Internet Slang Dataset

Posted on

Currently, I am working on a project on Emotion Detection in Text. We (my team & I) are building a Machine Learning model which can predict emotions based on data posted on micro-blog sites, like Twitter. We’ve been considering multiple parameters for learning, like, context of the text, usage of Emoticons, emojis, special characters, hash Tags etc.

One of the challenges we faced has been due to constraint of the Text length allowed for posts, which is only 140 characters. his leads users to employ enormous use of slangs and acronyms in place of words or even the entire phrases.
e.g. 2MORO, ALAP (As Long As Possible), PERF (Perfect),etc.
To predict an emotion one must understand it first. Hence we searched through the internet to get a reliable dataset for these slangs and acronyms (slang dictionary or texting dictionary). Unfortunately, we didn’t come across any extensive list, so we decided to create one.

There are many online web resources for the same (not many in dataset format), like the one we found most useful: http://www.internetslang.com/
or http://www.ruf.rice.edu/~kemmer/Words04/usage/slang_internet.html is a good repository, but not very extensive one.

Hence we created a dataset of 7500+ Slag words and meanings from scrapping http://www.internetslang.com/ . This post is to share that set with the Internet, so it might be useful for  all those who are exploring this field just like yours truly.

Dataset:

Slang_Dict

Dataset Specification:
Delimited using ‘`’ (Escape Character)
PS: Didn’t use any of the ‘commonly’ used delimiter characters because they are ‘commonly’ used for ASCII emoticons and expressions.
Dataset contains two Rows: Slang and its Meaning
If a Slang has multiple meanings, each is divided by ‘|’ Symbol.

Although the repository contains 7500+ entries, it’s still only the tip of the iceberg. You might find many slangs missing in here.
Also, many of the slangs/acronyms are region or cluster specific (used primarily only by certain groups of people) which are pretty difficult to capture.
Moreover, I would request any reader if they find any new slang, add it here and share it.

Here is a small Python snippet to use the dataset.

import csv
slang_data = []
with open(slang_filename,'rb') as exRtFile:
exchReader = csv.reader(exRtFile,delimiter='`',quoting=csv.QUOTE_NONE)
for row in exchReader:
slang_data.append(row)

#slang_data[1] contains Acronyms
#slang_data[2] contains meaning phrases

 

Unicode Dataset:
Another dataset which might be useful if you working in similar domain,

Set of all Unicode and theirs representation texts. e.g. : UNICODE(1F64B) – HAPPY PERSON RAISING ONE HAND(this unicode represents emoji character)
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt;
To understand the dataset and its parameters: ftp://ftp.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html;
To explore more about its contents, visit: https://en.wikipedia.org/wiki/Unicode

Again Courtesy and Source: [ http://www.internetslang.com/ ]

Graph Database using Java [OrientDB and Gremlin]

Posted on

What is Graph Database?

In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A graph database is any storage system that provides index-free adjacency.
More info : Graph Database

There are many Graph Databases available on net. The well-known and commonly used ones are Neo4j, OrientDB, Allegro, etc.

For my project I experimented with OrientDB. My focus of this tutorial would be to cover the setup and implementation of OrientDB through Java.

A little introduction to OrientDB,
OrientDB is an open source NoSQL database management system written in Java. The major advantage of using OrientDB over other graph DB is, it’s a multi-model database, supporting graph, document, key/value, and object Databases, and the relationships are managed as in graph databases with direct connections between records. There are very few platforms that provide such model flexibility.

More Info :OrientDB ; NoSQL ; Multi-model Database

Implementation

Note: I have used Gremlin as a Querying Language. (Gremlin is a graph Traversal Language specialized to work with Property Graphs). One can also use NoSQL queries as well.

A great Tutorial to Learn OrientDB: http://pettergraff.blogspot.it/2014/01/getting-started-with-orientdb.html
A great Tutorial to Learn Gremlin: http://gremlindocs.com/

Code snippet to implement OrientDB in Java using Gremlin as a querying Language:

Installing Libraries:
1. Install OrientDB: download

2.
Add the following Libraries in your project from the ‘lib’ folder, inside the OrientDB installation folder
gremlin-* (all jars)
orientdb-* (all jars)
commons-* (all jars)
pipes-*
blueprints-core-*
concurrentlinkedhashmap-lru-*
blueprints-core-*

This tutorial is based on the assumption that the database has been already created in OrientDB (It is very easy to create, you will find it in tutorials mentioned above). If not, you can work with the example DB already present in OrientDB: GratefulDeadConcerts

Code Snippet:

OrientGraph graph = new OrientGraph("plocal:/databases/Animal_Data", "reader", "reader");

//Adding and Vertex and setting properties:
Vertex vanimal1 = graph.addVertex("class:Animal");
vanimal1.setProperty("Name", "Tiger");

Vertex vanimal2 = graph.addVertex("class:Animal");
vanimal2.setProperty("Name", "Deer");

Vertex vEmpty = graph.addVertex(null); // Create an vertex in V class(superclass of all Vertex)

Edge ehunts = graph.addEdge(null, vanimal1, vanimal2, "hunts");
//Above statement creates an Edge of class E (Superclass of all Edges)

Similarly ‘getters’ can be used to fetch vertices and properties.

SQL Queries: http://orientdb.com/docs/last/Tutorial-Java.html

Gremlin Snippets:

Defining a pipeline

GremlinPipeline pipe = new GremlinPipeline();

1. Fetching features of a Vertex

Vertex temVert;
Iterable vertices = graph.getVertices("Name","Lion");

This returns all the lists of Vertex having property “Name” as “Lion”.
For my case its only one vertex, hence taking that vertex into an vertex object

if(vertices.iterator().hasNext()) // checks if the returned list of Vertex is empty or not.
{
temVert = vertices.iterator().next();
LiveLocation = temVert.getProperty("Lives").toString();
}

2. For getting path to all the 1 hop Neighbors

Iterable vertices = graph.getVertices("Name","Lion");
if(vertices.iterator().hasNext())
{
temVert = vertices.iterator().next();
pipe.start(temVert).bothE().bothV().simplePath().property("Name").path();
for (Object path : pipe) {
System.out.println(path);
}
}

The pipe can be broken as follows:
Start() – defines the starting vertex.
bothE() – defines both incoming and outgoing Edges from Start() vertex.
Note: For unidirectional edges use inE() and outE() for incoming and outgoing respectively

bothV()– defines vertices which have both incoming edges and outgoing edges.
For selecting vertices having only one directional edges, one can use inV() or outV() as per the requirement.(For traversing a tree or any other unidirectional graphs)

Above two statements can be written in short by using both()
e.g.

pipe.start(temVert).both().simplePath().property("Name").path();

This fetches all the vertices outgoing and incoming to the current vertex.

simplePath() – is used to capture the path including the edges, else without it, the query would return only the list of vertices.

Property(“”) – helps to select and display relevant property. Otherwise the path will contain only Vertex Ids.

Path() – it return whole path from start of the pipe. Else only the final result would be displayed. For above snippet, only the property(“Name”) of all the vertices neighbour to the starting point with one hop.

Similarly, for two hop paths:

pipe.start(temVert).bothE().bothV().bothE().bothV().simplePath().property("Name").path();

OR

pipe.start(temVert).both().both().simplePath().property("Name").path();

There are many other ways to achive…

For more examples use: http://gremlindocs.com/

And for more information regarding Orient DB.. Refer to its Manual: Manual

More References:
http://www.fromdev.com/2013/09/Gremlin-Example-Query-Snippets-Graph-DB.html
https://github.com/orientechnologies/orientdb/wiki
http://devdocs.inightmare.org/introduction-to-orientdb-graph-edition/

Perceptron Neural Network in Java [using Weka Library]

Posted on

A multilayer perceptron (MLP) is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one.

If looking for implementation of Neural Network on Weka Tool, follow this tutorial.

Following is a step by step implementation in Java (using weka libraries):

Adding Weka libraries

Download Weka from here.
From the package find Weka.jar and add in the project. [Project Properties>Java Build Path> Libraries (tab)>Add External JARs…]

Java Code Snippet

1. Building a Neural Classifier


public void simpleWekaTrain(String filepath){
try{
//Reading training arff file
FileReader trainreader = new FileReader(filepath);
Instances train = new Instances(trainreader);
train.setClassIndex(train.numAttributes() – 1);

//Instance of NN
MultilayerPerceptron mlp = new MultilayerPerceptron();

//Setting Parameters
mlp.setLearningRate(0.1);
mlp.setMomentum(0.2);
mlp.setTrainingTime(2000);
mlp.setHiddenLayers(“3”);

mlp.buildClassifier(train);
}
catch(Exception ex){
ex.printStackTrace();
}
}


Another Way to set parameters,


mlp.setOptions(Utils.splitOptions(“-L 0.1 -M 0.2 -N 2000 -V 0 -S 0 -E 20 -H 3”));


Where,

L = Learning Rate
M = Momentum
N = Training Time or Epochs
H = Hidden Layers
etc.
Find all the parameter information here.

Above code is for only one Hidden Layer with 3 perceptrons. In case of multiple layers, parameters are passed as comma separated values,
Example: If two hidden layers having 4 and 5 perceptrons in each layer respectively then command will be:
mlp.setHiddenLayers(“4,5”) OR “… -H 4,5”

For this project I’ve used ARFF file as input, but CSV formats would also work using same function.
Example of an arff file can be found here.
2. Neural Classifier Training Validation

For evaluation of training data,


Evaluation eval = new Evaluation(train);
eval.evaluateModel(mlp, train);

System.out.println(eval.errorRate()); //Printing Training Mean root squared Error
System.out.println(eval.toSummaryString()); //Summary of Training


To apply K-Fold validation


eval.crossValidateModel(mlp, train, kfolds, new Random(1));


3. Saving Classifier Model



mlp.buildClassifier(train);

weka.core.SerializationHelper.write(<String Path>, mlp);


Note: This will generate ‘.model’ file.

4. Reading Classifier Model


MultilayerPerceptron mlp = (MultilayerPerceptron) weka.core.SerializationHelper.read(<String Model Path>);


5. Evaluating/Predicting unlabelled data

Unlabelled data in arff file holds character ‘?’ in the classification column.

5.1 Fetching all the entries from the arff file


Instances datapredict = new Instances(
new BufferedReader(
new FileReader(<Predictdatapath>)));
datapredict.setClassIndex(datapredict.numAttributes() – 1);
Instances predicteddata = new Instances(datapredict);


5.2 Classify/Predict each value


for (int i = 0; i < datapredict.numInstances(); i++) {
double clsLabel = mlp.classifyInstance(datapredict.instance(i));
predicteddata.instance(i).setClassValue(clsLabel);
}


5.3 Generating output file


//Storing again in arff
BufferedWriter writer = new BufferedWriter(
new FileWriter(<Output File Path>));
writer.write(predicteddata.toString());
writer.newLine();
writer.flush();
writer.close();


I have created this snippet after going through several interesting resources,
For more information check them out:
https://weka.wikispaces.com/Use+Weka+in+your+Java+code
http://weka.8497.n7.nabble.com/Multi-layer-perception-td2896.html
https://weka.wikispaces.com/Serialization
http://weka.wikispaces.com/Programmatic+Use

Hello world!

Posted on Updated on

Hey Folks,

I am Hitanshu Tiwari, a Computer Engineer from India.
By interest I am a Designer, Developer and Sci-Fi Fan.

This is going to be my Geek Blog where I’ll be posting Tutorials, Designs and Articles which I come across in my Projects.

That’s all for now,

Thank you,