Internet Slang Dataset

Posted on

Currently, I am working on a project on Emotion Detection in Text. We (my team & I) are building a Machine Learning model which can predict emotions based on data posted on micro-blog sites, like Twitter. We’ve been considering multiple parameters for learning, like, context of the text, usage of Emoticons, emojis, special characters, hash Tags etc.

One of the challenges we faced has been due to constraint of the Text length allowed for posts, which is only 140 characters. his leads users to employ enormous use of slangs and acronyms in place of words or even the entire phrases.
e.g. 2MORO, ALAP (As Long As Possible), PERF (Perfect),etc.
To predict an emotion one must understand it first. Hence we searched through the internet to get a reliable dataset for these slangs and acronyms (slang dictionary or texting dictionary). Unfortunately, we didn’t come across any extensive list, so we decided to create one.

There are many online web resources for the same (not many in dataset format), like the one we found most useful: http://www.internetslang.com/
or http://www.ruf.rice.edu/~kemmer/Words04/usage/slang_internet.html is a good repository, but not very extensive one.

Hence we created a dataset of 7500+ Slag words and meanings from scrapping http://www.internetslang.com/ . This post is to share that set with the Internet, so it might be useful for  all those who are exploring this field just like yours truly.

Dataset:

Slang_Dict

Dataset Specification:
Delimited using ‘`’ (Escape Character)
PS: Didn’t use any of the ‘commonly’ used delimiter characters because they are ‘commonly’ used for ASCII emoticons and expressions.
Dataset contains two Rows: Slang and its Meaning
If a Slang has multiple meanings, each is divided by ‘|’ Symbol.

Although the repository contains 7500+ entries, it’s still only the tip of the iceberg. You might find many slangs missing in here.
Also, many of the slangs/acronyms are region or cluster specific (used primarily only by certain groups of people) which are pretty difficult to capture.
Moreover, I would request any reader if they find any new slang, add it here and share it.

Here is a small Python snippet to use the dataset.

import csv
slang_data = []
with open(slang_filename,'rb') as exRtFile:
exchReader = csv.reader(exRtFile,delimiter='`',quoting=csv.QUOTE_NONE)
for row in exchReader:
slang_data.append(row)

#slang_data[1] contains Acronyms
#slang_data[2] contains meaning phrases

 

Unicode Dataset:
Another dataset which might be useful if you working in similar domain,

Set of all Unicode and theirs representation texts. e.g. : UNICODE(1F64B) – HAPPY PERSON RAISING ONE HAND(this unicode represents emoji character)
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt;
To understand the dataset and its parameters: ftp://ftp.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html;
To explore more about its contents, visit: https://en.wikipedia.org/wiki/Unicode

Again Courtesy and Source: [ http://www.internetslang.com/ ]

Leave a comment