I recently joined Twitter. I am beyond fashionably late to this party, but Twitter has been on my linguistics radar for several years, thanks to the fact that many researchers have used tweets as data sets to study various aspects of language change and use.

Personally, I love the idea of hashtags as meta-commentary or as a type of paralinguistic cue (#SorryNotSorry comes to mind), but I’ve also listened to presentations where hashtags were used to track the spread of breaking news, as well as how it was possible to tell the trending hashtags that evolved organically versus the hashtags that were purposefully created (like those shown at the bottom of your TV screen during your favorite television series)—for the organically evolved, there are differences in wording or spelling, where as created hashtags tend to all spring up at the same time and in the same format.

But one of the big draws is the sheer amount of data that Twitter can provide. Searching a hashtag brings up thousands of tweets, and it’s here that language researchers are looking for insights. Since August I’ve come across two articles about a study that turned to Twitter to investigate Spanish dialects and discovered the existence of two “superdialects” whose usage doesn’t depend on geographic region. Rather, one dialect appears to be used more often in cities, and the other in rural areas. You can read the first article I found, from the MIT Technology Review here, and/or a more recent write-up of the same article I found on bigthink.com here.

Another way that researchers are using tweets as data are to reveal the overall mood of Twitter users on different days of the week and at different times throughout the day. This Buzzfeed article has some fun color-coded (if a little confusing) charts showing just that. The study uses specific search phrases like “feeling happy” or “hungover.” And with a huge data set culled from these search terms, their findings are probably reliable….to a point.

But as we all know, the words we use aren’t always meant literally. The lack of paralinguistic cues like facial expressions, body positioning, and tone of voice in online communication, combined with the 140 character limit, means that taking a phrase out of context isn’t going to reveal a foolproof data set of mood indicators. The sentiment analysis of tweets and Facebook posts are a big challenge to computational linguists—just think of how many meanings the word “like” can have online, or how much difficulty Sheldon Cooper from The Big Bang Theory has with recognizing sarcasm, and you have an idea of the possible pitfalls.

Does anyone else have examples of Twitter being used as a data source for research?

If you’re on Twitter—let me know! I’m always looking for cool people to follow. And if you want to return the favor, you can follow me at @l_g_johnson


As a bonus, yesterday was National Punctuation Day! Check out this fun Mental Floss article about lesser-known punctuation—I think it pairs well with the above section on paralinguistic cues. They’re like less colorful emojis!

2 thoughts on “How the Twitterverse is Contributing to Language Research

    1. Hi Mura–Thank you so much for sharing that link! I love that it gives practical and detailed information on how to build a Twitter corpus AND touches on the tricky nature of sentiment analysis. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s