Linguistics Identifies Anonymous Users

January 9th, 2013

See also:

How Unique – and Trackable – Is Your Browser?

Software Can Identify You from Your Browsing Habits

The Ugly Truth About Online Anonymity

Via: SC Magazine:

Up to 80 percent of certain anonymous underground forum users can be identified using linguistics, researchers say.

The techniques compare user posts to track them across forums and could even unveil authors of thesis papers or blogs who had taken to underground networks.

“If our dataset contains 100 users we can at least identify 80 of them,” researcher Sadia Afroz told an audience at the 29C3 Chaos Communication Congress in Germany.

“Function words are very specific to the writer. Even if you are writing a thesis, you’ll probably use the same function words in chat messages.

“Even if your text is not clean, your writing style can give you away.”

The analysis techniques could also reveal botnet owners, malware tool authors and provide insight into the size and scope of underground markets, making the research appealing to law enforcement.

To achieve their results the researchers used techniques including stylometric analysis, the authorship attribution framework Jstylo, and Latent Dirichlet allocation which can distinguish a conversation on stolen credit cards from one on exploit-writing, and similarly help identify interesting people.

While successful, the work faces a series of challenges. Analysis could only be performed using a minimum of 5000 words (this research used the “gold standard” of 6500 words) which culled the list of potential targets from tens of thousands to mere hundreds.

It also needs to separate discussion on product information like credit cards, exploits and drugs from conversational text in order to facilitate machine learning to automate the process, according to researcher Aylin Caliskan Islam.

And posts must be translated to English, a process which boosted author identification from 66 to around 80 per cent but was imperfect using freely available tools like Google and Bing.

However both of these tasks were performed successfully, and further development including the use of “exclusive” language translation tools would only serve to boost the identification accuracy.

Leetspeak, an alternative alphabet popular in some forum circles, cannot be translated.

One Response to “Linguistics Identifies Anonymous Users”

  1. neologiste says:

    i think they meant “l33tspeak”

    brilliant, and hilarious. all those teenaged mmorpg players (and their predecessors) certainly had no idea they were creating a secret language that would serve to preserve their anonymity from the State in such a perfectly infuriating way.

Leave a Reply

You must be logged in to post a comment.