Deep Analysis of Russian Twitter Trolls

Press enter or click to view image in full size

The history behind Russian disinformation is a dense and continuously evolving subject. The world’s best research hasn’t seemed to hit the mainstream yet, which made this an excellent opportunity to see if I could use some open source tooling to surface new analytical evidence.

The premiere dataset available for researchers on this subject has a lot of history to it. Researchers from Clemson University, Darren Linvill, and Patrick Warren, published a dataset containing 2,973,371 tweets from a network of 2,848 fake accounts belonging to the IRA. Darren and Patrick have added an extraordinary amount of depth to this subject over the years to help monitor and tackle foreign malign influence on social media platforms.

“At heart, if security agencies and political actors throughout the democratic world are to detect and deter such action in the future, it is crucial that we understand the pattern of such strategic social media activity, and develop tools to resist it when it emerges.”
“The Russians are Hacking My Brain!”
— Darren Linvill, Patrick Warren, Brandon Boatwright, and Will Grant

The big dataset was published by the Clemson researchers and open-sourced through FiveThirtyEight on GitHub. You can read more about the history behind the dataset and the schema information on the git repository here.

Open source repositories

In this blog post, I’ll show you how to use Apache Pinot and Superset to analyze 3 million tweets by the Internet Research Agency (IRA) open-sourced by FiveThirtyEight.

To get up and running with the example project I discuss in this blog post, and please head over to my open source repository with the bootstrap recipe.

https://github.com/kbastani/russian-troll-analysis

Analyzing the dataset

To do efficient exploratory analysis on top of millions of tweets in real-time requires a fast datastore designed to do precisely that. Apache Pinot provides the backend query capabilities that enabled me to do this research. Taking it to the next level, I needed a tool to create charts and dashboards on top of Pinot, for which Apache Superset played a perfect role.

My analysis started with some basic assumptions based on my prior research on the subject. First, I wanted to take a step back from the existing beliefs that the trolls employed strategies to influence the desired election outcome. This assumption is infamous and may have led to complex investigations politicized by both the news media and congress members after 2016.

So, I thought to myself, what if election interference using social media isn’t technically possible?

Exploring the data

After loading the raw data into Apache Pinot, the first step was to verify the analysis that was initially provided by FiveThirtyEight in 2018. The first chart they showed was a simple activity view that attempted to show possible election interference in 2016.

Press enter or click to view image in full size

Image by Darren Linvill and Patrick Warren | From FiveThirtyEight

To verify that I had the same dataset, I generated a SQL query and visualization using Pinot and Superset.

Press enter or click to view image in full size

Russian Troll Tweet Activity from 2015 to 2018

After verifying that my query matched the Clemson researchers’ chart, I went further to understand other features in the dataset.

Press enter or click to view image in full size

Image by Darren Linvill and Patrick Warren | From FiveThirtyEight

The graphic above shows the extensive work that the Clemson researchers did to categorize the fake IRA accounts’ intentions and behaviors. This view of the data is where, for me, things broke down with the election interference assumption. Twitter is a complex dynamical system of action and reaction, and it happens so fast that it would be difficult to game the system to any desired outcome.

I decided to develop a chart that more easily told the story of the behaviors of the different types of Twitter accounts in the dataset.

The chart I came up with smooths the activities shown in the FiveThirtyEight visualization. Here we see that there is spiking activity since 2015 that may not be related to election interference. What I needed to see was a view of the narratives at each of these spikes. I thought it might be useful to see whether particular topics or themes in the mainstream news media explained these spikes.

Press enter or click to view image in full size

Russian Troll Twitter Activity Related to Fox News

The chart above shows all mentions related to Fox News for the right trolls. This query showed some interesting spiking activity, so I decided to check the headline on Fox News for June 10th, 2015.

Press enter or click to view image in full size

https://web.archive.org/web/20150610010931/http://www.foxnews.com/

What shocked me about this result was that it was eerily similar to narratives that followed the tragic death of George Floyd in May 2020. To understand whether there was anything to this particular headline concerning the Twitter dataset, I decided to compare Fox News to other news media outlets.

Press enter or click to view image in full size

https://web.archive.org/web/20150610080013/http://www.cnn.com/

Above is a screenshot of CNN’s website several hours after Fox News featured their headline. The big difference in news coverage between the two sites is that Fox News used terms that were narratives in the IRA tweets, while CNN did not. While this was only one data point, I needed to understand whether or not the narrative drove the news cycle or vice versa. Which came first, and how did news coverage change over time?

The chart below shows the origin event where police and racial injustice began as a narrative. The query I’ve used to filter results is a relevance-based search using keywords found in the June 10th headline on the Fox News site. Apache Pinot has a full-text indexing implementation based on Apache Lucene, which allows me to return results related to my query.

Press enter or click to view image in full size

Text-based Search using FoxNews.com Headline: “A white Texas cop filmed wrestling…”

Here we see that there is a clear origin event to the narrative around racial injustice and police on June 10th, 2015. The narrative begins with spiking activity dominated by right trolls. In-between these blue spikes, we have sustained activity by left trolls on the same narrative. The final spikes, again, dominated by right trolls.

Press enter or click to view image in full size

Twitter Troll Activity Related to White Supremacy and Police

After running some smoothing on the data, I was able to see that the most massive spike in the chart is the full month of August 2017. I checked the news headlines for that month using the Wayback Machine and compared the narratives to the right troll narratives back in June 2015.

Press enter or click to view image in full size

https://web.archive.org/web/20170813011501/http://www.foxnews.com/

The narratives in the news media were now entirely consistent with the ideological content from two years earlier. The consistent theme? Terror, fear, anger, and outrage.

Press enter or click to view image in full size

https://web.archive.org/web/20170813011501/http://www.cnn.com/

At this point, the takeaway for me was that toxic ideologies and phrases in the IRA dataset had become prevalent in more recent news media headlines. To understand how ideologies and narratives were evolving, I decided to enrich the original dataset using named entity recognition from the open source Stanford CoreNLP library.

Named Entity Recognition

To go any further to understand the semantic content and narratives of the IRA tweets requires time-series-based natural language processing. Ideally, I wanted to spare myself from relating the content of the time series charts to individual tweets, but rather, the text entities contained in all tweets.

The Stanford NLP project provides a JDK-based library for performing named entity recognition (NER). I’ve used this library on tweets in the past, and it works reasonably well.

The chart above shows the number of tweets that contained a named entity for a particular category.

This next chart shows the number of distinct entity names belonging to each category. For this chart, I’ve filtered out handle and url, which are irrelevant for understanding different narratives behind the fake accounts.

Now that we have a reasonably good understanding of the distribution of entities and categories, we can begin to look at what each category contains. Specifically, we want to look at the named entities that classify the narratives of different fake accounts.

People

This chart contains the number of tweets by the fake account’s category for person entities.

Organizations

This chart contains the number of tweets by the fake account’s category for organization entities.

Criminal charges

This chart contains the number of tweets by the fake account’s category for criminal charges.

Causes of death

This chart contains the number of tweets by the fake account’s category for causes of death.

Miscellaneous

This chart contains the number of tweets by the fake account’s category for miscellaneous entities.

Titles

This chart contains the number of tweets by the fake account’s category for entities representing various titles that refer to a person.

Ideologies

This chart contains the number of tweets by the fake account’s category for ideologies.

Ideology time series

This chart is a time series that shows a smoothed distribution of ideologies mentioned in tweets.

Right troll ideologies

This next chart shows ideologies used for right troll accounts.

Left troll ideologies

This chart shows ideologies used for left troll accounts.

Drawing conclusions

Was it possible that IRA trolls were successful at interfering with elections in the United States? To answer this question, I feel that it is essential to understand what election interference means.

What is election interference?

Election interference is a nebulous political term that holds little to no legal weight when used by politicians. Foreign malign influence, however, is a term that describes nation-state funded campaigns used to influence the public opinion of voters. When foreign nations attempt to influence public opinion, and so too with it, the voters of an election, it’s a national security concern.

Now, what about domestic election interference?

Domestic election interference is an entirely made-up term that roughly translates to “campaigning for a political candidate.” As long as politicians are obeying campaign finance laws and are not involved in any form of election fraud, they are free to interfere with the election without legal liability domestically. The only exception to this rule that I could find was when political speech on social media, for example, is a threat to national security or public safety.

Did the IRA activity on Twitter interfere with US elections?

Absolutely, but not for the reasons most people think. After extensively analyzing these tweets, it’s clear to me that election interference wasn’t the point. Instead, the purpose of these three million tweets was to amplify terror, outrage, fear, and sow discord in the public square. The damage that these tweets caused was the inability for anyone to conclude what their ultimate goal was. Just reading through these tweets would reflect onto the reader any bias or conclusion that they held before the exercise.

Social media’s real danger is the fog of confusion that makes people believe that no one tells the truth. In reality, I think there probably isn’t a simple truth, or for those that know the truth, there’s no simple explanation.

Virtue signaling does occur between members of political parties, but that’s nothing new. Twitter might amplify this effect, and some politicians may find themselves disproportionately influential by virtue signaling individual narratives. It falls on politicians to understand that there’s a price to divisive virtue signaling during hotly contested political campaigns.

Twitter has an excessive amount of control over users that trust their feed. We are, after all, social creatures. Twitter gives us space to better understand how to identify with our in-group, with things such as memes and virtue signaling. Virtue signaling tends to gain more likes, retweets, and followers.

An in-group’s behavior influences everyone in some way or another, and on Twitter, no one is immune from malign influence. Malign influence seems to thrive on the idea that virtue signaling is key to preserving an in-group’s boundaries. By introducing fake out-groups that are ideologically opposed to a Twitter user’s in-group, the influencers of that group submit to more extreme forms of virtue, signaling to their followers. This is where I think the dominoes began to fall in 2015.

Solving the Problem

So, how do we stop malign influence actors from spreading toxic or extreme ideological narratives by virtue signaling? I don’t think we can without more research on the psychology behind digital virtue signaling. Malign influence actors need only find a way into the feed of a well-intentioned member of an in-group. Retweets and likes are both vulnerable to this, and it only takes one person you trust to spread malign influence throughout an in-group.

As for a solution?

I think solving this problem on Twitter requires reducing the amplification of retweets and rethinking the recommendation algorithm. Topics should hopefully focus on non-political sources of virtue signaling instead of ones that build on fear, anger, or outrage. This isn’t a simple problem to solve, and I hope to see more data and research as Twitter gets a handle on things.

Final thoughts

I hope this article was interesting and helpful to readers. If you have any feedback, comments, corrections, or recommendations, please feel free to reach out to me here or on Twitter. I encourage those brave enough to explore the open source project I’ve put together that will allow you to replicate my findings and run your own analysis on the data.

It’s a passion topic for me, and I hope that it is for you too. I’ve spared no effort to make it as easy as possible to setup your own analysis environment.

To get up and running with the example project I discuss in this blog post, and please head over to my open source repository with the bootstrap recipe.

https://github.com/kbastani/russian-troll-analysis

Special thanks

Special thanks to folks in the Apache Pinot community for providing feedback on this article. Also, thank you to Darren Linvill and Patrick Warren, and others that are working to inform decision makers on the threat of foreign malign influence on social media.