Using ELK to Analyze Twitter Data

We are experiencing a novel challenge nowadays, where we have copious amounts of data being generated by each and every internet user every day, but lack adequate resources to take full advantage of that data. Without diving into what it means to "take full advantage" and get into privacy concerns, let us consider some of the good ways we can use data.

  • Find patterns in public opinion on a topic
  • Determine what topic is gaining popularity in communities
  • Predict shifts in markets, politics, or other environments

These are broad enough that they are unlikely to have a negative impact on any persons privacy but have clear advantages to governments, companies, and any interest group. Currently we may see political organizations taking advantage of public data to determine probabilities of certain officials winning an election, or companies determining how a decision was received or is likely to be received by their audience, or an interest group finding like-minded communities to focus their efforts on.

Alright, so data is useful. Now what? Now we need to decide how to make sense of it, and computers are our best bet if we want to tackle any sizable data set. For this exercise, let's take a look at the ELK stack.

What is the ELK stack? The ELK stack is a collection of 3 tools that work together towards the end of making sense of data.

E - Elasticsearch: JSON-based search and analytics engine
L - Logstash:         Data collection pipeline
K - Kibana:            UI for data visualization

So this gives us a method for collecting data, searching or analyzing the data, and visualizing the data. It seems pretty clear why these get used together more than independently. So Logstash processes whatever data we give it access to based on filters we set, sends that on to Elasticsearch. Then, using Kibana, we can view and analyze our filtered data.
Set up isn't too bad for any of these three in any operating system with the current release (7.2.0 at time of writing) and the developers have what seems to be very thorough documentation of how to get set up.

Installing Elasticsearch
Installing Logstash
Installing Kibana

Preparing to Connect to Twitter
Clearly, in order to analyze Twitter data we need a way to connect to Twitter. To use their API we need to set up a developer account, register an app with them, then generate some keys and tokens that we will need in order to configure Logstash. Doing this does require that you agree not to submit data to governments, not to spy, and not to try to harvest personal information about users. Basic stuff.
Okay, we've got our keys, so now we can refer to Elastics documentations for Logstash's Twitter plug in that we will be making use of and get the following format for our configuration file.
input {
  twitter {
      consumer_key => "consumer_key"
      consumer_secret => "consumer_secret"
      oauth_token => "oauth_token"
      oauth_token_secret => "oauth_token_secret"
      keywords => ["Keyword1","Keyword2"]
      full_tweet => true
  }
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
  stdout { codec => rubydebug }
}
We can then launch Logstash from command line and provided the other component were installed and started as the instructions provided by Elastic suggest, we are up and running, collecting data! This plugin, unfortunately, only collects current data from Twitter while it is running. This can be circumvented with additional code and use of premium Twitter APIs, but for many use cases this is more than enough to work with. Also worth noting any API call to Twitter for past Tweets counts towards your relatively limited quota, after which would cost money to use.

With a very basic implementation, we primarily have the ability to view number of Tweets in our dataset, which is based on our keywords set in config, which contain some other keyword, which we can define through Kibana.

In this example, I filtered tweets by keywords "Hurricane" and "Hurricanes" in my configuration file, then assigned several filters to split the results into "sub-buckets" or the different lines seen in the image. To represent the number of Tweets contained certain keywords, I used a Date Histogram, which makes visualizing frequency of data, and in this case, relative frequencies much easier.

To provide a realistic analogue to a real world scenario, I chose many different specific subjects (hurricanes by name) belonging to my broad subject (hurricanes) which allows me to quickly see how many people are talking about which hurricanes and when they are doing it. This particular example only shows a relatively small window of time, but it is easy to see how such a concept scales.

Now, the more interesting idea explored here. Pinpointing associated topics and public sentiment. A good topic to explore in our example of hurricanes is climate change. It is widely believed that climate change is likely to have increased and continue to increase the frequency of catastrophic weather events, such as hurricanes. As a rough way of seeing how many people are talking about this relationship, I added a filter for climate change or global warming, using both because people often use them to refer to the same phenomenon. For this time period, it appears that people who were talking about hurricanes at all also mentioned climate change as much as or more than they did any particular hurricane. In our example, this suggests that people may be more interested in the cause of these hurricanes than the hurricanes themselves!

Okay, so there isn't too much most companies can do with information on who is talking about what hurricanes or why. But, this concept is very transferable. Take for instance the rapidly changing market for processors in the midst of riding competition between Intel and AMD. What are people talking about when they mention Intel? AMD? Do people mention positive or negative emotions or ideas when talking about one or the other? The keywords could be set to include AMD and Intel, then the filter first split it into graphs of each then subdivide each graph into groups of keywords related to public sentiment. The use case is wildly different, but the differences in implementation are trivial.

Hopefully some light has been shed on the general approach a person can take when trying to make use of anonymous personal data as well as why it can be such an important tool to a company or other organization.

Leave a Reply

Your email address will not be published. Required fields are marked *