We are experiencing a novel challenge nowadays, where we have copious amounts of data being generated by each and every internet user every day, but lack adequate resources to take full advantage of that data. Without diving into what it means to "take full advantage" and get into privacy concerns, let us consider some of the good ways we can use data.
- Find patterns in public opinion on a topic
- Determine what topic is gaining popularity in communities
- Predict shifts in markets, politics, or other environments
These are broad enough that they are unlikely to have a negative impact on any persons privacy but have clear advantages to governments, companies, and any interest group. Currently we may see political organizations taking advantage of public data to determine probabilities of certain officials winning an election, or companies determining how a decision was received or is likely to be received by their audience, or an interest group finding like-minded communities to focus their efforts on.
Alright, so data is useful. Now what? Now we need to decide how to make sense of it, and computers are our best bet if we want to tackle any sizable data set. For this exercise, let's take a look at the ELK stack.
What is the ELK stack? The ELK stack is a collection of 3 tools that work together towards the end of making sense of data.
E - Elasticsearch: JSON-based search and analytics engine
L - Logstash: Data collection pipeline
K - Kibana: UI for data visualization
Installing Elasticsearch
Installing Logstash
Installing Kibana
input { twitter { consumer_key => "consumer_key" consumer_secret => "consumer_secret" oauth_token => "oauth_token" oauth_token_secret => "oauth_token_secret" keywords => ["Keyword1","Keyword2"] full_tweet => true } } output { elasticsearch { hosts => ["localhost:9200"] } stdout { codec => rubydebug } }
In this example, I filtered tweets by keywords "Hurricane" and "Hurricanes" in my configuration file, then assigned several filters to split the results into "sub-buckets" or the different lines seen in the image. To represent the number of Tweets contained certain keywords, I used a Date Histogram, which makes visualizing frequency of data, and in this case, relative frequencies much easier.
To provide a realistic analogue to a real world scenario, I chose many different specific subjects (hurricanes by name) belonging to my broad subject (hurricanes) which allows me to quickly see how many people are talking about which hurricanes and when they are doing it. This particular example only shows a relatively small window of time, but it is easy to see how such a concept scales.
Now, the more interesting idea explored here. Pinpointing associated topics and public sentiment. A good topic to explore in our example of hurricanes is climate change. It is widely believed that climate change is likely to have increased and continue to increase the frequency of catastrophic weather events, such as hurricanes. As a rough way of seeing how many people are talking about this relationship, I added a filter for climate change or global warming, using both because people often use them to refer to the same phenomenon. For this time period, it appears that people who were talking about hurricanes at all also mentioned climate change as much as or more than they did any particular hurricane. In our example, this suggests that people may be more interested in the cause of these hurricanes than the hurricanes themselves!
Okay, so there isn't too much most companies can do with information on who is talking about what hurricanes or why. But, this concept is very transferable. Take for instance the rapidly changing market for processors in the midst of riding competition between Intel and AMD. What are people talking about when they mention Intel? AMD? Do people mention positive or negative emotions or ideas when talking about one or the other? The keywords could be set to include AMD and Intel, then the filter first split it into graphs of each then subdivide each graph into groups of keywords related to public sentiment. The use case is wildly different, but the differences in implementation are trivial.
Hopefully some light has been shed on the general approach a person can take when trying to make use of anonymous personal data as well as why it can be such an important tool to a company or other organization.