Big Public Data to Predict Crowd Behavior: Nathan Kallus Interview

Data Science Weekly Interview with Nathan Kallus, PhD Candidate at the Operations Research Center at MIT - working on using big public data to predict crowd behavior.

We recently caught up with Nathan Kallus, PhD Candidate at the Operations Research Center at MIT. We were keen to learn more about his background, his research into data-driven decision making and the recent work he's done using big public data to predict crowd behavior - especially as relates to social unrest…

Hi Nathan, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...

Q - What is your 30 second bio?
A - I grew up in Israel and went to college at UC Berkeley where I first discovered my passion for statistics and optimization. Today I am a PhD Candidate at the Operations Research Center at MIT and my research revolves around the combination of statistics/data sci with mathematical optimization. I am really interested in the theory and practice of data-driven decision-making and in general the analytical capacities and challenges of unstructured and large-scale data.

Q - How did you get interested in working with data?
A - Most things in life cannot be known with certainty. In my own life, this is one of the reasons why I always like to keep an open mind toward new things and never judge others. Statistics and related data science is the most important tool to understand things that are not absolute, as most things are. It allows us, first, to describe uncertainty in the real world and then, second, to investigate it using real data. This, to me, is very stimulating and makes working with data quite exciting. Optimization is the mathematics of making the best decision possible but what that decision is depends on the settings. An optimal decision in unrealistic or misspecified settings may end up being a very bad one in practice so it is critical to recognize uncertainty when optimizing, including uncertainty in one’s model.

The field of operations research has, historically, been primarily model-driven -- necessarily so due to a past dearth of data. Many quantitative methods for decision making were based on modeling and on distributional assumptions with little to no deference to data. At most it was estimate, then optimize. Nonetheless, the field has transformed whole industries: airlines, advertising, retail, finance, and more. The explosion in the availability and accessibility of data is ushering forth a shift toward a data-driven paradigm for decision making. New theory, methods, and applications are necessary to realize this and the combination of statistics and data science with optimization provides the right toolset. I find it critical and fascinating to work on the methodological advances in data-driven decision making that must come hand-in-hand with the technological advances of the era of information, and on practical applications that combine these effectively.

Q - Its definitely an exciting time to be in this field! ... So, what was the first data set you remember working with? What did you do with it?
A - Perhaps my first endeavor into predictive analytics was a final project in a course I took in undergrad. I was wondering if I could uncover certain innate personal characteristics of users such as sex (which is at least for the most part innate) based on petty aspects of their Facebook profile such as group memberships (this was before Likes etc). I wrote up a dummy Facebook app and got some friends to install it so I can scrape their friend network. Soon enough I had a few thousand profiles. The prediction was reasonably accurate, if I recall.

Q - Was there a specific "aha" moment when you realized the power of data?
A - I don't know about a first "aha," but the biggest "aha" was definitely when I saw how accurately I could predict real-world events like mass protests and violent social unrest accurately using data from the "virtual" world of social media. There the scale of the data was really critical for success.

We'll dive into that in more detail shortly! First, let's talk more broadly about your field of research - data-driven decision making…

Q - What have been some of the main advances in recent years?
A - The most important advances have been technological and resulting in an increase in the availability and accessibility of useful data. Enterprise resource planning (ERP), supply chain management (SCM), and customer relations management (CRM) software platforms are becoming more ubiquitous and collecting more raw data by default as they simply operate. A lot of things that used to be offline like the news, government data, etc are now online and machine-readible and therefore can be used for analysis, often with the help of natural language processing. New modes of communication such as Facebook and Twitter have taken hold online and data from these provide a digital window into sentiments, consumer behavior, the behavior of crowds, and more. In 2012 about 2.5 exabytes of data were created each day and this number has increased by some 25% each year since (fun fact: Walmart consumer transactions alone make up approximately 0.01% of daily data collection/creation).

This explosion of data is changing how we think about a lot of things but most importantly it needs to change the way we make decisions or its collection is for naught. The lack of real impact on decisions was (or still is) one of the biggest criticisms of the "Big Data" buzzword frenzy. I was reading the Technology Review's Business Report recently and liked the tagline they had: "What’s the point of all that data, anyway? It’s to make decisions" (Jan 2014).

Q - That's a great quote! ... So what are the biggest areas of opportunity / questions you would like to tackle?
A - How to use unstructured (e.g. text, video, health records) and irregular (e.g. non-IID) data -- properties that characterize (or at least should characterize) the kind of data referred to as "Big" -- for decision-making in a theoretically principled manner with practical impact.

Q - A great goal! :) Now, two of your more theoretical papers recently won awards in the Operations Research community (congratulations!) - can you share a high level summary of one or both? (e.g., what problem you were tackling, how you approached it, what you found etc.)
A - In both papers we were addressing the fundamental question of how to go from data to decisions and in both the idea of combining statistical tools (both theoretical and practical) with optimization tools (same) was the key. On both I worked with Dimitris Bertsimas and Vishal Gupta.

In one, we developed a new framework for data-driven optimization under uncertainty that is notable for combining generality, tractability, convergence, and finite-sample performance guarantees. The key was in making theoretical connections between statistical properties of hypothesis tests and optimization properties of decision problems. This yielded a new theoretical framework that unified some existing work and resulted in new tools that are well suited for practical use.

In the other, we worked with an already widely popular method for optimization under uncertainty, robust optimization (RO), and showed how it can be made data-driven and tailored for data-rich environments. The key ingredient in RO is an uncertainty set. These are traditionally designed ad-hoc and in a modelling-driven manner. The paper proposes several procedures for designing these based directly on data and shows that data-driven variants of existing RO approaches consistently outperform their data-poor analogues.

Q - Thanks for sharing - and congrats again on the recognition! ... Let's switch gears and talks about your recent work on Predicting Crowd Behavior with Big Public Data, which is a great example of how some of your research can be applied - and has been featured in the news … could you tell us a little more about this work? First, what specific problem were you trying to solve?
A - I had become quite interested in the predictive power of social media. Never before has this amount of communication been publicly available and so accessible. When it comes to predicting the actions of people, it makes a lot of sense –– it’s crowds talking about themselves. For example, the manifestation of mass demonstrations often involves collective reinforcement of shared ideas and public calls to action to gather at a specific place and time. Both of these now take place to some extent online, perhaps providing a sufficiently wide window into the real world and its future trajectory. So I set out to both verify and quantify this idea that social media data, and other online data, can predict significant future events such as mass protests, violent social unrest, and cyber hacktivism campaigns.

Q - Makes sense. How did you approach it?
A - I teamed up with a company called Recorded Future. They collect a ton of data from various open-content online sources like news, government publications, blogs, social media, etc. This is something that's sometimes called web intelligence or open source intelligence. Importantly, this data included lots of Twitter activity of the sort I talked about above.

I first looked for signs of potential predictive signals in the data. A descriptive analysis of the data suggested a potential signal in the Tweets that were posted before this day but seemingly talked about a protest to occur on that day (for example, social-media calls to arms are often like this: posted before the day in question, about the day in question). The key here was to analyze the unstructured text for reported/discussed events in it and the time frame for the event's purported occurrence. Other signals emerged from analyzing, for example, the news. In tandem, these were helpful in sifting out the social media trends that fizzled out before materializing on the ground.

Then it was a matter of employing machine learning to train a predictive classification model to uncover the right predictive patterns in the set of signals I thus extracted. Transforming and normalizing the data in the right way was also critical to making it work. There was also simply a ton of data to handle so of course there was also an ample amount of engineering necessary in implementing this to be able to handle this -- that, and a rather big computer.

Q - I can imagine! What answers / insights did you uncover?
A - While the issue of whether mobilization, and how much of it, occurs online is highly controversial, it did become quite clear to me that the permeation of connective technologies was sufficient so that the dialogues on platforms like Twitter provided a sufficiently clear window into these processes to allow pretty accurate prediction. I really could verify that social media data had the power to predict some futures and that it does so quite accurately. In the paper I looked at historical data in order to validate this, but right now the system is running live in real-time and it's exhilarating to follow it and see it actually predicting these events before they occur and before I see them reported in the news. We have documented many cases including in Pakistan, Bahrain, and Egypt where we saw a clear prediction and then two to three days later headlines everywhere.

Q - What are further potential applications of this approach? Which get you most excited?
A - The ability to forecast these things has clear benefits. Countries and authorities faced with a high likelihood of significant protest can prepare themselves and their citizens to avoid any unnecessary violence and damages (sadly a lot of unrest in certain regions ends up with clashes and deaths). Companies with personnel and supply chain operations in an affected region can ask their employees to stay at home and to remain apolitical and can attempt to safeguard their facilities in advance. Countries, companies, and organizations faced with possible cyber campaigns against them can beef up their cyber security in anticipation of attacks or even preemptively address the cause for the anger directed at them.

Besides providing predictions, the system can be useful to pundits and decision makers because it allows one to follow in particular those trends in social media that will actually lead to a major event. Then one can better understand what is going on now, where it will lead, as well as why. Let me give an example. The system had predicted in advance the unrest in Egypt that sadly led to 5 deaths on March 28. If you looked at the tweets that supported that prediction, you could see calls to demonstrate from both the pro-Brotherhood and pro-El-Sisi sides, raising fears of street clashes, and that the group Tamarod was behind the latter calls. These were actually two facts that were at a later time specifically mentioned in an AP release.

More broadly I am very excited about the predictive power of such web intelligence in more business operations, as it has for example been shown to predict product demand and other consumer behavior. With this sort of prediction, the decision theory actually becomes more complicated and is the subject of a current research project of mine.

Nathan - Thank you ever so much for your time! Really enjoyed learning more about your background, your research areas and the work you've done using big public data to predict crowd behavior - especially as relates to social unrest. Good luck with your ongoing research!

Readers, thanks for joining us! If you want to keep up to speed, Nathan can be found online here.

P.S.If you enjoyed this interview and want to learn more about

  • what it takes to become a data scientist
  • what skills do I need
  • what type of work is currently being done in the field

then check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)

Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.