Social Media & Machine Learning tell Stores where to locate: Dmytro Karamshuk Interview

Data Science Weekly Interview with Dmytro Karamshuk, Researcher in Computer Science and Engineering at King's College London - Mining Online Location-based Services for Optimal Retail Store Placement.

‍

We recently caught up with Dmytro Karamshuk, Researcher in Computer Science and Engineering at King's College London - investigating data mining, complex networks, human mobility and mobile networks. We were keen to learn more about his background, how human mobility modeling has evolved and what his research has uncovered in terms of applying machine learning to social media data to determine optimal retail store placement…

Hi Dmytro, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...

Q - What is your 30 second bio?
A - I’m a computer scientist with startup experience working on understanding user behavior in social media and mobile networks to make the world slightly better... Currently at King’s College London; previously at University of Cambridge; Italian national research council; Institute of Market and Technologies in Lucca; and as a managing partner at a software engineering startup (http://stanfy.com).

Q - How did you get interested in Machine Learning?
A - It was a commercial project we did with my classmates (co-founders of stanfy.com) during our university years. We were creating a web 2.0 system for our client, Banks.com Inc., when the idea of recommending interesting stuff to users emerged. Although we didn’t know much about recsys at that time, we implemented something very similar to item-to-item collaborative filtering. It was the year 2005. Everyone was happy with the outcome.

Q - That's great! ... So, what was the first data set you remember working with? What did you do with it?
A - The first dataset was at the age of 15 when I coded a web chat on Perl. The data was stored in a MySql database and a bunch of my friends would use it for fun. Didn’t do much with it to be honest, but it was interesting to play with the logs and run some simple stats.

Q - Was there a specific "aha" moment when you realized the power of data?
A - When there is a lack of data, researchers would usually start making assumptions and build synthetic models. An “aha” moment is when at last you get your hands on the real data and see how all your assumptions crash one by one. Had it during my PhD :)

I guess that's good and bad! Thanks for sharing your background - let's switch gears and talk in more detail about your research field - Machine Learning Applied to Human Mobility & Urban Dynamics...

Q - How has human mobility modeling been approached in the past?
A - Interest in urban mobility was raised many decades (or even centuries) before we (very recently) got the first large scale trace of human movements. With the lack of data researchers had to rely on either probabilistic synthetic models (the most prominent ones we have discussed in our review) or coarse-grained statistics usually collected with user surveys. There would also be more creative ways of gathering data such the one dollar bill experiment one dollar bill experiment where users were collectively reporting locations of one dollar bills across the US and used that as a proxy of human movements. The scale of the dataset was phenomenal for that time.

Another groundbreaking experiment, called Reality Mining, has been conducted by guys from MIT when a number of volunteers among MIT students agreed to carry a mobile device in their pockets with a piece of software which would record all bluetooth communications with other similar devices. This was probably the first dataset on human contact obtained in an automated way.

Frankly, and related, a large scale data set of human movements had already existed for a few decades: mobile phone operators have been collecting logs of mobile users’ locations (i.e., base stations from which they access the network) from the earliest days of mobile phones. However, this information was under lock for a long time given operators were afraid of leaking commercially sensitive information. The first one to break this taboo was a group of physicists from Northeastern University in Boston who published a large scale study from mobile phone data in their prominent 2008 Nature paper.

More recently, with the emergence of location-based social networks (such as Foursquare, Gowalla or Altergeo) where users voluntarily share their very-abouts with the world (via Twitter for example) we have finally got public access to a massive trace of human mobility.

Q - Really interesting context ... What then excites you most about bringing Machine Learning and Human Mobility & Urban Dynamics together? How is this approach different from traditional mobility models?
A - Machine-learning is useful in two ways in this context: as a tool to build something very practical, very useful, such as recommender systems; and as an explorational tool for theoretical research to understand complex dependencies and correlations between various variables in a given physical system.

Q - What are the biggest areas of opportunity / questions you want to tackle?
A - One interesting challenge lies in disambiguating information across various sources of data. There are dozens of different signals which we employ in urban computing: from social networks, census data to signals collected from sensors installed in personal devices, cars and embedded in the streets (CCTV cameras for example). So far all these data sources have been mostly considered separately because it is very difficult if not impossible to link a user of, say, an Oyster Card [used on the London subway / tube system] with a Twitter user, or his account in a fitness app or sensor installed in his car. But if we could draw these links at least on some coarse-grain level, among social groups or users with similar demographic profiles for example, it could skyrocket our understanding of user behavior in the cyber-physical world and open up huge space for exploration in various aspects of urban studies. I believe we will see a number of statistical methods as well as technological solutions emerging in this field in the near future.

Q - That would be very exciting! ... Couple of more practical questions. First, what Machine Learning methods have you found most helpful?
A - I play with various supervised-learning models. I find them more suitable for "exploratory" research where one wants to test some hypothesis or check some dependencies in his data. As long as a problem can be formalized in a supervised learning task, the results can be directly validated over the data which makes it more convenient than unsupervised learning where manual validation is required.

Q - And what are your favorite tools / applications to work with?
A - I have recently switched to Python (after ten years with Java) and found it very suitable for data analysis. I would name scipy, scikit-learn and graphlab as my new favorites. In Java my data analysis bundle would consist of Weka, Ranklib, Lenskit, Gephi, apache common math and other statistical libs. When I have inspiration I play with Processing to draw some fancy visualizations.

That is fancy! :) On that note, let's talk more about your recent work on Mining Online Location-based Services for Optimal Retail Store Placement, which has caught retailers attention - it is a great example of how your research can be applied ...

Q - Could you tell us a little more about this work?
A - In this work we solve an old good problem of finding the best place to open a business in the city but with a new data coming from location-based social networks. Here we wear the hat of a retail chain manager (say, a Starbucks manager) to solve the question: given a number of locations in the city can we predict the one where a new restaurant will thrive? See, big retailers spend a fortune for sophisticated user surveys and expensive market analysis, so, we wondered how user-generated data which was already available online could be used to solve the very same problem in a cost-effective way.

Q - That sounds great! How did you approach it?
A - We collected a dataset of public tweets where users were reporting their location with the Foursquare app. A basic record in this dataset would say a venue V in New York (e.g., a restaurant, railway station or a bank) where a user U was at time T. We used this data to measure how popular each venue is and to build various indicators of popularity. This would include things like the number of competitors in the area, presence of transportation hubs or other place-attractors nearby, intensity of human mobility in the area etc.

Q - What were the major steps?
A - Initially we tried to solve this problem for a general business. It took us a month of try-and-fail trials to understand that the popularity indicators may vary significantly across different types of venues: placing a chinese restaurant in China Town might be a good idea but not as good for an italian restaurant. We then decided to focus our efforts on 3 chains McDonalds, Starbucks and Dunkin' Donuts.

The next important insight came from the work of a physicist, Pablo Jensen, who a decade ago proposed to use a complex network toolkit to analyze geographic interactions between various business activities. We used some of his ideas to build a popularity indicator that assesses the importance of co-locating a restaurant with other business activities.

Q - So what unlock did the social media data provide?
A - The most crucial difference between the social networking data we had and traditional data sources is the fact that we have fine-grained data of individual users movements. So, we can ask questions like what is a place A which users who have just visited B would also usually visit? Like a coffee shop on one's way from a train station to his office. We learned these patterns from the dataset and used them as a yet another source to predict popularity.

Q - How did Machine Learning help?
A - Once we devised these various indicators of human movements and co-location of places we used machine learning to build a popularity prediction algorithm. Our goal was to use all various indicators and the popularity data to train a supervised model which would rank a given list of locations according to their prospective success. We tried various different models and found that a regression-to-rank model based on support vector regressions performed the best in this case.

Q - What answers / insights did you uncover?
A - Our main finding was that the model which combined user mobility information built out of users’ checkins and data about geographic location of users was performing better than geographic data alone. In other words, we can indeed achieve incredibly valuable insights on user retail preferences from the social media data. You can see more details on the exact results of the different models we tried in this recent presentation.

Q - That's really interesting! What are the next steps / where else could this be applied?
A - Not only is this very relevant for retailers, we believe our work can inspire research in various other areas where urban mobility is important. For example, the same way of thinking can be applied to study land prices, the optimal location for opening an office, understanding public use of common spaces, etc.

That would be great! Finally, let's talk a little about the future...

Q - What does the future of Machine Learning look like?
A - As a machine learning practitioner I would be really happy to see a basic machine learning course a in high school program. We have already realized that coding is the must know tool for a 21st century person and I believe data science is next in line.

Q - Any words of wisdom for Machine Learning students or practitioners starting out?
A - Solve the problems that are worth solving. A quote from a wise-man which I strongly second. One would rather start as an amateur in something that is really important with a potentially groundbreaking impact rather than an expert in something that no one cares about.

Dmytro - Thank you ever so much for your time! Really enjoyed learning more about your background, how human mobility modeling has evolved and what your research has uncovered in terms of applying machine learning to social media data to determine optimal retail store placement. Dmytro can be found online here and on twitter @karamshuk.

Readers, thanks for joining us!

P.S.If you enjoyed this interview and want to learn more about

what it takes to become a data scientist
what skills do I need
what type of work is currently being done in the field

then check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)