A few years ago, I started working for Edifixio. This company was a very nice place to work at. They design and manage cloud solutions for big companies, and also integrate SalesForce into their workflows. Initially, my job was to fix an app they had been developing internally. This app’s goal was to help predict the response of social media to a post before publishing it.
Their initial approach was hugely problematic. They used a proprietary software called SPSS, made by IBM. At the time, they used this software inside a Windows machine in AWS and interacted with it using batch scripts. Their main issue was extracting information from the social networks’ posts’ images. They also had issues regarding Facebook’s rate limits as they were fetching the data in real time without any caching.
After understanding that, I started pitching the idea of a “data lake”. At the time, this wasn’t as easy as it is today. Indeed, there was no one-button solution to deploy a data lake to AWS. Of course, Cloud Formation existed but I had to write it all from scratch. The plan was to make a “data fetcher” which essentially would scrape the Facebook API to get the requested posts asynchronously, and then put them in a database.
Any old database would’ve worked, but I wanted to go with the best system for the case. I attended a seminar at AWS’ headquarters in Paris where an AWS expert cane to talk about the inner workings of DynamoDB. DynamoDB is a NoSQL DB system which is really efficient at one, and only one, condition: you have to know what you want to do. The advantage of SQL is its flexibility. You want a weird view of the data ? There, just
JOIN this and that and you’re good. The counterpart to that is that you sacrifice scalability and speed since you need a big request processor.
On the other hand, with DynamoDB, you don’t have this flexibility of chosing how to access what after inserting the data. However, if you know what you’ll request, you can make this whole database infinitely scalable, fast, fault tolerant etc. I won’t go into the HASH key and the SORT key (aka RANGE key) here, but understanding it is crucial. This is what the decision of using DynamoDB is based on. Needless to say, in my case, DynamoDB was just perfect. Indeed, I knew exactly how to access posts: have the page name as a HASH key, and the timestamp as a SORT key.
Designing the scraper was easily the harduous part to this whole thing. First off, the Facebook API is an everchanging mess. If you’re not using it for the simplest of simple use cases, then you better have a whole team dedicated to keeping it working. The Q&A forums are outdated in only a few months. The libraries and few and far betweens. The undocumented behaviours are legion. It really is a test by fire of you debugging capabilities. Of course, to make matters simple, it’s not as if this scraper should run locally. No no no, this runs in the Cloud, i.e. the further away from where any sane step by step debugger has ever been. Forget all your
import pdb; pdb.set_trace() here, all your remote debuggers. The only thing you better have is a solid logging system, i.e.
In a Cloud environment, it’s primordial to correctly subdivide parts of the application as “microservices”. The Facebook scraper was one microservice. In the few years I worked on the whole project, I wrote three versions of it: the first two in Python, the latest in Java. I never looked back from switching away from Python for this kind of things. A typed language is the only path to sanity in this world.
If I had to redo this whole ordeal from the beginning, I would never have even tried to design my own scraper. I can garantee anyone that paying for a good scraper by the request is worth – every – penny.
After developing my local prototype for a Facebook post scraper, I used the little data I had to show what analysis were possible. As I mentionned, when I entered the project, they were using SPSS. SPSS is a statistics toolbox that works on natural language. It analysis words and can correlate their use to certain metrics. However, Facebook posts are more than just words. To alleviate that, they used ClarifAI to “describe” images in words. This approach was not only very costly, it also failed at describing subtle nuances. An image is worth a thousand words as we say. My instictive response was to use CNNs. During my first meeting, I began explaining to the whole team how a CNN works. Of course now I realise I should never have gotten into such nitty gritty details but that’s what experience is.
Therefore, I ditched the whole SPSS+ClarifAI stack and instead used transfer learning to extract implicit features out of images. I then used dimensionality reduction techniques (T-SNE back then, since UMAP wasn’t a thing) to display those features in 2D. This approach proved spectacular. It pivoted the whole project into a way more sustainable and novel direction.
Each point on this graph represent a post. When two dots are close together, it means they share the same content in their image. I can’t tell explicitly what this content is, but I can tell you they’re similar. That way, it’s really easy to find similar content to yours, see how they worded it, and what response they got.
I designed the above interface myself using Plotly’s excellent Dash framework. It’s a fully interactive graph, you can hover points and see the corresponding posts, select a subset for closer inspection. Of course, it’s also packed with other analysis, like plain old share vs like graphs.
Hashtag graph network
From this trove of data I was able to get from our efficient data pipleline (up to 20 000 posts, with images, comments, and metrics, per day) I designed many other analysis. Amongst them was this little hashtag (#) and mentions (@) visualisation.
On this interactive figure, each dot is either: a post (blue dot), a hashtag (red triangle), or a mention (green triangle). When a post mentions a hashtag or an @ mention, I linked the posts’ dot to that one hashtag or mention’s dot. This forms a graph, in the mathematical term of the sense, with node and edges, connecting posts together. This visualisation is great for understanding which brand interacted with which other brand, or influencer. You can also know if a brand did something at the same location as another as they share the same hashtag (e.g. #fashionweek or so).
This graph is fully interactive, you can zoom, move, and hover over posts’ dots (the blue dots) and get a preview of that one post. You can also select multiple posts and compare them, read the comments and put it in your favourites.