A look back from 2020: Alright, this project now makes me feel cringeworthy. I learned a lot since then, and I would have a much better appraoch realted to feature selection, hyperparameter optimization, cross-validation, and reporting (basically the whole machine learning part of it). Still, I am not removing it since it documents my journey.
I had planned to write this post a few months ago, but I suddenly became a research assistant and got very busy, so I could not find time. Last semester, I took "Machine Learning for Multimedia Informatics" course and we were free to find our own topic to develop a machine learning application as our course projects. Since I had tried to gather some data from Tinder and use it before taking the course, I wanted to build something on top of it, but later was directed towards different topics and I chose to create something related to Twitter.
The basic idea is to predict the popularity of a given tweet even before it gets tweeted. In the literature, popularity is mostly measured by the retweet count, but most of the people do not get retweeted much and that measure does not apply to private accounts. Since I wanted to create something that could potentially be useful for any Twitter user, I decided to measure the number of favorites (likes). I believe this is a more stable metric because people are reluctant to retweet an ordinary user, but they are more generous with favorites, maybe because they do not appear with a user's own tweets (like retweets do) in their main profile feed.
Collecting the data
To train my model(s), I needed to collect my own dataset. I used Twitter Streaming API to randomly select people whose tweets are in English and who use Twitter in English (thanks to the API, that information is provided by Twitter). However, Twitter's user base is highly skewed and I realized that I was not getting enough people with thousands or millions of followers, so I later implemented a quota sampling instead of a totally random one. This selection process took about a couple days with short intervals to reduce the sampling error. I call these users my "seeds".
After collecting the seeds, I collected the tweets of these seed users. I also implemented a breadth-first algorithm to get the tweets of the followers of those people in case it does not have enough users (but it turned out it had). I collected more than 300,000 tweets. While collecting the tweets, I had several other filtering mechanisms. For example, I was checking if a tweet is (embedded or not) a retweet or is it too new (too early to measure the popularity) or too old (since we do not know the user's follower count, etc. at that time) or if it is written in another language, etc. Based on the literature review I made and my assumptions, I collected these features (in the final iteration) for each tweet:
- Flesch Reading Ease score
- Mention count
- Attached media existence
- Day of week
- Time of day (in seconds)
- "Verified" badge existence
- Tweet count
- Following count
- Follower count
- Follower/tweet ratio
- Follower/followee ratio
As you can imagine, some of the features require some processing. Especially the "emotion" feature. Before using the tweet content, I discarded the mentions and emojis. To detect the dominant emotion, I used EmoLex, an emotion lexicon created by Saif M. Mohammad and Peter D. Turney at the National Research Council Canada. I am grateful for them letting me use it for my project and here are the related papers:
Saif Mohammad and Peter Turney. Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence, 29(3): 436-465, 2013. Wiley Blackwell Publishing Ltd.
Saif Mohammad and Peter Turney. Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California.
EmoLex is based on Robert Plutchik's Wheel of Emotions, so it is essentially a word list including words associated with anger, anticipation, disgust, fear, joy, sadness, surprise, trust. I believe it also includes the associated sentiment, but I did not use it. Since the original lexicon is in English, I limited my project to tweets written in English. For each tweet, after tokenization, lemmatization, and negation handling, it counts the number of these emotions to find the dominant one. If there were multiple emotions that were equally dominant, it randomly chooses one. If no emotion is found, it is coded with 0 (None). I decided to include emotion in my project because as previous studies suggest, certain emotions such as anger can make an online content become more popular. Here is a relevant YouTube video. It also served me as a practice for language processing.
However, there are certain complications with automatic emotion recognition. Since the focus was not on the language processing and this was just an introductory course project, my approach was very simple. I handled the negations, but detecting sarcasm is not that simple. Also, words like "sick" and "crazy" can be both positive and negative, which requires a context, but it just counts the words (unless there is a negation). By the way, negation relies on the correct use of grammar, which might not be the case on social media. Moreover, the actual content might not be related to a word used in the beginning of a chunk of text, but I believe that was not very important for this case due to the 140-character (now 280-character) nature of Twitter. For language processing, I used NLTK and TextBlob libraries.
I must admit that some of the features were not used at all while some of them were conditionally used.
I decided to use scikit-learn's random forest ensemble, but there were some problems. 300,000 tweets just too many and most of the data were not even useful while making a prediction for a specific Twitter user. The random forest was also trying to keep each tree in memory, which was causing problems. I implemented my own memory-efficient random forests with adaptive parameters that use scikit-learn's CART models.
When the system gets a prediction request, it filters the relatively gigantic dataset to find the similar data points. The relevant data points (tweets) are then used to train the decision trees. According to the requested user's account characteristics, tree parameters such as depth and training features are adapted as well. For example, I realized that "emotion" provides no information when it comes to the celebrity accounts. It looks like people do not care much about the content if the author of a tweet is a celebrity that they love. Therefore, that feature is not considered for a celebrity in order to improve the performance and the efficiency. After a tree is created and trained, the popularity of the requested tweet is directly predicted and added to a list as that tree is then destroyed before creating the next tree in that forest. In the end, those listed results are averaged. Using this method, I believe my application can produce much believable results in an acceptable time frame. Even my 9-year-old laptop can run it. Obviously, a real solution would be distributed and big-data-oriented. This semester I am taking a big data course and learning the basics of Hadoop, Spark, etc. I think using those tools for this problem would be a better approach.
Web development and user interface
I think machine learning feels like magic to many people. Since my application is about prediction and popularity, I decided to go with a "Gypsy fortune teller" theme. I believe hiding the science behind the application and creating a more fun and visual setting might be attracting ordinary people (another example I can think of is Akinator). After looking at some examples, I developed a minimalistic design vision (I had very little time for polishing). An obvious design decision was to use a magical sphere to display the predictions.
Since I was using Python, I used Flask to create my minimal one-page web application. When a user requests a prediction, the system pulls the requested username's public information (follower/followee/tweet counts so that it can work with any account owned by anyone) and it is combined with the tweet content/context sent with AJAX. While the user waits for the prediction, the system pulls the related Twitter user's profile image to animate it inside the magical sphere. I think this provides a customized response feel and it stalls the user in an entertaining way. After the averaged prediction is ready, the response is sent back to the user for display and the number is displayed in the sphere.
As a generalist, I think the most fun part of designing the interface was creating the magical sphere using Cinema 4D, Adobe Photoshop, Adobe After Effects, and CSS animations. After modeling the scene, I took some layered renders and played with them in Photoshop. Using CSS tricks, I carefully overlapped and aligned these layers. For the magical particles, I found a CC0 video that was possible to cut in a way that it can be seamlessly looped. I cut the video, changed the hue to match Twitter's color scheme, and cropped it before re-rendering it in Adobe After Effects. Between those static layers, I put this video and I animated it to rotate, which I believe makes the sphere look much better.
I could not manage to make the blending mode of the layers work and I could not achieve the results I achieved in Photoshop, but I think it was not that bad. I combined the sphere with the font I found and Bootstrap. I had to learn to overlap the elements and nudging everything a bit on the go, so that the sphere fits, which required some hacking and therefore I somewhat broke the responsive nature of the page. If I make this tool public, I want to make it fully responsive first.
Then, I added the text layer that will be updated by the AJAX response. I also added some extra CSS/jQuery to format and adjust the prediction text automatically. After adding other details and error/warning cases (for example, when the input is not written in English or it includes a mention, etc.), it was complete.
I actually learned more about Twitter API, collecting and processing the data than I learned about machine learning itself, but it is not very surprising considering a big portion of the machine learning problems is dealing with the data. Nowadays, I am heavily using my web scraping skills for professional purposes, so I am glad I decided to collect my own data instead of finding an existing one. Also, I realized theoretically bad decisions improved the results in practice.I read Soroush Vosoughi's PhD dissertation during the process (I read a lot more but this one was my favorite) and I really liked it. I would very much like to write a thesis in a similar topic and develop an actual web application open for everyone as an extension of my thesis.