We aimed to build a service that could help a user fetch tweets from twitter and get them filter so that the user should only see the less spammy/unrelevant tweets.
The tweets were already prefiltered:
After that:
Apache Kafka was found to be a very tricky tool, specially when connecting from/to Python and NodeJS. Also docker/docker-compose appeared to raise issues caused by the local caching of images, volumes and containers.
The data pipeline comming from a twitter Stream to a MongoDB and AngularJS UI had some issues, specially regarding the connection to the Vowpal Wabbit container through a socket. Also, some memory issues appeared when the near-duplicates filter started to grow.
The vectorization of the tweet and the implementation of an online learning model proved to be at the same time challenging and quite productive, as we could incrementally add labeled tweets to the model.
We selected a dataset of about 50k tweets, which included some specifically selected topics.
Then configured a script to perform a grid search with Vowpal Wabbit over the model parameters.
The result of all those combinations raised approximately 16k combinations of tests and scores, which are being visualized below.
In order to filter the data and find the best model for your needs, you should: