What was the project about?

We aimed to build a service that could help a user fetch tweets from twitter and get them filter so that the user should only see the less spammy/unrelevant tweets.

The tweets were already prefiltered:

  • Sampled stream: The twitter connection was performed through the public sample (which means 1% sample of all tweets worldwide).
  • Language filter: We filtered out tweets with languages other than English and Spanish

After that:

  • Near-duplicate filter:The tweets that looked "too similar" to previous tweets where considered near-duplicates. Locality-Sensitive Hashing and Minhash techniques were used for that purpose.
  • Relevance classifier:This was the model that was indirectly supervised by the user, by selecting/discarding them. Vowpal Wabbit and its Active Learning approaches were used.

What tools did we use?

Programming languages and modelling

  • Scala as programming language for the Twitter stream ingestion.
  • Python toolset (including among other datasketch, scikit-learn, etc)
  • Vowpal wabbit for the Online Machine Learning and Active Learning


UI / Data visualization

  • Jekyll and DC.js for the website development
  • MEAN stack for UI (the twitter filter assistant, not the website)

Which were the main challenges?


Apache Kafka was found to be a very tricky tool, specially when connecting from/to Python and NodeJS. Also docker/docker-compose appeared to raise issues caused by the local caching of images, volumes and containers.


The data pipeline comming from a twitter Stream to a MongoDB and AngularJS UI had some issues, specially regarding the connection to the Vowpal Wabbit container through a socket. Also, some memory issues appeared when the near-duplicates filter started to grow.


The vectorization of the tweet and the implementation of an online learning model proved to be at the same time challenging and quite productive, as we could incrementally add labeled tweets to the model.

How did the supervised modelling perform?

We selected a dataset of about 50k tweets, which included some specifically selected topics.

Then configured a script to perform a grid search with Vowpal Wabbit over the model parameters.

The result of all those combinations raised approximately 16k combinations of tests and scores, which are being visualized below.

In order to filter the data and find the best model for your needs, you should:

  • Select the range of values in the model scores below (the chosen score for our model was F-score)
  • Check which models and parameters were fitting best the data
In our case, it was a neural network with 1 layer (which is almost the same as a logistic regression).

Loss Function





Learning Rate

Pred Threshold