M'interessa: Universitat of Barcelona Postgraduate Capstone project

What was the project about?

We aimed to build a service that could help a user fetch tweets from twitter and get them filter so that the user should only see the less spammy/unrelevant tweets.

The tweets were already prefiltered:

Sampled stream: The twitter connection was performed through the public sample (which means 1% sample of all tweets worldwide).
Language filter: We filtered out tweets with languages other than English and Spanish

After that:

Near-duplicate filter:The tweets that looked "too similar" to previous tweets where considered near-duplicates. Locality-Sensitive Hashing and Minhash techniques were used for that purpose.
Relevance classifier:This was the model that was indirectly supervised by the user, by selecting/discarding them. Vowpal Wabbit and its Active Learning approaches were used.

What tools did we use?

Programming languages and modelling

Scala as programming language for the Twitter stream ingestion.
Python toolset (including among other datasketch, scikit-learn, etc)
Vowpal wabbit for the Online Machine Learning and Active Learning

Infrastructure

Apache Kafka as a queue manager/stream aggregator
Docker and docker-compose for the containerization

UI / Data visualization

Jekyll and DC.js for the website development
MEAN stack for UI (the twitter filter assistant, not the website)

Which were the main challenges?

Technology

Apache Kafka was found to be a very tricky tool, specially when connecting from/to Python and NodeJS. Also docker/docker-compose appeared to raise issues caused by the local caching of images, volumes and containers.

Pipeline

The data pipeline comming from a twitter Stream to a MongoDB and AngularJS UI had some issues, specially regarding the connection to the Vowpal Wabbit container through a socket. Also, some memory issues appeared when the near-duplicates filter started to grow.

Modelling

The vectorization of the tweet and the implementation of an online learning model proved to be at the same time challenging and quite productive, as we could incrementally add labeled tweets to the model.

How did the supervised modelling perform?

We selected a dataset of about 50k tweets, which included some specifically selected topics.

Then configured a script to perform a grid search with Vowpal Wabbit over the model parameters.

The result of all those combinations raised approximately 16k combinations of tests and scores, which are being visualized below.

In order to filter the data and find the best model for your needs, you should:

Select the range of values in the model scores below (the chosen score for our model was F-score)
Check which models and parameters were fitting best the data

In our case, it was a neural network with 1 layer (which is almost the same as a logistic regression).

What was the project about?

What tools did we use?

Programming languages and modelling

Infrastructure

UI / Data visualization

Which were the main challenges?

Technology

Pipeline

Modelling

How did the supervised modelling perform?

Loss Function

Algorithm

Decay

Ngram

Skip

Learning Rate

Pred Threshold

Precision

Recall

Accuracy

F-measure