How much data do we need? A study on detecting Fake News with Machine Learning

One amazing feat we have today is that there are so many datasets available online. There are also many hubs for data, like Kaggle or HuggingFace.

One important problem we've had in the last few years is the propagation of fake news - which are statements that range from being outright lies, to convenient interpretations of the facts, to naive replications of common sense.

Well, fortunately, we have many datasets online containing news labeled as either fake or true. But: is it feasible to use Machine Learning to detect fake news?

Single dataset results

Start by downloading a fake news dataset. Run the usual train-test pipeline and check the accuracy you could classify fake news from non-fake news. Is this accuracy enough to use your system as a filter for fake news in a social media portal?

Cross-dataset study

Now, download another two fake news datasets. First, run the train-test pipeline evaluation in each one of them. After that, proceed to a more interesting experiment:

  • train a classification pipeline in one dataset
  • test the pipeline in another dataset

Do this for all dataset combinations you have. What happens to the results?

Do you believe fake news detection with a classifier like ours is reliable?

Keep going

Find how much each word contributes to the overall classification. Are these words the same in each dataset?

Reflect:

  • Why is this happening?
  • Do these results indicate that our detector is reliable?
  • Is it feasible to detect fake news using a Bag-of-Words classifier?