The ultimate guide to where to search for online published news articles. open-sourced & free tools only. Use it to build a News API alternative.
An ultimate guide to where to search for online published news articles. Open-sourced & free tools only.
Build a News API alternative with open-source tools only.
Data scientists and NLP enthusiasts love working with news data because there are many real-world use cases:
While building NewsCatcher News API, we discovered many open-sourced & free tools, services, libraries that help you find & parse online-published news articles.
We even published two Python packages that help work with news data:
While there are a few paid options (including what NewsCatcher does), some non-commercial use cases might be satisfied with open-sourced & free options. So you don't have to pay to News API providers.
GDELT analyses news articles published online. They apply Natural Language Processing to understand what news is being written worldwide. In addition, the GKG dataset allows you to find the links to newly published news articles.
Pros:
Cons:
You might think that a list of URLs isn't much, but I bet you might be wrong. It's a half job done. For example, you could use the newspaper3k Python package for parsing news by ULR/its HTML.
Common Crawl crawls the web and open-source all of the online pages they could have found. They are non-profit, so I highly encourage you to donate to them if you'll end up using their solution.
In 2016, Common Crawl decided to decouple the news crawl part from their primary dataset. News Crawl uses RSS & news sitemaps to parse the news. This part of Crawl is separately open-sourced.
Pros:
Cons:
RSS feeds still exist. Our beta version used to rely solely on RSS feeds. You can read a full article here:
How we built a news API beta in 60 days
Pros:
Cons:
Google News is the biggest UI-first news aggregator.
Google News has an RSS for any UI page. This RSS is lightweight, and you will not get blocked for accessing it many times a day.
We wrote a Python library that helps you parse any Google News RSS page. Even if you are not a Python person, you can use this repository as an unofficial Google News RSS documentation (there is no official one).
This list is a good starting point if you'd like to experiment with news data.
Building your News API alternative may teach you more about the subject itself.
Plus, it's a great data engineering exercise.