Our core product is a JSON News API: we enable you to find relevant news
articles based on your query parameters, such as keyword, language, country,
news website, etc.In order to find relevant results only - among the more than a million news
articles that we index daily - the API makes a call to
ElasticSearch clusters.ElasticSearch is a powerful tool to index text data.But, the REST API itself is just the tip of the iceberg: the bulk of the work is
to bring structured news articles into the ElasticSearch cluster.
News gets published online every second. To keep up, we have to constantly
monitor news websites and check for updates. So we crawl the web. We don’t crawl
the entire web like Google does, just the news parts of it.We have a list of over 60,000 news websites that produce new articles (new data
points).Our web crawlers constantly find new web pages. The next step would be to check
whether these pages have been processed already. Then, our algorithms decide
whether this page is a news article page or not.At the end of this process, we have a stream of HTML pages of news articles that
were recently published online.
Part II: Data extraction (How we turn unstructured text into structured data)
Now, we have to turn the HTML page into a source of structured data, such as
title, article body, published time, etc.This is the most advanced and complicated part of what we do, as we have to
write a generic news parser: it has to extract data fields from any news
website in any language.Another more obvious option would be to write a dedicated web scraper for each
news website. But it becomes a much more difficult task when you have ~100,000
news websites to get news articles from.
Now, each news article is structured in the following JSON format:
Copy
Ask AI
{ "title": "Tesla Model X vs Tesla Model Y: which Tesla SUV should you buy?", "author": "Rob Clymo", "published_date": "2021-07-26 16:42:51", "published_date_precision": "full", "link": "https://www.techradar.com/news/tesla-model-x-vs-tesla-model-y", "clean_url": "techradar.com", "excerpt": "What to look for if you're in the market a Tesla SUV", "summary": "Tesla is gradually filling every spot in the current carbuying marketplace. There's even a budget level compact on the way that'll open up the brand to more customers. It's also establishing itself in the burgeoning SUV sector with models in the shape of the older Tesla Model X and the new Tesla Model Y. Both offer the classic SUV experience while blending it with Tesla's unique styling, dazzling performance and lots of innovation.With demand for SUV's still high, the Tesla Model X is perfect fo", "rights": "techradar.com", "rank": 640, "topic": "tech", "country": "US", "language": "en", "authors": ["Rob Clymo"], "media": "https://cdn.mos.cms.futurecdn.net/5BMZkUwoQhz64NSM6a89PG-1200-80.jpg", "is_opinion": false, "twitter_account": "@TechRadar"}
We may have over 1.5M such articles daily, so we’d need the most efficient tool
to index this data.We use ElasticSearch for this purpose. It enables our NEWS API to respond to
your advanced queries in less than 300ms.