News API
Our core product is a JSON News API: we enable you to find relevant news articles based on your query parameters, such as keyword, language, country, news website, etc. In order to find relevant results only - among the more than a million news articles that we index daily - the API makes a call to ElasticSearch clusters. ElasticSearch is a powerful tool to index text data. But, the REST API itself is just the tip of the iceberg: the bulk of the work is to bring structured news articles into the ElasticSearch cluster.Part I: Web crawling (How we find news data)
News gets published online every second. To keep up, we have to constantly monitor news websites and check for updates. So we crawl the web. We don’t crawl the entire web like Google does, just the news parts of it. We have a list of over 60,000 news websites that produce new articles (new data points). Our web crawlers constantly find new web pages. The next step would be to check whether these pages have been processed already. Then, our algorithms decide whether this page is a news article page or not. At the end of this process, we have a stream of HTML pages of news articles that were recently published online.Part II: Data extraction (How we turn unstructured text into structured data)
Now, we have to turn the HTML page into a source of structured data, such as title, article body, published time, etc.
Part III: Data enrichment
After we parse the fields from the article HTML we have to enrich them with more data points:- article language
- publishers’ country
- publishers’ page rank
- publishers’ topic
- whether this article is “Opinion” or not