By the end of this tutorial, you will be able to write a Named Entity Recognition pipeline using SpaCy: it will detect company acquisitions from news headlines.
By the end of this tutorial, you will be able to write a Named Entity Recognition pipeline using SpaCy: it will detect company acquisitions from news headlines.
Demo:
Learn on practice how to use named entity recognition to mine insights from news in real-time. We will write a Python script to analyze news articles' headlines to learn about all the latest company acquisitions.
In this tutorial, we will:
Natural Language Processing (NLP) is a set of techniques that helps analyze human-generated text.
Examples of applying NLP to real-world business problems:
We try to solve a problem in this article: find all company acquisitions from news articles published online.
Let's compare these two articles:
"Apple is planning to acquire a news API provider in the near future"
"Apple for breakfast: pros and cons"
Both articles mention "Apple"; however, the first one talks about the company, while the second one is about the fruit.
Without named entity recognition, we are not able to even qualify the content.
There are not just organizations (ORG) that the NER can identify. Spacy - Python NLP package that we're going to use - has a pre-built NER pipeline that includes:
For example, take a look at the article below after it's processed by the Spacy pipeline:
"On the 15th of September, Tim Cook announced that Apple wants to acquire ABC Group from New York for 1 billion dollars"
For now, let's use a pre-defined list of news article headlines to test named entity recognition in Spacy.
In the case of one company that acquires another one, it is fair to assume that there should be at least two ORG tags per headline: at least one for the acquire and at least one for the acquirer.
To install Spacy, run in your console:
pip install spacy
python -m spacy download en_core_web_lg
In your Python interpreter, load the package and pre-trained model:
First, let's run a script to see what entity types were recognized in each headline using the Spacy NER pipeline.
The very first example is the most obvious: one company acquires another one. Spacy NER identified both companies correctly.
The third article headline talks about an organization and a person.
All headlines talk about some acquisition; however, not all talk about one company acquiring another one.
Also, let's make it a generic function that we can apply to no matter which article.
The function below does two things:
Limitations:
Now that we know that a headline is an excellent candidate for an acquisition example, we want to ensure that one company is acquiring another one. For this, we will use dependency parsing with Spacy.
I'm not going to go much into the detail as it is outside of the scope of this article.
Using display, we can visualize the dependency tree.
The company that is bought should be:
After identifying the token, we:
I'll provide a quick example of how you could get the latest articles that contain lemma "acquire" from 2 sources: prnewswire and businesswire.
We'll use NewsCatcher News API Python SDK for that.
To install News API SDK, run in your console:
pip install newscatcherapi
You can grab your API key here.
We'll make a call to find articles that:
Finally, we print out the results.
I leave you here so that you can experiment with actual data yourself.