Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh

Head Of Product

By the end of this article, you will be able to write a name matching pipeline using spaCy’s PhraseMatcher. It will find and label organization names and their stock ticker symbols.

DEMO

You can find the accompanying web app here.

In this tutorial, we'll:

Learn about the spaCy PhraseMatcher class
Use PhraseMatcher to create a text annotation pipeline that labels organization names and stock tickers
Learn a quick-and-easy way to scrape Wikipedia tables directly into pandas DataFrame
Use spaCy's displacy class to visualize custom entities

The Problem We're Trying To Solve In This Article

While working on the named-entity recognition (NER) pipeline for one of our previous articles, we ran into some issues with the default spaCy NER model. The pipeline aimed to detect significant events like acquisitions. For this, we used spaCy’s in-built NER model to detect organization names in news headlines. The problem is that the model is far from perfect so it doesn’t necessarily detect all organizations. And this is to be expected for entities like organization names that have variations and new additions.

There are two ways to solve this issue. We can either train a better statistical NER model on an updated custom dataset or use a rule-based approach to make the detections. The funny thing about this choice is that it’s not really a choice. You see, to train a better NER model we would need to label text data. And the best way to optimize the data annotation process is to automate it (or parts of it) using a rule-based method.

So, that’s exactly what we are doing to do in this article. We’ll learn about spaCy’s PhraseMatcher and use it to annotate organization names and stock ticker symbols.

Introduction To spaCy PhraseMathcer

The PhraseMatcher class enables us to efficiently match large lists of phrases. It accepts match patterns in the form of Doc objects that can contain many tokens.

First, let's install spaCy.

The next step is to initialize the PhraseMatcher with a vocabulary. Like Matcher, the PhraseMatcher object must share the same vocabulary with the documents it will operate on.

After initializing the PhraseMathcer object with a vocab, we can add the patterns using the .add() method.

NOTE: Each phrase string has to be processed with the nlp object to create Doc objects for the pattern dictionary. Doing this in a simple loop or a list comprehension can easily become inefficient and slow. If we only need to match for text patterns (and not other attributes) we can use nlp.make_doc instead, which only runs the tokenizer. We can also use nlp.tokenizer.pipe for an extra speed boost as it processes the texts as a stream.

two options for creating pattern Docs more efficiently: make_doc and tokenizer.pipe

Now that we have added the desired phrases, we can use the matcher object to find them in some news text.

Output of the PhraseMatcher we just created for labeling names and political positions of power

Matching On Other Token Attributes

By default, the PhraseMatcher matches on the verbatim token text, i.e., .text attribute. Using the attr argument on initialization, we can change the token attribute the PhraseMatcher uses when comparing the patterns to the matched Doc.

Let’s say we wanted to create case-insensitive match patterns. We can do so using the attribute LOWER that enables us to match on Token.lower.

Output of our PhraseMatcher that matches for lowercase version of the text

Although nlp.make_doc helps us create Doc objects for patterns as efficiently as possible, it can cause issues if we want to match for other attributes. It only runs the tokenizer and none of the other pipeline components. So, we need to ensure that the required pipeline component runs when we create the pattern dictionary.

For example, to match on POS or LEMMA, the pattern Doc objects need to have part-of-speech tags set by the tagger or morphologizer. We either call the nlp object on the pattern texts, or use nlp.select_pipes to run components selectively.

Why PhraseMatcher?

Okay, that’s cool and everything, but why are we choosing PhraseMacher in the first place? Wouldn’t Python’s built-in find() method do just fine? What about spaCy’s Matcher class?

Yeah, both of these approaches will work, but they are rather inefficient.

The time complexity of the find() function is O(N*L) where N is the size of the string in which we search, and L is the size of the phrase(pattern) we are searching for. And this is just for one phrase, so this complexity would increase linearly with the number of phrases as well.

Matcher vs PhraseMatcher

There are two main reasons for choosing PhraseMacther over the Matcher class when working with large lists of patterns:

computational efficiency
ease of implementation.

Matcher runs in O(N*M) time for a document of N words and a pattern list of M entries. On the other hand, PhraseMatcher's average-case complexity is much better. The specifics depend on the pattern list, but it's generally somewhere between O(N) and O(N*log(M)). This has a big impact when the phrase dictionaries are very big, i.e., when we have a large M. For instance, let's say we want to match for a dictionary with every drug manufactured by a pharmaceutical company. PhraseMatcher will massively outperform its Matcher equivalent.

Just take a look at this time complexity plot. See how quickly the upper bound for Matcher(purple) moves in comparison to PhraseMatcher(blue) for the same increase in M.

time complexity plot Matcher vs PhraseMatcher that shows Matcher increasing substatially faster than PhraseMatcher

The other advantage of PhraseMatcher is its ease of use. With PhraseMatcher we can easily loop over the list of phrases and create Doc objects to pass as patterns. The process is a bit more complex for a Matcher object as it works on the token level. So we’ll have to accommodate for phrases of different token lengths.

For instance, to match the phrase “Washington, D.C.” using Matcher, we would have to create a complex pattern like the following two:

Two Match patterns options for spaCy Matcher that match for

Whereas, when using PhraseMatcher, we can simply pass in nlp("Washington, D.C.") and don’t have to write complex token patterns.

Using PhraseMatcher To Label News Data

Besides spaCy and pandas, we’ll be using a few other Python libraries:

cleanco - To clean and normalize company names
BeautifulSoup and requests - To fetch the list of S&P500 companies from Wikipedia and extract relevant data.

To install these run:

Getting Organization Names And Ticker Symbols

We’ll be using the list of all NasDAQ listed companies available on datahub.

first five rows of the datahub NasDAQ list of company names and tickers

This list was last updated three years ago, so we’ll supplement it using the list of S&P500 companies available on Wikipedia.

Get the webpage with the table of S&P500 companies.

Parse the HTML using BeautifulSoup and select the required table.

highlighted wikipedia HTML table class in screenshot that has inspect element options on

Convert the HTML table into a Pandas DataFrame using the read_html() method

first five rows of Wikipedia S&P500 table we just converted into a dataframe

There are some extra columns that we don’t need, and the name for the organizations’ column is different from the other DataFrame. We’ll drop the extra columns and rename the ‘Security’ column.

Now we can combine the two tables to get a more comprehensive list of names and tickers.

Cleaning Organization Names

Examples of extra information enclosed within parentheses in organization names

Some of the organization names have additional information in parentheses such as the state of registration or the class of the share. These things don’t matter for our aim of labeling organization names. So let’s create a function that removes anything enclosed with a pair of parentheses.

Besides that, organization names can have multiple forms. For example, Activision can be written as just ‘Activision Blizzard’ or ‘Activision Blizzard, Inc.’. Currently, our list contains the longer form of the names, to create a robust annotation system we’ll need to include both forms.

To obtain the cleaner, shorter versions of the organization names we’ll use cleanco

There are a few empty entries, so let’s deal with that before we move further.

Moreover, there are some incomplete/wrong organization names that will cause false positives if they're not dealt with.

Create The PhraseMatcher Labeller

Now that we have all the data we need, we can create our PhraseMatcher object.

PhraseMatcher supports adding multiple rules containing several patterns, and assigning IDs to each matcher rule. This means we can use the same matcher object for the names and tickers.

And that’s it! Now we can use this to label text data. Let’s try it out on some news data and see how it performs.

company names and ticker matches found by our phrasematcher annotator

Hmm, where seems to be some duplication. The problem is that our PhraseMatcher finds both forms of the company name, the longer complete version, and the shorter, cleaner version. There’s a fairly straightforward (greedy) solution to this issue, we count the occurrences of each matched name in all matched names.

The idea is that shorter names like “Activision Blizzard” will have a count of more than one. This is because they will as they will also appear as substrings in longer names like “Activision Blizzard, Inc”. But the longer version will occur only once. Let’s add this de-duplication and visualize the results using spaCy’s displacy.

displacy visualization of labelled news data after duplication fix

Conclusion

In this article, you learn about spaCy’s PhraseMatcher and used it to create a rudimentary text annotation pipeline that labels organization names and their ticker symbols. Depending on your needs, this pipeline can be easily extended to offer near-perfect results. You can even supplement it with a statistical NER model.

Generally speaking, statistical models are useful if your application needs to be able to generalize based on the context. That being said, you should always carefully consider if your use case really needs a model. Sometimes it might be better off with a rule-based system.

Two main criteria for choosing a rule-based approach:

You already have a large dictionary of terms to match
There’s a more or less finite number of instances of the given category (public-traded organization names and cities are good examples)

Combining a rule-based system with a statistical NER model might help you get better results if misspellings and spelling variations are important for your application. And spaCy’s entity recognizer respects pre-defined entities (e.g. set by previous ) and will use them as constraints for its predictions. So, if you train a NER model to detect public traded companies and add it after the rules, it will only find entities in the text that hasn't been labeled. This can potentially help you find entities the rules might miss.

Also interesting

all articles

Black thin grid lines forming diamond-shaped pattern on a white background.

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh Head Of Product

Also interesting

all articles

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara

CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher

Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara

CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara

CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher

Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher

Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara

CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher

Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher

Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher

Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher

Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara

CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher

Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing

Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh

Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova

Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara

CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan

Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova

Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh

Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh

Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh

Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara

CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara

CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi

CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh

Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh

Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara

CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara

CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara

CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh

Head Of Product

How To Annotate Entities With Spacy PhraseMatcher

The Problem We're Trying To Solve In This Article

Introduction To spaCy PhraseMathcer

Matching On Other Token Attributes

Why PhraseMatcher?

Matcher vs PhraseMatcher

Using PhraseMatcher To Label News Data

Getting Organization Names And Ticker Symbols

Cleaning Organization Names

Create The PhraseMatcher Labeller

Conclusion

Also interesting

Also interesting

DEVELOPERS

Technology