Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh

Head Of Product

Learn what is named entity recognition, how it works, and how it can be used. Use spaCy to train your own custom NER model.

In this tutorial, we'll:

Learn about named entity recognition(NER), how it works, and its applications
Use spaCy's pre-trained NER transformer model
Train a custom NER model with spaCy that extracts medical entities from journal texts.

What Is Named Entity Recognition?

Named Entity Recognition is one of the most pivotal data processing tasks in the field of Natural Language Processing(NLP). It aims to locate and categorize key information, i.e., entities, in text data. These ‘entities’ can be any word or any sequence of words(usually proper nouns) that consistently refer to the same thing. For instance, an entity detection system might detect the word “NewsCatcher” in text and classify it as an “Organization”.

NewsCatcher named entity recognition example visualization

At its core, all entity recognition systems have two steps:

Detecting the entities in text
Categorizing the entities into named classes

In the first step, the NER detects the location of the token or series of tokens that form an entity. Inside-outside-beginning chunking is a common method for finding the starting and ending indices of entities. The second step involves the creation of entity categories. These categories change depending on the use case, but here are some of the most common entities classes:

Person
Organization
Location
Time
Measurements or Quantities
String patterns like email addresses, phone numbers, or IP addresses

Although there are some rule-based entity recognition approaches, most modern named entity recognition systems make use of a machine learning/deep learning model. Due to its human origin, there is a lot of inherent ambiguity in text data. For instance, the word ‘Sydney’ can refer to both, a location and a person’s name.

Entity ambiguity example that shows Sydney used as a person's name and to refer to the city.

There is no sure-fire way of dealing with such ambiguities but as a general rule of thumb, the more relevant training data is to the task, the better the model performs. For example, later in the article, we’ll be training a custom NER model that extracts medical information from clinical texts. Now, if we try to use this model to extract the same medical information from other types of text, say blog posts or news articles, it will do a lot worse.

Application Of Named Entity Recognition

NER is useful in any application in which a high-level understanding of a large quantity of text is needed. A good NER enables the computer to understand the subject or theme of text at a glance, and quickly group documents based on their relevance.

Information Extraction And Summarization

There is an enormous amount of text data being produced every day, but most of the text in documents is not necessarily useful. For example, insurance policy documents are extremely lengthy and have a lot of information. But insurance inspectors only need a specific set of information like name, contact details, age, etc. Being able to quickly extract and summarize this information saves a lot of time.

Optimizing Search Engines

Illustration of an optimized search engine that only looks through list of relevant entities

Let’s say were are creating an internal search engine algorithm for a news publisher that has millions of articles. If we create a simple substring look-up algorithm, it will end up going through all the words in the millions of articles. This will be rather slow. If we implement a NER, use it to extract relevant entities from the articles, and store them, we can optimize the search process. as the search query will only need to be matched on the list of relevant entities, the search execution will take less time.

Machine Translation

Translation example from English to French with a named entity, person.

NER is also useful in translation applications as certain named entities like Person and Location don’t need to be translated, while others do.

Content Classification

News and publishing houses generate large amounts of online content on a daily basis and categorizing them correctly is very important to get the most use of each article. Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for each article helps in automatically categorizing the articles in defined hierarchies and enables better content discovery.

Customer Support

There are a number of ways to make the process of customer feedback handling smooth using Named Entity Recognition. Let’s say we are handling the customer support department of an electronic store with multiple branches worldwide, you go through a number of mentions in your customers’ feedback. Like this for instance,

If we pass this tweet through the Named Entity Recognition API, it pulls out the entities Washington (location) and Apple Watch(Product). This information can be then used to categorize the complaint and assign it to the relevant department within the organization that should be handling this.

NER in spaCy

spaCy, regarded as the fastest NLP framework in Python, comes with optimized implementations for a lot of the common NLP tasks including NER. spaCy v3.0 even introduced the latest state-of-the-art transformer-based pipelines. By default, the spaCy pipeline loads the part-of-speech tagger, dependency parser, and NER.

So we can perform named entity recognition in a few lines of code:

Although this RoBERTa-based model achieves state-of-the-art performance on the CoNLL–2003 dataset it was trained on, it doesn’t perform as well on other kinds of text data. For instance, if we try to extract entities from medical journal text it won’t detect any relevant information.

output of spacy's default NER on medical journal text

To fix this we’ll need to train our own NER model, and the good thing is that spaCy makes that process very straightforward.

How To Train A Custom NER Model in Spacy

To train our custom named entity recognition model, we’ll need some relevant text data with the proper annotations. For the purpose of this tutorial, we’ll be using the medical entities dataset available on Kaggle.

Let’s install spacy, spacy-transformers, and start by taking a look at the dataset.

We only need the text string, the entity start and end indices, and the entity type.

spaCy uses DocBin class for annotated data, so we’ll have to create the DocBin objects for our training examples. This DocBin class efficiently serializes the information from a collection of Doc objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code.

There are some entity span overlaps, i.e., the indices of some entities overlap. spaCy provides a utility method filter_spans to deal with this.

config options on spaCy’s training quickstart page

You can manually create a config file as per the use case or quickly create a base config on spaCy’s training quickstart page here.

We’ll be working with a base config file created using the quickstart page. This is an incomplete file with only our custom options, so we’ll have to fill in the rest with the default values.

output after filling the config file with default option values

Now we have all that we need to train our model.

Let’s load the best-performing model and test it on a piece of text.

output of the trained custom named entity recognition model

Even with the very limited amount of data the model achieves decent performance.

Conclusion

In this article, you learned how named entity recognition works, where it can be used, and then trained a custom NER model for extracting medical entities from journal text. You can use the same process to train a custom NER model for your applications, you'll just need some annotated data. In case you can't find any pre-existing datasets for your use case, you can use one of the following data annotation tools to create your own:

‍

Also interesting

all articles

Black thin grid lines forming diamond-shaped pattern on a white background.

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara CEO & co-founder

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh Head Of Product

Also interesting

all articles

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara

CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher

Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara

CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara

CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher

Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher

Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara

CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher

Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher

Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher

Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher

Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara

CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher

Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing

Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh

Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova

Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara

CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan

Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova

Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh

Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh

Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh

Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara

CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara

CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi

CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara

CEO & co-founder

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh

Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara

CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara

CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara

CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh

Head Of Product

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

What Is Named Entity Recognition?

Application Of Named Entity Recognition

Information Extraction And Summarization

Optimizing Search Engines

Machine Translation

Content Classification

Customer Support

NER in spaCy

How To Train A Custom NER Model in Spacy

Conclusion

Also interesting

Also interesting

DEVELOPERS

Technology