Tutorial

March 14, 2024

How To Classify Text With Python, Transformers & scikit-learn

Aditya Singh

Head Of Product

What is text classification? How does do text classifiers work? How can you train your own news classification models?

There’s an enormous amount of text out there, and more and more of it is generated every day in the form of emails, social media posts, chats, websites, and articles. All these text documents are rich sources of information. But because of the unstructured nature of the text, understanding and analyzing it is difficult and time-consuming. as a result, most companies are not able to leverage this invaluable source of information. This is where Natural Language Processing (NLP) methods like text classification come in.

What Is Text Classification?

Text classification, also known as text categorization or text tagging, is the process of assigning a text document to one or more categories or classes. It enables organizations to automatically structure all types of relevant text in a quick and inexpensive way. By classifying their text data, organizations can get a quick overview of the trends and also narrow down sections for further analysis.

Let’s say you are hired by an organization to help them sort through their customer reviews. First things first, if the organization has a global customer base and there are reviews written in multiple languages, dividing them based on their language will help with downstream tasks.

as uplifting and fulfilling as positive reviews are, they rarely contain any urgent issues that need to be addressed immediately. So, it might be a good idea to perform sentiment analysis and classify the reviews as positive and negative. Here your previous language classification will come in handy for making language-specific sentiment analysis models.

Once you have the negative reviews, you can further categorize them by feature/product mentioned. This will enable different teams to easily find relevant reviews and figure out what they’re doing wrong or what improvements need to be made.

How Does Text Classification Work?

There are two broad ways of performing text classification:

Manual
Automatic

Manual text classification involves a human annotator, who reads through the contents of the text documents and tags them. Now you might think that this method only causes problems when you’re working with large amounts of text. But you would be wrong to think so, here are two reasons:

Real-Time Analysis

as a side effect of being slow, manual classification hinders an organization’s ability to quickly identify and respond to critical situations. For instance, let’s say a cloud provider or an API goes down, if a person is sequentially going through the customer support tickets, it might take the organization a while to realize that their services are down.

Consistency

Humans are imperfect beings even on their best days. They’re prone to making mistakes due to factors like lack of sleep, boredom, distractions, etc. These can cause inconsistent classifications. For critical applications, these lapses can cost the organization thousands of dollars.

And beyond these disadvantages, human hours cost considerably more than running a Python script on a cloud server. So performing automatic text classification by applying natural language processing and other AI techniques is the best choice for most cases. Automatic classification is faster and more cost-effective. And on top of that, once a text classification model is trained to satisfactory specifications, it performs consistently.

There are many ways to automatically classify text documents, but all the methods can be classified into three types:

Rule-based methods
Machine learning-based methods
Hybrid methods

Rule-based Methods

Rule-based methods use a set of manually created linguistic rules to classify text. These rules consist of a pattern or a set of patterns for each of the categories. A very simple approach could be to classify documents based on the occurrences of category-specific words.

Say that you want to classify news articles into two classes: Business and Science. To do this you’ll need to create two lists of words that categorize each class. For instance, you could choose names of organizations like Goldman Sachs, Morgan Stanley, Apple, etc. for the business class, and for the science class, you could choose words like NasA, scientist, researchers, etc.

Now, when you want to classify a news article, the rule-based classifier will count the number of business-related words and science-related words. If the number of business-related words is greater than science-related words then the article will be classified as Business and vice versa.

rule-based classifier that uses count of topic-specific terms example

For example, the rule-based classifier will classify the news headline “

International Space Station: How NasA Plans To Destroy It” as science because there is one science-related word - “NasA” and no business-related words.

The biggest advantage of using rule-based systems is that there easy to understand by laymen. So once a bare-bones system is created, it can be incrementally improved over time. But the other side of this advantage is that the developers need deep knowledge of the domain to create the rules. Moreover, rule-based classification methods don’t scale well as adding new rules without proper testing can affect the results of older rules.

Machine Learning-Based Methods

Instead of defining the rules manually, you can choose a machine learning-based method that automatically learns the rules using past observations. Machine learning-based classifiers use labeled examples as training data to learn associations between words/phrases and the labels, i.e., the categories.

Now, that sounds easy enough to tackle, but there’s one problem you need to solve before you can train your machine learning classifier. Feature extraction. You see, computers don’t understand the text as we do, they only understand numbers, 0s & 1s. In the case of computer vision problems, the images are internally stored in the form of numbers representing the values of the individual pixels.

But that isn’t the case for text. So, the first step for training an NLP classifier is to transform the text into a numerical vector representation. One of the most commonly used text embedding methods is bag of words. It creates a vector that counts the occurrences of each word in a predefined dictionary.

Let's say you define the dictionary as: “(What, a, sunny, serene, beautiful, day, night)”, and you want to create the vector embedding of the sentence “What a serene night”. You would end up with the following vector representation for the sentence: (1, 1, 0, 1, 0, 0, 1).

After you have generated the vector representation of all the labeled text documents, you can use them to train a classifier. The vector representation of text documents is passed to the classifier with their correct categories. The model learns the associations between different linguistic features in the text and the categories:

Once the model has been trained up to the required performance standards, it can be used to make accurate predictions. The same feature extraction method is used to create the vector representation of the new text documents. The classification model uses these feature vectors to predict the categories of the documents.

machine learning classifier prediction illustration

Machine learning-based text classification methods are generally more accurate than rule-based classifiers. Besides that, machine learning classifiers are easier to scale as you can simply add new training examples to update the model. The only problem with machine learning classifiers is that they are hard to understand and debug. So if something goes wrong, it can be hard to figure out what caused the issue.

Hybrid Methods

Hybrid text classification methods combine the best of both worlds. They combine the generalization ability of the machine learning classifier with the easy-to-understand and tweak rule-based methods. They can learn complex rules thanks to the machine learning model, and any conflicting classifications or erratic behavior can be fixed using rules.

Take for instance the task of classifying financial news articles based on their industrial sectors such as pharma, finance, automotive, mining, etc. To do this you can create a hybrid system. First, train a named entity recognition model that extracts company names from the news articles. Then, create a list of the companies in each sector. And that's it, using these two things you can create a competent classifier.

Classifying News Headlines With Transformers & scikit-learn

Firstly, install spaCy wrapper for sentence transformers, spacy-sentence-bert, and the scikit-learn module.

And get the data here.

You'll be working with some of our old Google News data dumps. The news data is stored in the JSONL format. Normally you can load JSONL files into a DataFrame using the read_json method with the lines=True argument, but our data is structured a bit weirdly.

Each of the individual entries is stored within an Object (enclosed within {'item':}). So, if you try to load it straight away with the read_json method it will load all of it as one column.

output after loading the data with read_json method

To fix this, you can read the file like a normal text file and use the json.loads method to create a list of dictionaries.

example output of the list of dictionaries

There is still an issue to deal with 😬. The value of the attributes for each headline is stored inside its own object. Fret not, you can quickly fix it with some more dictionary manipulation.

Now you can create a DataFrame out of the news data.

first five rows of the news headlines dataframe

Let's take a look at all the topic labels.

You don't really need the other attributes like country, lang, or cleaned_url. And the topic labels-'WORLD' and 'NATION' are not very informative as their definition is very loose. So, you can drop them off.

Let's check for missing values and drop incomplete entries.

Next, you should take a look at how many training examples we have for each category.

bar plot of the number of training examples we have for each topic

More than 350000 articles for the two biggest categories and even the small group has about 40000 examples!

What about the distribution of headline lengths?

distribution plot of the lengths of the news headlines

Looks "normal". Get it?

Jokes aside, as you're going to use the state-of-the-art BERT transformer model to create the vector representations, I recommend that you train the models with a subset of the available dataset.

Now you can load the BERT sentence transformer and create the vector embedding for the headlines.

Let's split the data into training and testing sets.

And finally, you can train your choice of machine learning classifier.

The Support Vector Classifier(SVM) comes out on top. You can use this to predict the topic of new news headlines.

Let's try it out on a bunch of news headlines from Google News:

Conclusion

There you have it. You now know what text classification is, how it works, and how you can train your own machine learning text classifiers. But hey, what about neural networks? That's a good question! The thing is, as good as neural networks are they are not really necessary for this task. Both SVMs and NNs can approximate non-linear decision boundaries, and they both achieve comparable results on the same dataset. Neural networks can pull ahead in performance but they need a lot more computational power, training data, and time.

On the other hand, support vector machines can reliably identify the decision boundary based on the sole support vectors. Therefore, you can train an SVM classifier with a small subset of the data you would need to train a neural network that achieves similar performance. If, however, any marginal increase in performance is beneficial for your application feel free to go with neural networks.

‍

Also interesting

all articles

Black thin grid lines forming diamond-shaped pattern on a white background.

Product

July 14, 2026

All 272 Security Breaches in 3 Days: How CatchAll Found What Others Missed

Engineering Team

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh Head Of Product

Also interesting

all articles

Product

July 14, 2026

All 272 Security Breaches in 3 Days: How CatchAll Found What Others Missed

Engineering Team

Company

July 6, 2026

Structured Data Extraction from Web Search Results: JSON Schemas, Validation Prompts, and What Goes Wrong

Artem Bugara

CEO & co-founder

Company

July 1, 2026

What Is Recall in AI Search? Why Your AI Agent Might Be Missing 80% of Results

Margaretha Boetticher

Head of Growth

Tutorial

June 23, 2026

How to Track New Local Business Openings: Build an Automated Local Business Tracker

Engineering Team

Company

June 15, 2026

Web Search API for Risk Monitoring: How Risk Teams Catch Signals Early

Artem Bugara

CEO & co-founder

Tutorial

June 10, 2026

How to Evaluate Your AI Agent's Web Search Quality (Without Manual Labeling)

Artem Bugara

CEO & co-founder

Product

June 5, 2026

Web Scraping API vs. Custom Scraper: Which One Should You Use?

Margaretha Boetticher

Head of Growth

Tutorial

June 2, 2026

How Investment Teams Use Web Search APIs for Real-Time Market Intelligence

Margaretha Boetticher

Head of Growth

Tutorial

May 27, 2026

How to Build a Deep Research Agent with CatchAll and LangChain

Artem Bugara

CEO & co-founder

Tutorial

May 25, 2026

How to Monitor M&A Activity: Build an Automated Mergers & Acquisitions Tracker

NewsCatcher

Company

May 5, 2026

Best Web Search API: An In-Depth Comparison of Available Tools in 2026

Margaretha Boetticher

Head of Growth

Product

April 29, 2026

Web Scraping API vs Web Search API: A Developer's Guide to Choosing the Right Tool

Margaretha Boetticher

Head of Growth

Product

April 23, 2026

Web Search API Types: Three Architectures, One Confusing Name

Oleksandr Sirenko

Product

April 20, 2026

Introducing Company Watchlist: Scope Any Query to Your List of Companies

Margaretha Boetticher

Head of Growth

Company

April 14, 2026

What Is a Web Search API? A Guide for Developers and Analysts

Margaretha Boetticher

Head of Growth

Product

April 8, 2026

Web Search API Benchmarks: Q1 2026 — CatchAll vs Exa, OpenAI, and More

Oleksandr Sirenko

Company

March 26, 2026

Why We're Building a Different Type of Web Index

Artem Bugara

CEO & co-founder

Tutorial

February 25, 2026

Beyond the Scoreboard: Building a Live Olympics 2026 Incident and Medal Dashboard with CatchAll

NewsCatcher

Product

February 3, 2026

Google found 69 results. We found 3,261. Here's how

Engineering Team

Company

January 28, 2026

Why Recall Beats Precision for Real-World AI Research

Oleksandr Sirenko

Tutorial

January 14, 2026

Building a Deep Research Agent with CatchAll and CrewAI

NewsCatcher

Product

January 13, 2026

Evaluating Recall in Web Search APIs: OpenAI vs Exa vs Parallel AI vs CatchAll

NewsCatcher

Tutorial

December 29, 2025

Building a Supply Chain Risk Monitor Using CatchAll and CrewAI

NewsCatcher

Product

November 21, 2025

Introducing CatchAll: A SOTA Web Search API for Real-World Events

Margaretha Boetticher

Head of Growth

Company

June 10, 2025

How Transparency International Uses NewsCatcher Data to Fight Health Corruption

Jonathan Cushing

Programme Director

Product

March 14, 2025

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Aditya Singh

Head Of Product

Product

March 6, 2025

Measuring Product Launch Impact with News Data

Mariia Platonova

Head of Marketing

Company

January 24, 2025

NewsCatcher Partners With Reworkd To Streamline Access To Actionable Web Data

Artem Bugara

CEO & co-founder

Tutorial

January 22, 2025

Fake News Detection Using Python

Karthik Devan

Tech Copywriter

Company

December 16, 2024

Top Media Outlets: 50 Essential News Sites to Consider for Your News Analysis in 2025

Mariia Platonova

Head of Marketing

Product

December 9, 2024

How Does Our Local News API Work?

Aditya Singh

Head Of Product

Tutorial

November 25, 2024

Detecting Events in News Using NewsCatcher’s Events Intelligence API

Aditya Singh

Head Of Product

Product

November 5, 2024

Introducing NewsCatcher's Local News API

Aditya Singh

Head Of Product

Company

October 15, 2024

How to Choose a News API

Artem Bugara

CEO & co-founder

Product

September 17, 2024

Using Sentiment Analysis for Market Research

Artem Bugara

CEO & co-founder

Company

August 8, 2024

60,000 AI-generated news articles are published every day

Bradley Emi

CTO Pangram Labs

Product

May 7, 2024

Top 4 Free & Open-Source News API Alternatives

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Ultimate Guide To Text Similarity With Python

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

Using News API For Share Of Voice (SOV) Measurement & Competitor Tracking

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Train Custom Named Entity Recognition [NER] Model With SpaCy

Aditya Singh

Head Of Product

Company

May 7, 2024

Top 15 Takeaways From Running A Bootstrapped Startup For 1 Year

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Named Entity Recognition (NER) with SpaCy [with code example]

Aditya Singh

Head Of Product

Product

May 7, 2024

How We Built A News API Beta In 60 Days

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

How To Annotate Entities With Spacy PhraseMatcher

Aditya Singh

Head Of Product

Tutorial

May 7, 2024

How To Present/Show Open-Source Projects [Practical Guide]

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Google Kubernetes Engine as an alternative to Cloud Run

Maksym Sugonyaka

Product

May 7, 2024

Google News RSS Search Parameters: The Missing Docs

Artem Bugara

CEO & co-founder

Tutorial

May 7, 2024

Building A PR/Communication Media Monitoring Tool With News API

Artem Bugara

CEO & co-founder

Product

May 7, 2024

100k+ Rows Topic Labeled News Dataset

Artem Bugara

CEO & co-founder

Product

May 7, 2024

Announcing Free COVID-19 News API

Artem Bugara

CEO & co-founder

Tutorial

March 14, 2024

SpaCy vs NLTK. Text Normalization Comparison [with code]

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Top 6 Text Annotation Tools

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Sentiment Analysis Using Python

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Mining Financial Stock News Using SpaCy Matcher

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

Learning Natural Language Processing (NLP) Made Easy

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

How To Build Your Own Crypto News Aggregator

Aditya Singh

Head Of Product

Tutorial

March 14, 2024

4 Python Web Scraping Libraries To Mining News Data

Aditya Singh

Head Of Product

How To Classify Text With Python, Transformers & scikit-learn

What Is Text Classification?

How Does Text Classification Work?

Rule-based Methods

Machine Learning-Based Methods

Hybrid Methods

Classifying News Headlines With Transformers & scikit-learn

Conclusion

Also interesting

Also interesting

DEVELOPERS

Technology