Sentiment Analysis Using Python

Tutorial

Sentiment Analysis Using Python

Aditya Singh

March 14, 2024

Apr 21, 2025

Learn how to use sentiment analysis to mine insights about from tweets and news articles

Sentiment Analysis Using Python

By the end of this tutorial, you will be able to write a Sentiment Analysis pipeline using NLTK and transformers: it will detect public sentiment around companies from news headlines and tweets.

DEMO:

animated twitter and news sentiment score plot

Learn how to use sentiment analysis to mine insights from different data sources. We will write a Python script to analyze tweets and news articles to learn about the public sentiment around some tech companies.

In this tutorial, we will:

build a data pipeline to fetch tweets from Twitter and articles from top news publications
clean and normalize textual data
apply pre-trained sentiment analysis finBERT model provided in the transformers module
visualizing sentiment results

What Is Sentiment Analysis?

Sentiment analysis is the automated text analysis process that identifies and quantifies subjective information in text data.

When we need to understand what someone thinks about a product, service, or company, we get their feedback and store it in the form of an ordinal data point. In fact, most feedback forms and reviews have some form of this:

Nowadays, simple data points are not always representative of customer satisfaction. But people love to share their opinion on social media, so why do not use that? That’s where sentiment analysis comes in handy.Sentiment analysis aims to quantify the sentiment, opinion, or judgment based on what people write online. So it’s no surprise that the most common type of sentiment analysis is ’Polarity detection’ that involves classifying text sentiment as Positive, Negative, or Neutral.

Check out the sentiment analysis model below which tags this tweet as Negative:

Where Is Sentiment Analysis Used?

Marketing: Companies often use sentiment analysis to develop their marketing strategies, and to check how well they perform. In addition to that, sentiment analysis also helps the companies get a better grasp of how well their products and services are being received by the customers.

Politics: For the longest time, pre-election polls served as the only means of evaluating where the candidates stand in an upcoming election. These are rather inaccurate and can be deceiving as they are at the mercy of the voter turn-out. Sentiment analysis makes this process easier by leveraging the free-flowing political discourse on social networking sites.

Public Actions: as dystopian as it may seem, sentiment analysis can be used to look out for “destructive” tendencies in public rallies, protests, and demonstrations.

Twitter Sentiment Analysis With Python

Social networking platforms like Twitter enable businesses to engage with users. But, there’s a lot of data so it can be hard for brands to prioritize which tweets or mentions to respond to first. That's why sentiment analysis has become an essential part of social media marketing strategies.

Let's start by configuring the data pipeline to get some tweets.

Python Libraries Stack and Set-Up

We’ll be using the following libraries:

Tweepy - A convenient Python library for accessing the Twitter API
NLTK - Natural Language Toolkit everything related to Language Processing. We are using it for text cleaning and tokenization
transformers - Python library that provides thousands of pre-trained transformer models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, and more in over 100 languages. We will be using the ProsusAI/finbert model for financial sentiment analysis.
wordcloud - Python module for creating word clouds
plotly - Visualization library
newscatcherapi - An easy-to-use Python library for fetching news articles programmatically.

The code for this article can be found here.

Let's start by installing all required libraries.

pip install -r requirements.txt

view raw install.sh hosted with ❤ by GitHub

After that import all of them into your working environment.

	import os
	import re
	import tweepy
	from tweepy import OAuthHandler
	import numpy as np
	import pandas as pd

	# text treatement
	import nltk
	from nltk.tokenize import word_tokenize
	import string
	from nltk.corpus import stopwords
	from nltk.stem.porter import PorterStemmer
	from nltk.sentiment.vader import SentimentIntensityAnalyzer

	#Wordcloud
	from wordcloud import WordCloud, ImageColorGenerator
	import matplotlib.pyplot as plt
	from PIL import Image

	# Graphs
	import plotly.io as pio
	pio.renderers.default='browser'
	import plotly.express as px

	# transformers
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, AutoModelForTokenClassification

view raw import_libraries.py hosted with ❤ by GitHub

To work with the Twitter API, we need access tokens. For that, we have to apply for a developer account. Here’s the official documentation for getting started with the Twitter API.

	consumer_key = os.environ['CONSUMER_KEY']
	consumer_secret = os.environ['CONSUMER_SECRET']
	access_token = os.environ['ACCESS_TOKEN']
	access_token_secret = os.environ['ACCESS_SECRET']


	auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
	auth.set_access_token(access_token, access_token_secret)
	api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

view raw twitter_api_conn.py hosted with ❤ by GitHub

Get Trends

Trend hijacking is a growth-hacking/marketing strategy in which the company or individual hops on a trending meme or ‘challenge’ to capitalize on the trend’s organic traffic.

To look for trends we use the trends_place() method that takes Where On Earth IDentifier (WOEID) as an argument id. WOEID is a unique 32-bit reference identifier that is used to refer to different locations on earth. For the purpose of this example, we’ll use New York’s WOEID (2459115).

You can use https://www.findmecity.com to find the WOEID for whatever location you want to look up the trends for.

	trends_result = api.trends_place(id=2459115)[0]['trends']
	trends = {}
	for i in trends_result:
	trends[i['name']] = i['tweet_volume']
	trends_names = ' '.join(list(trends.keys()))

view raw get_twitter_trends.py hosted with ❤ by GitHub

The prominent trends don’t exactly pop at us in this form. Let’s visualize the trends in the form of a word cloud.

	# Image of corona
	hashtag = np.array(Image.open("hashtag.jpg"))
	hashtag[hashtag == 0] = 255

	wordcloud = WordCloud(background_color="white",max_words=200, mask=hashtag, contour_width=3, contour_color='firebrick', collocations=False).generate(trends_names)
	plt.figure(figsize=[20,10])
	plt.imshow(wordcloud, interpolation="bilinear")
	plt.axis("off")
	plt.show()

view raw show_twitter_trends.py hosted with ❤ by GitHub

‍

‍

Can you guess what day of the week the code was written?
Get Tweets
Trends aside, when we are concerned about the public sentiment around a product or company we often have to explicitly look up all tweets that mention it. We can search for tweets containing particular phrases using the Cursor object.
Let's say we want to analyze the sentiment around the three big tech companies- Apple, Amazon, and Facebook, so we’ll fetch a thousand tweets for each. But before we do that, let’s fetch one tweet, print it, and familiarize ourselves with the structure of the tweet.

	tweets = tweepy.Cursor(api.search, q='Apple', tweet_mode='extended').items(1)
	one_tweet = tweet[-1]
	one_tweet._json

view raw get_one_tweet.py hosted with ❤ by GitHub

There’s no lack of useful attributes in the tweet but we don’t need most of them for our purpose. Let's create a function that will help us extract relevant information for more than 1 tweet at scale.

	variables_we_need = ['created_at', 'id', 'full_text', 'entities', 'user', 'coordinates', 'retweet_count', 'favorite_count', 'lang']

	def get_all_tweets(count=100, q='', lang='', since='', tweet_mode='extended'):
	results = []

	tweets = tweepy.Cursor(api.search, q=q, lang=lang, since=since, tweet_mode=tweet_mode).items(count)

	for tweet in tweets:
	d = {}
	for variable in variables_we_need:
	d[variable] = tweet._json[variable]
	results.append(d)

	return results

view raw get_all_tweets.py hosted with ❤ by GitHub

Now let’s use this function to extract 1000 tweets for each of the three big tech companies.

	results_apple = get_all_tweets(count=1000, q='apple', tweet_mode='extended', lang='en', since='2021-10-25', until='2021-10-31')
	results_facebook = get_all_tweets(count=1000, q='facebook', tweet_mode='extended', lang='en', since='2021-10-25', until='2021-10-31')
	results_amazon = get_all_tweets(count=1000, q='amazon', tweet_mode='extended', lang='en', since='2021-10-25', until='2021-10-31')
	results_apple[0]

view raw get_all_tweets_three_companies.py hosted with ❤ by GitHub

Cleaning & Normalizing The Tweets
The tweet text in its current form isn’t the most conducive to analysis. We need to clean it before applying our sentiment analysis models. Here are the normalization operations we will apply to the tweets:
Delete @account_name at the beginning of retweets and any account mentions in the tweet text. This information isn’t needed for sentiment analysis and can be found in the ‘entities’ attribute of the tweets.
Extract and store links in a different attribute and delete them from the tweet.
Tokenize the remaining text.
Remove punctuation and stopwords.
Remove characters that are not alphabetic.
There are no hard-and-fast rules for normalizing text, these are just the processes that will work best for our use case. Let’s create a function that does all these cleaning steps and use it on the tweets extracted above:

	def clean_text(text, all_mentions):
	# If retweet, delete RT and name of the account
	text = re.sub('(RT\s.*):', '', text)
	# Find all links and delete them
	all_links = re.findall('(https:.*?)\s', text + ' ')

	for i in all_links:
	text = text.replace(i, '')

	for i in all_mentions:
	text = text.replace('@' + i, '')

	# Tokens
	tokens = word_tokenize(text.replace('-', ' '))
	# convert to lower case
	tokens = [w.lower() for w in tokens ]
	# remove punctuation from each word
	table = str.maketrans('', '', string.punctuation)
	stripped = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	words = [word for word in stripped if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	words = [w for w in words if not w in stop_words]
	phrase = " ".join(words)

	return phrase, all_links

	for i in results_apple:
	i['clean_text'], i['all_link'] = clean_text(i['full_text'], [j['screen_name'] for j in i['entities']['user_mentions']])
	for i in results_facebook:
	i['clean_text'], i['all_link'] = clean_text(i['full_text'], [j['screen_name'] for j in i['entities']['user_mentions']])
	for i in results_amazon:
	i['clean_text'], i['all_link'] = clean_text(i['full_text'], [j['screen_name'] for j in i['entities']['user_mentions']])

	results_apple[0]

view raw clean_text.py hosted with ❤ by GitHub

Sentiment Analysis Using Transformers
We can now do sentiment analysis on the cleaned tweet text. We’ll be using the FinBERT model for this. It is a transformer model trained on a large financial corpus. Let’s start by creating the sentiment analysis pipeline using the appropriate tokenizer and model. Then create a function for applying the sentiment analysis pipeline to each tweet.

	# Download and load FinBert pretrained model
	tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

	model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

	nlp = pipeline("sentiment-analysis", model = model, tokenizer=tokenizer)

	possible_sentiments = ['negative', 'neutral', 'positive']

	# Get sentiments
	def get_sentiments(input_dict, variable_text):

	for item_ in input_dict:
	sentiment = sentiment_analysis(item_[variable_text])
	for item in sentiment:
	for shade in possible_sentiments:
	if item['label'] == shade:
	item_[shade] = item['score']
	else:
	item_[shade] = 0

	return input_dict

	results_apple = get_sentiments(results_apple, 'clean_text')
	results_facebook = get_sentiments(results_facebook, 'clean_text')
	results_amazon = get_sentiments(results_amazon, 'clean_text')

view raw get_sentiments_finbert.py hosted with ❤ by GitHub

Here is how one tweet looks like right now:

Visualize The Results
Viewing the sentiment of individual tweets doesn’t really help us establish the overall public sentiment. Visualizing the overall score for the companies will be much better.
Convert the dictionaries into pandas Dataframes for ease of slicing/manipulating and then concatenate all together.

	# Create Dataframes
	apple_tweets_pd = pd.DataFrame(results_apple).loc[:, ['negative', 'neutral', 'positive']]
	facebook_tweets_pd = pd.DataFrame(results_facebook).loc[:, ['negative', 'neutral', 'positive']]
	amazon_tweets_pd = pd.DataFrame(results_amazon).loc[:, ['negative', 'neutral', 'positive']]

	# Concatanate
	total_score_tweets = pd.concat([apple_tweets_pd.mean(), facebook_tweets_pd.mean(), amazon_tweets_pd.mean()], axis=1)
	total_score_tweets = total_score_tweets.transpose()
	total_score_tweets = total_score_tweets.reset_index()
	total_score_tweets.columns = ['Company', 'negative', 'neutral', 'positive']
	total_score_tweets['Company'] = ['Apple', 'Facebook', 'Amazon']

	total_score_tweets

	# Visualize
	fig = px.histogram(total_score_tweets,
	x='Company',
	title='Sentiment Score by Company \| Tweets',
	y= ['negative', 'neutral','positive'],
	barmode='group',
	color_discrete_sequence=["red", "blue", "green"])
	fig.update_xaxes( title='Companies').update_yaxes(title='Sentiment score')
	fig.show()

view raw visualize_results.py hosted with ❤ by GitHub

That gives us:

bar chart of tweet sentiment scores per company

as we can see, most parts of tweets do not contain any emotional attitude, they are simply neutral. We only extracted 1000 tweets for each company. Increasing the number of tweets could bring different results. But here, we can see the trend: Amazon has a greater number of negative tweets. When for Apple and Facebook positive and negative numbers of tweets are almost equal.
News Sentiment Analysis With Python
Another great source for gauging public sentiment can be news articles. Let’s try to do the same kind of analysis for the three companies but on news articles. We’ll be using the newscatcherapi package to fetch the news articles, to work with it you’ll need to get an API key from here.
The news text is well-written and isn’t littered with things like mentions, hashtags, and special characters. So we can skip the cleaning step and move straight to sentiment analysis.

	# Import Packages
	from newscatcherapi import NewsCatcherApiClient
	import time

	# Initialize NewsCatcher API
	newscatcherapi = NewsCatcherApiClient(x_api_key='YOUR-X-API-KEY')

	# Extract News
	apple_articles = []
	facebook_articles = []
	amazon_articles = []

	for i in range(1, 11):
	apple_articles.extend(newscatcherapi.get_search(q='(Apple AND company) OR "Apple Inc"',
	lang='en',
	from_='2021-10-25',
	to_='2021-10-31',
	page_size=100,
	page=i)['articles'])
	time.sleep(1)

	facebook_articles.extend(newscatcherapi.get_search(q='(Facebook AND company) OR "Facebook Inc"',
	lang='en',
	from_='2021-10-25',
	to_='2021-10-31',
	page_size=100,
	page=i)['articles'])

	time.sleep(1)

	amazon_articles.extend(newscatcherapi.get_search(q='(Amazon AND company) OR "Amazon Inc"',
	lang='en',
	from_='2021-10-25',
	to_='2021-10-31',
	page_size=100,
	page=i)['articles'])

	time.sleep(1)

view raw newscatcherapi_extract.py hosted with ❤ by GitHub

And finally, visualize the scores.

	apple_articles_pd = pd.DataFrame(get_sentiments(apple_articles, 'title')).loc[:, ['negative', 'neutral', 'positive']]
	facebook_articles_pd = pd.DataFrame(get_sentiments(facebook_articles, 'title')).loc[:, ['negative', 'neutral', 'positive']]
	amazon_articles_pd = pd.DataFrame(get_sentiments(amazon_articles, 'title')).loc[:, ['negative', 'neutral', 'positive']]

	total_score_articles = pd.concat([apple_articles_pd.mean(), facebook_articles_pd.mean(), amazon_articles_pd.mean()], axis=1)
	total_score_articles = total_score_articles.transpose()
	total_score_articles = total_score_articles.reset_index()
	total_score_articles.columns = ['Company', 'negative', 'neutral', 'positive']
	total_score_articles['Company'] = ['Apple', 'Facebook', 'Amazon']

	# Sentiment Score
	total_score_articles

	# Graph
	fig = px.histogram(total_score_articles,
	x='Company',
	title='Sentiment Score by Company \| News Articles',
	y= ['negative', 'neutral','positive'],
	barmode='group',
	color_discrete_sequence=["red", "blue", "green"])
	fig.update_xaxes( title='Companies').update_yaxes(title='Sentiment score')
	fig.show()

view raw visualize_newscatcherapi.py hosted with ❤ by GitHub

bar chart of news sentiment scores per company

Conclusion

The goal of this article was to introduce you to a simplified process of Sentiment Analysis. Keep in mind that every analysis contains these steps:

Data Extraction

Data Cleaning

Model Application

Visualization

Our analysis was mainly focused on public sentiment towards 3 tech companies. It reveals the more sentimental/biased nature of news headlines. Where the majority of the tweets are neutral, news headlines seem to be predominantly negative. Should we be concerned 🤔

Choosing the Right News API Should Be Easy

Get access to the guide that simplifies your decision-making. Enter your email to download now.

Success! Your white paper is on its way. Be sure to check your inbox shortly!

Oops! Something went wrong while submitting the form.

What Is Sentiment Analysis?Where Is Sentiment Analysis Used?Twitter Sentiment Analysis With Python

READY FOR
CUSTOM NEWS SOLUTIONS?

Drop your email and find out how our API delivers precisely what your business needs.

Sentiment Analysis Using Python

What Is Sentiment Analysis?

Where Is Sentiment Analysis Used?

Twitter Sentiment Analysis With Python

Python Libraries Stack and Set-Up

Choosing the Right News API Should Be Easy

READY FOR CUSTOM NEWS SOLUTIONS?

READY FOR
CUSTOM NEWS SOLUTIONS?