Use NLP with AWS Comprehend to know the most famous U.S candidate

3 min read -

500 million.

It's the number of tweets posted per day, 50,000 new since you read this article. I tweet a lot myself:

It's a colossal amount of data, containing the opinions and thoughts of nearly 50 millions US citizens.

With the growth of NLP (Natural Language Processing) tools such as AWS Comprehend, Google Cloud NLP or IBM Watson, we can analyze people's messages and know their thoughts.

With such tools, is it possible to know the popularity of Joe Biden and Donald Trump, the two candidates for president of the United States?

We will proceed in several stages:

  1. Find out how to retrieve tweets about both candidates
  2. Find out how to use AWS Comprehend to analyze them
  3. Develop a program with Node.js that performs all these actions and stores the results in a database
  4. Analyze results
  5. Identify procedural flaws

Retrieve Tweets 🐦

To get tweets to analyze, we are going to use the Twitter API.
To use Twitter API, we will need to sign up and register an app to get API keys: Twitter developers console. Now we can use the Twit library to retrieve Tweets:

const Twit = require("twit"); // import Twit library
const T = new Twit({
consumer_key: "YOUR API CONSUMER KEY",
consumer_secret: "YOUR API SECRETCONSUMER KEY",
access_token: "YOUR API ACCESS TOKEN",
access_token_secret: "YOUR API SECRET ACCESS TOKEN",
}); // setup Twit
T.get(
"search/tweets", // retrieving tweets
{
q: "Donald Trump", // Get tweets containing "Donald Trump"
count: 10, // Get the last 10 tweets
},
(err, { statuses }) => console.log(statuses) // our tweets! 🎉🎉🎉
);
view raw getTweets.js hosted with ❤ by GitHub

statuses now contains the last 100 tweets containing "Donald Trump" with a lot of information, including the tweets contents in the text attribute:


Analyze Sentiments of Tweets 🧠

Now that we have data to analyse, let's use AWS Comprehend to get the sentiment of a text author.
To use Comprehend, we will need to sign up and setup AWS CLI to use this library: AWS console.
Now we can use Comprehend from the AWS SDK:

const AWS = require("aws-sdk"); // import AWS SDK
const comprehend = new AWS.Comprehend({
apiVersion: "2017-11-27",
region: "eu-west-1",
}); // setup Comprehend
comprehend.detectSentiment(
{
Text: "I love apples", // The text to analyze
LanguageCode: "en", // The language of the text
},
function (err, data) {
console.log(data); // the sentiments of the author! 🎉🎉🎉
}
);
view raw analyzeSentiments.js hosted with ❤ by GitHub

In data we now have the following object:

{
  "Sentiment": "POSITIVE",  // The most present emotion
  "SentimentScore": {  // The score in each of the emotions
    "Positive": 0.9970256686210632,
    "Negative": 0.00012458743003662676,
    "Neutral": 0.0028411296661943197,
    "Mixed": 0.000008623201210866682
  }
}

In this case, the text "I love apples" is detected as a POSITIVE text with an accuracy of 0.997.


The complete code 🚀

You can get the complete code at this address: opinion-analyzer. It does the following things:

  • Let the user enter the subject and the number of tweets to retrieve:
node index.js --subject "Donald Trump" --tweets 100
  • Retrieve these tweets
  • Get the sentiment of each tweets
  • If the sentiment of a tweet is Positive or Negative, stock it into a light JSON database using low-db. If the sentiment is Neutral or Mixed ignore it.

With a crontab, I'm launching this script every 6 hours to retrieve 25 tweets about Donald Trump and 25 others about Joe Biden, the database has the following format:

{
  "Donald Trump": {  // The subject
    "sinceId": 1257563324088123400  // The last tweet retrieved
    "5-5-2020": {  // The date of the extract
      "positive": 0.08,  // the rate of positive tweets
      "negative": 0.32,  // the rate of negative tweets
      "count": 50  // the count of every tweets
    },
  },
  "Joe Biden": {
    "sinceId": 1257563449518768000
    "5-5-2020": {
      "positive": 0.1,
      "negative": 0.24,
      "count": 50
    },
  }
}

The Results 📊

After running for a week, here are the results of Donald Trump:

Donald Trump Popularity

And here's Joe Biden's:

Joe Biden Popularity

As we can see, Joe Biden has significantly more positive tweets (about 362% more) and less negative ones than Donald Trump (about 38% less).

One surprising thing is that both Donald Trump and Joe Biden have much more negative tweets than positive ones (1672% more for Donald Trump and 287% more for Joe Biden).


The Flaws 😬

Ok, so Joe Biden is more popular than Donald Trump?
Hum, not so sure.

There are a lot of factors that invalidate the previous results:

  • Not everyone is on Twitter, "only" 22% of US adults
  • Only active tweeters are taken into account, not tweeters who don't post.
  • People have more reaction when they're feeling negative emotions, that explains why there are so many more negative tweets than positive ones.
  • The program only runs 4 times a day, to be really accurate, the program should take into account all new tweets regardless of the hour.
  • Not enough data, I would need thousands or even millions of tweets to get convincing results, but I don't really want to spend my scholarship on my Amazon bill, so we'll stay at 100 per day

Finally, Comprehend doesn't get sarcasm effectively, for example a tweet containing : "Of course I love Donald Trump! who doesn't love racism? " got a score of 89.2% as a positive feeling.


Conclusion 

There are many reasons not to think that these results correspond to reality, which can be verified very easily thanks to FiveThirtyHeight.

But as we have seen, most flaws are not related to Comprehend, but are mainly due to the fact that the analysis of Tweets does not allow us to know the popularity of someone, but rather the rate of negativity or positivity it generates, which is very different.

NLP is a very promising domain. Numerous studies already exist on its use in the field of politics, but it also finds its use in many other fields. For example, services like Deepl use Deep Learning to translate text efficiently.

With this program we have only seen a glimpse of what can be achieved with NLP, this area is very exciting and there is a lot to do with it!

If you have any questions or remarks do not hesitate to tweet me!