Macabe

🪶scraping bearblogs and analyzing sentiment with VADER

It’s always best to check if a site provides an API or allows scraping before proceeding.

You will need the following.

from flask import Flask, request, render_template_string, jsonify, send_file
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
import threading

The script does two things.

  1. Scrapes blog posts from a specified domain using requests and parses them with BeautifulSoup. It finds text within specific HTML tags (h1, h2, h3, and p).

  2. Each string of text within the HTML tags listed is passed to the VADER sentiment analyzer, which computes a sentiment score (negative, neutral, positive, and compound). These scores are saved to a txt file, along with each post’s title, link, and date.

What is scraping?

The act of extracting information from websites. It involves making requests, fetching html or other content, and then parsing that to retrieve specific pieces of information. If online data is something you want to collect, this is an efficient way to do so.

What is sentiment analysis?

A technique used in NLP to determine emotional tone or sentiment in a piece of text. The value of this ability increases seemingly each day. The scoring is the fascinating part. Here are outputs from the examples given in the vaderSentiment github.

VADER is smart, handsome, and funny.----------------------------- {'pos': 0.746, 'compound': 0.8316, 'neu': 0.254, 'neg': 0.0}
VADER is smart, handsome, and funny!----------------------------- {'pos': 0.752, 'compound': 0.8439, 'neu': 0.248, 'neg': 0.0}
VADER is very smart, handsome, and funny.------------------------ {'pos': 0.701, 'compound': 0.8545, 'neu': 0.299, 'neg': 0.0}
VADER is VERY SMART, handsome, and FUNNY.------------------------ {'pos': 0.754, 'compound': 0.9227, 'neu': 0.246, 'neg': 0.0}
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- {'pos': 0.767, 'compound': 0.9342, 'neu': 0.233, 'neg': 0.0}
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- {'pos': 0.706, 'compound': 0.9469, 'neu': 0.294, 'neg': 0.0}
VADER is not smart, handsome, nor funny.------------------------- {'pos': 0.0, 'compound': -0.7424, 'neu': 0.354, 'neg': 0.646}
The book was good.----------------------------------------------- {'pos': 0.492, 'compound': 0.4404, 'neu': 0.508, 'neg': 0.0}
At least it isn't a horrible book.------------------------------- {'pos': 0.363, 'compound': 0.431, 'neu': 0.637, 'neg': 0.0}
The book was only kind of good.---------------------------------- {'pos': 0.303, 'compound': 0.3832, 'neu': 0.697, 'neg': 0.0}
The plot was good, but the characters are uncompelling and the dialog is not great. {'pos': 0.094, 'compound': -0.7042, 'neu': 0.579, 'neg': 0.327}
Today SUX!------------------------------------------------------- {'pos': 0.0, 'compound': -0.5461, 'neu': 0.221, 'neg': 0.779}
Today only kinda sux! But I'll get by, lol----------------------- {'pos': 0.317, 'compound': 0.5249, 'neu': 0.556, 'neg': 0.127}
Make sure you :) or :D today!------------------------------------ {'pos': 0.706, 'compound': 0.8633, 'neu': 0.294, 'neg': 0.0}
Catch utf-8 emoji such as 💘 and 💋 and 😁-------------------- {'pos': 0.279, 'compound': 0.7003, 'neu': 0.721, 'neg': 0.0}
Not bad at all--------------------------------------------------- {'pos': 0.487, 'compound': 0.431, 'neu': 0.513, 'neg': 0.0}

Now here is the utility of each library used in the script. Like a unique ingredient in a recipe they are.

  1. Flask A lightweight web framework that allows building web applications in Python.

    Flask imports:

  1. requests This library is essential for sending HTTP requests, making it ideal for fetching data from the web, such as blog content. In this script, it pulls the HTML of Bearblog posts for analysis.

  2. BeautifulSoup Part of the bs4 package, BeautifulSoup is used for parsing HTML documents. It allows for easy extraction of elements like blog titles, links, and main content from a webpage’s HTML structure.

  3. urllib.parse Specifically, the urljoin function helps construct absolute URLs from relative paths, ensuring proper linkage to blog posts on a domain.

  4. nltk (Natural Language Toolkit) One of the most widely used libraries for working with human language data. The SentimentIntensityAnalyzer from the vader lexicon is utilized to perform sentiment analysis on text, assigning scores based on positive, negative, and neutral sentiment. The script also downloads the vader_lexicon, a lexicon specifically built for sentiment analysis.

  5. threading This module runs the web scraping and analysis in a separate thread, ensuring the web app remains responsive while processing large tasks like sentiment analysis.

The utility of web scraping and sentiment analysis can be applied in many different ways across many different industries. Have an idea for it? Make a fork.

#tech #technicalwriting