Python - Analyzing most used words in lyrics with Apache Spark
Couting occurences of words in a file is a basic example about how using Apache Spark. Let's find out how !
Couting occurences of words in a file is a basic example about how using Apache Spark. Let's find out how !
Last updated 20th January, 2021
This guide gives you a basic example about using Apache Spark and OVHcloud Data Processing. We will first read data from a CSV file, then count the frequence of each word in this particular file.
Here we will use as an example a dataset of lyrics from billboard songs, and find the most common words used over time.
Apart being fun here, word count is really helpful when you want to analyze products reviews, support tickets, most researched words, and so on. You can easily make statistics for yourself or users. Famous social networks such as Twitter also analyze the trending topics by counting frequencies of words.
At the end of the tutorial, you will succeed to read CSV file from OVHcloud Object Storage, count words, remove stop words, and find the twentiest most used ones.
For this tutorial we will download a CSV file which includes :
This dataset, created by Kaylin Pavlik, can be downloaded on his Github repository.
She also wrote a really interesting blog post about this data immersion in pop culture:
As you can see in the graph we can find surprising results. She wrote that "Average word counts (both unique and total) have increased over time. Variance in word counts has also increased, perhaps due to greater genre diversity in the chart rankings over time".
Let's go back to the tutorial. Click on the CSV file, then download "billboard_lyrics_1964-2015.csv" on your computer.
You can download Python and Environment files generated here on official OVHcloud Github repository [1] We recommend the use of Conda as a preferred Python environment. Also, if you want to execute this code on your computer, it will require Apache Spark up and running, locally.
Apache Spark comes with handy functions to read various kind of data sources but also powerful ones to analyze data.
Create a new Python file in the same folder as your CSV file and add inside it:
import sys
from operator import add
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
import pyspark.sql.functions as f
spark = SparkSession\
.builder \
.appName("PythonWordCount") \
.getOrCreate()
# Read CSV from OVHcloud Object Storage https://github.com/walkerkq/musiclyrics
data = spark.read.format('csv').options(header='true', inferSchema='true') \
.load('billboard_lyrics_1964-2015.csv') \
print('############ CSV extract:')
data.show()
This first piece of code creates a new Apache Spark session, then reads and loads a CSV file into a dataframe called "data".
When executed, inside our logs we will see the content of the dataframe thanks to the data.show() function:
+----+--------------------+--------------------+----+--------------------+------+
|Rank| Song| Artist|Year| Lyrics|Source|
+----+--------------------+--------------------+----+--------------------+------+
| 1| wooly bully|sam the sham and ...|1965|sam the sham misc...| 3|
| 2|i cant help mysel...| four tops|1965| sugar pie honey ...| 1|
| 3|i cant get no sat...| the rolling stones|1965| | 1|
| 4| you were on my mind| we five|1965| when i woke up t...| 1|
| 5|youve lost that l...|the righteous bro...|1965| you never close ...| 1|
| 6| downtown| petula clark|1965| when youre alone...| 1|
| 7| help| the beatles|1965|help i need someb...| 3|
| 8|cant you hear my ...| hermans hermits|1965|carterlewis every...| 5|
| 9|crying in the chapel| elvis presley|1965| you saw me cryin...| 1|
| 10| my girl| the temptations|1965|ive got sunshine ...| 3|
| 11| help me rhonda| the beach boys|1965|well since she pu...| 3|
| 12| king of the road| roger miller|1965| trailer for sale...| 1|
| 13|the birds and the...| jewel akens|1965|let me tell ya bo...| 3|
| 14|hold me thrill me...| mel carter|1965| hold me hold me ...| 1|
| 15| shotgun|junior walker th...|1965|i said ̢shotgun s...| 3|
| 16| i got you babe| sonny cher|1965|they say were you...| 3|
| 17| this diamond ring|gary lewis the p...|1965|who wants to buy ...| 3|
| 18| the in crowd| ramsey lewis trio|1965| instrumental| 3|
| 19|mrs brown youve g...| hermans hermits|1965| mrs brown youve ...| 1|
| 20|stop in the name ...| the supremes|1965| stop in the name...| 1|
+----+--------------------+--------------------+----+--------------------+------+
only showing top 20 rows
Now, we can play with this dataframe! Let's explode the strings of the "Lyrics" column and count the words frequencies:
# Count and group word frequencies on the column Lyrics, when splitted by space comma
data.withColumn('word', f.explode(f.split(f.col('Lyrics'), ' '))) \
.groupBy('word') \
.count() \
.sort('count', ascending=False) \
.show()
The explode() function will take the data inside the Lyrics column, and separate all the data found based on the separator "whitespace character".
Here is the result shown in the logs:
+----+-----+
|word|count|
+----+-----+
| you|64606|
| i|56466|
| the|53451|
| to|35752|
| and|32555|
| me|31170|
| a|29282|
| it|25688|
| my|22821|
| in|18553|
|that|16151|
| on|15814|
|your|15459|
|love|15283|
| im|14278|
| be|13004|
| of|12825|
| |12266|
| all|11895|
|dont|11587|
+----+-----+
only showing top 20 rows
It works as expected, but the Top 20 is filled with stop words like "I", "The", "To", ... We will now remove them.
Apache Spark has a feature called StopWordsRemover that will clean up the data for us. The input shoud not be a string such as "I love you" but an array like [I, love, you].
We will first transform our data with tokenization then redo a GroupBy function:
# To remove stop words (like "I", "The", ...), we need to provide arrays of words, not strings. Here we use APache Spark Tokenizer to do so.
# We create a new column to push our arrays of words
tokenizer = Tokenizer(inputCol="Lyrics", outputCol="words_token")
tokenized = tokenizer.transform(data).select('Rank','words_token')
print('############ Tokenized data extract:')
tokenized.show()
# Once in arrays, we can use the Apache Spark function StopWordsRemover
# A new column "words_clean" is here as an output
remover = StopWordsRemover(inputCol='words_token', outputCol='words_clean')
data_clean = remover.transform(tokenized).select('Rank', 'words_clean')
print('############ Data Cleaning extract:')
data_clean.show()
# Final step : like in the beginning, we can group again words and sort them by the most used
result = data_clean.withColumn('word', f.explode(f.col('words_clean'))) \
.groupBy('word') \
.count().sort('count', ascending=False) \
print('############ TOP20 Most used words in Billboard songs are:')
result.show()
# Stop Spark Process
spark.stop()
The result is much better:
+-----+-----+
| word|count|
+-----+-----+
| love|15283|
| im|14278|
| dont|11587|
| know|11166|
| like|10949|
| oh| 9736|
| baby| 9098|
| got| 8289|
| get| 8265|
| | 7982|
|youre| 6592|
| yeah| 6259|
| want| 6214|
| go| 6105|
| make| 5520|
| one| 5412|
| cant| 5338|
| see| 5264|
| time| 5176|
| let| 4927|
+-----+-----+
only showing top 20 rows
Here is the main finding: if we analyze the lyrics of most famous songs during the last 50 years, the word "love" comes in first position with 15283 occurences. Followed by "i'm", "don't", "know", and so on.
Our code is now complete, let's now use OVHcloud Data Processing to benefit from an online Apache Spark cluster, with automation and scalability.
All Python jobs launched via Data Processing require a conda environment.yml file in order to manage libraries used and versions. You can find how to generate them here: How to submit a Python job on the Data Processing platform using the OVHcloud manager
Both files must be placed in the same object container.
In this particular case, you can use this environment below. Copy and paste in a environment.yml file.
name: data-processing-environment
channels:
- defaults
dependencies:
- python=3.7.6
- pyspark
You are now ready to submit this job.
A detailed version about how to deploy a job in Python is explained on the guide How to submit a job in Python.
Briefly, inside OVHcloud Data Processing control panel, click on "start a new job" then:
Data Processing
from the left panelSubmit a new job
When the job is completed, you can access its log file, either in the object container 'odp-files' or in the dashboard of your job. Below are two extracts of this log.
Here is an extract :
As you can see, all our the Python's print functions are displayed in the logs. It is the easiest way to keep track of a job run. We also see Apache Spark logs.
[1] Please select the tag related to the Spark version you want to use.
Counting word frequencies on a 8MB file as shown here does not require lot of power, and could have been done on a single laptop. But as said previously we can imagine much more useful use-cases, such as analyzing product reviews for an e-commerce website, dig into social networks messages, analyzing most recurring patterns in support tickets and so on.
Usually it will require more compute power due to the amount of data to manipulate. That's where OVHcloud Data Processing will become helpful, bringing automation and scalability.
You can read data from multiple sources, like Object Storage via AWS S3 API protocols, then write your results in files.
If you are not familiar with Apache Spark, we recommend you to visit Apache Spark's official website and pyspark's documentation.
You can send your questions, suggestions or feedbacks in our community of users on our Discord in the channel #dataprocessing-spark
Zachęcamy do przesyłania sugestii, które pomogą nam ulepszyć naszą dokumentację.
Obrazy, zawartość, struktura - podziel się swoim pomysłem, my dołożymy wszelkich starań, aby wprowadzić ulepszenia.
Zgłoszenie przesłane za pomocą tego formularza nie zostanie obsłużone. Skorzystaj z formularza "Utwórz zgłoszenie" .
Dziękujemy. Twoja opinia jest dla nas bardzo cenna.
Dostęp do OVHcloud Community Przesyłaj pytania, zdobywaj informacje, publikuj treści i kontaktuj się z innymi użytkownikami OVHcloud Community.
Porozmawiaj ze społecznością OVHcloud