BREAKING DOWN THE TOP TRENDING YOUTUBE VIDEOS (U.S. & CA) By: Esther Zhang
Youtube is a well-known and worldwide platform for uploading videos. The platform was launched in 2005 by Chad Hurley, Steve Chen and Jawed Karim. In the past 15 years, no one could have precedented how popular the site/app would become and its impact on the world. It is now worth over billions of dollars. From music to entertainment to tutorials to politics, there is something for everyone.
According to Business of Apps, 500 hours of content are uploaded to Youtube every minute. In recent years specifically, many people have chosen to make a career out of posting videos on Youtube, otherwise known as “content creators”. My fascination for what makes a video go viral inspired me to analyze data on Youtube’s top trending videos. Top-trending videos are determined based on numerous factors such as user interaction (number of views, comments and likes/dislikes).
The objective of this project is to analyze youtube’s most trending videos data from December 1st, 2017 to May 31st, 2018 from a data science perspective, and hopefully enlight the general public of how these trending videos varied between the U.S. and Canada in 2017-2018.
I decided to include Canada’s data on top trending videos to analyze another aspect of media. Many consider the U.S. and Canada to be similar in many ways, as they are closely located geographically. Prior to this project, I had never given much thought to the true cultural differences between the two countries. I wanted to compare the trending video data and see if the countries do hold similar preferences in terms of content- whether that’s in regards to categories, channels or other areas.
Throughout this tutorial, I will attempt to uncover potential trends between the video statistics. Additionally, I want to analyze whether posting time affected the most viewed videos.
More information on Youtube’s growth and user, usage, content and revenue statistics may be found here: https://www.businessofapps.com/data/youtube-statistics/
During data collection, I am parsing my data from various raw csv files so that it is ready for the next step of the data life cycle.
I found my dataset off of Kaggle and it was titled: “Trending Youtube Video Statistics”. The link to this page is: https://www.kaggle.com/datasnaek/youtube-new
The content includes data on daily trending Youtube videos for the U.S., Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India. Again, I only chose to use data on the U.S. and Canada. Within each region’s data are columns for video title, channel title, publish time, category id, tags, views, likes/dislikes, description and comment count. The data was in the form of a .csv file which I will parse using the raw GitHub link found at: https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_US_videos.csv https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_CA_videos.csv
I saved the U.S. and Canada’s data into separate dataframes which will be manipulated later on so that it’s easier to read.
The category id is the only column that is linked to a separate JSON file for each region. Each number id corresponds to a category name in the JSON; for ex: category id 10 from the U.S. categories is associated with music. The column metadata can be found at: https://www.kaggle.com/datasnaek/youtube-new/data I will be reassigning the category ids to their category names in the next stage.
import requests
import numpy as np
import pandas as pd
from datetime import datetime, timezone
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import json
import warnings
warnings.filterwarnings("ignore")
# Setting the font
plt.rcParams["font.serif"] = "cmr10"
First, I will scrape the U.S. videos data:
# Reading U.S. csv into first dataframe
df_one = pd.read_csv("https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_US_videos.csv")
df_one.head()
Next, I will scrape the Canada videos data:
# Reading Canada csv into second dataframe
df_two = pd.read_csv("https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_CA_videos.csv")
df_two.head()
Now, we are ready to process the data for further analysis!
When I process the data, I am essentially reformatting the dataframe so that it is more readable and easier to manage when continuing on with the data exploration!
The first step to tidying this dataset was fixing the time format and converting the ‘published at” and “trending date” columns into datetime objects. I first extract the times from their current format and then save the current column’s index to the newly created datetime object. The trending date is only the date while published at is the date and time. As seen in the original dataframe, the published at column was formerly in “zulu” military time which made it hard to easily see what date/time the video was posted. It was important to create these datetime objects so that I can later graph video trends over time. More info on Python’s datetime objects can be found here: https://docs.python.org/3/library/datetime.html
Next, I dropped the columns: video_id, channelId, tags, thumbnail_link and description. They will not be used in my analysis.
I repeated the same process for both the U.S. and CA dataframe
# U.S.
# Converting trending date times and published at times into datetime objects
time_format = "%y.%d.%m"
time_format2 = "%Y-%m-%dT%H:%M:%S.000Z"
for index in df_one.index:
# Trending date will be in the format YYYY-MM-DD
date_time_obj = datetime.strptime(df_one.at[index,"trending_date"], time_format)
df_one.at[index,"trending_date"] = date_time_obj.date()
# Published at will be in the format YYYY-MM-DD HH:MM:SS
date_time_obj2 = datetime.strptime(df_one.at[index,"publishedAt"], time_format2)
df_one.at[index,"publishedAt"] = date_time_obj2
# Dropping irrelevant columns and saving the original dataframe into a new one
df_us = df_one.drop(columns={"video_id", "channelId", "tags","thumbnail_link","description"}, axis=0)
# Renaming columns for style and formatting purposes
df_us = df_us.rename(columns={"publishedAt": "published_at", "channelTitle": "channel_title", "categoryId": "category_id"})
df_us.head()
# CANADA
# Converting trending date times and published at times into datetime objects
time_format = "%y.%d.%m"
time_format2 = "%Y-%m-%dT%H:%M:%S.000Z"
for index in df_two.index:
# Trending date will be in the format YYYY-MM-DD
date_time_obj = datetime.strptime(df_two.at[index,"trending_date"], time_format)
df_two.at[index,"trending_date"] = date_time_obj.date()
# Published at will be in the format YYYY-MM-DD HH:MM:SS
date_time_obj2 = datetime.strptime(df_two.at[index,"publishedAt"], time_format2)
df_two.at[index,"publishedAt"] = date_time_obj2
# Dropping irrelevant columns and saving the original dataframe into a new one
df_ca = df_two.drop(columns={"video_id", "channelId", "tags","thumbnail_link","description"}, axis=0)
# Renaming columns for style and formatting purposes
df_ca = df_ca.rename(columns={"publishedAt": "published_at", "channelTitle": "channel_title", "categoryId": "category_id"})
df_ca.head()
The last step in my data processing phase is changing the category ids to category names so that readers may see the category in the original dataframes without having to use a separate key. I first create a new column in the original dataframe that will hold objects of type string. Next, I use pandas to read in the category id JSON file that I uploaded to my notebook. I created a temporary dataframe for categories that stores the number id in one column and the corresponding name in the next column. I then loop through the original dataframe’s category_id column and have an inner loop going through the temporary dataframe’s id column to check for equality- once the id’s match, I set the original dataframe’s category_name to the temporary dataframe’s name at that index. After all the category names are updated, I drop the category_id column.
# Reading in U.S. categories from JSON file & updating the U.S. dataframe
# to contain category names instead of category id's
df_us['category_name'] = np.nan # create new column
df_us['category_name'] = df_us['category_name'].astype(str) # change column type to string
# Reading in from the JSON file
us_categories = pd.read_json('US_category_id.json')
# Creating temporary dataframe for categories
df_categories = pd.DataFrame()
# Mapping ids from the JSON file into the column "id"
df_categories['id'] = us_categories['items'].map(lambda row: row['id'])
# Mapping names from the JSON file into the column "name"
df_categories['name'] = us_categories['items'].map(lambda row: row['snippet']['title'])
# change column type to int
df_us['category_id'] = df_us['category_id'].astype(int)
df_categories['id'] = df_categories['id'].astype(int)
# setting the U.S. dataframe's column "category_name" to each entry's corresponding category (based off it's ID)
for index in df_us.index:
for ind in df_categories.index:
if (df_us['category_id'][index] == df_categories['id'][ind]):
df_us.at[index, "category_name"] = df_categories.at[ind, "name"]
# Dropping the "category_id" column, no longer needed
df_us = df_us.drop(columns={"category_id"})
df_us.head()
# Reading in CA categories from JSON file & updating the CA dataframe
# to contain category names instead of category id's
df_ca['category_name'] = np.nan # create new column
df_ca['category_name'] = df_ca['category_name'].astype(str) # change column type to string
# Reading in from the JSON file
ca_categories = pd.read_json('CA_category_id.json')
# Creating temporary dataframe for categories
df_categories2 = pd.DataFrame()
# Mapping ids from the JSON file into the column "id"
df_categories2['id'] = ca_categories['items'].map(lambda row: row['id'])
# Mapping names from the JSON file into the column "name"
df_categories2['name'] = ca_categories['items'].map(lambda row: row['snippet']['title'])
# change column type to int
df_ca['category_id'] = df_ca['category_id'].astype(int)
df_categories2['id'] = df_categories2['id'].astype(int)
# setting the Canada dataframe's column "category_name" to each entry's corresponding category (based off it's ID)
for index in df_ca.index:
for ind in df_categories2.index:
if (df_ca['category_id'][index] == df_categories2['id'][ind]):
df_ca.at[index, "category_name"] = df_categories2.at[ind, "name"]
# Dropping the "category_id" column, no longer needed
df_ca = df_ca.drop(columns={"category_id"})
df_ca.head()
Time to analyze!
In the exploratory analysis step, I will create numerous plots of the data to break it down and look for potential trends. Then, I will analyze whether any of the plots show strong correlations.
There are over 40,000 videos in the dataframe but the first half of my analyses will only observe the top 20 trending videos by view count. View count is a driving factor in what boosts a trending video, along with other factors like likes/dislikes, comments, etc. so I hope to see interesting results between the U.S. and Canada’s data.
Now that I have the trending video data, I can graph which videos were the most trending based off of their view count. I plotted the top 20 most viewed Youtube videos in the U.S. and Canada separately.
# Top 20 Most Viewed Youtube Videos in the U.S.
plt.title("Top 20 Most Viewed Youtube Videos in the U.S.")
plt.xticks(rotation=90)
# Sorting most viewed videos by row in descending order
most_viewed_us = df_us.groupby('title').view_count.max().sort_values(ascending=False)[:20]
# Creating bar plot
sns.barplot(x=most_viewed_us.index, y=most_viewed_us.values)
plt.xlabel('Title')
plt.ylabel('Views (hundred million)')
plt.show
# Top 20 Most Viewed Youtube Videos in Canada
plt.title("Top 20 Most Viewed Youtube Videos in Canada")
plt.xticks(rotation=90)
# Sorting most viewed videos by row in descending order
most_viewed_ca = df_ca.groupby('title').view_count.max().sort_values(ascending=False)[:20]
# Creating bar plot
sns.barplot(x=most_viewed_ca.index, y=most_viewed_ca.values)
plt.xlabel('Title')
plt.ylabel('Views (hundred million)')
plt.show()
Since none of the top 20 videos had the same number of views, I could easily check between the U.S. and Canada most viewed list for same trending videos. I printed their titles out along with their ranking amongst the other 20 for each region.
Interestingly enough, 6/20 videos were the same between the two countries. However, no same video also had the same ranking. For instance, "XXXTENTACION & Lil Pump ft. Maluma & Swae Lee - Arms Around You (Official Lyrics Video)" was the most viewed in Canada but was the tenth most viewed in the U.S.
for index in range(len(most_viewed_us)):
for index2 in range(len(most_viewed_ca)):
views_us = int(most_viewed_us[index])
views_ca = int(most_viewed_ca[index2])
if views_us == views_ca:
print(most_viewed_us.index[index])
print("U.S. ranking: " + str(index) + " Canada ranking: " + str(index2) + '\n')
Next, I wanted to plot the top 20 trending videos' likes vs dislikes ratio. I created a scatter plot with the likes along the x-axis and the dislikes along the y-axis. Each point/marker is a different color and the legend on the right indicates the video title.
# US Likes vs Dislikes (Top 20 Trending Videos by Viewcount)
df_top20us = df_us.nlargest(20, 'view_count')
fig, ax = plt.subplots()
ax.scatter(df_top20us['likes'], df_top20us['dislikes'])
for index in df_top20us.index:
plt.plot(df_top20us['likes'][index], df_top20us['dislikes'][index], marker='o', linestyle='', markersize=8, label=df_top20us['title'][index])
plt.title("Top 20 U.S. Trending Videos Likes vs Dislikes")
plt.xlabel('Likes (millions)')
plt.ylabel('Dislikes (hundred thousand)')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left')
# CA Likes vs Dislikes (Top 20 Trending Videos by Viewcount)
df_top20ca = df_ca.nlargest(20, 'view_count')
fig, ax = plt.subplots()
ax.scatter(df_top20ca['likes'], df_top20ca['dislikes'])
for index in df_top20ca.index:
plt.plot(df_top20ca['likes'][index], df_top20ca['dislikes'][index], marker='o', linestyle='', markersize=8, label=df_top20ca['title'][index])
plt.title("Top 20 Canada Trending Videos Likes vs Dislikes")
plt.xlabel('Likes (millions)')
plt.ylabel('Dislikes (hundred thousand)')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left')
Based off each individual plot, I noticed they differed more than expected.
In the U.S., there was a linear upwards trend. The more likes seems to correlate to more dislikes. I decided to recreate the plot with the line of best fit.
However, in Canada, the data points were much more scattered. I can not conclude that the likes have a strong correlation to dislikes. I replotted the above graph with the line of best fit as well. As seen, there are multiple outliers in the data.
These plots introduce many smaller interesting questions such as: is Canada's like to dislike ratio a reflection of more controversial videos in the top 20 most viewed? are Canadian Youtube users more critical of trending music? It is worth noting that the U.S. graph had much higher counts of likes vs dislikes in comparison to Canada.
Although I can't make any concrete conclusions, it is fascinating how data science plays a role in uncovering these minor cultural discrepencies. The videos may be similar between the two regions, but the like to dislike ratio varies.
# Plotting the line of best fit for the U.S. likes vs dislikes
sns.lmplot(x='likes', y='dislikes', data=df_top20us, fit_reg=True)
# Plotting the line of best fit for Canada's likes vs dislikes
sns.lmplot(x='likes', y='dislikes', data=df_top20ca, fit_reg=True)
I wanted to analyze the time difference amongst the top 20 trending videos from when they were published to when they were trending by date.
I plotted two scatter plots in the same graph- one with the published date as the x-axis and the other with the trending date as the x-axis; the y-axis was the video titles for both.
plt.figure(figsize=(10,6))
plt.title("Published Date & Trending Date (U.S. Top 20)")
plt.scatter(df_top20us['published_at'], most_viewed_us.index, color='red')
plt.scatter(df_top20us['trending_date'], most_viewed_us.index, color='green')
plt.xlabel('Date')
plt.ylabel('Titles')
plt.show()
plt.figure(figsize=(10,6))
plt.title("Published Date & Trending Date (Canada Top 20)")
plt.scatter(df_top20ca['published_at'], most_viewed_ca.index, color='red')
plt.scatter(df_top20ca['trending_date'], most_viewed_ca.index, color='green')
plt.xlabel('Date')
plt.ylabel('Titles')
plt.show()
Finally, I will explore the most popular categories amongst all the trending videos in each region.
I counted the number of times each category name appeared and printed them out. Then I created a barplot with the category names along the x-axis and the corresponding counts on the y-axis.
# Number of Videos by Category (U.S.)
category_counts_us = df_us.value_counts('category_name')
print(category_counts_us)
plt.title("Number of Videos by Category (U.S.)")
plt.xticks(rotation=90)
sns.barplot(x=category_counts_us.index, y=category_counts_us.values)
plt.xlabel('Category Name')
plt.ylabel('Counts')
plt.show()
# Number of Videos by Category (Canada)
category_counts_ca = df_ca.value_counts('category_name')
print(category_counts_ca)
plt.title("Number of Videos by Category (Canada)")
plt.xticks(rotation=90)
sns.barplot(x=category_counts_ca.index, y=category_counts_ca.values)
plt.xlabel('Category Name')
plt.ylabel('Counts')
plt.show()
Since each region has their own set of category names, there were some differences between the U.S. and Canada barplots.
Both had music and entertainment as the top two categories with the most trending videos. These two were followed by fairly similar rankings of the categories: Howto & Style and Film & Animation. By the latter half of the barplot, there were even different category names. For instance, the U.S. had separate categories for News & Politics, Gaming and Pets & Animals. Meanwhile, Canada had a separate category for shows.
The most popular categories of trending videos is important information for those entering into content creation. With so many videos being uploaded daily, the competition is high. For instance, while music-related videos are more common to be trending, that could also be a result of much more music-related videos being posted. It is hard to say whether the more common trending categories correlate to higher chances of one's video making the trending page if they fall into those categories.
This data can also be used when analyzing other reserach topics like: what content is most commonly searched across countries? what hobbies/interests do different regions like most?
More information on Youtube categories and what specific content falls into which category can be found here: https://techpostplus.com/youtube-video-categories-list-faqs-and-solutions/
This tutorial was a fun way to analyze one of the most popular platforms in the world- Youtube. In regards to my earlier question of how do the U.S. and Canada differ culturally through their top trending Youtube videos, I conclude that we are indeed more similar than not.
The graphs of the top 20 most viewed trending videos showed that there's already many video crossovers between the two regions in such a small sample. Then, the categories revealed that the top 5 categories in both regions were the same: music, entertainment, howto & style, film & animation and comedy.
The scatter plots on likes vs dislikes revealed how the regions' rate ratios differ for their top 20 trending videos. Canada's graph was more likely to have an unpredictable like to dislike ratio than the U.S.. This analysis could open up new discussions on the details of the top 20 trending videos and how they are perceived differently in each region.
For all the future young and old content creators, data science is your friend! By analyzing and understanding data on trending Youtube videos, you'll better know what it takes to go viral yourself!