BREAKING DOWN THE TOP TRENDING YOUTUBE VIDEOS (U.S. & CA) By: Esther Zhang

  1. INTRODUCTION

Youtube is a well-known and worldwide platform for uploading videos. The platform was launched in 2005 by Chad Hurley, Steve Chen and Jawed Karim. In the past 15 years, no one could have precedented how popular the site/app would become and its impact on the world. It is now worth over billions of dollars. From music to entertainment to tutorials to politics, there is something for everyone.

According to Business of Apps, 500 hours of content are uploaded to Youtube every minute. In recent years specifically, many people have chosen to make a career out of posting videos on Youtube, otherwise known as “content creators”. My fascination for what makes a video go viral inspired me to analyze data on Youtube’s top trending videos. Top-trending videos are determined based on numerous factors such as user interaction (number of views, comments and likes/dislikes).

The objective of this project is to analyze youtube’s most trending videos data from December 1st, 2017 to May 31st, 2018 from a data science perspective, and hopefully enlight the general public of how these trending videos varied between the U.S. and Canada in 2017-2018.

I decided to include Canada’s data on top trending videos to analyze another aspect of media. Many consider the U.S. and Canada to be similar in many ways, as they are closely located geographically. Prior to this project, I had never given much thought to the true cultural differences between the two countries. I wanted to compare the trending video data and see if the countries do hold similar preferences in terms of content- whether that’s in regards to categories, channels or other areas.

Throughout this tutorial, I will attempt to uncover potential trends between the video statistics. Additionally, I want to analyze whether posting time affected the most viewed videos.

More information on Youtube’s growth and user, usage, content and revenue statistics may be found here: https://www.businessofapps.com/data/youtube-statistics/

  1. DATA COLLECTION

During data collection, I am parsing my data from various raw csv files so that it is ready for the next step of the data life cycle.

I found my dataset off of Kaggle and it was titled: “Trending Youtube Video Statistics”. The link to this page is: https://www.kaggle.com/datasnaek/youtube-new

The content includes data on daily trending Youtube videos for the U.S., Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India. Again, I only chose to use data on the U.S. and Canada. Within each region’s data are columns for video title, channel title, publish time, category id, tags, views, likes/dislikes, description and comment count. The data was in the form of a .csv file which I will parse using the raw GitHub link found at: https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_US_videos.csv https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_CA_videos.csv

I saved the U.S. and Canada’s data into separate dataframes which will be manipulated later on so that it’s easier to read.

The category id is the only column that is linked to a separate JSON file for each region. Each number id corresponds to a category name in the JSON; for ex: category id 10 from the U.S. categories is associated with music. The column metadata can be found at: https://www.kaggle.com/datasnaek/youtube-new/data I will be reassigning the category ids to their category names in the next stage.

In [1]:
import requests
import numpy as np 
import pandas as pd 
from datetime import datetime, timezone 
import matplotlib.pyplot as plt 
import matplotlib
import seaborn as sns
import json
import warnings
warnings.filterwarnings("ignore")

# Setting the font
plt.rcParams["font.serif"] = "cmr10"

First, I will scrape the U.S. videos data:

In [2]:
# Reading U.S. csv into first dataframe

df_one = pd.read_csv("https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_US_videos.csv")
df_one.head() 
Out[2]:
video_id title publishedAt channelId channelTitle categoryId trending_date tags view_count likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled description
0 U68MJz9DrI4 Gucci Mane, Bruno Mars, Kodak Black - Wake Up ... 2018-10-31T14:17:10.000Z UCSugZEYrWbzqIWGD195V-YA OfficialGucciMane 10 18.01.11 officialguccimane|gucci mane|gucci|mane|atlant... 673533 91342 1480 8204 https://i.ytimg.com/vi/U68MJz9DrI4/default.jpg False False The official music video for Gucci Mane, Bruno...
1 shWNgr0ifrs Kevin Gates - Great Man [Official Music Video] 2018-10-31T14:00:08.000Z UCj2GTFekdV3EUsTVN8oaEqA kevingatesTV 10 18.01.11 kevin gates|kevin|gates|atlantic|atlantic reco... 207289 18647 256 1707 https://i.ytimg.com/vi/shWNgr0ifrs/default.jpg False False Kevin Gates - Great ManStream/Download - https...
2 JIFMN986m8s Worst Halloween Candy Taste Test (Day 3) 2018-10-31T10:00:03.000Z UC4PooiX37Pld1T8J5SYT-SQ Good Mythical Morning 24 18.01.11 gmm|good mythical morning|rhettandlink|rhett a... 1054156 39129 1236 10153 https://i.ytimg.com/vi/JIFMN986m8s/default.jpg False False We've arrived at the finals! Which of our Ehh ...
3 0iy3HPxBFQY James Corden & Ariana Grande Visit an Escape Room 2018-10-31T05:01:11.000Z UCJ0uqCI0Vqr2Rrt1HseGirg The Late Late Show with James Corden 24 18.01.11 The Late Late Show|Late Late Show|James Corden... 896693 49252 289 2247 https://i.ytimg.com/vi/0iy3HPxBFQY/default.jpg False False Since Ariana Grande loves Halloween and being ...
4 XeAClxSYQc8 Ellen's Backstage Scares Featuring Kris Jenner... 2018-10-31T13:00:04.000Z UCp0hYYBW6IMayGgR-WeoCvQ TheEllenShow 24 18.01.11 scare|montage|ellen staff|tv|kris jenner|ciara... 408612 18862 207 0 https://i.ytimg.com/vi/XeAClxSYQc8/default.jpg True False One of Ellen's favorite parts of Halloween is ...

Next, I will scrape the Canada videos data:

In [3]:
# Reading Canada csv into second dataframe

df_two = pd.read_csv("https://raw.githubusercontent.com/mitchelljy/Trending-YouTube-Scraper/master/output/18.01.11_CA_videos.csv")
df_two.head()
Out[3]:
video_id title publishedAt channelId channelTitle categoryId trending_date tags view_count likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled description
0 ZAfAud_M_mg Halsey - Without Me 2018-10-29T15:58:23.000Z UCm3FgJ2Hqm7tb70T-GfwXVA HalseyVEVO 10 18.01.11 halsey without me|halsey|without me|halsey alo... 4467893 297950 6369 15626 https://i.ytimg.com/vi/ZAfAud_M_mg/default.jpg False False Without Me available now: https://halsey.lnk.t...
1 YyWru2XOiK0 Tyga - Dip (Official Video) ft. Nicki Minaj 2018-10-29T19:00:49.000Z UChXnu0HBydqedqhnClp0rJg TygaVEVO 10 18.01.11 Tyga|Dip|(Official|Video)|Last|Kings|Music|EMP... 5318708 252922 14229 24324 https://i.ytimg.com/vi/YyWru2XOiK0/default.jpg False False Download the new single, DIP. Out Now!Stream: ...
2 mwsJDfiOJdk Worst Halloween Candy Taste Test (Day 2) 2018-10-30T10:00:10.000Z UC4PooiX37Pld1T8J5SYT-SQ Good Mythical Morning 24 18.01.11 gmm|good mythical morning|rhettandlink|rhett a... 2037463 52481 1808 14696 https://i.ytimg.com/vi/mwsJDfiOJdk/default.jpg False False Day 2 of the worst Halloween Candy tournament ...
3 0iy3HPxBFQY James Corden & Ariana Grande Visit an Escape Room 2018-10-31T05:01:11.000Z UCJ0uqCI0Vqr2Rrt1HseGirg The Late Late Show with James Corden 24 18.01.11 The Late Late Show|Late Late Show|James Corden... 896693 49252 289 2247 https://i.ytimg.com/vi/0iy3HPxBFQY/default.jpg False False Since Ariana Grande loves Halloween and being ...
4 WZwr2a_lFWY IZ*ONE (아이즈원) - 라비앙로즈 (La Vie en Rose) MV 2018-10-29T09:00:05.000Z UC_pwIXKXNm5KGhdEVzmY60A Stone Music Entertainment 10 18.01.11 K-CULTURE korean Music MV Music Video K-Pop Kp... 7777735 425685 28512 71743 https://i.ytimg.com/vi/WZwr2a_lFWY/default.jpg False False IZ*ONE (아이즈원) - 라비앙로즈 (La Vie en Rose) MV 입니다....

Now, we are ready to process the data for further analysis!

  1. DATA PROCESSING / MUNGING

When I process the data, I am essentially reformatting the dataframe so that it is more readable and easier to manage when continuing on with the data exploration!

The first step to tidying this dataset was fixing the time format and converting the ‘published at” and “trending date” columns into datetime objects. I first extract the times from their current format and then save the current column’s index to the newly created datetime object. The trending date is only the date while published at is the date and time. As seen in the original dataframe, the published at column was formerly in “zulu” military time which made it hard to easily see what date/time the video was posted. It was important to create these datetime objects so that I can later graph video trends over time. More info on Python’s datetime objects can be found here: https://docs.python.org/3/library/datetime.html

Next, I dropped the columns: video_id, channelId, tags, thumbnail_link and description. They will not be used in my analysis.

I repeated the same process for both the U.S. and CA dataframe

In [4]:
# U.S.
# Converting trending date times and published at times into datetime objects
time_format = "%y.%d.%m"
time_format2 = "%Y-%m-%dT%H:%M:%S.000Z"

for index in df_one.index:
    
    # Trending date will be in the format YYYY-MM-DD
    date_time_obj = datetime.strptime(df_one.at[index,"trending_date"], time_format)
    df_one.at[index,"trending_date"] = date_time_obj.date()
    
    # Published at will be in the format YYYY-MM-DD HH:MM:SS
    date_time_obj2 = datetime.strptime(df_one.at[index,"publishedAt"], time_format2)
    df_one.at[index,"publishedAt"] = date_time_obj2

# Dropping irrelevant columns and saving the original dataframe into a new one
df_us = df_one.drop(columns={"video_id", "channelId", "tags","thumbnail_link","description"}, axis=0)

# Renaming columns for style and formatting purposes
df_us = df_us.rename(columns={"publishedAt": "published_at", "channelTitle": "channel_title", "categoryId": "category_id"})

df_us.head()
Out[4]:
title published_at channel_title category_id trending_date view_count likes dislikes comment_count comments_disabled ratings_disabled
0 Gucci Mane, Bruno Mars, Kodak Black - Wake Up ... 2018-10-31 14:17:10 OfficialGucciMane 10 2018-11-01 673533 91342 1480 8204 False False
1 Kevin Gates - Great Man [Official Music Video] 2018-10-31 14:00:08 kevingatesTV 10 2018-11-01 207289 18647 256 1707 False False
2 Worst Halloween Candy Taste Test (Day 3) 2018-10-31 10:00:03 Good Mythical Morning 24 2018-11-01 1054156 39129 1236 10153 False False
3 James Corden & Ariana Grande Visit an Escape Room 2018-10-31 05:01:11 The Late Late Show with James Corden 24 2018-11-01 896693 49252 289 2247 False False
4 Ellen's Backstage Scares Featuring Kris Jenner... 2018-10-31 13:00:04 TheEllenShow 24 2018-11-01 408612 18862 207 0 True False
In [5]:
# CANADA
# Converting trending date times and published at times into datetime objects
time_format = "%y.%d.%m"
time_format2 = "%Y-%m-%dT%H:%M:%S.000Z"

for index in df_two.index:
    # Trending date will be in the format YYYY-MM-DD
    date_time_obj = datetime.strptime(df_two.at[index,"trending_date"], time_format)
    df_two.at[index,"trending_date"] = date_time_obj.date()

    # Published at will be in the format YYYY-MM-DD HH:MM:SS
    date_time_obj2 = datetime.strptime(df_two.at[index,"publishedAt"], time_format2)
    df_two.at[index,"publishedAt"] = date_time_obj2
    
# Dropping irrelevant columns and saving the original dataframe into a new one
df_ca = df_two.drop(columns={"video_id", "channelId", "tags","thumbnail_link","description"}, axis=0)
# Renaming columns for style and formatting purposes
df_ca = df_ca.rename(columns={"publishedAt": "published_at", "channelTitle": "channel_title", "categoryId": "category_id"})

df_ca.head()
Out[5]:
title published_at channel_title category_id trending_date view_count likes dislikes comment_count comments_disabled ratings_disabled
0 Halsey - Without Me 2018-10-29 15:58:23 HalseyVEVO 10 2018-11-01 4467893 297950 6369 15626 False False
1 Tyga - Dip (Official Video) ft. Nicki Minaj 2018-10-29 19:00:49 TygaVEVO 10 2018-11-01 5318708 252922 14229 24324 False False
2 Worst Halloween Candy Taste Test (Day 2) 2018-10-30 10:00:10 Good Mythical Morning 24 2018-11-01 2037463 52481 1808 14696 False False
3 James Corden & Ariana Grande Visit an Escape Room 2018-10-31 05:01:11 The Late Late Show with James Corden 24 2018-11-01 896693 49252 289 2247 False False
4 IZ*ONE (아이즈원) - 라비앙로즈 (La Vie en Rose) MV 2018-10-29 09:00:05 Stone Music Entertainment 10 2018-11-01 7777735 425685 28512 71743 False False

The last step in my data processing phase is changing the category ids to category names so that readers may see the category in the original dataframes without having to use a separate key. I first create a new column in the original dataframe that will hold objects of type string. Next, I use pandas to read in the category id JSON file that I uploaded to my notebook. I created a temporary dataframe for categories that stores the number id in one column and the corresponding name in the next column. I then loop through the original dataframe’s category_id column and have an inner loop going through the temporary dataframe’s id column to check for equality- once the id’s match, I set the original dataframe’s category_name to the temporary dataframe’s name at that index. After all the category names are updated, I drop the category_id column.

In [6]:
# Reading in U.S. categories from JSON file & updating the U.S. dataframe
# to contain category names instead of category id's

df_us['category_name'] = np.nan # create new column
df_us['category_name'] = df_us['category_name'].astype(str) # change column type to string

# Reading in from the JSON file
us_categories = pd.read_json('US_category_id.json')

# Creating temporary dataframe for categories
df_categories = pd.DataFrame()
# Mapping ids from the JSON file into the column "id"
df_categories['id'] = us_categories['items'].map(lambda row: row['id'])
# Mapping names from the JSON file into the column "name"
df_categories['name'] = us_categories['items'].map(lambda row: row['snippet']['title'])

# change column type to int
df_us['category_id'] = df_us['category_id'].astype(int)
df_categories['id'] = df_categories['id'].astype(int)

# setting the U.S. dataframe's column "category_name" to each entry's corresponding category (based off it's ID)
for index in df_us.index:
    for ind in df_categories.index:
        if (df_us['category_id'][index] == df_categories['id'][ind]):
            df_us.at[index, "category_name"] = df_categories.at[ind, "name"]
        
# Dropping the "category_id" column, no longer needed
df_us = df_us.drop(columns={"category_id"})
                             
df_us.head()
Out[6]:
title published_at channel_title trending_date view_count likes dislikes comment_count comments_disabled ratings_disabled category_name
0 Gucci Mane, Bruno Mars, Kodak Black - Wake Up ... 2018-10-31 14:17:10 OfficialGucciMane 2018-11-01 673533 91342 1480 8204 False False Music
1 Kevin Gates - Great Man [Official Music Video] 2018-10-31 14:00:08 kevingatesTV 2018-11-01 207289 18647 256 1707 False False Music
2 Worst Halloween Candy Taste Test (Day 3) 2018-10-31 10:00:03 Good Mythical Morning 2018-11-01 1054156 39129 1236 10153 False False Entertainment
3 James Corden & Ariana Grande Visit an Escape Room 2018-10-31 05:01:11 The Late Late Show with James Corden 2018-11-01 896693 49252 289 2247 False False Entertainment
4 Ellen's Backstage Scares Featuring Kris Jenner... 2018-10-31 13:00:04 TheEllenShow 2018-11-01 408612 18862 207 0 True False Entertainment
In [7]:
# Reading in CA categories from JSON file & updating the CA dataframe
# to contain category names instead of category id's

df_ca['category_name'] = np.nan # create new column
df_ca['category_name'] = df_ca['category_name'].astype(str) # change column type to string

# Reading in from the JSON file
ca_categories = pd.read_json('CA_category_id.json')

# Creating temporary dataframe for categories
df_categories2 = pd.DataFrame()
# Mapping ids from the JSON file into the column "id"
df_categories2['id'] = ca_categories['items'].map(lambda row: row['id'])
# Mapping names from the JSON file into the column "name"
df_categories2['name'] = ca_categories['items'].map(lambda row: row['snippet']['title'])

# change column type to int
df_ca['category_id'] = df_ca['category_id'].astype(int)
df_categories2['id'] = df_categories2['id'].astype(int)

# setting the Canada dataframe's column "category_name" to each entry's corresponding category (based off it's ID)
for index in df_ca.index:
    for ind in df_categories2.index:
        if (df_ca['category_id'][index] == df_categories2['id'][ind]):
            df_ca.at[index, "category_name"] = df_categories2.at[ind, "name"]
            
# Dropping the "category_id" column, no longer needed
df_ca = df_ca.drop(columns={"category_id"})
                             
df_ca.head()
Out[7]:
title published_at channel_title trending_date view_count likes dislikes comment_count comments_disabled ratings_disabled category_name
0 Halsey - Without Me 2018-10-29 15:58:23 HalseyVEVO 2018-11-01 4467893 297950 6369 15626 False False Music
1 Tyga - Dip (Official Video) ft. Nicki Minaj 2018-10-29 19:00:49 TygaVEVO 2018-11-01 5318708 252922 14229 24324 False False Music
2 Worst Halloween Candy Taste Test (Day 2) 2018-10-30 10:00:10 Good Mythical Morning 2018-11-01 2037463 52481 1808 14696 False False Entertainment
3 James Corden & Ariana Grande Visit an Escape Room 2018-10-31 05:01:11 The Late Late Show with James Corden 2018-11-01 896693 49252 289 2247 False False Entertainment
4 IZ*ONE (아이즈원) - 라비앙로즈 (La Vie en Rose) MV 2018-10-29 09:00:05 Stone Music Entertainment 2018-11-01 7777735 425685 28512 71743 False False Music

Time to analyze!

  1. DATA EXPLORATION & ANALYSIS

In the exploratory analysis step, I will create numerous plots of the data to break it down and look for potential trends. Then, I will analyze whether any of the plots show strong correlations.

There are over 40,000 videos in the dataframe but the first half of my analyses will only observe the top 20 trending videos by view count. View count is a driving factor in what boosts a trending video, along with other factors like likes/dislikes, comments, etc. so I hope to see interesting results between the U.S. and Canada’s data.

Now that I have the trending video data, I can graph which videos were the most trending based off of their view count. I plotted the top 20 most viewed Youtube videos in the U.S. and Canada separately.

In [8]:
# Top 20 Most Viewed Youtube Videos in the U.S.
plt.title("Top 20 Most Viewed Youtube Videos in the U.S.")
plt.xticks(rotation=90)

# Sorting most viewed videos by row in descending order
most_viewed_us = df_us.groupby('title').view_count.max().sort_values(ascending=False)[:20]

# Creating bar plot
sns.barplot(x=most_viewed_us.index, y=most_viewed_us.values)
plt.xlabel('Title')
plt.ylabel('Views (hundred million)')
plt.show
Out[8]:
<function matplotlib.pyplot.show(*args, **kw)>
In [9]:
# Top 20 Most Viewed Youtube Videos in Canada
plt.title("Top 20 Most Viewed Youtube Videos in Canada")
plt.xticks(rotation=90)

# Sorting most viewed videos by row in descending order
most_viewed_ca = df_ca.groupby('title').view_count.max().sort_values(ascending=False)[:20]

# Creating bar plot
sns.barplot(x=most_viewed_ca.index, y=most_viewed_ca.values)
plt.xlabel('Title')
plt.ylabel('Views (hundred million)')
plt.show()

Since none of the top 20 videos had the same number of views, I could easily check between the U.S. and Canada most viewed list for same trending videos. I printed their titles out along with their ranking amongst the other 20 for each region.

Interestingly enough, 6/20 videos were the same between the two countries. However, no same video also had the same ranking. For instance, "XXXTENTACION & Lil Pump ft. Maluma & Swae Lee - Arms Around You (Official Lyrics Video)" was the most viewed in Canada but was the tenth most viewed in the U.S.

In [10]:
for index in range(len(most_viewed_us)):
    for index2 in range(len(most_viewed_ca)):
        views_us = int(most_viewed_us[index])
        views_ca = int(most_viewed_ca[index2])
        
        if views_us == views_ca:
            print(most_viewed_us.index[index])
            print("U.S. ranking: " + str(index) + " Canada ranking: " + str(index2) + '\n')
XXXTENTACION & Lil Pump ft. Maluma & Swae Lee  - Arms Around You (Official Lyrics Video)
U.S. ranking: 9 Canada ranking: 0

Little Mix - Woman Like Me (Official Video) ft. Nicki Minaj
U.S. ranking: 10 Canada ranking: 1

Cardi B - Money (Official Audio)
U.S. ranking: 15 Canada ranking: 2

Jason Derulo x David Guetta - Goodbye (feat. Nicki Minaj & Willy William) [OFFICIAL MUSIC VIDEO]
U.S. ranking: 16 Canada ranking: 4

Grocery Store Stereotypes
U.S. ranking: 17 Canada ranking: 5

Lady Gaga, Bradley Cooper - I'll Never Love Again (A Star Is Born)
U.S. ranking: 18 Canada ranking: 6

Next, I wanted to plot the top 20 trending videos' likes vs dislikes ratio. I created a scatter plot with the likes along the x-axis and the dislikes along the y-axis. Each point/marker is a different color and the legend on the right indicates the video title.

In [11]:
# US Likes vs Dislikes (Top 20 Trending Videos by Viewcount)

df_top20us = df_us.nlargest(20, 'view_count')

fig, ax = plt.subplots()
ax.scatter(df_top20us['likes'], df_top20us['dislikes'])

for index in df_top20us.index:
    plt.plot(df_top20us['likes'][index], df_top20us['dislikes'][index], marker='o', linestyle='', markersize=8, label=df_top20us['title'][index])

plt.title("Top 20 U.S. Trending Videos Likes vs Dislikes")
plt.xlabel('Likes (millions)')
plt.ylabel('Dislikes (hundred thousand)')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left')
Out[11]:
<matplotlib.legend.Legend at 0x7f22bab95850>
In [12]:
# CA Likes vs Dislikes (Top 20 Trending Videos by Viewcount)

df_top20ca = df_ca.nlargest(20, 'view_count')

fig, ax = plt.subplots()
ax.scatter(df_top20ca['likes'], df_top20ca['dislikes'])

for index in df_top20ca.index:
    plt.plot(df_top20ca['likes'][index], df_top20ca['dislikes'][index], marker='o', linestyle='', markersize=8, label=df_top20ca['title'][index])

plt.title("Top 20 Canada Trending Videos Likes vs Dislikes")
plt.xlabel('Likes (millions)')
plt.ylabel('Dislikes (hundred thousand)')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left')
Out[12]:
<matplotlib.legend.Legend at 0x7f22baad17f0>

Based off each individual plot, I noticed they differed more than expected.

In the U.S., there was a linear upwards trend. The more likes seems to correlate to more dislikes. I decided to recreate the plot with the line of best fit.

However, in Canada, the data points were much more scattered. I can not conclude that the likes have a strong correlation to dislikes. I replotted the above graph with the line of best fit as well. As seen, there are multiple outliers in the data.

These plots introduce many smaller interesting questions such as: is Canada's like to dislike ratio a reflection of more controversial videos in the top 20 most viewed? are Canadian Youtube users more critical of trending music? It is worth noting that the U.S. graph had much higher counts of likes vs dislikes in comparison to Canada.

Although I can't make any concrete conclusions, it is fascinating how data science plays a role in uncovering these minor cultural discrepencies. The videos may be similar between the two regions, but the like to dislike ratio varies.

In [13]:
# Plotting the line of best fit for the U.S. likes vs dislikes
sns.lmplot(x='likes', y='dislikes', data=df_top20us, fit_reg=True)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x7f22ba9ca880>
In [14]:
# Plotting the line of best fit for Canada's likes vs dislikes
sns.lmplot(x='likes', y='dislikes', data=df_top20ca, fit_reg=True)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x7f22ba90f310>

I wanted to analyze the time difference amongst the top 20 trending videos from when they were published to when they were trending by date.

I plotted two scatter plots in the same graph- one with the published date as the x-axis and the other with the trending date as the x-axis; the y-axis was the video titles for both.

In [23]:
plt.figure(figsize=(10,6))

plt.title("Published Date & Trending Date (U.S. Top 20)")

plt.scatter(df_top20us['published_at'], most_viewed_us.index, color='red')
plt.scatter(df_top20us['trending_date'], most_viewed_us.index, color='green')
plt.xlabel('Date')
plt.ylabel('Titles')
plt.show()
In [22]:
plt.figure(figsize=(10,6))

plt.title("Published Date & Trending Date (Canada Top 20)")

plt.scatter(df_top20ca['published_at'], most_viewed_ca.index, color='red')
plt.scatter(df_top20ca['trending_date'], most_viewed_ca.index, color='green')
plt.xlabel('Date')
plt.ylabel('Titles')
plt.show()

Finally, I will explore the most popular categories amongst all the trending videos in each region.

I counted the number of times each category name appeared and printed them out. Then I created a barplot with the category names along the x-axis and the corresponding counts on the y-axis.

In [20]:
# Number of Videos by Category (U.S.)

category_counts_us = df_us.value_counts('category_name')
print(category_counts_us)

plt.title("Number of Videos by Category (U.S.)")
plt.xticks(rotation=90)

sns.barplot(x=category_counts_us.index, y=category_counts_us.values)
plt.xlabel('Category Name')
plt.ylabel('Counts')
plt.show()
category_name
Music                   58
Entertainment           45
Howto & Style           18
Film & Animation        15
Comedy                  15
People & Blogs          12
Sports                  10
News & Politics          7
Gaming                   6
Science & Technology     5
Education                4
Pets & Animals           3
Travel & Events          1
Autos & Vehicles         1
dtype: int64
In [21]:
# Number of Videos by Category (Canada)

category_counts_ca = df_ca.value_counts('category_name')
print(category_counts_ca)

plt.title("Number of Videos by Category (Canada)")
plt.xticks(rotation=90)

sns.barplot(x=category_counts_ca.index, y=category_counts_ca.values)
plt.xlabel('Category Name')
plt.ylabel('Counts')
plt.show()
category_name
Entertainment           23
Music                   17
Howto & Style           10
Film & Animation         8
Comedy                   8
Sports                   5
Education                4
Science & Technology     3
People & Blogs           2
Autos & Vehicles         2
Travel & Events          1
Shows                    1
dtype: int64

Since each region has their own set of category names, there were some differences between the U.S. and Canada barplots.

Both had music and entertainment as the top two categories with the most trending videos. These two were followed by fairly similar rankings of the categories: Howto & Style and Film & Animation. By the latter half of the barplot, there were even different category names. For instance, the U.S. had separate categories for News & Politics, Gaming and Pets & Animals. Meanwhile, Canada had a separate category for shows.

The most popular categories of trending videos is important information for those entering into content creation. With so many videos being uploaded daily, the competition is high. For instance, while music-related videos are more common to be trending, that could also be a result of much more music-related videos being posted. It is hard to say whether the more common trending categories correlate to higher chances of one's video making the trending page if they fall into those categories.

This data can also be used when analyzing other reserach topics like: what content is most commonly searched across countries? what hobbies/interests do different regions like most?

More information on Youtube categories and what specific content falls into which category can be found here: https://techpostplus.com/youtube-video-categories-list-faqs-and-solutions/

  1. CONCLUSION

This tutorial was a fun way to analyze one of the most popular platforms in the world- Youtube. In regards to my earlier question of how do the U.S. and Canada differ culturally through their top trending Youtube videos, I conclude that we are indeed more similar than not.

The graphs of the top 20 most viewed trending videos showed that there's already many video crossovers between the two regions in such a small sample. Then, the categories revealed that the top 5 categories in both regions were the same: music, entertainment, howto & style, film & animation and comedy.

The scatter plots on likes vs dislikes revealed how the regions' rate ratios differ for their top 20 trending videos. Canada's graph was more likely to have an unpredictable like to dislike ratio than the U.S.. This analysis could open up new discussions on the details of the top 20 trending videos and how they are perceived differently in each region.

For all the future young and old content creators, data science is your friend! By analyzing and understanding data on trending Youtube videos, you'll better know what it takes to go viral yourself!