Video Games Sales Data Analysis

13 minute read

Introduction

Have you ever wondered what is the most played video game in the last thirty years? How about which gaming platform is the most used in the last decade? Is Super Mario the most popular game ever? Or is Call of Duty the most appealing one in the last decade?

In this post, we are going to explore advanced data manipulation techniques that are typically used to translate raw data into insightful plots and charts — enabling you to answer these types of questions.

To perform this analysis, we will use the Python language, and we will explore two of the most important data analytics libraries: Pandas and Matplotlib.

Here i will just report the most important insights from the analysis. You can however check the source code to better understand the corresponding code source here. The data required for this analysis can be found at this stable link.

1. Data Manipulation and Cleaning

1.1 Data Ingestion

We read the file 'data/vgsales.csv' using the pandas read_csv() function, and store it into the variable data.

data = pd.read_csv('data/vgsales.csv')

We now check the overall composition of the dataset using the .info() method. The dataset contains sales information, divided by region, for different video games by year. This dataset is not the most updated, but it covers titles until 2016.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object
 2   Platform      16598 non-null  object
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object
 5   Publisher     16540 non-null  object
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB

1.2 Data Cleaning

We clearly see there exists a few null values among both the Publisher and Year columns. The strategy here is to remove them.
Let us remove the null values using the pandas .dropna() method:

data.dropna(
  subset=['Publisher', 'Year'],
  how='any',
  inplace=True
)

The dataset was generated at the end of the year 2016. Hence, each observation with an occurrence greater than or equal to 2017 must be considered as either corrupted or incorrect, and hence must be removed.

data = data.query('Year < 2017')

For more information on the query method, please check this post here.

1.3 Data Consistency

Before moving to a proper data analysis, we need to be sure the data is consistent. By consistency I mean an intrinsic characteristic of the data. For instance, the Publisher name might contain a typo, or sometimes a Publisher might be identified by several names.

To investigate this, let us check how Sony appears inside our dataset: we hence access to the Publisher column, which is of type object, and try to check the different names Sony is used for:

data[data['Publisher'].str.contains(
  'sony',
  case=False
  )]['Publisher'].value_counts()
Sony Computer Entertainment            682
Sony Computer Entertainment Europe      15
Sony Online Entertainment                8
Sony Computer Entertainment America      3
Sony Music Entertainment                 1
Name: Publisher, dtype: int64

Obviously, the Sony Publisher identifier is not homogeneously identified among observations. We might think of, say, transforming each single possibility to a more general one, say “Sony”, to reduce heterogeneity in our data.

To do this, we basically create a custom method, called merging_info_publisher, that should be called whenever we wish to perform such kind of cleaning on our data. You can find it in the aforementioned source code. Let us check now how "Sony" is mapped inside our dataset after the transformation:

data[data['Publisher'].str.contains(
  'Sony'
)]['Publisher'].value_counts()
Sony    709
Name: Publisher, dtype: int64

Possibly, this pattern is repeated for different publishers as well. Here we identify a few Publishers that might have different labels inside the dataset, and we apply the merging_info_publisher method for each of them. After the necessary transformation, let us check the absolute distribution of the top 20 Publishers in our dataset:

data['Publisher'].value_counts().head(20)
Electronic Arts                           1341
Activision                                 996
Ubisoft                                    931
Namco                                      928
Konami                                     823
THQ                                        712
Sony                                       709
Nintendo                                   696
Sega                                       630
Take-Two Interactive                       412
Capcom                                     376
Atari                                      347
Tecmo Koei                                 338
Square Enix                                231
Warner Bros. Interactive Entertainment     217
Disney Interactive Studios                 214
Midway Games                               196
Eidos Interactive                          196
505 Games                                  192
Microsoft Game Studios                     189
Name: Publisher, dtype: int64

2. Total Games Released Each Year

We now want to know when the video game industry experienced a drastic development. Based on the number of games released each year, we might be able to find out when the video games boom happened. We store the distribution of video games releases by year inside the variable counter_df_by_year.

counter_df_by_year = filtered_data[
  ['Name', 'Year']
  ].drop_duplicates().groupby('Year').count()

counter_df_by_year.rename(
  columns={'Name': 'Number of Games'},
  inplace=True
)

Let us embed that dataframe into a graphical dimension, using matplotlib:

with plt.style.context('ggplot'):
    fig, ax = plt.subplots(figsize=(10,8))
    ax.plot(
      counter_df_by_year.index,
      counter_df_by_year['Number of Games'],
      marker='d'
    )
    ax.set_xlabel('Year')
    ax.set_ylabel('Number of Games')
    ax.set_title('Evolution of Video Games Industry')
    plt.show()

There was a significant boom in the late 2000s. Since then, the distinct number of release has shrunk possibly due to a more convergence to popular titles by both customers and developers.

3. Publisher Analysis with respect to Global Sales

Instead of considering absolute frequencies (with respect to the number of video games releases) of the top publishers, a better proxy is to consider the top publishers by Global Sales, identified by the columns "Global_sales". The result is shown below here:

Nintendo is the most popular publisher (not surprisingly), followed by two other famous Publishers: Electronic Arts and Activision.

We now want to go further and try to see which Platform was the most popular for each Year. To do so, we again use a proxy the total Global Sales with respect to video games for each specific Platform.

However, this requires a little bit of data wrangling, and therefore we need to perform a few steps to be able to answer to this question. I kindly ask you to look at the source code for more details. I will just briefly explain what i do there:

  1. First of all, we count the number of video games by Platform using the .groupby() method;
  2. We then need two steps:
    • pick the top 20 platform with respect to the column Total Observations;
    • filter the data with respect to those Platforms.
  3. The resulting dataframe, call it filtered_data_top20, is then used to aggregate the data with respect to Global Sales using the pandas .pivot_table(). We would like to have, as columns, the platforms’ vendor and, as index, the year. Ideally, this is done like this:
pivoted_data = filtered_data_top20.pivot_table(
    index='Year',
    columns='Platform',
    values='Global_Sales',
    aggfunc='sum',
    fill_value=0
)
  1. The next part is a little bit tricky: we now want to find the Platform which has the top sales for each distinct year. To do that we employ the NumPy method argsort() which basically allows to sort, for each row, the observations in ascending order. We then select the column names based on the sorting operation, so that in the first place we will find the platform with highest value with respect to the aggregated Global Sales.

The following table shows the tail of the yearly distribution of the most popular platforms during the last 40 years, obtained by following the above steps.

Year Platform
32 2012.0 PS3
33 2013.0 PS3
34 2014.0 PS4
35 2015.0 PS4
36 2016.0 PS4

A natural question now is: which was the most popular game in each single year? To answer to this question, we can employ the code below here:

most_popular_games = pd.DataFrame()
for _, row in most_popular_platform_by_year.iterrows():
    year = row['Year']
    platform = row['Platform']

    inner_df = filtered_data.query("Year == @year & Platform==@platform")

    pivoted_table_year_platform = inner_df.pivot_table(
        index = 'Year',
        columns='Name',
        values='Global_Sales',
        aggfunc='sum',
        fill_value=0
    )

    temp_col_max_value = pivoted_table_year_platform.max(axis=1).to_frame() # finds max value by row
    temp_col_max_value.rename(columns={0:'Total sales (ML of units)'}, inplace=True)

    temp_col_max = pivoted_table_year_platform.idxmax(axis=1).to_frame() # find the column with the greatest value on each row
    temp_col_max.rename(columns={0:'Most Wanted Title'}, inplace=True)

    merging_dfs = pd.concat([temp_col_max, temp_col_max_value], axis=1)

    most_popular_games = most_popular_games.append(merging_dfs)

As we can see from the output shown here below, there are many famous titles: from Pac-Man to the Legend of Zelda in the 80’s; the 90’s are obviously identified by the Super Mario saga, but interestingly the most title for that particular decade was Gran Turismo: what a game! I still remember it. For that time, that game was absolutely a breakthrough! Also note that in 1995 Sony released Playstation 1, and not surprisingly games related to that platform were the most sold in the second half of the 90’s.

Most Wanted Title Total sells (ML of units)
Year
1980.0 Asteroids 4.31
1981.0 Pitfall! 4.50
1982.0 Pac-Man 7.81
1983.0 Pitfall II: Lost Caverns 1.31
1984.0 Beamrider 0.27
1985.0 Ghostbusters 0.45
1986.0 Solaris 0.37
1987.0 Kung-Fu Master 0.65
1988.0 River Raid II 0.51
1989.0 Double Dragon 0.47
1990.0 Super Mario World 20.61
1991.0 The Legend of Zelda: A Link to the Past 4.61
1992.0 Super Mario Kart 8.76
1993.0 Super Mario All-Stars 10.55
1994.0 Donkey Kong Country 9.30
1995.0 Namco Museum Vol.1 3.84
1996.0 Crash Bandicoot 6.82
1997.0 Gran Turismo 10.95
1998.0 Tekken 3 7.16
1999.0 Gran Turismo 2 9.49
2000.0 Final Fantasy IX 5.30
2001.0 Gran Turismo 3: A-Spec 14.98
2002.0 Grand Theft Auto: Vice City 16.15
2003.0 Need for Speed Underground 7.20
2004.0 Grand Theft Auto: San Andreas 20.81
2005.0 Madden NFL 06 4.91
2006.0 Wii Sports 82.74
2007.0 Wii Fit 22.72
2008.0 Mario Kart Wii 35.82
2009.0 Wii Sports Resort 33.00
2010.0 Kinect Adventures! 21.82
2011.0 Call of Duty: Modern Warfare 3 13.46
2012.0 Call of Duty: Black Ops II 14.03
2013.0 Grand Theft Auto V 21.40
2014.0 Grand Theft Auto V 11.98
2015.0 Call of Duty: Black Ops 3 14.24
2016.0 FIFA 17 4.77

Also, moving to modern times we see that GTA V was quite a popular game, and without any surprise the popular Call of Duty: Black Ops 3, which was a very good game from Activision.

Given the above results, we can perform a little bit of data manipulation, and we can get the following result, which shows the most sold Title by Platform. Interestngly, GTA V was the most sold title for both PS3 and PS4: unbelivable!

Most Wanted Title Total sells (ML of units)
Platform
2600 Solaris 7.81
PS Tekken 3 10.95
PS2 Need for Speed Underground 20.81
PS3 Grand Theft Auto V 21.40
PS4 Grand Theft Auto V 14.24
SNES The Legend of Zelda: A Link to the Past 20.61
Wii Wii Sports Resort 82.74
X360 Kinect Adventures! 21.82

Surprisingly none of the Super Marios is mentioned in this list! Hence, we can conclude woth the following analysis: which were the most sold videogames ever?

We are interesting in investigating which were the most sold titles in the last 40 years. To do so, we employ the .groupby() method, and store the result into the most_wanted_vg variable.

most_wanted_vg = filtered_data_top20[
    ['Name', 'Global_Sales']
].groupby('Global_Sales').sum()

most_wanted_vg.sort_values(
  by='Global_Sales',
  ascending=False
).head(10)

The result is shown here below:

Name
Global_Sales
82.74 Wii Sports
35.82 Mario Kart Wii
33.00 Wii Sports Resort
30.01 New Super Mario Bros.
29.02 Wii Play
28.62 New Super Mario Bros. Wii
24.76 Nintendogs
23.42 Mario Kart DS
22.72 Wii Fit
22.00 Wii Fit Plus

It looks like the Nintendo games were the most sold! The Wii Sports was such a huge success for Nintendo. The classic Mario games, however, own 4 of Top 10 most popular games.

Conclusions

I hope you have enjoyed this analysis: it shows how pandas and Matplotlib are important to help you in the interpretability of the raw data.

If you want to replicate the results seen in this post, please check the source code here.

If you have enjoyed this post, please do not forget to star the repository where the source code is placed. See you in the next post!