'Number of Years Genres Have the Highest Vote Average', 'Proportion of Years Genres Have the Highest Vote Average', 'Number of Years Genres Have the Highest Average Budget', 'Proportion of Years Genres Have the Highest Average Budget', 'Number of Years Genres Have the Highest Average Revenue', 'Proportion of Years Genres Have the Highest Average Revenue', 'Number of Years Genres Have the Highest Average Profit', 'Proportion of Years Genres Have the Highest Average Profit'. 'runtime', 'genres', 'production_companies', 'release_date', As a data science newbie and self-learner, this definitely encouraged me a lot. We use essential cookies to perform essential website functions, e.g.

Then, drop the null values with dropna in cast, director, genres columns since they are just in tiny amount. Dealing with Multiple Values Columns You can try it for yourself here. Partial conclusion mean_high = high['popularity'].mean()

calculated. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.

Name: vote_average, dtype: int64

id imdb_id popularity budget revenue original_title cast director runtime genres production_c

Name: vote_average, dtype: int64 Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. Droping Extraneous Columns In [30]: # should return False Question 2: The distribution of revenue in different popularity levels in recent five years.

And from the table bellow, after transfer all zero values to null values in budget and revenue data, we can see that both the distribution of budget and revenue are much better, without too concentrate on the zero value or small values.


service is used by millions of people while we process over 2 billion requests. Check out the result.

I left out the issue on the later stage. In [39]: # mean rating for each revenue level

Learn more. MacFarlane If you wish to opt out, please close your SlideShare account. Out[20]:

In this project, we started our analysis by examining the most popular movie by genre.

Out[64]: budget_adj popularity

Khan|Vi... Chris 6.951084 0.578849 8

The movie dataset, which is originally from Kaggle, was cleaned and provided by Udacity.

Robert De Niro, Bruce Wills, Samuel L jackson, Nicolas Cage have worked in more than 50 movies. sure your experience on TMDb is nothing short of amazing.

The mean of the revenue_adj column across all years for each dataframe. plt.title('Popularity by Revenue Level', fontsize=15) Adventure movies are also the most profitible on average.

plt.show() [1]: https://www.themoviedb.org/about 6. runtime int64

director 10821 non-null object

Rylance|Amy memory usage: 1.5+ MB Do movies with highest revenue have more popularity?

Theron|Hugh These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. If you continue browsing the site, you agree to the use of cookies on this website. You provide a query string and we provide the closest match. Anna

… homepage object movies.groupby('budget_adj')['popularity'].value_counts().tail(10) Questions in the projects are as follows: In this process the main idea is to take a quick glance on the data set, find the potential unreasonable data value, unnecessary variables for my research question, null data or duplicates, and then make data clearing decisions. Features with missing values

By continuing to use TMDb, you are agreeing to this policy. revenue_adj float64 As one of the important steps I have joined Data Analyst Nanodegree. locations = [1,2] 1.574815e+09 6.6 16 Note: This project was completed as a part of Udacity Data Analyst Nanodegree that I finished in March, 2018. You will need an installation of Python, plus the following libraries: In [31]: movies.shape ### The TMDb revenue_adj 10865 non-null float64 # Create the string to eval the dataframe of the current genre's movies. It’s kind of huge amounts. Woodley|Theo Search for any missing information, like missing genre fields.

Dallas This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies.

In [56]: # 10 first values Trevorrow Universa We can see that there are no null values except the keywords and production_companies that I decided to keep before. budget 10865 non-null int64 2.

Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values


labels = ['low', 'high'] Use Git or checkout with SVN using the web URL. In [40]: heights = [mean_low, mean_high] revenue_adj 10865 non-null float64

Or just use function to retrieve the information I want?

Introduction 1.012787 6.5 48 movies.head() 1 76341 tt1392190 28.419936 150000000 378436354 In this project, we have to analyze a dataset and then communicate our findings about it. 2.541001e+08 7.3 18 There are some odd characters in the ‘cast’ column.

2 262500 tt2908446 13.112507 110000000 295238201 Insurgent We use essential cookies to perform essential website functions, e.g. In [61]: heights = [mean_low, mean_high] plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5) It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the plt.title('Vote Ratings by Budget Level', fontsize=15)

In [50]: # 10 last values This does limit the report because it will not look at every combination of genre as separate groups.

World First count the zero value in the zero budget dataframe . Fiction

This is my first part for the project, for the next part I will post the Exploratory Data Analysis part and question finding results!

popularity 10814 7.1. Investigating the TMDb Movie Data ... 23 movies are missing genre information, so given the size of the dataset, the simplest solution is to remove these movies from the dataset for this analysis.

genres 10842 non-null object

imdb_id 10855 non-null object In [22]: # inspecting the movies and budget columns The number samples of the dataset plt.xlabel('Revenue Level', fontsize=12) Learn more.

Out[63]: budget_adj popularity /search - Text based search is the most common way.

Pitch Perfect 8.102293 0.626646 45 We've, then examined, the movie popularity year by year. Along with extensive metadata for movies, TV shows and people, we also offer one of the If nothing happens, download the GitHub extension for Visual Studio and try again. Khan|Vi...

Additionally, this analysis is limited because of how the averages of each year and overall were divided and compared by individual genre.

runtime 10865 non-null int64 count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000