plt.subplots
Matplotlib APIWritten by Adam Green.
The large ecosystem of libraries is one of the reasons Python is a popular choice for data science. Yet with so many different libraries available, it can be tough to know where to start.
Pandas and Matplotlib are a great place to start your journey into Python data science & analysis tools.
Together these two libraries offer enough to start doing meaningful data work in Python.
Pandas is a Python library for working with tabular data.
Tabular data has rows and columns. If you have worked with Excel, you have already worked with tabular data.
In this article we will work with a cities
dataset - a tabular dataset with three rows and three columns:
You can install pandas with pip
, a Python package manager:
shell-session
$ pip install pandas
You can also find a notebook with all the code we develop in this article on Github or on Google Colab.
Pandas is used by importing the pandas
package, commonly aliased to pd
:
python
import pandas as pd
At the core of Pandas is the pd.DataFrame
.
A pd.DataFrame
has three parts:
The index and columns are labels - they tell us what a row or column represents, such as the population
column label. The values are the data itself, like 1.6
.
One way to create a DataFrame is from a Python dictionary, where the dictionary keys are the column names and the dictionary values are the data:
python
import pandas as pd pd.DataFrame({ 'city': ['auckland', 'berlin', 'london'], 'population': [1.6, 3.6, 8.9], 'hemisphere': ['south', 'north', 'north'] }) """ city population hemisphere 0 auckland 1.6 south 1 berlin 3.6 north 2 london 8.9 north """
Being able to construct DataFrames from Python objects such as dictionaries and lists is useful, but is limited to small datasets that we type manually in code. This is useful for creating data for unit tests, but not useful for working with large datasets.
A common way to create a DataFrame is from a CSV file. This allows us to work with CSV files on our local machine.
Pandas handles this with pd.read_csv
, which reads data from a CSV file on our local computer.
This same function can also be used to read data from the Internet. Below we read in our cities
dataset using pd.read_csv
to read from a Github URI:
python
import pandas as pd data = pd.read_csv( 'https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv' ) """ city population hemisphere 0 auckland 1.6 south 1 berlin 3.6 north 2 london 8.9 north """
This is the way we will load our cities
dataset in rest of this post - reading directly from a public URI (such as a URL).
We can access the three parts of our DataFrame as attributes of an initialized pd.Dataframe
object:
python
data.index # RangeIndex(start=0, stop=3, step=1) data.columns # Index(['city', 'population', 'hemisphere'], dtype='object') data.values """ [['auckland' 1.6 'south'] ['berlin' 3.6 'north'] ['london' 8.9 'north']] """
The index pandas created for us is a range index (also called an integer index).
Rows are labelled with a sequence of integers ([0, 1, 2]
):
python
data.index # RangeIndex(start=0, stop=3, step=1) data """ city population hemisphere 0 auckland 1.6 south 1 berlin 3.6 north 2 london 8.9 north """
We can replace this with more meaningful index using set_index
, turning the city
column into the index:
python
data.set_index('city') """ population hemisphere city auckland 1.6 south berlin 3.6 north london 8.9 north """
We have lost our original integer index of [0, 1, 2]
and gained an index of city
- great!
A basic operation in data analysis is selection - to select rows and columns. There are two ways to do this in pandas - loc
and iloc
.
Pandas offers two ways to do selection - one using a label and one using the integer position.
In Pandas we can select by a label using loc
and by position using iloc
. These two selection methods can be used on both the rows (index) and the columns of a DataFrame. We can use :
to select the entire row or column.
loc
Uses Labelsloc
selects based on the label of the row and column.
loc
allows us to use the labels of the index and columns - we use it to select data based on labels:
python
# select the berlin row, all columns data.loc['berlin', :] # select all rows, second column data.loc[:, 'population']
iloc
Uses Positioniloc
selects based on the integer position of the row and column.
iloc
allows us to use the position of rows and columns - we use it to select data based on position:
python
# select the first row, all columns data.iloc[0, :] # select all rows, second column data.iloc[:, 1]
Selecting based on position is very useful when your data is sorted.
iloc
is why a range index isn't that useful in pandas - we can always use iloc
to select data based on it's position.
Now we have been introduced to loc
& iloc
, let's use them to answer two questions about our cities
dataset.
We can answer this using loc
, selecting the auckland
row and the population
column:
python
data.loc['auckland', 'population'] # 1.6
We can answer this using iloc
to select the first row with 0
and loc
to select the population
column:
python
data.iloc[0].loc['hemisphere'] # north
Another basic operation in data analysis is filtering - selecting rows or columns based on conditional logic (if statements, equalities like ==
and inequalities like >
or <
).
We can filter in pandas with a boolean mask, which can be created with a conditional statement:
python
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv') # create our boolean mask # with the conditional 'population < 2.0' # aka population less than 2.0 mask = data.loc[:, 'population'] < 2.0 """ city auckland True berlin False london False Name: population, dtype: bool """
The boolean mask is an array of either True
or False
- here indicating True
if the population of the city is less than 2.
We can use our boolean mask to filter our dataset with loc
- loc
understands how to use a boolean mask:
python
subset = data.loc[mask, :] """ population hemisphere city auckland 1.6 south """
loc
or iloc
?The power of using boolean masks is that we can select many rows at once.
Below we create a boolean mask based on the hemisphere of the city - then using this mask to select two rows:
python
mask = data.loc[:, 'hemisphere'] == 'north' """ city auckland False berlin True london True Name: hemisphere, dtype: bool """ subset = data.loc[mask, :] """ population hemisphere city berlin 3.6 north london 8.9 north """
The final data analysis operation we will look at is aggregation.
In Pandas aggregation can be done by grouping - using the groupby
method on a DataFrame.
In Pandas aggregation happens in two steps:
hemisphere
,Aggregation allows us to estimate statistics, lets use it to answer a few more questions.
We can answer this with our two step workflow:
groupby('hemisphere')
to group by hemisphere,mean()
to calculate the average for each of our hemisphere groups (north
and south
).python
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv').set_index('city') data.groupby('hemisphere').mean() """ population hemisphere north 6.25 south 1.60 """
Our two steps to answer this question are:
hemisphere
,sum
.python
data.loc[:, ['population', 'hemisphere']].groupby('hemisphere').sum() """ population hemisphere north 12.5 south 1.6 """
The end of many data analysis pipelines involves saving data to disk - below we save our aggregation to a CSV file:
python
data.to_csv('groups.csv')
All the code we looked at above is in full below:
python
# import & alias pandas as pd import pandas as pd # read dataset from github url data = pd.read_csv('https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv').set_index('city') # save cities dataset to file on local machine data.to_csv('cities.csv') # select the berlin row, all columns data.loc['berlin', :] # select all rows, second colun data.loc[:, 'population'] # select the first row, all columns data.iloc[0, :] # select all rows, second column data.iloc[:, 1] # what is the population of Auckland? data.loc['auckland', 'population'] # which hemisphere is our first city in? data.iloc[0].loc['hemisphere'] # select population less than 2 subset = data.loc[data.loc[:, 'population'] < 2.0, :] # northern hemisphere countries subset = data.loc[data.loc[:, 'hemisphere'] == 'north', :] # average population in each hemisphere data.groupby('hemisphere').mean() # total population in each hemisphere data.loc[:, ['population', 'hemisphere']].groupby('hemisphere').sum() # save to csv file on local computer data.to_csv('groups.csv')
Matplotlib is a Python library for creating data visualizations.
One challenge of Matplotlib is that it offers multiple ways to plot data. This means that there are different ways to do the same thing in Matplotlib. This can make using Matplotlib challenging as it's not clear which API is being used.
Mastering Matplotlib requires understanding the different Matplotlib APIs, and using whichever is best for your data and workflow.
We recommend and will use the plt.subplots
API. It offers the flexibility to plot multiple charts in the same figure, and integrates nicely with Pandas.
You can install Matplotlib with pip
:
shell-session
$ pip install matplotlib
Matplotlib is used in Python by importing the matplotlib
package, commonly aliasing the submodule matplotlib.pyplot
as plt
:
python
import matplotlib.pyplot as plt
The core components in Matplotlib are the figure and axes.
An axes is a single plot within a figure. A figure can have ore or more axes. Each axes can be thought of as a separate plot.
This axes is different from an axis, such as the y-axis or x-axis of a plot. An axes will have both a y-axis and x-axis.
plt.subplots
Matplotlib APIWe can create a figure with two axes with the plt.subplots
API:
python
import matplotlib.pyplot as plt fig, axes = plt.subplots(ncols=2)
Our figure with two axes are at this point both empty.
Next we load our data with Pandas, and create a plot on our first axes:
python
import pandas as pd # load data from CSV with pandas data = pd.read_csv('https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv').set_index('city') # create a bar plot on the first axes of our figure data.plot('population', ax=axes[0], kind='bar')
Automatically makes labels for the x and y axis - quite nice!
Our second axes is still empty - we can plot something on it by passing ax=axes[1]
into another plot
call on our DataFrame:
python
# access first axes and plot a scatter plot data.plot('land-area', 'population', ax=axes[1], kind='scatter')
Now we can see two visualizations of our data - a bar chart and a scatter plot.
The final step in our pipeline is to save our figure to a PNG file on our local machine:
python
fig.savefig('cities.png')
The full code for our visualization pipeline is below:
python
import matplotlib.pyplot as plt import pandas as pd # create one figure with two axes fig, axes = plt.subplots(nrows=2) # run a simple data pipeline data = pd.read_csv('https://raw.githubusercontent.com/ADGEfficiency/data-science-south-data/main/cities/cities.csv').set_index('city') # access first axes and plot a line data.plot(y='population', ax=axes[0], kind='bar') # access first axes and plot a scatter plot data.plot('land-area', 'population', ax=axes[1], kind='scatter') # small trick to get x-axis labels to play nice plt.tight_layout() # save the figure as a png file fig.savefig('cities.png')
Thanks for reading!
Key takeaways from this post are:
pd.DataFrame
is made of an index, columns and values,.loc
selects based on the label, iloc
selects based on position,.groupby
,plt.subplots
API is the best way to use Matplotlib.This content is a sample from our Introduction to Pandas course.
Thanks for reading!
If you enjoyed this blog post, make sure to check out our free 77 data science lessons across 22 courses.