Pandas
Pandas Notes#
Getting Started#
Install Anaconda has packages that you need for datascience. It helps people that are not software developers.
- numpy
- pandas
- matplotlib
- cborne
- jupyter
Jupyter Notebooks#
Let you create and share documents that have live code, explanatory text and visualisations.
Open jupyter from anaconda navigator or use jupyter notebook
from your cli.
Once in jupyter click File -> New -> Notebook -> Python3
Some important shortcuts:
Esc + m
: Turns a cell into markdown modeShift + Enter
: Turns a cell into code modeEsc + b
: Insert empty code cells below current code cellEsc + a
: Insert empty code cells above current code cell
When you are within the paramteres of a function you can hit
Shift + Tab
and it will show you documentation.
Using Pandas#
Use
import pandas
or
# Accepted shortening
import pandas as pd
View versions#
View version of pandas and dependencies
pandas.show_versions()
To view just the pandas versions:
pandas.__version__
View help#
To view the help on any function use:
pd.<function_name>?
Eg.
pd.read_csv?
Dataframes#
A dataframe is like a 2 dimensional array, a table with rows and columns.
Read and show the first few rows with:
olympic_data = pd.read_csv('data/olympics.csv', skiprows=4)
olympic_data.head()
Series#
A series is a 1 deimensional array of indexed data
- Supports integer and label based indexing
Accessing a single series can use either square bracket or dot notation#
data_frame['series_name']
or
data_frame.series_name
Dot notation will not work if there is a space in the series name
Accessing multiple series#
data_frame[['series_name1', 'series_name2']]
Tips#
Check if you are working with a series or dataframe
type(data_frame)
pandas.core.frame.DataFrame
type(data_frame.series_name)
pandas.core.series.Series
type(data_frame[['series_name1', 'series_name2']])
pandas.core.frame.DataFrame
Data Input and Validation#
Reading data:
- read_excel()
- read_json()
- read_sql_table()
- read_csv()
Shape#
The shape attribute returns a tuple representing the dimensionality of the data
olympic_data.shape
(29216, 10)
It returns (rows
, columns
)
Get just rows
olympic_data.shape[0]
29216
Get just columns
olympic_data.shape[1]
10
Head and Tail#
Show the first n
rows at the start and the end, by default 5.
It is the first thing you should do, examine the data.
olympic_data.head()
and
olympic_data.tail()
Info()#
Shows entries, data type and number of non-null entries for each series
For real world usage there will be missing data
olympic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City 29216 non-null object
Edition 29216 non-null int64
Sport 29216 non-null object
Discipline 29216 non-null object
Athlete 29216 non-null object
NOC 29216 non-null object
Gender 29216 non-null object
Event 29216 non-null object
Event_gender 29216 non-null object
Medal 29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB
Basic Analysis#
value_count()#
Returns a series object counting all the unique values.
The first value is the most frequently occuring element, to reverse set the ascending
to True.
The missing data values (NA
) will be dropped by default, with dropna=True
To view the nubmer of medals given out per edition:
olympic_data.Edition.value_counts()
2008 2042
2000 2015
2004 1998
1996 1859
1992 1705
1988 1546
1984 1459
...
To count the number of medals per gener
olympic_data.Gender.value_counts()
Men 21721
Women 7495
Name: Gender, dtype: int64
sort_values()#
Sorts values in a series. You are sorting by asceding order by default, by default the missing data is added at the end.
athletes = olympic_data.Athlete.sort_values()
To sort by two rows, say the earlist edition and names from A to Z.
olympic_data.sort_values(by=['Edition', 'Athlete'])
Boolean Indexing#
Boolean vectors can be used to filter data.
AND
will use&
OR
will use|
NOT
will use~
Multiple conditions must be grouped in brackets
View all records that got a gold medal
olympic_data.Medal == 'Gold'
Returns
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
To select a dataframe where all the athletes won gold
olympic_data[olympic_data.Medal == 'Gold']
Multiple conditions
olympic_data[(olympic_data.Medal == 'Gold') & (olympic_data.Gender == 'Women')]
String Handling#
Using python’s str
there are a number of built-in methods like str.contains()
, str.startswith()
and str.isnumeric()
Finding Flo Jo we search for florence:
olympic_data.Athlete.str.contains('Florence')
this gives us a boolean series, which we must plug into the dataframe
olympic_data[olympic_data.Athlete.str.contains('Florence')]
Basic Plotting#
Matplotlib makes easy things easy and hard thing possible. You can generate plots, histograms, bar charts, error charts with a few lines of code.
Importing#
import matplotlib.pyplot as plt
the merger command: %matplotlib inline
allows the output of the matplotlib command to be executed within the jupyter notebook. The resulting plots are also stored in the notebook.
You will want to explicitly set the axis or figure objects that you will be using on more complex plots.
The shorter method is the pyplot
method of matplotlib
Plot types#
By default the plot is a link
plot(kind='line')
plot(kind='bar')
plot(kind='barh')
plot(kind='pie')
plot(kind=’line’)
Tracking changes over a period of time Easier to compare changes with different groups and small changes
plot(kind=’bar’)
Track changes over time over different groups. Best when changes are large over time.
plot(kind=’barh’)
Horizontal bar graph, axis rotated.
plot(kind=’pie’)
Best for comparing parts of a whole. Don’t show changes over time.
Example#
first_olympics.Sport.value_counts().plot(kind='line');
Add the ;
to the end of that line to suppress: <matplotlib.axes._subplots.AxesSubplot at 0x117bf6da0>
Plot Colours#
You can set the colour as keyword arguments to the plot()
command
Eg.
plot(color='red')
More information on matplotlib colours
Figure Size#
Tuple given as keyword argument to plot()
to specify the width and height
Eg.
first_olympics.Sport.value_counts().plot(figsize=(15, 3));
Apparently the numbers represent inches (great news!)
Colourmaps#
Find a good representation of your data. Is there an intuitive colour scheme? Eg. Gold, Silver and Bronze for medal winners. Blue for male, pink for female.
Classes of colur maps:
- sequential - representing ordered data
- diverging - information deviates around a middle value
- Qualitative - No ordering or relationship (miscellaneous)
Eg.
first_olympics.Sport.value_counts().plot(kind='pie', colormap='Pastel1')
The color
parameter for pie charts does not exist
Seaborn#
Visualisation library based on matplotlib
Why:
- Attractive statistical plots
- A complement and not a substitute to matplotlib
- Integrates well with pandas
Take a look at the seaborn examples
Countplot()#
seaboard.countplot(
data=source_dataframe,
hue=categorical_variable_colours,
order=sequence_of_categorical_variables,
palette=colors_for_levels
)
Seaborn vs Matplotlib#
- Matplotlib - Short scripts with pyplot with simple plot types
- Seaborn - Statistical or categorical data needing advanced plots
Importing#
import seaborn as sns
Indexing#
- The index is an immutable array (that is it cannot be changed in place and will need to be recreated)
-
Indexing allows you to access a row or column using a label
type(olympic_data.index) pandas.core.indexes.range.RangeIndex
Immutable
olympic_data.index[100] = 5
TypeError: Index does not support mutable operations
setindex()#
Which of series is the index
olympic_data.set_index('Athlete')
You can change the index inplace with
olympic_data.set_index('Athlete', inplace=True)
or
athlete_index_df = olympic_data.set_index('Athlete')
Reset the index#
Return the dataframe to the integer index
olympic_data.reset_index(inplace=True)
sort_index()#
Speeds up data access on large datasets. Sort objects by a label along the axis.
athlete_index_df.sort_index(inplace=True)
athlete_index_df.head()
Descending order:
athlete_index_df.sort_index(inplace=True, ascending=False)
athlete_index_df.head()
loc[]#
Label based index, selecting via the label.
Raises a KeyError
when the items are not found
olympic_data.loc['BOLT, Usain']
The index has to be the type you are searching
Although the two below are synonymous:
olympic_data[olympic_data.Athlete == 'BOLT, Usain']
olympic_data.loc[olympic_data.Athlete == 'BOLT, Usain']
iloc[]#
Integer based indexing
Allows for traditional pythonic slicing
View an item:
olympic_data.iloc[1700]
Athlete RABOT, Pierre
City London
Edition 1908
Sport Sailing
Discipline Sailing
NOC FRA
Gender Men
Event 6m
Event_gender X
Medal Bronze
Name: 1700, dtype: object
Return multiple items:
olympic_data.iloc[[1542, 2390, 6001, 15000]]
Show 2nd to 5th:
olympic_data.iloc[1:4]
Group by#
- Split a dataframe into groups based on some criteria
- Applies a function to each group independently
- Combines the results into a dataframe
Creating a group by object only verifies that you have passed a valid mapping
The group by object is not a dataframe but a group of dataframes in a dict-like structure
type(olympic_data.groupby('Edition'))
pandas.core.groupby.groupby.DataFrameGroupBy
View the group by as a list:
list(olympic_data.groupby('Edition'))
Iterate through a Group#
Each group is a dataframe
for key, group in olympic_data.groupby('Edition'):
print(key)
Groupby Computations#
GroupBy.size()
GroupBy.count()
GroupBy.first()
,GroupBy.last()
GroupBy.head()
,GroupBy.tail()
GroupBy.mean()
GroupBy.max()
,GroupBy.min()
agg()
allows multiple statistics in one calculation per group
Eg.
olympic_data.loc[olympic_data.Athlete == 'LEWIS, Carl'].groupby('Athlete').agg({'Edition': ['min', 'max', 'count']})
Reshaping#
Stack function allows you to set inner columns as rows
If you have grouped by data and want to view it like the original data frame use:
g = mw.groupby(['NOC', 'Gender', 'Discipline', 'Event']).size()
Stack#
Returns a dataframe or series, by default dropna=True
so you won’t have any missing values
A stack returns a dataframe or series with a new innermost layer of rows
Stack returns a taller dataframe
df.stack(['Event'])
unstack#
Returns a dataframe or series
Sometimes it makes sense for NA
values to be 0. Meaning no medals are won for that event.
Returns a wider dataframe
df.unstack('Gender')
Data Visualisations#
sns.heatmap(sorted_medal_data)
Create our own colourmaps#
from matplotlib.colors import ListedColormap
Seaborn uses palette. View the current colour palette:
sns.color_palette()
View it on a plot:
sns.palplot(sns.color_palette())
Gold silver bronze:
gold_silver_bronze = ['#dbb40c', '#c5c9c7', '#a87900']
sns.palplot(gold_silver_bronze)
Create the colourmap:
gsb_colourmap = ListedColormap(sns.color_palette(gold_silver_bronze))
top_5.plot(kind='bar', figsize=(20, 10), colormap=gsb_colourmap);
Example Adding a column#
most_medals = us_medals.groupby(['Edition', 'Athlete', 'Medal']).size().unstack('Medal', fill_value=0)
most_medals['Total'] = most_medals['Gold'] + most_medals['Silver'] + most_medals['Bronze']
Sort by group#
# Reset index
most_medals.reset_index(inplace=True)
for year, group in most_medals.groupby('Edition'):
print(group.sort_values('Total', ascending=False)[:1])
# Create a list and then create a dataframe
tu = [group.sort_values('Total', ascending=False)[:1] for year, group in most_medals.groupby('Edition')]
tu
top = pd.DataFrame()
for element in tu:
top = top.append(element)
top
Source: Lynda Pandas for Datascience