Pandas Notes

Getting Started

Install Anaconda has packages that you need for datascience. It helps people that are not software developers.

  • numpy
  • pandas
  • matplotlib
  • cborne
  • jupyter

Jupyter Notebooks

Let you create and share documents that have live code, explanatory text and visualisations.

Open jupyter from anaconda navigator or use jupyter notebook from your cli.

Once in jupyter click File -> New -> Notebook -> Python3

Some important shortcuts:

  • Esc + m: Turns a cell into markdown mode
  • Shift + Enter: Turns a cell into code mode
  • Esc + b: Insert empty code cells below current code cell
  • Esc + a: Insert empty code cells above current code cell

When you are within the paramteres of a function you can hit Shift + Tab and it will show you documentation.

Using Pandas

Use

import pandas

or

# Accepted shortening
import pandas as pd

View versions

View version of pandas and dependencies

pandas.show_versions()

To view just the pandas versions:

pandas.__version__

View help

To view the help on any function use:

pd.<function_name>?

Eg.

pd.read_csv?

Dataframes

A dataframe is like a 2 dimensional array, a table with rows and columns.

Read and show the first few rows with:

olympic_data = pd.read_csv('data/olympics.csv', skiprows=4)
olympic_data.head()

Series

A series is a 1 deimensional array of indexed data

  • Supports integer and label based indexing

Accessing a single series can use either square bracket or dot notation

data_frame['series_name']

or

data_frame.series_name

Dot notation will not work if there is a space in the series name

Accessing multiple series

data_frame[['series_name1', 'series_name2']]

Tips

Check if you are working with a series or dataframe

type(data_frame)
pandas.core.frame.DataFrame

type(data_frame.series_name)
pandas.core.series.Series

type(data_frame[['series_name1', 'series_name2']])
pandas.core.frame.DataFrame

Data Input and Validation

Reading data:

  • read_excel()
  • read_json()
  • read_sql_table()
  • read_csv()

Shape

The shape attribute returns a tuple representing the dimensionality of the data

olympic_data.shape
(29216, 10)

It returns (rows, columns)

Get just rows

olympic_data.shape[0]
29216

Get just columns

olympic_data.shape[1]
10

Head and Tail

Show the first n rows at the start and the end, by default 5.

It is the first thing you should do, examine the data.

olympic_data.head()

and

olympic_data.tail()

Info()

Shows entries, data type and number of non-null entries for each series

For real world usage there will be missing data

olympic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB

Basic Analysis

value_count()

Returns a series object counting all the unique values. The first value is the most frequently occuring element, to reverse set the ascending to True. The missing data values (NA) will be dropped by default, with dropna=True

To view the nubmer of medals given out per edition:

olympic_data.Edition.value_counts()

2008    2042
2000    2015
2004    1998
1996    1859
1992    1705
1988    1546
1984    1459
...

To count the number of medals per gener

olympic_data.Gender.value_counts()

Men      21721
Women     7495
Name: Gender, dtype: int64

sort_values()

Sorts values in a series. You are sorting by asceding order by default, by default the missing data is added at the end.

athletes = olympic_data.Athlete.sort_values()

To sort by two rows, say the earlist edition and names from A to Z.

olympic_data.sort_values(by=['Edition', 'Athlete'])

Boolean Indexing

Boolean vectors can be used to filter data.

  • AND will use &
  • OR will use |
  • NOT will use ~

Multiple conditions must be grouped in brackets

View all records that got a gold medal

olympic_data.Medal == 'Gold'

Returns

0         True
1        False
2        False
3         True
4        False
5        False
6         True
7        False

To select a dataframe where all the athletes won gold

olympic_data[olympic_data.Medal == 'Gold']

Multiple conditions

olympic_data[(olympic_data.Medal == 'Gold') & (olympic_data.Gender == 'Women')]

String Handling

Using python’s str there are a number of built-in methods like str.contains(), str.startswith() and str.isnumeric()

Finding Flo Jo we search for florence:

olympic_data.Athlete.str.contains('Florence')

this gives us a boolean series, which we must plug into the dataframe

olympic_data[olympic_data.Athlete.str.contains('Florence')]

Basic Plotting

Matplotlib makes easy things easy and hard thing possible. You can generate plots, histograms, bar charts, error charts with a few lines of code.

Importing

import matplotlib.pyplot as plt

the merger command: %matplotlib inline allows the output of the matplotlib command to be executed within the jupyter notebook. The resulting plots are also stored in the notebook.

You will want to explicitly set the axis or figure objects that you will be using on more complex plots. The shorter method is the pyplot method of matplotlib

Plot types

By default the plot is a link

plot(kind='line')
plot(kind='bar')
plot(kind='barh')
plot(kind='pie')

plot(kind=’line’)

Tracking changes over a period of time Easier to compare changes with different groups and small changes

plot(kind=’bar’)

Track changes over time over different groups. Best when changes are large over time.

plot(kind=’barh’)

Horizontal bar graph, axis rotated.

plot(kind=’pie’)

Best for comparing parts of a whole. Don’t show changes over time.

Example

first_olympics.Sport.value_counts().plot(kind='line');

Add the ; to the end of that line to suppress: <matplotlib.axes._subplots.AxesSubplot at 0x117bf6da0>

Plot Colours

You can set the colour as keyword arguments to the plot() command

Eg.

plot(color='red')

More information on matplotlib colours

Figure Size

Tuple given as keyword argument to plot() to specify the width and height

Eg.

first_olympics.Sport.value_counts().plot(figsize=(15, 3));

Apparently the numbers represent inches (great news!)

Colourmaps

Find a good representation of your data. Is there an intuitive colour scheme? Eg. Gold, Silver and Bronze for medal winners. Blue for male, pink for female.

Classes of colur maps:

  • sequential - representing ordered data
  • diverging - information deviates around a middle value
  • Qualitative - No ordering or relationship (miscellaneous)

Eg.

first_olympics.Sport.value_counts().plot(kind='pie', colormap='Pastel1')

The color parameter for pie charts does not exist

Seaborn

Visualisation library based on matplotlib

Why: * Attractive statistical plots * A complement and not a substitute to matplotlib * Integrates well with pandas

Take a look at the seaborn examples

Countplot()

seaboard.countplot(
    data=source_dataframe,
    hue=categorical_variable_colours,
    order=sequence_of_categorical_variables,
    palette=colors_for_levels
)

Seaborn vs Matplotlib

  • Matplotlib - Short scripts with pyplot with simple plot types
  • Seaborn - Statistical or categorical data needing advanced plots

Importing

import seaborn as sns

Indexing

  • The index is an immutable array (that is it cannot be changed in place and will need to be recreated)
  • Indexing allows you to access a row or column using a label

    type(olympic_data.index) pandas.core.indexes.range.RangeIndex

Immutable

olympic_data.index[100] = 5
TypeError: Index does not support mutable operations

setindex()

Which of series is the index

olympic_data.set_index('Athlete')

You can change the index inplace with

olympic_data.set_index('Athlete', inplace=True)

or

athlete_index_df = olympic_data.set_index('Athlete')

Reset the index

Return the dataframe to the integer index

olympic_data.reset_index(inplace=True)

sort_index()

Speeds up data access on large datasets. Sort objects by a label along the axis.

athlete_index_df.sort_index(inplace=True)
athlete_index_df.head()

Descending order:

athlete_index_df.sort_index(inplace=True, ascending=False)
athlete_index_df.head()

loc[]

Label based index, selecting via the label.

Raises a KeyError when the items are not found

olympic_data.loc['BOLT, Usain']

The index has to be the type you are searching

Although the two below are synonymous:

olympic_data[olympic_data.Athlete == 'BOLT, Usain']
olympic_data.loc[olympic_data.Athlete == 'BOLT, Usain']

iloc[]

Integer based indexing

Allows for traditional pythonic slicing

View an item:

olympic_data.iloc[1700]
Athlete         RABOT, Pierre
City                   London
Edition                  1908
Sport                 Sailing
Discipline            Sailing
NOC                       FRA
Gender                    Men
Event                      6m
Event_gender                X
Medal                  Bronze
Name: 1700, dtype: object

Return multiple items:

olympic_data.iloc[[1542, 2390, 6001, 15000]]

Show 2nd to 5th:

olympic_data.iloc[1:4]

Group by

  • Split a dataframe into groups based on some criteria
  • Applies a function to each group independently
  • Combines the results into a dataframe

Creating a group by object only verifies that you have passed a valid mapping

The group by object is not a dataframe but a group of dataframes in a dict-like structure

type(olympic_data.groupby('Edition'))
pandas.core.groupby.groupby.DataFrameGroupBy

View the group by as a list:

list(olympic_data.groupby('Edition'))

Iterate through a Group

Each group is a dataframe

for key, group in olympic_data.groupby('Edition'):
    print(key)

Groupby Computations

  • GroupBy.size()
  • GroupBy.count()
  • GroupBy.first(), GroupBy.last()
  • GroupBy.head(), GroupBy.tail()
  • GroupBy.mean()
  • GroupBy.max(), GroupBy.min()

agg() allows multiple statistics in one calculation per group

Eg.

olympic_data.loc[olympic_data.Athlete == 'LEWIS, Carl'].groupby('Athlete').agg({'Edition': ['min', 'max', 'count']})

Reshaping

Stack function allows you to set inner columns as rows

If you have grouped by data and want to view it like the original data frame use:

g = mw.groupby(['NOC', 'Gender', 'Discipline', 'Event']).size()

Stack

Returns a dataframe or series, by default dropna=True so you won’t have any missing values

A stack returns a dataframe or series with a new innermost layer of rows

Stack returns a taller dataframe

df.stack(['Event'])

unstack

Returns a dataframe or series

Sometimes it makes sense for NA values to be 0. Meaning no medals are won for that event.

Returns a wider dataframe

df.unstack('Gender')

Data Visualisations

sns.heatmap(sorted_medal_data)

Create our own colourmaps

from matplotlib.colors import ListedColormap

Seaborn uses palette. View the current colour palette:

sns.color_palette()

View it on a plot:

sns.palplot(sns.color_palette())

Gold silver bronze:

gold_silver_bronze = ['#dbb40c', '#c5c9c7', '#a87900']
sns.palplot(gold_silver_bronze)

Create the colourmap:

gsb_colourmap = ListedColormap(sns.color_palette(gold_silver_bronze))
top_5.plot(kind='bar', figsize=(20, 10), colormap=gsb_colourmap);

Example Adding a column

most_medals = us_medals.groupby(['Edition', 'Athlete', 'Medal']).size().unstack('Medal', fill_value=0)
most_medals['Total'] = most_medals['Gold'] + most_medals['Silver'] + most_medals['Bronze']

Sort by group

# Reset index
most_medals.reset_index(inplace=True)
for year, group in most_medals.groupby('Edition'):
    print(group.sort_values('Total', ascending=False)[:1])

# Create a list and then create a dataframe
tu = [group.sort_values('Total', ascending=False)[:1] for year, group in most_medals.groupby('Edition')]
tu
top = pd.DataFrame()
for element in tu:
    top = top.append(element)
top

Source: Lynda Pandas for Datascience