Python For Data Analysis
Python for Data Analysis Notes#
What Data?#
- Tabular like a CSV or excel
- Multi-dimensional arrays (matrices)
- Related data tables
- Evenly spaced out time series
Why Python?#
Not from the book
Because it is the best.
Python can be used at any stage research to prototyping to production systems.
Why not python?#
It’s maximum performance is not great due to it being interpreted and not compiled. The GIL (Global Interpreter Lock) prevents the interpreter from running more than 1 instruction at a time. Developer time is more important than CPU time in most cases so it is a good tradeoff.
Essential Python Libraries#
Numpy#
Numerical python provides data structures, algorithms and library glue for scientific applications using numerical data.
- A fast multi-dimensional arrray
ndarray
- Functions for perofming element-wise and array-wise operations
- Tools for writing datasets to disk
- Linear algebra, fourier transform and random number generation
- A mature C Api to enable python extensions and native C and C++ code to access the numpy array.
It is a container for using and manipulating data, more efficient than built-in python data structures. C can run operations on a numpy array without copying data into an in-memory representation
Pandas#
Provides high level data strcutures and functions that make working with tabular data fast, easy and expressive.
It’s primary data structure is a DataFrame
- a tabular column-oriented data structure with row and column labels.
It provides the tools for data manipulation, preparation and cleaning.
- Labeled axis with data alignments
- Integrated time series
- Same data structure for time series and non-time series data
- Arithmetic and reductions would pass on meta data
- Flexible handling of missing data
- Operations similar to those in db’s
The name pandas is derived from panel data
an econometrics term and Python Analysis Data
Matplotlib#
Most popular python library for creating plots and other 2d visualisations. It is well suited for creating publication ready charts.
iPython and Jupyter#
The interative python interpreter, it maximises developer productivity by allowing interactive exploration execute-explore
as opposed to edit-compile-run
. Data analysis involves a lot of trial and error, that is why it works well.
Jupyter is a language agnostic interactive computing tool. It provides an interactive code book called a notebook.
SciPy#
A collection of packages addressing a number of problem domains in scientific computing
scipy.integrate
- Numerical integrations routines and differential equation solversscipy.linalg
- Linear algebra routinesscipy.optimize
- Function optimizers and root finding algorithmsscipy.signal
- Signal processing toolsscipy.sparse
- Sparse matrices and sparse linear system solversscipy.special
- Wrapper aroundSPECFUN
A fortran library.scipy.stats
- Continuous and discrete probability distributions, density functions, samplers, statistical and descriptive tests.scipy.weave
- Tools for using C++ to accelerate array computations
Scikit-learn#
A general purpose machine learning toolkit for python programmers.
Submodules:
classification
- SVM, nearest neighbors, random forestRegression
- Lasso, Ridge regression, LogisticClustering
- k-means, special clusteringDimensionality reduction
- PCA, feature selection, matrix factorisationPreprocessing
- Feature extraction, normalisation
Statsmodels#
A statistical analysis package for classical statistics and econometrics.
Submodules:
- Regression models: linear regression, generalised linear models, robust linear models, linear mixed effect models.
- ANOVA - Analysis of variance
- Time series analysis - AR, ARMA, ARIMA, VAR
- Nonparametric methods: kernel density estimation, kernel regressino
- Visualisation of statistical model results
Tasks in this book#
Interacting with the Outside world#
Reading and writing various file formats
Preparation#
Cleaning, munging, combining, normalising, reshaping, slicing, dicing and transforming data for analysis.
Transformation#
Applying mathematical and statistical operations to groups of data sets to derive new data sets
Modeling and computation#
Connecting your data to statistical models, machine learning algorithms.
Presentation#
Creating interactive or static graphical visualisations or textual summaries
Import Conventions#
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn as sk
import statsmodels as sm
As it is bad practice to from numpy import *
Jargon#
Munge/Munging/Wrangling - Manipulating unstructured or messy data into a structured clean form
Chapter 2: Python, iPython and Jupyter Notebooks#
Python Interpreter#
Python is interpreted
, that is it executes code line by line.
Enter the interpreter with (provided python is installed)
python
exit it with:
>>> exit()
Running programs from the terminal
$ python hello_world.py
Hello world
Many data scientists make use of ipython
, an interactive python shell that gives you tab completion features, lets you run files and pretty prints output
ipython
You can use %run
to execute scripts within a shell:
In [1]: %run hello_world.py
Hello world
Ipython basics#
Get a quick reference
%quickref
Basic variable assignment
In [5]: a = 5
In [6]: a
Out[6]: 5
Printing data is nicer eg:
In [10]: data = {i : np.random.randn() for i in range(7)}
In [11]: data
Out[11]:
{0: 0.6470083582639674,
1: -0.07424180544449763,
2: 1.017173940313301,
3: 0.5332176226383887,
4: 0.4663520822888573,
5: 0.9544131609184705,
6: -1.2202876585351747}
Jupyter Notebook#
A notebook, an interactive document for code, text, markdown and visualisations
The jupyer notebook interacts with kernels, which are implementations of the interactive computing protocols of various programming languages.
Start a notebook:
$ jupyter notebook
It will usually open the browser automatically at: http://localhost:8888/tree
Jupyter can also be used on a remote computer
Create a new python 3 notebook:
File -> New -> Python3
Execute code:
Shift + Enter
The extensions is .ipynb
for python notebooks
You can use
Tab
to complete partially completed commands or paths
Introspection#
Use ?
before or after any command:
In [6]: ?any
Signature: any(iterable, /)
Docstring:
Return True if bool(x) is True for any x in the iterable.
If the iterable is empty, return False.
Type: builtin_function_or_method
This is object
introspection, it shows us the docstring.
Using ??
will show the source code if possible.
You can also use it to search for functions
In [8]: import numpy as np
In [9]: np.*load*?
np.__loader__
np.load
np.loads
np.loadtxt
np.pkgload
The %run Command#
Any file can be run as a python program in your interactive shell.
In [550]: %run ipython_script_test.py
The script is run in the empty namespace same as running it from the shell. (No existing imports will be used)
To include in the current namespace:
%run -i ipython_script_test.py
In jupyter
you can also load the code in a file using:
%load ipython_script_test.py
Interrupting running code#
You can use Ctrl + C
which causes a KeyBoardInterrupt
when code execution seems stuck.
When a piece of Python code has called into some compiled extension modules, pressing
will not cause the program execution to stop immediately in all cases
Executing code from the clipboard#
Copy the code and in a jupyter notebook
use:
%paste
Or to edit before pasting
%cpaste
Terminal Keyboard Shortcut#
Has some emacs
style shortcuts (boooh!)
ctrl-p
orup-arrow
: previous commandctrl-n
ordown-arrow
: next commandctrl-r
: reverse history searchctrl-shift-p
: paste from clipboardctrl-c
: cancel executing codectrl-a
: go to beginning of linectrl-e
: go to end of linectrl-k
: delete from cursor to end of linectrl-u
: discard all text from current linectrl-l
: clear screen
Exceptions and Tracebacks#
With %run
you may get an exception. It has alot more context that standard python interpreter, set with %xmode
.
You can use %pdb
or %debug
to step into things.
Magic Commands#
All commands that start with %
(not available to standard python interpreter)
For example use %timeit
to check execution time:
In [13]: a = np.random.randn(100, 100)
In [14]: %timeit np.dot(a, a)
26.5 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can use help ?
on magic methods:
%debug?
You can use then without the %
provided no variables with the same name with %automagic
%quickref
- quick reference card%magic
- detailed docs for magic commands%hist
- Print command input history%pdb
- Automatically enter debugger on excpetion%paste
- Execute code from clipboard%cpaste
- Manually paste in code from clipboard%reset
- Delete all variables and names defined in the interactive namespace.%page obejct
- page through an object (likeless
)%run script.py
- Execute a python script%prun statement
- execute a command in the profiler%time statement
- output length of time for statement%timeit
- Run a statement multiple times and get execution time%who
- Show variables defined in namespace%xdel variable
- delete variable from context
Jupyter Magic Commands#
Often loaded by third party extensions, use %%
Eg. Creating a cython extension in jupyter
%%cython
cimport numpy as cnp
cnp.import_array()
def sum_arr(cnp.ndarray[cnp.float64] arr):
cdef int i
cdef double total = 0
for i in range(len(arr)):
total += arr[i]
return total
Can be run with:
In [2]: arr = np.random.randn(1000000)
In [3]: sum(arr)
Out[3]: 340.78566319852672
I got the following error:
UsageError: Cell magic `%%cython` not found.
The above did not work apparently you should use %load_ext Cython
but that didn’t work for me either.
Also there might be an error in the book sum(arr)
should be sum_arr(arr)
Matplotlib integration#
Using %matplotlib
or in Jupyter %matplotlib inline
.
I didn’t even do that though, ust executed:
import numpy as np
import matplotlib.pyplot as plt
plt.plot(np.random.randn(50).cumsum())
Python Language Basics#
Semantics#
- use indentation, not braces (white spaces are very important)
- a colon denates the start of an (indented) block
- It makes thing cosmetically constant and easier to read through (a convention)
- Default to
4 spaces
- Semicolon is still used rarely for statements on a single line: eg. x = 5; y = 6; z = 1;
- Everything is an object - every number, string, class, module or function is an obejct and has a
type()
- Variables are passed by reference be default.
-
The type of an object is stored in the object itself. Python is a
strongly-typed
language5 + ‘5’ Traceback (most recent call last): File “
“, line 1, in TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’ -
In python, operations are carried out stricly or immediately (not lazily or when it is needed like Haskell). Python uses generators for more efficient evaluation.
- In python 3, Unicode has become a first class string type.
- In python a module is simply a
x.py
file
Eg.
# some_module.py
PI = 3.14159
def f(x):
return x + 2
Used in another file:
import some_module
result = some_module.f(5)
pi = some_module.PI
the same as:
from some_module import f, g, PI
result = g(5, PI)
Binary Operators#
a + b Add a and b
a - b Subtract b from a
a * b Multiply a by b
a / b Divide a by b
a // b Floor-divide a by b, dropping any fractional remainder
a ** b Raise a to the b power
a & b True if both a and b are True. For integers, take the bitwise AND.
a | b True if either a or b is True. For integers, take the bitwise OR.
a ^ b For booleans, True if a or b is True, but not both. For integers, take the bitwise EXCLUSIVE-OR.
a == b True if a equals b
a != b True if a is not equal to b
a <= b, a < b True if a is less than (less than or equal) to b
a > b, a >= b True if a is greater than (greater than or equal) to b
a is b True if a and b reference same Python object
a is not b True if a and b reference different Python objects
Mutable and immutable#
Most python objects are mutable: lists, dicts and numpy arrays. This means their contents can be changed.
Some objects are immutable, once they are treated their contents cannot be changed. Like a tuple
.
a = (10, 15, 20) a[1] = 40 Traceback (most recent call last): File “
“, line 1, in TypeError: ‘tuple’ object does not support item assignment a (10, 15, 20)
Scalar Types#
None The Python “null” value (only one instance of the None object exists)
str String type. Holds Unicode (UTF-8 encoded) strings.
bytes Raw ASCII bytes (or Unicode encoded as bytes)
float Double-precision (64-bit) floating point number. Note there is no separate double type.
bool A True or False value
int Arbitrary precision signed integer.
Dates and Time#
Type Description
%Y 4-digit year
%y 2-digit year
%m 2-digit month [01, 12]
%d 2-digit day [01, 31]
%H Hour (24-hour clock) [00, 23]
%I Hour (12-hour clock) [01, 12]
%M 2-digit minute [00, 59]
%S Second [00, 61] (seconds 60, 61 account for leap seconds)
%w Weekday as integer [0 (Sunday), 6]
%U Week number of the year [00, 53]. Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0”.
%W Week number of the year [00, 53]. Monday is considered the first day of the week, and days before the first Monday of the year are “week 0”.
%z UTC time zone offset as +HHMM or -HHMM, empty if time zone naive
%F Shortcut for %Y-%m-%d, for example 2012-4-18
%D Shortcut for %m/%d/%y, for example 04/18/12
Convert a string time to a datetime object: (String - Parse - Time )
In [108]: datetime.strptime('20091031', '%Y%m%d')
Out[108]: datetime.datetime(2009, 10, 31, 0, 0)
Format a datetime as a string: (String - format - time)
In [107]: dt.strftime('%m/%d/%Y %H:%M')
Out[107]: '10/29/2011 20:30
Rounding time down to minute:
In [109]: dt.replace(minute=0, second=0)
Out[109]: datetime.datetime(2011, 10, 29, 20, 0)
A lot more stuff in the book, I recommend reading it
Chapter 3: Built In Data Structures, Functions and Files#
Tuple#
One dimensinal, fixed length, immutable sequence of python objects.
>>> tup = 4, 5, 6
>>> tup
(4, 5, 6)
or
>>> tup = (4, 5, 6)
A nested tuple:
>>> nested_tup = (3, 5, 6), (7, 8)
>>> nested_tup
((3, 5, 6), (7, 8))
Convert a list to a tuple:
>>> a = [1, 2, 3]
>>> tuple(a)
(1, 2, 3)
Elements can be accessed with square brackets and an index:
>>> nested_tup[1]
(7, 8)
tuples can be concatenated with
+
and multiplied with*
You can unpack tuples by assigning a tuple like-object:
>>> a, b = nested_tup
>>> a
(3, 5, 6)
>>> b
(7, 8)
or
>>> a, (b, c,) = nested_tup
>>> a
(3, 5, 6)
>>> b
7
>>> c
8
Too many values raises a ValueError
but you can catch the rest with *the_rest
:
>>> values = 1, 2, 3, 4, 5, 6
>>> first, second, rest = values
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 3)
>>> first, second, *the_rest = values
>>> first
1
>>> second
2
>>> the_rest
[3, 4, 5, 6]
Count occurances of an element:
>>> the_rest.count(3)
1
List#
Variable length, one-dimensional and are mutable (can be modified in place)
>>> list_ = [1, 2, 3, None]
>>> list_
[1, 2, 3, None]
or
>>> list(nested_tup)
[(3, 5, 6), (7, 8)]
Add elements:
In [39]: b_list.append('dwarf')
Insert elements at specified location:
In [41]: b_list.insert(1, 'red')
Take an element out by index:
In [43]: b_list.pop(2)
Take an element out by value, ie. remove all occurances of:
In [46]: b_list.remove('foo')
Check if a list contains element:
>>> None in list_
True
Concatenating lists:
>>> x = [1, 2, 3]
>>> y = [1, 2, 3, 4]
>>> z = x + y
>>> z
[1, 2, 3, 1, 2, 3, 4]
or extend
:
>>> x.extend([5, 6, 7])
>>> x
[1, 2, 3, 5, 6, 7]
extend
is less expensive that +
and assignment.
Sorting#
Can be sorted in place with sort()
:
>>> x = [5, 7, 3, 9, 99]
>>> x.sort()
>>> x
[3, 5, 7, 9, 99]
You can also supply a function to sort using the key
kwarg:
>>> x = ['hello', 'world', 'lets', 'be', 'friends']
>>> x.sort(key=len)
>>> x
['be', 'lets', 'hello', 'world', 'friends']
Binary search and maintaining sorted list#
import bisect
>>> x
[3, 5, 7, 9, 99]
bisect.bisect
calculates where it should be inserted
>>> bisect.bisect(x, 2)
0
bisect.insort
actually does the inserting
>>> bisect.insort(x, 2)
>>> x
[2, 3, 5, 7, 9, 99]
The honus is on you to ensure the list is sorted already. It is too expensive for bisect to do it.
Slicing#
You can select portions of the list or any list-like types (tuples, numpy arrays, pandas series)
>>> x
[2, 3, 5, 7, 9, 99]
>>> x[2:4]
[5, 7]
Enumerate#
Looping over a list and keeping the index
>>> for index, value in enumerate(x):
... print(index, value)
...
0 2
1 3
2 5
3 7
4 9
5 99
Sorted#
Returns a new sorted list
>>> sorted([7, 1, 2, 6, 0, 3, 2])
[0, 1, 2, 2, 3, 6, 7]
Zip#
Zip pairs up 2 lists with each other creating a zip
object that when cast to a list becomes a list of tuples:
>>> numbers = [1, 2, 3]
>>> words = ['one', 'two', 'three']
>>> combined = zip(numbers, words)
>>> combined
<zip object at 0x10f850148>
>>> list(combined)
[(1, 'one'), (2, 'two'), (3, 'three')]
The most common use of it is iterating over both lists together which you can do with:
>>> for i, (a, b) in enumerate(zip(numbers, words)):
... print(i, a, b)
...
0 1 one
1 2 two
2 3 three
Reversed#
Creates an iterator of the reverse of the list:
>>> list(reversed([7, 1, 2, 6, 0, 3, 2]))
[2, 3, 0, 6, 2, 1, 7]
Dict#
A flexibly sized collection key-value pairs, where keys and values are python objets.
Also known as a hash map
or associative array
empty_dict = {}
Create with elements:
my_dict = {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
Access an element with []
:
>>> my_dict['a']
'some value'
Check if a dict contains a key:
>>> 'c' in my_dict
False
>>> 'a' in my_dict
True
Keys can be deleted with del
or pop
:
>>> del my_dict['a']
>>> my_dict
{'b': [1, 2, 3, 4], 7: 'an integer'}
>>> b = my_dict.pop('b')
>>> b
[1, 2, 3, 4]
Get all keys or all values:
>>> list(my_dict.keys())
['a', 'b', 7]
>>> list(my_dict.values())
['some value', [1, 2, 3, 4], 'an integer']
Concatenating with +
does not work:
>>> my_a = {'a': 1, 'b':2}
>>> my_b = {'c': 3, 'd': 4}
>>> my_c = my_a + my_b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'
You can merge 2 dicts together with update()
:
>>> my_a
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
There is another way to create a new list:
>>> my_c = dict(my_a.items() | my_b.items())
>>> my_c
{'c': 3, 'a': 1, 'b': 2, 'd': 4}
Creating dicts from sequences:
mapping = {}
for key, value in zip(key_list, value_list):
mapping[key] = value
Default Values#
get()
and pop()
can return default values.
value = some_dict.get(key, default_value)
>>> my_c.get('e', 5)
5
Even better might be to use defaultdict
Valid dict keys#
A key is only valid if it can be hashable, you can check if an object is hashable with hash()
>>> datetime.date(2018,10,11)
datetime.date(2018, 10, 11)
>>> my_dict
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
>>> my_dict[datetime.date(2018,10,11)] = 3
>>> my_dict[[3, 4, 5]] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Set#
A set is an unordered collection of unique elements
>>> {2, 2, 3, 4, 1, 1}
{1, 2, 3, 4}
or
>>> set([2, 2, 3, 4, 1, 1])
{1, 2, 3, 4}
It supports mathemtical operations like union
(or):
>>> a = {1, 2, 3, 4, 5}
>>> b = {3, 4, 5, 6, 7, 8}
>>> a | b
{1, 2, 3, 4, 5, 6, 7, 8}
Intersection (and
):
>>> a & b
{3, 4, 5}
Difference:
>>> a - b
{1, 2}
Exclusive or:
>>> a ^ b
{1, 2, 6, 7, 8}
There are inplace counterparts to keep the returned set that are more efficient:
c = a.copy()
c |= b
or
c &= b
Check if is a subset:
In [24]: a.issubset(b)
Out[24]: False
In [25]: a.issubset(c)
Out[25]: True
Sets are equal if there contents are equal:
In [26]: {1, 2, 3} == {3, 2, 1}
Out[26]: True
Set Operations#
Function Alternate Syntax Description
a.add(x) N/A Add element x to the set a
a.clear() N/A Reset the set a to an empty state, discarding all of its elements.
a.remove(x) N/A Remove element x from the set a
a.pop() N/A Remove an arbitrary element from the set a, raising KeyError if the set is empty.
a.union(b) a | b All of the unique elements in a and b.
a.update(b) a |= b Set the contents of a to be the union of the elements in a and b.
a.intersection(b) a & b All of the elements in both a and b.
a.intersection_update(b) a &= b Set the contents of a to be the intersection of the elements in a and b.
a.difference(b) a - b The elements in a that are not in b.
a.difference_update(b) a -= b Set a to the elements in a that are not in b.
a.symmetric_difference(b) a ^ b All of the elements in either a or b but not both.
a.symmetric_difference_update(b) a ^= b Set a to contain the elements in either a or b but not both.
a.issubset(b) N/A True if the elements of a are all contained in b.
a.issuperset(b) N/A True if the elements of b[…]
List, Set and Dict Comprehensions#
List comprehensions allow us to concisely form a new list by filtering an existing list
The basic form:
[expr for val in collection if condition]
Equavalent to:
result = []
for val in collection:
if condition:
result.append(expr)
Example:
In [30]: strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
In [31]: [x.upper() for x in strings if len(x) > 2]
Out[31]: ['BAT', 'CAR', 'DOVE', 'PYTHON']
Dict comp:
dict_comp = {key-expr : value-expr for value in collection if condition}
Set comp:
set_comp = {expr for value in collection if condition}
Functions, Namespace and Scope#
Functions can access variables from local
and global
scope. In python this is called a namespace
.
Variables by default are assigned to the local
namespace
Local variables are declared within a function.
You can declare global variables within a function, but you need to use the global
keyword.
def global_array():
global a
a = []
Returning Multiple Values#
def f():
a = 5
b = 6
c = 7
return a, b, c
a, b, c = f()
Pretty much the same as tuple unpacking discussed earlier
Functions are objects#
You can use functions as arguments to other functions
def remove_punctuation(value):
return re.sub('[!#?]', '', value)
You can use map
, that applies a function to a collection.
In [23]: map(remove_punctuation, states)
Anonymous Lambda functions#
Don’t use these
Example:
def short_function(x):
return x * 2
if equivalent to
equiv_anon = lambda x: x * 2
Closures: Functions that return Functions#
A dynamically generated function returned by another function
A key property is that the returned function has access to the variables in the local namespace it was created
Example:
In [32]: def make_closure(a):
...: def closure():
...: print('I know the secret: {}'.format(a))
...: return closure
...:
...:
In [33]: closure = make_closure(5)
In [34]: closure()
I know the secret: 5
You can even mutate values in that local scope, one limitation is that you cannot create new variables. So best to modify an existing list or dict. Or you can make use of the nonlocal
keyword.
Args and Kwargs#
A functions func(a, b, c, d=some, e=value)
are packed up into a tuple and a dict.
a, b, c = args
d = kwargs.get('d', d_default_value)
e = kwargs.get('e', e_default_value)
which is done behind the scenes. You can therefore send variables into a function that aren’t in the signature.
Alot more hectic stuff in the book…generators etc.
Files and Operating Systems#
Most examples in this book will use pandas.read_csv
to import data files. However it is good to know how it is done:
Open a file for reading or writing:
In [1]: path = './hello_world.txt'
In [2]: f = open(path)
By default the file is opening in r
mode (read only).
You can iterate over it:
In [4]: for line in f:
…: print(line)
If we had done f = open(path, 'w')
a new empty file would have overwritten that file.
Using x
would work to create a file and skip if it already existed.
You can read a certain number of characters or the whole thing:
In [3]: f.read()
Out[3]: 'Hello\nWorld\n\nWhoopee!!\n'
Or reset it, once it is read no more characters are read:
In [8]: f.read(2)
Out[8]: 'He'
You can read it in binary
mode;
In [9]: f = open(path, 'rb')
In [10]: f.read()
Out[10]: b'Hello\nWorld\n\nWhoopee!!\n'
You can check where read is with:
In [11]: f.tell()
Out[11]: 23
Go to another location with seek
:
In [12]: f.seek(2)
Out[12]: 2
In [13]: f.read()
Out[13]: b'llo\nWorld\n\nWhoopee!!\n'
File Modes#
Mode Description
r Read-only mode
w Write-only mode. Creates a new file (erasing the data for any file with the same name)
x Write-only mode. Creates a new file, but fails if the file path already exists.
a Append to existing file (create it if it does not exist)
r+ Read and write
b Add to mode for binary files, that is 'rb' or 'wb'
t Text mode for files (automatically decoding bytes to unicode). This is the default if not specified. Add t to other modes to use this, that is 'rt' or 'xt
File Methods#
read([size]) Return data from file as a string, with optional size argument indicating the number of bytes to read
readlines([size]) Return list of lines in the file, with optional
lines([size]) Return list of lines in the file, with optional size argument
readlines([size]) Return list of lines (as strings) in the file
write(str) Write passed string to file.
writelines(strings) Write passed sequence of strings to the file.
close() Close the handle
flush() Flush the internal I/O buffer to disk
seek(pos) Move to indicated file position (integer).
tell() Return current file position as integer.
closed True if the file is closed.
Chapter 4: Numpy Basics#
Numerical Python. Most important foundational packages for numerical computing.
Features:
ndarray
- An efficient multi-dimensional array providing fast array-oriented and flexible broadcasting- Mathemtical functions on all elements, no loops
- Tools for reading and writing to disc and working with memory mapped files
- Linear algebra, random number generation
- A C Api for connecting with C, C++ and FORTRAN
Knowing numpy
will help you use pandas
more effectiely.
For datascience:
- Fast vectorised array operations
- Effecient descriptive statistics (aggregate and summary)
- Data alignment and relational DB manipulation
pandas
provides time series manipulation not found in numpy
Designed for efficiency on large arrays of data
my_arr = np.arange(1000000)
my_list = list(range(1000000))
In [20]: %timeit for _ in range(10): my_arr2 = my_arr * 2
20.1 ms ± 949 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit for _ in range(10): my_list2 = [x * 2 for x in my_list]
769 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Ndarray#
An N-dimensinal array object for homogenous (the sametype) data, that lets you perform scalar operations on it (which really applies these operations to each element)
In [1]: import numpy as np
In [2]: data = np.random.randn(2, 3)
In [3]: data
Out[3]:
array([[-0.33620461, 1.62919406, -0.75813031],
[ 0.47703292, 0.67547306, -0.03832895]])
In [4]: data * 10
Out[4]:
array([[-3.36204611, 16.29194062, -7.5813031 ],
[ 4.77032922, 6.75473056, -0.38328952]])
In [5]: data + 10
Out[5]:
array([[ 9.66379539, 11.62919406, 9.24186969],
[10.47703292, 10.67547306, 9.96167105]])
Shape#
A tuple indicating the size of each dimension
In [7]: data.shape
Out[7]: (2, 3)
Datatype#
In [9]: data.dtype
Out[9]: dtype('float64')
An array, numpy array and ndarray are the same thing in this book
Dimensions#
Number of array dimensions
In [28]: data.ndim
Out[28]: 2
Creatng NdArrays#
Use the array
function.
In [11]: numbers = [0.1, 0.2, 0.3, 0.5678]
In [13]: number_array = np.array(numbers)
In [14]: number_array
Out[14]: array([0.1 , 0.2 , 0.3 , 0.5678])
Nested sequences will be converted into a multi-dimensional array
In [15]: numbers = [[1, 2, 3], [4, 5, 6]]
In [16]: multi_arr = np.array(numbers)
In [17]: multi_arr
Out[17]:
array([[1, 2, 3],
[4, 5, 6]])
You can create array’s of zeroes and ones with, zeros
and ones
In [29]: np.zeros(10)
Out[29]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [33]: np.ones((10, 10))
Out[33]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
It is not safe to assume
np.empty
will return zeroes. It will return unintialised garbage values.
In [34]: np.empty((2, 3, 2))
Out[34]:
array([[[ 1.72723371e-077, -4.34288275e-311],
[ 2.96439388e-323, 2.22028315e-314],
[ 2.20432227e-314, 2.26173552e-314]],
[[ 2.20290004e-314, 2.20304074e-314],
[ 2.26375497e-314, 2.20285084e-314],
[ 2.20285084e-314, 8.34402697e-309]]])
arange
, is the numpy array equivalent of python’s range
:
In [35]: np.arange(10)
Out[35]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Datatypes#
The source of numpys flexibility in working with other systems and programming languages.
In [36]: arr1 = np.array([1, 2, 3], dtype=np.float32)
In [37]: arr2 = np.array([1, 2, 3], dtype=np.int32)
In [38]: arr1.dtype
Out[38]: dtype('float32')
In [40]: arr2.dtype
Out[40]: dtype('int32')
Knowing exact number of bits is not required, just a general idea of the type: int, float, decimal, str is ok.
You can cast to another datatype with astype
:
In [42]: arr1.dtype
Out[42]: dtype('float32')
In [43]: int_arr = arr1.astype(np.int32)
In [44]: int_arr.dtype
Out[44]: dtype('int32')
Decimal part will be truncated
A ValueError
is raised if casting fails.
Calling astype
always creates a new array
Operations between Arrays and Scalars#
You can do operations on an ndarray
without loops, called vectorization.
In [45]: arr = np.array([[1, 2, 3,], [4, 5, 6]])
In [46]: arr
Out[46]:
array([[1, 2, 3],
[4, 5, 6]])
In [47]: arr * arr
Out[47]:
array([[ 1, 4, 9],
[16, 25, 36]])
In [48]: arr - arr
Out[48]:
array([[0, 0, 0],
[0, 0, 0]])
In [49]: 1/ arr
Out[49]:
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
In [50]: arr ** 0.5
Out[50]:
array([[1. , 1.41421356, 1.73205081],
[2. , 2.23606798, 2.44948974]])
Operations between different sized arrays is called broadcasting.
Basic INnexing and Slicing#
They work the same as python lists.
In [51]: arr = np.arange(10)
In [52]: arr[5]
Out[52]: 5
In [53]: arr[2:3]
Out[53]: array([2])
In [54]: arr
Out[54]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [55]: arr[2:3] = 33
In [56]: arr
Out[56]: array([ 0, 1, 33, 3, 4, 5, 6, 7, 8, 9])
Importantly, the slices on nparray
are not copies of the data, merely views. Therefore modiifation on a copied slive will modify the original ndarray
.
Numpy is performance oriented and not so eager to make frivolous copies
To copy a slice you must explicity use copy()
:
arr_slice = arr[2:3].copy()
Multi dimensional arrays can be accessed recursively or via comma seperated numbers:
In [74]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [76]: arr2d[2]
Out[76]: array([7, 8, 9])
In [77]: arr2d[2][0]
Out[77]: 7
In [78]: arr2d[2, 0]
Out[78]: 7
Multi-dimensional sliving is hectic…read the book for more info
Boolean Indexing#
Say we have names and data:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
We can run boolean expressions against the ndarray and it will be applied element-wise
In [82]: names == 'Joe'
Out[82]: array([False, True, False, False, False, True, True])
The boolean array can be used to index another array axis (provided the axis it is slicing is the same size)
In [83]: data[names == 'Joe']
Out[83]:
array([[-1.61727753e-03, 1.11134244e+00, 3.17965850e-01,
1.66651524e-01],
[ 1.71907497e+00, 5.76550942e-01, -1.09187824e+00,
-8.16713776e-01],
[ 4.16526723e-01, 2.90342599e-02, -1.77285758e+00,
4.69447898e-01]])
If you wanted to use multiple boolean conditions with AND
and OR
, remember to sue brackets around each condition:
In [85]: (names == 'Joe') | (names == 'Will')
Out[85]: array([False, True, True, False, True, True, True])
Selecting data from an array with boolean indexing always creates a copy of the data
Only |
and &
work not their reserved keyword equivalents
You can then set data of the returned arrray
In [93]: data[names == 'Joe'] = 4
In [94]: data
Out[94]:
array([[-0.80941878, -0.9509764 , -0.27128519, 1.30517233],
[ 4. , 4. , 4. , 4. ],
[ 2.35196404, -0.60165119, -1.00170882, -1.25451698],
[-1.2806518 , 0.42240918, -0.57464507, -1.22217808],
[-0.26818876, -0.32669154, 2.20744615, -0.95180174],
[ 4. , 4. , 4. , 4. ],
[ 4. , 4. , 4. , 4. ]])
Fancy Indexing#
Fancy Indexing is indexing using integer arrays
arr = np.empty((8, 4))
for i in range(8):
arr[i] = i
In [100]: arr
Out[100]:
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
To select a subset of rows in a particular order:
In [104]: arr[[2, 4, 5, 6]]
Out[104]:
array([[2., 2., 2., 2.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.]])
Using multiple indices:
In [105]: arr = np.arange(32).reshape((8, 4))
In [106]: arr
Out[106]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
In [107]: arr[[1, 2, 3], [0, 3, 1]]
Out[107]: array([ 4, 11, 13])
The resulting copy returned elements (1, 0)
, (2, 3)
and (3, 1)
Fancy indexing always copies the data
Transposing arrays nd swapping indices#
Transposing returns a view, not a copy.
Arrays have the transpose
method and the special T
attribute.
In [109]: arr = np.arange(15).reshape((5, 3))
In [110]: arr
Out[110]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [111]: arr.T
Out[111]:
array([[ 0, 3, 6, 9, 12],
[ 1, 4, 7, 10, 13],
[ 2, 5, 8, 11, 14]])
Calculating the inner matrix product( whatever that means):
In [114]: np.dot(arr.T, arr)
Out[114]:
array([[270, 300, 330],
[300, 335, 370],
[330, 370, 410]])
Higher dimensional arrays can be given an array of axis numbers to transpose around:
In [115]: arr = np.arange(16).reshape((2, 2, 4))
In [116]: arr
Out[116]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [117]: arr.transpose((1, 0, 2))
Out[117]:
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
swapaxes
takes a pair of axis to swap:
In [118]: arr
Out[118]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [119]: arr.swapaxes(1, 2)
Out[119]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
Universal Functions#
A ufunc
universal function performs elementwise operations on data in ndarrays.
unaray ufuncs#
np.sqrt()
In [120]: arr = np.arange(10)
In [121]: np.sqrt(arr)
Out[121]:
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
np.exp()
- Calculate exponential
In [124]: np.exp(arr)
Out[124]:
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
Others
abs, fabs, sqrt, square, exp, log, log10, sign, ceil, floor, rint, modf, isnan, isfinite, isinf, cos, cosh, sin, sinh, tan, tanh, logical_not
Binary Functions#
np.maximum(arr)
- Gives the element-wise maximum
In [126]: x = np.random.randn(8)
In [127]: y = np.random.randn(8)
In [128]: x
Out[128]:
array([ 1.0055233 , -0.15194186, 0.89593432, 0.40450036, -0.21771624,
0.61883328, -1.18958781, -1.13737865])
In [129]: y
Out[129]:
array([-0.84924147, -0.79931119, 1.11336264, -0.19901553, -0.33127759,
-0.48134005, 0.17885233, 1.33822367])
In [130]: np.maximum(x, y)
Out[130]:
array([ 1.0055233 , -0.15194186, 1.11336264, 0.40450036, -0.21771624,
0.61883328, 0.17885233, 1.33822367])
Others:
add, subtract, multiple, divide, power, maximum, minimum, mod, copysign, greater, greater_equal, less, not_equal, logical_and, logical_or, logical_xor
Loop Free Programming#
np.meshgrid
takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays.
In [131]: arr = np.arange(-5, 5, 0.01) # 1000 equally spaced
In [135]: xs, ys = np.meshgrid(arr, arr)
z = np.sqrt(xs ** 2 + ys ** 2)
plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")
Expressing Conditional Logic as Array Operations#
Getting the value from different arrays depending on a boolean array
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
In [7]: result
Out[7]: [1.1, 2.2, 1.3, 1.4, 2.5]
The problem with the above is it will be slow as it is being done by python, it also won’t work with multidimensional arrays. Better to make use of where
:
In [8]: result = np.where(cond, xarr, yarr)
In [9]: result
Out[9]: array([1.1, 2.2, 1.3, 1.4, 2.5])
Mathematical and Statistical Methods#
Get normally distributed random data:
In [10]: arr = np.random.randn(400)
Mean#
In [12]: arr.mean()
Out[12]: 0.02312549029816819
or:
In [13]: np.mean(arr)
Out[13]: 0.02312549029816819
Sum#
In [14]: arr.sum()
Out[14]: 9.250196119267276
The above functions take an options axis argument
Eg.
arr.mean(1)
arr.mean(axis=1)
Cumsum and Cumprod#
Cumulative sum and cumulative product produce an array of intermediate results
arr.cumprod()
arr.cumsum()
In a multidimensional array, accumulation returns an array of the same size it is calculated along the specified axis:
In [20]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
In [21]: arr
Out[21]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [22]: arr.cumsum(0)
Out[22]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]])
as opposed to:
In [23]: arr.cumsum()
Out[23]: array([ 0, 1, 3, 6, 10, 15, 21, 28, 36])
Available methods#
sum, mean, std, var, min, max, argmin, argmax, cumsum, cumprod
Methods of Boolean Arrays#
Sum is often used to count True
values
In [24]: arr = np.random.randn(100)
In [28]: (arr > 0).sum()
Out[28]: 48
Other methods:
any()
checks if any are True-
all()
checks if all are trueIn [29]: (arr > 0).any() Out[29]: True
In [30]: (arr > 0).all() Out[30]: False
Sorting#
Can be sorted in place with arr.sort()
:
In [31]: arr = np.random.randn(10)
In [32]: arr
Out[32]:
array([ 0.71608056, 1.34479278, 1.41800966, -0.48860031, -0.56617617,
-1.42766719, -0.30723866, -0.91101707, 1.70796963, -0.09524445])
In [33]: arr.sort()
In [34]: arr
Out[34]:
array([-1.42766719, -0.91101707, -0.56617617, -0.48860031, -0.30723866,
-0.09524445, 0.71608056, 1.34479278, 1.41800966, 1.70796963])
A multidimensional array can sorted along the axis:
0 would be column-wise and 1 would be row wise…eg:
In [45]: arr = np.random.randn(2, 3)
In [46]: arr
Out[46]:
array([[ 0.28019483, -0.05325257, -2.63604385],
[-0.24228228, -0.63299377, -1.84148469]])
In [47]: arr.sort(0)
In [48]: arr
Out[48]:
array([[-0.24228228, -0.63299377, -2.63604385],
[ 0.28019483, -0.05325257, -1.84148469]])
Finding a quantile:
In [49]: large_arr = np.random.randn(1000)
In [50]: large_arr.sort()
In [51]: large_arr[int(0.05 * len(large_arr))]
Out[51]: -1.6273103628065735
This gives the 5% quantile
Unique and Other set logic#
np.unique
returns sorted unique values in an array
In [52]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [55]: np.unique(names)
Out[55]: array(['Bob', 'Joe', 'Will'], dtype='<U4')
Testing membership of one array in another:
In [56]: values = np.array([6, 0, 0, 3, 2, 5, 6, 9, 7, 9, 7, 9])
In [57]: np.in1d(values, [6, 7, 9])
Out[57]:
array([ True, False, False, False, False, False, True, True, True,
True, True, True])
Other methods:
unique(x), itersect1d(x, y), union1d(x, y), in1d(x, y), setdiff1d(x, y), setxor1d(x, y)
File Input and Output#
np.load
and np.save
are the workhorses of saving and loading from disk.
Files are saved uncompressed with extension .npy
In [58]: arr = np.arange(10)
In [59]: arr
Out[59]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [60]: np.save('some_array', arr)
In [61]: %pwd
Out[61]: '/Users/stephen/projects/fixes'
In [62]: np.load('some_array.npy')
Out[62]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
You can save multiple arrays in a zip with savez()
np.savez('array_archive.npz', a=arr, b=arr)
Linear Algebra#
Matrix multiplication, decompositions, determinants and other square matrix math.
Multiplying 2, 2-dimensional arrays with *
is an element-wise product instead of a matrix dot product.
That is the reason there is a function dot
, an array method and a function in the numpy
namespace for matrix multiplication.
In [197]: x = np.array([[1., 2., 3.], [4., 5., 6.]])
In [198]: y = np.array([[6., 23.], [-1, 7], [8, 9]])
In [199]: x
Out[199]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [200]: y
Out[200]:
array([[ 6., 23.],
[ -1., 7.],
[ 8., 9.]])
In [68]: x.dot(y)
Out[68]:
array([[ 28., 64.],
[ 67., 181.]])
You can use @
as an infix operator that performs matrix multiplication
In [69]: x @ np.ones(3)
Out[69]: array([ 6., 15.])
numpy.linalg
has a standard set of matrix decompositions and things like determinant and inverse.
It is a bit over scope at this stage…read the book
Pseudorandom Number Generation#
numpy.random
supplements the standard python random
library.
You can get a random array 4 x 4 with a normal distribution using normal
:
In [71]: samples
Out[71]:
array([[-0.5007292 , -0.34324411, -0.22568356, 0.6718186 ],
[-1.10226626, 1.36328937, 0.87324304, 0.62881017],
[ 0.12820656, 0.7683322 , 1.45949403, 1.93611694],
[ 0.63583831, -2.30817119, -1.42783868, -0.79192873]])
Random Functions#
Function Description
seed Seed the random number generator
permutation Return a random permutation of a sequence, or return a permuted range
shuffle Randomly permute a sequence in place
rand Draw samples from a uniform distribution
randint Draw random integers from a given low-to-high range
randn Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface)
binomial Draw samples from a binomial distribution
normal Draw samples from a normal (Gaussian) distribution
beta Draw samples from a beta distribution
chisquare Draw samples from a chi-square distribution
gamma Draw samples from a gamma distribution
uniform Draw samples from a uniform [0, 1) distribution
Random Walks#
More on Random Walks in the book
Chapter 5: Getting Started with Pandas#
Pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.
Import convention:
import pandas as pd
Data Structures#
Pandas two workhorse data structures are: DataFrame
and Series
Series#
A one-dimensional array-like object containing a sequence of values of a single type and associated labels, called an index
.
In [2]: series = pd.Series([1, 4, 6, 7, 8])
In [3]: series
Out[3]:
0 1
1 4
2 6
3 7
4 8
dtype: int64
Labels on the left, values on the right
Get the values of a series as an array:
In [5]: series.values
Out[5]: array([1, 4, 6, 7, 8])
Get the index of the series:
In [7]: series.index
Out[7]: RangeIndex(start=0, stop=5, step=1)
Often you want to set data with labels:
In [8]: series = pd.Series([4, 5, 6],['a', 'b', 'c'])
In [9]: series
Out[9]:
a 4
b 5
c 6
dtype: int64
Select a single value:
In [11]: series['b']
Out[11]: 5
Select multiple values:
In [12]: series[['b', 'c']]
Out[12]:
b 5
c 6
dtype: int64
Filtering, scalar multiplication and pplying math functions will preserve the index link:
In [15]: series[series > 4]
Out[15]:
b 5
c 6
dtype: int64
In [16]: series * 2
Out[16]:
a 8
b 10
c 12
dtype: int64
In [19]: np.exp(series)
Out[19]:
a 54.598150
b 148.413159
c 403.428793
dtype: float64
You can think of a series as a fixed length, ordered dict.
Check if a label / index exists:
In [20]: 'b' in series
Out[20]: True
Creating a series from a python dict:
In [23]: p_data = {'Gauteng': 5900, 'Western Cape': 3200, 'Northern Cape': 200}
In [24]: province_series = pd.Series(p_data)
In [25]: province_series
Out[25]:
Gauteng 5900
Western Cape 3200
Northern Cape 200
dtype: int64
You can ignore an index and set the order with the index
kwarg:
In [26]: province_series = pd.Series(p_data, index=['Northern Cape', 'Western Cape'])
In [27]: province_series
Out[27]:
Northern Cape 200
Western Cape 3200
dtype: int64
If a corresponding value for the index is not found you get a NaN
In [28]: province_series = pd.Series(p_data, index=['Northern Cape', 'Western Cape', 'KwaZulu Natal'])
In [29]: province_series
Out[29]:
Northern Cape 200.0
Western Cape 3200.0
KwaZulu Natal NaN
dtype: float64
NaN
stands for Not a Number, which in pandas shows NA
or missing values.
pandas.isnull
and pandas.notnull
should be used to detet missing values.
In [30]: pd.isnull(province_series)
Out[30]:
Northern Cape False
Western Cape False
KwaZulu Natal True
dtype: bool
Data Alignment can be thought of as a Database
JOIN
The series and index have a name
attribute, which integrates with other pandas fucntionality.
In [32]: series.name = 'FF+ Voters'
In [33]: series.name
Out[33]: 'FF+ Voters'
In [36]: series.index.name = 'Province'
An index can be altered in-place:
In [37]: series.index = ['Eastern', 'Western', 'Northern']
In [38]: series
Out[38]:
Eastern 4
Western 5
Northern 6
Name: FF+ Voters, dtype: int64
DataFrame#
Rectangular table of data, with an ordered colletin of columns that can be different types. It has row and column labels.
Under the hood it is stored as one or more two-dimensional blocks.
You can use hierachical indexing to represent higher dimensional data.
Creating a dataframe from a dict of of equal length lists or numpy arrays.
In [41]: data = {'province': ['gauteng', 'limpopo', 'northern cape'], 'year': ['2000', '2001', '2002'], 'pop': ['10000', '20000', '30000']}
In [42]: frame = pd.DataFrame(data)
In [43]: frame
Out[43]:
province year pop
0 gauteng 2000 10000
1 limpopo 2001 20000
2 northern cape 2002 30000
To view the first few rows use head()
:
In [44]: frame.head()
Out[44]:
province year pop
0 gauteng 2000 10000
1 limpopo 2001 20000
2 northern cape 2002 30000
Use columns to re-arrange the columns in that order:
In [45]: frame = pd.DataFrame(data, columns=['province', 'pop', 'year'])
In [46]: frame
Out[46]:
province pop year
0 gauteng 10000 2000
1 limpopo 20000 2001
2 northern cape 30000 2002
If you pass a column that does not exist, it will be created with missing values:
In [51]: frame = pd.DataFrame(data, columns=['province', 'pop', 'year', 'dest'])
In [52]: frame
Out[52]:
province pop year dest
0 gauteng 10000 2000 NaN
1 limpopo 20000 2001 NaN
2 northern cape 30000 2002 NaN
Get the columns:
In [53]: frame.columns
Out[53]: Index(['province', 'pop', 'year', 'dest'], dtype='object')
Retrieve a single column as a series:
In [55]: frame.year
Out[55]:
0 2000
1 2001
2 2002
Name: year, dtype: object
or
In [56]: frame['year']
Out[56]:
0 2000
1 2001
2 2002
Name: year, dtype: object
It is more precise to use the index notation (not property dot-notation)
Columns can be modified with assignment of a scalar
In [61]: frame['pop'] = 100
In [62]: frame
Out[62]:
province pop year dest
0 gauteng 100 2000 NaN
1 limpopo 100 2001 NaN
2 northern cape 100 2002 NaN
or a numpy array vector
In [67]: frame['pop'] = np.arange(3.)
In [68]: frame
Out[68]:
province pop year dest
0 gauteng 0.0 2000 NaN
1 limpopo 1.0 2001 NaN
2 northern cape 2.0 2002 NaN
When assigning an array, the length must much the dataframe’s
Assigning a column that does not exist will create a new column
the del
keyword will delete columns
In [71]: del frame['dest']
In [72]: frame
Out[72]:
province pop year
0 gauteng 0 2000
1 limpopo 1 2001
2 northern cape 2 2002
Create a new column with a boolean expression
In [75]: frame['is_gauteng'] = frame['province'] == 'gauteng'
In [76]: frame
Out[76]:
province pop year is_gauteng
0 gauteng 0 2000 True
1 limpopo 1 2001 False
2 northern cape 2 2002 False
The column returned much liek a numpy array is a view, not a copy. Therefore any inplace modification will result in an update to the dataframe.
Another common scenario is nested dicts, which will take outer dict keys as column labels and inner dict keys as row indices.
You can transpose (swap) a dataframes columns and rows.
with T
:
In [77]: frame.T
Out[77]:
0 1 2
province gauteng limpopo northern cape
pop 0 1 2
year 2000 2001 2002
is_gauteng True False False
Set the index
and column
names:
frame.index.name = 'year'
frame.column.name = 'province'
View the values of a dataframe as a 2-d numpy array:
In [83]: frame.values
Out[83]:
array([['gauteng', 0, '2000', True],
['limpopo', 1, '2001', False],
['northern cape', 2, '2002', False]], dtype=object)
If the datatypes are different, the dtype=object
Index Objects#
Index objects hold axis labels
and other meta data like axis names.
In [5]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [6]: obj.index
Out[6]: Index(['a', 'b', 'c'], dtype='object')
Index objects are immutable so they cannot be changed by the user
Immutability makes it easier to share index objects among data structures.
Types of Indexes:
Index
- array of python objectsInt64Index
- integer valuesFloat64Index
- float values for hierachical indexesMultiIndex
- array of tuplesRangeIndex
- Spaced sequenceCategoricalIndex
- An index of values with categoryDateTimeIndex
- Nanosecond timestamps using numpy’sdatetime64
PeriodIndex
- Timespans
Essential Functionality#
Reindexing#
Create a new obejct conformed to a new index
In [7]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [8]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [9]: obj
Out[9]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
In [10]: obj2
Out[10]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
For ordered data like timeseries it might be useful to fill in values with the ffill
method
In [11]: obj3 = pd.Series(['red', 'green', 'blue'], index=[1, 3, 5])
In [14]: obj3
Out[14]:
1 red
3 green
5 blue
dtype: object
In [15]: obj3.reindex(range(6), method='ffill')
Out[15]:
0 NaN
1 red
2 red
3 green
4 green
5 blue
dtype: object
bfill
on the other hand fills backwards:
In [17]: obj3.reindex(range(6), method='bfill')
Out[17]:
0 red
1 red
2 green
3 green
4 blue
5 blue
dtype: object
reindex
an alter the (row) index, columns or both.
In [18]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'b', 'c'],
...: columns=['gauteng', 'northern cape', 'limpopo'])
In [19]: frame
Out[19]:
gauteng northern cape limpopo
a 0 1 2
b 3 4 5
c 6 7 8
In [20]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [21]: frame2
Out[21]:
gauteng northern cape limpopo
a 0.0 1.0 2.0
b 3.0 4.0 5.0
c 6.0 7.0 8.0
d NaN NaN NaN
Columns are reindexed using the columns
keyword:
In [22]: frame.reindex(columns=['western cape', 'mpumalanga', 'kwazulu natal'])
Out[22]:
western cape mpumalanga kwazulu natal
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
Re-indexing can be done better with the label indexing loc
:
In [24]: frame.loc[['a', 'b', 'c', 'd'], ['gauteng', 'limpopo', 'northern cape']]
/Library/Frameworks/Python.framework/Versions/3.6/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
#!/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6
Out[24]:
gauteng limpopo northern cape
a 0.0 2.0 1.0
b 3.0 5.0 4.0
c 6.0 8.0 7.0
d NaN NaN NaN
Dropping Entries from index#
Use drop
:
In [28]: frame2
Out[28]:
gauteng limpopo northern cape
a 0.0 2.0 1.0
b 3.0 5.0 4.0
c 6.0 8.0 7.0
d NaN NaN NaN
In [29]: frame2.drop('b')
Out[29]:
gauteng limpopo northern cape
a 0.0 2.0 1.0
c 6.0 8.0 7.0
d NaN NaN NaN
drop a column:
In [30]: frame2.drop(columns=['gauteng', 'northern cape'])
Out[30]:
limpopo
a 2.0
b 5.0
c 8.0
d NaN
Modify in place without returning a value, use inplace=True
:
frame2.drop(columns=['gauteng', 'northern cape'], inpalce=True)
Indexing, Selection and Filtering#
Indexing works the same as a numpy
array except you can use things other than integers.
loc
and iloc
(integer label selection) enable you to select a subset of rows and columns
with numpy notation
data.loc['Colorado', ['two', 'three']]
data.iloc[[1, 2], [3, 0, 1]]
Arithmetic and Data Alignment#
Adding series or dataframes with create a new series or dataframe with NaN
values for those that dont exist in both.
You can fill values with:
df1.add(df2, fill_value=0)
Also works with reindex
Operations between a series and a dataframe are called
broadcasting
Function application and mappings#
In [165]: frame
Out[165]:
b d e
Utah -0.204708 0.478943 -0.519439
Ohio -0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 -1.296221
In [166]: np.abs(frame)
Out[166]:
b d e
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221
Sorting and Ranking#
On a series index use sort_index
:
obj.sort_index()
In a dataframe you can sort the whole thing or just a single axis:
frame.sort_index()
or
frame.sort_index(axis=1)
You can set to ascending or descending order:
frame.sort_index(axis=1, ascending=False)
You can sort values with, you guesed it, sort_values()
Nan
values are sent to the end by default.
In a dataframe you can sort by 1 or more columns:
frame.sort_values(by='b')
or
frame.sort_values(by=['a', 'b'])
Ranking is similar to sorting, it gives numbers representing where the value should rank in sorting:
In [32]: series = pd.Series([1, 6, -5, -6, 7, 8, 10, -5])
In [33]: series.rank()
Out[33]:
0 4.0
1 5.0
2 2.5
3 1.0
4 6.0
5 7.0
6 8.0
7 2.5
dtype: float64
Rank based on the order they appear in the data (whatever that means):
In [36]: series.rank(method='first')
Out[36]:
0 4.0
1 5.0
2 2.0
3 1.0
4 6.0
5 7.0
6 8.0
7 3.0
dtype: float64
Tie breaking methods:
average, min, max, first, sense
Duplicate Index values#
Unique axis labels are not mandatory
In [37]: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [38]: obj
Out[38]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
Check of the index is unique:
In [39]: obj.index.is_unique
Out[39]: False
Selecting an index will return both values:
In [40]: obj['a']
Out[40]:
a 0
a 1
dtype: int64
Descriptive Statistics#
sum()
In [41]: frame
Out[41]:
gauteng northern cape limpopo
a 0 1 2
b 3 4 5
c 6 7 8
In [42]: frame.sum()
Out[42]:
gauteng 9
northern cape 12
limpopo 15
dtype: int64
sum(axis=1)
In [43]: frame.sum(axis=1)
Out[43]:
a 3
b 12
c 21
dtype: int64
Set an NaN
value:
In [56]: frame.loc['a', 'gauteng'] = 'NaN'
In [57]: frame
Out[57]:
gauteng northern cape limpopo
a NaN 1 2
b 3 4 5
c 6 7 8
So you can choose to not skipna
values which is True by default:
In [58]: frame.mean(axis=1)
Out[58]:
a 1.5
b 4.5
c 7.5
dtype: float64
In [59]: frame.mean(axis=1, skipna=False)
Out[59]:
a 1.5
b 4.5
c 7.5
dtype: float64
no difference in the above case
describe
the frame:
In [63]: frame.describe()
Out[63]:
northern cape limpopo
count 3.0 3.0
mean 4.0 5.0
std 3.0 3.0
min 1.0 2.0
25% 2.5 3.5
50% 4.0 5.0
75% 5.5 6.5
max 7.0 8.0
It will return alternate data if it is non-numeric
Methods:
count, describe, min, max, sum, mean, median, prod, var, std, skew, kurt, cumsum, diff
Correlation and Covariance#
Lets consider some stock price close and volume for these summary statistics.
We need:
pip install pandas_datareader
Lets get the data:
import pandas as pd
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
In [9]: price.head()
Out[9]:
AAPL IBM MSFT GOOG
Date
2010-01-04 27.095369 102.944206 24.827723 311.349976
2010-01-05 27.142210 101.700638 24.835745 309.978882
2010-01-06 26.710482 101.039986 24.683319 302.164703
2010-01-07 26.661104 100.690277 24.426620 295.130463
2010-01-08 26.838356 101.700638 24.595085 299.064880
Compute the percentage changes in the price:
In [10]: returns = price.pct_change()
In [11]: returns.tail()
Out[11]:
AAPL IBM MSFT GOOG
Date
2018-10-17 -0.004321 -0.076282 -0.002613 -0.004985
2018-10-18 -0.023374 -0.026110 -0.019962 -0.024846
2018-10-19 0.015230 -0.011107 0.001475 0.007804
2018-10-22 0.006110 0.007126 0.008927 0.004287
2018-10-23 0.009427 0.009152 -0.013956 0.002297
The corr
method of Series
computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. cov
computes the covariance.
In [15]: returns['MSFT'].corr(returns['IBM'])
Out[15]: 0.4723006230591393
In [16]: returns['MSFT'].cov(returns['IBM'])
Out[16]: 8.074798783252494e-05
Using these methods on the DataFrame
returns a correlation or covariance matrix.
In [17]: returns.cov()
Out[17]:
AAPL IBM MSFT GOOG
AAPL 0.000251 0.000069 0.000094 0.000105
IBM 0.000069 0.000145 0.000081 0.000073
MSFT 0.000094 0.000081 0.000201 0.000111
GOOG 0.000105 0.000073 0.000111 0.000231
In [18]: returns.corr()
Out[18]:
AAPL IBM MSFT GOOG
AAPL 1.000000 0.362561 0.419753 0.436144
IBM 0.362561 1.000000 0.472301 0.396107
MSFT 0.419753 0.472301 1.000000 0.513596
GOOG 0.436144 0.396107 0.513596 1.000000
You can do pair-wise corrlation with corrwith
:
In [20]: returns.corrwith(returns['GOOG'])
Out[20]:
AAPL 0.436144
IBM 0.396107
MSFT 0.513596
GOOG 1.000000
dtype: float64
correlation of percentage changes with volume:
In [21]: returns.corrwith(volume)
Out[21]:
AAPL -0.066129
IBM -0.173960
MSFT -0.087216
GOOG -0.017287
dtype: float64
Unique Values, Value Counts and Membership#
In [22]: obj = pd.Series(['a', 'b', 'c', 'c', 'b', 'b', 'a', 'b', 'a', 'c', 'b' ])
In [23]: uniques = obj.unique()
In [24]: uniques
Out[24]: array(['a', 'b', 'c'], dtype=object)
Unique values are not retruned in sorted order but can be sorted after the fact if needed
value_counts()
computes a series containing value frequencies:
In [25]: obj.value_counts()
Out[25]:
b 5
c 3
a 3
dtype: int64
It is sorted in descending order for convenience
You can use value_counts()
from the package
In [26]: pd.value_counts(obj, sort=False) Out[26]: b 5 a 3 c 3 dtype: int64
Membership check with isin()
:
In [27]: mask = obj.isin(['b', 'c'])
In [28]: mask
Out[28]:
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 True
10 True
dtype: bool
In [29]: obj[mask]
Out[29]:
1 b
2 c
3 c
4 b
5 b
7 b
9 c
10 b
dtype: object
Index.get_indexer
gives you an index array from an array of possible non-distinct values into another array of distinct values
In [30]: unique_vals = obj.unique()
In [31]: unique_vals
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: pd.Index(unique_vals).get_indexer(obj)
Out[32]: array([0, 1, 2, 2, 1, 1, 0, 1, 0, 2, 1])
Chapter 6: Data Loading, Storage and File Formats#
Reading and writing data in text formats#
Python is good for working with text and file mungingdue to its:
- simple syntax for interacting with files
- lightweight built-in datastructures
- convenient language features (tuple packing and unpacking)
Reading data into a dataframe:
read_csv
- comma seperated valuesread_table
- load delimited data from a file (\t
is default delimiter)read_fwf
- load dat from file with fixed width column formatread_clipboard
- Reads data from clipboard (from webpages)read_excel
- Read tabular data from excel xlsread_hdf
- Read from HDF5read_html
- Read from<table>
s in an html documentread_json
- Reads from json (javascript object notation) stringread_msgpack
- read from MessagePack binary formatread_pickle
- read an object from python pickle formatread_sas
- Read a SAS datasetread_sql
- Read the result on a sql query (using SQLALchemy) as a pandas Dataframe.read_stata
- Read from a stata file formatread_feather
- Read from feather file format
Data is messy in the real world and hence there are many options available to read_csv
for example for neatening data.
type inference means you don’t need to explicitly specify the data type of the columns for numeric, integer, boolean or string. Dates however do require a bit of effort
In [1]: !cat ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [3]: df = pd.read_csv('ex1.csv')
In [4]: df
Out[4]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
We could have also used:
In [7]: df = pd.read_table('ex1.csv', delimiter=',')
If your data does not have headers (the first line), you can specify none and let pandas default:
df = pd.read_csv('ex1.csv', headers=None)
or set them with:
df = pd.read_csv('ex1.csv', headers=['a', 'b', 'c', 'd', 'message'])
Hierachical data can be split using index_col
:
In [11]: !cat csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
In [13]: parsed = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2'])
In [14]: parsed
Out[14]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
In some cases a file may not have a delimiter, perhaps it is just whitespace in that case you can give it a regular expression to the sep
kwarg:
In [23]: list(open('ex2.txt'))
Out[23]:
[' A B C\n',
' aaa -0.264438 -1.026059 -0.619500\n',
' bbb 0.927272 0.302904 -0.032399\n',
' ccc -0.264273 -0.386314 -0.217601\n',
' ddd -0.871858 -0.348382 1.100491\n']
In [24]: result = pd.read_table('ex2.txt', sep='\s+')
In [25]: result
Out[25]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
You can skip certain rows with skiprows=[0, 3, 4]
Handling missing values in pandas is important. Missing data is usually an empty string or a sentinel value: NA
, -1.#IND
, NULL
The na_values
kwarg can take a list or set of strings to consider mssing values:
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
Different NA
sentinels can be specified for each column in a dict:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']
pd.read_csv('examples/ex5.csv', na_values=sentinels)
Some Other Arguments#
path
- location or url of objectsep
ordelimiter
- character sequence or regular expression to seperate data in rowscomment
- character to split comment out at end of linesparse_dates
- Attemptt o parse data todatetime
, can give a list of columns to try. It isFalse
by default.
Reading Text Files in Pieces#
For large files you may want to work on small chunks of a file at a time, and not have the entire file in memory.
Check the book
Writing Data out to Text Format#
Data can be exported to a file using panda’s to_csv
function.
In [28]: df.to_csv('out.csv')
In [29]: !cat out.csv
,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo
You can use a different seperator and write to stdout
with:
In [30]: import sys
In [31]: df.to_csv(sys.stdout, sep='|')
|a|b|c|d|message
0|1|2|3|4|hello
1|5|6|7|8|world
2|9|10|11|12|foo
Missing values are shown as empty strings to use something else use na_rep='NULL'
Both row and columnlabels are written by default, this can be disabled with:
index=False
and header=False
You can also specify which columns you want to write:
In [32]: df.to_csv('out.csv', columns=['a', 'b', 'message'])
In [33]: !cat out.csv
,a,b,message
0,1,2,hello
1,5,6,world
2,9,10,foo
Manually working with delimited formats#
Sometimes you receive a csv with malformed lines
In [35]: !cat ex7.csv
“a","b","c"
"1","2","3"
"1","2","3","4”
For any file with a single character delimiter you can use pythons’s csv
module:
In [36]: import csv
In [37]: f = open('ex7.csv')
In [38]: reader = csv.reader(f)
In [39]: for line in reader:
...: print(line)
['“a"', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4”\n']
From there you can do the wrangling you need
Defining a new csv format with a different delimiter, string quoting or line termination convension is done by defining a subclass of csv.Dialect
class my_dialect(csv.Dialect):
line_terminator = '\n'
delimiter = ';'
quote_char = '"'
quoting = csv.QUOTE_MINIMAL
reader = csv.reader(f, dialect=my_dialect)
Individual parameters can be given as keywords:
reader = csv.reader(f, delimiter='|')
Files with fixed multi-character delimiters won’t be able to use
csv
To write csv files:
with open('mydata.csv', 'w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
writer.writerow(('7', '8', '9'))
JSON Data#
JSON (Javascript Object Notation) has become a standard for sending data over HTTP.
It is nearly valid python code except for its null value null
and nuances like not allowing trailing commas.
All the keys in a JSON object must be strings.
We will make use of the built-in json
module.
To convert a json string to a python, use json.loads
:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 29, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
""”
In [45]: import json
In [46]: result = json.loads(obj)
In [47]: result
Out[47]:
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 29, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
json.dumps
converts a python object back to json:
In [48]: as_json = json.dumps(obj)
In [49]: as_json
Out[49]: '"\\n{\\"name\\": \\"Wes\\",\\n \\"places_lived\\": [\\"United States\\", \\"Spain\\", \\"Germany\\"],\\n \\"pet\\": null,\\n \\"siblings\\": [{\\"name\\": \\"Scott\\", \\"age\\": 29, \\"pets\\": [\\"Zeus\\", \\"Zuko\\"]},\\n {\\"name\\": \\"Katie\\", \\"age\\": 38,\\n \\"pets\\": [\\"Sixes\\", \\"Stache\\", \\"Cisco\\"]}]\\n}\\n"'
You can pass a list of JSON objects to the DataFrame constructor:
In [51]: siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
In [52]: siblings
Out[52]:
name age
0 Scott 29
1 Katie 38
pd.read_json
automatically converts json datasets into a Series or DataFrame.
data = pd.read_json('examples/example.json')
To export data from pandas to json use to_json
:
data.to_json()
XML and HTML: Web Scraping#
Python has a few libraries to read and write data: lxml
, Beautiful Soup and html5lib
Pandas has the built-in read_html
which automatically parses tables out of html files into a DataFrame.
By default it find and parses data in <table>
tags.
In [3]: tables = pd.read_html('fdic_failed_bank_list.html')
Returns a list
In [9]: type(tables)
Out[9]: list
The data is held in elements of that list:
In [14]: failures = tables[0]
In [15]: failures.head()
Out[15]:
Bank Name City ST ... Acquiring Institution Closing Date Updated Date
0 Allied Bank Mulberry AR ... Today's Bank September 23, 2016 November 17, 2016
1 The Woodbury Banking Company Woodbury GA ... United Bank August 19, 2016 November 17, 2016
2 First CornerStone Bank King of Prussia PA ... First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
3 Trust Company Bank Memphis TN ... The Bank of Fayette County April 29, 2016 September 6, 2016
4 North Milwaukee State Bank Milwaukee WI ... First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016
[5 rows x 7 columns]
We can compute the number of bank failures per year:
In [16]: close_timestamps = pd.to_datetime(failures['Closing Date'])
In [18]: close_timestamps.dt.year.value_counts()
Out[18]:
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, dtype: int64
Parsing XML with lxml.objectify#
Read the book…
Binary Data Formats#
One of the easiest ways to store data (serialisaton) is using python’s pickle
.
It is efficient and stores data in binary.
In [19]: frame = pd.read_csv('ex1.csv')
In [21]: frame
Out[21]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [23]: frame.to_pickle('frame_pickle')
Any pickled object can be read by pickle
directly or with pd.read_pickle
:
Pickle is recommended for short term storage. It is hard to gaurentee that the format will be stable over time.
Other formats HDF5
and MessagePack
as well as Feather
can be checkout out in the book
Reading Microsoft Excel Files#
Excel 2003 and higher. You use it with pd.ExcelFile
and pd.read_excel
.
Internally these tools add on the xlrd
and openpyxl
packages. YOu will probably need to isntall these packages seperately.
First create the excel file:
In [2]: xlsx = pd.ExcelFile('ex1.xlsx')
In [3]: xlsx
Out[3]: <pandas.io.excel.ExcelFile at 0x119658dd8>
Then read it with:
In [5]: pd.read_excel(xlsx)
Out[5]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
If you are reading a file with multiple sheets it is faster to create the file, otherwise you can simple pass the file path to
pd.read_excel
To write to an excel file you must first create an ExcelWriter
, then write data to it using to_excel
:
In [9]: writer = pd.ExcelWriter('ex2.xlsx')
In [10]: frame.to_excel(writer)
In [11]: writer.save()
Interacting with Web Apis#
A good package to familiarise yourself with is requests
We can pass data directly into a dataframe and extract fields of interest.
In [15]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
In [16]: resp = requests.get(url)
In [17]: resp
Out[17]: <Response [200]>
In [18]: data = resp.json()
In [19]: data[0]['title']
Out[19]: 'ENH: Implement IntervalIndex.is_overlapping'
In [20]: issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
In [21]: issues
Out[21]:
number title labels state
0 23327 ENH: Implement IntervalIndex.is_overlapping [{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=... open
1 23326 Data type of new assigned DataFrame column [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=... open
2 23325 AttributeError: module 'pandas' has no attribu... [] open
3 23324 Rounding valid timestamps near daylight saving... [{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=... open
4 23323 TST: Update sparse data generation [{'id': 49182326, 'node_id': 'MDU6TGFiZWw0OTE4... open
5 23321 Support for partition_cols in to_parquet [{'id': 685114413, 'node_id': 'MDU6TGFiZWw2ODU... open
6 23320 BUG/TST: timedelta-like with Index/Series/Data... [] open
Interacting with Databases#
In a business setting most data will be stored in a relational database: SQLServer, PosgreSQL and MySQL.
It is much faster to use sqlalchemy instead of using native drivers.
pip install SQLAlchemy
pip install mysqlclient
Typically you would connect to a database with a url like:
dialect+driver://username:password@host:port/database
Eg.
engine = create_engine('postgresql://scott:tiger@localhost/mydatabase')
engine = create_engine('mysql://scott:tiger@localhost/foo')
engine = create_engine('mysql+mysqldb://scott:tiger@localhost/foo')
engine = create_engine('oracle://scott:tiger@127.0.0.1:1521/sidname')
engine = create_engine('mssql+pyodbc://scott:tiger@mydsn')
In [5]: queue_frame = pd.read_sql('select * from chart_queue', db)
Chapter 7: Data Cleaning and Preparation#
A significant amount of your time in data analysis and modelling is spent on data preparation: loading, cleaning, transforming and rearranging.
Apparently taking up to 80% of your codng time
Handling Missing Data#
All descriptive statistics exclude missing data by default.
The missing sentinel value for floating points is NaN
which is imperfect but functional.
In [1]: import pandas as pd
In [2]: import numpy as np
In [5]: string_data = pd.Series(['Aardvark', 'Artichoke', np.nan, 'Avocado'])
In [6]: string_data.isnull()
Out[6]:
0 False
1 False
2 True
3 False
dtype: bool
In pandas, we’ve adopted a convention used in the R languageby reffering to missing data as NA
(Not Available).
NA
may mean data that does not exist or data that exists but was not observed (due to data collection problems).
The built-in python None
value is also treated as NA
in object arrays
In [7]: string_data[0] = None
In [8]: string_data
Out[8]:
0 None
1 Artichoke
2 NaN
3 Avocado
dtype: object
In [10]: string_data.isnull()
Out[10]:
0 True
1 False
2 True
3 False
dtype: bool
Filtering out Missing Data#
You can use isnull()
and boolean indexing but dropna
can also be helpful.
In [11]: from numpy import nan as NA
In [12]: data = pd.Series([1, NA, 3.5, 7])
In [13]: data.dropna()
Out[13]:
0 1.0
2 3.5
3 7.0
dtype: float64
which is equivalent to:
In [15]: data[data.notnull()]
Out[15]:
0 1.0
2 3.5
3 7.0
dtype: float64
On a dataframe dropna()
drops all rows containing a missing value
Passing how=all
will only drop rows where all the values are missing
data.dropna(how='all')
To drop columns, do the same thing but where axis=1
data.dropna(axis=1, how='all')
Suppose you only want to keep rows containing a certain number of observations. For that you would use thresh
:
In [16]: df = pd.DataFrame(np.random.randn(7, 3))
In [17]: df
Out[17]:
0 1 2
0 0.871250 -1.434184 1.323816
1 -2.049019 1.685563 -0.028869
2 -0.284594 -1.658746 -0.893838
3 -0.306614 -0.045451 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
In [18]: df.iloc[:4, 1] = NA
In [19]: df
Out[19]:
0 1 2
0 0.871250 NaN 1.323816
1 -2.049019 NaN -0.028869
2 -0.284594 NaN -0.893838
3 -0.306614 NaN 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
In [20]: df.iloc[:2, 2] = NA
In [21]: df
Out[21]:
0 1 2
0 0.871250 NaN NaN
1 -2.049019 NaN NaN
2 -0.284594 NaN -0.893838
3 -0.306614 NaN 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
In [22]: df.dropna(thresh=3)
Out[22]:
0 1 2
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
In [23]: df.dropna(thresh=2)
Out[23]:
0 1 2
2 -0.284594 NaN -0.893838
3 -0.306614 NaN 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
Filling in Missing Data#
Calling fillna
with a constant replaces missing values with that value:
In [24]: df.fillna(0)
Out[24]:
0 1 2
0 0.871250 0.000000 0.000000
1 -2.049019 0.000000 0.000000
2 -0.284594 0.000000 -0.893838
3 -0.306614 0.000000 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
You can use a different fill value for a different column with a dict:
In [28]: df.fillna({0: -1, 1: 0.5, 2: 0})
Out[28]:
0 1 2
0 0.871250 0.500000 0.000000
1 -2.049019 0.500000 0.000000
2 -0.284594 0.500000 -0.893838
3 -0.306614 0.500000 0.072289
4 1.253934 0.242588 0.802684
5 1.301810 -1.869278 0.492717
6 0.307062 -0.273495 -0.468776
You can change it inplace with inplace=True
instead of retruning a new object
The same indexing methods available can be used with fillna
:
In [30]: df = pd.DataFrame(np.random.randn(6, 3))
In [31]: df.iloc[2:, 1] = NA; df.iloc[4:, 2] = NA
In [32]: df
Out[32]:
0 1 2
0 0.427217 -0.321663 0.806618
1 -0.918541 1.607939 -0.374820
2 -0.720018 NaN -0.989154
3 0.740126 NaN 0.176258
4 -0.920978 NaN NaN
5 0.449166 NaN NaN
In [33]: df.fillna(method='ffill')
Out[33]:
0 1 2
0 0.427217 -0.321663 0.806618
1 -0.918541 1.607939 -0.374820
2 -0.720018 1.607939 -0.989154
3 0.740126 1.607939 0.176258
4 -0.920978 1.607939 0.176258
5 0.449166 1.607939 0.176258
and limit:
In [34]: df.fillna(method='ffill', limit=2)
Out[34]:
0 1 2
0 0.427217 -0.321663 0.806618
1 -0.918541 1.607939 -0.374820
2 -0.720018 1.607939 -0.989154
3 0.740126 1.607939 0.176258
4 -0.920978 NaN 0.176258
5 0.449166 NaN 0.176258
You could also fill data with the mean:
In [35]: data = pd.Series([1.0, NA, 3.5, NA, 7])
In [36]: data.fillna(data.mean())
Out[36]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
Data Transformation#
Removing Duplicates#
The duplicated()
method returns a boolean series of whether each row is duplicated or not:
df.duplicated() == True
To get a dataframe with duplciates dropped:
df.drop_duplicates()
Both methods consider all of the columns by default.
Suppose you want to filter only on a certain column:
data.drop_duplicates(['k1'])
The first occurance is kept by default, if you want to keep the last use the kwarg keep='last'
Transforming Data Using a Function or Mapping#
In [39]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'pastrami', 'corned beef', 'Bacon', 'Pastrami'],
...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3,]})
In [40]: data
Out[40]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 Pastrami 3.0
Suppose you wanted to add a column for the animal a type of meat comes from:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
}
The map
method on a series accepts a function or dict-like mapping, but there is a problem some meats are capitalised.
In [42]: data[‘animal’] = data[‘food’].str.lower().map(meat_to_animal)
In [43]: data
Out[43]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 Pastrami 3.0 cow
We could have also used a function:
In [45]: data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
Replaceing Values#
Take this data:
In [48]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [49]: data
Out[49]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
Say that -999
is a sentinel for missing data. To replace this with data pandas understands we can use replace
:
In [50]: data.replace(-999, np.NAN)
Out[50]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
You can replace multiple values:
In [51]: data.replace([-999, -1000], np.NAN)
Out[51]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
Multiple specific replacements:
In [52]: data.replace({-999: 0 , -1000: np.NAN})
Out[52]:
0 1.0
1 0.0
2 2.0
3 0.0
4 NaN
5 3.0
dtype: float64
Renaming axis#
transform = lambda x: x[:4].upper()
data.index.map(transform)
or
data.rename(index=str.title, columns=str.upper) data.rename(index={‘OHIO’: ‘INDIANA’}, inplace=True)
Discretisation and Binning#
Continuous data is often seperated into bins
Say you have ages:
In [53]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [54]: ages
Out[54]: [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [55]: bins = [18, 25, 35, 60, 100]
You then use the cut
function
In [56]: cats = pd.cut(ages, bins)
In [57]: cats
Out[57]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
The object returned is a special Categorical
object
In [58]: cats.codes
Out[58]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [59]: cats.categories
Out[59]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
In [60]: pd.value_counts(cats)
Out[60]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
Consistent with mathemtical notation: a parenthesis (
means that side is open while a square bracket means closed ]
(inclusive).
You can set which side is open or closed with right=False
In [61]: cats = pd.cut(ages, bins, right=False)
In [62]: cats
Out[62]:
[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
If you give cut an integer instead of explicit bin edges it will computer equal length bins based on minimum and maximum values in the data.
In [64]: data = np.random.randn(20)
In [65]: data
Out[65]:
array([ 0.31992007, -1.37057101, -0.78414082, 1.41844717, -0.29812797,
0.9521121 , 0.82879956, 2.77438651, 0.75959752, -1.31259677,
0.51138379, -0.32840281, 1.81732981, -0.52482447, 0.72973532,
0.49750981, -0.56058252, -0.25608326, 0.28111671, 2.05995848])
In [66]: pd.cut(data, 4, precision=2)
Out[66]:
[(-0.33, 0.7], (-1.37, -0.33], (-1.37, -0.33], (0.7, 1.74], (-0.33, 0.7], ..., (-0.33, 0.7], (-1.37, -0.33], (-0.33, 0.7], (-0.33, 0.7], (1.74, 2.77]]
Length: 20
Categories (4, interval[float64]): [(-1.37, -0.33] < (-0.33, 0.7] < (0.7, 1.74] < (1.74, 2.77]]
A similar related function qcut
, cuts bins based on quantiles.
Depending on the distribution of the data using cut
will not usually result in each bin having the same amount of data points.
qcut
uses sample quantiles instead, by definition you receive roughly equal sized bins.
In [67]: data = np.random.randn(1000)
In [68]: cats = pd.qcut(data, 4)
In [69]: cats
Out[69]:
[(-0.647, 0.0519], (0.719, 3.005], (-0.647, 0.0519], (-0.647, 0.0519], (-0.647, 0.0519], ..., (0.719, 3.005], (0.0519, 0.719], (0.719, 3.005], (-0.647, 0.0519], (0.719, 3.005]]
Length: 1000
Categories (4, interval[float64]): [(-3.189, -0.647] < (-0.647, 0.0519] < (0.0519, 0.719] <
(0.719, 3.005]]
In [70]: pd.value_counts(cats)
Out[70]:
(0.719, 3.005] 250
(0.0519, 0.719] 250
(-0.647, 0.0519] 250
(-3.189, -0.647] 250
dtype: int64
You can pass your own quantiles (values from 0 to 1):
In [71]: cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])
In [72]: cats.value_counts()
Out[72]:
(-3.189, -1.305] 100
(-1.305, 0.0519] 400
(0.0519, 1.257] 400
(1.257, 3.005] 100
dtype: int64
Detecting and Filtering Outliers#
Filtering or transforming outliers is largely a matter of applying array operations.
In [73]: data = pd.DataFrame(np.random.randn(1000, 4))
In [74]: data.describe()
Out[74]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.007788 -0.010025 0.029215 0.011847
std 0.983244 1.028296 1.004748 1.009238
min -3.015770 -3.204882 -3.727639 -2.709379
25% -0.650793 -0.670933 -0.699211 -0.655731
50% -0.028469 -0.024508 0.036163 -0.006192
75% 0.675486 0.668831 0.762269 0.694783
max 3.156981 3.034008 2.976704 3.516318
Suppose you want to find the values in a column exceeding 3 in magnitude:
In [76]: col[np.abs(col)> 3]
Out[76]:
218 3.128675
509 3.011054
699 3.516318
Name: 3, dtype: float64
To select all the rows have a magnitude greater than 3, use any
In [77]: data[(np.abs(data) > 3).any(1)]
Out[77]:
0 1 2 3
187 -3.015770 -0.028759 -0.087647 1.514852
218 0.677857 0.448257 0.334189 3.128675
313 3.156981 -0.842627 -0.762875 0.803619
487 -0.620277 -3.094748 0.769148 -0.081738
509 0.346854 -1.075202 0.175269 3.011054
640 0.538403 -3.204882 -2.088300 1.448109
697 -0.388438 1.511103 -3.727639 0.479053
699 0.580033 1.228524 1.146316 3.516318
953 1.060103 3.034008 0.980496 0.985401
962 -0.683237 -3.093548 0.467704 0.111335
You can set the values of the outliers, eg:
data[np.abs(data) > 3] = np.sign(data) * 3
np.sign
returns -1 or 1 based on the sign of the values
Permutation and Random Sampling#
Permuting (random reordering) a series or rows in a dataframe is easy to do suing the numpy.random.permutations
function.
You can create a sampler and then use df.take
- return values in positional indexing
In [79]: df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
In [80]: df
Out[80]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [81]: sampler = np.random.permutation(5)
In [82]: sampler
Out[82]: array([3, 0, 4, 1, 2])
In [83]: df.take(sampler)
Out[83]:
0 1 2 3
3 12 13 14 15
0 0 1 2 3
4 16 17 18 19
1 4 5 6 7
2 8 9 10 11
To get a random subset without replacement:
In [85]: df.take(np.random.permutation(len(df))[:3])
Out[85]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
To generate a sample with replacement:
In [86]: bag = np.array([5, 7, -1, 6, 4])
In [87]: sampler = np.random.randint(0, len(bag), size=10)
In [88]: sampler
Out[88]: array([0, 0, 0, 0, 2, 1, 0, 3, 1, 2])
In [89]: draws = bag.take(sampler)
In [90]: draws
Out[90]: array([ 5, 5, 5, 5, -1, 7, 5, 6, 7, -1])
In [91]: bag
Out[91]: array([ 5, 7, -1, 6, 4])
Computing Indicator / Dummy Variables#
Another type of trasformation for statistical modelling and machine learning is converting a categorical variable into a “dummy” or “indicator” matrix.
More in the book about this
String Manipulation#
Most text handling in python is made easy with the built in string object and functions.
String Object Methods#
Split a broken string
In [92]: values = 'a,b, guido'
In [93]: values.split(',')
Out[93]: ['a', 'b', ' guido']
It is often combined with strip()
to trim whitespace and linebreaks
In [96]: pieces = [x.strip() for x in values.split(',')]
In [97]: pieces
Out[97]: ['a', 'b', 'guido']
These values can be concatenated
In [98]: first, second, third = pieces
In [99]: first + '::' + second + '::' + third
Out[99]: 'a::b::guido'
A more pythonic and faster way is using str.join()
:
In [101]: '::'.join(pieces)
Out[101]: 'a::b::guido'
This always feels like it is called in reverse
Detecting substring use in
, but you can use find
and index
:
In [104]: 'guido' in values
Out[104]: True
In [105]: values.index(',')
Out[105]: 1
In [108]: values.find(':')
Out[108]: -1
index
raises an exception if the string isn’t found. Found returns-1
Number of occurances of a string, use count
:
In [109]: values.count(',')
Out[109]: 2
Substitution and removal of certain chars can use str.replace()
In [110]: values.replace(',', '::')
Out[110]: 'a::b:: guido'
Remove with an empty string:
In [111]: values.replace(',', '')
Out[111]: 'ab guido'
Regular Expressions#
A way to search or match string patterns in text. Python’s re
module is responsible for applying regilar expressions to strings.
3 categories:
- pattern matching
- substitution
- splitting
Splitting a string by variable whitespace:
In [112]: import re
In [113]: text = 'Foo Bar\t baz \tqux'
In [115]: re.split('\s+', text)
Out[115]: ['Foo', 'Bar', 'baz', 'qux']
Split with one or more white spaces
The regular expression is first compiled and then apploied to the text. It is much more efficient to compile a regular expression to create a reusable opbject with re.compile()
In [117]: regex = re.compile('\s+')
In [118]: regex.split(text)
Out[118]: ['Foo', 'Bar', 'baz', 'qux']
If instead you want a list of all pattterns matching the regex:
In [119]: regex.findall(text)
Out[119]: [' ', '\t ', ' \t']
To avoid verbose escaping with the
\
character, it is better to use raw string literals eg. r’C:\x’ isntead of ‘C:\x’
match
and search
are closely related to findall
:
search
only returns the first matchmatch
only matches the beginning of a string
Lets consider a block of text with email addresses:
In [120]: text = """Dave dave@google.com
...: Steve steve@gmail.com
...: Rob rob@gmail.com
...: Ryan ryan@yahoo.com
...: """
In [121]: pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
In [122]: regex = re.compile(pattern, flags=re.IGNORECASE)
In [123]: regex.findall(text)
Out[123]: ['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
re.IGNORECASE
makes the search case insensitive
search
returns a special match objectfor the first email found. The match object only shows use the start and end characters.
In [124]: m = regex.search(text)
In [125]: m
Out[125]: <_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
In [126]: text[m.start():m.end()]
Out[126]: 'dave@google.com'
match
returns None
as it will only match if the string starts with a match
sub
will return a new string with matches replaced:
In [127]: print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
Suppose you want to group parts of the match, you should use parenthesis ()
:
In [130]: pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
In [131]: regex = re.compile(pattern, flags=re.IGNORECASE)
In [132]: m = regex.match('abc@example.com')
In [133]: m.groups()
Out[133]: ('abc', 'example', 'com')
findall
will return tuples:
In [134]: regex.findall(text)
Out[134]:
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub
has access to special symbols \1
and '\2
which correspond to matched groups
In [136]: regex.sub(r'Username: \1 Domain: \2 Suffix: \3', text)
Out[136]: 'Dave Username: dave Domain: google Suffix: com\n Steve Username: steve Domain: gmail Suffix: com\n Rob Username: rob Domain: gmail Suffix: com\n Ryan Username: ryan Domain: yahoo Suffix: com\n '
You can also name matched groups using the notation:
r'(?P<name>[A-Z]+)'
Example:
In [139]: pattern = r'(?P<username>[A-Z0-9._%+-]+)@(?P<domain>[A-Z0-9.-]+)\.(?P<suffix>[A-Z]{2,4})'
In [140]: regex = re.compile(pattern, flags=re.IGNORECASE|re.VERBOSE)
In [141]: m = regex.match('abc@example.com')
In [142]: m.groupdict()
Out[142]: {'username': 'abc', 'domain': 'example', 'suffix': 'com'}
Vectorised String Functions#
In [144]: data = {'Dave': 'dave@gmail.com', 'Pat': 'pat@yahoo.com', 'elaine': 'elaine@gmail.com', 'petri': np.NaN}
In [145]: data = pd.Series(data)
In [146]: data
Out[146]:
Dave dave@gmail.com
Pat pat@yahoo.com
elaine elaine@gmail.com
petri NaN
dtype: object
In [147]: data.isnull()
Out[147]:
Dave False
Pat False
elaine False
petri True
dtype: bool
You can use string functoins using .str
:
In [148]: data.str.upper()
Out[148]:
Dave DAVE@GMAIL.COM
Pat PAT@YAHOO.COM
elaine ELAINE@GMAIL.COM
petri NaN
dtype: object
In [149]: data.str.contains('gmail')
Out[149]:
Dave True
Pat False
elaine True
petri NaN
dtype: object
Regular expressions can be applied:
In [150]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[150]:
Dave [(dave, gmail, com)]
Pat [(pat, yahoo, com)]
elaine [(elaine, gmail, com)]
petri NaN
dtype: object
Retrieval:
In [151]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [152]: matches
Out[152]:
Dave True
Pat True
elaine True
petri NaN
dtype: object
Not sure what this does:
In [154]: matches.str.get(1)
Out[154]:
Dave NaN
Pat NaN
elaine NaN
petri NaN
dtype: float64
You can also slive strings:
In [155]: data.str[:5]
Out[155]:
Dave dave@
Pat pat@y
elaine elain
petri NaN
dtype: object
Chapter 8: Data Wrangling, Join, Combine and Reshape#
In many applications data is spread across different files or databases and arranged in a form that is difficult to analyse.
Hierachical Indexing#
Allows you to have multiple index levels on an axis.
It allows you to work with higher dimension data in a lower dimension form.
In [163]: data = pd.Series(np.random.randn(9), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd' ], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
In [164]: data
Out[164]:
a 1 -0.766229
2 0.041728
3 0.114579
b 1 -0.726996
3 -1.168821
c 1 0.851205
2 -0.131947
d 2 -0.447351
3 0.449313
dtype: float64
This is a series with MultiIndex
as its index.
In [165]: data.index
Out[165]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
Partial Indexing is possible to select subsets:
In [166]: data['b']
Out[166]:
1 -0.726996
3 -1.168821
dtype: float64
In [168]: data[['b', 'd']]
Out[168]:
b 1 -0.726996
3 -1.168821
d 2 -0.447351
3 0.449313
dtype: float64
In [169]: data['b':'c']
Out[169]:
b 1 -0.726996
3 -1.168821
c 1 0.851205
2 -0.131947
dtype: float64
Select the second element in the inner level from each outer level:
In [170]: data.loc[:, 2]
Out[170]:
a 0.041728
c -0.131947
d -0.447351
dtype: float64
Hierachical indexing plays an important role in reshaping data and group-based operations like forming a pivot table.
This data could be rearranged into a dataframe using the unstack()
method:
In [171]: data.unstack()
Out[171]:
1 2 3
a -0.766229 0.041728 0.114579
b -0.726996 NaN -1.168821
c 0.851205 -0.131947 NaN
d NaN -0.447351 0.449313
The inverse operation is stack()
:
In [172]: data.unstack().stack()
Out[172]:
a 1 -0.766229
2 0.041728
3 0.114579
b 1 -0.726996
3 -1.168821
c 1 0.851205
2 -0.131947
d 2 -0.447351
3 0.449313
dtype: float64
With a dataframe, each axis can have a hierachical index:
In [173]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
...: index=[['a', 'a', 'b', 'b',],[1, 2, 1, 2]],
...: columns=[['Ohio', 'Ohio', 'Colarado'], ['Green', 'Red', 'Green']])
In [174]: frame
Out[174]:
Ohio Colarado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Hierachical levels can have names. Don’t confuse index names for axis labels:
In [176]: frame.index.names = [‘key1’, ‘key2’]
In [177]: frame.columns.names = [‘state’, ‘colour’]
In [178]: frame
Out[178]:
state Ohio Colarado
colour Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
With partial indexing:
In [179]: frame['Ohio']
Out[179]:
colour Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
Reorderng and Sorting Levels#
swaplevel
takes 2 level number or names and returns a new object with the levels interchanged.
In [180]: frame.swaplevel('key1', 'key2')
Out[180]:
state Ohio Colarado
colour Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sort_index
can sort the values in a single level.
In [181]: frame.sort_index(level=1)
Out[181]:
state Ohio Colarado
colour Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
Data selectino performance is much better on hierachically indexed objects if the index is alphabetically sorted starting from the outermost level. ie. the result of sort_index()
or sort_index(level=0)
Summary Statistics by Level#
Many descricptive statistics have a level
option to apply on a specific axis.
In [183]: frame
Out[183]:
state Ohio Colarado
colour Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [185]: frame.sum(level='key2')
Out[185]:
state Ohio Colarado
colour Green Red Green
key2
1 6 8 10
2 12 14 16
In [189]: frame.sum(level='colour', axis=1)
Out[189]:
colour Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
It uses
groupby
Indexing with dataframe’s columns#
In [190]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0 , -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two',], 'd': [0, 1, 2, 0, 1,
...: 2, 3]})
In [191]: frame
Out[191]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
set_index
will return a new dataframe that will use one or more of its columns as the inex:
In [192]: frame2 = frame.set_index(['c', 'd'])
In [193]: frame2
Out[193]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
By default the columns are removed from the dataframe, though you can leave them in:
In [194]: frame.set_index(['c', 'd'], drop=False)
Out[194]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
To undo this hierachical indexing use reset_index()
:
In [196]: frame2.reset_index()
Out[196]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
Integer Indexes#
There are subtle differences in integer indexing between python and numpy/pandas:
In [197]: ser = pd.Series(np.arange(3.))
In [198]: ser[-1]
gives an error:
KeyError: -1
Whereas with a non-integer index, there is no potencial for ambiguity:
In [200]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
In [201]: ser2
Out[201]:
a 0.0
b 1.0
c 2.0
dtype: float64
In [203]: ser2[-1]
Out[203]: 2.0
With an axis containing integers, data indexing with always be label-oriented.
Use loc
for label selection and iloc
for integer selection
In [204]: ser[:1]
Out[204]:
0 0.0
dtype: float64
In [205]: ser.iloc[:1]
Out[205]:
0 0.0
dtype: float64
In [206]: ser.loc[:1]
Out[206]:
0 0.0
1 1.0
dtype: float64
Combining and Merging Datasets#
The following ways:
pandas.merge
- connects rows based on keys, similar to a dbjoin
pandas.concat
- concatenates or stacks objects together along an axiscombine_first
- instance method that enables splicing together ovverlapping data to fill in missing data in one obejct with values in another
Database-style joins#
In [207]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
In [208]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
In [209]: df1
Out[209]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
In [210]: df2
Out[210]:
key data2
0 a 0
1 b 1
2 d 2
In [211]: pd.merge(df1, df2)
Out[211]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
This is a many-to-one join as df2
only has single values for each key
If the column to join on is not specified merge
uses the overlapping column names as keys.
It is good practice to to explicit, not implicit
about what to join on:
In [213]: pd.merge(df1, df2, on='key')
Out[213]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
If the colum names are different on each dataframe, you can specify them seperately:
In [214]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
In [215]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
In [216]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[216]:
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
By default, merge does an
inner join
which results in an intersection or the common set in the tables.
Other options are left
, right
and outer
. outer
takes the union by applying both left and right joins:
In [218]: pd.merge(df1, df2, how='outer')
Out[218]:
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0
many-to-many joins have well-defined but not intuitive behaviour:
In [222]: df1
Out[222]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
In [223]: df2
Out[223]:
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
In [224]: pd.merge(df1, df2, on='key', how='left')
Out[224]:
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 a 5 0.0
10 a 5 2.0
11 b 6 1.0
12 b 6 3.0
Many-to-many joins form a cartesian product of the rows. There were 3 b’s in df1
and 2 b’s in df2
therefore in the merged dataframe there are 6 b’s.
In [225]: pd.merge(df1, df2, how='inner')
Out[225]:
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 6 1
5 b 6 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
10 a 5 0
11 a 5 2
To join with multiple keys, pass a list of column names:
In [226]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
In [227]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})
In [229]: left
Out[229]:
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
In [230]: right
Out[230]:
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
In [228]: pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[228]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
To determine which combination will appear in the results think of multiple keys as forming an array of tuples to be used as a single join key.
Overlapping column names can be addressed manually by renaming the axis labels. merge
has a suffix
option for specifying strings to append to overlapping key names:
In [231]: pd.merge(left, right, on='key1')
Out[231]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
In [232]: pd.merge(left, right, on='key1', suffixes=['_left', '_right'])
Out[232]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
Merging on Index#
Sometimes the merge keys are found in its index.
To indicate which index should be used as the merge key use left_index=True
or right_index=True
or both.
In [233]: left1 = ({'key': ['a', 'b', 'a', 'b', 'c'], 'value': range(6)})
In [245]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
In [240]: left1
Out[240]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
In [246]: right1
Out[246]:
group_val
a 3.5
b 7.0
In [247]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[247]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
You can do the union instead of intersection:
In [248]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[248]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
Hierachically indexed data, things are more complicated as joining on index is innherantly a multiple key merge.
In this case you have to state the multiple keys to merge on. More in the book…
Concatenating along an axis#
Concatenating, binging or stacking
In [249]: arr = np.arange(12).reshape((3, 4))
In [256]: arr
Out[256]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [257]: np.concatenate([arr, arr], axis=1)
Out[257]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
In a dataframe context:
- Should we combine or use shared values on an index?
- Do concatenated chunks of data need to be identifiable in the resulting object?
- Does the concatenation axis contain data that needs to be preserved? In many cases the integer integer labels should be discarded.
The concat
function in pandas provides a conistent way to do the above.
In [258]: s1 = pd.Series([0, 1], index=['a', 'b'])
In [259]: s2 = pd.Series([2, 3,4], index=['c', 'd', 'e'])
In [260]: s3 = pd.Series([5, 6], index=['f', 'g'])
In [263]: s1
Out[263]:
a 0
b 1
dtype: int64
In [264]: s2
Out[264]:
c 2
d 3
e 4
dtype: int64
In [265]: s3
Out[265]:
f 5
g 6
dtype: int64
Concatenated:
In [262]: pd.concat([s1, s2, s3])
Out[262]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
By default concat
works along axis=0
If you pass axis=1
the result is a DataFrame
:
In [266]: pd.concat([s1, s2, s3], axis=1)
Out[266]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
You can intersect with join=inner
alot more detailed stuff in the bool…like join_axes
Also combing data with an overlap is explained
Reshaping and Pivoting#
Reshaping with Hierachical Indexing#
stack
rotates or pivots from the columns in the data to rows
unstack
pivots from rows to columns
In [3]: data = pd.DataFrame(np.arange(6).reshape(2,3), index=pd.Index(['Ohio', 'Colarado']), columns=pd.Index(['one', 'two', 'three'], name='number'))
In [4]: data
Out[4]:
number one two three
Ohio 0 1 2
Colarado 3 4 5
stack()
pivots the columns into rows, producing a series:
In [5]: result = data.stack()
In [6]: result
Out[6]:
number
Ohio one 0
two 1
three 2
Colarado one 3
two 4
three 5
dtype: int64
You can rearrange the series back into a dataframe with unstack()
:
In [7]: result.unstack()
Out[7]:
number one two three
Ohio 0 1 2
Colarado 3 4 5
By default the innermost level is unstacked, you can choose a different level by giving the name or number:
In [8]: result.unstack(0)
Out[8]:
Ohio Colarado
number
one 0 3
two 1 4
three 2 5
Unstacking might give missing data
Stacking filters out missing data, you can stop that by passing dropna=False
:
data2.unstack().stack(dropna=False)
The level unstacked becomes the lowest level in the result
Pivoting Long to Wide Format#
A comoon way of storing time series in a database and CSV is long or stacked format.
Lets load time series data and do some wrangling on it:
In [4]: data = pd.read_csv('macrodata.csv')
In [5]: periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='data')
In [6]: data.head()
Out[6]:
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
In [7]: data = pd.DataFrame(data.to_records(), columns=pd.Index(['realgdp', 'infl', 'unemp'], name='item'),
...: index=periods.to_timestamp('D', 'end'))
In [8]: data.head()
Out[8]:
item realgdp infl unemp
data
1959-03-31 2710.349 0.00 5.8
1959-06-30 2778.801 2.34 5.1
1959-09-30 2775.488 2.74 5.3
1959-12-31 2785.204 0.27 5.6
1960-03-31 2847.699 2.31 5.2
In [9]: ldata = data.stack().reset_index().rename(columns={0: 'value'})
In [10]: ldata.head()
Out[10]:
data item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
This is long
format as it is time series with multiple keys.
Each row represents a single observation.
Data is frequently stored like this in MySQL db’s.
In this case date
(data
) and item
would be primary keys.
You may prefer a dataframe with one column per distinct date.
The pivot
method does this:
In [11]: pivoted = ldata.pivot('data', 'item', 'value')
In [13]: pivoted.head()
Out[13]:
item infl realgdp unemp
data
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
The first two arguments passed are the row and column index, then fianlly an optional value to fill the data.
More in the book omn pivoting mutliple keys with multiple values per keys - it will create a hierachical index.
Chapter 9: Plotting and Visualisation#
Informative visualisations (plots) is one of the most important tasks in data analysis. It may be part of the exploratory process: identifying outliers, ideas for models and data transformations.
For others building interactive visualisations for the web may be the end goal.
Matplotlib is desktop plotting package for creating two-dimensional publicationquality plots.
mplot3d
for 3d plotsbasemap
for mappingand projections
The best way to follow along is with inline plotting in a jupyter notebook: %matplotlib line
matplot API Primer#
Import convention:
from matplotlib.pyplot import plt
First plot:
plt.plot(np.arange(10));
Libraries like seaborn and panda’s built-in plotting functions deal with the mundane tasks of creating plots, it is important to learn a bit about the api.
Figures and Subplots#
Plots in matplotlib
reside within a Figure
object.
Check the Chapter 9: Figures and Subplots Jupter Notebook Notes, which I will attempt to host if possible.
Chapter 10: Data Aggregation and Group Operations#
Categorising a data set and applying functions to each group is a critical part of the data analysis workflow. After loading, merging and preparing a dataset you may need to compute group statistics or pivot tables for visualisations.
SQL can be limited in this regard, you will learn:
- Split a pandas object into pieces based on keys (functions, arrays or datadframe columns)
- Computing group sumamry statistics
- Apply transformations, linear regression, rank or subset per group.
- Computer pivot tables and cross tabulation
- Quantile analysis
GroupBy Mechanics#
Split-Apply-Combine
for group operations was coined by R’s Hadley Wickham.
- Split the group based on one or more keys, on a specific axis (0 = rows, 1 = columns)
- A function is applied to each group producing a new value
- Finally the results of those functions are combined into a result object
For example summing the total spend at different shops (as the key) in a dataset
The keys do not need to be of the same type, it can be:
- A list of array values the same length as teh axis being grouped
- a column name in a dataframe
- a dict or series corresponding to values on an axis
-
A function to be invoked on axis index or labels
In [4]: df = pd.DataFrame({‘key1’: [‘a’, ‘a’, ‘b’, ‘b’, ‘a’], ‘keys2’: [‘one’, ‘two’, ‘one’, ‘two’, ‘one’], ‘data1’: np.random.randn(5), …: ‘data2’: np.random.randn(5)})
In [5]: df Out[5]: key1 keys2 data1 data2 0 a one 0.148874 0.617623 1 a two 0.421171 0.352511 2 b one -0.566405 1.491486 3 b two 2.147993 -0.948390 4 a one 0.423003 -1.586536
Suppose you want to calculate the mean
of the data1
column using the group labels from key1
, one way:
In [6]: grouped = df['data1'].groupby(df['key1'])
In [7]: grouped
Out[7]: <pandas.core.groupby.groupby.SeriesGroupBy object at 0x11512f748>
The grouped
variable is now a GroupBy
object, it hasn;t calulated anything except some intermediate data about the group key df['key1']
.
To calculate the mean
:
In [8]: grouped.mean()
Out[8]:
key1
a 0.331016
b 0.790794
Name: data1, dtype: float64
The data (a series) has been aggregated according to the group key, producing a new series indexed by unique values in the key1
column.
In [19]: means = df['data1'].groupby([df['key1'], df['key2']]).mean()
In [20]: means
Out[20]:
key1 key2
a one 0.285939
two 0.421171
b one -0.566405
two 2.147993
Name: data1, dtype: float64
In this case we grouped by 2 keys and the resulting series now has a hierachical index
In [21]: means.unstack()
Out[21]:
key2 one two
key1
a 0.285939 0.421171
b -0.566405 2.147993
The group keys are Series but they could also be arrays of the correct length
In [22]: states = np.array(['Gauteng', 'Western Cape', 'Mpumalanga', 'Kwa-Zulu Natal', 'Northern Cape'])
In [23]: years = np.array([2005, 2006, 2007, 2008, 2009])
In [26]: df['data1'].groupby([states, years]).mean()
Out[26]:
Gauteng 2005 0.148874
Kwa-Zulu Natal 2008 2.147993
Mpumalanga 2007 -0.566405
Northern Cape 2009 0.423003
Western Cape 2006 0.421171
Name: data1, dtype: float64
Not sure what the hell I just did above
Frequently the grouping data is in your dataframe already:
In [28]: df.groupby('key1').mean()
Out[28]:
data1 data2
key1
a 0.331016 -0.205467
b 0.790794 0.271548
or with 2 keys:
In [29]: df.groupby(['key1', 'key2']).mean()
Out[29]:
data1 data2
key1 key2
a one 0.285939 -0.484456
two 0.421171 0.352511
b one -0.566405 1.491486
two 2.147993 -0.948390
In the single key example the key2
column is not shown, because it is not numeric data it is known as a nuisance column which is therefore excluded from the result
By default only numeric columns are aggregated.
A general useful method is size()
which returns the size of each group
In [30]: df.groupby(['key1', 'key2']).size()
Out[30]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
All missing values are excluded from the result
Iterating over groups#
The GroupBy
objext supports iteration, generating a sequence of tuples of the group name along with a chunk of data.
In [31]: for name, group in df.groupby('key1'):
...: print(name)
a
b
In the case of multiple keys, he first element in the tuple will be a tuple of key values:
In [32]: for (k1, k2), group in df.groupby(['key1', 'key2']):
...: print((k1, k2))
('a', 'one')
('a', 'two')
('b', 'one')
('b', 'two')
Of course you can cast the groups to a dict:
In [36]: dict(list(df.groupby('key1')))
Out[36]:
{'a': key1 key2 data1 data2
0 a one 0.148874 0.617623
1 a two 0.421171 0.352511
4 a one 0.423003 -1.586536, 'b': key1 key2 data1 data2
2 b one -0.566405 1.491486
3 b two 2.147993 -0.948390}
By default, groupby
groups on axis=0
(rows)
You could however group on columns (axis=1
):
In [38]: grouped = df.groupby(df.dtypes, axis=1)
In [39]: dict(list(grouped))
Out[39]:
{dtype('float64'): data1 data2
0 0.148874 0.617623
1 0.421171 0.352511
2 -0.566405 1.491486
3 2.147993 -0.948390
4 0.423003 -1.586536, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
Selecting a Column or Subset of Columns#
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation:
df.groupby('key1')['data1']
is the same as:
df['data1'].groupby('key1')
For large datasets it my be desirable to aggregate only a few columns.
To compute mean
s for just the data2
column and get the result as a dataframe we could write:
In [40]: df.groupby(['key1', 'key2'])[['data2']].mean()
Out[40]:
data2
key1 key2
a one -0.484456
two 0.352511
b one 1.491486
two -0.948390
The object returned is a grouped DataFrame if a list or array is passed and a grouped series if only the single column is passed as a scalar.
In [44]: type(df.groupby(['key1', 'key2'])['data2'])
Out[44]: pandas.core.groupby.groupby.SeriesGroupBy
In [45]: type(df.groupby(['key1', 'key2'])[['data2']])
Out[45]: pandas.core.groupby.groupby.DataFrameGroupBy
Grouping with Dicts and Series#
Grouping information may exist in a form other than an array
In [46]: people = pd.DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'],)
In [49]: people.iloc[2:3, [1, 2]] = np.NaN
In [50]: people
Out[50]:
a b c d e
Joe -0.997165 0.194776 -0.565123 0.848180 -1.277432
Steve -0.719995 1.353184 -1.120488 0.772741 -0.919065
Wes 1.513412 NaN NaN -0.212853 -0.219539
Jim -1.605091 0.283836 0.600302 1.887151 0.083196
Travis -0.915961 -0.331931 -1.177884 1.064124 1.125694
Now we have a mapping of columns to groups:
In [51]: mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
You can simply pass that to groupby
:
In [52]: by_column = people.groupby(mapping, axis=1)
In [54]: by_column.sum()
Out[54]:
blue red
Joe 0.283057 -2.079821
Steve -0.347747 -0.285876
Wes -0.212853 1.293874
Jim 2.487453 -1.238059
Travis -0.113760 -0.122198
The same holds true for series
In [55]: map_series = pd.Series(mapping)
In [56]: map_series
Out[56]:
a red
b red
c blue
d blue
e red
f orange
dtype: object
In [57]: people.groupby(map_series, axis=1).count()
Out[57]:
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
Grouping with Functions#
A function passed as the group key will be called once per index, the return value will be the group names.
Suppose you wanted to group people by the length of their name:
In [58]: people.groupby(len).sum()
Out[58]:
a b c d e
3 -1.088844 0.478612 0.035179 2.522479 -1.413775
5 -0.719995 1.353184 -1.120488 0.772741 -0.919065
6 -0.915961 -0.331931 -1.177884 1.064124 1.125694
Mixing functions with arrays, dicts or series is not a problem as everything is turned into an array internally.
In [59]: key_list = ['one', 'one', 'one', 'two', 'two']
In [60]: people.groupby([len, key_list]).min()
Out[60]:
a b c d e
3 one -0.997165 0.194776 -0.565123 -0.212853 -1.277432
two -1.605091 0.283836 0.600302 1.887151 0.083196
5 one -0.719995 1.353184 -1.120488 0.772741 -0.919065
6 two -0.915961 -0.331931 -1.177884 1.064124 1.125694
Grouping by Index Levels#
For hierachically-indexed data you can aggregate using one of the levels of an axis’s index.
To do that apss the level
keyword:
In [63]: columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]], names=['country', 'tenor'])
In [64]: hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
In [65]: hier_df
Out[65]:
country US JP
tenor 1 3 5 1 3
0 0.751653 0.196602 0.112632 -1.101248 -0.337879
1 0.040296 -0.983856 -0.300156 1.766771 0.637154
2 -0.949319 -0.454868 -0.202546 -0.003234 -0.700482
3 1.334632 1.579028 1.965208 0.671107 -0.627314
In [66]: hier_df.groupby(level='country', axis=1).count()
Out[66]:
country JP US
0 2 3
1 2 3
2 2 3
3 2 3
Data Aggregation#
Any data transformation that produces scalar values from arrays, like sum
, count
, min
and mean
.
You can have transformations of your own making.
For example quantile
computes sample quantiles of a series or dataframes columns:
In [67]: df
Out[67]:
key1 key2 data1 data2
0 a one 0.148874 0.617623
1 a two 0.421171 0.352511
2 b one -0.566405 1.491486
3 b two 2.147993 -0.948390
4 a one 0.423003 -1.586536
In [68]: grouped = df.groupby('key1')
In [69]: grouped['data1'].quantile(0.9)
Out[69]:
key1
a 0.422637
b 1.876553
Name: data1, dtype: float64
Quantile is a series method and is thus available on groupnby objects.
Groupby efficiently slices the data and calls the method on each piece
To use your own aggregation function, pass any function that aggregates an array to the agg
or aggregate
method:
In [71]: grouped.agg(peak_to_peak)
Out[71]:
data1 data2
key1
a 0.274130 2.204160
b 2.714398 2.439876
describe
works on grouped data:
In [72]: grouped.describe()
Out[72]:
data1 ... data2
count mean std min 25% 50% ... std min 25% 50% 75% max
key1 ...
a 3.0 0.331016 0.157743 0.148874 0.285022 0.421171 ... 1.203364 -1.586536 -0.617013 0.352511 0.485067 0.617623
b 2.0 0.790794 1.919369 -0.566405 0.112195 0.790794 ... 1.725253 -0.948390 -0.338421 0.271548 0.881517 1.491486
[2 rows x 16 columns]
Custom aggregation functions are generally much slower than the optimized functions:
count, sum, mean, median, std, var, min, max, prod, first, last
Using tip data:
In [75]: tips = pd.read_csv('tips.csv')
In [76]: tips.head()
Out[76]:
total_bill tip smoker day time size
0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
We create the tip_pct
column:
In [78]: tips['tip_pct'] = tips['tip'] / tips['total_bill']
Column-wise nd multi-function application#
Group by sex
and smoker
:
In [90]: grouped = tips.groupby(['sex', 'smoker'])
In [91]: grouped_pct = grouped['tip_pct']
In [92]: grouped_pct.agg('mean')
Out[92]:
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
Name: tip_pct, dtype: float64
You can pass a list of functions you get a dataframe with column names taken from functions:
In [94]: grouped_pct.agg(['mean', 'std', peak_to_peak])
Out[94]:
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
You don’t have to accept those column names, lambda functions will have a name <lambda>
which makes them hard to identify.
You can pass a list of (name, function)
tuples where the first element of each tuple is used as the column name
In [96]: grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
Out[96]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
With a dataframe you have more columns to apply functions to:
In [97]: functions = ['count', 'mean', 'max']
In [98]: result = grouped['tip_pct', 'total_bill'].agg(functions)
In [99]: result
Out[99]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
It gives a result with hierachical columns, which you can view:
In [100]: result['tip_pct']
Out[100]:
count mean max
sex smoker
Female No 54 0.156921 0.252672
Yes 33 0.182150 0.416667
Male No 97 0.160669 0.291990
Yes 60 0.152771 0.710345
Suppose you want to apply different functions to one or more columns:
In [101]: grouped.agg({'tip': np.max, 'size': sum})
Out[101]:
tip size
sex smoker
Female No 5.2 140
Yes 6.5 74
Male No 9.0 263
Yes 10.0 150
In [102]: grouped.agg({'tip': ['min', 'max', 'mean', 'std'], 'size': 'sum'})
Out[102]:
tip size
min max mean std sum
sex smoker
Female No 1.00 5.2 2.773519 1.128425 140
Yes 1.00 6.5 2.931515 1.219916 74
Male No 1.25 9.0 3.113402 1.489559 263
Yes 1.00 10.0 3.051167 1.500120 150
Returning Aggregated Data without Indexes#
Sometimes you don’t want an index coming back with results.
You can disable this behaviour by passing as_index=False
to groupby
.
In [107]: tips.groupby(['sex', 'smoker'], as_index=False).mean()
Out[107]:
sex smoker total_bill tip size tip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
You can of course use
reset_index()
to achieve the same result
Apply: General split-apply-combine#
The most general purpose GroupBy
method is apply()
.
apply
splits the object into pieces, invokes the functions and then attempts to piece it back together.
Suppose you wantted to select the top 5 tip_pct
values by group. First write a function that selects the rows with the largest values in a particular function:
In [110]: def top(df, n=5, column='tip_pct'):
...: return df.sort_values(by=column)[-n:]
In [111]: top(tips, n=6)
Out[111]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
So this returns the top 6 biggest tips. Now if we group by Smoker:
In [116]: tips.groupby('smoker').apply(top)
Out[116]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
The top
function is called on each piece of the dataframe, then the results are glued together using pandas.concat
.
The result contains a hierachical index whose inner level contains index values from the origianl dataframe.
If you pass a function that takes other arguments to the apply function you can pass those after the function.
In [118]: tips.groupby('smoker').apply(top, n=1, column='total_bill')
Out[118]:
total_bill tip sex smoker day time size tip_pct
smoker
No 212 48.33 9.0 Male No Sat Dinner 4 0.186220
Yes 170 50.81 10.0 Male Yes Sat Dinner 3 0.196812
The function only needs to return a scalar value or a pandas object. The function will require some creativity.
Remember the describe function on a groupby object:
In [120]: result = tips.groupby('smoker')['tip_pct'].describe()
In [121]: result
Out[121]:
count mean std min 25% 50% 75% max
smoker
No 151.0 0.159328 0.039910 0.056797 0.136906 0.155625 0.185014 0.291990
Yes 93.0 0.163196 0.085119 0.035638 0.106771 0.153846 0.195059 0.710345
The describe()
function is really just an alias for:
f = lambda x: x.describe()
grouped.apply(f)
Suppressing the group keys#
Use group_keys=False
:
In [124]: tips.groupby('smoker').apply(top)
Out[124]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [125]: tips.groupby('smoker', group_keys=False).apply(top)
Out[125]:
total_bill tip sex smoker day time size tip_pct
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
Quantile and bucket analysis#
Pandas has cut
and qcut
for slicing data into buckets. These functions can be combined with groupby
.
In [126]: frame = pd.DataFrame({'data1': np.random.randn(1000), 'data2': np.random.randn(1000)})
In [127]: quartiles = pd.cut(frame.data1, 4)
In [128]: quartiles[:10]
Out[128]:
0 (-1.654, -0.189]
1 (-0.189, 1.275]
2 (-0.189, 1.275]
3 (-0.189, 1.275]
4 (-0.189, 1.275]
5 (-0.189, 1.275]
6 (-0.189, 1.275]
7 (-0.189, 1.275]
8 (1.275, 2.74]
9 (-0.189, 1.275]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.125, -1.654] < (-1.654, -0.189] < (-0.189, 1.275] <
(1.275, 2.74]]
The categorical object returned can be passed directly to groupby
:
In [129]: def get_stats(group):
...: return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}
In [130]: grouped = frame.data2.groupby(quartiles)
In [132]: grouped.apply(get_stats).unstack()
Out[132]:
count max mean min
data1
(-3.125, -1.654] 52.0 2.299146 0.177311 -1.974142
(-1.654, -0.189] 396.0 2.713077 0.171591 -3.334721
(-0.189, 1.275] 452.0 3.190075 -0.093762 -2.807887
(1.275, 2.74] 100.0 2.637351 0.004744 -2.028048
These were equal length buckets
To computer equal size buckets, use qcut
In [133]: grouping = pd.qcut(frame.data1, 10, labels=False)
In [134]: grouped = frame.data2.groupby(grouping)
In [135]: grouped.apply(get_stats).unstack()
Out[135]:
count max mean min
data1
0 100.0 2.299146 0.170813 -3.334721
1 100.0 2.713077 0.083603 -2.528098
2 100.0 2.238571 0.178352 -1.928308
3 100.0 2.669556 0.258033 -2.031216
4 100.0 3.085353 0.031921 -2.807887
5 100.0 3.190075 -0.107540 -2.686559
6 100.0 2.122244 -0.050376 -2.385933
7 100.0 2.099441 -0.167291 -2.330819
8 100.0 2.417712 -0.049621 -2.789057
9 100.0 2.637351 0.004744 -2.028048
Check the book for examples of:
- Filling missing values with group-specific values
- Random smapling and permutation
- Group Weighted Average and Correlation
- Group-wise Linear Regression
Pivot Tables and Cross Tabulation#
A pivot table is a data summarisation tool. It aggregates table data by one or more keys.
Made possible in pandas by groupby
combined with reshape
utilising hierachical indexing.
DataFrame has a pivot_table
method along with the top-level pd.pivot_table
.
Pivot tables can add partial totals known as margins
Suppose you want grouped means (the default for pivotal table)
In [8]: tips.pivot_table(index=['sex', 'smoker'])
Out[8]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
Now suppose we want to aggregate only tip_pct
and size
and additionally group by day.
We put smoker
in the columns and day
in the rows.
In [10]: tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns=['smoker'])
Out[10]:
size tip_pct
smoker No Yes No Yes
sex day
Female Fri 2.500000 2.000000 0.165296 0.209129
Sat 2.307692 2.200000 0.147993 0.163817
Sun 3.071429 2.500000 0.165710 0.237075
Thur 2.480000 2.428571 0.155971 0.163073
Male Fri 2.000000 2.125000 0.138005 0.144730
Sat 2.656250 2.629630 0.162132 0.139067
Sun 2.883721 2.600000 0.158291 0.173964
Thur 2.500000 2.300000 0.165706 0.164417
You can add total with margins=True
This adds All
column and row labels.
In [11]: tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns=['smoker'], margins=True)
Out[11]:
size tip_pct
smoker No Yes All No Yes All
sex day
Female Fri 2.500000 2.000000 2.111111 0.165296 0.209129 0.199388
Sat 2.307692 2.200000 2.250000 0.147993 0.163817 0.156470
Sun 3.071429 2.500000 2.944444 0.165710 0.237075 0.181569
Thur 2.480000 2.428571 2.468750 0.155971 0.163073 0.157525
Male Fri 2.000000 2.125000 2.100000 0.138005 0.144730 0.143385
Sat 2.656250 2.629630 2.644068 0.162132 0.139067 0.151577
Sun 2.883721 2.600000 2.810345 0.158291 0.173964 0.162344
Thur 2.500000 2.300000 2.433333 0.165706 0.164417 0.165276
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803
To use a different aggregation function
In [12]: tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns=['smoker'], margins=True, aggfunc=len)
Out[12]:
size tip_pct
smoker No Yes All No Yes All
sex day
Female Fri 2 7 9 2.0 7.0 9.0
Sat 13 15 28 13.0 15.0 28.0
Sun 14 4 18 14.0 4.0 18.0
Thur 25 7 32 25.0 7.0 32.0
Male Fri 2 8 10 2.0 8.0 10.0
Sat 32 27 59 32.0 27.0 59.0
Sun 43 15 58 43.0 15.0 58.0
Thur 20 10 30 20.0 10.0 30.0
All 151 93 244 151.0 93.0 244.0
So with aggfunc=len
we can get the frequency. Or use aggfunc='count'
.
If some combinations are empty you may want to add a fill value with fill_value=0
In [21]: tips.pivot_table(['size'], index=['time', 'sex', 'smoker'], columns=['day'], margins=True, aggfunc='sum', fill_value=0)
Out[21]:
size
day Fri Sat Sun Thur All
time sex smoker
Dinner Female No 2 30 43 2 77
Yes 8 33 10 0 51
Male No 4 85 124 0 213
Yes 12 71 39 0 122
Lunch Female No 3 0 0 60 63
Yes 6 0 0 17 23
Male No 0 0 0 50 50
Yes 5 0 0 23 28
All 40 219 216 152 627
CrossTabulations: Crosstab#
A special case of pivot table that computers group frequency.
pd.crosstab(data.x, data.y, margins=True)
The first 2 arguments to cross tab can be an array, Series or list of arrays.
In [5]: pd.crosstab(tips.sex, tips.smoker, margins=True)
Out[5]:
smoker No Yes All
sex
Female 54 33 87
Male 97 60 157
All 151 93 244
In [6]: pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
Out[6]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244
Chapter 11: Time Series#
Important in many fields: finance, economics, ecology, neuroscience and physics.
Anything observed or measured in many points in time is a time series
- Fixed frequency: data points occur at regular intervals (eg. every 15 minutes)
- Irregular frequency: No fixed time between data points
How you mark time sreies data:
- timestamps - instants in time
- Fixed periods - A month or year 2008
- Intervals - A start and end timestamp
- Experiment or elapsed time - Time relative to another time
Date and Time Datatypes and Tools#
Python has datetime
, time
and calendar
built-in packages. The datetime.datetime
type is widely used:
In [1]: from datetime import datetime
In [2]: now = datetime.now()
In [3]: now
Out[3]: datetime.datetime(2018, 11, 8, 12, 36, 8, 869124)
In [4]: now.year, now.month, now.day
Out[4]: (2018, 11, 8)
datetime
stores the date and time down to the microsecond
timedelta
represents the temporal difference between 2 datetime objects
In [5]: delta = now - datetime(2011, 1, 7)
In [6]: delta
Out[6]: datetime.timedelta(2862, 45368, 869124)
In [7]: delta.days
Out[7]: 2862
In [8]: delta.seconds
Out[8]: 45368
You can add or subtract a timedelta
to a datetime
object to yield a new datetime
:
In [9]: from datetime import timedelta
In [10]: start = datetime(2001, 1, 1)
In [11]: start + timedelta(12)
Out[11]: datetime.datetime(2001, 1, 13, 0, 0)
In [14]: start - 2 * timedelta(12)
Out[14]: datetime.datetime(2000, 12, 8, 0, 0)
Converting between string and datetime#
datetime
objects and pandas Timestamp
objects can be formatted as string using str
or the strftime
method.
As a reminder use
strftime
to meanstring -> format -> time
andstrptime
to meanstring -> parse -> time
In [16]: stamp = datetime(2011, 1, 12)
In [17]: str(stamp)
Out[17]: '2011-01-12 00:00:00'
In [18]: stamp.strftime('%Y-%m-%d')
Out[18]: '2011-01-12'
strptime
converts strings to dates.
In [19]: value = '2018-11-11'
In [20]: datetime.strptime(value, '%Y-%m-%d')
Out[20]: datetime.datetime(2018, 11, 11, 0, 0)
If you don’t know the format of a datetime you can use dateutil
:
In [21]: from dateutil.parser import parse
In [22]: parse('2018-01-01 14:52')
Out[22]: datetime.datetime(2018, 1, 1, 14, 52)
dateutil
is capable of parsing most human-intelligible date representations
In [23]: parse('Jan 25, 2008 9:45 PM')
Out[23]: datetime.datetime(2008, 1, 25, 21, 45)
One gotcha is the day appearing before month in some locales (as always America first is the way):
In [24]: parse('6/12/2011')
Out[24]: datetime.datetime(2011, 6, 12, 0, 0)
In [25]: parse('6/12/2011', dayfirst=True)
Out[25]: datetime.datetime(2011, 12, 6, 0, 0)
Pandas has a to_datetime()
method that parses many kinds of date representations.
In [26]: datestrs = ['2017-01-12 12:00:00', '23 Jan, 2018 4:45 PM']
In [29]: pd.to_datetime(datestrs)
Out[29]: DatetimeIndex(['2017-01-12 12:00:00', '2018-01-23 16:45:00'], dtype='datetime64[ns]', freq=None)
It also handles missing times:
In [30]: pd.to_datetime(datestrs + [None])
Out[30]: DatetimeIndex(['2017-01-12 12:00:00', '2018-01-23 16:45:00', 'NaT'], dtype='datetime64[ns]', freq=None)
NaT
means Not a time
dateutil
has some unexpected behaviours
Time Series Basics#
In [32]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 8), datetime(2011, 1, 12)]
In [35]: ts = pd.Series(np.random.randn(4), index=dates)
In [36]: ts
Out[36]:
2011-01-02 -0.640063
2011-01-05 -0.946634
2011-01-08 1.051946
2011-01-12 -0.637094
dtype: float64
The index has been made into a DateTimeIndex
:
In [37]: ts.index
Out[37]: DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-08', '2011-01-12'], dtype='datetime64[ns]', freq=None)
Like other Series, arithmetic operations between timeseries align on dates:
In [39]: ts + ts[::2]
Out[39]:
2011-01-02 -1.280125
2011-01-05 NaN
2011-01-08 2.103892
2011-01-12 NaN
dtype: float64
Pandas stores timestamps with numpy’s datetime64
datatype in nanosecond resolution
In [41]: ts.index.dtype
Out[41]: dtype('<M8[ns]')
Scalar objects from DateTimeIndex
are pandasTimestamp
objects:
In [42]: ts.index[0]
Out[42]: Timestamp('2011-01-02 00:00:00')
A
Timestamp
can be substituted anywhere you use adatetime
object
Indexing, Selection and Subsetting#
Time Series behave the same as any other Series in pandas
In [47]: ts.index[2]
Out[47]: Timestamp('2011-01-08 00:00:00')
In [48]: ts[ts.index[2]]
Out[48]: 1.051945822754756
For convenience you can use a string as a date:
In [49]: ts['2011/01/08']
Out[49]: 1.051945822754756
A longer time series:
In [50]: longer_ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
In [51]: longer_ts.head()
Out[51]:
2000-01-01 -1.554534
2000-01-02 -1.860764
2000-01-03 -2.241509
2000-01-04 -1.096584
2000-01-05 -2.403843
Freq: D, dtype: float64
You can select slices of data by just part of the date:
In [53]: longer_ts['2000-03'].head()
Out[53]:
2000-03-01 -0.240239
2000-03-02 0.508818
2000-03-03 -0.432612
2000-03-04 0.497631
2000-03-05 -1.356072
Freq: D, dtype: float64
Slicing with datetime objects works too:
In [60]: longer_ts[datetime(2002, 9, 24):]
Out[60]:
2002-09-24 0.200162
2002-09-25 0.377376
2002-09-26 -0.247723
Freq: D, dtype: float64
Because most timeseries data is ordered you can slice with timestamps not contained in a timeseries to perform a range query:
In [62]: longer_ts['1999-01-01': '2000-1-10']
Out[62]:
2000-01-01 -1.554534
2000-01-02 -1.860764
2000-01-03 -2.241509
2000-01-04 -1.096584
2000-01-05 -2.403843
2000-01-06 1.392809
2000-01-07 -0.552810
2000-01-08 -0.221427
2000-01-09 0.994283
2000-01-10 -0.335201
Freq: D, dtype: float64
The same indexing holds true for dataframes:
In [64]: dates = pd.date_range('1/1/2001', periods=100, freq='W-WED')
In [66]: long_df = pd.DataFrame(np.random.randn(100, 3), index=dates, columns=['JHB', 'CPT', 'DBN'])
In [69]: long_df.head()
Out[69]:
JHB CPT DBN
2001-01-03 -0.888841 1.500003 0.015592
2001-01-10 1.441082 -0.293374 -0.243195
2001-01-17 0.500772 3.376711 0.269602
2001-01-24 0.348858 1.043463 -0.363716
2001-01-31 -0.032118 -0.245414 -0.495472
In [72]: long_df.loc['2002-05']
Out[72]:
JHB CPT DBN
2002-05-01 0.440993 -0.553701 0.732000
2002-05-08 -0.595253 -0.068814 0.222021
2002-05-15 -1.180922 2.071676 -0.077351
2002-05-22 -0.036615 1.090348 -0.640301
2002-05-29 1.025804 0.070982 -1.463681
Time Series with Duplciate Indices#
There may be multiple data observations on a particular timestamp
In [74]: dates = pd.DatetimeIndex(['1-1-2011', '1-2-2011', '1-2-2011', '1-2-2011', '1-5-2011'])
In [75]: dup_ts = pd.Series(np.arange(5), index=dates)
In [76]: dup_ts
Out[76]:
2011-01-01 0
2011-01-02 1
2011-01-02 2
2011-01-02 3
2011-01-05 4
dtype: int64
Check if the index is unique with:
In [79]: dup_ts.index.is_unique
Out[79]: False
Slicing this series either returns scalars or slices depending on the timestamp:
In [83]: dup_ts['1-1-2011']
Out[83]: 0
In [84]: dup_ts['1-2-2011']
Out[84]:
2011-01-02 1
2011-01-02 2
2011-01-02 3
dtype: int64
Ya I know it used american dates, so the day is not changing
Suppose you want to aggregate data having non-unique timestamps, use groupby
with level=0
:
In [85]: grouped = dup_ts.groupby(level=0)
In [86]: grouped.mean()
Out[86]:
2011-01-01 0
2011-01-02 2
2011-01-05 4
dtype: int64
In [87]: grouped.count()
Out[87]:
2011-01-01 1
2011-01-02 3
2011-01-05 1
dtype: int64
Data Ranges, Frequencies and Shifting#
Generic Time series in pandas is assumed to be irregular (they have no fixed frequency). Sometimes it is desirable to work with a fixed frequncy such as daily, monthly or every 15 minutes even if that means introducing missing values into your Time Series.
Fortunately pandas has tools for resampling, inferring frequencies and generating fixed frequncy date ranges.
Converting a time series into a fixed frequency can be done with resample
:
In [88]: ts
Out[88]:
2011-01-02 -0.640063
2011-01-05 -0.946634
2011-01-08 1.051946
2011-01-12 -0.637094
dtype: float64
In [89]: resampler = ts.resample('D')
In [91]: resampler
Out[91]: DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
This doesn’t seem to do anything yet
Generating Date Ranges#
pandas.date_range
is responsible for generating a DateTimeIndex
with a specific length and frequency.
In [92]: index = pd.date_range('7-11-2018', '25-12-2018')
In [93]: index
Out[93]:
DatetimeIndex(['2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14',
'2018-07-15', '2018-07-16', '2018-07-17', '2018-07-18',
'2018-07-19', '2018-07-20',
...
'2018-12-16', '2018-12-17', '2018-12-18', '2018-12-19',
'2018-12-20', '2018-12-21', '2018-12-22', '2018-12-23',
'2018-12-24', '2018-12-25'],
dtype='datetime64[ns]', length=168, freq='D')
BY default date_range
generates daily timestamps. If you pass just a start or end date you need to set periods
- the number of periods to generate.
In [94]: pd.date_range('7-11-2018', periods=5)
Out[94]:
DatetimeIndex(['2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14',
'2018-07-15'],
dtype='datetime64[ns]', freq='D')
The start and end dates specify strict boundaries.
If you wanted a dateinde with the last business day for each month you would pass the BM
frequency and only dates falling on or inside the date interval will be included.
In [95]: pd.date_range('2000-01-01', '2000-12-01', freq='BM')
Out[95]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
Feel the power
Date range preserves the time of the start or end date:
In [97]: pd.date_range(start='2000-01-01 14:45', periods=20)
Out[97]:
DatetimeIndex(['2000-01-01 14:45:00', '2000-01-02 14:45:00',
'2000-01-03 14:45:00', '2000-01-04 14:45:00',
'2000-01-05 14:45:00', '2000-01-06 14:45:00',
'2000-01-07 14:45:00', '2000-01-08 14:45:00',
'2000-01-09 14:45:00', '2000-01-10 14:45:00',
'2000-01-11 14:45:00', '2000-01-12 14:45:00',
'2000-01-13 14:45:00', '2000-01-14 14:45:00',
'2000-01-15 14:45:00', '2000-01-16 14:45:00',
'2000-01-17 14:45:00', '2000-01-18 14:45:00',
'2000-01-19 14:45:00', '2000-01-20 14:45:00'],
dtype='datetime64[ns]', freq='D')
Sometimes tou may want to normalise the times to midnight
:
In [98]: pd.date_range(start='2000-01-01 14:45', periods=5, normalize=True)
Out[98]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05'],
dtype='datetime64[ns]', freq='D')
Frequencies and Date Offsets#
Frequencies have a base frequency and a multiplier.
Base frequncies have a string alias: M
- monthly, H
- hourly
For each base frequency there is an object called a Date Offset
In [99]: from pandas.tseries.offsets import Hour, Minute
In [100]: hour = Hour()
In [101]: four_hours = Hour(4)
In [102]: four_hours
Out[102]: <4 * Hours>
In most applications you would never need to do this and instead use H
or 4H
In [103]: pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4H')
Out[103]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
Most offsets can be combined with addition:
In [104]: Hour(2) + Minute(30)
Out[104]: <150 * Minutes>
You can also pass frequency strings like: 2h30min
In [106]: pd.date_range('2000-01-01', periods=10, freq='2h30min')
Out[106]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 02:30:00',
'2000-01-01 05:00:00', '2000-01-01 07:30:00',
'2000-01-01 10:00:00', '2000-01-01 12:30:00',
'2000-01-01 15:00:00', '2000-01-01 17:30:00',
'2000-01-01 20:00:00', '2000-01-01 22:30:00'],
dtype='datetime64[ns]', freq='150T')
Take note,
min
notm
Some frequncy points are not evenly spaced, calendar month end M
and last business day/weekday BM
.
These are known as anchored offsets.
D
- DayB
- Business DayH
- HourlyT
ormin
- Every MinuteS
- Every SecondL
orms
- Every millisecondU
- Every microsecondM
- Month EndBM
- Business Month EndMS
- Month BeginningBMS
- Business Month BeginW-MON
,W-TUE
… - Weekly on a given dayWOM-3FRI
- Week of month (Third friday of every month)Q-JAN
- quarterly on last calendar day of monthQB-FEB
- Quarterly business dayQS-JAN
QBS-JAN
- Quarterly beginningA-JAN
- Annually last calendar dayAB-JAN
- Annual last business day
Shifting (Leading and Lagging) Data#
Moving data but keeping the index unmodified.
In [107]: ts = pd.Series(np.random.randn(4), index=pd.date_range('1-1-2000', periods=4, freq='M'))
In [108]: ts
Out[108]:
2000-01-31 0.502127
2000-02-29 -1.085350
2000-03-31 -0.390934
2000-04-30 0.702752
Freq: M, dtype: float64
In [109]: ts.shift(2)
Out[109]:
2000-01-31 NaN
2000-02-29 NaN
2000-03-31 0.502127
2000-04-30 -1.085350
Freq: M, dtype: float64
In [110]: ts.shift(-2)
Out[110]:
2000-01-31 -0.390934
2000-02-29 0.702752
2000-03-31 NaN
2000-04-30 NaN
Freq: M, dtype: float64
A common use of shift
is calculating percentage changes
In [111]: ts / ts.shift(1)
Out[111]:
2000-01-31 NaN
2000-02-29 -2.161507
2000-03-31 0.360192
2000-04-30 -1.797623
Freq: M, dtype: float64
shift
leaves the index unmodified so some data is discarded. If the frequency is known it can be passed to shift
to advance the timestamps instead of simply the data.
In [113]: ts.shift(2, freq='M')
Out[113]:
2000-03-31 0.502127
2000-04-30 -1.085350
2000-05-31 -0.390934
2000-06-30 0.702752
Freq: M, dtype: float64
Shifting Dates with Offsets#
The pandas date offset can be used with Timestamp
or datetime:
In [115]: from pandas.tseries.offsets import Day, MonthEnd
In [116]: now = datetime(2018, 1, 1)
In [117]: now + 3 * Day()
Out[117]: Timestamp('2018-01-04 00:00:00')
If you add an anchored offset like MonthEnd
the first increment will roll forward
a date.
In [118]: now + MonthEnd()
Out[118]: Timestamp('2018-01-31 00:00:00')
In [119]: now + MonthEnd(2)
Out[119]: Timestamp('2018-02-28 00:00:00')
You can explicitly rollforward
or rollback
offsets:
In [120]: offset = MonthEnd()
In [121]: offset.rollback(now)
Out[121]: Timestamp('2017-12-31 00:00:00')
In [122]: offset.rollforward(now)
Out[122]: Timestamp('2018-01-31 00:00:00')
You can use offsets to groupby
:
ts.groupby(offset.rollforward).mean()
But it is easier and faster to resample
:
ts.resample('M').mean()
Timezone Handling#
Many choose to use UTC
coordinated universal time, the successor to GMT. Timezones are expressed in offsets to UTC
.
pytz
is the package to use for timezones
In [123]: import pytz
In [125]: pytz.common_timezones[:5]
Out[125]:
['Africa/Abidjan',
'Africa/Accra',
'Africa/Addis_Ababa',
'Africa/Algiers',
'Africa/Asmara']
In [126]: tz = pytz.timezone('Africa/Johannesburg')
In [127]: tz
Out[127]: <DstTzInfo 'Africa/Johannesburg' LMT+1:52:00 STD>
Timezone localization and conversion#
By default timeseries are timezone naive
In [128]: rng = pd.date_range('3-9-2012 9:30', periods=6, freq='D')
In [129]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [131]: print(ts.index.tz)
None
Date ranges can be generated with a timezone set:
In [133]: pd.date_range(start='2017-01-02', periods=6, freq='B', tz='UTC')
Out[133]:
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
'2017-01-06', '2017-01-09'],
dtype='datetime64[ns, UTC]', freq='B')
Converting from naive to localised s handled by tz_localize
In [135]: ts
Out[135]:
2012-03-09 09:30:00 0.696816
2012-03-10 09:30:00 -0.905538
2012-03-11 09:30:00 1.260273
2012-03-12 09:30:00 -0.787496
2012-03-13 09:30:00 1.262290
2012-03-14 09:30:00 1.451650
Freq: D, dtype: float64
In [136]: ts_utc = ts.tz_localize('UTC')
In [137]: ts_utc
Out[137]:
2012-03-09 09:30:00+00:00 0.696816
2012-03-10 09:30:00+00:00 -0.905538
2012-03-11 09:30:00+00:00 1.260273
2012-03-12 09:30:00+00:00 -0.787496
2012-03-13 09:30:00+00:00 1.262290
2012-03-14 09:30:00+00:00 1.451650
Freq: D, dtype: float64
In [138]: ts_utc.index.tz
Out[138]: <UTC>
Once a time series has been localised, it can be converted to another timezone using tz_convert
:
In [139]: ts_utc.tz_convert('America/New_York')
Out[139]:
2012-03-09 04:30:00-05:00 0.696816
2012-03-10 04:30:00-05:00 -0.905538
2012-03-11 05:30:00-04:00 1.260273
2012-03-12 05:30:00-04:00 -0.787496
2012-03-13 05:30:00-04:00 1.262290
2012-03-14 05:30:00-04:00 1.451650
Freq: D, dtype: float64
In [140]: ts_utc.tz_convert('Africa/Johannesburg')
Out[140]:
2012-03-09 11:30:00+02:00 0.696816
2012-03-10 11:30:00+02:00 -0.905538
2012-03-11 11:30:00+02:00 1.260273
2012-03-12 11:30:00+02:00 -0.787496
2012-03-13 11:30:00+02:00 1.262290
2012-03-14 11:30:00+02:00 1.451650
Freq: D, dtype: float64
Operations with Timezone aware timestamp objects#
In [141]: stamp = pd.Timestamp('2018-11-08 8:46 PM')
In [144]: stamp_za = stamp.tz_localize('Africa/Johannesburg')
In [147]: stamp_za.tz_convert('Australia/Adelaide')
Out[147]: Timestamp('2018-11-09 05:16:00+1030', tz='Australia/Adelaide')
The cricket tommorrow starts at 4:50 AM South African time
:
In [148]: stamp = pd.Timestamp('2018-11-9 4:50 AM')
In [149]: stamp_za = stamp.tz_localize('Africa/Johannesburg')
In [150]: stamp_adelaide = stamp_za.tz_convert('Australia/Adelaide')
In [151]: stamp_adelaide
Out[151]: Timestamp('2018-11-09 13:20:00+1030', tz='Australia/Adelaide')
That means it starts at 13:20
for them.
You can also create an aware tiemstamp with:
In [152]: za_time = pd.Timestamp(datetime.now(), tz='Africa/Johannesburg')
In [153]: za_time
Out[153]: Timestamp('2018-11-08 20:53:35.386909+0200', tz='Africa/Johannesburg')
Time objects hold the utc nanoseconds since the epoch 1-1-1970
, the same no matter the tiemzone:
In [155]: stamp.value
Out[155]: 1541739000000000000
Operations between 2 Timezones#
2 time series with different timezones are combined the result will be in UTC
Since timestamps are stored under the hood in UTC
no conversion is required.
Periods and period Arithmetic#
Period Frequency Conversion#
Quarterly Period Frequencies#
Convertng Timestamps to Periods and Back#
Creating a PeriodIndex from Arrays#
Lots of stuff in the book…
Resampling and Frequency Conversion#
Resampling is the process of converting a time series from one frequency to another.
Going from a higher frequency to a lower frequency is called downsampling
Converting a lower frequency to a higher frequency is called upsampling
resample
is the workhorse function. It has a similar api to groupby
.
In [159]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
In [160]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [162]: ts.head()
Out[162]:
2000-01-01 -1.243879
2000-01-02 1.536819
2000-01-03 -0.941166
2000-01-04 0.472657
2000-01-05 0.269019
Freq: D, dtype: float64
We can resample the data to months (group by months) and get the mean
:
In [163]: ts.resample('M')
Out[163]: DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]
In [164]: ts.resample('M').mean()
Out[164]:
2000-01-31 0.114949
2000-02-29 0.260392
2000-03-31 0.260345
2000-04-30 -0.100887
Freq: M, dtype: float64
Resample params:
freq
: resampled frequencyaxis=0
: axis to resample onfill_method
: how to interpolate when upsamplingffill
orbfill
closed=None
: when downsampling which end of each interval is closed / inclusive (right or left)kind
: Aggregate totimestamp
orperiod
Downsampling#
Aggregating data to a regular, lower frequency is a common task. The desired frequency indicates bin edges used to slice the time series into pieces.
So you need to think about which interval is closed and how to label each bin (with start or end of interval)
In [165]: rng = pd.date_range('2000-01-01', periods=12, freq='T')
In [166]: ts = pd.Series(np.arange(12), index=rng)
In [167]: ts
Out[167]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int64
Suppose you want to aggregate the data into five-minute chunks or bars by taking the sum of each group:
In [169]: ts.resample('5min').sum()
Out[169]:
2000-01-01 00:00:00 10
2000-01-01 00:05:00 35
2000-01-01 00:10:00 21
Freq: 5T, dtype: int64
The freq is the bin edges, so 5-minute increments.
By default the left bin edge is inclusive. So the 00:00
value is inclusive in the 00:05
interval.
If you want 00:00
to include data from 00:00 - 00:05
you would close on the right.
In [168]: ts.resample('5min', closed='right').sum()
Out[168]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The resulting timestamps are labels from the left side of each bin. You can change that to use right with label=right
:
In [170]: ts.resample('5min', closed='right', label='right').sum()
Out[170]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T, dtype: int64
Sometimes youn may want to shift the result index by some amount, for example subtracting 1 second fromthe right edge to make it clear which interval the tiemstamp refers to. To do this pass the offset to loffset
:
In [171]: ts.resample('5min', closed='right', label='right', loffset='-1s').sum()
Out[171]:
1999-12-31 23:59:59 0
2000-01-01 00:04:59 15
2000-01-01 00:09:59 40
2000-01-01 00:14:59 11
Freq: 5T, dtype: int64
This could hav ebeen done with using
shift
Open - High - Low - Close (OHCL) Resampling#
In finance tit is common to have 4 values in each bucket, 1 for high, low, close and open.
There is a function ohlc()
which calculates these for you
In [174]: ts.resample('5min').ohlc()
Out[174]:
open high low close
2000-01-01 00:00:00 0 4 0 4
2000-01-01 00:05:00 5 9 5 9
2000-01-01 00:10:00 10 11 10 11
Upsampling and Interpolation#
When converting a higher frequency to a lower frequency no aggregation is needed
In [175]: frame = pd.DataFrame(np.random.randn(2, 4), index=pd.date_range('1-1-2000', periods=2, freq='W-WED'),
...: columns=['Colorado', 'Texas', 'New York', 'Ohio'])
In [176]: frame
Out[176]:
Colorado Texas New York Ohio
2000-01-05 1.140193 -0.729382 1.102285 0.633958
2000-01-12 -0.890363 0.472829 0.128073 0.183249
Aggregating the data (that has at most 1 value per group) results in empty values which are converted to 0 bey default it looks like:
In [178]: df_daily = frame.resample('D').sum()
In [179]: df_daily
Out[179]:
Colorado Texas New York Ohio
2000-01-05 1.140193 -0.729382 1.102285 0.633958
2000-01-06 0.000000 0.000000 0.000000 0.000000
2000-01-07 0.000000 0.000000 0.000000 0.000000
2000-01-08 0.000000 0.000000 0.000000 0.000000
2000-01-09 0.000000 0.000000 0.000000 0.000000
2000-01-10 0.000000 0.000000 0.000000 0.000000
2000-01-11 0.000000 0.000000 0.000000 0.000000
2000-01-12 -0.890363 0.472829 0.128073 0.183249
Suppose you want to fill forward:
In [181]: frame.resample('D').ffill()
Out[181]:
Colorado Texas New York Ohio
2000-01-05 1.140193 -0.729382 1.102285 0.633958
2000-01-06 1.140193 -0.729382 1.102285 0.633958
2000-01-07 1.140193 -0.729382 1.102285 0.633958
2000-01-08 1.140193 -0.729382 1.102285 0.633958
2000-01-09 1.140193 -0.729382 1.102285 0.633958
2000-01-10 1.140193 -0.729382 1.102285 0.633958
2000-01-11 1.140193 -0.729382 1.102285 0.633958
You can limit hwo far to fill:
In [182]: frame.resample('D').ffill(limit=2)
Out[182]:
Colorado Texas New York Ohio
2000-01-05 1.140193 -0.729382 1.102285 0.633958
2000-01-06 1.140193 -0.729382 1.102285 0.633958
2000-01-07 1.140193 -0.729382 1.102285 0.633958
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.890363 0.472829 0.128073 0.183249
Notably the new date index need not overlap with the old one at all:
In [188]: frame.resample('W-THU').ffill()
Out[188]:
Colorado Texas New York Ohio
2000-01-06 1.140193 -0.729382 1.102285 0.633958
2000-01-13 -0.890363 0.472829 0.128073 0.183249
Resampling with Periods#
In [199]: frame = pd.DataFrame(np.random.randn(24, 4), index=pd.date_range(‘1-2000’, ‘1-2002’, freq=’M’), …: columns=[‘Colorado’, ‘Texas’, ‘New York’, ‘Ohio’])
In [201]: frame[:5] Out[201]: Colorado Texas New York Ohio 2000-01-31 0.623972 0.526449 0.752766 -0.150275 2000-02-29 0.423840 -0.890684 0.222549 -0.562278 2000-03-31 1.157608 -0.395766 0.619686 -0.919198 2000-04-30 1.864226 -0.399488 0.613920 1.416122 2000-05-31 1.125638 0.488510 -0.829483 -1.558244
In [202]: annual_frame = frame.resample('A-DEC').mean()
In [203]: annual_frame
Out[203]:
Colorado Texas New York Ohio
2000-12-31 0.306742 -0.360456 -0.102797 -0.286532
2001-12-31 -0.033584 0.165718 -0.418406 0.107400
You must decide which end of the timespan to place the values before resampling.
In [204]: annual_frame.resample('Q-DEC').ffill()
Out[204]:
Colorado Texas New York Ohio
2000-12-31 0.306742 -0.360456 -0.102797 -0.286532
2001-03-31 0.306742 -0.360456 -0.102797 -0.286532
2001-06-30 0.306742 -0.360456 -0.102797 -0.286532
2001-09-30 0.306742 -0.360456 -0.102797 -0.286532
2001-12-31 -0.033584 0.165718 -0.418406 0.107400
It default to end
but you can set convention=start
:
In [205]: annual_frame.resample('Q-DEC', convention='start').ffill()
Out[205]:
Colorado Texas New York Ohio
2000-12-31 0.306742 -0.360456 -0.102797 -0.286532
2001-03-31 0.306742 -0.360456 -0.102797 -0.286532
2001-06-30 0.306742 -0.360456 -0.102797 -0.286532
2001-09-30 0.306742 -0.360456 -0.102797 -0.286532
2001-12-31 -0.033584 0.165718 -0.418406 0.107400
Not sure what the above did…
The rules are more strict… * downsampling - target frequency must be subperiod of source * upsampling - target frequency must be superperiod of source
Mainly affecting quarterly, annual and weekly frequencies
Moving Window Functions#
Rolling averages
The rolling
operator bahaves similarly to resample
and groupby
. It can be called on a series or dataframe along with a window
.
aapl_std_250 = close_px.AAPL.rolling(250, min_periods=10).std()
To compute an expanding window mean use expanding()
Calling rolling_mean
on a dataframe applies the transformation to every column.
close_px.rolling(60).mean().plot(logy=True);
The above shows 60-day MA on a log y axis.
Exponentially Weighted Functions#
To give more weight to more recent observations, you specify a constant delay factor. A popular way to specify the delat factor is using a span which makes the result comparable to a simple moving window function with window size equal to span.
Pandas has an ewm
operator
wm60 = aapl_px.ewm(span=30).mean()
ewm60.plot(style='k--', label='EW MA');
Some more hectic stuff in the book
Chapter 12: Advanced Numpy#
I skipped this chapter because it is probably a bit chunky and I’m not in the mood