2020-06-14 Python

Effective Python

Effective Python Summary Notes#

I’ve been wanting to learn and improve my python so I read a book and took notes from Effective Python - Brett Slatkin. He has some good tech reads on his blog.

Pythonic Thinking#

1. Know which python version you are using#

$ python –version

of

>>> import sys
>>> print(sys.version_info)
sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
>>> print(sys.version)
3.6.4 (default, Mar  9 2018, 23:15:03) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]

Prefer python 3 for new projects.
There are many runtimes: CPython, Jython, Itonpython, PyPy. Default is CPython.

2. Follow PEP8 Style Guide#

Conistent style makes code more approachable and easy to read Facilitates collaboration

Read the pep8 style guide

3. Know the Differences Between bytes, str, and unicode#

In python 3 there is bytes and str. str contain unicode values bytes contain raw 8-bit values

You need to use encode and decode to convert unicode to bytes
Do encoding and decoding at the furtherest boundary of the interface (so core of program works with unicode)
bytes and str instances are never equivalent (In python 3)
File handles (using open) default to UTF-8 encoding

Ensure to use wb write-banary mode as opposed to just w wrote character mode:

with open('/tmp/random.bin', 'wb') as f:

When reading a file you can specify the mode:

with open('data.bin', 'r', encoding='cp1252') as f:
    ....

Get the default encoding on your system:

python3 -c 'import locale; print(locale. getpreferredencoding())'
UTF-8

4. Prefer Interpolated f-strings over C-style format and str.format#

C-style format:

a = 0b1011010
b = 0xc5f
print('Binary is %d, hex is %d' % (a,b))

Binary is 187, hex is 3167

Problems with C-style format:

Changing the order of the tuple makes the expression fail also changing the format and keeping the order, same error
Tuple and format becomes long forcing splitting across lines - hurting readability
Using the same value multiple times, must duplicate in tuple
redundancy in dictionaries

Advanced string formatting:

a = 1234.5678
formatted = format(a, ',.2f')
print(formatted)
1,234.57

Instead of c-style formatting you can use placeholders {}, which are replaced by positional arguments:

key = 'my_key'
value = 1244
'{} = {}'.format(key, value)
my_key = 1244

You can optionally format the placeholder with a colon character:

formatted = '{:<10} = {:.2f}'.format(key, value)
print(formatted)
my_key     = 1244.00

The formatting per class can be customised per class with __format__

With C-style and formatting you need to double the special character to insure it is not interpreted.

print('%.2f%%' % 12.5)
12.50%

print('{} replaces {{}}'.format(1.23))
1.23 replaces {}

Positional index can be specified when formatted:

formatted = '{1} = {0}'.format(key, value) 
print(formatted)
1244 = my_key

The same index can be used multiple times, not needing the value to be passed again.

formatted = '{0} loves food. {0} loves eating.'.format('john')
print(formatted)
john loves food. john loves eating.

Using str.format is not recommended

Interpolated format strings#

Python 3.6 added interpolated format strings.

You precede the string with f, like b for byte-strings and r for raw unescaped strings.

f-strings remove the redundancy of declaring the strings to be formatted.

formatted = f'{key} = {value}'
my_key = 1244

The format specifiers are still available

formatted = f'{key!r:<10} = {value:.2f}'
'my_key'   = 1244.00

Comparison:

f_string = f'{key:<10} = {value:.2f}'
c_tuple = '%-10s = %.2f' % 0key, value)
str_args = '{:<10} = {:.2f}'.format(key, value)
str_kw = '{key:<10} = {value:.2f}'.format(key=key, value, value)
c_dict = '%(key)-10s = %(value).2f' % {'key': key, 'value': value}

F-strings let you put a full python expression in

pantry = [('plums', 2), ('horse raddish', 1), ('corn', 4)]

for i, (item, count) in enumerate(pantry):
    f_string = f'#{i+1}: {item.title():<15s} = {round(count)}'
    print(f_string)

Results:

$ python f_string.py 
#1: Plums           = 2
#2: Horse Raddish   = 1
#3: Corn            = 4

You can even use a parameterised format:

places = 3
number = 1.23456
print(f'My number is {number:.{places}f}')
My number is 1.235

Choose f-strings

5. Write helper functions, instead of complex expressions#

Consider:

red = int(my_values.get('red', [''])[0] or 0)

This code is not obvious. There is a lot of visual noise and it is not approachable.

You could use a ternary:

red = my_values.get('red', [''])
red = int(red[0]) if red[0] else 0

but it is still not great.

So a helper function:

def get_first_int(values, key, default=0):
    found = values.get(key, [''])
    if found[0]:
        found = int(found[0])
    else:
        found = default
    return found

and calling:

green = get_first_int(my_values, 'green')

is much clearer.

Use complex expressions to a help function, especially when logic is repeated

6. Prefer Multiple Assignment Unpacking over Indexing#

Python uses tuples to create immutable, ordered sequences of values.

snack_calories = {
    'chips': 140,
    'popcorn': 80,
    'nuts': 190,
}
items = tuple(snack_calories.items())
print(items)

You can access an item in a tuple with the index.
Once created, you cannot modify a tuple value

Unpacking#

Python has syntax to let you unpack a tuple into variables.

item = ('Peanut butter', 'Jelly')
first, second = item # Unpacking
print(first, 'and', second)

There is less noise in the code than using indexes

The same unpacking logic applies to more complex structures: unpacking lists of tuples

You can also swap items in place:

a[i -1], a[i] = a[i], a[i - 1]

There is usually no need to access anything by indexes

Avoid indexing where possible

7. Prefer Enumerate over Range#

If you need the index use enumerate, Python enumerate wraps any iterator with a lazy generator

As opposed to:

for i in range(len(flavor_list)):
    flavor = flavor_list[i]
    print(f'{i + 1}: {flavour}')

consider (and setting where enumerate should start counting from):

for i, flavor in enumerate(flavor_list, 1):
    print(f'{i}: {flavour}')

enumerate yields pairs of the loop index and the next value from the given iterator.

flavour_list = ['chocolate', 'strawberry', 'bubblegum']
it = enumerate(flavour_list)
print(next(it))
(0, 'chocolate')
print(next(it))
(1, 'strawberry')
print(next(it)
(2, 'bubblegum')

That is why unapacking works:

for index, value in enumerate(flavour_list):
    ...

8. Use zip to process iterators in parrallel#

names = ['Cecilia', 'Lise', 'Marie']
letters = [len(n) for n in names]

For processing a list and derived list simulateously you can use enumerate to get the index:

for i, name in enumerate(names):
    count = letters[i]
    if count > max_letters:
        longest_name = name
        max_letters = count

But python provides zip, that wraps 2 or more iterators with a lazy generator. The zip generator yields tuples containing the next value from each iterator

for name, count in zip(names, letters):
    if count > max_letters:
        longest_name = name
        max_letters = count

print(longest_name)

If the iterators supplied are not the same length, it keeps going until 1 is exhausted.
zip will truncate quietly

You can use itertools.zip_longest to always extinguish the longest list and use fill values:

import itertools
for name, count in itertools.zip_longest(names, counts):
    print(f'{name}: {count}')

9. Avoid Else blocks after for and while#

for i in range(3):
    print('Loop {}'.format(i))
else:
    print('Else block!')

Python weirdly has an else after a for and that makes it difficult for new programmers. The reason is it works more like an except because the else part will run at the end of the loop. So it will execute regardless of whether the loop was entered or not.

A break statement in the for part will skip the else block
The behaviour is not obvious or intuitive

This is also the case with a while loop:

while False:
    print('Never Runs!')
else:
    print('While Else block runs!')

10. Prevent Repitition with Assignment Expressions#

The infamous walrus operator

Introduced in python 3.8.

a := b

pronounced, a walrus b

They allow you to assign variabled in places where assignment is disallowed.

For example, we want to make sure there is at least 1 lemon to squeeze for lemonade.

fresh_fruit = {
    'apple': 10,
    'banana': 8,
    'lemon': 5,
}

def make_lemonade(count):
    ...
def out_of_stock():
    ...

count = fresh_fruit.get('lemon', 0)
if count:
    make_lemonade(count)
else:
    out_of_stock()

The problem above is the count variable is only used in the if portion of the if statement.

So we could rewrite the above as:

if count:= fresh_fruit.get('lemon', 0):
    make_lemonade(count)
else:
    out_of_stock()

If I needed more than 4 apples for cider:

if (count := fresh_fruit.get('apple', 0)) >= 4:
    make_cider(count)
else:
    out_of_stock()

You need to surround the assignment with parenthesis for it to work

Used to improve the imitation of the switch/case statements:

if (count := fresh_fruit.get('banana', 0)) >= 2:
    pieces = slice_bananas(count)
    to_enjoy = make_smoothies(pieces)
elif (count := fresh_fruit.get('apple', 0)) >= 4:
    to_enjoy = make_cider(count)
elif count := fresh_fruit.get('lemon', 0):
    to_enjoy = make_lemonade(count)
else:
    to_enjoy = 'Nothing'

Improvements on readability and indentation

The assignment expression both assigns and evaluates
If it is a subexpression it needs parenthesis

Lists and Sequences#

11. Know how to slice sequences#

list, str and bytes can be sliced
The result of a slice is a whole new list, the original is not changed

Syntax is:

somelist[start:end]

eg: a = [1, 2, 3, 4] a[:2] a[:5] a[0:5]

12. Avoid Striding and slicing in a single Expression#

somelist[start:end:stride]

The stride lets you take every nth item

>>> colours = ['red', 'orange', 'yellow', 'blue', 'green']
>>> colours[::2]
['red', 'yellow', 'green']

Can be very confusing, especially negative strides
Avoid start and end when doing a stride
Use itertools module islice function if necessary

13. Prefer Catch all unpacking over Slicing#

A limitation of unpacking is you need to know the number of items in the list or sequence.

car_ages = [10, 0, 5, 6, 16, 21, 8]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, second_oldest = car_ages_descending

You get an error:

ValueError: too many values to unpack (expected 2)

Newcomers will overcome this with indexing - which can become messy and noisy. Also error prone to the off by one error/

The solution invovles python’s catch-all unpacking with a starred expressions. If allows one part of the expression to match any other part of the matching pattern.

oldest, second_oldest, *others = car_ages_descending    
print(oldest, second_oldest, others)
21 16 [10, 8, 6, 5, 0]

A starred expression can appear anywhere:

oldest, *others, youngest = car_ages_descending
print(oldest, others, youngest)
21 [16, 10, 8, 6, 5] 0

You cannot use a catch-all expression on its owns
You cannot use multiple catch-alls

Catchalls become lists in all instances, if no items - it comes an empty list.

The danger is if the catchall catches an iterator too large for memory - always know the size is smaller than memory.

14. Sort by Complex Criteria using the key parameter#

The list built-in type provides a sort method for ordering items in a list based on criteria.

By default .sort() will order in ascending order.

numbers = [55, 78, 13, 0, 12, -8]
numbers.sort()
print(numbers)
[-8, 0, 12, 13, 55, 78]

It works for nearly all built-in types, but won’t for your custom class if it the class cannot be compared.

Example Tool class:

class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight

    def __repr__(self):
        return f'Tool({self.name!r}, {self.weight}'


if __name__ == "__main__":
    tools = [
        Tool('level', 3.5),
        Tool('hammer', 1.25),
        Tool('screwdriver', 0.5),
        Tool('chisel', 0.25),
    ]

    tools.sort()

You will get an error:

TypeError: '<' not supported between instances of 'Tool' and 'Tool'

Often there is an attribute of the class that can be used for ordering, for this sort() accepts a key parameter - that is expecter to be a function. In this case we will use the lambda keywork to represent the name.

tools.sort(key=lambda x: x.name)
print(tools)
[Tool('chisel', 0.25, Tool('hammer', 1.25, Tool('level', 3.5, Tool('screwdriver', 0.5]

You can use any attribute that has a natural order.

They can also be used to transform and sort in one, eg. for a string type lambda x: x.lower()

What about sorting on more than 1 criteria:

Tuples are comparable by default and have a natural ordering, meaning that they implement all of the special methods, such as __lt__, that are required by the sort method.

You can take advantage of this by sorting the tuple:

tools.sort(key=lambda x: (x.weight, x.name))
print(tools)
[Tool('chisel', 0.25, Tool('screwdriver', 0.5, Tool('hammer', 1.25, Tool('level', 3.5]

One disadvantage is they must all be in ascedning order or descending (with the reverse=True paramter)

tools.sort(key=lambda x: (x.weight, x.name), reverse=True)
[Tool('level', 3.5, Tool('hammer', 1.25, Tool('screwdriver', 0.5, Tool('chisel', 0.25]

More edge cases in the book…

15. Be Cautious when Relying on Dict Insertion Ordering#

In Python3.5 and prior - iterating over dictionary would return keys in random order. The order of iteration would not match the order of insertion.

Functions keys, values, items and popitem would also show this behaviour - prior to python 3.6.

There are many repercussions to this change.

**kwargs would come in any order.

Classes also use the dict type for their instance dictionaries __dict__

collections used to be the goto for OrderedDict class that preserved insertion ordering - it may still be preferable over python’s dict due to the speed.

Python makes it easy to define custom container types for standard protocols list, dict and others.

Python is not statically typed - so most code relies on duck typing - where an object’s behaviour is its defacto type - instead of rigid class hierachies.

When you don’t get the dict object but a similar duck typed object you have 3 options:

Change your program to expect different objects
Check the expected type and raise an Exception if it is different
Use a type annotation to ensure it is a dict instance and not a MutableMapping with mypy

Then check it with static analysis: python3 -m mypy --strict example.py

16. Prefer `get` over `in` and `KeyError` to handle Missing Dict Keys#

counters = {
    'pumpernickel': 2,
    'sourdough': 1,
}

count = counters.get(key, 0)
counters[key] = count + 1

There is a Counter class with a built-in colletions module

If you wanted to know who voted for each type:

votes = {
    'baguette': ['Bob', 'Alice'],
    'ciabatta': ['Coco', 'Deb'],
}
key = 'brioche'
who = 'Elmer'

if key in votes:
    names = votes[key]
else:
    votes[key] = names = []

names.append(who)
print(votes)

You could also use:

names = votes.get(key)
if names is None:
    votes[key] = names = []

or with an assignment expression:

if (names := votes.get(key)) is None:
    votes[key] = names = []

names.append(who)

Not the most readable

setdefault fetches the value of a key - if it isn’t repsent it assigns the default value provided.

names = votes.setdefault(key, [])
names.append(who)

setdefault is not self explanatory - it should be get_or_set so new developers would understand faster and without having to look at the docs.

Important that a new default value is set directly for each key and not copied. If the value assigned is modified after being set as the default it will change the key value.

There are only a few circumstances in which using setdefault is the shortest way to handle missing dictionary keys, such as when the default values are cheap to construct, mutable, and there’s no potential for raising exceptions (e.g., list instances).

17. Prefer `defaultdict` Over `setdefault` to Handle Missing Items in Internal State#

For the instance where you are creating a mechanism for storing countries and cities:

class Visits:
    def __init__(self):
        self.data = {}

    def add(self, country, city):
        city_set = self.data.setdefault(country, set())
        city_set.add(city)

hiding setdefault from the caller - a nicer interface for the caller

visits = Visits()
visits.add('Russia', 'Yekaterinburg')
visits.add('Tanzania', 'Zanzibar')
print(visits.data)

>>>
{'Russia': {'Yekaterinburg'}, 'Tanzania': {'Zanzibar'}}

Using defaultdict:

from collections import defaultdict

class Visits:
    def __init__(self):
        self.data = defaultdict(set)

    def add(self, country, city):
        self.data[country].add(city)

visits = Visits()
visits.add('England', 'Bath')
visits.add('England', 'London')
print(visits.data)

>>> defaultdict(<class 'set'>, {'England': {'London', 'Bath'}})

18. Know how to construct key dependent values with `missing`#

For example, say that I’m writing a program to manage social network profile pictures on the filesystem. I need a dictionary to map profile picture pathnames to open file handles so I can read and write those images as needed.

You can subclass the dict type and implement the missing special method to add custom logic for handling missing keys

class Pictures(dict):
    def __missing__(self, key):
        value = open_picture(key)
        self[key] = value
        return value

pictures = Pictures()
handle = pictures[path]
handle.seek(0)
image_data = handle.read()

The setdefault method of dict is a bad fit when creating the default value has high computational cost or may raise exceptions.
The function passed to defaultdict must not require any arguments

Functions#

Functions enable you to break large programs into smaller, simpler pieces with names to represent their intent. They improve readability and make code more approachable. They allow for reuse and refactoring.

19. Never Unpack more than 3 Variables when functions return Multiple Values#

One effect of unpacking is it allows python functions to return more than 1 value

def get_stats(numbers):
    minimum = min(numbers)
    maximum = max(numbers)
    return minimum, maximum

lengths = [63, 73, 72, 60, 67, 66, 71, 61, 72, 70]

minimum, maximum = get_stats(lengths) # Two return values

Unpacking more than 3 makes it easy to reorder them - causing hard to spot bugs.

# Correct:
minimum, maximum, average, median, count = get_stats(lengths)

# Oops! Median and average swapped:
minimum, maximum, median, average, count = get_stats(lengths)

The line line might become very long - PEP8 forces next line - hurting readability

Never unpack more than 3 - if you do want to unpack more than 3 you are better off defining a lightweight class or namedtuple

20. Prefer Raising Exceptions to Returning None#

Use List Comprehensions Instead of map and filter#

List comprehensions derive one list from another

>>> numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> [x**2 for x in numbers]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Preferable over using map that requires a lambda

squares = map(lambda x: x ** 2, a)

You can also use lsit comprehensions to filter with an if:

[x**2 for x in numbers if x % 2 == 0]

which can be achieved with map and filter:

alt = map(lambda x: x**2, filter(lambda x: x % 2 == 0, a))
list(alt)

There are also list comprehensions for dict and set

chile_ranks = {'ghost': 1, 'habanero': 2, 'cayenne': 3}
# dict comprehension
rank_dict = {rank: name for name, rank in chile_ranks.items()}
# set comprehensoin
chile_len_set = {len(name) for name in rank_dict.values()}

Avoid More Than Two Expressions in List Comprehensions#

List comprehensions support multiple loops

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] flat = [x for row in matrix for x in row]

and multiple if conditions

a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b = [x for x in a if x > 4 if x % 2 == 0]
c = [x for x in a if x > 4 and x % 2 == 0]

You can also use conditions at each level:

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] filtered = [[x for x in row if x % 3 == 0] for row in matrix if sum(row) >= 10] print(filtered)

But this is horrendous for someone else to comprehend

Consider Generator Expressions for Large Comprehensions#

List comprehensions create a new list with at most the same number of values in the input sequence
For large inputs this may cause the program to crash due to memory usage
To solve this, Python provides generator expressions, a generalization of list comprehensions and generators
generator expressions evaluate to an iterator that yields one item at a time from the expression

When you’re looking for a way to compose functionality that’s operating on a large stream of input, generator expressions are the best tool for the job

it = (len(x) for x in open(‘/tmp/my_file.txt’))

gen = (print(i) for i in [9,1,2,3,3,]) print(next(gen))

Take Advantage of Each Block in try/except/else/finally#

Finally Blocks#

Use try...finally when you want exceptions to propagate up but you also want to run cleanup code when exceptions occur.

handle = open('/tmp/random_data.txt')  # May raise IOError
try:
    data = handle.read()  # May raise UnicodeDecodeError
finally:
    handle.close()        # Always runs after try:

Else Blocks#

When the try block doesn’t raise an exception, the else block will run.
The else block helps you minimize the amount of code in the try block and improves readability

def load_json_key(data, key): try: result_dict = json.loads(data) # May raise ValueError except ValueError as e: raise KeyError from e else: return result_dict[key]

If decoding is successful the result key is returned if there is a KeyError that propagtes up to the caller

Everything together Try…Except…Else…Finally#

UNDEFINED = object()

def divide_json(path):
    handle = open(path, 'r+')   # May raise IOError
    try:
        data = handle.read()    # May raise UnicodeDecodeError
        op = json.loads(data)   # May raise ValueError
        value = (op['numerator'] / op['denominator'])
        # May raise ZeroDivisionError
    except ZeroDivisionError as e:
        return UNDEFINED
    else:
        op['result'] = value
        result = json.dumps(op)
        handle.seek(0)
        handle.write(result)    # May raise IOError
        return value
    finally:
        handle.close() # Always runs

Functions#

Best organisation tool that help break up large programs into smaller pieces. They improve readibility and make code more approachable.

Prefer Exceptions to Returning None#

There’s a draw for Python programmers to give special meaning to the return value of None

A helper function that divides one number by another. In the case of dividing by zero, returning None seems natural because the result is undefined.

def divide(a, b):
    try:
        return a/b
    except ZeroDivisionError:
        return None

Using the function:

result = divide(x, y)
if result is None:
    print('Invalid inputs')

The problem is what if the numerator is 0 and denominator not zero, that returns 0. Then when you evaluate in an if condition and look for false istead of is None

That is why returning None is error prone

There are two ways to fix this, the first is returning a two tuple of (success_flag, result) The problem is that some will just ignore that with the _ for unused variables

The better way is to not return None at all, rather raise an exception and have them deal with it.

def divide(a, b):
    try:
        return a / b
    except ZeroDivisionError as e:
        raise ValueError('Invalid inputs') from e

I would even not raise the ValueError

It is then handled better on the caller (no check for None):

x, y = 5, 2
try:
    result = divide(x, y)
except ValueError:
    print('Invalid inputs')
else:
    print('Result is %.1f'.format(result))

>>>
Result is 2.5

Raise eceptions instead of returning None

Know How Closures Interact with Variable Scope#

closures: functions that refer to variables from the scope in which they were defined
functions are first class objects: you can refer to them directly, assign them to variables, pass them as arguments to other functions

When you reference a variable the python interpreter resolves the reference in this order: 1. Current function’s scope 2. Any enclosing scopes 3. Scope of the module containing the code (global scope) 4. The built-in scope (python built in functions: len, str, etc.)

If none of these find the reference a NameError is raised.

Assigning a value to a variable works differently. If the variable is already defined in the current scope, then it will just take on the new value. If the variable doesn’t exist in the current scope, then Python treats the assignment as a variable definition

def sort_priority2(numbers, group):
    found = False         # Scope: 'sort_priority2'
    def helper(x):
        if x in group:
            found = True  # Scope: 'helper' -- Bad!
            return (0, x)
        return (1, x)
    numbers.sort(key=helper)
    return found

So how do you get the data out:

The nonlocal statement is used to indicate that scope traversal should happen upon assignment for a specific variable name. It won’t go up the module level.

def sort_priority3(numbers, group):
    found = False
    def helper(x):
        nonlocal found
        if x in group:
            found = True
            return (0, x)
        return (1, x)
    numbers.sort(key=helper)
    return found

It’s complementary to the global statement, which indicates that a variable’s assignment should go directly into the module scope.
When your usage of nonlocal starts getting complicated, it’s better to wrap your state in a helper class.
By default, closures can’t affect enclosing scopes by assigning variables.
Avoid nonlocal

A class can be used to make it much easier to read:

class Sorter(object):
    def __init__(self, group):
        self.group = group
        self.found = False

    def __call__(self, x):
        if x in self.group:
            self.found = True
            return (0, x)
        return (1, x)

sorter = Sorter(group)
numbers.sort(key=sorter)
assert sorter.found is True

Consider Generators Instead of Returning Lists#

Take getting the indices of words in a sentence:

def index_words(text):
    result = []
    if text:
        result.append(0)
    for index, letter in enumerate(text):
        if letter == ' ':
            result.append(index + 1)
    return result

It is dense and noisy
One line for creating result list and one for returning it
It requires all results to be stored in the list before being returned (inefficent use of memory)

The better way is to use a generator. When called, generator functions do not actually run but instead immediately return an iterator.

With each call to __next__ of the iterator, it will advance to the next yield expression

def index_words_iter(text):
    if text:
        yield 0
    for index, letter in enumerate(text):
        if letter == ' ':
            yield index + 1

It is easier to read as references to the result list have been eliminated
The iterator returned by the generator can be converted with list()
Done line by line especially useful in a stream of reading from a file

Be Defensive when Iterating over Arguments#

An iterator only produces its results a single time

If you iterate over an iterator or generator that has already raised a StopIteration exception, you won’t get any results the second time around

Using out previous example:

address = 'Four score and seven years ago...'
word_iterator = index_words_iter(address)
print(list(word_iterator))
print(list(word_iterator))

returns

[0, 5, 11, 15, 21, 27]
[]

Also no exception is raised as python functions are looking for the StopIteration exception during normla operation. They don’t know the difference between an Iterator with no output and an iterator whose output has been exhausted.

One way to fix this is to copy the results of the iterator but the output could be large and cause your program to crash.

The better way to achieve the same result is to provide a new container class that implements the iterator protocol

The iterator protocol is how Python for loops and related expressions traverse the contents of a container type. When Python sees a statement like for x in foo it will actually call iter(foo). The iter built-in function calls the foo.__iter__ special method in turn. The __iter__ method must return an iterator object (which itself implements the __next__ special method). Then the for loop repeatedly calls the next built-in function on the iterator object until it’s exhausted (and raises a StopIteration exception). It sounds complicated, but practically speaking you can achieve all of this behavior for your classes by implementing the __iter__ method as a generator

class WordIndexer:
    def __init__(self, text):
        self.text = text

    def __iter__(self):
        if self.text:
            yield 0
        for index, letter in enumerate(self.text):
            if letter == ' ':
                yield index + 1

calling it with:

word_index = WordIndexer(address)
print(list( word_index))
print(list( word_index))

Now WordIndex is a class that implements the Iterator Protocal (A conatiner for an iterator). Now we need to ensure that the iterator to a function is not an iterator:

def normalize_defensive(numbers):
    '''When an iterator object is passed into iter() it returns the iterator,
    when a container is entered a new iterator is returned each time'''
    if iter(numbers) is iter(numbers):
        raise TypeError('Must supply a container')

Reduce Visual Noise with Variable Positional Arguments#

Optional positional arguments (*args) can make a function call more clear and remove visual noise.

Take this example:

def log(message, values):
    if not values:
        print(message)
    else:
        values_str = ', '.join(str(x) for x in values)
        print('{}: {}'.format(message, values_str))

To just print out my message, I have to send an empty []

log('My numbers are', [1, 2])
log('hello world', [])

You can tell python it is an optional parameters with:

def log(message, *values):
    ...

and then call it with:

log('hello world')

You would need to change how you send values in though:

favorites = [7, 33, 99]
log('Favorite colors', *favorites)

The *favourites tells python to pass items from the sequence as positional arguments:

*favourites == (7, 33, 99)
favourites == ([7, 33, 99],)

There are a few problems:

The variable arguments are always turned into a tuple before they are passed to your function.

This could consume alot of memory on a generator as it is turned into a tuple.

Functions that accept *args are best for situations where you know the number of inputs in the argument list will be reasonably small

You can’t add new positional arguments to your function in the future without migrating every caller

Ie. adding def log(sequence, message, *values): will break an existing call to log('hello world')

Bugs like this are hard ot track down.

Therefore you should use keyword only arguments when extending a function already accepting *args

Provide Optional Behavior with Keyword Arguments#

All positional arguments in python can also be called with keywords. They can be called:

def remainder(number, divisor):
    return number % divisor

assert remainder(20, 7) == 6
assert remainder(20, divisor=7) == 6
assert remainder(number=20, divisor=7) == 6
assert remainder(divisor=7, number=20) == 6

One way it cannot be called is with:

assert remainder(number=20, 7) == 6

as that raises: SyntaxError: positional argument follows keyword argument

Also each argument must be specified once:

remainder(20, number=7)

gives: TypeError: remainder() got multiple values for argument 'number'

Keyword arguments make function calls clearer to new readers of code
They can have default values - reducing repetitive code and reducing noise (gets difficult with complex defaults)
They provide a powerful way to extend a function’s parameters while maintaining backwards compatibility with existing callfs

With a default period of per second:

def flow_rate(weight_diff, time_diff, period=1):
    return (weight_diff / time_diff) * period

would be preferable to:

def flow_rate(weight_diff, time_diff, period):
    return (weight_diff / time_diff) * period

You could also extend this without breaking existing calls with:

def flow_rate(weight_diff, time_diff, period=1, units_per_kg=1): return ((weight_diff / units_per_kg) / time_diff) * period

The only problem with this is that optional arguments period and units_per_kg may still be specified as positional arguments.

pounds_per_hour = flow_rate(weight_diff, time_diff, 3600, 2.2)

The best practice is to always specify optional arguments using the keyword names and never pass them as positional arguments.

Use None and Docstrings to specify dynamic default arguments#

Sometimes you need to use a non-static type as a keyword arguments defualt value

For example when logging a message oyu want to include the time and date of the log:

def log(message, when=datetime.datetime.now()):
    print('{}: {}'.format(when, message))

log('Hi there!')
sleep(0.1)
log('Hi again!')

>>> 2018-07-13 21:34:08.251207: Hi there!
>>> 2018-07-13 21:34:08.251207: Hi again!

Remember datetime.datetime.now is only run once, when the function is defined

The convension for achieving the desired result is to set when=None and document how to use the function is a docstring.

def log(message, when=None):
    '''Log a message with a timestamp

    Args:
        message: Message to print
        when: datetime of when the message occured
            Default to present time
    '''
    when = datetime.datetime.now() if when is None else when
    print('{}: {}'.format(when, message))

The None arugment is especially important for arguments that are mutable. Say you want to decode some json with a default:

def decode(data, default={}):
    try:
        return json.loads(data)
    except ValueError:
        return default

foo = decode('bad data')
foo['stuff'] = 5
bar = decode('also bad data')
foo['jink'] = '45'
print('Foo:', foo)
print('bar:', bar)

>>> Foo: {'stuff': 5, 'jink': '45'}
>>> bar: {'stuff': 5, 'jink': '45'}

Unforunately both foo and bar are both equal to the default parameter. They are the same dictionary object being modified.

The fix is setting default=None

Change it like:

def decode(data, default=None):
    if default is None:
        default = {}
    try:
        return json.loads(data)
    except ValueError:
        return default

Use None as the default argument for keyword arguments that have a dynamic value
Keyword arguments are evaluated once, at module load time

Enforce Clarity with Keyword-Only Arguments#

Say you have a function with signature:

def safe_division(number, divisor, ignore_overflow, ignore_zero_division): …

Expecting the ignore_overflow and ignore_zero_division flags to be boolean. You can call it:

>>> result = safe_division(1, 0, False, True)
>>> result = safe_division(1, 10**500, True, False)

It is not clear what the boolean flags are and it is easy to confuse them. One way to change this is to default them to false and callers must say which flags they want to switch. The problem is you can still call it with:

safe_division(1, 10**500, True, False)

In python 3 you can demand clarity with keyword only arguments. These arguments can only be supplied by keyword never by position.

You do this using the * symbol in the argument list, which indicates the end of positional arguments and the beginning of keyword-only arguments.

def safe_division_c(number, divisor, *, ignore_overflow=False, ignore_zero_division=False):
    ...

Now calling it badly:

safe_division_c(1, 10**500, True, False)
>>> TypeError: safe_division_c() takes 2 positional arguments but 4 were given

Classes and Inheritance#

Python supports inheritance (acquiring attribute and methods from a parent class), polymorphism (A way for multiple classes to implement their own unique versions of a method) and encapsulation (Restricting direct access to an objects attributes and methods)

Prefer Helper functions over bookkeeping with tuples and dictionaries#

When a class is getting very complex with many dictionaries and tuples within then it time to use classes, a hierachy of classes.

This is a common problem when scope increases (at first you didn’t know you had to keep track of such and such). It is important to remember that more than one layer of nesting is a problem.

Avoid dictionaries that contain ditionaries
It makes your code hard to read
It makes maintenance difficult

Breaking it into classes:

helps create well defined interfaces encapsulating data
A layer of abstraction between your interfaces and your concrete implementations

Extending tuples is also an issue, as associating more data now cause an issue with calling code. A namedtuple in the collections module does exactly what you need…defining a tiny immutable data class.

Limitations of namedtuple:

You cannot specify default argument values. With a handful of optional values a class is a better choice.
Attributes are still accessible by numerical indices and iteration

A complete example:

Grade = collections.namedtuple('Grade', ('score', 'weight'))


class Subject(object):
    def __init__(self):
        self._grades = []

    def report_grade(self, score, weight):
        self._grades.append(Grade(score, weight))

    def average_grade(self):
        total, total_weight = 0, 0
        for grade in self._grades:
            total += grade.score * grade.weight
            total_weight += grade.weight
        return total / total_weight


class Student(object):
    def __init__(self):
        self._subjects = {}

    def subject(self, name):
        if name not in self._subjects:
            self._subjects[name] = Subject()
        return self._subjects[name]

    def average_grade(self):
        total, count = 0, 0
        for subject in self._subjects.values():
            total += subject.average_grade()
            count += 1
        return total / count


class Gradebook(object):
    def __init__(self):
        self._students = {}

    def student(self, name):
        if name not in self._students:
            self._students[name] = Student()
        return self._students[name]

Usage:

book = Gradebook()
albert = book.student('Albert Einstein')
math = albert.subject('Math')
math.report_grade(80, 0.10)
print(albert.average_grade())
>>> 80.0

It may have become longer but it is much easier to read

Accept Functions for Simple Interfaces Instead of Classes#

Python’s built-in API’s let you customise behavious by passing in a function. Like the list, sort function that takes a key argument to determine the order.

Ordering by length:

names = ['Socrates', 'Archimedes', 'Plato', 'Aristotle']
names.sort(key=lambda x: len(x))
print(names)

>>> ['Plato', 'Socrates', 'Aristotle', 'Archimedes']

Functions are ideal for hooks as tehy are easier to describe and simpler to define than classes. Ie. better than using Abstract Class

Functions are often all you need to interact(interface) between simple components
The __call__ special method enables instances of a class to behave like plain old python functions
When you need a function to maintain state consider providing a class that provides a __call__

Refer to the book for more information…

Use @classmethod Polymorphism to construct methods generically#

Polymorphism is a way for multiple classes in a hierachu to implement their own unique version of a method.

This allows many classes to fulfill the same interface or abstract base class while providing different functionality

Say you want a common class to represent input data for a MapReduce function, you create a common class to represent this.

class InputData(object):
    def read(self):
        raise NotImplementedError

There is one version of a concrete subclass that reads from a file on disk:

class PathInputData(InputData):
    def __init__(self, path):
        self.path = path

    def read(self):
        return open(self.path).read()

Now you could also have a class that reads from the network

Now we want a similar setup for a MapReduce worker to consume input data in a standard way

class Worker(object): def init(self, input_data): self.input_data = input_data self.result = None

def map(self):
    raise NotImplementedError

def reduce(self, other):
    raise NotImplementedError

Remember a concrete class is a class where all methods are completely implemented. An abstract class is one where functions are not fully defined (An abstract of a class).

The concrete subclass of Worker:

class lineCountWorker(Worker):
    def map(self):
        data = self.input_data.read()
        self.data = data.count('\n')

    def reduce(self, other):
        self.result += other.result

Now the big hurdle…What connects these pieces?

I have a set of classes with reasonable abstractions and interfaces, but they are only useful once the class is constructed. What is responsible for building the objects and orchestrating the map reduce?

We can manually build this with helper functions:

def generate_inputs(data_dir):
    for name in os.listdir(data_dir):
        yield PathInputData(os.path.join(data_dir, name))

def create_workers(input_list):
    workers = []
    for input_data in input_list:
        workers.append(LineCountWorker(input_data))
    return workers

def execute(workers):
    threads = [Thread(target=w.map) for w in workers]
    for thread in threads: thread.start()
    for thread in threads: thread.join()

    first, rest = workers[0], workers[1:]
    for worker in rest:
        first.reduce(worker)
    return first.result

def mapreduce(data_dir):
    inputs = generate_inputs(data_dir)
    workers = create_workers(inputs)
    return execute(workers)

There is a big problem here. The functions are not generic at all. If you write a different type of InputData or Worker subclass you would have to rewrite all of these functions. This boils down to needing a generic way to construct objects.

In other languages you could solve this problem with constructor polymorphism, making each subclass of InputData have a special constrcutor that can be used generically.

The problem is that python only has a single constructor method: __init__. It is unreasonable to require each subclass to have a compatible constructor.

The best way to solve this is with: @classmethod polymorphism

Python class method polymorphism extends to whole classes, not just their constructed objects.

Remember polymorphism means to take on different forms

class GenericInputData(object): def read(self): raise NotImplementedError

@classmethod
def generate_inputs(cls, config):
    raise NotImplementedError

generate_inputs takes a dictionary of configuration parameters than the concrete class must interpret.

class PathInputData(GenericInputData): def init(self, path): self.path = path

def read(self):
    return open(self.path).read()

@classmethod
def generate_inputs(cls, config):
    data_dir = config['data_dir']
    for name in os.listdir(data_dir):
        yield cls(os.path.join(data_dir, name))

Similarly, I can make the create_workers helper part of the GenericWorker class. Here, I use the input_class parameter, which must be a subclass of GenericInputData, to generate the necessary inputs. I construct instances of the GenericWorker concrete subclass using cls() as a generic constructor.

class GenericWorker(object):
    # ...
    def map(self):
        raise NotImplementedError

    def reduce(self, other):
        raise NotImplementedError

    @classmethod
    def create_workers(cls, input_class, config):
        workers = []
        for input_data in input_class.generate_inputs(config):
            workers.append(cls(input_data))
        return workers

The call to input_class.generate_inputs is the class polymorphism. Also the cls(input_data) provides an alternate way to instantiate instead of using __init__ directly.

We can then just change the parent class:

class LineCountWorker(GenericWorker):
    ...

and finally rewrite mapreduce to be more generic:

def mapreduce(worker_class, input_class, config):

workers = worker_class.create_workers(input_class, config) return execute(workers)

Calling the function now requires more parameters:

with TemporaryDirectory() as tmpdir:
    write_test_files(tmpdir)
    config = {'data_dir': tmpdir}
    result = mapreduce(LineCountWorker, PathInputData, config)

Initialise Parent classes with Super#

Calling the parent class __init__ mthod can ead to unpredictable behaviour especially with multiple inheritance as the __init__.

Python 2.2 introduced super and set the MRO - Method Resolution Order. Python 3 introduced super with no arguments and it should be used because it is clear, concise and always does the right thing.

Use Multiple Inheritance Only for Mix-in utility Classes#

Python makes multi-inheritance possible and traceable, but is better to avoid it altogether.

If you want the encapsultion and convenience of multiple inheritance, use a mixin instead. A mixin is a small utility class that only defines a set of additional methods a class should provide.

Mixin classses don’t define their own instance attributes and don’t require their __init__ constructor to be called.

Example: you want the ability to convert a python object from its in-memory representation to a dictionary ready for serialisation.

class ToDictMixin(object):
    def to_dict(self):
        return self._traverse_dict(self.__dict__)

    def _traverse_dict(self, instance_dict):
        output = {}
        for key, value in instance_dict.items():
            output[key] = self._traverse(key, value)
        return output

    def _traverse(self, key, value):
        if isinstance(value, ToDictMixin):
            return value.to_dict()
        elif isinstance(value, dict):
            return self._traverse_dict(value)
        elif isinstance(value, list):
            return [self._traverse(key, i) for i in value]
        elif hasattr(value, '__dict__'):
            return self._traverse_dict(value.__dict__)
        else:
            return value

Using it:

class BinaryTree(ToDictMixin):
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right

tree = BinaryTree(10, left=BinaryTree(7, right=BinaryTree(9)),

right=BinaryTree(13, left=BinaryTree(11)) ) print(tree.to_dict())

The mixin methods can also be overriden.

Alot more to read on this in the book…

Prefer public attributes of private ones#

In python there are only 2 attribute visibility types: private and public.

class MyObject(object):
    def __init__(self):
        self.public_field = 5
        self.__private_field = 10

    def get_private_field(self):
        return self.__private_field

Public attribues can be accessed with dot notation:

my_obj = MyObject()
print(my_obj.public_field)

Private fields start with a double underscore __ and can be accessed by methods of the containing class.

print(my_obj.get_private_field())

Directly accessing a private atrribute gives an error:

print(my_obj.__private_field)
>>> AttributeError: 'MyObject' object has no attribute '__private_field'

Class methods can access private attributes because they are declared within the class block]
A subclass cannot access it’s parent classes private fields

The python compiler just does a check on the calling class name, thereforethis works:

class MyChildObject(MyObject):
    pass

print(my_child_obj.get_private_field())
>>> 10

but if MyChildObject held the get_private_field() method it would fail.

If you look at the __dict__ of a object you can see parent attributes:

(Pdb) my_child_obj.__dict__
{'public_field': 5, '_MyObject__private_field': 10}

and accessing them is easy:

print(my_child_obj._MyObject__private_field)

Why isn’t visibility restricted? The python motto:

“We are all consenting adults here.”

The benfits of being open outweigh the downsides of being closed.

To minimise the damage of accessing internals unknowingly follow the PEP8 naming conventions. Fields prefixed with underscore(_protected_fields) are protected meaning external users of the class should proceed with caution.

By choosing private fields you are making subclass overrides and extensions cumbersome and brittle. Then if these private references will break due to the hierachy changing.

It is better to allow subclasses to do more by using _protected attributes. Make sure to document their importance and that they be treated as immutable.

Inherit from collections.abc for custom Container Types#

Much of python is defining classes, data and how they relate. Each python class is a container of some kind. Oftentimes when creating a sequence you will extend (inherit from) list.

But what about a BinaryTree that you want to allow indexing for, that isn’t a list but is similar.

class BinaryNode(object): def init(self, value, left=None, right=None): self.value = value self.left = left self.right = right

You can access an item with obj.__getitem__(0) ie. obj[0]

class IndexableNode(BinaryNode): def _search(self, count, index): # … # Returns (found, count)

def getitem(self, index): found, _ = self._search(0, index) if not found: raise IndexError(‘Index out of range’) return found.value

But then you would also need implementations of __len__, count and index

You should use an abstract base class (abc) from collections:

from collections.abc import Sequence

Then once you implement the __gettitem__ and __len__ the other methods come for free.

You can still inherit directly from python’s container types list and dict for simple cases

Metaclasses and Attributes#

Metaclass let you intercept python’s class statement to provide special behviour each time it is defined.

Remember to follow the rule of least surprise

Use Plain attributes instead of Get and Set Methods#

These can be done in python and may be seen as good to:

encapsulate functionality
validate usage
define boundaries

In python, you never need to do this. Always start with simle public attributes.

If you need special behaviour you can us @property and the setter method. This also helps to add validation and type checking.

class BoundedResistance(Resistor):
    def __init__(self, ohms):
        super().__init__(ohms)

    @property
    def ohms(self):
        return self._ohms

    @ohms.setter
    def ohms(self, ohms):
        if ohms <= 0:
            raise ValueError('%f ohms must be > 0' % ohms)
        self._ohms = ohms

Don’t set other attributes in getter property methods. Only modify related object state in setters

If you are doing something slow and complex, rather do it in a normal method. People are expecting this to behave like a property.

Consider @property Instead of Refactoring Attributes#

“One advanced but common use of @property is transitioning what was once a simple numerical attribute into an on-the-fly calculation”

Check the book for a good example…

Use @propertyto give existing instance attributes new functionality
Make incremental progress towards better data models
Consider refactoring a class when using a @property too regularly

Use Descriptors for reusable @property methods#

The big problem with @property is reuse. The methods it decorates cannot be reused for multiple attributes in the same class or external classes.

Take the example:

class Exam(object):
    def __init__(self):
        self._writing_grade = 0
        self._math_grade = 0

    @staticmethod
    def _check_grade(value):
        if not (0 <= value <= 100):
            raise ValueError('Grade must be between 0 and 100')

    @property
    def writing_grade(self):
        return self._writing_grade

    @writing_grade.setter
    def writing_grade(self, value):
        self._check_grade(value)
        self._writing_grade = value

    @property
    def math_grade(self):
        return self._math_grade

    @math_grade.setter
    def math_grade(self, value):
        self._check_grade(value)
        self._math_grade = value

We are duplicating properies and the grade validations. The better way to do this is to use a descriptor, that describes how attribute access is interpreted by the language. * Provide __get__ and __set__ methods to reuse grade validation behaviour. * They are better than mixins at this because you can reuse the same logic for many attributes in the same class.

The class implementing descriptor:

class Grade(object):
    def __get__(*args, **kwargs):
        # ...

    def __set__(*args, **kwargs):
        # ...

The exam:

class Exam(object):
    # Class attributes
    math_grade = Grade()
    writing_grade = Grade()
    science_grade = Grade()

Assigning properties:

exam = Exam()
exam.writing_grade = 40

# Which is really
Exam.__dict__['writing_grade'].__set__(exam, 40)

Retrievingproperties:

print(exam.writing_grade)

# Which is really
print(Exam.__dict__['writing_grade'].__get__(exam, Exam))

In short, when an Exam instance doesn’t have an attribute named writing_grade, Python will fall back to the Exam class’s attribute instead. If this class attribute is an object that has __get__ and __set__ methods, Python will assume you want to follow the descriptor protocol.

There are still many gotchas here you can go through in the book…

Use getattr, getattribute, and setattr for Lazy Attributes#

Read the book…

Validate subclasses with Meta Classes#

Use metaclasses to ensure that subclasses are well formed at the time they are defined, before objects of their type are constructed.
The __new__ method of metaclasses is run after the class statement’s entire body has been processed.

Register Class Existence with Metaclasses#

Hectic topic…read the book

Annotate Class Attributes with Metaclasses#

Again…hectic

Concurrency and Parrallelism#

Concurrency is when a computer does many different things seemingly at the same time. Interleaving execution of a program making it seem like it is all being done the same time.

Parallelism is actually doing many different things at the same time.

Concurrency provides no speedup for the total work.

These topics are a bit too hectic for now… you are welcome to read the book…I will leave the headings here