Tutorial:PracticalPython/6 Generators

From HandWiki


Generators

Iteration (the for-loop) is one of the most common programming patterns in Python. Programs do a lot of iteration to process lists, read files, query databases, and more. One of the most powerful features of Python is the ability to customize and redefine iteration in the form of a so-called “generator function.” This section introduces this topic. By the end, you’ll write some programs that process some real-time streaming data in an interesting way.

Iteration Protocol

This section looks at the underlying process of iteration.

Iteration Everywhere

Many different objects support iteration.

a = 'hello'
for c in a: # Loop over characters in a
    ...

b = { 'name': 'Dave', 'password':'foo'}
for k in b: # Loop over keys in dictionary
    ...

c = [1,2,3,4]
for i in c: # Loop over items in a list/tuple
    ...

f = open('foo.txt')
for x in f: # Loop over lines in a file
    ...

Iteration: Protocol

Consider the for-statement.

for x in obj:
    # statements

What happens under the hood?

_iter = obj.__iter__()        # Get iterator object
while True:
    try:
        x = _iter.__next__()  # Get next item
    except StopIteration:     # No more items
        break
    # statements ...

All the objects that work with the for-loop implement this low-level iteration protocol.

Example: Manual iteration over a list.

>>> x = [1,2,3]
>>> it = x.__iter__()
>>> it
<listiterator object at 0x590b0>
>>> it.__next__()
1
>>> it.__next__()
2
>>> it.__next__()
3
>>> it.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in ? StopIteration
>>>

Supporting Iteration

Knowing about iteration is useful if you want to add it to your own objects. For example, making a custom container.

class Portfolio:
    def __init__(self):
        self.holdings = []

    def __iter__(self):
        return self.holdings.__iter__()
    ...

port = Portfolio()
for s in port:
    ...

Exercises

Exercise 6.1: Iteration Illustrated

Create the following list:

a = [1,9,4,25,16]

Manually iterate over this list. Call __iter__() to get an iterator and call the __next__() method to obtain successive elements.

>>> i = a.__iter__()
>>> i
<listiterator object at 0x64c10>
>>> i.__next__()
1
>>> i.__next__()
9
>>> i.__next__()
4
>>> i.__next__()
25
>>> i.__next__()
16
>>> i.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>>

The next() built-in function is a shortcut for calling the __next__() method of an iterator. Try using it on a file:

>>> f = open('Data/portfolio.csv')
>>> f.__iter__()    # Note: This returns the file itself
<_io.TextIOWrapper name='Data/portfolio.csv' mode='r' encoding='UTF-8'>
>>> next(f)
'name,shares,price\n'
>>> next(f)
'"AA",100,32.20\n'
>>> next(f)
'"IBM",50,91.10\n'
>>>

Keep calling next(f) until you reach the end of the file. Watch what happens.

Exercise 6.2: Supporting Iteration

On occasion, you might want to make one of your own objects support iteration–especially if your object wraps around an existing list or other iterable. In a new file portfolio.py, define the following class:

# portfolio.py

class Portfolio:

    def __init__(self, holdings):
        self._holdings = holdings

    @property
    def total_cost(self):
        return sum([s.cost for s in self._holdings])

    def tabulate_shares(self):
        from collections import Counter
        total_shares = Counter()
        for s in self._holdings:
            total_shares[s.name] += s.shares
        return total_shares

This class is meant to be a layer around a list, but with some extra methods such as the total_cost property. Modify the read_portfolio() function in report.py so that it creates a Portfolio instance like this:

# report.py
...

import fileparse
from stock import Stock
from portfolio import Portfolio

def read_portfolio(filename):
    '''
    Read a stock portfolio file into a list of dictionaries with keys
    name, shares, and price.
    '''
    with open(filename) as file:
        portdicts = fileparse.parse_csv(file,
                                        select=['name','shares','price'],
                                        types=[str,int,float])

    portfolio = [ Stock(d['name'], d['shares'], d['price']) for d in portdicts ]
    return Portfolio(portfolio)
...

Try running the report.py program. You will find that it fails spectacularly due to the fact that Portfolio instances aren’t iterable.

>>> import report
>>> report.portfolio_report('Data/portfolio.csv', 'Data/prices.csv')
... crashes ...

Fix this by modifying the Portfolio class to support iteration:

class Portfolio:

    def __init__(self, holdings):
        self._holdings = holdings

    def __iter__(self):
        return self._holdings.__iter__()

    @property
    def total_cost(self):
        return sum([s.shares*s.price for s in self._holdings])

    def tabulate_shares(self):
        from collections import Counter
        total_shares = Counter()
        for s in self._holdings:
            total_shares[s.name] += s.shares
        return total_shares

After you’ve made this change, your report.py program should work again. While you’re at it, fix up your pcost.py program to use the new Portfolio object. Like this:

# pcost.py

import report

def portfolio_cost(filename):
    '''
    Computes the total cost (shares*price) of a portfolio file
    '''
    portfolio = report.read_portfolio(filename)
    return portfolio.total_cost
...

Test it to make sure it works:

>>> import pcost
>>> pcost.portfolio_cost('Data/portfolio.csv')
44671.15
>>>

Exercise 6.3: Making a more proper container

If making a container class, you often want to do more than just iteration. Modify the Portfolio class so that it has some other special methods like this:

class Portfolio:
    def __init__(self, holdings):
        self._holdings = holdings

    def __iter__(self):
        return self._holdings.__iter__()

    def __len__(self):
        return len(self._holdings)

    def __getitem__(self, index):
        return self._holdings[index]

    def __contains__(self, name):
        return any([s.name == name for s in self._holdings])

    @property
    def total_cost(self):
        return sum([s.shares*s.price for s in self._holdings])

    def tabulate_shares(self):
        from collections import Counter
        total_shares = Counter()
        for s in self._holdings:
            total_shares[s.name] += s.shares
        return total_shares

Now, try some experiments using this new class:

>>> import report
>>> portfolio = report.read_portfolio('Data/portfolio.csv')
>>> len(portfolio)
7
>>> portfolio[0]
Stock('AA', 100, 32.2)
>>> portfolio[1]
Stock('IBM', 50, 91.1)
>>> portfolio[0:3]
[Stock('AA', 100, 32.2), Stock('IBM', 50, 91.1), Stock('CAT', 150, 83.44)]
>>> 'IBM' in portfolio
True
>>> 'AAPL' in portfolio
False
>>>

One important observation about this–generally code is considered “Pythonic” if it speaks the common vocabulary of how other parts of Python normally work. For container objects, supporting iteration, indexing, containment, and other kinds of operators is an important part of this.

Customizing Iteration

This section looks at how you can customize iteration using a generator function.

A problem

Suppose you wanted to create your own custom iteration pattern.

For example, a countdown.

>>> for x in countdown(10):
...   print(x, end=' ')
...
10 9 8 7 6 5 4 3 2 1
>>>

There is an easy way to do this.

Generators

A generator is a function that defines iteration.

def countdown(n):
    while n > 0:
        yield n
        n -= 1

For example:

>>> for x in countdown(10):
...   print(x, end=' ')
...
10 9 8 7 6 5 4 3 2 1
>>>

A generator is any function that uses the yield statement.

The behavior of generators is different than a normal function. Calling a generator function creates a generator object. It does not immediately execute the function.

def countdown(n):
    # Added a print statement
    print('Counting down from', n)
    while n > 0:
        yield n
        n -= 1
>>> x = countdown(10)
# There is NO PRINT STATEMENT
>>> x
# x is a generator object
<generator object at 0x58490>
>>>

The function only executes on __next__() call.

>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.__next__()
Counting down from 10
10
>>>

yield produces a value, but suspends the function execution. The function resumes on next call to __next__().

>>> x.__next__()
9
>>> x.__next__()
8

When the generator finally returns, the iteration raises an error.

>>> x.__next__()
1
>>> x.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in ? StopIteration
>>>

Observation: A generator function implements the same low-level protocol that the for statements uses on lists, tuples, dicts, files, etc.

Exercises

Exercise 6.4: A Simple Generator

If you ever find yourself wanting to customize iteration, you should always think generator functions. They’re easy to write—make a function that carries out the desired iteration logic and use yield to emit values.

For example, try this generator that searches a file for lines containing a matching substring:

>>> def filematch(filename, substr):
        with open(filename, 'r') as f:
            for line in f:
                if substr in line:
                    yield line

>>> for line in open('Data/portfolio.csv'):
        print(line, end='')

name,shares,price
"AA",100,32.20
"IBM",50,91.10
"CAT",150,83.44
"MSFT",200,51.23
"GE",95,40.37
"MSFT",50,65.10
"IBM",100,70.44
>>> for line in filematch('Data/portfolio.csv', 'IBM'):
        print(line, end='')

"IBM",50,91.10
"IBM",100,70.44
>>>

This is kind of interesting–the idea that you can hide a bunch of custom processing in a function and use it to feed a for-loop. The next example looks at a more unusual case.

Exercise 6.5: Monitoring a streaming data source

Generators can be an interesting way to monitor real-time data sources such as log files or stock market feeds. In this part, we’ll explore this idea. To start, follow the next instructions carefully.

The program Data/stocksim.py is a program that simulates stock market data. As output, the program constantly writes real-time data to a file Data/stocklog.csv. In a separate command window go into the Data/ directory and run this program:

bash % python3 stocksim.py

If you are on Windows, just locate the stocksim.py program and double-click on it to run it. Now, forget about this program (just let it run). Using another window, look at the file Data/stocklog.csv being written by the simulator. You should see new lines of text being added to the file every few seconds. Again, just let this program run in the background—it will run for several hours (you shouldn’t need to worry about it).

Once the above program is running, let’s write a little program to open the file, seek to the end, and watch for new output. Create a file follow.py and put this code in it:

# follow.py
import os
import time

f = open('Data/stocklog.csv')
f.seek(0, os.SEEK_END)   # Move file pointer 0 bytes from end of file

while True:
    line = f.readline()
    if line == '':
        time.sleep(0.1)   # Sleep briefly and retry
        continue
    fields = line.split(',')
    name = fields[0].strip('"')
    price = float(fields[1])
    change = float(fields[4])
    if change < 0:
        print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')

If you run the program, you’ll see a real-time stock ticker. Under the hood, this code is kind of like the Unix tail -f command that’s used to watch a log file.

Note: The use of the readline() method in this example is somewhat unusual in that it is not the usual way of reading lines from a file (normally you would just use a for-loop). However, in this case, we are using it to repeatedly probe the end of the file to see if more data has been added (readline() will either return new data or an empty string).

Exercise 6.6: Using a generator to produce data

If you look at the code in Exercise 6.5, the first part of the code is producing lines of data whereas the statements at the end of the while loop are consuming the data. A major feature of generator functions is that you can move all of the data production code into a reusable function.

Modify the code in Exercise 6.5 so that the file-reading is performed by a generator function follow(filename). Make it so the following code works:

>>> for line in follow('Data/stocklog.csv'):
          print(line, end='')

... Should see lines of output produced here ...

Modify the stock ticker code so that it looks like this:

if __name__ == '__main__':
    for line in follow('Data/stocklog.csv'):
        fields = line.split(',')
        name = fields[0].strip('"')
        price = float(fields[1])
        change = float(fields[4])
        if change < 0:
            print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')

Exercise 6.7: Watching your portfolio

Modify the follow.py program so that it watches the stream of stock data and prints a ticker showing information for only those stocks in a portfolio. For example:

if __name__ == '__main__':
    import report

    portfolio = report.read_portfolio('Data/portfolio.csv')

    for line in follow('Data/stocklog.csv'):
        fields = line.split(',')
        name = fields[0].strip('"')
        price = float(fields[1])
        change = float(fields[4])
        if name in portfolio:
            print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')

Note: For this to work, your Portfolio class must support the in operator. See Exercise 6.3 and make sure you implement the __contains__() operator.

Discussion

Something very powerful just happened here. You moved an interesting iteration pattern (reading lines at the end of a file) into its own little function. The follow() function is now this completely general purpose utility that you can use in any program. For example, you could use it to watch server logs, debugging logs, and other similar data sources. That’s kind of cool.

Producers, Consumers and Pipelines

Generators are a useful tool for setting various kinds of producer/consumer problems and dataflow pipelines. This section discusses that.

Producer-Consumer Problems

Generators are closely related to various forms of producer-consumer problems.

# Producer
def follow(f):
    ...
    while True:
        ...
        yield line        # Produces value in `line` below
        ...

# Consumer
for line in follow(f):    # Consumes vale from `yield` above
    ...

yield produces values that for consumes.

Generator Pipelines

You can use this aspect of generators to set up processing pipelines (like Unix pipes).

producerprocessingprocessingconsumer

Processing pipes have an initial data producer, some set of intermediate processing stages and a final consumer.

producerprocessingprocessingconsumer

def producer():
    ...
    yield item
    ...

The producer is typically a generator. Although it could also be a list of some other sequence. yield feeds data into the pipeline.

producerprocessingprocessingconsumer

def consumer(s):
    for item in s:
        ...

Consumer is a for-loop. It gets items and does something with them.

producerprocessingprocessingconsumer

def processing(s):
    for item in s:
        ...
        yield newitem
        ...

Intermediate processing stages simultaneously consume and produce items. They might modify the data stream. They can also filter (discarding items).

producerprocessingprocessingconsumer

def producer():
    ...
    yield item          # yields the item that is received by the `processing`
    ...

def processing(s):
    for item in s:      # Comes from the `producer`
        ...
        yield newitem   # yields a new item
        ...

def consumer(s):
    for item in s:      # Comes from the `processing`
        ...

Code to setup the pipeline

a = producer()
b = processing(a)
c = consumer(b)

You will notice that data incrementally flows through the different functions.

Exercises

For this exercise the stocksim.py program should still be running in the background. You’re going to use the follow() function you wrote in the previous exercise.

Exercise 6.8: Setting up a simple pipeline

Let’s see the pipelining idea in action. Write the following function:

>>> def filematch(lines, substr):
        for line in lines:
            if substr in line:
                yield line

>>>

This function is almost exactly the same as the first generator example in the previous exercise except that it’s no longer opening a file–it merely operates on a sequence of lines given to it as an argument. Now, try this:

>>> lines = follow('Data/stocklog.csv')
>>> ibm = filematch(lines, 'IBM')
>>> for line in ibm:
        print(line)

... wait for output ...

It might take awhile for output to appear, but eventually you should see some lines containing data for IBM.

Exercise 6.9: Setting up a more complex pipeline

Take the pipelining idea a few steps further by performing more actions.

>>> from follow import follow
>>> import csv
>>> lines = follow('Data/stocklog.csv')
>>> rows = csv.reader(lines)
>>> for row in rows:
        print(row)

['BA', '98.35', '6/11/2007', '09:41.07', '0.16', '98.25', '98.35', '98.31', '158148']
['AA', '39.63', '6/11/2007', '09:41.07', '-0.03', '39.67', '39.63', '39.31', '270224']
['XOM', '82.45', '6/11/2007', '09:41.07', '-0.23', '82.68', '82.64', '82.41', '748062']
['PG', '62.95', '6/11/2007', '09:41.08', '-0.12', '62.80', '62.97', '62.61', '454327']
...

Well, that’s interesting. What you’re seeing here is that the output of the follow() function has been piped into the csv.reader() function and we’re now getting a sequence of split rows.

Exercise 6.10: Making more pipeline components

Let’s extend the whole idea into a larger pipeline. In a separate file ticker.py, start by creating a function that reads a CSV file as you did above:

# ticker.py

from follow import follow
import csv

def parse_stock_data(lines):
    rows = csv.reader(lines)
    return rows

if __name__ == '__main__':
    lines = follow('Data/stocklog.csv')
    rows = parse_stock_data(lines)
    for row in rows:
        print(row)

Write a new function that selects specific columns:

# ticker.py
...
def select_columns(rows, indices):
    for row in rows:
        yield [row[index] for index in indices]
...
def parse_stock_data(lines):
    rows = csv.reader(lines)
    rows = select_columns(rows, [0, 1, 4])
    return rows

Run your program again. You should see output narrowed down like this:

['BA', '98.35', '0.16']
['AA', '39.63', '-0.03']
['XOM', '82.45','-0.23']
['PG', '62.95', '-0.12']
...

Write generator functions that convert data types and build dictionaries. For example:

# ticker.py
...

def convert_types(rows, types):
    for row in rows:
        yield [func(val) for func, val in zip(types, row)]

def make_dicts(rows, headers):
    for row in rows:
        yield dict(zip(headers, row))
...
def parse_stock_data(lines):
    rows = csv.reader(lines)
    rows = select_columns(rows, [0, 1, 4])
    rows = convert_types(rows, [str, float, float])
    rows = make_dicts(rows, ['name', 'price', 'change'])
    return rows
...

Run your program again. You should now a stream of dictionaries like this:

{ 'name':'BA', 'price':98.35, 'change':0.16 }
{ 'name':'AA', 'price':39.63, 'change':-0.03 }
{ 'name':'XOM', 'price':82.45, 'change': -0.23 }
{ 'name':'PG', 'price':62.95, 'change':-0.12 }
...

Exercise 6.11: Filtering data

Write a function that filters data. For example:

# ticker.py
...

def filter_symbols(rows, names):
    for row in rows:
        if row['name'] in names:
            yield row

Use this to filter stocks to just those in your portfolio:

import report
portfolio = report.read_portfolio('Data/portfolio.csv')
rows = parse_stock_data(follow('Data/stocklog.csv'))
rows = filter_symbols(rows, portfolio)
for row in rows:
    print(row)

Exercise 6.12: Putting it all together

In the ticker.py program, write a function ticker(portfile, logfile, fmt) that creates a real-time stock ticker from a given portfolio, logfile, and table format. For example::

>>> from ticker import ticker
>>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'txt')
      Name      Price     Change
---------- ---------- ----------
        GE      37.14      -0.18
      MSFT      29.96      -0.09
       CAT      78.03      -0.49
        AA      39.34      -0.32
...

>>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'csv')
Name,Price,Change
IBM,102.79,-0.28
CAT,78.04,-0.48
AA,39.35,-0.31
CAT,78.05,-0.47
...

Discussion

Some lessons learned: You can create various generator functions and chain them together to perform processing involving data-flow pipelines. In addition, you can create functions that package a series of pipeline stages into a single function call (for example, the parse_stock_data() function).

More Generators

This section introduces a few additional generator related topics including generator expressions and the itertools module.

Generator Expressions

A generator version of a list comprehension.

>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b:
...   print(i, end=' ')
...
2 4 6 8
>>>

Differences with List Comprehensions.

  • Does not construct a list.
  • Only useful purpose is iteration.
  • Once consumed, can’t be reused.

General syntax.

(<expression> for i in s if <conditional>)

It can also serve as a function argument.

sum(x*x for x in a)

It can be applied to any iterable.

>>> a = [1,2,3,4]
>>> b = (x*x for x in a)
>>> c = (-x for x in b)
>>> for i in c:
...   print(i, end=' ')
...
-1 -4 -9 -16
>>>

The main use of generator expressions is in code that performs some calculation on a sequence, but only uses the result once. For example, strip all comments from a file.

f = open('somefile.txt')
lines = (line for line in f if not line.startswith('#'))
for line in lines:
    ...
f.close()

With generators, the code runs faster and uses little memory. It’s like a filter applied to a stream.

Why Generators

  • Many problems are much more clearly expressed in terms of iteration.
    • Looping over a collection of items and performing some kind of operation (searching, replacing, modifying, etc.).
    • Processing pipelines can be applied to a wide range of data processing problems.
  • Better memory efficiency.
    • Only produce values when needed.
    • Contrast to constructing giant lists.
    • Can operate on streaming data
  • Generators encourage code reuse
    • Separates the iteration from code that uses the iteration
    • You can build a toolbox of interesting iteration functions and mix-n-match.

itertools module

The itertools is a library module with various functions designed to help with iterators/generators.

itertools.chain(s1,s2)
itertools.count(n)
itertools.cycle(s)
itertools.dropwhile(predicate, s)
itertools.groupby(s)
itertools.ifilter(predicate, s)
itertools.imap(function, s1, ... sN)
itertools.repeat(s, n)
itertools.tee(s, ncopies)
itertools.izip(s1, ... , sN)

All functions process data iteratively. They implement various kinds of iteration patterns.

More information at Generator Tricks for Systems Programmers tutorial from PyCon ’08.

Exercises

In the previous exercises, you wrote some code that followed lines being written to a log file and parsed them into a sequence of rows. This exercise continues to build upon that. Make sure the Data/stocksim.py is still running.

Exercise 6.13: Generator Expressions

Generator expressions are a generator version of a list comprehension. For example:

>>> nums = [1, 2, 3, 4, 5]
>>> squares = (x*x for x in nums)
>>> squares
<generator object <genexpr> at 0x109207e60>
>>> for n in squares:
...     print(n)
...
1
4
9
16
25

Unlike a list a comprehension, a generator expression can only be used once. Thus, if you try another for-loop, you get nothing:

>>> for n in squares:
...     print(n)
...
>>>

Exercise 6.14: Generator Expressions in Function Arguments

Generator expressions are sometimes placed into function arguments. It looks a little weird at first, but try this experiment:

>>> nums = [1,2,3,4,5]
>>> sum([x*x for x in nums])    # A list comprehension
55
>>> sum(x*x for x in nums)      # A generator expression
55
>>>

In the above example, the second version using generators would use significantly less memory if a large list was being manipulated.

In your portfolio.py file, you performed a few calculations involving list comprehensions. Try replacing these with generator expressions.

Exercise 6.15: Code simplification

Generators expressions are often a useful replacement for small generator functions. For example, instead of writing a function like this:

def filter_symbols(rows, names):
    for row in rows:
        if row['name'] in names:
            yield row

You could write something like this:

rows = (row for row in rows if row['name'] in names)

Modify the ticker.py program to use generator expressions as appropriate.

[[../Contents.md|Contents]] | Previous (6.3 Producer/Consumer) | [[../07_Advanced_Topics/00_Overview.md|Next (7 Advanced Topics)]]