California Population Data

It’s surprisingly hard to find a simple list of annual population data for the state of California. You can find a chart on Google:

But the underlying data must be extracted from several spreadsheets from the census site.

So here’s a simple Python list of the annual data (1950-2010):

[(1950, 10677000),
 (1951, 11134000),
 (1952, 11635000),
 (1953, 12251000),
 (1954, 12746000),
 (1955, 13133000),
 (1956, 13713000),
 (1957, 14264000),
 (1958, 14880000),
 (1959, 15467000),
 (1960, 15870000),
 (1961, 16497000),
 (1962, 17072000),
 (1963, 17668000),
 (1964, 18151000),
 (1965, 18585000),
 (1966, 18858000),
 (1967, 19176000),
 (1968, 19394000),
 (1969, 19711000),
 (1970, 19971069),
 (1971, 20345939),
 (1972, 20585469),
 (1973, 20868728),
 (1974, 21173865),
 (1975, 21537849),
 (1976, 21935909),
 (1977, 22352396),
 (1978, 22835958),
 (1979, 23256880),
 (1980, 23667902),
 (1981, 24285933),
 (1982, 24820009),
 (1983, 25360026),
 (1984, 25844393),
 (1985, 26441109),
 (1986, 27102237),
 (1987, 27777158),
 (1988, 28464249),
 (1989, 29218164),
 (1990, 29950111),
 (1991, 30414114),
 (1992, 30875920),
 (1993, 31147208),
 (1994, 31317179),
 (1995, 31493525),
 (1996, 31780829),
 (1997, 32217708),
 (1998, 32682794),
 (1999, 33145121),
 (2000, 33987977),
 (2001, 34479458),
 (2002, 34871843),
 (2003, 35253159),
 (2004, 35574576),
 (2005, 35827943),
 (2006, 36021202),
 (2007, 36250311),
 (2008, 36604337),
 (2009, 36961229),
 (2010, 37349363)]

And here’s a list of the raw data and its sources:

# raw data: (year range, raw numbers, unit/multiplier, delim, source)
    raw_data = (
        ( range(1950,1955),
          '10,677   11,134   11,635  12,251   12,746',
          1000,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st5060ts.txt' ),
        ( range(1955,1960),
          '13,133  13,713   14,264   14,880  15,467',
          1000,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st5060ts.txt' ),
        ( range(1960,1965),
          '15,870   16,497   17,072   17,668   18,151',
          1000,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st6070ts.txt' ),
        ( range(1965,1970),
          '18,585   18,858   19,176   19,394   19,711',
          1000,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st6070ts.txt' ),
        ( range(1970,1976),
          '19971069  20345939  20585469  20868728  21173865  21537849',
          1,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st7080ts.txt' ),
        ( range(1976,1980),
          '21935909  22352396  22835958  23256880',
          1,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st7080ts.txt' ),
        ( range(1980,1985),
          '23667902  24285933  24820009  25360026  25844393',
          1,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st8090ts.txt' ),
        ( range(1985,1990),
          '26441109  27102237  27777158  28464249  29218164',
          1,
          '\s+',
          'http://www.census.gov/popest/archives/1980s/st8090ts.txt' ),
        ( range(1999,1989,-1),
          '33145121  32682794  32217708  31780829  31493525  31317179  31147208  30875920  30414114  29950111',
          1,
          '\s+',
          'http://www.census.gov/popest/archives/1990s/ST-99-07.txt' ),
        ( range(2000,2011),
         '"33,987,977","34,479,458","34,871,843","35,253,159","35,574,576","35,827,943","36,021,202","36,250,311","36,604,337","36,961,229","37,349,363"',
          1,
          ',',
          'http://www.census.gov/popest/intercensal/state/ST-EST00INT-01.csv' ),
    )

Python Timeit

I’m working through the lessons in the free online version Allen B. Downey’s Think Stats.One thing I like to do whenever I write code is test it. So almost as soon as I began the exercises, I added a module called testwell.

Since performance is often a consideration with statistical computing, my testing module includes a basic timing function called timeit. The Python standard library includes a timeit module, but I don’t find it very friendly, as it creates a new process with its own separate environment. I just want a timing function that can time comparable pieces of code in the current environment. Here’s my timeit function:

def timeit(f, *args, **kw):
    """time a function over an n number of trails:

    f1 = lambda a: a * a
    def f2(a, b):
        f1(a) + f1(b)
    def f3():
        f2(10, 20)

    USAGE:
        t1 = timeit(f1, a, n=1000)
        t2 = timeit(f2, a, b, n=1000)
        t3 = timeit(f3, n=1000)
        pprint([t1, t2, t3])
    """
    n = kw.get('n', 100)
    print 'timing %s over %s trials' % (f, n)
    t0 = time.time()
    for i in range(n):
        f(*args)

    total_time = time.time() - t0
    per_trial = total_time / n
    return '%.2f (%s trials at %.6f per trial)' % (total_time, n, per_trial)

It takes a function as it’s first argument, a list of arguments, and an n keyword to specify how many times to run it.

The pattern I like best is to enclose the operation you wish to test in a function, then just pass that with the n keyword:

import pprint

num_trials = 100000
f1 = lambda: Percentile(scores, 50)
f2 = lambda: iPercentile(scores, 50)
t1 = timeit(f1, n=num_trials)
t2 = timeit(f2, n=num_trials)
pprint.pprint([t1, t2])

# output
timing <function <lambda> at 0x8950bc4> over 100000 trials
timing <function <lambda> at 0x8950bfc> over 100000 trials
['0.86 (100000 trials at 0.000009 per trial)',
 '0.38 (100000 trials at 0.000004 per trial)']

Updating Model Schema in Google App Engine

Problem

I have an App Engine app that keeps tracks of some tweets (“status” in the API). I decided I wanted to store the time the message was originally tweeted. So I need to update the schema of one of my models, TweetDigest, to add the new property (field). But then I also would like to update any existing records to include the value.

Solution

First things first. Let’s update the model. This is easy enough. Just add the new property to the existing model. The relevant code:

from project.models.twitter import TweetDigest

class TweetDigest(db.Model):
    tweet_id        = db.StringProperty(required=True, indexed=True)
    user_id         = db.StringProperty(required=True)
    screen_name     = db.StringProperty(required=True)
    text            = db.StringProperty(required=True, multiline=True)
    stored_at       = db.DateTimeProperty(auto_now_add=True)

    # new property
    tweeted_at      = db.DateTimeProperty(required=True)

That’s the easy part. Any new records will include that property. Existing records, however, will not have the property, at all. Let me emphasize that: it’s not that the new field is set to null for existing records. Existing records do not have the field at all. A couple queries in the interactive console illustrate:

from datetime import datetime
from google.appengine.ext import db
from pprint import pprint

print datetime.now()
count = TweetDigest.all().filter('user_image', None).count()
print count

record = TweetDigest.all().get()
pprint(record.__dict__['_entity'])

Output:

2011-09-05 00:11:25.657055
0
{u'screen_name': u'klenwell',
 u'stored_at': datetime.datetime(2011, 7, 6, 9, 26, 56, 757862),
 u'text': u'Premature optimization is the root of all evil.',
 u'tweet_id': u'1948390000',
 u'user_id': u'1820900'}

So how to fix this? Well, in this case, I have to retrieve the created_at value for each status using the Twitter API then update each record. I set up an action in a special controller to do this. It queries the datastore to fetch all the existing records, the creates a task for each that will query the Twitter API, get the created_at value, and update the record in the datastore.

Here’s the controller code that creates the tasks:

def add_tweet_created_at_field(self):
    # queue settings
    queue_name = 'tweetdigest-schema-change'
    queue_url_f = '/backend/queue/store_tweet_created_at_value/%s'
    queue_params = {}

    # purge queue
    self.purge_queue(queue_name)
    logging.info('purged queue: %s' % (queue_name))

    # select all TweetDigest records without image_url and add to queue
    queue_count = 0
    query = TweetDigest.all()
    for digest in query:
        if digest.tweeted_at is None:
            queue_url = queue_url_f % (digest.tweet_id)
            added_task = self.queue_task(queue_url, queue_params, queue_name)
            queue_count += 1

    logging.info('queued %s TweetDigest records for update' % (queue_count))

    # output
    response = {
        'queued tasks'          : queue_count,
    }
    self.set('data', pformat(response))
    self.render(self.default_view)

This action is runs within the appswell framework. The task code is left as an exercise for the reader.

References

http://code.google.com/appengine/articles/update_schema.html
http://stackoverflow.com/questions/7037269/check-if-a-field-is-present-in-an-entity
http://appswell.appspot.com/

Google App Engine Memcache Limits

Problem

If you attempt to store an object more than approximately 1 MB in size using memcache in the Google App Engine, it will give a ValueError, something like this:

ValueError: Values may not be more than 1000000 bytes in length; received 1088171 bytes

Solution

I’ve added a library to my Appswell framework that allows you to get around this limit by serializing an object into multiple strings and storing these along with an index object that stores the key.

Usage Example:

import multicache as memcache

# cache params
cache_data = some_large_nested_dict
cache_key = 'test_multicache'
cache_len = 60

# save data
memcache.set(cache_key, cache_data, cache_len)

# retrieve data
retrieved_data = memcache.get(cache_key)

The module can be easily extracted from the framework. See these links for additional details:

source code: http://code.google.com/p/appswell/source/browse/appspot/lib/multicache.py
wiki page: http://klenwell.com/is/AppengineMulticache

Julian Day in Python

Needed this today at work. Turned out to be much easier than I anticipated:

from datetime import datetime
julianday = '%d%03d' % (datetime.now().timetuple().tm_year,
    datetime.now().timetuple().tm_yday)