klenwell information services : PywellInputOutput

Pywell Lesson: File Input/Output

return to Tutorial Index

Goals


Lecture

In its simplest form, a script accepts some input data, manipulates it, and spits some data back out. Data in, data out. So where does this data come from?

It can come from a variety of resources. One of the most common is files and that's where we begin here. Python makes it easy to read data out of a file. Let's say we have a file in the tmp directory.

>>> file_path = "/tmp/myfile.txt"
>>> f = open(file_path)
>>> contents = f.read()
>>> f.close()
>>> print contents


Very simple. Data in: the contents of file /tmp/myfile.txt. Data out: the contents, unchanged. f here is a file object, a handler that provides a number of methods for manipulating files.

Other data resources include databases, web services, and message queues. All follow a similar pattern: open the resource, collect the data, close the resource.

For example, here's a simple script that pulls the latest JSON data from the USGS website's JSON feed and finds the largest earthquake in the last day. It requires Python's urllib and json libraries:

>>> import urllib, json
>>> usgs_url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson"
>>> response = urllib.urlopen(usgs_url)
>>> raw_json = response.read()
>>> json_data = json.loads(raw_json)
>>> max([entry['properties']['mag'] for entry in json_data['features'])


Here's a breakdown of the steps:

1. Imports required Python libraries
2. Read data feed
For the sake of convenience, the USGS JSON data feed is set as the variable, usgs_url.
The url is opened using the urllib module's urlopen function (note the similarity to the file interface):
It is then read, much like a file, into the raw_json variable
3. Parse JSON data into a dictionary
The contents are parsed using the json module's loads function, which takes a string of JSON data and converts it to a dictionary.
A dictionary makes the data much easier to work with. Remember a dictionary is a data structure that ties a key to a value that can be accessed like so:
The value can be almost any data type, a string, an integer, a list, another dict. Once we have our json_data dict, we can play with it. For instance:
>>> type(json_data)
<type 'dict'>
>>> json_data.keys()
[u'type', u'features', u'bbox', u'metadata']
>>> json_data['metadata']
{u'url': u'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson', u'count': 154, u'generated': 1370802521000, u'api': u'1.0.1', u'title': u'USGS All Earthquakes, Past Day'}
>>> type(json_data['features'])
<type 'list'>
>>> json_data['features'][0].keys()
[u'geometry', u'type', u'properties', u'id']
>>> first = json_data['features'][0]
>>> first.keys()
[u'geometry', u'type', u'properties', u'id']
>>> first = json_data['features'][0]
>>> first['properties']
{u'rms': 0.32, u'code': u'15357601', u'cdi': None, u'sources': u',ci,', u'nst': 15, u'tz': -420, u'magType': u'Ml', u'detail': u'http://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/ci15357601.geojson', u'sig': 30, u'net': u'ci', u'type': u'earthquake', u'status': u'AUTOMATIC', u'updated': 1370801930386, u'felt': None, u'alert': None, u'dmin': 0.11678099, u'mag': 1.4, u'gap': 79.2, u'types': u',general-link,geoserve,nearby-cities,origin,scitech-link,', u'url': u'http://earthquake.usgs.gov/earthquakes/eventpage/ci15357601', u'ids': u',ci15357601,', u'tsunami': None, u'place': u'2km S of Brawley, California', u'time': 1370801714000, u'mmi': None}


The data we're interested in specifically, that corresponds to earthquakes in the past day, is the features key:
>>> earthquakes = json_data['features']
>>> first_earthquake = earthquakes[0]
>>> first_earthquake['properties']
{u'rms': 0.32, u'code': u'15357601', u'cdi': None, u'sources': u',ci,', u'nst': 15, u'tz': -420, u'magType': u'Ml', u'detail': u'http://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/ci15357601.geojson', u'sig': 30, u'net': u'ci', u'type': u'earthquake', u'status': u'AUTOMATIC', u'updated': 1370801930386, u'felt': None, u'alert': None, u'dmin': 0.11678099, u'mag': 1.4, u'gap': 79.2, u'types': u',general-link,geoserve,nearby-cities,origin,scitech-link,', u'url': u'http://earthquake.usgs.gov/earthquakes/eventpage/ci15357601', u'ids': u',ci15357601,', u'tsunami': None, u'place': u'2km S of Brawley, California', u'time': 1370801714000, u'mmi': None}
>>> first_earthquake['properties']['mag']
1.4


4. Find the entry with the highest magnitude
Here, for the sake of brevity, I use list comprehension to collect a list of the magnitude for each earthquake and then use the built-in max function to find the maximum value:
If you're not familiar with list comprehensions yet, this code may make more sense:
earthquakes = json_data['features'])
earthquake_magnitudes = []

for earthquake in earthquakes:
  earthquake_magnitudes.append(earthquake['properties']['mag'])

print len(earthquakes)  # number of earthquakes in last day
print max(earthquake_magnitudes) # biggest earthquakes value



Exercises

Write a script that collects the USGS earthquake data and answers the following questions:

Extra Credit

Parse the New York Times' XML feed of Most E-Mailed Articles and answer the following questions:
You'll need to research how to parse XML in Python. It's not quite as simple as JSON.

References

Python Documentation on File Objects
Python JSON Library