klenwell information services : PywellInputOutput

Revision [2625]

This is an old revision of PywellInputOutput made by KlenwellAdmin on 2013-06-09 12:05:54.
 

Pywell Lesson: File Input/Output

return to PyWell Tutorial Index

Goals



Lecture

In its simplest form, a script accepts some input data, manipulates it, and spits some data back out. Data in, data out. So where does this data come from?

It can come from a variety of resources. One of the most common is files and that's where we begin here. Python makes it easy to read data out of a file. Let's say we have a file in the tmp directory.

>>> file_path = "/tmp/myfile.txt"
>>> f = open(file_path)
>>> contents = f.read()
>>> f.close()
>>> print contents


Very simple. Data in: the contents of file /tmp/myfile.txt. Data out: the contents, unchanged. f here is a file object, a handler that provides a number of methods for manipulating files.

Other data resources include databases, web services, and message queues. All follow a similar pattern: open the resource, collect the data, close the resource.

For example, here's a simple script that pulls the latest JSON data from the USGS website's JSON feed and finds the largest earthquake in the last day. It requires Python's urllib and json libraries:

>>> import urllib, json
>>> usgs_url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson"
>>> response = urllib.urlopen(usgs_url)
>>> raw_json = response.read()
>>> json_data = json.loads(raw_json)
>>> max([entry['properties']['mag'] for entry in json_data['features'])


Here's a breakdown of the steps:

1. Imports required Python libraries
2. Read data feed
For the sake of convenience, the USGS JSON data feed is set as the variable, usgs_url.
The url is opened using the urllib module's urlopen function (note the similarity to the file interface):
It is then read, much like a file, into the raw_json variable
3. Parse JSON data into a dictionary
The contents are parsed using the json module's loads function, which takes a string of JSON data and converts it to a dictionary.
A dictionary makes the data much easier to work with. Remember a dictionary is a data structure that ties a key to a value that can be accessed like so:
The value can be almost any data type, a string, an integer, a list, another dict. Once we have our json_data dict, we can play with it. For instance:
 


The data we're interested in specifically, that corresponds to earthquakes in the past day, is the features key:
 


4. Find the entry with the highest magnitude
Here, for the sake of brevity, I use list comprehension to collect a list of the magnitude for each earthquake and then use the built-in max function to find the maximum value:
If you're not familiar with list comprehensions yet, this code may make more sense:
earthquakes = json_data['features'])
earthquake_magnitudes = []

for earthquake in earthquakes:
  earthquake_magnitudes.append(earthquake['properties']['mag'])

print len(earthquakes)  # number of earthquakes in last day
print max(earthquake_magnitudes) # biggest earthquakes value



Exercises

Write a script that collects the USGS earthquake data and answers the following questions:

Extra Credit

Parse the New York Times' XML feed of Most E-Mailed Articles and answer the following questions:
You'll need to research how to parse XML in Python. It's not quite as simple as JSON.

References