The urllib
library makes it very easy to retrieve web pages and process the data
in Python. Using urllib you can treat a web page much like a file.
You simply indicate which web page you would like to retrieve and
urllib handles all of the HTTP protocol details. The equivalent code
to read the text.txt file from the web using urllib is as follows:
Once the web page has
been opened with urllib.urlopen we can
treat it like a file and read through it using a for loop. When the
program runs, we only see the output of the contents of the file. The
headers are still sent, but the urllib
code consumes the headers and only returns the data to us.
This is a test. This is a
test. This is a test.
This is a test. This is a
test. This is a test.
This is a test. This is a
test. This is a test.
This is a test. This is a
test. This is a test.
As an example, we can
write a program to retrieve the data for text.txt and compute the
frequency of each word in the file as follows:
Again, once we have
opened the web page, we can read it like a local file.
Parsing HTML and
scraping the web
One of the common uses of
the urllib capability in Python is to
scrape the web. Web scraping is when we write a program that pretends
to be a web browser and retrieves pages and then examines the data in
those pages looking for patterns.
Parsing HTML using
Regular Expressions
As an example, a search
engine such as Google will look at the source of one web page and
extract the links to other pages and retrieve those pages, extracting
links, and so on. Using this technique, Google spiders its way
through nearly all of the pages on the web. Google also uses the
frequency of links from pages it finds to a particular page as one
measure of how “important” a page is and how highly the page
should appear in its search results.
One simple way to parse
HTML is to use regular expressions to repeatedly search and extract
for substrings that match a particular pattern. Here is a simple web
page:
<h1>The First
Page</h1>
<p>
If you like, you can
switch to the
<a
href="http://binapratica.blogspot.com/page2.html">Second
Page</a>
</p>
We can construct a
well-formed regular expression to match and extract the link values
from the above text as follows:
href="http://.+?"
Our regular expression
looks for strings that start with “href=”http://” followed by
one or more characters “.+?” followed by another double quote.
The question mark added to the “.+?” indicates that the match is
to be done in a “non-greedy” fashion instead of a “greedy”
fashion. A non-greedy match tries to find the smallest possible
matching string and a greedy match tries to find the largest possible
matching string. We need to add parentheses to our regular expression
to indicate which part of our matched string we would like to extract
and produce the following program:
The findall
regular expression method will give us a list of all of the strings
that match our regular expression, returning only the link text
between the double quotes. When we run the program, we get the
following output:
python urlregex.py
Regular expressions work
very nice when your HTML is well-formatted and predictable. But since
there is a lot of “broken” HTML pages out there, you might find
that a solution only using regular expressions might either miss some
valid links or end up with bad data. This can be solved by using a
robust HTML parsing library.
Parsing HTML using
BeautifulSoup
There are a number of
Python libraries which can help you parse HTML and extract data from
the pages. Each of the libraries has its strengths and weaknesses and
you can pick one based on your needs.
As an example, we will
simply parse some HTML input and extract links using the
BeautifulSoup library. You can download and install the BeautifulSoup
code from: www.crummy.com
You can download and
“install” BeautifulSoup or you can simply place the
BeautifulSoup.py file in the same folder as your application. Even
though HTML looks like XML and some pages are carefully constructed
to be XML, most HTML is generally broken in ways that cause an XML
parser to reject the entire page of HTML as improperly formed.
BeautifulSoup tolerates highly flawed HTML and still lets you easily
extract the data you need. We will use urllib
to read the page and then use BeautifulSoup to extract the href
attributes from the anchor (a) tags.
The program prompts for a
web address, then opens the web page, reads the data and passes the
data to the BeautifulSoup parser, and then retrieves all of the
anchor tags and prints out the href
attribute for each tag. When the program runs it looks as follows:
python urllinks.py
You can use BeautifulSoup
to pull out various parts of each tag as follows:
This produces the
following output:
python urllink2.py
TAG: <a
href="http://binapratica.blogspot.com/page2.html">Second
Page</a>
URL:
http://binapratica.blogspot.com/page2.html
Content: [u'\nSecond
Page']
Attrs: [(u'href',
u'http://binapratica.blogspot.com/page2.html')]
These examples only begin
to show the power of BeautifulSoup when it comes to parsing HTML. See
the documentation and samples at www.crummy.com
for more detail.
Reading binary files
using urllib
Sometimes you want to
retrieve a non-text (or binary) file such as an image or video file.
The data in these files is generally not useful to print out but you
can easily make a copy of a URL to a local file on your hard disk
using urllib. The pattern is to open the
URL and use read to download the entire contents of the document into
a string variable (img) and then write that information to a local
file as follows:
This program reads all of
the data in at once across the network and stores it in the variable
img in the main memory of your computer
and then opens the file img1.jpg and writes the data out to your
disk. This will work if the size of the file is less than the size of
the memory of your computer. However if this is a large audio or
video file, this program may crash or at least run extremely slowly
when your computer runs out of memory. In order to avoid running out
of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way
the program can read any sized file without using up all of the
memory you have in your computer.
In this example, we read
only 100,000 characters at a time and then write those characters to
the img1.jpg file before retrieving the next 100,000 characters of
data from the web. This program runs as follows:
python curl2.py
568248 characters copied.
If you have a Unix or
Macintosh computer, you probably have a command built into your
operating system that performs this operation as follows:
curl -O
http://binapratica.blogspot.com/img1.jpg
The command curl is short
for “copy URL” and so these two examples are cleverly named
curl1.py and curl2.py on http://binapratica.blogspot.com/img1.jpg
as they implement similar functionality to the curl command. There is
also a curl3.py sample program that does this task a little more
effectively in case you actually want to use this pattern in a
program you are writing.







Nenhum comentário:
Postar um comentário