Programming no Frontiers: junho 2015

terça-feira, 9 de junho de 2015

Python and FTP

So, how do we write an FTP client by using Python? Here is a code example. Let's review the steps:

Connect to server
Log in
Make service request(s) (and hopefully get response[s])
Quit

When using Python’s FTP support, all you do is import the ftplib module and instantiate the ftplib.FTP class. All FTP activity—logging in, transferring files, and logging out—will be accomplished using your object.

Here is some Python pseudocode:

Soon we will look at a real example, but for now, let’s familiarize ourselves with methods from the ftplib.FTP class, which you will likely use in your code.

ftplib.FTP Class Methods

I outline the most popular methods. The list is not comprehensive, but the ones presented here are those that make up the API for FTP client programming in Python. In other words, you don’t really need to use the others because they are either utility or administrative functions or are used by the API methods later.

Method	Description
login(user='anonymous', passwd='', acct='')	Log in to FTP server; all arguments are optional
pwd()	Current working directory
cwd(path)	Change current working directory to path
dir([path[,...[,cb]])	Displays directory listing of path; optional call back cb passed to retrlines()
nlst([path[,...])	Like dir() but returns a list of filenames instead of displaying
retrlines(cmd [, cb])	Download text file given FTP cmd, for example, RETR filename; optional callback cb for processing each line of file
retrbinary(cmd, cb[, bs=8192[, ra]])	Similar to retrlines() except for binary file; call- back cb for processing each block (size bs defaults to 8K) downloaded required
storlines(cmd, f)	Upload text file given FTP cmd, for example, STOR filename; open file object f required
storbinary(cmd, f[, bs=8192])	Similar to storlines() but for binary file; open file object f required, upload blocksize bs defaults to 8K
rename(old, new)	Rename remote file from old to new
delete(path)	Delete remote file located at path
mkd(directory)	Create remote directory
rmd(directory)	Remove remote directory
quit()	Close connection and quit

The methods you will most likely use in a normal FTP transaction include login(), cwd(), dir(), pwd(), stor*(), retr*(), and quit(). There are more FTP object methods not listed in the table that you might find useful. For more detailed information about FTP objects, read the Python documentation available at http://docs.python.org/library/ftplib#ftp-objects.

An Interactive FTP Example

An example of using FTP with Python is so simple to use that you do not even have to write a script. You can just do it all from the interactive interpreter and see the action and output in real time. Here is a sample session, using the shell interactive python by Terminal:

A Client Program FTP Example

I mentioned previously that an example script is not even necessary because you can run one interactively and not get lost in any code. I will try anyway. For example, suppose that you want a piece of code that goes to download the latest copy of Bugzilla from the Mozilla Web site. The example following, is what we came up with. I am attempting an application here, but even so, you can probably run this one interactively, too. Our application uses the FTP library to download the file and includes some error-checking.

Example:

Be aware that this script is not automated, so it is up to you to run it whenever you want to perform the download, or if you are on a Unix-based system, you can set up a cron job to automate it for you. Another issue is that it will break if either the file or directory names change.

If no errors occur when you run this script, you get the following output:

$ getLatestFileByFTP.py

*** Connected to host "ftp.mozilla.org"

*** Logged in as "anonymous"

*** Changed to "pub/mozilla.org/webtools" folder

*** Downloaded "bugzilla-LATEST.tar.gz" to CWD

Line-by-Line Explanation

Lines 10–16

The initial lines of code import the necessary modules (mainly to grab exception objects) and set a few constants.

Lines 18–52

The main() function consists of various steps of operation: create an FTP object and attempt to connect to the FTPs server (lines 20–24) and (return and) quit on any failure. We attempt to login as anonymous and abort if unsuccessful (lines 27–31). The next step is to change to the distribution directory (lines 34–39), and finally, we try to download the file (lines 42–49).

For line 21 and all other exception handlers in this example where you’re saving the exception instance—in this case e—if you’re using Python 2.5 and older, you need to change the as to a comma, because this new syntax was introduced (but not required) in version 2.6 to help with 3.x migration. Python 3 only understands the new syntax shown in line 21.

On line 42, we pass a callback to retrbinary() that should be executed for every block of binary data downloaded. This is the write() method of a file object we create to write out the local version of the file. We are depending on the Python interpreter to adequately close our file after the transfer is done and to not lose any of our data. Although more convenient, I usually try to avoid using this style, because the programmer should be responsible for freeing resources directly allocated rather than depending on other code. In this case, we should save the open file object to a variable, say loc, and then pass loc.write in the call to ftp.retrbinary().

After the transfer has completed, we would call loc.close(). If for some reason we are not able to save the file, we remove the empty file to avoid cluttering up the file system (line 45). We should put some error-checking around that call to os.unlink(FILE) in case the file does not exist. Finally, to avoid another pair of lines that close the FTP connection and return, we use an else clause (lines 35–42).

Lines 46–49 This is the usual idiom for running a stand-alone script.

Conclusion:

FTP is not only useful for downloading client applications to build and/or use, but it can also be helpful in your everyday job for moving files between systems. For example, suppose that you are an engineer or a system administrator needing to transfer files. It is an obvious choice to use the scp or rsync commands when crossing the Internet boundary or pushing files to an externally visible server. However, there is a penalty when moving extremely large logs or database files between internal computers on a secure network in that manner: security, encryption, compression/decompression, etc. If what you want to do is just build a simple FTP application that moves files for you quickly during the after-hours, using Python is a great way to do it!

You can read more about FTP in the FTP Protocol Definition/Specification (RFC 959) at http://tools.ietf.org/html/rfc959 as well as on the www. network sorcery.com/enp/protocol/ftp.htm Web page. Other related RFCs include 2228, 2389, 2428, 2577, 2640, and 4217. To find out more about Python’s FTP support, you can start at http://docs.python.org/library/ftplib.

That's it, see you in the next post!

Tks,

Retrieving web pages with urllib

The urllib library makes it very easy to retrieve web pages and process the data in Python. Using urllib you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol details. The equivalent code to read the text.txt file from the web using urllib is as follows:

Once the web page has been opened with urllib.urlopen we can treat it like a file and read through it using a for loop. When the program runs, we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

This is a test. This is a test. This is a test.

As an example, we can write a program to retrieve the data for text.txt and compute the frequency of each word in the file as follows:

Again, once we have opened the web page, we can read it like a local file.

Parsing HTML and scraping the web

One of the common uses of the urllib capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages and then examines the data in those pages looking for patterns.

Parsing HTML using Regular Expressions

As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web. Google also uses the frequency of links from pages it finds to a particular page as one measure of how “important” a page is and how highly the page should appear in its search results.

One simple way to parse HTML is to use regular expressions to repeatedly search and extract for substrings that match a particular pattern. Here is a simple web page:

<h1>The First Page</h1>

<p>

If you like, you can switch to the

<a href="http://binapratica.blogspot.com/page2.html">Second Page</a>

</p>

We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

href="http://.+?"

Our regular expression looks for strings that start with “href=”http://” followed by one or more characters “.+?” followed by another double quote. The question mark added to the “.+?” indicates that the match is to be done in a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries to find the smallest possible matching string and a greedy match tries to find the largest possible matching string. We need to add parentheses to our regular expression to indicate which part of our matched string we would like to extract and produce the following program:

The findall regular expression method will give us a list of all of the strings that match our regular expression, returning only the link text between the double quotes. When we run the program, we get the following output:

python urlregex.py

Enter - http://binapratica.blogspot.com/page1.htm

http://binapratica.blogspot.com/page2.htm

Regular expressions work very nice when your HTML is well-formatted and predictable. But since there is a lot of “broken” HTML pages out there, you might find that a solution only using regular expressions might either miss some valid links or end up with bad data. This can be solved by using a robust HTML parsing library.

Parsing HTML using BeautifulSoup

There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.

As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. You can download and install the BeautifulSoup code from: www.crummy.com

You can download and “install” BeautifulSoup or you can simply place the BeautifulSoup.py file in the same folder as your application. Even though HTML looks like XML and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need. We will use urllib to read the page and then use BeautifulSoup to extract the href attributes from the anchor (a) tags.

The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag. When the program runs it looks as follows:

python urllinks.py

Enter - http://binapratica.blogspot.com/page1.html

http://binapratica.blogspot.com/page2.html

You can use BeautifulSoup to pull out various parts of each tag as follows:

This produces the following output:

python urllink2.py

Enter - http://binapratica.blogspot.com/page1.html

TAG: <a href="http://binapratica.blogspot.com/page2.html">Second Page</a>

URL: http://binapratica.blogspot.com/page2.html

Content: [u'\nSecond Page']

Attrs: [(u'href', u'http://binapratica.blogspot.com/page2.html')]

These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML. See the documentation and samples at www.crummy.com for more detail.

Reading binary files using urllib

Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out but you can easily make a copy of a URL to a local file on your hard disk using urllib. The pattern is to open the URL and use read to download the entire contents of the document into a string variable (img) and then write that information to a local file as follows:

This program reads all of the data in at once across the network and stores it in the variable img in the main memory of your computer and then opens the file img1.jpg and writes the data out to your disk. This will work if the size of the file is less than the size of the memory of your computer. However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any sized file without using up all of the memory you have in your computer.

In this example, we read only 100,000 characters at a time and then write those characters to the img1.jpg file before retrieving the next 100,000 characters of data from the web. This program runs as follows:

python curl2.py

568248 characters copied.

If you have a Unix or Macintosh computer, you probably have a command built into your operating system that performs this operation as follows:

curl -O http://binapratica.blogspot.com/img1.jpg

The command curl is short for “copy URL” and so these two examples are cleverly named curl1.py and curl2.py on http://binapratica.blogspot.com/img1.jpg as they implement similar functionality to the curl command. There is also a curl3.py sample program that does this task a little more effectively in case you actually want to use this pattern in a program you are writing.