domingo, 3 de abril de 2016

Integrating Python with other languages

The normal Python distribution uses the C language and is often referred to as CPython. But there are other implementations of the Python language in other languages. These non‐C based interpreters facilitate integration of the other language with Python. Two of the best known of these alternate Python versions are Jython, written in Java, and IronPython, an implementation for Microsoft’s .NET environment. A third alternative is Cython, which is not strictly an implementation of Python but a very closely related subset that can be compiled into C to provide very fast performance while still providing the speed of development of a Python‐like language.

Jython
The Java implementation of Python offers many advantages to Java programmers looking for an interactive environment in which to test their Java classes or to build prototype solutions that can, if necessary, be converted to full Java later. The distribution includes both an interpreter and a compiler. The interpreter comes with the familiar interactive prompt, as well as the ability to run scripts directly. In addition to importing Python modules (including many of the regular Python standard library modules), Jython can import Java libraries, making the classes available to the Python interpreter as if they were regular Python classes. This makes it possible to exercise and test new Java classes interactively at the Jython prompt. Jython also enables dynamic prototyping of solutions by mixing Java and Python code together. The interpreter can also be used to run script files with all of these same features for bigger projects or prototypes. The compiler takes Jython code (either pure Python or a mixture of Java classes and Python code) and compiles it into a '.java' file. This is a powerful tool for prototyping new classes because they can be developed and written in Python, compiled, and included in Java code. Once proven, the Python version can be seamlessly replaced with a pure Java version. The downside of Jython is that it tends to produce slower code than pure Java and is also more memory hungry. This is largely due to the fact that the compiler effectively embeds a Python interpreter in the output files.

IronPython
IronPython is a version of Python written for the Microsoft .NET framework. 
.NET is not a single language system; rather, it depends on a common bytecode to which several languages can be compiled. The modules so produced can then be shared between languages. Thus, code written in IronPython can import modules written in C#, C++, Visual Basic, and several other .NET compatible languages. Similarly, IronPython modules can be imported by any of those other .NET languages. IronPython is an extremely appealing prospect for developers working on the .NET platform. 
Better still is the fact that an open source variant of .NET called Mono has been produced that can run under Linux and Mac OS X and many others, including mainframe computers and games consoles. This is achieved while maintaining binary‐level compatibility with the Microsoft .NET implementation. As .NET becomes the de facto standard for building applications on Microsoft Windows, the availability of Python within that framework is a major boon for Python programmers. The IronPython implementation supports most of the standard Python library as well as the .NET module system. Modules in .NET are called assemblies, but they are imported into IronPython in exactly the same way that ordinary Python modules are imported. Some issues exist due to the dynamic typing used by Python and the .NET type system, which is more static in nature. However, once understood these can be worked around using some helper features built into IronPython. Full documentation is provided on the IronPython in documentation.

Cython
Cython is significantly different from the other language integration options discussed here. It is, in effect, a separate language from Python but is highly compatible with it, describing itself as a super‐set of Python. This means Python programmers can easily learn Cython and take advantage of its special features. So what are these features that would make you want to use Cython? In short, speed. Cython is a compiler that produces C code that, in turn, can be compiled to native machine code and thereby has the potential to run much faster than its Python equivalent. This compiled code can then be imported back into regular Python just like any other module to provide the best of both worlds — easy Python development combined with C‐level speed of execution.

Oracle Data Integration

What's ?
Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. This tool is used in BI and Big Data projects for extract-transformation-load (ETL). For more information, learn in this documentation.

Why Jython?
As mentioned above Jython (Java version of Python, or better, Python under Java) is an object-oriented scripting language. Jython scripts run on any platform that has a java virtual machine (JVM). Oracle Data Integrator includes the Jython Interpreter within the execution agent. The agent is able to run script written in Jython.

Oracle Data Integrator users may write procedures or knowledge modules using Jython, and may mix Jython code with SQL, PL/SQL, OS Calls, etc. Thanks to Jython, the programming capabilities of Oracle Data Integrator are dramatically increased. It is possible to perform complex processing with strings, lists, "dictionaries", call FTP modules, manage files, integrate external Java classes, etc. 

Note: To use Jython code in KM (Knowledge Management), procedure commands or procedures commands, you must systematically set the technology to Jython.

How to do it? (Using the Jython interpreter)
Jython programs can be interpreted for test purposes outside of Data Integrator, using the standard Jython interpreter. 

>> Start the Jython interpreter:
1. Start an OS prompt (console)
2. Go to the /bin directory.
3. Key in: jython
4. The interpreter is launched

>> Exiting the Jython interpreter:
1. Hit Ctrl+Z (^Z), then Enter
2. You exit the interpreter

>> Running Jython scripts:
1. Go to the /bin directory.
2. Type in: jython <script_path.py>
3. The script is executed

>> Using Jython in the procedures:
All Jython programs can be called from a procedure or a Knowledge Module.

>> Create a procedure that calls Jython:
1. In Designer, select a Folder in your Project and insert a new Procedure
2. Type the Name of the procedure
3. Add a command line in the Detail tab
4. In the command window, type the Name for this command
5. In the Command on Target tab, choose the Jython Technology from the list
6. In the Command text, type the Jython program to be executed, or use the expression editor
7. Click OK to apply the changes
8. Click Apply to apply the changes in the procedure window
9. In the Execution tab, click the Execute button and follow the execution results in the execution log.

The procedure that was created with this process can be added to a Package like any other
procedure.

Some Examples

>> Using FTP:
In some environments, it can be useful to use FTP (File Transfer Protocol) to transfer files between heterogeneous systems. Oracle Data Integrator provides an additional Jython module to further integrate FTP. The following example show how to use this module. Pull the '*.txt' files from '/home/odi' of the server ftp.myserver.com into the local directory 'c:\temp'.



>> Using JDBC:
It can be convenient to use JDBC (Java DataBase Connectivity) to connect to a database from Jython. All Java classes in the CLASSPATH can be directly used in Jython. The following example shows how to use the JDBC API to connect to a database, to run a SQL query and write the result into a file. The reference documentation for Java is available at http://java.sun.com



Others code samples about it, are available on my GitHub.

Thank you, see you in the next post!

terça-feira, 9 de junho de 2015

Python and FTP

So, how do we write an FTP client by using Python? Here is a code example. Let's review the steps:
  1. Connect to server
  2. Log in
  3. Make service request(s) (and hopefully get response[s])
  4. Quit
When using Python’s FTP support, all you do is import the ftplib module and instantiate the ftplib.FTP class. All FTP activity—logging in, transferring files, and logging out—will be accomplished using your object.

Here is some Python pseudocode:


Soon we will look at a real example, but for now, let’s familiarize ourselves with methods from the ftplib.FTP class, which you will likely use in your code.

ftplib.FTP Class Methods

I outline the most popular methods. The list is not comprehensive, but the ones presented here are those that make up the API for FTP client programming in Python. In other words, you don’t really need to use the others because they are either utility or administrative functions or are used by the API methods later.

Method
Description
login(user='anonymous', passwd='', acct='')
Log in to FTP server; all arguments are optional
pwd()
Current working directory
cwd(path)
Change current working directory to path
dir([path[,...[,cb]])
Displays directory listing of path; optional call back cb passed to retrlines()
nlst([path[,...])
Like dir() but returns a list of filenames instead of displaying
retrlines(cmd [, cb])
Download text file given FTP cmd, for example, RETR filename; optional callback cb for processing each line of file
retrbinary(cmd, cb[, bs=8192[, ra]])
Similar to retrlines() except for binary file; call-
back cb for processing each block (size bs defaults
to 8K) downloaded required
storlines(cmd, f)
Upload text file given FTP cmd, for example, STOR filename; open file object f required
storbinary(cmd, f[, bs=8192])
Similar to storlines() but for binary file; open file
object f required, upload blocksize bs defaults to 8K
rename(old, new)
Rename remote file from old to new
delete(path)
Delete remote file located at path
mkd(directory)
Create remote directory
rmd(directory)
Remove remote directory
quit()
Close connection and quit
The methods you will most likely use in a normal FTP transaction include login(), cwd(), dir(), pwd(), stor*(), retr*(), and quit(). There are more FTP object methods not listed in the table that you might find useful. For more detailed information about FTP objects, read the Python documentation available at http://docs.python.org/library/ftplib#ftp-objects.

An Interactive FTP Example

An example of using FTP with Python is so simple to use that you do not even have to write a script. You can just do it all from the interactive interpreter and see the action and output in real time. Here is a sample session, using the shell interactive python by Terminal:


A Client Program FTP Example

I mentioned previously that an example script is not even necessary because you can run one interactively and not get lost in any code. I will try anyway. For example, suppose that you want a piece of code that goes to download the latest copy of Bugzilla from the Mozilla Web site. The example following, is what we came up with. I am attempting an application here, but even so, you can probably run this one interactively, too. Our application uses the FTP library to download the file and includes some error-checking.

Example:


Be aware that this script is not automated, so it is up to you to run it whenever you want to perform the download, or if you are on a Unix-based system, you can set up a cron job to automate it for you. Another issue is that it will break if either the file or directory names change.

If no errors occur when you run this script, you get the following output:

$ getLatestFileByFTP.py
*** Connected to host "ftp.mozilla.org"
*** Logged in as "anonymous"
*** Changed to "pub/mozilla.org/webtools" folder
*** Downloaded "bugzilla-LATEST.tar.gz" to CWD
$

Line-by-Line Explanation


Lines 10–16
The initial lines of code import the necessary modules (mainly to grab exception objects) and set a few constants.


Lines 18–52
The main() function consists of various steps of operation: create an FTP object and attempt to connect to the FTPs server (lines 20–24) and (return and) quit on any failure. We attempt to login as anonymous and abort if unsuccessful (lines 27–31). The next step is to change to the distribution directory (lines 34–39), and finally, we try to download the file (lines 42–49).

For line 21 and all other exception handlers in this example where you’re saving the exception instance—in this case e—if you’re using Python 2.5 and older, you need to change the as to a comma, because this new syntax was introduced (but not required) in version 2.6 to help with 3.x migration. Python 3 only understands the new syntax shown in line 21.

On line 42, we pass a callback to retrbinary() that should be executed for every block of binary data downloaded. This is the write() method of a file object we create to write out the local version of the file. We are depending on the Python interpreter to adequately close our file after the transfer is done and to not lose any of our data. Although more convenient, I usually try to avoid using this style, because the programmer should be responsible for freeing resources directly allocated rather than depending on other code. In this case, we should save the open file object to a variable, say loc, and then pass loc.write in the call to ftp.retrbinary().

After the transfer has completed, we would call loc.close(). If for some reason we are not able to save the file, we remove the empty file to avoid cluttering up the file system (line 45). We should put some error-checking around that call to os.unlink(FILE) in case the file does not exist. Finally, to avoid another pair of lines that close the FTP connection and return, we use an else clause (lines 35–42).

Lines 46–49 This is the usual idiom for running a stand-alone script.

Conclusion:

FTP is not only useful for downloading client applications to build and/or use, but it can also be helpful in your everyday job for moving files between systems. For example, suppose that you are an engineer or a system administrator needing to transfer files. It is an obvious choice to use the scp or rsync commands when crossing the Internet boundary or pushing files to an externally visible server. However, there is a penalty when moving extremely large logs or database files between internal computers on a secure network in that manner: security, encryption, compression/decompression, etc. If what you want to do is just build a simple FTP application that moves files for you quickly during the after-hours, using Python is a great way to do it!

You can read more about FTP in the FTP Protocol Definition/Specification (RFC 959) at http://tools.ietf.org/html/rfc959 as well as on the www. network sorcery.com/enp/protocol/ftp.htm Web page. Other related RFCs include 2228, 2389, 2428, 2577, 2640, and 4217. To find out more about Python’s FTP support, you can start at http://docs.python.org/library/ftplib.

That's it, see you in the next post!


Tks,

Retrieving web pages with urllib


The urllib library makes it very easy to retrieve web pages and process the data in Python. Using urllib you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol details. The equivalent code to read the text.txt file from the web using urllib is as follows:


Once the web page has been opened with urllib.urlopen we can treat it like a file and read through it using a for loop. When the program runs, we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns the data to us.

This is a test. This is a test. This is a test.
This is a test. This is a test. This is a test.
This is a test. This is a test. This is a test.
This is a test. This is a test. This is a test.

As an example, we can write a program to retrieve the data for text.txt and compute the frequency of each word in the file as follows:


Again, once we have opened the web page, we can read it like a local file.

Parsing HTML and scraping the web
One of the common uses of the urllib capability in Python is to scrape the web. Web scraping is when we write a program that pretends to be a web browser and retrieves pages and then examines the data in those pages looking for patterns.

Parsing HTML using Regular Expressions
As an example, a search engine such as Google will look at the source of one web page and extract the links to other pages and retrieve those pages, extracting links, and so on. Using this technique, Google spiders its way through nearly all of the pages on the web. Google also uses the frequency of links from pages it finds to a particular page as one measure of how “important” a page is and how highly the page should appear in its search results.

One simple way to parse HTML is to use regular expressions to repeatedly search and extract for substrings that match a particular pattern. Here is a simple web page:

<h1>The First Page</h1>
<p>
If you like, you can switch to the
</p>

We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

href="http://.+?"

Our regular expression looks for strings that start with “href=”http://” followed by one or more characters “.+?” followed by another double quote. The question mark added to the “.+?” indicates that the match is to be done in a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries to find the smallest possible matching string and a greedy match tries to find the largest possible matching string. We need to add parentheses to our regular expression to indicate which part of our matched string we would like to extract and produce the following program:


The findall regular expression method will give us a list of all of the strings that match our regular expression, returning only the link text between the double quotes. When we run the program, we get the following output:

python urlregex.py

Regular expressions work very nice when your HTML is well-formatted and predictable. But since there is a lot of “broken” HTML pages out there, you might find that a solution only using regular expressions might either miss some valid links or end up with bad data. This can be solved by using a robust HTML parsing library.

Parsing HTML using BeautifulSoup
There are a number of Python libraries which can help you parse HTML and extract data from the pages. Each of the libraries has its strengths and weaknesses and you can pick one based on your needs.
As an example, we will simply parse some HTML input and extract links using the BeautifulSoup library. You can download and install the BeautifulSoup code from: www.crummy.com

You can download and “install” BeautifulSoup or you can simply place the BeautifulSoup.py file in the same folder as your application. Even though HTML looks like XML and some pages are carefully constructed to be XML, most HTML is generally broken in ways that cause an XML parser to reject the entire page of HTML as improperly formed. BeautifulSoup tolerates highly flawed HTML and still lets you easily extract the data you need. We will use urllib to read the page and then use BeautifulSoup to extract the href attributes from the anchor (a) tags.


The program prompts for a web address, then opens the web page, reads the data and passes the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the href attribute for each tag. When the program runs it looks as follows:

python urllinks.py

You can use BeautifulSoup to pull out various parts of each tag as follows:


This produces the following output:

python urllink2.py
TAG: <a href="http://binapratica.blogspot.com/page2.html">Second Page</a>
URL: http://binapratica.blogspot.com/page2.html
Content: [u'\nSecond Page']
Attrs: [(u'href', u'http://binapratica.blogspot.com/page2.html')]

These examples only begin to show the power of BeautifulSoup when it comes to parsing HTML. See the documentation and samples at www.crummy.com for more detail.

Reading binary files using urllib
Sometimes you want to retrieve a non-text (or binary) file such as an image or video file. The data in these files is generally not useful to print out but you can easily make a copy of a URL to a local file on your hard disk using urllib. The pattern is to open the URL and use read to download the entire contents of the document into a string variable (img) and then write that information to a local file as follows:


This program reads all of the data in at once across the network and stores it in the variable img in the main memory of your computer and then opens the file img1.jpg and writes the data out to your disk. This will work if the size of the file is less than the size of the memory of your computer. However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any sized file without using up all of the memory you have in your computer.


In this example, we read only 100,000 characters at a time and then write those characters to the img1.jpg file before retrieving the next 100,000 characters of data from the web. This program runs as follows:

python curl2.py
568248 characters copied.

If you have a Unix or Macintosh computer, you probably have a command built into your operating system that performs this operation as follows:

curl -O http://binapratica.blogspot.com/img1.jpg

The command curl is short for “copy URL” and so these two examples are cleverly named curl1.py and curl2.py on http://binapratica.blogspot.com/img1.jpg as they implement similar functionality to the curl command. There is also a curl3.py sample program that does this task a little more effectively in case you actually want to use this pattern in a program you are writing.

sexta-feira, 25 de julho de 2014

About Indexes

Indexes are created to provide direct access to rows. An index is a tree structure. Indexes can be classified on their logic design or their physical implementation. Logical classification is based on application perspective, whereas physical classification is based on how the indexes are stored. Indexes can be partitioned or nonpartitioned. Large tables use partitioned indexes, which spreads an index to multiple table spaces, thus decreasing contention for index look up and increasing manageability. An index may consist of a single column or multiple columns; it may be unique or nonunique. Some of these indexes are outlined below.
  • Function - based indexes precompute the value of a function or expression of one or more columns and store it in an index. It can be created as a B - tree or as a bit map. It can improve the performance of queries performed on tables that rarely change.
  • Domain indexes are application specifi c and are created and managed by the user or applications. Single - column indexes can be built on text, spatial, scalar, object, or LOB data types.
  • B - tree indexes store a list of row IDs for each key. Structure of a B - tree index is similar to the ones in the SQL Server described above. The leaf nodes contain indexes that point to rows in a table. The leaf blocks allow the scanning of the index in either ascending or descending order. Oracle server maintains all indexes when insert, update, or delete operations are performed on a table.
  • Bitmap indexes are useful when columns have low cardinality and a large number of rows. For example, a column may contain few distinct values, like Y/N for marital status, or M/F for gender. A bitmap is organized like a B - tree where the leaf nodes store a bitmap instead of row IDs. When changes are made to the key columns, bit maps must be modified.