Wednesday, October 26, 2011

ECL - A Practical Language for a Practical Data Programmer

Let us say you have a list of objects that contain Person information like First Name, Last Name, Phone Number, Date of Birth and Address. It is required to transform each object into a new object which contains a Name field,  Phone Number, Date of Birth and Address. Where the Name field is the full name of the person, and the rest of the fields are a copy of the original objects corresponding field values.

If we attempt to write the solution in Java, the function to perform the conversion for each Person object will look something like:

public PersonEx convert(Person person) { 
 PersonEx personEx = new PersonEx(); 
 personEx.name = person.lastName + " " + person.firstName; 
 personEx.phoneNumber = person.phoneNumber; 
 personEx.address = person.address; 
 personEx.dateOfBirth = person.dateOfBirth; 
 return personEx; 
}

Now, the equivalent code in ECL is :

PersonEx convert(Person person) := TRANSFORM 
 SELF.name := person.firstName + ' ' + person.lastName;
 SELF := person; 
end;

The ECL code looks much simpler but achieves the same objective. Let me explain. ECL automates programming steps as much as possible based on the information that is available to the compiler. In the above example, ECL knows that the return value is a record of type "PersonEx". Hence, the keyword "SELF" is equivalent to "PersonEx self = new PersonEx();" in Java. The instance creation and association is implicit. This eliminates the need to type in all the extra code that greatly simplifies your programming task.

There is some more implicit magic here. What does the statement "SELF := person;" do? You guessed right. It is equivalent to explicitly writing the following code:

SELF
.phoneNumber := person.phoneNumber;
SELF.address := person.address;
SELF.dateOfBirth := person.dateOfBirth;


ECL, by introspection, compares the two objects and automatically initializes the variables that have the same names that have not been previously initialized.

To summarize:





ECL, provides us many more features to make our programming lives easier. The following are some examples:

                                              PROJECT(persons, convert(LEFT));

The "LEFT" indicates a reference to a record in "persons". The "project" declares that for every record in the data set "persons", perform the "convert" transformation. Implementing something similar in java would need an iterator, loop and several variable initialization steps.

                                             OUTPUT(persons(firstName = 'Jason'));

This action returns a result set "persons" after applying the filter "firstName='Jason'".

As you can seen in the above examples, ECL has been designed to be simple and practical.  It enables data programmers to quickly implement their thoughts into programming tasks by keeping the syntax simple and minimal. Java is used in the examples to show you how ECL's contracting style can be used effectively to solve data processing problems. It does not mean that Java is not a practical language. It simply means that ECL is abstract enough to shield the programmer from complex language structures as in Java.


Tuesday, October 18, 2011

Parsing a Web Server Log File on Thor

One of the major benefits of Thor, the HPCC Systems data refinery component, is its ability to process large volumes of data, like data from log files without the need to perform pre-processing outside of the platform. In this blog entry you will be introduced to ECL's powerful regular expression library that makes data extraction look easy. ECL is the programming language that is used to program ETL (Extract, Transform and Load) jobs on Thor.

Understanding the Input Content

Typical log file data can be classified as semi-structured content as most log files have records and columns that can be extracted as against unstructured data, where record and column boundaries are hard to identify.

A web server log file is an example of a semi-structured file. A typical line in a web server log file that conforms to the standard common log format reads as:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

ECL, the powerful distributed data programming language,  has built in libraries to process both semi structured data (log files etc)  and unstructured  data (emails, html content, twitter feeds etc). This enables you to maximize the parallel processing capability of the platform right out of the gate. No holding back.

Designing for Extraction

Let us identify the tokens that are present in a web log file. This will enable us to code a token parser to successfully extract web server log data from a file that implements the common log format:










Writing ECL Code

We will now write an ECL program to parse a file that contains lines of data where each line is formatted as shown above.

Step 1 - Declare record structure to store the raw input

//Declare the record to store each record from
//the raw input file. Since the file has lines of log data,
//the record will need one string field to store each line.
RawLayout := record
string rawTxt;
end;

//Declare the file. In this example,
//for simplicity, the content is shown inline
fileRaw := dataset([{'127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326'}], RawLayout);


Step 2 - Declare regular expression patterns to parse the tokens from each raw input record

Beware! This is where you would need to pick up a regular expressions book (or web site) if you are not already familiar with the topic.


pattern alphaFmt := pattern('[a-zA-Z]')+;
pattern alphaNumbFmt := pattern('[a-zA-Z0-9]')+;
pattern sepFmt := ' '+;
pattern numFmt := pattern('[0-9]')+;
pattern ipFmt := numFmt '.' numFmt '.' numFmt '.' numFmt;
pattern identFmt := '-';
pattern authuserFmt := alphaNumbFmt;
pattern hoursFromGMT := pattern('[\\-\\+]') numFmt;
pattern yearFmt := numFmt;
pattern monthFmt := alphaNumbFmt;
pattern dayFmt := numFmt;
pattern hoursFmt := numFmt;
pattern minutesFmt := numFmt;
pattern secondsFmt := numFmt;
pattern dateFmt := '[' dayFmt '/' monthFmt '/' yearFmt ':' hoursFmt ':' minutesFmt ':' secondsFmt ' ' hoursFromGMT ']';
pattern cmdFmt := alphaFmt;
pattern notQuoteFmt := pattern('[^"]')*;
pattern paramsFmt := opt('?' notQuoteFmt);
pattern urlFmt := pattern('[^"\\?]')*;
pattern httpMethodFmt := 'HTTP/' numFmt '.' numFmt;
pattern requestFmt := '"' cmdFmt urlFmt paramsFmt httpMethodFmt '"';
pattern statusFmt := numFmt;
pattern bytesFmt := numFmt;

pattern line := ipFmt sepFmt identFmt sepFmt authUserFmt sepFmt dateFmt sepFmt requestFmt sepFmt statusFmt sepFmt bytesFmt;


The declarations above are easy to follow if you know your regular expressions. These pattern declarations are used by the parser to extract tokens from the raw input record and map them to a (structured) model that can then be used to perform further processing.

The pattern line declaration specifies how each line in the file should be parsed and interpreted as tokens.

Step 3 - Declare the new record that will contain the extracted tokens


LogLayout := record
string ip := matchtext(ipFmt);
string authUser := matchtext(authuserFmt);
string date := matchtext(dateFmt);
string request := matchtext(requestFmt);
string status := matchtext(statusFmt);
string bytes := matchtext(bytesFmt);
end;


The matchtext() function is used to extract the specific token you are interested in from the parser.

Step 4 - Parse the file and output the result

logFile := parse(fileRaw,
rawTxt,
line,
LogLayout,first);

output(logFile);


The parse function accepts the file to parse, the field in the RawRecord to parse, the token format for each line, the output record layout and a flag as parameters. The flag value first indicates "Only return a row for the first match starting at a particular position".

Step 5 - Submit the program to Thor and view the results


Once the program is submitted (run), the output should look like:

                                                                                


There a few variations of this program that you can implement.

Some Variations

What if the data source is actually a sprayed file?

fileRaw := dataset('~.::myfile',RawLayout,csv(separator('')));

You will simply replace the fileRaw declaration with the one above. "~.::myfile" is the logical name of the sprayed file.

How can you record error lines that do not match the specified pattern? In other words malformed input.

ErrorLayout := record
string t := fileRaw.rawTxt;
end;

e := parse(fileRaw,
rawTxt,
line,
ErrorLayout,NOT MATCHED ONLY);

output(e);

As you have seen, the ECL language and Thor provides you with a powerful framework to accomplish your ETL tasks. You can learn more about Thor and ECL at http://hpccsystems.com.