One of the major benefits of Thor, the HPCC Systems data refinery component, is its ability to process large volumes of data, like data from log files without the need to perform pre-processing outside of the platform. In this blog entry you will be introduced to ECL's powerful regular expression library that makes data extraction look easy. ECL is the programming language that is used to program ETL (Extract, Transform and Load) jobs on Thor.
Understanding the Input Content
Typical log file data can be classified as semi-structured content as most log files have records and columns that can be extracted as against unstructured data, where record and column boundaries are hard to identify.
A web server log file is an example of a semi-structured file. A typical line in a web server log file that conforms to the standard common log format reads as:
ECL, the powerful distributed data programming language, has built in libraries to process both semi structured data (log files etc) and unstructured data (emails, html content, twitter feeds etc). This enables you to maximize the parallel processing capability of the platform right out of the gate. No holding back.
Designing for Extraction
Let us identify the tokens that are present in a web log file. This will enable us to code a token parser to successfully extract web server log data from a file that implements the common log format:
Writing ECL Code
We will now write an ECL program to parse a file that contains lines of data where each line is formatted as shown above.
Step 1 - Declare record structure to store the raw input
Step 2 - Declare regular expression patterns to parse the tokens from each raw input record
Understanding the Input Content
Typical log file data can be classified as semi-structured content as most log files have records and columns that can be extracted as against unstructured data, where record and column boundaries are hard to identify.
A web server log file is an example of a semi-structured file. A typical line in a web server log file that conforms to the standard common log format reads as:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
ECL, the powerful distributed data programming language, has built in libraries to process both semi structured data (log files etc) and unstructured data (emails, html content, twitter feeds etc). This enables you to maximize the parallel processing capability of the platform right out of the gate. No holding back.
Designing for Extraction
Let us identify the tokens that are present in a web log file. This will enable us to code a token parser to successfully extract web server log data from a file that implements the common log format:
Writing ECL Code
We will now write an ECL program to parse a file that contains lines of data where each line is formatted as shown above.
Step 1 - Declare record structure to store the raw input
//Declare the record to store each record from
//the raw input file. Since the file has lines of log data,
//the record will need one string field to store each line.
RawLayout := record
string rawTxt;
end;
//Declare the file. In this example,
//for simplicity, the content is shown inline
fileRaw := dataset([{'127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326'}], RawLayout);
Step 2 - Declare regular expression patterns to parse the tokens from each raw input record
Beware! This is where you would need to pick up a regular expressions book (or web site) if you are not already familiar with the topic.
pattern alphaFmt := pattern('[a-zA-Z]')+;
pattern alphaNumbFmt := pattern('[a-zA-Z0-9]')+;
pattern sepFmt := ' '+;
pattern numFmt := pattern('[0-9]')+;
pattern ipFmt := numFmt '.' numFmt '.' numFmt '.' numFmt;
pattern identFmt := '-';
pattern authuserFmt := alphaNumbFmt;
pattern hoursFromGMT := pattern('[\\-\\+]') numFmt;
pattern yearFmt := numFmt;
pattern monthFmt := alphaNumbFmt;
pattern dayFmt := numFmt;
pattern hoursFmt := numFmt;
pattern minutesFmt := numFmt;
pattern secondsFmt := numFmt;
pattern dateFmt := '[' dayFmt '/' monthFmt '/' yearFmt ':' hoursFmt ':' minutesFmt ':' secondsFmt ' ' hoursFromGMT ']';
pattern cmdFmt := alphaFmt;
pattern notQuoteFmt := pattern('[^"]')*;
pattern paramsFmt := opt('?' notQuoteFmt);
pattern urlFmt := pattern('[^"\\?]')*;
pattern httpMethodFmt := 'HTTP/' numFmt '.' numFmt;
pattern requestFmt := '"' cmdFmt urlFmt paramsFmt httpMethodFmt '"';
pattern statusFmt := numFmt;
pattern bytesFmt := numFmt;
pattern line := ipFmt sepFmt identFmt sepFmt authUserFmt sepFmt dateFmt sepFmt requestFmt sepFmt statusFmt sepFmt bytesFmt;
The declarations above are easy to follow if you know your regular expressions. These pattern declarations are used by the parser to extract tokens from the raw input record and map them to a (structured) model that can then be used to perform further processing.
The pattern line declaration specifies how each line in the file should be parsed and interpreted as tokens.
Step 3 - Declare the new record that will contain the extracted tokens
LogLayout := record
string ip := matchtext(ipFmt);
string authUser := matchtext(authuserFmt);
string date := matchtext(dateFmt);
string request := matchtext(requestFmt);
string status := matchtext(statusFmt);
string bytes := matchtext(bytesFmt);
end;
The matchtext() function is used to extract the specific token you are interested in from the parser.
Step 4 - Parse the file and output the result
logFile := parse(fileRaw,
rawTxt,
line,
LogLayout,first);
output(logFile);
The parse function accepts the file to parse, the field in the RawRecord to parse, the token format for each line, the output record layout and a flag as parameters. The flag value first indicates "Only return a row for the first match starting at a particular position".
Step 5 - Submit the program to Thor and view the results
Once the program is submitted (run), the output should look like:
There a few variations of this program that you can implement.
Some Variations
Some Variations
What if the data source is actually a sprayed file?
fileRaw := dataset('~.::myfile',RawLayout,csv(separator('')));
You will simply replace the fileRaw declaration with the one above. "~.::myfile" is the logical name of the sprayed file.
How can you record error lines that do not match the specified pattern? In other words malformed input.
ErrorLayout := record
string t := fileRaw.rawTxt;
end;
e := parse(fileRaw,
rawTxt,
line,
ErrorLayout,NOT MATCHED ONLY);
output(e);
As you have seen, the ECL language and Thor provides you with a powerful framework to accomplish your ETL tasks. You can learn more about Thor and ECL at http://hpccsystems.com.
1 comment:
Your articles make complete sense out of each topic.quality wordpress themes
Post a Comment