Wednesday, July 20, 2011

MapReduce Vs. HPCC Platform

Let us  assume that we have to solve the following problem:

"Find the number of times each URL is accessed by parsing a log file containing URL and date stamp pairs"

Solution 1

This is easy. Traverse the log file, one line at time. Record the count for every unique URL that is encountered. This is easily accomplished in a Java program using a for loop and a hashmap.

Solution 1 works great if the input file is small. What if you are dealing with a large volume of data? Tera, giga, peta etc. The sequential processing is not a practical option for dealing with Big Data.

Solution 2

In this solution, the input is split into multiple <key, value> pairs and fed to a map function in parallel. The map function then converts it into intermediate <key, value> pairs. In our example, the map function will output <url, 1> pair for for every input <key, value>. Where the url is the url identified from the input and 1 is for each unique occurrence.

Before Map Step:

Key = 1 Value =
Key = 2 Value =
Key = 3 Value =
Key = 4 Value =

After Map Step:

Key = Value = 1
Key = Value = 1
Key = Value = 1
Key = Value = 1

The data is then sorted by the intermediate keys (<url1, 1>, <url2,1> etc) so that all occurrences of the same key are grouped together. Every unique key identified and all the values is then passed to a reduce function. The reduce function can be called multiple times for the same key. In our example, the reduce function will simply count the occurrences for the unique url in the reduce step - <url, total count>

After Reduce Step:
Key = Value = 2
Key = Value = 1
Key = Value = 1


This process of solving the problem by using the Map and Reduce steps is called MapReduce. The MapReduce paradigm was made famous by Google.  Google used to process large volumes of crawled data, by spreading the map and reduce jobs to several worker nodes in a cluster to execute it in parallel to achieve high throughput. Another well known MapReduce framework is Hadoop.

The major drawback of the MapReduce paradigm is the fact that it is intended to process batch oriented jobs. It is suitable for ETL (Extraction, Transformation and Loading) but not for online query processing. So Hadoop and its extensions Hive and HBase, have been built on top of a batch oriented framework that is not really meant for online query processing. The other drawback is the fact that every task needs to be defined in terms of a Map and Reduce step so that work can be distributed across nodes. In most cases it will take several Map and Reduce steps to solve a single problem.

Solution 3

SELECT url, count(*) FROM urllog GROUP BY url

How easy is that? No map or reduce logic. SQL, with it declarative nature, lets us concentrate on the What logic. However, SQL databases do not adhere themselves to BigData processing. Typical BigData processing involve several clustered nodes processing information to produce an end result.

HPCC Systems ECL, is specifically designed to overcome the limitations of SQL. ECL is a truly declarative language, that is somewhat similar to SQL, and lets you solve the problem by expressing the code as the What rather than the How. The complexity around clustering is well encapsulated by the ECL language and hence is never really exposed to the programmer.

Simple ECL Code to find the count for each unique URL:
//Declare the input record structure. Assume an input CSV file of URL,Date fields
rec := RECORD
  STRING50 url;
  STRING50 date;

//Declare the source of the data
urllog := DATASET('~tutorial::AC::Urllog',rec,CSV);

//Declare the record structure to hold the aggregates for each unique URL
grouprec := record
  groupCount := COUNT(GROUP);

//The TABLE function is equivalent to an SQL SELECT command
//The following declaration is used to create a Cross Tab aggregate of the SELECT equivalent shown above
RepTable := TABLE(urllog,grouprec,url);

//Output the new record set

Sample Input:

Sample Output:

The HPCC Platform will distribute the work across the nodes based on the most optimal path that is determined at runtime.

As compared to MapReduce frameworks like Hadoop, the HPCC Platform keeps it simple. Let the platform determine the best work distribution across nodes so that developer solves the What rather than worry about How the work is distributed across nodes. Further, the HPCC Platform comprises of two unique components that are optimized to solve specific problems. The Thor is used as an ETL (Extraction, Transformation, Loading and Linking) engine and the Roxie is used as the online query processing engine.

Monday, July 11, 2011

ECL - Part III (Declarative, Attributes (aka Definitions) and Actions)

ECL is a declarative programming language. In declarative programming, we express the logic of computation and do not describe the control flow or the state.  For example, in Java you would:

x = 1;
y = 2;

Java is an imperative programming language, where the sequence of steps you write dictates to the compiler in what order the code has to be executed. In the example, x = 1 is executed first, y = 2 next, and then System.out.println. Here, the programmer controls the sequence and state.

For the same code, in ECL you would:

x:= 1; //An attribute declaration
y:= 2;
output(x+y); //An action

The code looks similar to the Java code. The difference is that the steps x :=  1 and y := 2 do not perform a state assignment operation nor do they define the control flow. It simply means, when there is a need to use x, substitute the value 1. Until there is a need, do not do anything.

Well, you might now ask the question - Is being a declarative programming language really important?

The answer is "YES". In parallel computing, it is best left to the platform to determine the optimal sequence (or parallel) steps to execute to reach an end goal. Performance and scale is everything.

Attributes and Actions

  • Attribute (aka Definition): A declaration such as x := 1; is representative of an attribute declaration where x is an attribute. It is important to note that an attribute is not equivalent to a variable state assignment, but rather a declaration. The difference is that a declaration, postpones the evaluation until the attribute is actually used.
  •  Action: An action such as output(x+y); instructs the platform to execute a code snippet that would be expected to produce a result.
In ECL, every statement is either an attribute (aka definition) or an action. Declarative thinking, helps Big Data developers to worry about the problem solution (the What), rather than the need to worry about the sequence of steps, parallel programming techniques and state assignment (the How).  Being declarative is another reason why ECL is a powerful language for Big Data programming.

Friday, July 08, 2011

ECL - Part II (ECL IDE Basics and Transformations)

In Part I of the ECL blog series we were introduced to the HPCC platform, how to load a data file and display the contents using ECL. In Part II, we will continue from where we left off and learn about transformations in ECL. This will give you a glimpse of the power of the ECL language and why it is the best language to handle data  (Big or Small) manipulation.

Before we begin to code transformations, let us spend some time understanding the features/views available in the ECL IDE, the tool used to write ECL code:

  • Builder - Use the builder to edit your ECL code, build and submit it for execution.
  • Submit/Compile - Is used to compile an ECL code file and submit it as a job for execution on the cluster
  • Output Results - Executed ECL code results can be viewed here.
  • Syntax Errors - Check if your ECL code is free of syntax errors using the compile option (F7). The Syntax Errors view displays design time syntax errors.
  • Runtime Errors - The error log view  displays the errors that occur when ECL code is executed on the cluster.
  • Workunits - Displays all the ECL jobs that have been executed on a cluster. It is conveniently categorized by days, months and years. 
  • Repository - This synonymous to projects in other IDEs. Shows location of files on local storage. For me, it can we found on the hard disk at "C:\Users\Public\Documents\HPCC Systems". It can be configured to point elsewhere by changing the IDE preferences.
  • Workspace - Is a logical work environment that can be used to enhance your programming experience. 
  • Datasets - List the available data sets on the cluster. It is convent to select the data set and copy the label so as to use it in the code
Read more about the ECL IDE and Client Tools here

Now back to coding transformations.  For the transformation example, we are going to work with the OriginalPerson dataset from Part I and transform the data to create a new TransformedPerson dataset, which is a copy of the OriginalPerson dataset with the First, Middle and Last names converted to upper case.

Open a new builder window (CTRL+N) and type in the following code: 

//Declare the format of the source and destination record
Layout_People := RECORD
  STRING15 FirstName;
  STRING25 LastName;
  STRING15 MiddleName;
  STRING5 Zip;
  STRING42 Street;
  STRING20 City;
  STRING2 State;

//Declare reference to source file
File_OriginalPerson :=

//Write the Transform code
Layout_People toUpperPlease(Layout_People pInput)
  SELF.FirstName := Std.Str.ToUpperCase(pInput.FirstName);
  SELF.LastName := Std.Str.ToUpperCase(pInput.LastName);
  SELF.MiddleName := Std.Str.ToUpperCase(pInput.MiddleName);
  SELF.Zip := pInput.Zip;
  SELF.Street := pInput.Street;
  SELF.City := pInput.City;
  SELF.State := pInput.State;
END ; 

//Apply the transformation
TransformedPersonDataset := 


//Output it as a new Dataset


The important step is a call to the Project function. In this particular case it means:

"Transform Dataset File_OriginalPerson to TransformedPersonDataset By applying  transformation toUpperPlease for each record of LEFT dataset = File_OriginalPerson"

LEFT is analogous to the LEFT join syntax in SQL. In this case it is the File_OriginalPerson.

Compile and Submit the code. View the results in the Output Results view.

This is some powerful code. ECL lets you solve complex data manipulation problems using simple and concise code. This is only tip of the iceberg. Read the ECL programmers guide and ECL Language reference to discover ECLs immense power.

Wednesday, July 06, 2011

ECL - Part I (loading data)

One of the cool features of the HPCC platform is its ability to extract, transform and load Big Data (tera bytes to peta bytes). At the core of the HPCC platform is the powerful and simple ECL (Enterprise Control Language) programming language. Part I of the ECL blog, shows you how to load data and display the contents. In essence, have the data ready for further data manipulation.

Before we begin to get our feet wet and drool over the code, it is imperative to spend a few minutes understanding the HPCC platform architecture. The following high level architecture diagram, provides you with a view of the important components:

THOR - The Thor cluster performs the Loading, Extraction and Transformation of the data. It is used to load Big Data (unstructured, semi-structured or structured), transform it and optimize it for querying.

ROXIE - The Roxie cluster is optimized to perform queries with very high concurrency. Typically, processed data from a Thor cluster is exported to a Roxie cluster to enable real time, fast and highly concurrent query processing.

ESP - The ESP provides you with a simple web services interface, that is used to access the Roxie queries.

Now back to the coding. How do we load the data? The HPCC Platform has a built in utility called ECL Watch. Data loading is one of the functions (among many) that ECL Watch provides. The following step by step tutorial takes you through the process:

1) For the sake of sanity, we will assume that you have been able to download and install the HPCC VM. If not, please proceed to do so and read the HPCC VM Install Guide.

2) Now download the sample data file containing person information data- first name, last name etc. Once downloaded, extract the contents of the zip file and store it at a known location.

3) Point your browser to the ECL Watch and use the upload/download file link to upload the file to the landing zone

Browse to the sample person file you downloaded, select and upload.

4) Spray the file contents to all the nodes across the cluster. Again, use the ECL Watch utility to do this.

The label is a logical name, AC stands for my initials. It really can be anything. Choose the file that you uploaded in step 3, set the record size to 124 (record size of the person file) and submit for it to be sprayed.

If successful, you should see a results page that looks like:

Click on the View Progress to view the progress of the spray

5) Download, install and configure the ECL IDE if you have not done so already.

ECL IDE preferences:

Enter your VM IP in the "Server" input.

Save the preference

6) Now, you are ready to write ECL Code. While in the ECL IDE, press CTRL+N to open a new work unit. Type in the following code:

Layout_People := RECORD
STRING15 FirstName;
STRING25 LastName;
STRING15 MiddleName;
STRING42 Street;
STRING20 City;
STRING2 State;

File_OriginalPerson :=
DATASET('~tutorial::AC::OriginalPerson', Layout_People, THOR);

//Here change the AC to whatever you used to name the label in step 4


7) Syntax check (f7) and Submit/Execute the code to see the following results

That is it for now. In my next blog post, we will be looking at some of the features of the ECL IDE and ECL Language. In the process, we will also expand upon the person example and learn about transformations, indexing and sorting.