Monday, December 05, 2016

ECL as a Data Flow Language

HPCC  Systems platform provides two levels of parallelism:

1. Data partitioned parallel processing - Here, data is partitioned into parts and the parts are distributed across multiple slave nodes. Data partitioning provides the ability to execute an ECL operation simultaneously on every slave node, where, each slave node operates on its data parts.

2. Data flow parallel processing - Here, the ECL program is compiled to an optimized data flow graph (aka Explain plan, aka Execution Plan), with each node representing an operation.

For example, the following ECL code:





is compiled to the following data flow graph representation:



















The compiler is able to understand (shown by the split above) that the individual SORT operations have no dependency on each other, and can be executed in parallel.


In contrast, let us consider a situation where one ECL command depends on another ECL command.



Here, the DEDUP operation depends on the output of the SORT operation. The ECL compiler automatically generates the correct data flow graph:



















The data flow architecture has been around for a long time. In fact, most RDBMS SQL engines are based on using a data flow architecture. That said, using a data flow architecture in a distributed data processing environment provides for a very powerful solution. Another great example of a data flow engine is Google's TensorFlow. With ECL, HPCC Systems provides a simple interface to program with, while abstracting the complexity of parallel processing to the data flow architecture of the ECL compiler. The ECL compiler maximizes the use of all the computational power (CPUs, GPUs etc.) by deploying the ECL computations for parallel processing.