|2014-06 Microsoft Research Parallel Software|
This training note intends to record two major efforts undertaken by Microsoft Research to prepare for Parallel Computing as published by ACM in 2011 and 2013 respectively. For simplicity, we refer the first effort as Dandelion and the second effort as Linqits. The info below is the Compucon’s interpretation of the Microsoft efforts.
• Dandelion is a Windows system for high level programming for heterogeneous systems. It adopts the .NET LINQ approach to identify code regions for parallelization and to integrate data-parallel operator into C# and F#. It adopts the dataflow execution model and comprises 3 execution engines: distributed cluster, multi-core CPU, and GPU. Benchmarking with k-means single iteration for a single machine, Dandelion is 6.4X faster than a single CPU core, whereas multi-core CPU is 3.1X faster than single core. C++ programming needed 491 lines of code. Dandelion had 42 lines only and achieved from 14 to 66 times of runtime performance gain. CUDA took 909 lines and achieved 50 to 100 times of performance gain.
• Linqits can be read as LINQ circuits to assist our understanding. It is a Microsoft Research project for directly accelerating a declarative subset of C# programming language called LINQ which allows the programmer to embed user-defined anonymous functions that enable elegant ways to express rich algorithms. Its major contributions are the application of C# through LINQ to Linux and FPGA. To run C# applications in Debian Linux requires cross-compiling the Linux-compatible “Mono 2.10” runtime using an ARM GCC 4.1 cross-compiler. Instead of relying on Mono's built-in LINQ provider, the researchers leveraged an improved internal implementation of LINQ called Dandelion. It uses the Dandelion compiler scheme of producing a Query Plan based on dataflow execution and a C# Runtime and Scheduler for allocating workloads to the appropriate hardware engine. The inclusion of FPGA is through 2 steps- Hardware Template and C# Runtime. The Hardware Template is a special IP block that is highly parameterized at the RTL level and can be customized to suit the particular needs of the query plan. This template supports 6 of the 7 major LINQ operators. It is currently implemented in less than 10K lines of Verilog and has been placed-and-routed at 100MHz on the FPGA platform for all applications tested. The C# Runtime must pre-configure the hardware to operate in tandem with managed code. The test system is Xilinx ZYNQ-7020 consisting of 2 ARM Cortex A9 cores and Xilinx FPGA with 53K LUT and 106K Flip-flops. The C# Runtime will decide dynamically whether a node of the query plan should execute in hardware (FPGA) or software (ARM). At runtime, data is streamed in from DDR3 main memory and processed by respective cores and the results are returned to the main memory. This project has achieved impressive performance gains.