Menu Content/Inhalt
Home arrow Technology Park arrow HPC arrow 2014-06 Microsoft Research Parallel Software

2014-06 Microsoft Research Parallel Software Print
June 2014

This training note intends to record two major efforts undertaken by Microsoft Research to prepare for Parallel Computing as published by ACM in 2011 and 2013 respectively. For simplicity, we refer the first effort as Dandelion and the second effort as Linqits.  The info below is the Compucon’s interpretation of the Microsoft efforts. 

Compucon spoke in Multicore World 2014 Conference in February in Auckland and quoted the gains of Linqits to the audience as shown in the image here.   Black Scholes and K-means are 2 of the computing intensive algorithms tested by the researchers.  P is performance gain.  E is energy gain.  The performance gain came from parallel hardware and parallel software collectively.  The time to complete a task is obviously reduced at the cost of parallel hardware runtime energy and therefore the energy gain is lesser than the performance gain.  In addition, the researchers claimed that the line of code (effort of programming) required was reduced and this is eye-opening. 
MICROSOFT RESEARCH PARALLEL COMPUTING

• Dandelion is a Windows system for high level programming for heterogeneous systems.  It adopts the .NET LINQ approach to identify code regions for parallelization and to integrate data-parallel operator into C# and F#.  It adopts the dataflow execution model and comprises 3 execution engines: distributed cluster, multi-core CPU, and GPU.  Benchmarking with k-means single iteration for a single machine, Dandelion is 6.4X faster than a single CPU core, whereas multi-core CPU is 3.1X faster than single core.  C++ programming needed 491 lines of code.  Dandelion had 42 lines only and achieved from 14 to 66 times of runtime performance gain.  CUDA took 909 lines and achieved 50 to 100 times of performance gain. 

• Linqits can be read as LINQ circuits to assist our understanding.  It is a Microsoft Research project for directly accelerating a declarative subset of C# programming language called LINQ which allows the programmer to embed user-defined anonymous functions that enable elegant ways to express rich algorithms.  Its major contributions are the application of C# through LINQ to Linux and FPGA.  To run C# applications in Debian Linux requires cross-compiling the Linux-compatible “Mono 2.10” runtime using an ARM GCC 4.1 cross-compiler. Instead of relying on Mono's built-in LINQ provider, the researchers leveraged an improved internal implementation of LINQ called Dandelion. It uses the Dandelion compiler scheme of producing a Query Plan based on dataflow execution and a C# Runtime and Scheduler for allocating workloads to the appropriate hardware engine.  The inclusion of FPGA is through 2 steps- Hardware Template and C# Runtime.  The Hardware Template is a special IP block that is highly parameterized at the RTL level and can be customized to suit the particular needs of the query plan.  This template supports 6 of the 7 major LINQ operators. It is currently implemented in less than 10K lines of Verilog and has been placed-and-routed at 100MHz on the FPGA platform for all applications tested.  The C# Runtime must pre-configure the hardware to operate in tandem with managed code.  The test system is Xilinx ZYNQ-7020 consisting of 2 ARM Cortex A9 cores and Xilinx FPGA with 53K LUT and 106K Flip-flops.  The C# Runtime will decide dynamically whether a node of the query plan should execute in hardware (FPGA) or software (ARM). At runtime, data is streamed in from DDR3 main memory and processed by respective cores and the results are returned to the main memory. This project has achieved impressive performance gains. 
END