Dynamically Reconfigurable Resource Array

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: The goals set by the International Technology Roadmap for Semiconductors (ITRS) for the consumer portable category, to be realized by 2020, are 1000X improvement in performance with only 40\% increase in power budget and no increase in design team size. To meet these goals, the challenges facing the VLSI community are gaps in architecture efficacy, design productivity and battery capacity.As the causes of the gaps in architecture efficacy and battery capacity, this thesis identifies: a) instruction granularity mismatch, b) bit-width granularity mismatch, c) silicon granularity mismatch and d) parallelism mismatch. Field Programmable Gate Array(FPGA) technology can address instruction/bit-width granularity and parallelism mismatch but suffers from silicon granularity mismatch due to high reconfiguration overheads. The ultimate design goal of a system-on-chip is to achieve an ASIC-like performance and FPGA-like flexibility, design time and cost. Coarse Grain Reconfigurable Architectures (CGRAs) are a compromise between ASIC and FPGA since they provide better computational efficiency compared to FPGAs and better engineering efficiency compared to ASIC. However, the current generation of CGRAs lack many architectural properties that would enable them to replace ASIC and/or FPGA by mainstream industry.To objectively discuss these properties, in the first part of the thesis a classification scheme has been proposed that classifies parallel computing machines into 47 classes and propose how they can be graded in terms of flexibility. We apply this classification scheme on academic and industrial reconfigurable architectures to compare them for their similarities and differences. We identify an instruction flow spatial computing class to be used for a CGRA fabric called Dynamically Reconfigurable Resource Array (DRRA) presented in the second part of this thesis. The DRRA fabric is a Parallel Distributed Digital Signal Processing (PDDSP) fabric with distributed arithmetic, logic, interconnect and control resources. Problems associated with the distributed control model of DRRA are identified and architectural solutions that can be exploited by the compiler tools are presented.After logical and physical synthesis, DRRA shows a peak performance of 21 GOPS and peak silicon efficiency of 16.03 GOPS/mm extsuperscript{2}. We further performed a three-level validation of the DRRA fabric. At first level, we mapped a number of signal and compute intensive algorithms to demonstrate the flexibility of the DRRA fabric. At second level, we measured the gap between ASIC, DRRA and FPGA. On average DRRA shows 22.87x area, 10.75x power consumption, 852x configuration bits, 959x configuration cycles, 63,94x silicon efficiency, 4.78x computational efficiency, and 6.15E+10x better energy-delay product improvements compared to FPGA. Finally, at third level we present the use of DRRA for a real world example of implementing a 128-, 256-, 512-, 1024-, 2048-point configurable FFT processor. For 1024 point FFT, in terms of computational efficiency, DRRA outperforms all CGRAs by at least 2x and is worse than ASIC by 3.45x. As regards silicon efficiency, although dedicated processors perform 1.6x better, DRRA is better than all other CGRAs.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.