Performance and Energy Efficient Building Blocks for Network-on-Chip Architectures
Abstract: The ever shrinking size of the MOS transistors brings the promise of scalable Network-on-Chip (NoC) architectures containing hundreds of processing elements with on-chip communication, all integrated into a single die. Such a computational fabric will provide high levels of performance in an energy efficient manner. To mitigate emerging wire-delay problem and to address the need for substantial interconnect bandwidth, packet switched routers are fast replacing shared buses and dedicated wires as the interconnect fabric of choice. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as 3D graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. Therefore, this work focuses on two key building blocks critical to the success of NoC design: high performance, area and energy efficient router and floating-point processor architectures.This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destinationaware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power, with no performance penalty over a published design. In a 150nm six-metal CMOS process, the 12.2mm2 router contains 1.9 million transistors and operates at 1GHz at 1.2V. We next present a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. Combined algorithmic, logic and circuit techniques enable multiply-accumulates at speeds exceeding 3GHz, with single-cycle throughput. Unlike existing FPMAC architectures, the design eliminates scheduling restrictions between consecutive FPMAC instructions. The optimizations allow removal of the costly normalization step from the critical accumulate loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading zero anticipator (LZA) and overflow detection logic applicable to carry-save format is presented. In a 90nm seven-metal dual-VT CMOS process, the 2mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFLOPS of performance while dissipating 1.2W at 3.1GHz, 1.3V supply.It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results from key building blocks demonstrate the feasibility of pushing the performance limits of compute cores and communication routers, while keeping active and leakage power, and area under control.
This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.