ASPDAC '19- Proceedings of the 24th Asia and South Pacific Design Automation Conference

Full Citation in the ACM Digital Library

SESSION: University design contest

A wide conversion ratio, 92.8% efficiency, 3-level buck converter with adaptive on/off-time control and shared charge pump intermediate voltage regulator

Kousuke Miyaji
Yuki Karasawa
Takanobu Fukuoka

An efficient cascode 3-level buck converter with adaptive on/off-time (AOOT) control and shared charge pump (CP) intermediate voltage (Vmid) regulator is proposed and demonstrated. The conversion ratio (CR) Vout/Vin is enhanced by using the proposed AOOT control scheme, where the control switches between adaptive on-time (AOnT) and adaptive off-time (AOffT) mode according to the target CR. The proposed CP shares flying capacitor Cfly and power switches in the 3-level buck converter to generate Vmid=Vin/2 achieving both small size and low loss. The proposed 3-level buck converter is implemented in a standard 0.25μm CMOS process. 92.8% maximum efficiency and wide CR are obtained with the integrated Vmid regulator.

A three-dimensional millimeter-wave frequency-shift based CMOS biosensor using vertically stacked spiral inductors in LC oscillators

Maya Matsunaga
Taiki Nakanishi
Atsuki Kobayashi
Kiichi Niitsu

This paper presents a millimeter-wave frequency-shift-based CMOS biosensor that is capable of providing three-dimensional (3D) resolution. The vertical resolution from the sensor surface is obtained using dual-layer LC oscillators, which enable 3D target detection. The LC oscillators produce different frequency shifts from the desired resonant frequency due to the frequency-dependent complex relative permittivity of the biomolecular target. The measurement results from a 65-nm test chip demonstrated the feasibility of achieving 3D resolution.

Design of 385 x 385 μm2 0.165V 270pW fully-integrated supply-modulated OOK transmitter in 65nm CMOS for glasses-free, self-powered, and fuel-cell-embedded continuous glucose monitoring contact lens

Kenya Hayashi
Shigeki Arata
Ge Xu
Shunya Murakami
Cong Dang Bui
Takuyoshi Doike
Maya Matsunaga
Atsuki Kobayashi
Kiichi Niitsu

This work presents the lowest power consumption sub-mm2 supply modulated OOK transmitter for enabling self-powered continuous glucose monitoring (CGM) contact lens. By combining the transmitter with a glucose fuel cell which functions as both the power source and sensing transducer, self-powered CGM contact lens can be emerged. The 385 x 385 μm2 test chip implemented in 65-nm standard CMOS technology operates 270pW under 0.165V and successfully demonstrates self-powered operation using 2 x 2 mm2 solid-state glucose fuel cell.

2D optical imaging using photosystem I photosensor platform with 32x32 CMOS biosensor array

Kiichi Niitsu
Taichi Sakabe
Mariko Miyachi
Yoshinori Yamanoi
Hiroshi Nishihara
Tatsuya Tomo
Kazuo Nakazato

This paper presents 2D imaging using photosensor platform with a newly-proposed large-scale CMOS biosensor array in 0.6-μm standard CMOS. The platform combines photosystem I (PSI) isolated from Thermosynechococcus elongatus and a large-scale CMOS biosensor array. PSI converts the absorbed photons into electrons, which are then sensed by the CMOS biosensor array. The prototyped photosensor enables CMOS-based 2D imaging using PSI for the first time.

Design of gate-leakage-based timer using an amplifier-less replica-bias switching technique in 55-nm DDC CMOS

Atsuki Kobayashi
Yuya Nishio
Kenya Hayashi
Shigeki Arata
Kiichi Niitsu

A design of gate-leakage-based timer using an amplifier-less replica-bias switching technique that can realize stable and low-voltage operation is presented. To generate stable oscillation frequency, the topology that discharges the pre-charged capacitor via a gate leaking MOS capacitor with low-leakage switch and logic circuits is employed. The test chip fabricated in 55-nm deeply depleted channel (DDC) CMOS technology achieves an Allan deviation floor of 200 ppm at a supply voltage of 350 mV in a 0.0022 mm2 area.

A low-voltage CMOS electrophoresis IC using electroless gold plating for small-form-factor biomolecule manipulation

Kiichi Niitsu
Yuuki Yamaji
Atsuki Kobayashi
Kazuo Nakazato

We present sub-1-V CMOS-based electrophoresis method for small-form-factor biomolecule manipulation that is contained in a microchip. This is the first time this type of device has been presented in the literature. By combining CMOS technology with electroless gold plating, the electrode pitch can be reduced and the required input voltage can be decreased to less than 1 V. We fabricated the CMOS electrophoresis chip in a cost-competitive 0.6 μm standard CMOS process. A sample/hold circuit in each cell is used to generate a constant output from an analog input. After forming gold electrodes using an electroless gold plating technique, we were able to manipulate red food coloring with a 0--0.7 V input voltage range. The results shows that the proposed CMOS chip is effective for electrophoresis-based manipulation.

A low-voltage low-power multi-channel neural interface IC using level-shifted feedback technology

Liangjian Lyu
Yu Wang
Chixiao Chen
C. -J. Richard Shi

A low-voltage low-power 16-channel neural interface front-end IC for in-vivo neural recording applications is presented in this paper. A current reuse telescope amplifier is used to achieve better noise efficiency factor (NEF). Power efficiency factor (PEF) is further improved by reducing supply voltage with the proposed level-shifted feedback (LSFB) technique. The neural interface is fabricated in a 65 nm CMOS process. It operates under 0.6V supply voltage consuming 1.07 μW/channel. An input referred noise of 5.18 μV is measured, leading to a NEF of 2.94 and a PEF of 5.19 over 10 kHz bandwidth.

Development of a high stability, low standby power six-transistor CMOS SRAM employing a single power supply

Nobuaki Kobayashi
Tadayoshi Enomoto

We developed and applied a new circuit, called the "Self-controllable Voltage Level (SVL)" circuit, not only to expand both "write" and "read" stabilities, but also to achieve a low stand-by power and data holding capability in a single low power supply, 90-nm, 2-kbit, six-transistor CMOS SRAM. The SVL circuit can adaptively lower and higher the wordline voltages for a "read" and "write" operation, respectively. It can also adaptively lower and higher the memory cell supply voltages for the "write" and "hold" operations, and "read" operation, respectively. A Si area overhead of the SVL circuit is only 1.383% of the conventional SRAM.

Design of heterogeneously-integrated memory system with storage class memories and NAND flash memories

Chihiro Matsui
Ken Takeuchi

Heterogeneously-integrated memory system is configured with various types of storage class memories (SCMs) and NAND flash memories. SCMs are faster than NAND flash, and they are divided into memory and storage types with their characteristics. NAND flash memories are also classified by the number of stored bits per memory cell. These non-volatile memories have trade-offs among access speed, capacity and bit cost. Therefore, mix and match of various non-volatile memories are essential to simultaneously achieve the best speed and cost of the storage. This paper proposes a design methodology with unique interaction of device, circuit and system to achieve the appropriate configurations in the heterogeneously-integrated memory system for application.

A 65-nm CMOS fully-integrated circulating tumor cell and exosome analyzer using an on-chip vector network analyzer and a transmission-line-based detection window

Taiki Nakanishi
Maya Matsunaga
Shunya Murakami
Atsuki Kobayashi
Kiichi Niitsu

A fully-integrated CMOS circuit based on a vector network analyzer (VNA) and a transmission-line-based detection window for circulating tumor cell (CTC) and exosome analysis is presented. We have introduced a fully-integrated architecture, which eliminates the undesired parasitic components and enables high-sensitivity, for analysis of extremely low-concentration CTC in blood. To validate the operation of the proposed system, a test chip was fabricated using 65-nm CMOS technology. Measurement results shows the effectiveness of the approach.

Low standby power CMOS delay flip-flop with data retention capability

Nobuaki Kobayashi
Tadayoshi Enomoto

We developed and applied a new circuit, called the self-controllable voltage level (SVL) circuit, to achieve not only low standby power dissipation (Pst) while retaining data, but also to switch significantly quickly between an operational mode and a standby mode, in a single power source, 90-nm CMOS delay flip-flop (D-FF). The Pst of the developed D-FF is only 5.585 nW/bit, 14.81% of the 37.71 nW/bit of the conventional D-FF at a supply voltage (Vdd) of 1.0 V. The static-noise margin of the developed D-FF is 0.2576 V, and that of the conventional D-FF is 0.3576 V (at Vdd of 1.0 V). The Si area overhead of the SVL circuit is 11.62% of the conventional D-FF.

Accelerate pattern recognition for cyber security analysis

Mohammad Tahghighi
Wei Zhang

Network security analysis is about processing the network equipment's log records to capture malicious and anomalous traffic. Scrutinizing huge amount of records to capture complex patterns is time consuming and difficult to parallelize. In this paper, we proposed a hardware/software co-designed system to address this problem for specific IP chaining patterns.

FPGA laboratory system supporting power measurement for low-power digital design

Marco Winzker
Andrea Schwandt

Power measurement of a digital design implementation supports development of low-power systems and gives insight into the performance of a circuit. A laboratory system is presented that consists of an FPGA board for use in a hands-on and remote laboratory. Measurement results show how the system can be utilized for teaching and research.

SESSION: Real-time embedded software

Towards limiting the impact of timing anomalies in complex real-time processors

Pedro Benedicte
Jaume Abella
Carles Hernandez
Enrico Mezzetti
Francisco J. Cazorla

Timing verification of embedded critical real-time systems is hindered by complex designs. Timing anomalies, deeply analyzed in static timing analysis, require specific solutions to bound their impact. For the first time, we study the concept and impact of timing anomalies in measurement-based timing analysis, the most used in industry, showing that they require to be considered and handled differently. In addition, we analyze anomalies in the context of Measurement-Based Probabilistic Timing Analysis, which simplifies quantifying their impact.

SeRoHAL: generation of selectively robust hardware abstraction layers for efficient protection of mixed-criticality systems

Petra R. Kleeberger
Juana Rivera
Daniel Mueller-Gritschneder
Ulf Schlichtmann

A major challenge in mixed-criticality system design is to ensure safe behavior under the influence of hardware errors while complying with cost and performance constraints. SeRoHAL generates hardware abstraction layers with software-based safety mechanisms to handle errors in peripheral interfaces. To reduce performance and memory overheads, SeRoHAL can select protection mechanisms, depending on the criticality of the hardware accesses.
We evaluated SeRoHAL on a robot arm control software. During fault injection, it prevents up to 76% of the assertion failures. Selective protection customized to the criticality of the accesses reduces the induced overheads significantly compared to protection of all hardware accesses.

Partitioned and overhead-aware scheduling of mixed-criticality real-time systems

Yuanbin Zhou
Soheil Samii
Petru Eles
Zebo Peng

Modern real-time embedded and cyber-physical systems comprise a large number of applications, often of different criticalities, executing on the same computing platform. Partitioned scheduling is used to provide temporal isolation among tasks with different criticalities. Isolation is often a requirement, for example, in order to avoid the case when a low criticality task overruns or fails in such a way that causes a failure in a high criticality task. When the number of partitions increases in mixed criticality systems, the size of the schedule table can become extremely large, which becomes a critical bottleneck due to design time and memory constraints of embedded systems. In addition, switching between partitions at runtime causes CPU overhead due to preemption. In this paper, we propose a design framework comprising a hyper-period optimization algorithm, which reduces the size of schedule table and preserves schedulability, and a re-scheduling algorithm to reduce the number of preemptions. Extensive experiments demonstrate the effectiveness of proposed algorithms and design framework.

SESSION: Hardware and system security

Layout recognition attacks on split manufacturing

Wenbin Xu
Lang Feng
Jeyavijayan (JV) Rajendran
Jiang Hu

One technique to prevent attacks from an untrusted foundry is split manufacturing, where only a part of the layout is sent to the untrusted high-end foundry, and the rest is manufactured at a trusted low-end foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. For instance, Boolean satisfiability (SAT) based bijective mapping attack is mentioned without detailed research. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We provide the first systematic study, to the best of our knowledge, on attack techniques in this scenario. Besides of implementing SAT-based bijective mapping attack, we develop a new attack technique based on structural pattern matching. Experimental comparison with bijective mapping attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing.

Execution of provably secure assays on MEDA biochips to thwart attacks

Tung-Che Liang
Mohammed Shayan
Krishnendu Chakrabarty
Ramesh Karri

Digital microfluidic biochips (DMFBs) have emerged as a promising platform for DNA sequencing, clinical chemistry, and point-of-care diagnostics. Recent research has shown that DMFBs are susceptible to various types of malicious attacks. Defenses proposed thus far only offer probabilistic guarantees of security due to the limitation of on-chip sensor resources. A micro-electrode-dot-array (MEDA) biochip is a next-generation DMFB that enables the sensing of on-chip droplet locations, which are captured in the form of a droplet-location map. We propose a security mechanism that validates assay execution by reconstructing the sequencing graph (i.e., the assay specification) from the droplet-location maps and comparing it against the golden sequencing graph. We prove that there is a unique (one-to-one) mapping from the set of droplet-location maps (over the duration of the assay) to the set of possible sequencing graphs. Any deviation in the droplet-location maps due to an attack is detected by this countermeasure because the resulting derived sequencing graph is not isomorphic to the original sequencing graph. We highlight the strength of the security mechanism by simulating attacks on real-life bioassays.

TAD: time side-channel attack defense of obfuscated source code

Alexander Fell
Hung Thinh Pham
Siew-Kei Lam

Program obfuscation is widely used to protect commercial software against reverse-engineering. However, an adversary can still download, disassemble and analyze binaries of the obfuscated code executed on an embedded System-on-Chip (SoC), and by correlating execution times to input values, extract secret information from the program. In this paper, we show (1) the impact of widely-used obfuscation methods on timing leakage, and (2) that well-known software countermeasures to reduce timing leakage of programs, are not always effective for low-noise environments found in embedded systems. We propose two methods for mitigating timing leakage in obfuscated codes. The first is a compiler driven method, called TAD, which removes conditional branches with distinguishable execution times for an input program. In the second method (TADCI), TAD is combined with dynamic hardware diversity by replacing primitive instructions with Custom Instructions (CIs) that exhibit non-deterministic execution times at runtime. Experimental results on the RISC-V platform show that the information leakage is reduced by 92% and 82% when TADCI is applied to the original and obfuscated source code, respectively.

SESSION: Thermal- and power-aware design and optimization

Leakage-aware thermal management for multi-core systems using piecewise linear model based predictive control

Xingxing Guo
Hai Wang
Chi Zhang
He Tang
Yuan Yuan

Performing thermal management on new generation IC chips is challenging. This is because the leakage power, which is significant in today's chips, is nonlinearly related to temperature, resulting in a complex nonlinear control problem in thermal management. In this paper, a new dynamic thermal management (DTM) method with piecewise linear (PWL) thermal model based predictive control is proposed to solve the nonlinear control problem. First, a PWL thermal model is built by combining multiple local linear thermal models expanded at several Taylor expansion points. These Taylor expansion points are carefully selected by a systematic scheme which exploits the thermal behavior property of the IC chips. Based on the PWL thermal model, a new predictive control method is proposed to compute the future power recommendation for DTM. By approximating the nonlinearity accurately with the PWL thermal model and being equipped with predictive control technique, the new DTM can achieve an overall high quality temperature management with smooth and accurate temperature tracking. Experimental results show the new method outperforms the linear model predictive control based method in temperature management quality with negligible computing overhead.

Multi-angle bended heat pipe design using x-architecture routing with dynamic thermal weight on mobile devices

Hsuan-Hsuan Hsiao
Hong-Wen Chiou
Yu-Min Lee

Heat pipe is an effective passive cooling technique for mobile devices. This work builds a multi-angle bended heat pipe thermal model and presents an X-architecture routing engine guided by developed dynamic thermal weights to construct the heat pipe path for reducing the operating temperatures of a smartphone. Compared with a commercial tool, the error of the thermal model is only 4.79%. The routing engine can efficiently reduce the operating temperatures of application processors at least 13.20% in smartphones.

Fully-automated synthesis of power management controllers from UPF

Dustin Peterson
Oliver Bringmann

We present a methodology for automatic synthesis of power management controllers for System-on-Chip designs by using an extended version of the Unified Power Format (UPF). Our methodology takes an SoC design and a UPF-based power design, and automatically generates a power management controller in Verilog/VHDL that implements the power state machine specified in UPF. It performs a priority-based scheduling for all power state machine actions, connects each power management signal to the corresponding logic wire in the UPF design and integrates the controller into the System-on-Chip using a configurable bus interface. We implemented the proposed approach as a plugin for Synopsys Design Compiler to close the gap in today's power management flows and evaluated it by a RISC-V System-on-Chip.

SESSION: Reverse engineering: growing more mature - and facing powerful countermeasures

Integrated flow for reverse engineering of nanoscale technologies

Bernhard Lippmann
Michael Werner
Niklas Unverricht
Aayush Singla
Peter Egger
Anja Dübotzky
Horst Gieser
Martin Rasche
Oliver Kellermann
Helmut Graeb

In view of potential risks of piracy and malicious manipulation of complex integrated circuits built in technologies of 45 nm and less, there is an increasing need for an effective and efficient process of reverse engineering. This paper provides an overview of the current process and details on a new tool for the acquisition and synthesis of large area images and the extraction of a layout. For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM). From this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

NETA: when IP fails, secrets leak

Travis Meade
Jason Portillo
Shaojie Zhang
Yier Jin

Assuring the quality and the trustworthiness of third party resources has been a hard problem to tackle. Researchers have shown that analyzing Integrated Circuits (IC), without the aid of golden models, is challenging. In this paper we discuss a toolset, NETA, designed to aid IP users in assuring the confidentiality, integrity, and accessibility of their IC or third party IP core. The discussed toolset gives access to a slew of gate-level analysis tools, many of which are heuristic-based, for the purposes of extracting high-level circuit design information. NETA majorly comprises the following tools: RELIC, REBUS, REPCA, REFSM, and REPATH.
The first step involved in netlist analysis falls to signal classification. RELIC uses a heuristic based fan-in structure matcher to determine the uniqueness of each signal in the netlist. REBUS finds word groups by leveraging the data bus in the netlist in conjunction with RELIC's signal comparison through heuristic verification of input structures. REPCA on the other hand tries to improve upon the standard bruteforce RELIC comparison by leveraging the data analysis technique of PCA and a sparse RELIC analysis on all signals. Given a netlist and a set of registers, REFSM reconstructs the logic which represents the behavior of a particular register set over the course of the operation of a given netlist. REFSM has been shown useful for examining register interaction at a higher level. REPATH, similar to REFSM, finds a series of input patterns which forces a logical FSM initialize with some reset state into a state specified by the user. Finally, REFSM 2 is introduced to utilizes linear time precomputation to improve the original REFSM.

Machine learning and structural characteristics for reverse engineering

Johanna Baehr
Alessandro Bernardini
Georg Sigl
Ulf Schlichtmann

In the past years, much of the research into hardware reverse engineering has focused on the abstraction of gate level netlists to a human readable form. However, none of the proposed methods consider a realistic reverse engineering scenario, where the netlist is physically extracted from a chip. This paper analyzes how errors caused by this extraction and the later partitioning of the netlist affect the ability to identify the functionality. Current formal verification based methods, which compare against a golden model, are incapable of dealing with such erroneous netlists. Two new methods are proposed, which focus on the idea that structural similarity implies functional similarity. The first approach uses fuzzy structural similarity matching to compare the structural characteristics of an unknown design against designs in a golden model library using machine learning. The second approach proposes a method for inexact graph matching using fuzzy graph isomorphisms, based on the functionalities of gates used within the design. For realistic error percentages, both approaches are able to match more than 90% of designs correctly. This is an important first step for hardware reverse engineering methods beyond formal verification based equivalence matching.

Towards cognitive obfuscation: impeding hardware reverse engineering based on psychological insights

Carina Wiesen
Nils Albartus
Max Hoffmann
Steffen Becker
Sebastian Wallat
Marc Fyrbiak
Nikol Rummel
Christof Paar

In contrast to software reverse engineering, there are hardly any tools available that support hardware reversing. Therefore, the reversing process is conducted by human analysts combining several complex semi-automated steps. However, countermeasures against reversing are evaluated solely against mathematical models. Our research goal is the establishment of cognitive obfuscation based on the exploration of underlying psychological processes. We aim to identify problems which are hard to solve for human analysts and derive novel quantification metrics, thus enabling stronger obfuscation techniques.

Insights into the mind of a trojan designer: the challenge to integrate a trojan into the bitstream

Maik Ender
Pawel Swierczynski
Sebastian Wallat
Matthias Wilhelm
Paul Martin Knopp
Christof Paar

The threat of inserting hardware Trojans during the design, production, or in-field poses a danger for integrated circuits in real-world applications. A particular critical case of hardware Trojans is the malicious manipulation of third-party FPGA configurations. In addition to attack vectors during the design process, FPGAs can be infiltrated in a non-invasive manner after shipment through alterations of the bitstream. First, we present an improved methodology for bitstream file format reversing. Second, we introduce a novel idea for Trojan insertion.

SESSION: All about PIM

GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs

Guohao Dai
Tianhao Huang
Yu Wang
Huazhong Yang
John Wawrzynek

Large-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR [1]).

ParaPIM: a parallel processing-in-memory accelerator for binary-weight deep neural networks

Shaahin Angizi
Zhezhi He
Deliang Fan

Recent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM's in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4x higher energy efficiency and 7.3x speedup over recent processing-in-DRAM acceleration, or roughly 5x higher energy-efficiency and 20.5x speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.

CompRRAE: RRAM-based convolutional neural network accelerator with reduced computations through a runtime activation estimation

Xizi Chen
Jingyang Zhu
Jingbo Jiang
Chi-Ying Tsui

Recently Resistive-RAM (RRAM) crossbar has been used in the design of the accelerator of convolutional neural networks (CNNs) to solve the memory wall issue. However, the intensive multiply-accumulate computations (MACs) executed at the crossbars during the inference phase are still the bottleneck for the further improvement of energy efficiency and throughput. In this work, we explore several methods to reduce the computations for the RRAM-based CNN accelerators. First, the output sparsity resulting from the widely employed Rectified Linear Unit is exploited, and a significant portion of computations are bypassed through an early detection of the negative output activations. Second, an adaptive approximation is proposed to terminate the MAC early when the sum of the partial results of the remaining computations is considered to be within a certain range of the intermediate accumulated result and thus has an insignificant contribution to the inference. In order to determine these redundant computations, a novel runtime estimation on the maximum and minimum values of each output activation is developed and used during the MAC operation. Experimental results show that around 70% of the computations can be reduced during the inference with a negligible accuracy loss smaller than 0.2%. As a result, the energy efficiency and the throughput are improved by over 2.9 and 2.8 times, respectively, compared with the state-of-the-art RRAM-based accelerators.

CuckooPIM: an efficient and less-blocking coherence mechanism for processing-in-memory systems

Sheng Xu
Xiaoming Chen
Ying Wang
Yinhe Han
Xiaowei Li

The ever-growing processing ability of in-memory processing logic makes the data sharing and coherence between processors and in-memory logic play an increasingly important role in Processing-in-Memory (PIM) systems. Unfortunately, the existing state-of-the-art coarse-grained PIM coherence solutions suffer from unnecessary data movements and stalls caused by a data ping-pong issue. This work proposes CuckooPIM, a criticality-aware and less-blocking coherence mechanism, which can effectively avoid unnecessary data movements and stalls. Experiments reveal that CuckooPIM achieves 1.68x speedup on average comparing with coarse-grained PIM coherence.

AERIS: area/energy-efficient 1T2R ReRAM based processing-in-memory neural network system-on-a-chip

Jinshan Yue
Yongpan Liu
Fang Su
Shuangchen Li
Zhe Yuan
Zhibo Wang
Wenyu Sun
Xueqing Li
Huazhong Yang

ReRAM-based processing-in-memory (PIM) architecture is a promising solution for deep neural networks (NN), due to its high energy efficiency and small footprint. However, traditional PIM architecture has to use a separate crossbar array to store either positive or negative (P/N) weights, which limits both energy efficiency and area efficiency. Even worse, imbalance running time of different layers and idle ADCs/DACs even lower down the whole system efficiency. This paper proposes AERIS, an Area/Energy-efficient 1T2R ReRAM based processing-In-memory NN System-on-a-chip to enhance both energy and area efficiency. We propose an area-efficient 1T2R ReRAM structure to represent both P/N weights in a single array, and a reference current cancelling scheme (RCS) is also presented for better accuracy. Moreover, a layer-balance scheduling strategy, as well as the power gating technique for interface circuits, such as ADCs/DACs, is adopted for higher energy efficiency. Experiment results show that compared with state-of-the-art ReRAM-based architectures, AERIS achieves 8.5x/1.3x peak energy/area efficiency improvements in total, due to layer-balance scheduling for different layers, power gating of interface circuits, and 1T2R ReRAM circuits. Furthermore, we demonstrate that the proposed RCS compensates the non-ideal factors of ReRAM and improves NN accuracy by 5.2% in the XNOR net on CIFAR-10 dataset.

SESSION: Design for reliability

IR-ATA: IR annotated timing analysis, a flow for closing the loop between PDN design, IR analysis & timing closure

Ashkan Vakil
Houman Homayoun
Avesta Sasan

This paper presents IR-ATA, a novel flow for modeling the timing impact of IR drop during the physical design and timing closure of an ASIC chip. We first illustrate how the current and conventional mechanism for budgeting the IR drop and voltage noise (by using hard margins) lead to sub-optimal design. Consequently, we propose a new approach for modeling and margining against voltage noise, such that each timing path is margined based on its own topology and its own view of voltage noise. By having such a path based margining mechanism, the margins for IR drop and voltage noise for most timing paths in the design are safely relaxed. The reduction in the margin increases the available timing slack that could be used for improving the power, performance, and area of a design. Finally, we illustrate how IR-ATA could be used to track the timing impact of physical or PDN changes, allowing the physical designers to explore tradeoffs that were previously, for lack of methodology, not possible.

Learning-based prediction of package power delivery network quality

Yi Cao
Andrew B. Kahng
Joseph Li
Abinash Roy
Vaishnav Srinivas
Bangqi Xu

Power Delivery Network (PDN) is a critical component in modern System-on-Chip (SoC) designs. With the rapid development in applications, the quality of PDN, especially Package (PKG) PDN, determines whether a sufficient amount of power can be delivered to critical computing blocks. In conventional PKG design, PDN design typically takes multiple weeks including many manual iterations for optimization. Also, there is a large discrepancy between (i) quick simulation tools used for quick PDN quality assessment during the design phase, and (ii) the golden extraction tool used for signoff. This discrepancy may introduce more iterations. In this work, we propose a learning-based methodology to perform PKG PDN quality assessment both before layout (when only bump/ball maps, but no package routing, are available) and after layout (when routing is completed but no signoff analysis has been launched). Our contributions include (i) identification of important parameters to estimate the achievable PKG PDN quality in terms of bump inductance; (ii) the avoidance of unnecessary manual trial and error overheads in PKG PDN design; and (iii) more accurate design-phase PKG PDN quality assessment. We validate accuracy of our predictive models on PKG designs from industry. Experimental results show that, across a testbed of 17 industry PKG designs, we can predict bump inductance with an average absolute percentage error of 21.2% or less, given only pinmap and technology information. We improve prediction accuracy to achieve an average absolute percentage error of 17.5% or less when layout information is considered.

Tackling signal electromigration with learning-based detection and multistage mitigation

Wei Ye
Mohamed Baker Alawieh
Yibo Lin
David Z. Pan

With the continuous scaling of integrated circuit (IC) technologies, electromigration (EM) prevails as one of the major reliability challenges facing the design of robust circuits. With such aggressive scaling in advanced technology nodes, signal nets experience high switching frequency, which further exacerbates the signal EM effect. Traditionally, signal EM fixing approaches analyze EM violations after the routing stage and repair is attempted via iterative incremental routing or cell resizing techniques. However, these "EM-analysis-then fix" approaches are ill-equipped when faced with the ever-growing EM violations in advanced technology nodes. In this work, we propose a novel signal EM handling framework that (i) incorporates EM detection and fixing techniques into earlier stages of the physical design process, and (ii) integrates machine learning based detection alongside a multistage mitigation. Experimental results demonstrate that our framework can achieve 15x speedup when compared to the state-of-the-art EDA tool while achieving similar performance in terms of EM mitigation and overhead.

ROBIN: incremental oblique interleaved ECC for reliability improvement in STT-MRAM caches

Elham Cheshmikhani
Hamed Farbeh
Hossein Asadi

Spin-Transfer Torque Magnetic RAM (STT-MRAM) is a promising alternative for SRAMs in on-chip cache memories. Besides all its advantages, high error rate in STT-MRAM is a major limiting factor for on-chip cache memories. In this paper, we first present a comprehensive analysis that reveals that the conventional Error-Correcting Codes (ECCs) lose their efficiency due to data-dependent error patterns, and then propose an efficient ECC configuration, so-called ROBIN, to improve the correction capability. The evaluations show that the inefficiency of conventional ECC increases the cache error rate by an average of 151.7% while ROBIN reduces this value by more than 28.6x.

Aging-aware chip health prediction adopting an innovative monitoring strategy

Yun-Ting Wang
Kai-Chiang Wu
Chung-Han Chou
Shih-Chieh Chang

Concerns exist that the reliability of chips is worsening because of downscaling technology. Among various reliability challenges, device aging is a dominant concern because it degrades circuit performance over time. Traditionally, runtime monitoring approaches are proposed to estimate aging effects. However, such techniques tend to predict and monitor delay degradation status for circuit mitigation measures rather than the health condition of the chip. In this paper, we propose an aging-aware chip health prediction methodology that adapts to workload conditions and process, supply voltage, and temperature variations. Our prediction methodology adopts an innovative on-chip delay monitoring strategy by tracing representative aging-aware delay behavior. The delay behavior is then fed into a machine learning engine to predict the age of the tested chips. Experimental results indicate that our strategy can obtain 97.40% accuracy with 4.14% area overhead on average. To the authors' knowledge, this is the first method that accurately predicts current chip age and provides information regarding future chip health.

SESSION: New advances in emerging computing paradigms

Compiling SU(4) quantum circuits to IBM QX architectures

Alwin Zulehner
Robert Wille

The Noisy Intermediate-Scale Quantum (NISQ) technology is currently investigated by major players in the field to build the first practically useful quantum computer. IBM QX architectures are the first ones which are already publicly available today. However, in order to use them, the respective quantum circuits have to be compiled for the respectively used target architecture. While first approaches have been proposed for this purpose, they are infeasible for a certain set of SU(4) quantum circuits which have recently been introduced to benchmark corresponding compilers. In this work, we analyze the bottlenecks of existing compilers and provide a dedicated method for compiling this kind of circuits to IBM QX architectures. Our experimental evaluation (using tools provided by IBM) shows that the proposed approach significantly outperforms IBM's own solution regarding fidelity of the compiled circuit as well as runtime. Moreover, the solution proposed in this work has been declared winner of the IBM QISKit Developer Challenge. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

Quantum circuit compilers using gate commutation rules

Toshinari Itoko
Rudy Raymond
Takashi Imamichi
Atsushi Matsuo
Andrew W. Cross

The use of noisy intermediate-scale quantum computers (NISQCs), which consist of dozens of noisy qubits with limited coupling constraints, has been increasing. A circuit compiler, which transforms an input circuit into an equivalent output circuit conforming the coupling constraints with as few additional gates as possible, is essential for running applications on NISQCs. We propose a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler.

Scalable design for field-coupled nanocomputing circuits

Marcel Walter
Robert Wille
Frank Sill Torres
Daniel Große
Rolf Drechsler

Field-coupled Nanocomputing (FCN) technologies are considered as a solution to overcome physical boundaries of conventional CMOS approaches. But despite ground breaking advances regarding their physical implementation as e.g. Quantum-dot Cellular Automata (QCA), Nanomagnet Logic (NML), and many more, there is an unsettling lack of methods for large-scale design automation of FCN circuits. In fact, design automation for this class of technologies still is in its infancy - heavily relying either on manual labor or automatic methods which are applicable for rather small functionality only. This work presents a design method which - for the first time - allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies. The proposed scheme is capable of handling around 40000 gates within seconds while the current state-of-the-art takes hours to handle around 20 gates. This is confirmed by experimental results on the layout level for various established benchmarks libraries.

BDD-based synthesis of optical logic circuits exploiting wavelength division multiplexing

Ryosuke Matsuo
Jun Shiomi
Tohru Ishihara
Hidetoshi Onodera
Akihiko Shinya
Masaya Notomi

Optical circuits using nanophotonic devices attract significant interest due to its ultra-high speed operation. As a consequence, the synthesis methods for the optical circuits also attract increasing attention. However, existing methods for synthesizing optical circuits mostly rely on straight-forward mappings from established data structures such as Binary Decision Diagram (BDD). The strategy of simply mapping a BDD to an optical circuit sometimes results in an explosion of size and involves significant power losses in branches and optical devices. To address these issues, this paper proposes a method for reducing the size of BDD-based optical logic circuits exploiting wavelength division multiplexing (WDM). The paper also proposes a method for reducing the number of branches in a BDD-based circuit, which reduces the power dissipation in laser sources. Experimental results obtained using a partial product accumulation circuit in parallel multipliers demonstrates significant advantages of our method over existing approaches in terms of area and power consumption.

Hybrid binary-unary hardware accelerator

S. Rasoul Faraji
Kia Bazargan

Stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the low area advantage comes at an exponential price in latency, making the area x delay cost unattractive. In this paper, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or sub-optimal design in the design space. The proposed method is especially well-suited to FPGAs due to the abundant availability of routing and flip-flop resources. To the best of our knowledge, we are the first to show a scalable method based on the principles of stochastic computing that can beat conventional binary in terms of a real cost, i.e., area x delay. Our method outperforms the binary and fully unary methods on a number of functions and on a common edge detection algorithm. In terms of area x delay cost, our cost is on average only 2.51% and 10.2% of the binary for 8- and 10-bit resolutions, respectively. These numbers are 2--3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the binary method for high-resolution oscillating functions such as sin(15x).

SESSION: Design, testing, and fault tolerance of neuromorphic systems

Fault tolerance in neuromorphic computing systems

Mengyun Liu
Lixue Xia
Yu Wang
Krishnendu Chakrabarty

Resistive Random Access Memory (RRAM) and RRAM-based computing systems (RCS) provide energy-efficient technology options for neuromorphic computing. However, the applicability of RCS is limited by reliability problems that arise from the immature fabrication process. In order to take advantage of RCS in practical applications, fault-tolerant design is a key challenge. We present a survey of fault-tolerant designs for RRAM-based neuromorphic computing systems. We first describe RRAM-based crossbars and training architectures in RCS. Following this, we classify fault models into different categories, and review post-fabrication testing methods. Subsequently, online testing methods are presented. Finally, we present various fault-tolerant techniques that were designed to tolerate different types of RRAM faults. The methods reviewed in this survey represent recent trends in fault-tolerant designs of RCS, and are expected motivate further research in this field.

Build reliable and efficient neuromorphic design with memristor technology

Bing Li
Bonan Yan
Chenchen Liu
Hai (Helen) Li

Neuromorphic computing is a revolutionary approach of computation, which attempts to mimic the human brain's mechanism for extremely high implementation efficiency and intelligence. Latest research studies showed that the memristor technology has a great potential for realizing power- and area-efficient neuromorphic computing systems (NCS). On the other hand, the memristor device processing is still under development. Unreliable devices can severely degrade system performance, which arises as one of the major challenges in developing memristor-based NCS. In this paper, we first review the impacts of the limited reliability of memristor devices and summarize the recent research progress in building reliable and efficient memristor-based NCS. In the end, we discuss the main difficulties and the trend in memristor-based NCS development.

Reliable in-memory neuromorphic computing using spintronics

Christopher Münch
Rajendra Bishnoi
Mehdi B. Tahoori

Recently Spin Transfer Torque Random Access Memory (STT-MRAM) technology has drawn a lot of attention for the direct implementation of neural networks, because it offers several advantages such as near-zero leakage, high endurance, good scalability, small foot print and CMOS compatibility. The storing device in this technology, the Magnetic Tunnel Junction (MTJ), is developed using magnetic layers that requires new fabrication materials and processes. Due to complexities of fabrication steps and materials, MTJ cells are subject to various failure mechanisms. As a consequence, the functionality of the neuromorphic computing architecture based on this technology is severely affected. In this paper, we have developed a framework to analyze the functional capability of the neural network inference in the presence of the several MTJ defects. Using this framework, we have demonstrated the required memory array size that is necessary to tolerate the given amount of defects and how to actively decrease this overhead by disabling parts of the network.

SESSION: Memory-centric design and synthesis

A staircase structure for scalable and efficient synthesis of memristor-aided logic

Alwin Zulehner
Kamalika Datta
Indranil Sengupta
Robert Wille

The identification of the memristor as fourth fundamental circuit element and, eventually, its fabrication in the HP labs provide new capabilities for in-memory computing. While there already exist sophisticated methods for realizing logic gates with memristors, mapping them to crossbar structures (which can easily be fabricated) still constitutes a challenging task. This is particularly the case since several (complementary) design objectives have to be satisfied, e.g. the design method has to be scalable, should yield designs requiring a low number of timesteps and utilized memristors, and a layout should result that is hardly skewed. However, all solutions proposed thus far only focus on one of these objectives and hardly address the other ones. Consequently, rather imperfect solutions are generated by state-of-the-art design methods for memristor-aided logic thus far. In this work, we propose a corresponding automatic design solution which addresses all these design objectives at once. To this end, a staircase structure is utilized which employs an almost square-like layout and remains perfectly scalable while, at the same time, keeps the number of timesteps and utilized memristors close to the minimum. Experimental evaluations confirm that the proposed approach indeed allows to satisfy all design objectives at once.

On-chip memory optimization for high-level synthesis of multi-dimensional data on FPGA

Daewoo Kim
Sugil Lee
Jongeun Lee

It is very challenging to design an on-chip memory architecture for high-performance kernels with large amount of computation and data. The on-chip memory architecture must support efficient data access from both the computation part and the external memory part, which often have very different expectations about how data should be accessed and stored. Previous work provides only a limited set of optimizations. In this paper we show how to fundamentally restructure on-chip buffers, by decoupling logical array view from the physical buffer view, and providing general mapping schemes for the two. Our framework considers the entire data flow from the external memory to the computation part in order to minimize resource usage without creating performance bottleneck. Our experimental results demonstrate that our proposed technique can generate solutions that reduce memory usage significantly (2X over the conventional method), and successfully generate optimized on-chip buffer architectures without costly design iterations for highly optimized computation kernels.

HUBPA: high utilization bidirectional pipeline architecture for neuromorphic computing

Houxiang Ji
Li Jiang
Tianjian Li
Naifeng Jing
Jing Ke
Xiaoyao Liang

Training Convolutional Neural Networks(CNNs) is both memory-and computation-intensive. The resistive random access memory (ReRAM) has shown its advantage to accelerate such tasks with high energy-efficiency. However, the ReRAM-based pipeline architecture suffers from the low utilization of computing resource, caused by the imbalanced data throughput in different pipeline stages because of the inherent down-sampling effect in CNNs and the inflexible usage of ReRAM cells. In this paper, we propose a novel ReRAM-based bidirectional pipeline architecture, named HUBPA, to accelerate the training with higher utilization of the computing resource. Two stages of the CNN training, forward and backward propagations, are scheduled in HUBPA dynamically to share the computing resource. We design an accessory control scheme for the context switch of these two tasks. We also propose an efficient algorithm to allocate computing resource for each neural network layer. Our experiment results show that, compared with state-of-the-art ReRAM pipeline architecture, HUBPA improves the performance by 1.7X and reduces the energy consumption by 1.5X, based on the current benchmarks.

SESSION: Efficient modeling of analog, mixed signal and arithmetic circuits

Efficient sparsification of dense circuit matrices in model order reduction

Charalampos Antoniadis
Nestor Evmorfopoulos
Georgios Stamoulis

The integration of more components into ICs due to the ever increasing technology scaling has led to very large parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose a methodology for the sparsification of the dense circuit matrices resulting from Model Order Reduction, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.

Spectral approach to verifying non-linear arithmetic circuits

Cunxi Yu
Tiankai Su
Atif Yasin
Maciej Ciesielski

This paper presents a fast and effective computer algebraic method for analyzing and verifying non-linear integer arithmetic circuits using a novel algebraic spectral model. It introduces a concept of algebraic spectrum, a numerical form of polynomial expression; it uses the distribution of coefficients of the monomials to determine the type of arithmetic function under verification. In contrast to previous works, the proof of functional correctness is achieved by computing an algebraic spectrum combined with local rewriting of word-level polynomials. The speedup is achieved by propagating coefficients through the circuit using And-Inverter Graph (AIG) datastructure. The effectiveness of the method is demonstrated with experiments including standard and Booth multipliers, and other synthesized non-linear arithmetic circuits up to 1024 bits containing over 12 million gates.

S2-PM: semi-supervised learning for efficient performance modeling of analog and mixed signal circuits

Mohamed Baker Alawieh
Xiyuan Tang
David Z. Pan

As integrated circuit technologies continue to scale, variability modeling is becoming more crucial yet, more challenging. In this paper, we propose a novel performance modeling method based on semi-supervised co-learning. We exploit the multiple representations of process variation in any analog and mixed signal circuit to establish a co-learning framework where unlabeled samples are leveraged to improve the model accuracy without enduring any simulation cost. Practically, our proposed method relies on a small set of labeled data, and the availability of no-cost unlabeled data to efficiently build accurate performance model for any analog and mixed signals circuit design. Our numerical experiments demonstrate that the proposed approach achieves up to 30% reduction in simulation cost compared to the state-of-the-art modeling technique without surrendering any accuracy.

SESSION: Logic and precision optimization for neural network designs

Energy-efficient, low-latency realization of neural networks through boolean logic minimization

Mahdi Nazemi
Ghasem Pasandi
Massoud Pedram

Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

Log-quantized stochastic computing for memory and computation efficient DNNs

Hyeonuk Sim
Jongeun Lee

For energy efficiency, many low-bit quantization methods for deep neural networks (DNNs) have been proposed. Among them, logarithmic quantization is being highlighted showing acceptable deep learning performance. It also simplifies high-cost multipliers as well as reducing memory footprint drastically. Meanwhile, stochastic computing (SC) was proposed for low-cost DNN acceleration and the recently proposed SC multiplier improved the accuracy and latency significantly which are main drawbacks of SC. However, in their binary-interfaced system which yet costs much less than storing all stochastic stream, quantization is basically linear as same as conventional fixed-point binary. We applied logarithmically quantized DNNs to the state-of-the-art SC multiplier and studied how it can benefit. We found that SC multiplication on logarithmically quantized input is more accurate and it can help fine-tuning process. Furthermore, we designed the much low-cost SC-DNN accelerator utilizing the reduced complexity of inputs. Finally, while logarithmic quantization benefits data flow, proposed architecture achieves 40% and 24% less area and power consumption than the previous SC-DNN accelerator. Its area X latency product is smaller even than the shifter based accelerator.

Cell division: weight bit-width reduction technique for convolutional neural network hardware accelerators

Hanmin Park
Kiyoung Choi

The datapath bit-width of hardware accelerators for convolutional neural network (CNN) inference is generally chosen to be wide enough, so that they can be used to process upcoming unknown CNNs. Here we introduce the cell division technique, which is a variant of function-preserving transformations. With this technique, it is guaranteed that CNNs that have weights quantized to fixed-point format of arbitrary bit-widths, can be transformed to CNNs with less bit-widths of weights without any accuracy drop (or any accuracy change). As a result, CNN hardware accelerators are released from the weight bit-width constraint, which has been preventing them from having narrower datapaths. In addition, CNNs that have wider weight bit-widths than those assumed by a CNN hardware accelerator can be executed on the accelerator. Experimental results on LeNet-300-100, LeNet-5, AlexNet, and VGG-16 show that weights can be reduced down to 2--5 bits with 2.5X--5.2X decrease in weight storage requirement and of course without any accuracy drop.

SESSION: Modern mask optimization: from shallow to deep learning

LithoROC: lithography hotspot detection with explicit ROC optimization

Wei Ye
Yibo Lin
Meng Li
Qiang Liu
David Z. Pan

As modern integrated circuits scale up with escalating complexity of layout design patterns, lithography hotspot detection, a key stage of physical verification to ensure layout finishing and design closure, has raised a higher demand on its efficiency and accuracy. Among all the hotspot detection approaches, machine learning distinguishes itself for achieving high accuracy while maintaining low false alarms. However, due to the class imbalance problem, the conventional practice which uses the accuracy and false alarm metrics to evaluate different machine learning models is becoming less effective. In this work, we propose the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods. To systematically handle class imbalance, we further propose the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss. Experimental results demonstrate that the new surrogate loss functions are promising to outperform the cross-entropy loss when applied to the state-of-the-art neural network model for hotspot detection.

Detecting multi-layer layout hotspots with adaptive squish patterns

Haoyu Yang
Piyush Pathak
Frank Gennari
Ya-Chieh Lai
Bei Yu

Layout hotpot detection is one of the critical steps in modern integrated circuit design flow. It aims to find potential weak points in layouts before feeding them into manufacturing stage. Rapid development of machine learning has made it a preferable alternative of traditional hotspot detection solutions. Recent researches range from layout feature extraction and learning model design. However, only single layer layout hotspots are considered in state-of-the-art hotspot detectors and certain defects such as metal-to-via failures are not naturally supported. In this paper, we propose an adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks. We conduct experiments on 14nm industrial designs with a metal layer and its two adjacent via layers that contain metal-to-via hotspots. Results show that the adaptive squish representation can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

A local optimal method on DSA guiding template assignment with redundant/dummy via insertion

Xingquan Li
Bei Yu
Jianli Chen
Wenxing Zhu

As an emerging manufacture technology, block copolymer directed self-assembly (DSA) is promising for via layer fabrication. Meanwhile, redundant via insertion is considered as an essential step for yield improvement. For better reliability and manufacturability, in this paper, we concurrently consider DSA guiding template assignment with redundant via and dummy via insertion at post-routing stage. Firstly, by analyzing the structure property of guiding templates, we propose a building-block based solution expression to discard redundant solutions. Then, honoring the compact solution expression, we construct a conflict graph with dummy via insertion, and then formulate the problem to an integer linear programming (ILP). To make a good trade-off between solution quality and runtime, we relax the ILP to an unconstrained nonlinear programming (UNP). Finally, a line search optimization algorithm is proposed to solve the UNP. Experimental results verify the effectiveness of our new solution expression and the efficiency of our proposed algorithm.

Deep learning-based framework for comprehensive mask optimization

Bo-Yi Yu
Yong Zhong
Shao-Yun Fang
Hung-Fei Kuo

With the dramatically increase of design complexity and the advance of semiconductor technology nodes, huge difficulties appear during design for manufacturability with existing lithography solutions. Sub-resolution assist feature (SRAF) insertion and optical proximity correction (OPC) are both inevitable resolution enhancement techniques (RET) to maximize process window and ensure feature printability. Conventional model-based SRAF insertion and OPC methods are widely applied in industrial application but suffer from the extremely long runtime due to iterative optimization process. In this paper, we propose the first work developing a deep learning framework to simultaneously perform SRAF insertion and edge-based OPC. In addition, to make the optimized masks more reliable and convincing for industrial application, we employ a commercial lithography simulation tool to consider the quality of wafer image with various lithographic metrics. The effectiveness and efficiency of the proposed framework are demonstrated in experimental results, which also show the success of machine learning-based lithography optimization techniques for the current complex and large-scale circuit layouts.

SESSION: System level modelling methods I

AxDNN: towards the cross-layer design of approximate DNNs

Yinghui Fan
Xiaoxi Wu
Jiying Dong
Zhi Qi

Thanks for the inborn error resistance of neural networks, approximate computing has become a promising and hardware friendly technique to improve the energy efficiency of DNNs. From the layer of algorithms, architectures, to circuits, there are many possibilities to implement approximate DNNs. However, the complicated interaction between major design concerns, e.g., power performance, and the lack of an efficient simulator cross multiple design layers have generated suboptimal solutions of approximate DNNs through the conventional design method. In this paper, we present a systematical framework towards the cross-layer design of approximation DNNs. By introducing hardware imperfection to the training phase, the accuracy of DNN models can be recovered by up to 5.32% when the most aggressive approximate multiplier has been used. Integrated with the techniques of activation pruning and voltage scaling, the energy efficiency of the approximate DNN accelerator can be improved by 52.5% on average. We also build a pre-RTL simulation environment where we can easily express accelerator architectures, try the combination of different approximate strategies, and evaluate the power consumption. Experiments demonstrate the pre-RTL simulation has achieved ~20X speed up compared with traditional RTL method when evaluating the same target. The convenient pre-RTL simulation helps us to quickly figure out the trade-off between accuracy and energy at the design stage for an approximate DNN accelerator.

Simulate-the-hardware: training accurate binarized neural networks for low-precision neural accelerators

Jiajun Li
Ying Wang
Bosheng Liu
Yinhe Han
Xiaowei Li

This work investigates how to effectively train binarized neural networks (BNNs) for the specialized low-precision neural accelerators. When mapping BNNs onto the specialized neural accelerators that adopt fixed-point feature data representation and binary parameters, due to the operation overflow caused by short fixed-point coding, the BNN inference results from the deep learning frameworks on CPU/GPU will be inconsistent with those from the accelerators. This issue leads to a large deviation between the training environment and the inference implementation, and causes potential model accuracy losses when deployed on the accelerators. Therefore, we present a series of methods to contain the overflow phenomenon, and enable typical deep learning frameworks like Tensorflow to effectively train BNNs that could work with high accuracy and convergence speed on the specialized neural accelerators.

An N-way group association architecture and sparse data group association load balancing algorithm for sparse CNN accelerators

Jingyu Wang
Zhe Yuan
Ruoyang Liu
Huazhong Yang
Yongpan Liu

In recent years, ASIC CNN Accelerators have attracted great attention among researchers for the high performance and energy efficiency. Some former works utilize the sparsity of CNN networks to improve the performance and the energy efficiency. However, these methods bring tremendous overhead to the output memory, and the performance suffers from the hash collision. This paper presents: 1) an N-Way Group Association Architecture to reduce the memory overhead for Sparse CNN Accelerators; 2) a Sparse Data Group Association Load Balancing Algorithm which is implemented by the Scheduler module in the architecture to reduce the collision rate and improve the performance. Compared with the state-of-art accelerator, this work achieves either 1) 1.74x performance with 50% memory overhead reduction in the 4-way associated design or 2) 1.91x performance without memory overhead reduction the 2-way associated design, which is close to the theoretical performance limit (without collision).

Maximizing power state cross coverage in firmware-based power management

Vladimir Herdt
Hoang M. Le
Daniel Große
Rolf Drechsler

Virtual Prototypes (VPs) are becoming increasingly attractive for the early analysis of SoC power management, which is nowadays mostly implemented in firmware (FW). Power and timing constraints can be monitored and validated by executing a set of test-cases in a power-aware FW/VP co-simulation. In this context, cross coverage of power states is an effective but challenging quality metric. This paper proposes a novel coverage-driven approach to automatically generate test-cases maximizing this cross coverage. In particular, we integrate a coverage-loop that successively refines the generation process based on previous results. We demonstrate our approach on a LEON3-based VP.

SESSION: Testing and design for security

Improving scan chain diagnostic accuracy using multi-stage artificial neural networks

Mason Chern
Shih-Wei Lee
Shi-Yu Huang
Yu Huang
Gaurav Veda
Kun-Han (Hans) Tsai
Wu-Tung Cheng

Diagnosis of intermittent scan chain failures remains a hard problem. We demonstrate that Artificial Neural Networks (ANNs) can be used to achieve significantly higher accuracy. The key is to take on domain knowledge and use a multi-stage process incorporating ANNs with gradually refined focuses. Experimental results on benchmark circuits show that this method is, on average, 20% more accurate than a state-of-the-art commercial tool for intermittent stuck-at faults, and improves the hit rate from 25.3% to 73.9% for some test-case.

Testing stuck-open faults of priority address encoder in content addressable memories

Tsai-Ling Tsai
Jin-Fu Li
Chun-Lung Hsu
Chi-Tien Su

Content addressable memory (CAM) is widely used in the systems with the need of parallel search. The testing of CAM is more difficult than that of random access memory (RAM) due to the complicated function of CAM. Similar to the testing of RAM, the testing of CAM should cover the cell array and peripheral circuits. In this paper, we propose a March-like test, March-PCL, for detecting the stuck-open faults (SOFs) of the priority address encoder of CAMs. As the best of our knowledge, this is the first word to discuss the testing of SOFs of the priority address encoder of CAMs. The March-PCL requires 4N Write and 4N Compare operations to cover 100% SOFs.

ScanSAT: unlocking obfuscated scan chains

Lilas Alrahis
Muhammad Yasin
Hani Saleh
Baker Mohammad
Mahmoud Al-Qutayri
Ozgur Sinanoglu

While financially advantageous, outsourcing key steps such as testing to potentially untrusted Outsourced Semiconductor Assembly and Test (OSAT) companies may pose a risk of compromising on-chip assets. Obfuscation of scan chains is a technique that hides the actual scan data from the untrusted testers; logic inserted between the scan cells, driven by a secret key, hide the transformation functions between the scan-in stimulus (scan-out response) and the delivered scan pattern (captured response). In this paper, we propose ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key. Our empirical results demonstrate that ScanSAT can easily break naive scan obfuscation techniques using only three or fewer attack iterations even for large key sizes and in the presence of scan compression.

CycSAT-unresolvable cyclic logic encryption using unreachable states

Amin Rezaei
You Li
Yuanqi Shen
Shuyu Kong
Hai Zhou

Logic encryption has attracted much attention due to increasing IC design costs and growing number of untrusted foundries. Unreachable states in a design provide a space of flexibility for logic encryption to explore. However, due to the available access of scan chain, traditional combinational encryption cannot leverage the benefit of such flexibility. Cyclic logic encryption inserts key-controlled feedbacks into the original circuit to prevent piracy and overproduction. Based on our discovery, cyclic logic encryption can utilize unreachable states to improve security. Even though cyclic encryption is vulnerable to a powerful attack called CycSAT, we develop a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT. The attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

SESSION: Network-centric design and system

Routing in optical network-on-chip: minimizing contention with guaranteed thermal reliability

Mengquan Li
Weichen Liu
Lei Yang
Peng Chen
Duo Liu
Nan Guan

Communication contention and thermal susceptibility are two potential issues in optical network-on-chip (ONoC) architecture, which are both critical for ONoC designs. However, minimizing conflict and guaranteeing thermal reliability are incompatible in most cases. In this paper, we present a routing criterion in the network level. Combined with device-level thermal tuning, it can implement thermal-reliable ONoC. We further propose two routing approaches (including a mixed-integer linear programming (MILP) model and a heuristic algorithm (CAR)) to minimize communication conflict based on the guaranteed thermal reliability, and meanwhile, mitigate the energy overheads of thermal regulation in the presence of chip thermal variations. By applying the criterion, our approaches achieve excellent performance with largely reduced complexity of design space exploration. Evaluation results on synthetic communication traces and realistic benchmarks show that the MILP-based approach achieves an average of 112.73% improvement in communication performance and 4.18% reduction in energy overhead compared to state-of-the-art techniques. Our heuristic algorithm only introduces 4.40% performance difference compared to the optimal results and is more scalable to large-size ONoCs.

Bidirectional tuning of microring-based silicon photonic transceivers for optimal energy efficiency

Yuyang Wang
M. Ashkan Seyedi
Jared Hulme
Marco Fiorentino
Raymond G. Beausoleil
Kwang-Ting Cheng

Microring-based silicon photonic transceivers are promising to resolve the communication bottleneck of future high-performance computing systems. To rectify process variations in microring resonance wavelengths, thermal tuning is usually preferred over electrical tuning due to its preservation of extinction ratios and quality factors. However, the low energy efficiency of resistive thermal tuners results in nontrivial tuning cost and overall energy consumption of the transceiver. In this study, we propose a hybrid tuning strategy which involves both thermal and electrical tuning. Our strategy determines the tuning direction of each resonance wavelength with the goal of optimizing the transceiver energy efficiency without compromising signal integrity. Formulated as an integer programming problem and solved by a genetic algorithm, our tuning strategy yields 32%~53% savings of overall energy per bit for measured data of 5-channel transceivers at 5~10 Gb/s per channel, and up to 24% saving for synthetic data of 30-channel transceivers, generated based on the process variation models built upon measured data. We further investigated a polynomial-time approximation method which achieves over 100x speedup in tuning scheme computation, while still maintaining considerable energy-per-bit savings.

Redeeming chip-level power efficiency by collaborative management of the computation and communication

Ning Lin
Hang Lu
Xin Wei
Xiaowei Li

Power consumption is the first order design constraint in future many-core processors. Conventional power management approaches usually focus on certain functional components, either computation or communication hardware resources, trying to optimize its power consumption as much as possible, while leave the other part untouched. However, such unilateral power control concept, though has some potentials to contribute overall power reduction, cannot guarantee the optimal power efficiency of the chip. In this paper, we propose a novel Collaborative management approach, coordinating both Computation and Communication infrastructure in tandem, termed as CoCom. Apart from prior work that deals with power control separately, it leverages the correlations between the two parts, as the "key chain" to guide their respective power state coordination to the appropriate direction. Besides, it uses dedicated hybrid on-chip/off-chip mechanisms to minimize the control cost and simultaneously guarantee the effectiveness. Experimental results show that, compared with the conventional unilateral baselines, CoCom is able to achieve abundant power reduction with minimal performance degradation at the same time.

A high-level modeling and simulation approach using test-driven cellular automata for fast performance analysis of RTL NoC designs

Moon Gi Seok
Hessam S. Sarjoughian
Daejin Park

The simulation speedup of designed RTL NoC regarding the packet transmission is essential to analyze the performance or to optimize NoC parameters for various combinations of intellectual-property (IP) blocks, which requires repeated computations for parameter-space exploration. In this paper, we propose a high-level modeling and simulation (M&S) approach using a revised cellular automata (CA) concept to speed up simulation of dynamic flit movements and queue occupancy within target RTL NoC. The CA abstracts the detailed RTL operations with the view of deciding a cell's state of actions (related to moving packet flits and changing the connection between CA cells) using its own high-level states and those of neighbors, and executing relevant operations to the decided action states. During the performing the operations including connection requests and acceptances, architecture-independent and user-developed routing and arbitration functions are utilized. The decision regarding the action states follows a rule set, which is generated by the proposed test environment. The proposed method was applied to an open-source Verilog NoC, which achieves simulation speedup by approximately 8 to 31 times for a given parameter set.

SESSION: Advanced memory systems

A sharing-aware L1.5D cache for data reuse in GPGPUs

Jianfei Wang
Li Jiang
Jing Ke
Xiaoyao Liang
Naifeng Jing

With GPUs heading towards general-purpose, hardware caching, e.g. the first-level data (L1D) cache is introduced into the on-chip memory hierarchy for GPGPUs. However, facing the GPGPU massive multi-threading, the small L1D requires a better management for a higher hit rate to benefit the performance. In this paper, on observing the L1D usage inefficiency, such as data duplication among streaming multiprocessors (SMs) that wastes the precious L1D resources, we first propose a shared L1.5D cache that substitutes the private L1D caches in several SMs to reduce the duplicated data and in turn increase the effective cache size for each SM. We evaluate and adopt a suitable layout of L1.5D to meet the timing requirements in GPGPUs. Then, to protect the sharable data from early evictions, we propose a sharable data aware cache management, which leverages a lightweight PC-based history table to protect sharable data on cache replacement. The experiments demonstrate that the proposed design can achieve an averaged 20.1% performance improvement with an increased on-chip hit rate by 16.9% for applications with sharable data.

NeuralHMC: an efficient HMC-based accelerator for deep neural networks

Chuhan Min
Jiachen Mao
Hai Li
Yiran Chen

In Deep Neural Network (DNN) applications, energy consumption and performance cost of moving data between memory hierarchy and computational units are significantly higher than that of the computation itself. Process-in-memory (PIM) architecture such as Hybrid Memory Cube (HMC), becomes an excellent candidate to improve the data locality for efficient DNN execution. However, it's still hard to efficiently deploy large-scale matrix computation in DNN on HMC because of its coarse grained packet protocol. In this work, we propose NeuralHMC, the first HMC-based accelerator tailored for efficient DNN execution. Experimental results show that NeuralHMC reduces the data movement by 1.4x to 2.5x (depending on the DNN data reuse strategy) compared to Von Neumann architecture. Furthermore, compared to state-of-the-art PIM-based DNN accelerator, NeuralHMC can promisingly improve the system performance by 4.1x and reduces energy by 1.5x, on average.

Boosting chipkill capability under retention-error induced reliability emergency

Xianwei Zhang
Rujia Wang
Youtao Zhang
Jun Yang

The DRAM based main memory of high embedded systems faces two design challenges: (i) degrading reliability; and (ii) increasing power and energy consumption. While chipkill ECC (error correction code) and multi-rate refresh may be adopted to address them, respectively, a simple integration of the two results in 3x or more SDC (silent data corruption) errors and failing to meet the system reliability guarantee. This is referred to as reliability emergency.
In this paper, we propose PlusN, a hardware-assisted memory error protection design that adaptively boosts the baseline chipkill capability to address the reliability emergency. Based on the error probability assessment at runtime, the system switches its memory protection between the baseline chipkill and PlusN --- the latter generates a stronger ECC with low storage and access overheads. Our experimental results show that PlusN can effectively enforce the system reliability guarantee under different reliability emergency scenarios.

SESSION: Learning: make patterning light and right

SRAF insertion via supervised dictionary learning

Hao Geng
Haoyu Yang
Yuzhe Ma
Joydeep Mitra
Bei Yu

In modern VLSI design flow, sub-resolution assist feature (SRAF) insertion is one of the resolution enhancement techniques (RETs) to improve chip manufacturing yield. With aggressive feature size continuously scaling down, layout feature learning becomes extremely critical. In this paper, for the first time, we enhance conventional manual feature construction, by proposing a supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction. By taking advantage of label information, the proposed dictionary learning engine can discriminatively and accurately represent the input data. We further consider SRAF design rules in a global view, and design an integer linear programming model in the post-processing stage of SRAF insertion framework. Experimental results demonstrate that, compared with a state-of-the-art SRAF insertion tool, our framework not only boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

A fast machine learning-based mask printability predictor for OPC acceleration

Bentian Jiang
Hang Zhang
Jinglei Yang
Evangeline F. Y. Young

Continuous shrinking of VLSI technology nodes brings us powerful chips with lower power consumption, but it also introduces many issues in manufacturability. Lithography simulation process for new feature size suffers from large computational overhead. As a result, conventional mask optimization process has been drastically resource consuming in terms of both time and cost. In this paper, we propose a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and apply it in a conventional mask optimization tool to verify its effectiveness.

Semi-supervised hotspot detection with self-paced multi-task learning

Ying Chen
Yibo Lin
Tianyang Gai
Yajuan Su
Yayi Wei
David Z. Pan

Lithography simulation is computationally expensive for hotspot detection. Machine learning based hotspot detection is a promising technique to reduce the simulation overhead. However, most learning approaches rely on a large amount of training data to achieve good accuracy and generality. At the early stage of developing a new technology node, the amount of data with labeled hotspots or non-hotspots is very limited. In this paper, we propose a semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality. Experimental results demonstrate that our approach can achieve 2.9--4.5% better accuracy at the same false alarm levels than the state-of-the-art work using 10%-50% of training data. The source code and trained models are released on https://github.com/qwepi/SSL.

SESSION: Design and CAD for emerging memories

Exploring emerging CNFET for efficient last level cache design

Dawen Xu
Li Li
Ying Wang
Cheng Liu
Huawei Li

Carbon Nanotube field-effect transistors (CNFET) emerge as a promising alternative to the conventional CMOS for the much higher speed and power efficiency. It is particularly suitable for building the power-hungry last level cache (LLC). However, the process variation (PV) in CNFET substantially affects the operation stability and thus the worst-case timing, which limits the LLC operation frequency dramatically given a fully synchronous design. To address this problem, we developed a variation-aware cache such that each part of the cache can run at its optimal frequency and the overall cache performance can be improved significantly.
While asymmetric-correlated in the variation unique to the CNFET fabrication process, this indicates that cache latency distribution is closely related with the LLC layouts. For the two typical LLC layouts, we proposed variation-aware-set (VAS) cache and variation-aware-way (VAW) cache respectively to make best use of the CNFET cache architecture. For VAS cache, we further proposed a static page mapping to ensure the most frequent used data are mapped to the fast cache region. Similarly, we apply a latency-aware LRU replacement strategy to assign the most recent data to the fast cache region. According to the experiments, the optimized CNFET based LLC improves the performance by 39% and reduces the power consumption by 10% on average compared to the baseline CNFET LLC design.

Mosaic: an automated synthesis flow for boolean logic based on memristor crossbar

Lei Xie

Memristor crossbar stacked on the top of CMOS circuitry is a promising candidate for future VLSI circuits, due to its great scalability, near-zero standby power consumption, etc. In order to design large-scale logic circuits, an automated synthesis flow is highly demanded to map Boolean functions onto memristor crossbar. This paper proposes such a synthesis flow, Mosaic by reusing a part of the existing CMOS synthesis flow. In addition, two schemes are proposed to optimize designs in terms of delay and power consumption. To verify Mosaic and its optimization schemes, four types of adders are used as a study case; the incurred delay, area and power costs for both the crossbar and its CMOS controller are evaluated. The results show that the optimized adders reduce delay (>26%), power consumption (>21%) and area (>23%) as compared to initial ones. To show the potential of Mosaic for design space exploration, we use other nice more complex benchmarks. The results shows that the design can be significantly optimized in terms of both area (4.5x to 82.9x) and delay (2.4x to 9.5x).

Handling stuck-at-faults in memristor crossbar arrays using matrix transformations

Baogang Zhang
Necati Uysal
Deliang Fan
Rickard Ewetz

Matrix-vector multiplication is the dominating computational workload in the inference phase of neural networks. Memristor crossbar arrays (MCAs) can inherently execute matrix-vector multiplication with low latency and small power consumption. A key challenge is that the classification accuracy may be severely degraded by stuck-at-fault defects. Earlier studies have shown that the accuracy loss can be recovered by retraining each neural network or by utilizing additional hardware. In this paper, we propose to handle stuck-at-faults using matrix transformations. A transformation T changes a weight matrix W into a weight matrix, @ = T(W), which is more robust to stuck-at-faults. In particular, we propose a row flipping transformation, a permutation transformation, and a value range transformation. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the matrix, which results in that each stuck-at-fault introduces an error of smaller magnitude. The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

SESSION: Optimized training for neural networks

CAPTOR: a class adaptive filter pruning framework for convolutional neural networks in mobile applications

Zhuwei Qin
Fuxun Yu
Chenchen Liu
Xiang Chen

Nowadays, the evolution of deep learning and cloud service significantly promotes neural network based mobile applications. Although intelligent and prolific, those applications still lack certain flexibility: For classification tasks, neural networks are generally trained online with vast classification targets to cover various utilization contexts. However, only partial classes are practically tested due to individual mobile user preference and application specificity. Thus the unneeded classes cause considerable computation and communication cost. In this work, we propose CAPTOR - a class-level reconfiguration framework for Convolutional Neural Networks (CNNs). By identifying the class activation preference of convolutional filters through feature interest visualization and gradient analysis, CAPTOR can effectively cluster and adaptively prune the filters associated with unneeded classes. Therefore, CAPTOR enables class-level CNN reconfiguration for network model compression and local deployment on mobile devices. Experiment shows that, CAPTOR can reduce computation load for VGG-16 by up to 40.5% and 37.9% energy consumption with ignored loss of accuracy. For AlexNet, CAPTOR also reduces computation load by up to 42.8% and 37.6% energy consumption with less than 3% loss in accuracy.

TNPU: an efficient accelerator architecture for training convolutional neural networks

Jiajun Li
Guihai Yan
Wenyan Lu
Shuhao Jiang
Shijun Gong
Jingya Wu
Junchao Yan
Xiaowei Li

Training large scale convolutional neural networks (CNNs) is an extremely computation and memory intensive task that requires massive computational resources and training time. Recently, many accelerator solutions have been proposed to improve the performance and efficiency of CNNs. Existing approaches mainly focus on the inference phase of CNN, and can hardly address the new challenges posed in CNN training: the resource requirement diversity and bidirectional data dependency between convolutional layers (CVLs) and fully-connected layers (FCLs). To overcome this problem, this paper presents a new accelerator architecture for CNN training, called TNPU, which leverages the complementary effect of the resource requirements between CVLs and FCLs. Unlike prior approaches optimizing CVLs and FCLs in separate way, we take an alternative by smartly orchestrating the computation of CVLs and FCLs in single computing unit to work concurrently so that both computing and memory resources will maintain high utilization, thereby boosting the performance. We also proposed a simplified out-of-order scheduling mechanism to address the bidirectional data dependency issues in CNN training. The experiments show that TNPU achieves a speedup of 1.5x and 1.3x, with an average energy reduction of 35.7% and 24.1% over comparably provisioned state-of-the-art accelerators (DNPU and DaDianNao), respectively.

REIN: a robust training method for enhancing generalization ability of neural networks in autonomous driving systems

Fuxun Yu
Chenchen Liu
Xiang Chen

In recent years, neural network has shown its great potential in autonomous driving systems. However, the theoretically well-train neural networks usually fail their performance when facing real-world examples with unexpected physical variations. As the current neural networks still suffer from limited generalization ability, those unexpected variations would cause considerable accuracy degradation and critical safety issues. Therefore, the generalization ability of neural networks becomes one of the most critical challenges for autonomous driving system design. In this work, we propose a robust training method to enhance neural network's generalization ability in various practical autonomous driving scenarios. Based on detailed practical variation modeling and neural network generation ability analysis, the proposed training method could consistently improve model classification accuracy by at most 25% in various scenarios (e.g. raining/fogy, dark lighting, and camera discrepancy). Even with adversarial corner cases, our model could still achieve at most 40% accuracy improvement over natural model.

SESSION: New trends in biochips

Factorization based dilution of biochemical fluids with micro-electrode-dot-array biochips

Sohini Saha
Debraj Kundu
Sudip Roy
Sukanta Bhattacharjee
Krishnendu Chakrabarty
Partha P. Chakrabarti
Bhargab B. Bhattacharya

Sample preparation, an essential preprocessing step for biochemical protocols, is concerned with the generation of fluids satisfying specific target ratios and error-tolerance. Recent micro-electrode-dot-array (MEDA)-based DMF biochips provide the advantage of supporting both discrete and dynamic mixing models, the power of which has not yet been fully harnessed for implementing on-chip dilution and mixing of fluids. In this paper, we propose a novel factorization-based algorithm called FacDA for efficient and accurate dilution of sample fluid on a MEDA chip. Simulation results reveal that over a large number of test-cases with the mixing volume constraint in the range of 4--10 units, FacDA requires around 38% fewer mixing steps, 52% less sample units, and generates approximately 23% less wastage, all on average, compared to two prior dilution algorithms used for MEDA chips.

Sample preparation for multiple-reactant bioassays on micro-electrode-dot-array biochips

Tung-Che Liang
Yun-Sheng Chan
Tsung-Yi Ho
Krishnendu Chakrabarty
Chen-Yi Lee

Sample preparation, as a key procedure in many biochemical protocols, mixes various samples and/or reagents into solutions that contain the target concentrations. Digital microfluidic biochips (DMFBs) have been adopted as a platform for sample preparation because they provide automatic procedures that require less reactant consumption and reduce human-induced errors. However, traditional DMFBs only utilize the (1:1) mixing model, i.e., only two droplets of the same volume can be mixed at a time, which results in higher completion time and the wastage of valuable reactants. To overcome this limitation, a next-generation micro-electrode-dot-array (MEDA) architecture that provides flexibility of mixing multiple droplets of different volumes in a single operation was proposed. In this paper, we present a generic multiple-reactant sample preparation algorithm that exploits the novel fluidic operations on MEDA biochips. Simulated experiments show that the proposed method outperforms existing methods in terms of saving reactant cost, minimizing the number of operations, and reducing the amount of waste.

Robust sample preparation on digital microfluidic biochips

Zhanwei Zhong
Robert Wille
Krishnendu Chakrabarty

Sample preparation is an important application for the digital microfluidic biochips (DMFBs) platform, and many methods have been developed to reduce the time and reagent usage associated with on-chip sample preparation. However, errors in fluidic operations can result in the concentration of the resulting droplet being outside the calibration range. Current error-recovery methods have the drawback that they need the use of on-chip sensors and further re-execution time. In this paper, we present two dilution-chain structures that can generate a droplet with a desired concentration even if volume variations occur during droplet splitting. Experimental results show the effectiveness of the proposed method compared to previous methods.

SESSION: Power-efficient machine learning hardware design

SAADI: a scalable accuracy approximate divider for dynamic energy-quality scaling

Setareh Behroozi
Jingjie Li
Jackson Melchert
Younghyun Kim

Approximate computing can significantly improve the energy efficiency of arithmetic operations in error-resilient applications. In this paper, we propose an approximate divider design that facilitates dynamic energy-quality scaling. Conventional approximate dividers lack runtime energy-quality scalability, which is the key to maximizing the energy efficiency while meeting dynamically varying accuracy requirements. Our divider design, named SAADI, makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations. For the approximate 8-bit division of 32-bit/16-bit division, the average accuracy of SAADI can be adjusted in between 92.5% and 99.0% by varying latency up to 7x. We evaluate the accuracy and energy consumption of SAADI for various design parameters and demonstrate its efficacy for low-power signal processing applications.

SeFAct: selective feature activation and early classification for CNNs

Farhana Sharmin Snigdha
Ibrahim Ahmed
Susmita Dey Manasi
Meghna G. Mankalale
Jiang Hu
Sachin S. Sapatnekar

This work presents a dynamic energy reduction approach for hardware accelerators for convolutional neural networks (CNN). Two methods are used: (1) an adaptive data-dependent scheme to selectively activate a subset of all neurons, by narrowing down the possible activated classes (2) static bitwidth reduction. The former is applied in late layers of the CNN, while the latter is more effective in early layers. Even accounting for the implementation overheads, the results show 20%--25% energy savings with 5--10% accuracy loss.

FACH: FPGA-based acceleration of hyperdimensional computing by reducing computational complexity

Mohsen Imani
Sahand Salamat
Saransh Gupta
Jiani Huang
Tajana Rosing

Brain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a query hypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosine to perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.

SESSION: Security of machine learning and machine learning for security: progress and challenges for secure, machine intelligent mobile systems

ADMM attack: an enhanced adversarial attack for deep neural networks with undetectable distortions

Pu Zhao
Kaidi Xu
Sijia Liu
Yanzhi Wang
Xue Lin

Many recent studies demonstrate that state-of-the-art Deep neural networks (DNNs) might be easily fooled by adversarial examples, generated by adding carefully crafted and visually imperceptible distortions onto original legal inputs through adversarial attacks. Adversarial examples can lead the DNN to misclassify them as any target labels. In the literature, various methods are proposed to minimize the different lp norms of the distortion. However, there lacks a versatile framework for all types of adversarial attacks. To achieve a better understanding for the security properties of DNNs, we propose a general framework for constructing adversarial examples by leveraging Alternating Direction Method of Multipliers (ADMM) to split the optimization approach for effective minimization of various lp norms of the distortion, including l0, l1, l2, and l∞ norms. Thus, the proposed general framework unifies the methods of crafting l0, l1, l2, and l∞ attacks. The experimental results demonstrate that the proposed ADMM attacks achieve both the high attack success rate and the minimal distortion for the misclassification compared with state-of-the-art attack methods.

A system-level perspective to understand the vulnerability of deep learning systems

Tao Liu
Nuo Xu
Qi Liu
Yanzhi Wang
Wujie Wen

Deep neural network (DNN) is nowadays achieving the human-level performance on many machine learning applications like self-driving car, gaming and computer-aided diagnosis. However, recent studies show that such a promising technique has gradually become the major attack target, significantly threatening the safety of machine learning services. On one hand, the adversarial or poisoning attacks incurred by DNN algorithm vulnerabilities can cause the decision misleading with very high confidence. On the other hand, the system-level DNN attacks built upon models, training/inference algorithms and hardware and software in DNN execution, have also emerged for more diversified damages like denial of service, private data stealing. In this paper, we present an overview of such emerging system-level DNN attacks by systematically formulating their attack routines. Several representative cases are selected in our study to summarize the characteristics of system-level DNN attacks. Based on our formulation, we further discuss the challenges and several possible techniques to mitigate such emerging system-level DNN attacks.

HAMPER: high-performance adaptive mobile security enhancement against malicious speech and image recognition

Zirui Xu
Fuxun Yu
Chenchen Liu
Xiang Chen

Recently, the machine learning technologies have been widely used in cognitive applications such as Automatic Speech Recognition (ASR) and Image Recognition (IR). Unfortunately, these techniques have been massively used in unauthorized audio/image data analysis, causing serious privacy leakage. To address this issue, we propose HAMPER in this work, which is a data encryption framework that protects the audio/image data from unauthorized ASR/IR analysis. Leveraging machine learning models' vulnerability to adversarial examples, HAMPER encrypt the audio/image data with adversarial noises to perturb the recognition results of ASR/IR systems. To deploy the proposed framework in extensive platforms (e.g. mobile devices), HAMPER also take into consideration of computation efficiency, perturbation transferability, as well as data attribute configuration. Therefore, rather than focusing on the high-level machine learning models, HAMPER generates adversarial examples from the low-level features. Taking advantage of the light computation load, fundamental impact, and direct configurability of the low-level features, the generated adversarial examples can efficiently and effectively affect the whole ASR/IR systems. Experiment results show that, HAMPER can effectively perturb the unauthorized ASR/IR analysis with 85% Word-Error-Rate (WER) and 83% Image-Error-Rate (IER) respectively. Also, HAMPER achieves faster processing speed with 1.5X speedup for image encryption and even 26X in audio, comparing to the state-of-the-art methods. Moreover, HAMPER achieves strong transferability and configures adversarial examples with desired attributes for better scenario adaptation.

AdverQuil: an efficient adversarial detection and alleviation technique for black-box neuromorphic computing systems

Hsin-Pai Cheng
Juncheng Shen
Huanrui Yang
Qing Wu
Hai Li
Yiran Chen

In recent years, neuromorphic computing systems (NCS) have gained popularity in accelerating neural network computation because of their high energy efficiency. The known vulnerability of neural networks to adversarial attack, however, raises a severe security concern of NCS. In addition, there are certain application scenarios in which users have limited access to the NCS. In such scenarios, defense technologies that require changing the training methods of the NCS, e.g., adversarial training become impracticable. In this work, we propose AdverQuil - an efficient adversarial detection and alleviation technique for black-box NCS. AdverQuil can identify the adversarial strength of input examples and select the best strategy for NCS to respond to the attack, without changing structure/parameter of the original neural network or its training method. Experimental results show that on MNIST and CIFAR-10 datasets, AdverQuil achieves a high efficiency of 79.5 - 167K image/sec/watt. AdverQuil introduces less than 25% of hardware overhead, and can be combined with various adversarial alleviation techniques to provide a flexible trade-off between hardware cost, energy efficiency and classification accuracy.

SESSION: System level modelling methods II

SIMULTime: Context-sensitive timing simulation on intermediate code representation for rapid platform explorations

Alessandro Cornaglia
Alexander Viehl
Oliver Bringmann
Wolfgang Rosenstiel

Nowadays, product lines are common practice in the embedded systems domain as they allow for substantial reductions in development costs and the time-to-market by a consequent application of design paradigms such as variability and structured reuse management. In that context, accurate and fast timing predictions are essential for an early evaluation of all relevant variants of a product line concerning target platform properties. Context-sensitive simulations provide attractive benefits for timing analysis. Nevertheless, these simulations depend strongly on a single configuration pair of compiler and hardware platform. To cope with this limitation, we present SIMULTime, a new technique for context-sensitive timing simulation based on the software intermediate representation. The assured simulation throughput significantly increases by simulating simultaneously different hardware hardware platforms and compiler configurations. Multiple accurate timing predictions are produced by running the simulator only once. Our novel approach was applied on several applications showing that SIMULTime increases the average simulation throughput by 90% when at least four configurations are analyzed in parallel.

Modeling processor idle times in MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard deadline

Amirhossein Esmaili
Mahdi Nazemi
Massoud Pedram

Energy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

Phone-nomenon: a system-level thermal simulator for handheld devices

Hong-Wen Chiou
Yu-Min Lee
Shin-Yu Shiau
Chi-Wen Pan
Tai-Yu Chen

This work presents a system-level thermal simulator, Phone-nomenon, to predict the thermal behavior of smartphone. First, we study the nonlinearity of internal and external heat transfer mechanisms and propose a compact thermal model. After that, we develop an iterative framework to handle the nonlinearity. Compared with a commercial tool, ANSYS Icepak, Phonenomenon can achieve two and three orders of magnitude speedup with 3.58% maximum error and 1.72°C difference for steady-state and transient-state simulations, respectively. Meanwhile, Phone-nomenon also fits the measured data of a built thermal test vehicle pretty well.

Virtual prototyping of heterogeneous automotive applications: matlab, SystemC, or both?

Xiao Pan
Carna Zivkovic
Christoph Grimm

We present a case study on virtual prototyping of automotive applications. We address the co-simulation of HW/SW systems involving firmware, communication protocols, and physical/mechanical systems in the context of model-based and agile development processes. The case study compares the Matlab/Simulink and SystemC based approaches by an e-gas benchmark. We compare the simulation performance, modeling capabilities and applicability in different stages of the development process.

SESSION: Placement

Diffusion break-aware leakage power optimization and detailed placement in sub-10nm VLSI

Sun ik Heo
Andrew B. Kahng
Minsoo Kim
Lutong Wang

A diffusion break (DB) isolates two neighboring devices in a standard cell-based design and has a stress effect on delay and leakage power. In foundry sub-10nm design enablements, device performance is changed according to the type of DB - single diffusion break (SDB) or double diffusion break (DDB) - that is used in the library cell layout. Crucially, local layout effect (LLE) can substantially affect device performance and leakage. Our present work focuses on the 2nd DB effect, a type of LLE in which distance to the second-closest DB (i.e., a distance that depends on the placement of a given cell's neighboring cell) also impacts performance of a given device. In this work, we implement a 2nd DB-aware timing and leakage analysis flow, and show how a lack of 2nd DB awareness can misguide current optimization in place-and-route stages. We then develop 2nd DB-aware leakage optimization and detailed placement heuristics. Experimental results in a scaled foundry 14nm technology indicate that our 2nd DB-aware analysis and optimization flow achieves, on average, 80% recovery of the leakage increment that is induced by the 2nd DB effect, without changing design performance.

MDP-trees: multi-domain macro placement for ultra large-scale mixed-size designs

Yen-Chun Liu
Tung-Chieh Chen
Yao-Wen Chang
Sy-Yen Kuo

In this paper, we present a new hybrid representation of slicing trees and multi-packing trees, called multi-domain-packing trees (MDP-trees), for macro placement to handle ultra large-scale multi-domain mixed-size designs. A multi-domain design typically consists of a set of mixed-size domains, each with hundreds/thousands of large macros and (tens of) millions of standard cells, which is often seen in modern high-end applications (e.g., 4G LTE products and upcoming 5G ones). To the best of our knowledge, there is still no published work specifically tackling the domain planning and macro placement simultaneously. Based on binary trees, the MDP-tree is very efficient and effective for handling macro placement with multiple domains. Previous works on macro placement can handle only single-domain designs, which do not consider the global interactions among domains. In contrast, our MDP-trees plan domain regions globally, and optimize the interconnections among domains and macro/cell positions simultaneously. The placement area of each domain is well reserved, and the macro displacement is minimized from initial macro positions of the design prototype. Experimental results show that our approach can significantly reduce both the average half-perimeter wirelength and the average global routing wirelength.

A shape-driven spreading algorithm using linear programming for global placement

Shounak Dhar
Love Singhal
Mahesh A. Iyer
David Z. Pan

In this paper, we consider the problem of finding the global shape for placement of cells in a chip that results in minimum wirelength. Under certain assumptions, we theoretically prove that some shapes are better than others for purposes of minimizing wirelength, while ensuring that overlap-removal is a key constraint of the placer. We derive some conditions for the optimal shape and obtain a shape which is numerically close to the optimum. We also propose a linear-programming-based spreading algorithm with parameters to tune the resultant shape and derive a cost function that is better than total or maximum displacement objectives, that are traditionally used in many numerical global placers. Our new cost function also does not require explicit wirelength computation, and our spreading algorithm preserves to a large extent, the relative order among the cells placed after a numerical placer iteration. Our experimental results demonstrate that our shape-driven spreading algorithm improves wirelength, routing congestion and runtime compared to a bi-partitioning based spreading algorithm used in a state-of-the-art academic global placer for FPGAs.

Finding placement-relevant clusters with fast modularity-based clustering

Mateus Fogaça
Andrew B. Kahng
Ricardo Reis
Lutong Wang

In advanced technology nodes, IC implementation faces increasing design complexity as well as ever-more demanding design schedule requirements. This raises the need for new decomposition approaches that can help reduce problem complexity, in conjunction with new predictive methodologies that can help avoid bottlenecks and loops in the physical implementation flow. Notably, with modern design methodologies it would be very valuable to better predict final placement of the gate-level netlist: this would enable more accurate early assessment of performance, congestion and floorplan viability in the SOC floorplanning/RTL planning stages of design. In this work, we study a new criterion for the classic challenge of VLSI netlist clustering: how well netlist clusters "stay together" through final implementation. We propose use of several evaluators of this criterion. We also explore the use of modularity-driven clustering to identify natural clusters in a given graph without the tuning of parameters and size balance constraints typically required by VLSI CAD partitioning methods. We find that the netlist hypergraph-to-graph mapping can significantly affect quality of results, and we experimentally identify an effective recipe for weighting that also comprehends topological proximity to I/Os. Further, we empirically demonstrate that modularity-based clustering achieves better correlation to actual netlist placements than traditional VLSI CAD methods (our method is also 4X faster than use of hMetis for our largest testcases). Finally, we show a potential flow with fast "blob placement" of clusters to evaluate netlist and floorplan viability in early design stages; this flow can predict gate-level placement of 370K cells in 200 seconds on a single core.

SESSION: Algorithms and architectures for emerging applications

An approximation algorithm to the optimal switch control of reconfigurable battery packs

Shih-Yu Chen
Jie-Hong R. Jiang
Shou-Hung Welkin Ling
Shih-Hao Liang
Mao-Cheng Huang

The broad applications of lithium-ion batteries in cyber-physical systems attract intensive research on building energy-efficient battery systems. Reconfigurable battery packs have been proposed to improve reliability and energy efficiency. Despite recent efforts, how to simultaneously maximize battery usage time and minimize switching count during reconfiguration is rarely addressed. In this work, we devise a control algorithm that, under a simplified battery model, achieves the longest usage time under a given constant power-load while the switching count is at most twice above the minimum. It is further generalized for arbitrary power-loads and adjusted for refined battery models. Simulation experiments show promising benefits of the proposed algorithm.

Autonomous vehicle routing in multiple intersections

Sheng-Hao Lin
Tsung-Yi Ho

Advancements in artificial intelligence and Internet of Things indicates the realization of commercial autonomous vehicles is almost ready. With autonomous vehicles comes new approaches in solving some of the current traffic problems such as fuel consumption, congestion, and high incident rates. Autonomous Intersection Management (AIM) is an example that utilizes the unique attributes of autonomous vehicles to improve the efficiency of a single intersection. However, in a system of interconnected intersections, just by improving individual intersections does not guarantee a system optimum. Therefore, we extend from a single intersection to a grid of intersections and propose a novel vehicle routing method for autonomous vehicles that can effectively reduce the travel time of each vehicle. With dedicated short range communications and the fine-grained control of autonomous vehicles, we are able to apply wire routing algorithms with modified constraints to vehicle routing. Our method intelligently avoids congestions by simulating the future traffic and thereby achieving a system optimum.

GRAM: graph processing in a ReRAM-based computational memory

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Yeseong Kim
Tajana Rosing

The performance of graph processing for real-world graphs is limited by inefficient memory behaviours in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this paper, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM provides 122.5X and 11.1x speedup compared with an in-memory graph system and optimized multithreading algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1X and 3.8X respectively.

ADEPOS: anomaly detection based power saving for predictive maintenance using edge computing

Sumon Kumar Bose
Bapi Kar
Mohendra Roy
Pradeep Kumar Gopalakrishnan
Arindam Basu

In Industry 4.0, predictive maintenance (PdM) is one of the most important applications pertaining to the Internet of Things (IoT). Machine learning is used to predict the possible failure of a machine before the actual event occurs. However, main challenges in PdM are: (a) lack of enough data from failing machines, and (b) paucity of power and bandwidth to transmit sensor data to cloud throughout the lifetime of the machine. Alternatively, edge computing approaches reduce data transmission and consume low energy. In this paper, we propose Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine. In the beginning of the machine's life, low accuracy computations are used when machine is healthy. However, on detection of anomalies as time progresses, system is switched to higher accuracy modes. We show using the NASA bearing dataset that using ADEPOS, we need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4--6.65X.

SESSION: Embedded software for parallel architecture

Efficient sporadic task handling in parallel AUTOSAR applications using runnable migration

Milan Copic
Rainer Leupers
Gerd Ascheid

Automotive software has become immensely complex. To manage this complexity, a safety-critical application is commonly written respecting the AUTOSAR standard and deployed on a multi-core ECU. However, parallelization of an AUTOSAR task is hindered by data dependencies between runnables, the smallest code-fragments executed by the run-time system. Consequently, a substantial number of idle intervals is introduced. We propose to utilize such intervals in sporadic tasks by migrating runnables that were originally scheduled to execute in the scope of periodic tasks.

A heuristic for multi objective software application mappings on heterogeneous MPSoCs

Gereon Onnebrink
Ahmed Hallawa
Rainer Leupers
Gerd Ascheid
Awaid-Ud-Din Shaheen

Efficient development of parallel software is one of the biggest hurdles to exploit the advantages of heterogeneous multi-core architectures. Fast and accurate compiler technology is required for determining the trade-off between multiple objectives, such as power and performance. To tackle this problem, the paper at hand proposes the novel heuristic TONPET. Furthermore, it is integrated into the SLX tool suite for a detailed evaluation and an applicability study. TOPNET is tested against representative benchmarks on three different platforms and compared to a state-of-the-art Evolutionary Multi Objective Algorithm (EMOA). On average, TONPET produces 6% better Pareto fronts, while being 18X faster in the worst case.

ReRAM-based processing-in-memory architecture for blockchain platforms

Fang Wang
Zhaoyan Shen
Lei Han
Zili Shao

Blockchain's decentralized and consensus mechanism has attracted lots of applications, such as IoT devices. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the Blockchain mining consumes huge computation resource and energy, which is unacceptable for resource-limited embedded devices. This paper for the first time presents a ReRAM-based processing-in-memory architecture for Blockchain mining, called Re-Mining. Re-Mining includes a message schedule module and a SHA computation module. The modules are composed of several basic ReRAM-based logic operations units, such as ROR, RSF and XOR. Re-Mining further designs intra-transaction and inter-transaction parallel mechanisms to accelerate the Blockchain mining. Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly.

SESSION: Machine learning and hardware security

Towards practical homomorphic email filtering: a hardware-accelerated secure naïve bayesian filter

Song Bian
Masayuki Hiromoto
Takashi Sato

A secure version of the naïve Bayesian filter (NBF) is proposed utilizing partially homomorphic encryption (PHE) scheme. SNBF can be implemented with only the additive homomorphism from the Paillier system, and we derive new techniques to reduce the computational cost of PHE-based SNBF. In the experiment, we implemented SNBF both in software and hardware. Compared to the best existing PHE scheme, we achieved 1,200x (resp., 398,840x) runtime reduction in the CPU (resp., ASIC) implementations, with additional 1,919x power reduction on the designated hardware multiplier. Our hardware implementation is able to classify an average-length email in 0.5 s, making it one of the most practical NBF schemes to date.

A 0.16pJ/bit recurrent neural network based PUF for enhanced machine learning attack resistance

Nimesh Shah
Manaar Alam
Durga Prasad Sahoo
Debdeep Mukhopadhyay
Arindam Basu

Physically Unclonable Function (PUF) circuits are finding wide-spread use due to increasing adoption of IoT devices. However, the existing strong PUFs such as Arbiter PUFs (APUF) and its compositions are susceptible to machine learning (ML) attacks because the challenge-response pairs have a linear relationship. In this paper, we present a Recurrent-Neural-Network PUF (RNN-PUF) which uses a combination of feedback and XOR function to significantly improve resistance to ML attack, without significant reduction in the reliability. ML attack is also partly reduced by using a shared comparator with offset-cancellation to remove bias and save power. From simulation results, we obtain ML attack accuracy of 62% for different ML algorithms, while reliability stays above 93%. This represents a 33.5% improvement in our Figure-of-Merit. Power consumption is estimated to be 12.3μW with energy/bit of ≈ 0.16pJ.

P3M: a PIM-based neural network model protection scheme for deep learning accelerator

Wen Li
Ying Wang
Huawei Li
Xiaowei Li

This work is oriented at the edge computing scenario that terminal deep learning accelerators use pre-trained neural network models distributed from third-party providers (e.g. from data center clouds) to process the private data instead of sending it to the cloud. In this scenario, the network model is exposed to the risk of being attacked in the unverified devices if the parameters and hyper-parameters are transmitted and processed in an unencrypted way. Our work tackles this security problem by using on-chip memory Physical Unclonable Functions (PUFs) and Processing-In-Memory (PIM). We allow the model execution only on authorized devices and protect the model from white-box attacks, black-box attacks and model tampering attacks. The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead. The experimental results show considerable performance improvement over two state-of-the-art solutions we evaluated.

SESSION: Memory architecture for efficient neural network computing

Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator

Jilan Lin
Zhenhua Zhu
Yu Wang
Yuan Xie

With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

In-memory batch-normalization for resistive memory based binary neural network hardware

Hyungjun Kim
Yulhwa Kim
Jae-Joon Kim

Binary Neural Network (BNN) has a great potential to be implemented on Resistive memory Crossbar Array (RCA)-based hardware accelerators because it requires only 1-bit precision for weights and activations. While general structures to implement convolution or fully-connected layers in RCA-based BNN hardware were actively studied in previous works, Batch-Normalization (BN) layer, which is another key layer of BNN, has not been discussed in depth yet. In this work, we propose in-memory batch-normalization schemes which integrate BN layers on RCA so that area/energy-efficiency of the BNN accelerators can be maximized. In addition, we also show that sense amp error due to device mismatch can be suppressed using the proposed in-memory BN design.

XOMA: exclusive on-chip memory architecture for energy-efficient deep learning acceleration

Hyeonuk Sim
Jason H. Anderson
Jongeun Lee

State-of-the-art deep neural networks (DNNs) require hundreds of millions of multiply-accumulate (MAC) computations to perform inference, e.g. in image-recognition tasks. To improve the performance and energy efficiency, deep learning accelerators have been proposed, realized both on FPGAs and as custom ASICs. Generally, such accelerators comprise many parallel processing elements, capable of executing large numbers of concurrent MAC operations. From the energy perspective, however, most consumption arises due to memory accesses, both to off-chip external memory, and on-chip buffers. In this paper, we propose an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective. To the maximum possible extent, off-chip memory accesses are eliminated, providing lowest-possible energy consumption for inference. Compared to a state-of-the-art ASIC, our architecture requires 36% fewer external memory accesses and 53% less energy consumption for low-latency image classification.

SESSION: Logic-level security and synthesis

BeSAT: behavioral SAT-based attack on cyclic logic encryption

Yuanqi Shen
You Li
Amin Rezaei
Shuyu Kong
David Dlott
Hai Zhou

Cyclic logic encryption is newly proposed in the area of hardware security. It introduces feedback cycles into the circuit to defeat existing logic decryption techniques. To ensure that the circuit is acyclic under the correct key, CycSAT is developed to add the acyclic condition as a CNF formula to the SAT-based attack. However, we found that it is impossible to capture all cycles in any graph with any set of feedback signals as done in the CycSAT algorithm. In this paper, we propose a behavioral SAT-based attack called BeSAT. Be-SAT observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked. The experimental results show that BeSAT successfully overcomes the drawback of CycSAT.

Structural rewriting in XOR-majority graphs

Zhufei Chu
Mathias Soeken
Yinshui Xia
Lunyao Wang
Giovanni De Micheli

In this paper, we present a structural rewriting method for a recently proposed XOR-Majority graph (XMG), which has exclusive-OR (XOR), majority-of-three (MAJ), and inverters as primitives. XMGs are an extension of Majority-Inverter Graphs (MIGs). Previous work presented an axiomatic system, Ω, and its derived transformation rules for manipulation of MIGs. By additionally introducing XOR primitive, the identities of MAJ-XOR operations should be exploited to enable powerful logic rewriting in XMGs. We first proposed two MAJ-XOR identities and exploit its potential optimization opportunities during structural rewriting. Then, we discuss the rewriting rules that can be used for different operations. Finally, we also address structural XOR detection problem in MIG. The experimental results on EPFL benchmark suites show that the proposed method can optimize the size/depth product of XMGs and its mapped look-up tables (LUTs), which in turn benefits the quantum circuit synthesis that using XMG as the underlying logic representations.

Design automation for adiabatic circuits

Alwin Zulehner
Michael P. Frank
Robert Wille

Adiabatic circuits are heavily investigated since they allow for computations with an asymptotically close to zero energy dissipation per operation---serving as an alternative technology for many scenarios where energy efficiency is preferred over fast execution. Their concepts are motivated by the fact that the information lost from conventional circuits results in an entropy increase which causes energy dissipation. To overcome this issue, computations are performed in a (conditionally) reversible fashion which, additionally, have to satisfy switching rules that are different from conventional circuitry---crying out for dedicated design automation solutions. While previous approaches either focus on their electrical realization (resulting in small, hand-crafted circuits only) or on designing fully reversible building blocks (an unnecessary overhead), this work aims for providing an automatic and dedicated design scheme that explicitly takes the recent findings in this domain into account. To this end, we review the theoretical and technical background of adiabatic circuits and present automated methods that dedicatedly realize the desired function as an adiabatic circuit. The resulting methods are further optimized---leading to an automatic and efficient design automation for this promising technology. Evaluations confirm the benefits and applicability of the proposed solution.

SESSION: Analysis and algorithms for digital design verification

A figure of merit for assertions in verification

Samuel Hertz
Debjit Pal
Spencer Offenberger
Shobha Vasudevan

Assertion quality is critical to the confidence and claims in a design's verification. In current practice, there is no metric to evaluate assertions. We introduce a methodology to rank register transfer level (RTL) assertions. We define assertion importance and assertion complexity and present efficient algorithms to compute them. Our method ranks each assertion according to its importance and complexity. We demonstrate the effectiveness of our ranking for pre-silicon verification on a detailed case study. For completeness, we study the relevance of our highly ranked assertions in a post-silicon validation context, using traced and restored signal values from the design's netlist.

Suspect2vec: a suspect prediction model for directed RTL debugging

Neil Veira
Zissis Poulos
Andreas Veneris

Automated debugging tools based on Boolean Satisfiability (SAT) have greatly alleviated the time and effort required to diagnose and rectify a failing design. Practical experience shows that long-running debugging instances can often be resolved faster using partial results that are available before the SAT solver completes its search. In such cases it is preferable for the tool to maximize the number of suspects it returns during the early stages of its deployment. To capitalize on this observation, this paper proposes a directed SAT-based debugging algorithm which prioritizes examining design locations that are more likely to be suspects. This prioritization is determined by suspect2vec --- a model which learns from historical debug data to predict the suspect locations that will be found. Experiments show that this algorithm is expected to find 16% more suspects than the baseline algorithm if terminated prematurely, while still retaining the ability to find all suspects if executed to completion. Key to its performance and a contribution of this work is the accuracy of the suspect prediction model. This is because incorrect predictions introduce overhead in exploring parts of the search space where few or no solutions exist. Suspect2vec is experimentally demonstrated to outperform existing suspect prediction methods by an average accuracy of 5--20%.

Path controllability analysis for high quality designs

Li-Jie Chen
Hong-Zu Chou
Kai-Hui Chang
Sy-Yen Kuo
Chi-Lai Huang

Given a design variable and its fanin cone, determining whether one fanin variable has controlling power over other fanin variables can benefit many design steps such as verification, synthesis and test generation. In this work we formulate this path controllability problem and propose several algorithms that not only solve this problem but also return values that enable or block other fanin variables. Empirical results show that our algorithms can effectively perform path controllability analysis and help produce high-quality designs.

SESSION: FPGA and optics-based neural network designs

Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS

Qin Li
Xiaofan Zhang
JinJun Xiong
Wen-mei Hwu
Deming Chen

Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

Efficient FPGA implementation of local binary convolutional neural network

Aidyn Zhakatayev
Jongeun Lee

Binarized Neural Networks (BNN) has shown a capability of performing various classification tasks while taking advantage of computational simplicity and memory saving. The problem with BNN, however, is a low accuracy on large convolutional neural networks (CNN). Local Binary Convolutional Neural Network (LBCNN) compensates accuracy loss of BNN by using standard convolutional layer together with binary convolutional layer and can achieve as high accuracy as standard AlexNet CNN. For the first time we propose FPGA hardware design architecture of LBCNN and address its unique challenges. We present performance and resource usage predictor along with design space exploration framework. Our architecture on LBCNN AlexNet shows 76.6% higher performance in terms of GOPS, 2.6X and 2.7X higher performance density in terms of GOPS/Slice, and GOPS/DSP compared to previous FPGA implementation of standard AlexNet CNN.

Hardware-software co-design of slimmed optical neural networks

Zheng Zhao
Derong Liu
Meng Li
Zhoufeng Ying
Lu Zhang
Biying Xu
Bei Yu
Ray T. Chen
David Z. Pan

Optical neural network (ONN) is a neuromorphic computing hardware based on optical components. Since its first on-chip experimental demonstration, it has attracted more and more research interests due to the advantages of ultra-high speed inference with low power consumption. In this work, we design a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations. Different from the originally proposed ONN architecture based on singular value decomposition which results in two implementation-expensive unitary matrices, we show a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer. In the experiments, we demonstrate that by leveraging the training engine, we are able to find a comparable accuracy to that of the previous architecture, which brings about the flexibility of using the slimmed implementation. The area cost in terms of the Mach-Zehnder interferometers, the core optical components of ONN, is 15%-38% less for various sizes of optical neural networks.

SESSION: The resurgence of reconfigurable computing in the post moore era

Software defined architectures for data analytics

Vito Giovanni Castellana
Marco Minutoli
Antonino Tumeo
Marco Lattuada
Pietro Fezzardi
Fabrizio Ferrandi

Data analytics applications increasingly are complex workflows composed of phases with very different program behaviors (e.g., graph algorithms and machine learning, algorithms operating on sparse and dense data structures, etc). To reach the levels of efficiency required to process these workflows in real time, upcoming architectures will need to leverage even more workload specialization. If, at one end, we may find even more heterogenous processors composed by a myriad of specialized processing elements, at the other end we may see novel reconfigurable architectures, composed of sets of functional units and memories interconnected with (re)configurable on-chip networks, able to adapt dynamically to adapt the workload characteristics. Field Programmable Gate Arrays are more and more used for accelerating various workloads and, in particular, inferencing in machine learning, providing higher efficiency than other solutions. However, their fine-grained nature still leads to issues for the design software and still makes dynamic reconfiguration impractical. Future, more coarse-grained architectures could offer the features to execute diverse workloads at high efficiency while providing better reconfiguration mechanisms for dynamic adaptability. Nevertheless, we argue that the challenges for reconfigurable computing remain in the software. In this position paper, we describe a possible toolchain for reconfigurable architectures targeted at data analytics.

Runtime reconfigurable memory hierarchy in embedded scalable platforms

Davide Giri
Paolo Mantovani
Luca P. Carloni

In heterogeneous systems-on-chip, the optimal choice of the cache-coherence model for a loosely-coupled accelerator may vary at each invocation, depending on workload and system status. We propose a runtime adaptive algorithm to manage the coherence of accelerators. The algorithm's choices are based on the combination of static and dynamic features of the active accelerators and their workloads. We evaluate the algorithm by leveraging our FPGA-based platform for rapid SoC prototyping. Experimental results, obtained through the deployment of a multi-core and multi-accelerator system that runs Linux SMP, show the benefits of our approach in terms of execution time and memory accesses.

XPPE: cross-platform performance estimation of hardware accelerators using machine learning

Hosein Mohammadi Makrani
Hossein Sayadi
Tinoosh Mohsenin
Setareh rafatirad
Avesta Sasan
Houman Homayoun

The increasing heterogeneity in the applications to be processed ceased ASICs to exist as the most efficient processing platform. Hybrid processing platforms such as CPU+FPGA are emerging as powerful processing platforms to support an efficient processing for a diverse range of applications. Hardware/Software co-design enabled designers to take advantage of these new hybrid platforms such as Zynq. However, dividing an application into two parts that one part runs on CPU and the other part is converted to a hardware accelerator implemented on FPGA, is making the platform selection difficult for the developers as there is a significant variation in the application's performance achieved on different platforms. Developers are required to fully implement the design on each platform to have an estimation of the performance. This process is tedious when the number of available platforms is large. To address such challenge, in this work we propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed estimation is performed for a wide range of applications and evaluated against a vast set of platforms. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

SESSION: Hardware acceleration

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

Bosheng Liu
Xiaoming Chen
Ying Wang
Yinhe Han
Jiajun Li
Haobo Xu
Xiaowei Li

As an energy-efficient hardware solution for deep neural network (DNN) inference, systolic accelerators are particularly popular in both embedded and datacenter computing scenarios. Despite their excellent performance and energy efficiency, however, systolic DNN accelerators are naturally facing a resource under-utilization problem - not all DNN models can well match the fixed processing elements (PEs) in a systolic array implementation, because typical DNN models vary significantly from applications to applications. Consequently, state-of-the-art hardware solutions are not expected to deliver the nominal (peak) performance and energy efficiency as claimed because of resource under-utilization. To deal with this dilemma, this study proposes a novel systolic DNN accelerator with a flexible computation mapping and dataflow scheme. By providing three types of parallelism and dynamically switching among them: channel-direction mapping, planar mapping, and hybrid, our accelerator offers the adaptability to match various DNN models to the fixed hardware resources, and thus, enables flexibly exploiting PE provision and data reuse for a wide range of DNN models to achieve optimal performance and energy efficiency.

ALook: adaptive lookup for GPGPU acceleration

Daniel Peroni
Mohsen Imani
Tajana Rosing

Associative memory in form of look-up table can decrease the energy consumption of GPGPU applications by exploiting data locality and reducing the number redundant computations. State of the art architectures utilize associative memory as static look-up tables. Static designs lack the ability to adapt to applications at runtime, limiting them to small segments of code with high redundancy. In this paper, we propose an adaptive look-up based approach, called ALook, which uses a dynamic update policy to maintain a set of recently used operations in associative memory. ALook updates with values computed by floating point units at runtime to adapt to the workload and matches the stored results to avoid recomputing similar operations. ALook utilizes a novel FPU architecture which accelerates GPU computation by parallelizing the operation lookup process. We test the efficiency of ALook on image processing, general purpose, and machine learning applications by integrating it beside FPUs in an AMD Southern Island GPU. Our evaluation shows that ALook provides 3.6X EDP (Energy Delay Product) and 32.8% performance speedup, compared to an unmodified GPU, for applications accepting less than 5% output error. The proposed ALook architecture improves the GPU performance by 2.0X as compared to state-of-the-art computational reuse methods for the same level of output error.

Collaborative accelerators for in-memory MapReduce on scale-up machines

Abraham Addisie
Valeria Bertacco

Relying on efficient data analytics platforms is increasingly becoming crucial for both small and large scale datasets. While MapReduce implementations, such as Hadoop and Spark, were originally proposed for petascale processing in scale-out clusters, it has been noted that, today, most data centers processes operate on gigabyte-order or smaller datasets, which are best processed in single high-end scale-up machines. In this context, Phoenix++ is a highly optimized MapReduce framework available for chip-multiprocessor (CMP) scale-up machines. In this paper we observe that Phoenix++ suffers from an inefficient utilization of the memory subsystem, and a serialized execution of the MapReduce stages. To overcome these inefficiencies, we propose CASM, an architecture that equips each core in a CMP design with a dedicated instance of a specialized hardware unit (the CASM accelerators). These units collaborate to manage the key-value data structure and minimize both on- and off-chip communication costs. Our experimental evaluation on a 64-core design indicates that CASM provides more than a 4x speedup over the highly optimized Phoenix++ framework, while keeping area overhead at only 6%, and reducing energy demands by over 3.5x.

SESSION: Routing

Detailed routing by sparse grid graph and minimum-area-captured path search

Gengjie Chen
Chak-Wa Pui
Haocheng Li
Jingsong Chen
Bentian Jiang
Evangeline F. Y. Young

Different from global routing, detailed routing takes care of many detailed design rules and is performed on a significantly larger routing grid graph. In advanced technology nodes, it becomes the most complicated and time-consuming stage. We propose Dr. CU, an efficient and effective detailed router, to tackle the challenges. To handle a 3D detailed routing grid graph of enormous size, a set of two-level sparse data structures is designed for runtime and memory efficiency. For handling the minimum-area constraint, an optimal correct-by-construction path search algorithm is proposed. Besides, an efficient bulk synchronous parallel scheme is adopted to further reduce the runtime usage. Compared with the first place of ISPD 2018 Contest, our router improves the routing quality by up to 65% and on average 39%, according to the contest metric. At the same time, it achieves 80--93% memory reduction, and 2.5--15X speed-up.

Latency constraint guided buffer sizing and layer assignment for clock trees with useful skew

Necati Uysal
Wen-Hao Liu
Rickard Ewetz

Closing timing using clock tree optimization (CTO) is a tremendously challenging problem that may require designer intervention. CTO is performed by specifying and realizing delay adjustments in an initially constructed clock tree. Delay adjustments are typically realized by inserting delay buffers or detour wires. In this paper, we propose a latency constraint guided buffer sizing and layer assignment framework for clock trees with useful skew, called the (BLU) framework. The BLU framework realizes delay adjustments during CTO by performing buffer sizing and layer assignment. Given an initial clock tree, the BLU framework first predicts the final timing quality and specifies a set of delay adjustments, which are translated into latency constraints. Next, buffer sizing and layer assignment is performed with respect to the latency constraints using an extension of van Ginneken's algorithm. Moreover, the framework includes a feature of reducing the power consumption by relaxing the latency constraints and a method of improving the timing performance by tightening the latency constraints. The experimental results demonstrate that the proposed framework is capable of reducing the capacitive cost with 13% on the average. The total negative slack (TNS) and worst negative slack (WNS) are reduced with up to 58% and 20%, respectively.

Search

Connect

Upcoming

Navigation

ASPDAC-2019 TOC