Full Citation in the ACM Digital Library
High temperature or temperature non-uniformity have become a serious threat to performance and reliability of high-performance integrated circuits (ICs). Thermal effect becomes a non-ignorable issue to circuit design or physical design. To estimate temperature accurately, the locations of modules have to be determined, which makes an efficient and effective thermal-aware floorplanning play a more important role. To resolve this problem, this paper proposes a differential nonlinear model which can approximate temperature and minimize wirelength at the same time during floorplanning. We also apply some techniques such a thermal-aware clustering or shrinking hot modules in the multi-level framework to further reduce temperature without inducing longer wirelength. The experimental results demonstrate that temperature and wirelength are greatly improved in our method compared to other works. More importantly, our runtime is quite fast and the fixed-outline constraint is also satisfied.
Poisson's equation has been used in VLSI global placement for describing the potential field induced by a given charge density distribution. Unlike previous global placement methods that solve Poisson's equation numerically, in this paper, we provide an analytical solution of the equation to calculate the potential energy of an electrostatic system. The analytical solution is derived based on the separation of variables method and an exact density function to model the block distribution in a placement region, which is an infinite series and converges absolutely. Using the analytical solution, we give a fast computation scheme of Poisson's equation and develop an effective and efficient global placement algorithm called Pplace. Experimental results show that our Pplace achieves smaller placement wirelength than ePlace and NTUplace3, two leading wirelength-driven placers. With the pervasive applications of Poisson's equation in scientific fields, in particular, our effective, efficient, and robust computation scheme for its analytical solution can provide substantial impacts to these fields.
Fogging and proximity effects are two major factors that cause inaccurate exposure and thus layout pattern distortions in e-beam lithography. In this paper, we propose the first analytical placement algorithm to consider both the fogging and proximity effects. We first formulate the global placement problem as a separable minimization problem with linear constraints, where different objectives can be tackled one by one in an alternating fashion. Then, we propose a novel proximal group alternating direction method of multipliers (ADMM) to solve the separable minimization problem with two subproblems, where the first subproblem (mainly associated with wirelength and density) is solved by a steepest descent method without line-search, and the second one (mainly associated with the fogging and proximity effects) is handled by an analytical scheme. We prove the property of global convergence of the proximal group ADMM method. Finally, legalization and detailed placement are used to legal and further improve the placement result. Experimental results show that our algorithm is effective and efficient for the addressed problem. Compared with the state-of-the-art work, our algorithm not only can achieve 13.4% smaller fogging variation and 21.4% lower proximity variation, but also has a 1.65X speedup.
The 2.5D FPGA is a promising technology to accommodate a large design in one FPGA chip, but the limited number of inter-die connections in a 2.5D FPGA may cause routing failures. To resolve the failures, input/output time-division multiplexing is adopted by grouping cross-die signals to go through one routing channel with a timing penalty after netlist partitioning. However, grouping signals after partitioning might lead to a suboptimal solution. Consequently, it is desirable to consider simultaneous partitioning and signal grouping although the optimization objectives of partitioning and grouping are different, and the time complexity of such simultaneous optimization is usually high. In this paper, we propose a simultaneous partitioning and grouping algorithm that can not only integrate the two objectives smoothly, but also reduce the time complexity to linear time per partitioning iteration. Experimental results show that our proposed algorithm outperforms the state-of-the-arts flow in both cross-die signal timing criticality and system-clock periods.
Reversible logic is a building block for adiabatic and quantum computing in addition to other applications. Since common functions are non-reversible, one needs to embed them into proper-size reversible functions by adding ancillary inputs and garbage outputs. We explore the Intellectual Property (IP) piracy of reversible circuits. The number of embeddings of regular functions in a reversible function and the percent of leaked ancillary inputs measure the difficulty of recovering the embedded function. To illustrate the key concepts, we study reversible logic circuits designed using reversible logic synthesis tools based on Binary Decision Diagrams and Quantum Multi-valued Decision Diagrams.
In order to enhance the security of logic obfuscation schemes, delay based logic locking has been proposed in combination with traditional functional logic locking approaches in recent literature. A circuit obfuscated using the aforementioned approach preserves the correct functionality only when both correct functional and delay keys are provided. In this paper, we develop a novel SAT formulation based approach called TimingSAT to deobfuscte the functionalities of such delay locked designs within a reasonable amount of time. The proposed technique models the timing characteristics of various types of gates present in the design as Boolean functions to build timing profile embedded SAT formulations in terms of targeted key inputs. TimingSAT attack works in two stages: In the first stage the functional keys are found using traditional SAT attack approach and in the second stage the delay keys are deciphered utilizing the timing profile embedded SAT formulation of the circuit. In both stages of the attack, wrong keys are iteratively eliminated till a key belonging to the correct equivalence class is obtained. The experimental results highlight the effectiveness of the proposed TimingSAT attack to break delay logic locked benchmarks within few hours.
Similar to digital circuits, analog and mixed-signal (AMS) circuits are also susceptible to supply-chain attacks such as piracy, overproduction, and Trojan insertion. However, unlike digital circuits, supply-chain security of AMS circuits is less explored. In this work, we propose to perform "logic locking" on digital section of the AMS circuits. The idea is to make the analog design intentionally suffer from the effects of process variations, which impede the operation of the circuit. Only on applying the correct key, the effect of process variations are mitigated, and the analog circuit performs as desired. We provide the theoretical guarantees of the security of the circuit, and along with simulation results for the band-pass filter, low-noise amplifier, and low-dropout regulator, we also show experimental results of our technique on a band-pass filter.
With the globalization of manufacturing and supply chains, ensuring the security and trustworthiness of ICs has become an urgent challenge. Split manufacturing (SM) and layout camouflaging (LC) are promising techniques to protect the intellectual property (IP) of ICs from malicious entities during and after manufacturing (i.e., from untrusted foundries and reverse-engineering by end-users). In this paper, we strive for "the best of both worlds," that is of SM and LC. To do so, we extend both techniques towards 3D integration, an up-and-coming design and manufacturing paradigm based on stacking and interconnecting of multiple chips/dies/tiers.
Initially, we review prior art and their limitations. We also put forward a novel, practical threat model of IP piracy which is in line with the business models of present-day design houses. Next, we discuss how 3D integration is a naturally strong match to combine SM and LC. We propose a security-driven CAD and manufacturing flow for face-to-face (F2F) 3D ICs, along with obfuscation of interconnects. Based on this CAD flow, we conduct comprehensive experiments on DRC-clean layouts. Strengthened by an extensive security analysis (also based on a novel attack to recover obfuscated F2F interconnects), we argue that entering the next, third dimension is eminent for effective and efficient IP protection.
Efficient acceleration of Deep Neural Networks is a manifold task. In order to save memory requirements and reduce energy consumption we propose the use of dedicated accelerators with novel arithmetic processing elements which use bit shifts instead of multipliers. While a regular power-of-2 quantization scheme allows for multiplierless computation of multiply-accumulate-operations, it suffers from high accuracy losses in neural networks. Therefore, we evaluate the use of powers-of-arbitrary-log-bases and confirmed their suitability for quantization of pre-trained neural networks. The presented method works without retraining of the neural network and therefore is suitable for applications in which no labeled training data is available. In order to verify our proposed method, we implement the log-based processing elements into a neural network accelerator on an FPGA. The hardware efficiency is evaluated in terms of FPGA utilization and energy requirements in comparison to regular 8-bit-fixed-point multiplier based acceleration. Using this approach hardware resources are minimized and power consumption is reduced by 22.3%.
Recent large-scale CNNs suffer from a severe memory wall problem as their number of weights range from tens to hundreds of millions. Processing in-memory (PIM) and binary CNN have been proposed to alleviate the number of memory accesses and footprints, respectively. By combining the two separate concepts, we propose a novel processing in-DRAM framework for binary CNN, called NID, where dominant convolution operations are processed using in-DRAM bulk bitwise operations. We first identify the problem that the bitcount operations with only bulk bitwise AND/OR/NOT incur significant overhead in terms of delay when the size of kernels gets larger. Then, we not only optimize the performance by efficiently allocating inputs and kernels to DRAM banks for both convolutional and fully-connected layers through design space explorations, but also mitigate the overhead of bitcount operations by splitting kernels into multiple parts. Partial sum accumulations and tasks of the other layers such as max-pooling and normalization layers are processed in the peripheral area of DRAM with negligible overheads. In results, our NID framework achieves 19X-36X performance and 9X-14X EDP improvements for convolutional layers, and 9X-17X performance and 1.4X-4.5X EDP improvements for fully-connected layers over previous PIM technique in four large-scale CNN models.
Neural network based approximate computing is a universal architecture promising to gain tremendous energy-efficiency for many error resilient applications. To guarantee the approximation quality, existing works deploy two neural networks (NNs), e.g., an approximator and a predictor. The approximator provides the approximate results, while the predictor predicts whether the input data is safe to approximate with the given quality requirement. However, it is non-trivial and time-consuming to make these two neural network coordinate---they have different optimization objectives---by training them separately. This paper proposes a novel neural network structure---AXNet---to fuse two NNs to a holistic end-to-end trainable NN. Leveraging the philosophy of multi-task learning, AXNet can tremendously improve the invocation (proportion of safe-to-approximate samples) and reduce the approximation error. The training effort also decrease significantly. Experiment results show 50.7% more invocation and substantial cuts of training time when compared to existing neural network based approximate computing framework.
This work introduces the concept of scalable-effort Convolutional Neural Networks (ConvNets), an effort-accuracy scalable model for classification of data at multilevel abstraction. Scalable-effort ConvNets are able to adapt at run-time to the complexity of the classification problem, i.e. the level of abstraction defined by the application (or context), and reach a given classification accuracy with minimal computational effort. The mechanism is implemented using a single-weight scalable-precision model rather than an ensemble of quantized weight models; this makes the proposed strategy highly flexible and particularly suited for embedded architectures with limited resource availability.
The paper describes (i) a hardware/software vertical implementation of scalable-precision multiply&accumulate arithmetic, (ii) an accuracy-constrained heuristic that delivers near-optimal layer-by-layer precision mapping at a predefined level of abstraction. It also reports the validation for three state-of-the-art nets, i.e. AlexNet, SqueezeNet and MobileNet, trained and tested with ImageNet. Collected results show scalable-effort ConvNets guarantee flexibility and substantial savings: 47.07% computational effort reduction at minimum accuracy, or 30.6% accuracy improvement at maximum effort w.r.t. standard flat ConvNets (average over the three benchmarks for high-level classification).
Several emerging reconfigurable technologies have been explored in recent years offering device level runtime reconfigurability. These technologies offer the freedom to choose between p- and n-type functionality from a single transistor. In order to optimally utilize the feature-sets of these technologies, circuit designs and storage elements require novel design to complement the existing and future electronic requirements. An important aspect to sustain such endeavors is to supplement the existing design flow from the device level to the circuit level. This should be backed by a thorough evaluation so as to ascertain the feasibility of such explorations. Additionally, since these technologies offer runtime reconfigurability and often encapsulate more than one functions, hardware security features like polymorphic logic gates and on-chip key storage come naturally cheap with circuits based on these reconfigurable technologies. This paper presents innovative approaches devised for circuit designs harnessing the reconfigurable features of these nanotechnologies. New circuit design paradigms based on these nano devices will be discussed to brainstorm on exciting avenues for novel computing elements.
This work firstly investigates the problem of how designing data-driven (i.e., toggling based) clock gating can be closely integrated with the synthesis of flip-flops, which has never been addressed in the prior clock gating works. Our key observation is that some internal part of a flip-flop cell can be reused to generate its clock gating enable signal. Based on this, we propose a newly optimized flip-flop wiring structure, called eXOR-FF, in which an internal logic can be reused for every clock cycle to decide if the flip-flop is to be activated or inactivated through clock gating, thereby achieving area saving (thus, leakage as well as dynamic power saving) on every pair of flip-flop and its toggling detection logic. Then, we propose a comprehensive methodology of placement/timing-aware clock gating exploration that provides two unique strengths: best suited for maximally exploiting the benefit of eXOR-FFs and precise analyses on the decomposition of power consumptions and timing impact, and translating them into cost functions in core engine of clock gating exploration.
Reliability of a P/G network is one of the most important concerns in a chip design, which makes powerplanning the most critical step in the physical design. Traditional P/G network design mainly focuses on reducing usage of routing resource to satisfy voltage drop and electromigration constraints according to a regular mesh. As the number of macros in a modern design increases, this style may waste more routing resource and make routing congestion more severe in local regions. In order to save routing resource and increase routability, this paper proposes a delicate powerplanning method. First, we propose a row-style power mesh to facilitate connection of pre-placed macros and increase routability of signal nets in the later stage. Besides, an effective power stripe width which can reduce wastage of routing resource and provide stronger supply voltage is found. Moreover, we propose the first work to use the linear programming algorithm to minimize P/G routing area and consider routability at the same time. The experimental results show that routability of a design with many macros can be significantly improved by our row-style power networks.
Conventional on-chip spiral inductor consumes significant top metal routing area, thereby preventing its popularity in many on-chip applications. Recently TSV-inductor with a magnetic core has been proved to be a viable option for on-chip DC-DC converter in a 14nm test chip. The operating conditions of such inductors play a major role in maximizing the performance and efficiency of the DC-DC converter. However, due to its unique TSV-structure, unlike conventional spiral inductor, much of the modeling details remain unclear. This paper analyzes the modeling details of a magnetic core TSV-inductor and proposes a design methodology to optimize power losses of the inductor. With this methodology, designers can ensure fast and reliable inductor optimization for on-chip applications. Experimental results show that the optimized magnetic core TSV-inductor can achieve inductance density improvement of 6.0--7.7X and quality factor improvements of 1.3--1.6X while maintaining the same footprint.
During design signoff, many iterations of Engineer Change Order (ECO) are needed to ensure IR drop of each cell instance meets the specified limit. It is a waste of resources because repeated dynamic IR drop simulations take a very long time on very similar designs. In this work, we train a machine learning model, based on data before ECO, and predict IR drop after ECO. To increase our prediction accuracy, we propose 17 timing-aware, power-aware, and physical-aware features. Our method is scalable because the feature dimension is fixed (937), independent of design size and cell library. Also, we propose to build regional models for cell instances near IR drop violations to improves both prediction accuracy and training time. Our experiments show that our prediction correlation coefficient is 0.97 and average error is 3.0mV on a 5-million-cell industry design. Our IR drop prediction for 100K cell instances can be completed within 2 minutes. Our proposed method provides a fast IR drop prediction to speedup ECO.
We provide a systemization of knowledge of the recent progress made in addressing the crucial problem of deep learning on encrypted data. The problem is important due to the prevalence of deep learning models across various applications, and privacy concerns over the exposure of deep learning IP and user's data. Our focus is on provably secure methodologies that rely on cryptographic primitives and not trusted third parties/platforms. Computational intensity of the learning models, together with the complexity of realization of the cryptography algorithms hinder the practical implementation a challenge. We provide a summary of the state-of-the-art, comparison of the existing solutions, as well as future challenges and opportunities.
Machine learning, specifically deep learning is becoming a key technology component in application domains such as identity management, finance, automotive, and healthcare, to name a few. Proprietary machine learning models - Machine Learning IP - are developed and deployed at the network edge, end devices and in the cloud, to maximize user experience.
With the proliferation of applications embedding Machine Learning IPs, machine learning models and hyper-parameters become attractive to attackers, and require protection. Major players in the semiconductor industry provide mechanisms on device to protect the IP at rest and during execution from being copied, altered, reverse engineered, and abused by attackers. In this work we explore system security architecture mechanisms and their applications to Machine Learning IP protection.
Deep Learning (DL) models have been shown to be vulnerable to adversarial attacks. In light of the adversarial attacks, it is critical to reliably quantify the confidence of the prediction in a neural network to enable safe adoption of DL models in autonomous sensitive tasks (e.g., unmanned vehicles and drones). This article discusses recent research advances for unsupervised model assurance against the strongest adversarial attacks known to date and quantitatively compare their performance. Given the widespread usage of DL models, it is imperative to provide model assurance by carefully looking into the feature maps automatically learned within Dl models instead of looking back with regret when deep learning systems are compromised by adversaries.
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.
Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead.
To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X -- 44.9X speedup and reduces the energy consumption by 14.6%-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.
As convolutional neural networks (CNNs) enable state-of-the-art computer vision applications, their high energy consumption has emerged as a key impediment to their deployment on embedded and mobile devices. Towards efficient image classification under hardware constraints, prior work has proposed adaptive CNNs, i.e., systems of networks with different accuracy and computation characteristics, where a selection scheme adaptively selects the network to be evaluated for each input image. While previous efforts have investigated different network selection schemes, we find that they do not necessarily result in energy savings when deployed on mobile systems. The key limitation of existing methods is that they learn only how data should be processed among the CNNs and not the network architectures, with each network being treated as a blackbox.
To address this limitation, we pursue a more powerful design paradigm where the architecture settings of the CNNs are treated as hyper-parameters to be globally optimized. We cast the design of adaptive CNNs as a hyper-parameter optimization problem with respect to energy, accuracy, and communication constraints imposed by the mobile device. To efficiently solve this problem, we adapt Bayesian optimization to the properties of the design space, reaching near-optimal configurations in few tens of function evaluations. Our method reduces the energy consumed for image classification on a mobile device by up to 6X, compared to the best previously published work that uses CNNs as blackboxes. Finally, we evaluate two image classification practices, i.e., classifying all images locally versus over the cloud under energy and communication constraints.
Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs which execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8X --58X speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing a conventional DNN accelerator that uses 2's complement (2C) arithmetic with one that uses signed magnitude representation (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest.
For future autonomous vehicles, the system development life cycle must keep up with the rapid rate of innovation and changing needs of the market. Waterfall is too slow to react to such changes, and therefore, there is a growing emphasis to adopt Agile development concepts in the automotive industry. Ensuring requirements trace-ability, and thus proving functional safety, is a serious challenge in this direction. Modern cars are complex cyber-physical systems and are traditionally designed using a set of disjoint tools, which adds to the challenge. In this paper, we point out that multi-domain coupling and design automation using correct-by-design approaches can lead to safe designs even in an Agile environment. In this context, we study current industry trends. We further outline the challenges involved in multi-domain coupling and demonstrate using a state-of-the-art approach how these challenges can be addressed by exploiting domain-specific knowledge.
Smart buildings in the future are complex cyber-physical-human systems that involve close interactions among embedded platform (for sensing, computation, communication and control), mechanical components, physical environment, building architecture, and occupant activities. The design and operation of such buildings require a new set of methodologies and tools that can address these heterogeneous domains in a holistic, quantitative and automated fashion. In this paper, we will present our design automation methods for improving building energy efficiency and offering comfortable services to occupants at low cost. In particular, we will highlight our work in developing both model-based and data-driven approaches for building automation and control, including methods for co-scheduling heterogeneous energy demands and supplies, for integrating intelligent building energy management with grid optimization through a proactive demand response framework, for optimizing HVAC control with deep reinforcement learning, and for accurately measuring in-building temperature by combining prior modeling information with few sensor measurements based upon Bayesian inference.
High power Lithium-Ion (Li-Ion) battery packs used in stationary Electrical Energy Storage (EES) systems and Electric Vehicle (EV) applications require a sophisticated Battery Management System (BMS) in order to maintain safe operation and improve their performance. With the increasing complexity of these battery packs and their demand for shorter time-to-market, decentralized approaches for battery management, providing a high degree of modularity, scalability and improved control performance are typically preferred. However, manual design approaches for these complex distributed systems are time consuming and are error-prone resulting in a reduced energy efficiency of the overall system. Here, special design automation techniques considering all abstraction-levels of the battery system are required to obtain highly optimized battery packs. This paper presents from a design automation perspective the recent advances in the domain of battery systems that are a combination of the electrochemical cells and their associated management modules. Specifically, we classify the battery systems into three abstraction levels, cell-level (battery cells and their interconnection schemes), module-level (sensing and charge balancing circuits) and pack-level (computation and control algorithms). We provide an overview of challenges that exist in each abstraction layer and give an outlook towards future design automation techniques that are required to overcome these limitations.
Dynamic verification is widely used to increase confidence in the correctness of RTL circuits during the pre-silicon design phase. Despite numerous attempts over the last decades to automate the stimuli generation based on coverage feedback, Coverage Directed Test Generation (CDG) has not found the widespread adoption that one would expect. Based on new ideas from the software testing community around coverage-guided mutational fuzz testing, we propose a new approach to the CDG problem which requires minimal setup and takes advantage of FPGA-accelerated simulation for rapid testing. We provide test input and coverage definitions that allow fuzz testing to be applied to RTL circuit verification. In addition we propose and implement a series of transformation passes that make it feasible to reset arbitrary RTL designs quickly, a requirement for deterministic test execution. Alongside this paper we provide rfuzz, a fully featured implementation of our testing methodology which we make available as open-source software to the research community. An empirical evaluation of rfuzz shows promising results on archiving coverage for a wide range of different RTL designs ranging from communication IPs to an industry scale 64-bit CPU.
This paper proposes a framework for functional verification of shared memory that relies on reusable coverage-driven directed test generation. It reveals a new mechanism to improve the quality of non-deterministic tests. The generator exploits general properties of coherence protocols and cache memories for better control on transition coverage, which serves as a proxy for increasing the actual coverage metric adopted in a given verification environment. Being independent of coverage metric, coherence protocol, and cache parameters, the proposed generator is reusable across quite different designs and verification environments. We report the coverage for 8, 16, and 32-core designs and the effort required for exposing nine different types of errors. The proposed technique was always able to reach similar coverage as a state-of-the-art generator, and it always did it faster above a certain threshold. For instance, when executing tests with IK operations for verifying 32-core designs, the former reached 65% coverage around 5 times faster than the latter. Besides, we identified challenging errors that could hardly be found by the latter within one hour, but were exposed by our technique in 5 to 30 minutes.
Stimulus generation is an essential part of hardware verification, being at the core of widely applied constrained-random verification techniques. However, as verification problems get more and more complex, so do the constraints which must be satisfied. In this context, it is a challenge to efficiently generate random stimuli which can achieve a good coverage of the design space. We developed a new technique SMTSampler which can sample random solutions from Satisfiability Modulo Theories (SMT) formulas with bit-vectors, arrays, and uninterpreted functions. The technique uses a small number of calls to a constraint solver in order to generate up to millions of stimuli. Our evaluation on a large set of complex industrial SMT benchmarks shows that SMTSampler can handle a larger class of SMT problems, outperforming state-of-the-art constraint sampling techniques in the number of samples produced and the coverage of the constraint space.
Memristor-based deep learning accelerators provide a promising solution to improve the energy efficiency of neuromorphic computing systems. However, the electrical properties and crossbar structure of memristors make these accelerators error-prone. To enable reliable memristor-based accelerators, a simulation platform is needed to precisely analyze the impact of non-ideal circuit and device properties on the inference accuracy. In this paper, we propose a flexible simulation framework, DL-RSIM, to tackle this challenge. DL-RSIM simulates the error rates of every sum-of-products computation in the memristor-based accelerator and injects the errors in the targeted TensorFlow-based neural network model. A rich set of reliability impact factors are explored by DL-RSIM, and it can be incorporated with any deep learning neural network implemented by TensorFlow. Using three representative convolutional neural networks as case studies, we show that DL-RSIM can guide chip designers to choose a reliability-friendly design option and develop reliability optimization techniques.
In this paper, we present a ferroelectric FET (FeFET) based power-efficient architecture to accelerate data-intensive applications such as deep neural networks (DNNs). We propose a cross-cutting solution combining emerging device technologies, circuit optimizations, and micro-architecture innovations. At device level, FeFET crossbar is utilized to perform vector-matrix multiplication (VMM). As a field effect device, FeFET significantly reduces the read/write energy compared with the resistive random-access memory (ReRAM). At circuit level, we propose an all-digital peripheral design, reducing the large overhead introduced by ADC and DAC in prior works. In terms of micro-architecture innovation, a dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Speed, power, area and computing accuracy are evaluated based on detailed device characterization and system modeling. For DNN computing, our design achieves 254x and 9.7x gain in power efficiency (GOPS/W) compared to GPU and ReRAM based designs, respectively.
Transfer learning has demonstrated a great success recently towards general supervised learning to mitigate expensive training efforts. However, existing neural network accelerators have been proven inefficient in executing transfer learning by failing to accommodate the layer-wise heterogeneity in computation and memory requirements. In this work, we propose EMAT---an efficient multi-task architecture for transfer learning built on resistive memory (ReRAM) technology. EMAT utilizes the energy-efficiency of ReRAM arrays for matrix-vector multiplication and realizes a hierarchical reconfigurable design with heterogeneous computation components to incorporate the data patterns in transfer learning. Compared to the GPU platform, EMAT can perform averagely 120X performance speedup and 87X energy saving. EMAT also obtains 2.5X speedup compared to the-state-of-the-art CMOS accelerator.
Maintaining high energy efficiency has become a critical design issue for high-performance systems. Many power management techniques have been proposed for the processor cores such as dynamic voltage and frequency scaling (DVFS). However, very few solutions consider the power losses suffered on the power delivery system (PDS), despite the fact that they have a significant impact on the system overall energy efficiency. With the explosive growth of system complexity and highly dynamic workloads variations, it is challenging to find the optimal power management policies which can effectively match the power delivery with the power consumption. To tackle the above problems, we propose a reinforcement learning-based power management scheme for manycore systems to jointly monitor and adjust both the PDS and the processor cores aiming to improve system overall energy efficiency. The learning agents distributed across power domains not only manage the power states of processor cores but also control the on/off states of on-chip VRs to proactively adapt to the workload variations. Experimental results with realistic applications show that when the proposed approach is applied to a large-scale system with a hybrid PDS, it lowers the system overall energy-delay-product (EDP) by 41% than a traditional monolithic DVFS approach with a bulky off-chip VR.
Stochastic gradient descent (SGD) is a widely-used algorithm in many applications, especially in the training process of deep learning models. Low-precision implementation for SGD has been studied as a major acceleration approach. However, if not appropriately used, low-precision implementation can deteriorate its convergence because of the rounding error when gradients become small near a local optimum. In this work, to balance throughput and algorithmic accuracy, we apply the Q-learning technique to adjust the precision of SGD automatically by designing an appropriate decision function. The proposed decision function for Q-learning takes the error rate of the objective function, its gradients, and the current precision configuration as the inputs. Q-learning then chooses proper precision adaptively for hardware efficiency and algorithmic accuracy. We use reconfigurable devices such as FPGAs to evaluate the adaptive precision configurations generated by the proposed Q-learning method. We prototype the framework using LeNet-5 model with MNIST and CIFAR10 datasets and implement it on a Xilinx KCU1500 FPGA board. In the experiments, we analyze the throughput of different precision representations and the precision-selection of our framework. The results show that the proposed framework with adapative precision increases the throughput by up to 4.3 x compared to the conventional 32-bit floating point setting, and it achieves both the best hardware efficiency and algorithmic accuracy.
Mobile devices running augmented reality applications consume considerable energy for graphics-intensive workloads. This paper presents a scheme for the differentiated handling of camera-captured physical scenes and computer-generated virtual objects according to different perceptual quality metrics. We propose online algorithms and their realtime implementations to reduce energy consumption through dynamic frame rate adaptation while maintaining the visual quality required for augmented reality applications. To evaluate system efficacy, we integrate our scheme into Android and conduct extensive experiments on a commercial smartphone with various application scenarios. The results show that the proposed scheme can achieve energy savings of up to 39.1% in comparison to the native graphics system in Android while maintaining satisfactory visual quality.
In this paper, we present DATC Robust Design Flow (RDF) from logic synthesis to detailed routing. We further include detailed placement and detailed routing tools based on recent EDA research contests. We also demonstrate RDF in a scalable cloud infrastructure. Design methodology and cross-stage optimization research can be conducted via RDF.
Subthreshold SRAM design is crucial for addressing the memory bottleneck in energy constrained applications. While statistical optimization can be applied based on Monte-Carlo (MC) simulation, exploration of bitcell design space is time consuming. This paper presents a framework for model-based design and optimization of subthreshold SRAM bitcells under random PVT variations. By incorporating key design and process features, a physical model of bitcell static noise margin (SNM) has been derived analytically. It captures intra-die SNM variations by the combination of a folded-normal distribution and a non-central chi-squared distribution. Validations with MC simulation show its accuracy of modeling SNM distributions down to 25mV beyond 6-sigma for typical bitcells in 28nm. Model-based tuning of subthreshold SRAM bitcells is investigated for design tradeoff between leakage, area and stability. When targeting a specific SNM constraint, we show that an optimal standby voltage exists which offers minimum bitcell leakage power - any deviation above or below increases the power consumption. When targeting a specific standby voltage, our design flow identifies bitcell instances of 12x less leakage power or 3x reductions in area as compared to the minimum-length design.
Adaptive voltage scaling (AVS) is a promising approach to overcome manufacturing variability, dynamic environmental fluctuation, and aging. This paper focuses on timing sensors necessary for AVS implementation and compares in-situ timing error predictive FF (TEP-FF) and critical path replica in terms of how much voltage margin can be reduced. For estimating the theoretical bound of ideal AVS, this work proposes linear programming based minimum supply voltage analysis and discusses the voltage adaptation performance quantitatively by investigating the gap between the lower bound and actual supply voltages. Experimental results show that TEP-FF based AVS and replica based AVS achieve up to 13.3% and 8.9% supply voltage reduction, respectively while satisfying the target MTTF. AVS with TEP-FF tracks the theoretical bound with 2.5 to 5.6 % voltage margin while AVS with replica needs 7.2 to 9.9 % margin.
Organic thin-film transistors (OTFTs) are widely used in flexible circuits, such as flexible displays, sensor arrays, and radio frequency identification cards (RFIDs), because these technologies offer features such as better flexibility, lower cost, and easy manufacturability using low-temperature fabrication process. This paper develops a procedure that evaluates the performance of flexible displays. Due to their very nature, flexible displays experience significant mechanical strain/stress in the field due to the deformation caused during daily use. These deformations can impact device and circuit performance, potentially causing a loss in functionality. This paper first models the effects of extrinsic strain due to two fundamental deformations modes, bending and twisting. Next, this strain is translated to variations in device mobility, after which analytical models for error analysis in the flexible display are derived based on the rendered image values in each pixel of the display. Finally, two error correction approaches for flexible displays are proposed, based on voltage compensation and flexible clocking.
As data security has become the major concern in modern storage systems with low-cost multi-level-cell (MLC) flash memories, it is not trivial to realize data sanitization in such a system. Even though some existing works employ the encryption or the built-in erase to achieve this requirement, they still suffer the risk of being deciphered or the issue of performance degradation. In contrast to the existing work, a fast sanitization scheme is proposed to provide the highest degree of security for data sanitization; that is, every old version of data could be immediately sanitized with zero live-data-copy overhead once the new version of data is created/written. In particular, this scheme further considers the reliability issue of MLC flash memories; the proposed scheme includes a one-shot sanitization design to minimize the disturbance during data sanitization. The feasibility and the capability of the proposed scheme were evaluated through extensive experiments based on real flash chips. The results demonstrate that this scheme can achieve the data sanitization with zero live-data-copy, where performance overhead is less than 1%.
Secure deletion ensures user privacy by permanently removing invalid data from the secondary storage. This process is particularly critical to solid state drives (SSDs) wherein invalid data are generated not only upon deleting a file but also upon updating a file of which the user is not aware. While previous secure deletion schemes are usually applied to all invalid data on the SSD, our observation is that in many cases security is not required for all files on the SSD. This paper proposes an efficient secure deletion scheme targeting only the invalid data of files marked as "secure" by the user. A security-aware data allocation strategy is designed, which separates secure and unsecure data at lower (block) level but mixes them at higher levels of SSD hierarchical organization. Block-level separation minimizes secure deletion cost, while higher-level mixing mitigates the adverse impact of secure deletion on SSD lifetime. A two-level block management scheme is further developed to scatter secure blocks over the SSD for wear leveling. Experiments on real-world benchmarks confirm the advantage of the proposed scheme in reducing secure deletion cost and improving SSD lifetime.
The exponential growth in creation and consumption of various forms of digital data has led to the emergence of new application workloads such as machine learning, data analytics and search. These workloads process large amounts of data and hence pose increased demands on the on-chip and off-chip interconnects of modern computing systems. Therefore, techniques that can improve the energy-efficiency and performance of interconnects are becoming increasingly important.
Approximate computing promises significant advantages over more traditional computing architectures with respect to circuit area, performance, power efficiency, flexibility, and cost. Its use is suitable in applications where limited and controlled inaccuracies are tolerable or uncertainty is intrinsic in input or their data processing, e.g., as it happens in (deep-) machine learning, image and signal processing. This paper discusses a dimension of approximate computing that has been neglected so far, despite it represents nowadays a major asset, that of security. A number of hardware-related security threats are considered, and the implications of approximate circuits or systems designed to address these threats are discussed.
Neural networks and deep learning are promising techniques for bringing brain inspired computing into embedded platforms. They pave the way to new kinds of associative memories, classifiers, data-mining, machine learning or search engines, which can be the basis of critical and sensitive applications such as autonomous driving. Emerging non-volatile memory technologies integrated in the so called Multi-Processor System-on-Chip (MPSoC) architectures enable the realization of such computational paradigms. These architectures take advantage of the Network-on-Chip concept to efficiently carry out communications with dedicated distributed memories and processing elements. However, current MPSoC-based neuromorphic architectures are deployed without taking security into account. The growing complexity and the hyper-sharing of hardware resources of MPSoCs may become a threat, thus increasing the risk of malware infections and Trojans introduced at design time. Specially, MPSoC microarchitectural side-channels and fault injection attacks can be exploited to leak sensitive information and to cause malfunctions. In this work we present three contributions to that issue: i) first analysis of security issues in MPSoC-based neuromorphic architectures; ii) discussion of the threat model of the neuromorphic architectures; ii) demonstration of the correlation between SNN input and the neural computation.
Today, secure systems are built by identifying potential vulnerabilities and then adding protections to thwart the associated attacks. Unfortunately, the complexity of today's systems makes it impossible to prove that all attacks are stopped, so clever attackers find a way around even the most carefully designed protections. In this article, we take a sobering look at the state of secure system design, and ask ourselves why the "security arms race" never ends? The answer lies in our inability to develop adequate security verification technologies. We then examine an advanced defensive system in nature - the human immune system - and we discover that it does not remove vulnerabilities, rather it adds offensive measures to protect the body when its vulnerabilities are penetrated We close the article with brief speculation on how the human immune system could inspire more capable secure system designs.
Modern processing systems with heterogeneous components (e.g., CPUs, GPUs) have numerous configuration and design options such as the number and types of cores, frequency, and memory bandwidth. Hardware architects must perform design space explorations in order to accurately target markets of interest under tight time-to-market constraints. This need highlights the importance of rapid performance and power estimation mechanisms.
This work describes the use of machine learning (ML) techniques within a methodology for the estimating performance and power of heterogeneous systems. In particular, we measure the power and performance of a large collection of test applications running on real hardware across numerous hardware configurations. We use these measurements to train a ML model; the model learns how the applications scale with the system's key design parameters.
Later, new applications of interest are executed on a single configuration, and we gather hardware performance counter values which describe how the application used the hardware. These values are fed into our ML model's inference algorithm, which quickly identify how this application will scale across various design points. In this way, we can rapidly predict the performance and power of the new application across a wide range of system configurations.
Once the initial run of the program is complete, our ML algorithm can predict the application's performance and power at many hardware points faster than running it at each of those points and with a level of accuracy comparable to cycle-level simulators.
In the emerging data-driven science paradigm, computing systems ranging from IoT and mobile to manycores and datacenters play distinct roles. These systems need to be optimized for the objectives and constraints dictated by the needs of the application. In this paper, we describe how machine learning techniques can be leveraged to improve the computational-efficiency of hardware design optimization. This includes generic methodologies that are applicable for any hardware design space. As an example, we discuss a guided design space exploration framework to accelerate application-specific manycore systems design and advanced imitation learning techniques to improve on-chip resource management. We present some experimental results for application-specific manycore system design optimization and dynamic power management to demonstrate the efficacy of these methods over traditional EDA approaches.
Data-driven prognostic health management is essential to ensure high reliability and rapid error recovery in commercial core router systems. The effectiveness of prognostic health management depends on whether failures can be accurately predicted with sufficient lead time. This paper describes how time-series analysis and machine-learning techniques can be used to detect anomalies and predict failures in complex core router systems. First, both a feature-categorization-based hybrid method and a changepoint-based method have been developed to detect anomalies in time-varying features with different statistical characteristics. Next, a SVM-based failure predictor is developed to predict both categories and lead time of system failures from collected anomalies. A comprehensive set of experimental results is presented for data collected during 30 days of field operation from over 20 core routers deployed by customers of a major telecom company.
Neural approximate computing gains enormous energy-efficiency at the cost of tolerable quality-loss. A neural approximator can map the input data to output while a classifier determines whether the input data are safe to approximate with quality guarantee. However, existing works cannot maximize the invocation of the approximator, resulting in limited speedup and energy saving. By exploring the mapping space of those target functions, in this paper, we observe a nonuniform distribution of the approximation error incurred by the same approximator. We thus propose a novel approximate computing architecture with a Multiclass-Classifier and Multiple Approximators (MCMA). These approximators have identica network topologies, and thus can share the same hardware resource in an neural processing unit(NPU) clip. In the runtime, MCMA can swap in the invoked approximator by merely shipping the synapse weights from the on-chip memory to the buffers near MAC within a cycle. We also propose efficient co-training methods for such MCMA architecture. Experimental results show a more substantial invocation of MCMA as well as the gain of energy-efficiency.
Recently, deterministic approaches to stochastic computing (SC) have been proposed. These compute with the same constructs as stochastic computing but operate on deterministic bit streams. These approaches reduce the area, greatly reduce the latency (by an exponential factor), and produce completely accurate results. However, these methods do not scale well. Also, they lack the property of progressive precision enjoyed by SC. As a result, these deterministic approaches are not competitive for applications where some degree of inaccuracy can be tolerated. In this work we introduce two fast-converging, scalable deterministic approaches to SC based on low-discrepancy sequences. The results are completely accurate when running the operations for the required number of cycles. However, the computation can be truncated early if some inaccuracy is acceptable. Experimental results show that the proposed approaches significantly improve both the processing time and area-delay product compared to prior approaches.
Approximate Computing has emerged as a design paradigm that allows to decrease hardware costs by reducing the accuracy of the computation for applications that are robust against such errors. In Boolean logic approximation, the number of terms and literals of a logic function can be reduced by allowing to produce erroneous outputs for some input combinations. This paper proposes a novel methodology for the approximation of multi-output logic functions. Related work on multi-output logic approximation minimizes each output function separately. In this paper, we show that thereby a huge optimization potential is lost. As a remedy, our methodology considers the effect on all output functions when introducing errors thus exploiting the cross-function minimization potential. Moreover, our approach is integrated into a design space exploration technique to obtain not only a single solution but a Pareto-set of designs with different trade-offs between hardware costs (terms and literals) and error (number of minterms that have been falsified). Experimental results show our technique is very efficient in exploring Pareto-optimal fronts. For some benchmarks, the number of terms could be reduced from an accurate function implementation by up to 15% and literals by up to 19% with degrees of inaccuracy around 0.1% w.r.t. accurate designs. Moreover, we show that the Pareto-fronts obtained by our methodology dominate the results obtained when applying related work.
It is extremely challenging to deploy computing-intensive convolutional neural networks (CNNs) with rich parameters in mobile devices because of their limited computing resources and low power budgets. Although prior works build fast and energy-efficient CNN accelerators by greatly sacrificing test accuracy, mobile devices have to guarantee high CNN test accuracy for critical applications, e.g., unlocking phones by face recognitions. In this paper, we propose a 3D XPoint ReRAM-based process-in-memory architecture, 3DICT, to provide various test accuracies to applications with different priorities by lookup-based CNN tests that dynamically exploit the trade-off between test accuracy and latency. Compared to the state-of-the-art accelerators, on average, 3DICT improves the CNN test performance per Watt by 13% ~ 61X and guarantees 9-year endurance under various CNN test accuracy requirements.
Passive crossbar arrays of resistive random-access memory (RRAM) have shown great potential to meet the demands of future memory. By eliminating transistor per cell, the crossbar array possesses a higher memory density but introduces sneak currents which incur extra energy waste and reliability issues. The complementary resistive switch (CRS), consisting of two anti-serially stacked memristors, is considered as a promising solution to the sneak current problem. However, the destructive read of the CRS results in an additional recovery write operation which strongly restricts its further promotion. Exploiting the dual CRS/memristor mode of CRS devices, we propose Aliens, a novel hybrid architecture for resistive random-access memory which introduces one alien cell (memristor mode) for each wordline in the crossbar to provide a practical hybrid memory without operating system's intervention. Aliens draws advantages from both modes: restrained sneak current of CRS mode and non-destructive read of memristor mode. The simple and regular cell mode organization of Aliens enables an energy-saving read method and an effective mode switching strategy called Lazy-Switch. By exploiting memory access locality, Lazy-Switch delays and merges the recovery write operations of the CRS mode. Due to fewer recovery write operations and negligible sneak currents, Aliens achieves improvement in energy, overall endurance, and access performance. The experiment results show that our design offers average energy savings of 13.9X compared with memristor-only memory, a memory lifetime 5.3X longer than CRS-only memory, and a competitive performance compared with memristor-only memory.
The Internet of Things (IoT) has led to the emergence of big data. Processing this amount of data poses a challenge for current computing systems. PIM enables in-place computation which reduces data movement, a major latency bottleneck in conventional systems. In this paper, we propose an in-memory implementation of fast and energy-efficient logic (FELIX) which combines the functionality of PIM with memories. To the best of authors' knowledge, FELIX is the first PIM logic to enable the single cycle NOR, NOT, NAND, minority, and OR directly in crossbar memory. We exploit the voltage threshold-based memristors to enable single cycle operations. It is a purely in-memory execution which neither reads out data nor changes sense amplifiers, while preserving data in-memory. We extend these single cycle operations to implement more complex functions like XOR and addition in memory with 2X lower latency than the fastest published PIM technique. We also increase the amount of in-memory parallelism in our design by segmenting bitlines using switches. To evaluate the efficiency of our design at the system level, we design a FELIX-based HyperDimensional (HD) computing accelerator. Our evaluation shows that for all applications tested using HD, FELIX provides on average 128.8X speedup and 5,589.3X lower energy consumption as compared to AMD GPU. FELIX HD also achieves on average 2.21X higher energy efficiency, 1.86X speedup, and 1.68X less memory as compared to the fastest PIM technique.
Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.
The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the inference algorithm. Therefore, this work proposes to customize the detection algorithm, e.g. SSD, to benefit its hardware implementation with low data precision at the cost of marginal accuracy degradation. The proposed FPGA-based deep learning inference accelerator is demonstrated on two Intel FPGAs for SSD algorithm achieving up to 2.18 TOPS throughput and up to 3.3X superior energy-efficiency compared to GPU.
FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.
Reliance on off-site untrusted fabrication facilities has given rise to several threats such as intellectual property (IP) piracy, overbuilding and hardware Trojans. Logic locking is a promising defense technique against such malicious activities that is effected at the silicon layer. Over the past decade, several logic locking defenses and attacks have been presented, thereby, enhancing the state-of-the-art. Nevertheless, there has been little research aiming to demonstrate the applicability of logic locking with large-scale multi-million-gate industrial designs consisting of multiple IP blocks with different security requirements. In this work, we take on this challenge to successfully lock a multi-million-gate system-on-chip (SoC) provided by DARPA by taking it all the way to GDSII layout. We analyze how specific features, constraints, and security requirements of an IP block can be leveraged to lock its functionality in the most appropriate way. We show that the blocks of an SoC can be locked in a customized manner at 0.5%, 15.3%, and 1.5% chip-level overhead in power, performance, and area, respectively.
With the advent of many-core systems, use cases of embedded systems have become more dynamic: Plenty of applications are concurrently executed, but may dynamically be exchanged and modified even after deployment. Moreover, resources may temporally or permanently become unavailable because of thermal aspects, dynamic power management, or the occurrence of faults. This poses new challenges for reaching objectives like timeliness for real-time or performance for best-effort program execution and maximizing system utilization. In this work, we first focus on dynamic management schemes for reliability/aging optimization under thermal constraints. The reliability of on-chip systems in the current and upcoming technology nodes is continuously degrading with every new generation because transistor scaling is approaching its fundamental limits. Protecting systems against degradation effects such as circuits' aging comes with considerable losses in efficiency. We demonstrate in this work why sustaining reliability while maximizing the utilization of available resources and hence avoiding efficiency loss is quite challenging - this holds even more when thermal constraints come into play. Then, we discuss techniques for run-time management of multiple applications which sustain real-time properties. Our solution relies on hybrid application mapping denoting the combination of design-time analysis with run-time application mapping. We present a method for Real-time Mapping Reconfiguration (RMR) which enables the Run-Time Manager (RM) to execute realtime applications even in the presence of dynamic thermal-and reliability-aware resource management.
This paper is paper of the ICCAD 2018 Special Session on "Managing Heterogeneous Many-cores for High-Performance and Energy-Efficiency". The other two papers of this Special sessions are [1] and [2].
Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads.
The widespread adoption of big data has led to the search for high-performance and low-power computational platforms. Emerging heterogeneous manycore processing platforms consisting of CPU and GPU cores along with various types of accelerators offer power and area-efficient trade-offs for running these applications. However, heterogeneous manycore architectures need to satisfy the communication and memory requirements of the diverse computing elements that conventional Network-on-Chip (NoC) architectures are unable to handle effectively. Further, with increasing system sizes and level of heterogeneity, it becomes difficult to quickly explore the large design space and establish the appropriate design trade-offs. To address these challenges, machine learning-inspired heterogeneous manycore system design is a promising research direction to pursue. In this paper, we highlight various salient features of heterogeneous manycore architectures enabled by emerging interconnect technologies and machine learning techniques.
Multi-cell spacing constraints arise due to aggressive scaling and manufacturing issues. For example, we can incorporate multi-cell spacing constraints due to pin accessibility problem in sub-10nm nodes. This work studies detailed placement considering multi-cell spacing constraints. A naive approach is to model each multi-cell spacing constraint as a set of 2-cell spacing constraints, but the resulting total cell displacement would be much larger than necessary. Thus, we aim to tackle this problem and propose a practical multi-cell method by first analyzing the initial layout to determine which cell pair in each multi-cell spacing constraint is the easiest to break apart. Secondly, we apply a single-row dynamic programming (SRDP)-based method one row at a time, called Intra-Row Move (IRM) to resolve a majority of violations while minimizing the total cell displacement or wirelength increase. With cell virtualization and movable region computation techniques, our IRM can be easily extended to handle mixed cell-height designs with only a slight modification of the cost computation in the SRDP method. Finally, we apply an integer linear programming-based method called Global Move (GM) to resolve the remaining violations. Experimental results indicate that our multi-cell method is much better than a 2-cell method both in solution quality and runtime.
Along with device scaling, the drain-to-drain abutment (DDA) constraint arises as an emerging challenge in modern circuit designs, which incurs additional difficulties especially for designs with mixed-cell-height standard cells which have prevailed in advanced technology. This paper presents the first work to address the mixed-cell-height placement problem considering the DDA constraint from post global placement throughout detailed placement. Our algorithms consists of three major stages: (1) DDA-aware preprocessing, (2) legalization, and (3) detailed placement. In the DDA-aware preprocessing stage, we first align cells to desired rows, considering the distribution ratio of source nodes to drain nodes. After deciding the cell ordering of every row, we adopt the modulus-based matrix splitting iteration method to remove all cell overlaps with minimum total displacement in the legalization stage. For detailed placement, we propose a satisfiability-based approach which considers the whole layout to flip a subset of cells and swap pairs of adjacent cells simultaneously. Compared with a shortest-path method, experimental results show that our proposed algorithm can significantly reduce cell violations and displacements with reasonable runtime.
Mixed-cell-height circuits have become popular in advanced technologies for better power, area, routability, and performance trade-offs. With the technology and region constraints imposed by modern circuit designs, the mixed-cell-height legalization problem has become more challenging. In this paper, we present an effective and efficient legalization algorithm for mixed-cell-height circuit designs with technology and region constraints. We first present a fence region handling technique to unify the fence regions and the default ones. To obtain a desired cell assignment, we then propose a movement-aware cell reassignment method by iteratively reassigning cells in locally dense areas to their desired rows. After cell reassignment, a technology-aware legalization is presented to remove cell overlaps while satisfying the technology constraints. Finally, we propose a technology-aware refinement to further reduce the average and maximum cell movements without increasing the technology constraints violations. Compared with the champion of the 2017 ICCAD CAD Contest and the state-of-the-art work, experimental results show that our algorithm achieves the best average and maximum cell movements and significantly fewer technology constraint violations, in a comparable runtime.
Mixed-cell-height standard cells are prevailingly used in advanced technologies to achieve better design trade-offs among timing, power, and routability. As feature size decreases, placement of cells with multiple threshold voltages may violate the complex minimum-implant-area (MIA) layer rule arising from the limitations of patterning technologies. Existing works consider the mixed-cell-height placement problem only during legalization, or handle the MIA constraints during detailed placement. In this paper, we address the mixed-cell-height placement problem with MIA constraints into two major stages: post global placement and MIA-aware legalization. In the post global placement stage, we first present a continuous and differentiable cost function to address the Vdd/Vss alignment constraints, and add weighted pseudo nets to MIA violation cells dynamically. Then, we propose a proximal optimization method based on the given global placement result to simultaneously consider Vdd/Vss alignment constraints, MIA constraints, cell distribution, cell displacement, and total wirelength. In the MIA-aware legalization stage, we develop a graph-based method to cluster cells of specific threshold voltages, and apply a strip-packing-based binary linear programming to reshape cells. Then, we propose a matching-based technique to resolve intra-row MIA violations and reduce filler insertion. Furthermore, we formulate inter-row MIA-aware legalization as a quadratic programming problem, which is efficiently solved by a modulus-based matrix splitting iteration method. Finally, MIA-aware cell allocation and refinement are performed to further improve the result. Experimental results show that, without any extra area overhead, our algorithm still can achieve 8.5% shorter final total wirelength than the state-of-the-art work.
RAPID is a low-overhead critical-word-first read acceleration architecture for improved performance and endurance in MLC/TLC non-volatile memories (NVMs). RAPID encodes the critical words in a cache line using only the most significant bits (MSbs) of the MLC/TLC NVM cells. Since the MSbs of an NVM cell can be decoded using a single read strobe, the data (i.e., critical words) encoded using the MSbs can be decoded with low latency. System-level SPEC CPU2006 workload evaluations of a TLC RRAM architecture show that RAPID improves read latency by 21%, energy by 24%, and endurance by 2-4x over state-of-the-art striped NVM.
FPGA that utilizes via-switches, which are a kind of nonvolatile resistive RAMs, for crossbar implementation is attracting attention due to higher integration density and performance. However, programming via-switches arbitrarily in a crossbar is not trivial since a programming current must be provided through signal wires that are shared by multiple via-switches. Consequently, depending on the previous programming status in sequential programming, unintentional switch programming may occur due to signal detour, which is called sneak path problem. This problem interferes the reconfiguration of via-switch FPGA, and hence countermeasures for sneak path problem are indispensable. This paper identifies the circuit status that causes sneak path problem and proposes a sneak path avoidance method that gives sneak path free programming order of via-switches in a crossbar. We prove that sneak path free programming order necessarily exists for arbitrary on-off patterns in a crossbar as long as no loops exist, and also validate the proof and the proposed method with simulation-based evaluation. Thanks to the proposed method, any practical configurations of via-switch FPGA can be successfully programmed without sneak path problem.
Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7x speedup, 30% utilization rate improvement, and 1.2x energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4x higher on average in energy efficiency and 20x higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4x speedup and 1.65x energy efficiency improvement.
We propose an efficient Ising processor with approximated parallel tempering (IPAPT) implemented on an FPGA. Hardware-friendly approximations of the components of parallel tempering (PT) are proposed to enhance solution quality with low hardware overhead. Multiple replicas of Ising states having different temperatures run in parallel by sharing a single network structure, and the replicas are exchanged based on the approximated energy evaluation. The application of PT substantially improves the quality of optimization solutions. The experimental results on the various max-cut problems have shown that utilization of PT significantly increases the probability of obtaining optimal solutions, and IPAPT obtains optimal solutions two orders magnitude faster than a software solver.
Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others' strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker's strategy for generating adversarial examples. We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.
Human activity recognition (HAR) has attracted significant research interest due to its applications in health monitoring and patient rehabilitation. Recent research on HAR focuses on using smartphones due to their widespread use. However, this leads to inconvenient use, limited choice of sensors and inefficient use of resources, since smartphones are not designed for HAR. This paper presents the first HAR framework that can perform both online training and inference. The proposed framework starts with a novel technique that generates features using the fast Fourier and discrete wavelet transforms of a textile-based stretch sensor and accelerometer data. Using these features, we design a neural network classifier which is trained online using the policy gradient algorithm. Experiments on a low power IoT device (TI-CC2650 MCU) with nine users show 97.7% accuracy in identifying six activities and their transitions with less than 12.5 mW power consumption.
The Micro-electrode-dot-array (MEDA) is a next-generation digital microfluidic biochip (DMFB) platform that supports fine-grained control and real-time sensing of droplet movements. These capabilities permit continuous monitoring and checkpoint-based validation of assay execution on MEDA. This paper presents a class of "shadow attacks" that abuse the timing slack in the assay execution. State-of-the-art checkpoint-based validation techniques cannot expose the shadow operations. We develop a defense that introduces extra checkpoints in the assay execution at time instances when the assay is prone to shadow attacks. Experiments confirm the effectiveness and practicality of the defense.
Blockchain provides decentralized consensus in large, open networks without a trusted authority, making it a promising solution for the Internet of Things (IoT) to distribute verifiable data, such as firmware updates. However, verifying data integrity and consensus on a linearly growing blockchain quickly exceeds memory and processing capabilities of embedded systems.
As a remedy, we propose a generic blockchain extension that enables highly constrained devices to verify the inclusion and integrity of any block within a blockchain. Instead of traversing block by block, we construct a LeapChain that reduces verification steps without weakening the integrity guarantees of the blockchain. Applied to Proof-of-Work blockchains, our scheme can be used to verify consensus by proving a certain amount of work on top of a block.
Our analytical and experimental results show that, compared to existing approaches, only LeapChain provides deterministic and tight upper bounds on the memory requirements in the kilobyte range, significantly extending the possibilities of blockchain application on embedded IoT devices.
Convolutional neural networks (CNNs) are of increasing widespread use in robotics, especially for object recognition. However, such CNNs still lack several critical properties necessary for robots to properly perceive and function autonomously in uncertain, and potentially adversarial, environments. In this paper, we investigate factors for accurate, reliable, and resource-efficient object and pose recognition suitable for robotic manipulation in adversarial clutter. Our exploration is in the context of a three-stage pipeline of discriminative CNN-based recognition, generative probabilistic estimation, and robot manipulation. This pipeline proposes using a SAmpling Network Density filter, or SAND filter, to recover from potentially erroneous decisions produced by a CNN through generative probabilistic inference. We present experimental results from SAND filter perception for robotic manipulation in tabletop scenes with both benign and adversarial clutter. These experiments vary CNN model complexity for object recognition and evaluate levels of inaccuracy that can be recovered by generative pose inference. This scenario is extended to consider adversarial environmental modifications with varied lighting, occlusions, and surface modifications.
Advancements in machine learning led to its adoption into numerous applications ranging from computer vision to security. Despite the achieved advancements in the machine learning, the vulnerabilities in those techniques are as well exploited. Adversarial samples are the samples generated by adding crafted perturbations to the normal input samples. An overview of different techniques to generate adversarial samples, defense to make classifiers robust is presented in this work. Furthermore, the adversarial learning and its effective utilization to enhance the robustness and the required constraints are experimentally provided, such as up to 97.65% accuracy even against CW attack. Though adversarial learning's effectiveness is enhanced, still it is shown in this work that it can be further exploited for vulnerabilities.
The majority function <xyz> evaluates to true, if at least two of its Boolean inputs evaluate to true. The majority function has frequently been studied as a central primitive in logic synthesis applications for many decades. Knuth refers to the majority function in the last volume of his seminal The Art of Computer Programming as "probably the most important ternary operation in the entire universe." Majority logic sythesis has recently regained significant interest in the design automation community due to nanoemerging technologies which operate based on the majority function. In addition, majority logic synthesis has successfully been employed in CMOS-based applications such as standard cell or FPGA mapping.
This tutorial gives a broad introduction into the field of majority logic synthesis. It will review fundamental results and describe recent contributions from theory, practice, and applications.
Early routability prediction helps designers and tools perform preventive measures so that design rule violations can be avoided in a proactive manner. However, it is a huge challenge to have a predictor that is both accurate and fast. In this work, we study how to leverage convolutional neural network to address this challenge. The proposed method, called RouteNet, can either evaluate the overall routability of cell placement solutions without global routing or predict the locations of DRC (Design Rule Checking) hotspots. In both cases, large macros in mixed-size designs are taken into consideration. Experiments on benchmark circuits show that RouteNet can forecast overall routability with accuracy similar to that of global router while using substantially less runtime. For DRC hotspot prediction, RouteNet improves accuracy by 50% compared to global routing. It also significantly outperforms other machine learning approaches such as support vector machine and logistic regression.
Detailed routing is a dead-or-alive critical element in design automation tooling for advanced node enablement. However, very few works address detailed routing in the recent open literature, particularly in the context of modern industrial designs and a complete, end-to-end flow. The ISPD-2018 Initial Detailed Routing Contest addressed this gap for modern industrial designs, using a reduced design rules set. In this work, we present TritonRoute, an initial detailed router for the ISPD-2018 contest. Given route guides from global routing, the initial detailed routing stage should generate a detailed routing solution honoring the route guides as much as possible, while minimizing wirelength, via count and various design rule violations. In our work, the key contribution is intra-layer parallel routing, where we partition each layer into parallel panels and route each panel using an Integer Linear Programming-based algorithm. We sequentially route layer by layer from the bottom to the top. We evaluate our router using the official ISPD-2018 benchmark suite and show that we reduce the contest metric by up to 74%, and on average 50%, compared to the first-place routing solution for each testcase.
Detailed routing is the most complicated and time-consuming stage in VLSI design and has become a critical process for advanced node enablement. To handle the high complexity of modern detailed routing, initial detailed routing is often employed to minimize design-rule violations to facilitate final detailed routing, even though it is still not violation-free after initial routing. This paper presents a novel initial detailed routing algorithm to consider industrial design-rule constraints and optimize the total wirelength and via count. Our algorithm consists of three major stages: (1) an effective pinaccess point generation method to identify valid points to model a complex pin shape, (2) a via-aware track assignment method to minimize the overlaps between assigned wire segments, and (3) a detailed routing algorithm with a novel negotiation-based rip-up and re-route scheme that enables multithreading and honors global routing information while minimizing designrule violations. Experimental results show that our router outperforms all the winning teams of the 2018 ACM ISPD Initial Detailed Routing Contest, where the top-3 routers result in 23%, 52%, and 1224% higher costs than ours.
Multi-layer obstacle-avoiding rectilinear Steiner minimal tree (ML-OARSMT) problem has been extensively studied in recent years. In this work, we consider a variant of ML-OARSMT problem and extend the applicability to the net open location finder. Since ECO or router limitations may cause the open nets, we come up with a framework to detect and reconnect existing nets to resolve the net opens. Different from prior connection graph based approach, we propose a technique by applying efficient Boolean operations to repair net opens. Our method has good quality and scalability and is highly parallelizable. Compared with the results of ICCAD-2017 contest, we show that our proposed algorithm can achieve the smallest cost with 4.81 speedup in average than the top-3 winners.
Neural networks (NNs) are key to deep learning systems. Their efficient hardware implementation is crucial to applications at the edge. Binarized NNs (BNNs), where the weights and output of a neuron are of binary values {-1, +1} (or encoded in {0,1}), have been proposed recently. As no multiplier is required, they are particularly attractive and suitable for hardware realization. Most prior NN synthesis methods target on hardware architectures with neural processing elements (NPEs), where the weights of a neuron are loaded and the output of the neuron is computed. The load-and-compute method, though area efficient, requires expensive memory access, which deteriorates energy and performance efficiency. In this work we aim at synthesizing BNN dense layers into dedicated logic circuits. We formulate the corresponding matrix covering problem and propose a scalable algorithm to reduce the area and routing cost of BNNs. Experimental results justify the effectiveness of the method in terms of area and net savings on FPGA implementation. Our method provides an alternative implementation of BNNs, and can be applied in combination with NPE-based implementation for area, speed, and power tradeoffs.
Threshold logic functions gain revived attention due to their connection to neural networks employed in deep learning. Despite prior endeavors in the characterization of threshold logic functions, to the best of our knowledge, the quest for a canonical representation of threshold logic functions in the form of their realizing linear inequalities remains open. In this paper we devise a procedure to canonicalize a threshold logic function such that two threshold logic functions are equivalent if and only if their canonicalized linear inequalities are the same. We further strengthen the canonicity to ensure that symmetric variables of a threshold logic function receive the same weight in the canonicalized linear inequality. The canonicalization procedure invokes O(m) queries to a linear programming (resp. an integer linear programming) solver when a linear inequality solution with fractional (resp. integral) weight and threshold values is to be found, where m is the number of symmetry groups of the given threshold logic function. The guaranteed canonicity allows direct application to the classification of NP (input negation, input permutation) and NPN (input negation, input permutation, output negation) equivalence of threshold logic functions. It may thus enable applications such as equivalence checking, Boolean matching, and library construction for threshold circuit synthesis.
Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To synthesize approximate circuits automatically, many approximate logic synthesis (ALS) algorithms have been proposed. However, they mainly focus on area reduction and are not optimal in reducing the delay of the circuits. In this paper, we propose DALS, a delay-driven ALS framework. DALS works on the AND-inverter graph (AIG) representation of a circuit. It supports a wide range of approximate local changes and some commonly-used error metrics, including error rate and mean error distance. In order to select an optimal set of nodes in the AIG to apply approximate local changes, DALS establishes a critical error network (CEN) from the AIG and formulates a maximum flow problem on the CEN. Our experimental results on a wide range of benchmarks show that DALS produces approximate circuits with significantly reduced delays.
Parallel computing is a trend to enhance scalability of electronic design automation (EDA) tools using widely available multicore platforms. In order to benefit from parallelism, well-known EDA algorithms have to be reformulated and optimized for multicore implementation. This paper introduces a set of principles to enable a fine-grain parallel AND-inverter graph (AIG) rewriting. It presents a novel method to discover and rewrite in parallel parts of the AIG, without the need for graph partitioning. Experiments show that, when synthesizing large designs composed of millions of AIG nodes, the parallel rewriting on 40 physical cores is up to 36x and 68x faster than ABC commands rewrite -l and drw, respectively, with comparable quality of results in terms of AIG size and depth.
Specialized hardware accelerators are being increasingly integrated into today's computer systems to achieve improved performance and energy efficiency. However, the resulting variety and complexity make it challenging to ensure the security of these accelerators. To mitigate complexity while guaranteeing security, we propose a high-level synthesis (HLS) infrastructure that incorporates static information flow analysis to enforce security policies on HLS-generated hardware accelerators. Our security-constrained HLS infrastructure is able to effectively identify both explicit and implicit information leakage. By detecting the security vulnerabilities at the behavioral level, our tool allows designers to address these vulnerabilities at an early stage of the design flow. We further propose a novel synthesis technique in HLS to eliminate timing channels in the generated accelerator. Our approach is able to remove timing channels in a verifiable manner while incurring lower performance overhead for high-security tasks on the accelerator.
Hardware information flow analysis detects security vulnerabilities resulting from unintended design flaws, timing channels, and hardware Trojans. These information flow models are typically generated in a general way, which includes a significant amount of redundancy that is irrelevant to the specified security properties. In this work, we propose a property specific approach for information flow security. We create information flow models tailored to the properties to be verified by performing a property specific search to identify security critical paths. This helps find suspicious signals that require closer inspection and quickly eliminates portions of the design that are free of security violations. Our property specific trimming technique reduces the complexity of the security model; this accelerates security verification and restricts potential security violations to a smaller region which helps quickly pinpoint hardware security vulnerabilities.
Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures.
For the past decade, security experts have warned that malicious engineers could modify hardware designs to include hardware backdoors (trojans), which, in turn, could grant attackers full control over a system. Proposed defenses to detect these attacks have been outpaced by the development of increasingly small, but equally dangerous, trojans. To thwart trojan-based attacks, we propose a novel architecture that maps the security-critical portions of a processor design to a one-time programmable, LUT-free fabric. The programmable fabric is automatically generated by analyzing the HDL of targeted modules. We present our tools to generate the fabric and map functionally equivalent designs onto the fabric. By having a trusted party randomly select a mapping and configure each chip, we prevent an attacker from knowing the physical location of targeted signals at manufacturing time. In addition, we provide decoy options (canaries) for the mapping of security-critical signals, such that hardware trojans hitting a decoy are thwarted and exposed. Using this defense approach, any trojan capable of analyzing the entire configurable fabric must employ complex logic functions with a large silicon footprint, thus exposing it to detection by inspection. We evaluated our solution on a RISC-V BOOM processor and demonstrated that, by providing the ability to map each critical signal to 6 distinct locations on the chip, we can reduce the chance of attack success by an undetectable trojan by 99%, incurring only a 27% area overhead.
Automotive systems have always been designed with safety in mind. In this regard, the functional safety standard, ISO 26262, was drafted with the intention of minimizing risk due to random hardware faults or systematic failure in design of electrical and electronic components of an automobile. However, growing complexity of a modern car has added another potential point of failure in the form of cyber or sensor attacks. Recently, researchers have demonstrated that vulnerability in vehicle's software or sensing units could enable them to remotely alter the intended operation of the vehicle. As such, in addition to safety, security should be considered as an important design goal. However, designing security solutions without the consideration of safety objectives could result in potential hazards. Consequently, in this paper we propose the notion of security for safety and show that by integrating safety conditions with our system-level security solution, which comprises of a modified Kalman filter and a Chi-squared detector, we can prevent potential hazards that could occur due to violation of safety objectives during an attack. Furthermore, with the help of a car-following case study, where the follower car is equipped with an adaptive-cruise control unit, we show that our proposed system-level security solution preserves the safety constraints and prevent collision between vehicle while under sensor attack.
With the increasing use of Ethernet-based communication backbones in safety-critical real-time domains, both efficient and predictable interfacing and cryptographically secure authentication of high-speed data streams are becoming very important. Although the increasing data rates of in-vehicle networks allow the integration of more demanding (e.g., camera-based) applications, processing speeds and, in particular, memory bandwidths are no longer scaling accordingly. The need for authentication, on the other hand, stems from the ongoing convergence of traditionally separated functional domains and the extended connectivity both in- (e.g., smart-phones) and outside (e.g., telemetry, cloud-based services and vehicle-to-X technologies) current vehicles. The inclusion of cryptographic measures thus requires careful interface design to meet throughput, latency, safety, security and power constraints given by the particular application domain. Over the last decades, this has forced system designers to not only optimize their software stacks accordingly, but also incrementally move interface functionalities from software to hardware. This paper discusses existing and emerging methods for dealing with high-speed data streams ranging from software-only via mixed-hardware/software approaches to fully hardware-based solutions. In particular, we introduce two approaches to acquire and authenticate GigE Vision Video Streams at full line rate of Gigabit Ethernet on Programmable SoCs suitable for future heterogeneous automotive processing platforms.
Connected vehicle applications such as autonomous intersections and intelligent traffic signals have shown great promises in improving transportation safety and efficiency. However, security is a major concern in these systems, as vehicles and surrounding infrastructures communicate through ad-hoc networks. In this paper, we will first review security vulnerabilities in connected vehicle applications. We will then introduce and discuss some of the defense mechanisms at network and system levels, including (1) the Security Credential Management System (SCMS) proposed by the United States Department of Transportation, (2) an intrusion detection system (IDS) that we are developing and its application on collaborative adaptive cruise control, and (3) a partial consensus mechanism and its application on lane merging. These mechanisms can assist to improve the security of connected vehicle applications.
The Internet of Things (IoT) technology is transforming the world into Smart Cities, which have a huge impact on future societal lifestyle, economy and business. Intelligent Transportation Systems (ITS), especially IoT-enabled Electric Vehicles (EVs), are anticipated to be an integral part of future Smart Cities. Assuring ITS safety and security is critical to the success of Smart Cities because human lives are at stake. The state-of-the-art understanding of this matter is very superficial because there are many new problems that have yet to be investigated. For example, the cyber-physical nature of ITS requires considering human-in-the-loop (i.e., drivers and pedestrians) and imposes many new challenges. In this paper, we systematically explore the threat model against ITS safety and security (e.g., malfunctions of connected EVs/transportation infrastructures, driver misbehavior and unexpected medical conditions, and cyber attacks). Then, we present a novel and systematic ITS safety and security architecture, which aims to reduce accidents caused or amplified by a range of threats. The architecture has appealing features: (i) it is centered at proactive cyber-physical-human defense; (ii) it facilitates the detection of early-warning signals of accidents; (iii) it automates effective defense against a range of threats.
Electromigration (EM) is becoming a progressively severe reliability challenge due to increased interconnect current densities. A shift from traditional (post-layout) EM verification to robust (pro-active) EM-aware design - where the circuit layout is designed with individual EM-robust solutions - is urgently needed. This tutorial will give an overview of EM and its effects on the reliability of present and future integrated circuits (ICs). We introduce the physical EM process and present its specific characteristics that can be affected during physical design. Examples of EM countermeasures which are applied in today's commercial design flows are presented. We show how to improve the EM-robustness of metallization patterns and we also consider mission profiles to obtain application-oriented current-density limits. The increasing interaction of EM with thermal migration is investigated as well. We conclude with a discussion of application examples to shift from the current post-layout EM verification towards an EM-aware physical design process. Its methodologies, such as EM-aware routing, increase the EM-robustness of the layout with the overall goal of reducing the negative impact of EM on the circuit's reliability.
Since the invention of generalized polynomial chaos in 2002, uncertainty quantification has impacted many engineering fields, including variation-aware design automation of integrated circuits and integrated photonics. Due to the fast convergence rate, the generalized polynomial chaos expansion has achieved orders-of-magnitude speedup than Monte Carlo in many applications. However, almost all existing generalized polynomial chaos methods have a strong assumption: the uncertain parameters are mutually independent or Gaussian correlated. This assumption rarely holds in many realistic applications, and it has been a long-standing challenge for both theorists and practitioners.
This paper propose a rigorous and efficient solution to address the challenge of non-Gaussian correlation. We first extend generalized polynomial chaos, and propose a class of smooth basis functions to efficiently handle non-Gaussian correlations. Then, we consider high-dimensional parameters, and develop a scalable tensor method to compute the proposed basis functions. Finally, we develop a sparse solver with adaptive sample selections to solve high-dimensional uncertainty quantification problems. We validate our theory and algorithm by electronic and photonic ICs with 19 to 57 non-Gaussian correlated variation parameters. The results show that our approach outperforms Monte Carlo by 2500× to 3000× in terms of efficiency. Moreover, our method can accurately predict the output density functions with multiple peaks caused by non-Gaussian correlations, which is hard to handle by existing methods.
Based on the results in this paper, many novel uncertainty quantification algorithms can be developed and can be further applied to a broad range of engineering domains.
Due to inherent complex behaviors and stringent requirements in analog and mixed-signal (AMS) systems, verification becomes a key bottleneck in the product development cycle. For the first time, we present a Bayesian optimization (BO) based approach to the challenging problem of verifying AMS circuits with stringent low failure requirements. At the heart of the proposed BO process is a delicate balancing between two competing needs: exploitation of the current statistical model for quick identification of highly-likely failures and exploration of undiscovered design space so as to detect hard-to-find failures within a large parametric space. To do so, we simultaneously leverage multiple optimized acquisition functions to explore varying degrees of balancing between exploitation and exploration. This makes it possible to not only detect rare failures which other techniques fail to identify, but also do so with significantly improved efficiency. We further build in a mechanism into the BO process to enable detection of multiple failure regions, hence providing a higher degree of coverage. Moreover, the proposed approach is readily parallelizable, further speeding up failure detection, particularly for large circuits for which acquisition of simulation/measurement data is very time-consuming. Our experimental study demonstrates that the proposed approach is very effective in finding very rare failures and multiple failure regions which existing statistical sampling techniques and other BO techniques can miss, thereby providing a more robust and cost-effective methodology for rare failure detection.
Transient simulation becomes a bottleneck for modern IC designs due to large numbers of transistors, interconnects and tight design margins. For modified nodal analysis (MNA) formulation, we could have differential algebraic equations (DAEs) which consist ordinary differential equations (ODEs) and algebraic equations. Study of solving DAEs with conventional multi-step integration methods has been a research topic in the last few decades. We adopt matrix exponential based integration method for circuit transient analysis, its stability and accuracy with DAEs remain an open problem. We identify that potential stability issues in the calculation of matrix exponential and vector product (MEVP) with rational Krylov method are originated from the singular system matrix in DAEs. We then devise a robust algorithm to implicitly regularize the system matrix while maintaining its sparsity. With the new approach, &phis; functions are applied for MEVP to improve the accuracy of results. Moreover our framework no longer suffers from the limitation on step sizes thus a large leap step is adopted to skip many simulation steps in between. Features of the algorithm are validated on large-scale power delivery networks which achieve high efficiency and accuracy.
Optical network-on-chip (NoC) is a promising platform beyond electronic NoCs. In particular, wavelength-routed optical network-on-chip (WRONoC) is renowned for its high bandwidth and ultra-low signal delay. Current WRONoC topology generation approaches focus on full-connectivity, i.e. all masters are connected to all slaves. This assumption leads to wasted resources for application-specific designs. In this work, we propose CustomTopo: a general solution to the topology generation problem on WRONoCs that supports customized connectivity. CustomTopo models the topology structure and its communication behavior as an integer-linear-programming (ILP) problem, with an adjustable optimization target considering the number of add-drop filters (ADFs), the number of wavelengths, and insertion loss. The time for solving the ILP problem in general positively correlates with the network communication densities. Experimental results show that CustomTopo is applicable for various communication requirements, and the resulting customized topology enables a remarkable reduction in both resource usage and insertion loss.
2.5D integration technology is gaining popularity in the design of homogeneous and heterogeneous many-core computing systems. 2.5D network design, both inter- and intra-chiplet, impacts overall system performance as well as its manufacturing cost and thermal feasibility. This paper introduces a cross-layer methodology for designing networks in 2.5D systems. We optimize the network design and chiplet placement jointly across logical, physical, and circuit layers to achieve an energy-efficient network, while maximizing system performance, minimizing manufacturing cost, and adhering to thermal constraints. In the logical layer, our co-optimization considers eight different network topologies. In the physical layer, we consider routing, microbump assignment, and microbump pitch constraints to account for the extra costs associated with microbump utilization in the inter-chiplet communication. In the circuit layer, we consider both passive and active links with five different link types, including a gas station link design. Using our cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods. Compared to 2D systems, our approach achieves 29% better performance with the same manufacturing cost, or 25% lower cost with the same performance.
Application-specific MPSoCs profit immensely from a custom-fit Network-on-Chip (NoC) architecture in terms of network performance and power consumption. In this paper we suggest a new approach to explore application-specific NoC architectures. In contrast to other heuristics, our approach uses a set of network modifications defined with graph rewriting rules to model the design space exploration as a Markov Decision Process (MDP). The MDP can be efficiently explored using the Monte Carlo Tree Search (MCTS) heuristics. We formulate a weighted sum reward function to compute a single solution with a good trade-off between power and latency or a set of max reward functions to compute the complete Pareto front between the two objectives. The Wavefront feature adds additional efficiency when computing the Pareto front by exchanging solutions between parallel MCTS optimization processes. Comparison with other popular search heuristics demonstrates a higher efficiency of MCTS-based heuristics for several test cases. Additionally, the Wavefront-MCTS heuristics allows complete tracability and control by the designer to enable an interactive design space exploration process.
In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot find a design point with large speedup for applications with variable loop bounds. The reason is that loops with variable loop bounds cannot be efficiently parallelized or pipelined with simple insertion of HLS directives. Also, making highly accurate prediction of cycles and resource consumption on the entire design space becomes a challenging task because of the inaccuracy of the HLS tool cycle prediction and the wide design space. In this paper we present an HLS-based FPGA optimization and DSE framework that produces a high-performance design even in the presence of variable loop bounds. We propose code transformations that increase the utilization of the compute resources for variable loops, including several computation patterns with loop-carried dependency such as floating-point reduction and prefix sum. In order to rapidly perform DSE with high accuracy, we describe a resource and cycle estimation model constructed from the information obtained from the actual HLS synthesis. Experiments on applications with variable loop bounds in Polybench benchmarks with Vivado HLS show that our framework improves the baseline implementation by 75X on average and outperforms current state-of-the-art DSE frameworks.
FPGA application developers must explore increasingly large design spaces to identify regions of code to accelerate. High-Level Synthesis (HLS) tools automatically derive FPGA-based designs from high-level language specifications, which improves designer productivity; however, HLS tool run-times are cost-prohibitive for design space exploration, preventing designers from adequately answering cost-value decisions without expert guidance. To address this concern, this paper introduces a machine learning framework to predict FPGA performance and power consumption without relying on analytical models or HLS tools in-the-loop. For workloads that were manually optimized by appropriately setting pragmas, the framework obtains a worst-case relative error of 9.08% while running 43.78x faster than HLS; for unoptimized workloads, the framework obtains a worst-case relative error of 9.79% while running 36.24x faster than HLS.
Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. Thus it is typical to use an energy-efficient accelerator such as mobile GPU, DSP array, and customized neural processor chip. Moreover, new deep learning algorithms that aim to balance accuracy, speed, and resource requirements are developed on a deep learning framework such as Caffe[16] and Tensorflow[1] that is assumed to run directly on the target hardware. However, embedded devices may not be able to run those frameworks directly due to hardware limitations or missing OS support. To overcome this difficulty, we develop a deep learning software framework that generates a C code that can be run on any devices. The framework is facilitated with various options for software optimization that can be performed according to the optimization methodology proposed in this paper. Another benefit is that it can generate various styles of C code, tailored for a specific compiler or the accelerator architecture. Experiments on three platforms, NVIDIA Jetson TX2[23], Odroid XU4[10], and SRP (Samsung Reconfigurable Processor)[32], demonstrate the potential of the proposed approach.
Unlike traditional processors, embedded Internet of Things (IoT) devices lack resources to incorporate protection against modern sophisticated attacks resulting in critical consequences. Remote attestation (RA) is a security service to establish trust in the integrity of a remote device. While conventional RA is static and limited to detecting malicious modification to software binaries at load-time, recent research has made progress towards runtime attestation, such as attesting the control flow of an executing program. However, existing control-flow attestation schemes are inefficient and vulnerable to sophisticated data-oriented programming (DOP) attacks subvert these schemes and keep the control flow of the code intact.
In this paper, we present LiteHAX, an efficient hardware-assisted remote attestation scheme for RISC-based embedded devices that enables detecting both control-flow attacks as well as DOP attacks. LiteHAX continuously tracks both the control-flow and data-flow events of a program executing on a remote device and reports them to a trusted verifying party. We implemented and evaluated LiteHAX on a RISC-V System-on-Chip (SoC) and show that it has minimal performance and area overhead.
Microarchitectural side-channel attacks have posed serious threats to many computing systems, ranging from embedded systems and mobile devices to desktop workstations and cloud servers. Such attacks exploit side-channel vulnerabilities stemming from fundamental microarchitectural performance features, including the most common caches, out-of-order execution (for the newly revealed Meltdown exploit), and speculative execution (for Spectre). Prior efforts have focused on identifying and assessing these security vulnerabilities, and designing and implementing countermeasures against them. However, the efforts aiming at detecting specific side-channel attacks tend to be narrowly focused, which can make them effective but also makes them obsolete very quickly. In this paper, we propose a new methodology for detecting microarchitectural side-channel attacks that has the potential for a wide scope of applicability, as we demonstrate using a case study involving the Prime+Probe attack family. Instead of looking at the side-effects of side-channel attacks on microarchitectural elements such as hardware performance counters, we target the high-level semantics and invariant patterns of these attacks. We have applied our method to different Prime+Probe attack variants on the instruction cache, data cache, and last-level cache, as well as several benign programs as benchmarks. The method can detect all of the Prime+Probe attack variants with a true positive rate of 100% and an average false positive rate of 7.4%.
Injecting randomization in different layers of the computing platform has been shown beneficial for security, resilience to software bugs and timing analysis. In this paper, with focus on the latter, we show our experience regarding memory and timing resource management when software randomization techniques are applied to one of the most stringent industrial environments, ARINC653-based avionics. We describe the challenges in this task, we propose a set of solutions and present the results obtained for two commercial avionics applications, executed on COTS hardware and RTOS.
Single Flux Quantum (SFQ) electronic circuits originated with the advent of Rapid Single Flux Quantum (RSFQ) logic in 1985 and have since evolved to include more energy-efficient technologies such as ERSFQ and eSFQ. SFQ logic circuits, based on the manipulation of quantized flux pulses, have been demonstrated to run at clock speeds in excess of 120 GHz, and with bit-switch energy below 1 aJ. Small SFQ microprocessors have been developed, but characteristics inherent to SFQ circuits and the lack of circuit design tools have hampered the development of large SFQ systems. SFQ circuit characteristics include fan-out of one and the subsequent demand for pulse splitters, gate-level clocking, susceptibility to magnetic fields and sensitivity to intra-gate and inter-gate inductance. Superconducting interconnects propagate data pulses at the speed of light, but suffer from reflections at vias that attenuate transmitted pulses. The recently started IARPA SuperTools program aims to deliver SFQ Computer-Aided Design (CAD) tools that can enable the successful design of 64 bit RISC processors given the characteristics of SFQ circuits. A discussion on the technology of SFQ circuits and the most modern SFQ fabrication processes is presented, with a focus on the unique electronic design automation CAD requirements for the design, layout and verification of SFQ circuits.
Josephson junction-based superconducting logic families have been proposed to implement analog and digital signals, which can achieve low energy dissipation and ultra-fast switching speed. There are two representative technologies: DC-biased RSFQ (rapid single flux quantum) technology and its variants that achieve a verified speed of 370 Ghz, and AC-biased AQFP (adiabatic quantum-flux-parametron) that achieves an energy dissipation near quantum limits. Despite extraordinary characteristics of the superconducting logic families, many technical challenges remain, including the choice of circuit fabrics and architectures that utilize the SFQ technology and the development of effective design automation methodologies and tools. This paper presents our work on developing design flows and tools for DC- and AC-biased SFQ circuits, leveraging unique characteristics and design requirements of the SFQ logic families. More precisely, physical design algorithms, including placement, clock tree routing, and signal routing algorithms targeting RSFQ circuits are presented first. Next, a majority/minority gate-based automatic synthesis framework targeting AQFP logic circuits is described. Finally, experimental results to demonstrate the efficacy of the proposed framework and tools are presented.
With the increasing clock frequencies, the timing requirement of Rapid Single Flux Quantum (RSFQ) digital circuits is critical for achieving the correct functionality. To meet this requirement, it is necessary to incorporate length-matching constraint into routing problem. However, the solutions of existing routing algorithms are inherently limited by pre-allocated splitters (SPLs), which complicates the subsequent routing stage under length-matching constraint. Hence, in this paper, we reallocate SPLs to fully utilize routing resources to cope with length-matching effectively. We propose the first multi-terminal routing algorithm for RSFQ circuits that integrates SPL reallocation into the routing stage. The experimental results on a practical circuit show that our proposed algorithm achieves routing completion while reducing the required area by 17%. Comparing to [2], we can still improve by 7% with less runtime when SPLs are pre-allocated.
Electromagnetic (EM) analysis is to reveal the secret information by analyzing the EM emission from a cryptographic device. EM analysis (EMA) attack is emerging as a serious threat to hardware security. It has been noted that the on-chip power grid (PG) has a security implication on EMA attack by affecting the fluctuations of supply current. However, there is little study on exploiting this intrinsic property as an active countermeasure against EMA. In this paper, we investigate the effect of PG on EM emission and propose an active countermeasure against EMA, i.e. EM Equalizer (EME). By adjusting the PG impedance, the current waveform can be flattened, equalizing the EM profile. Therefore, the correlation between secret data and EM emission is significantly reduced. As a first attempt to the co-optimization for power and EM security, we extend the EME method by fixing the vulnerability of power analysis. To verify the EME method, several cryptographic designs are implemented. The measurement to disclose (MTD) is improved by 1138x with area and power overheads of 0.62% and 1.36%, respectively.
The RSA algorithm [21] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [9, 18, 23, 26], leveraging the Single Instruction Multiple Thread (SIMT) model.
The current practice in board-level integration is to incorporate chips and components from numerous vendors. A fully trusted supply chain for all used components and chipsets is an important, yet extremely difficult to achieve, prerequisite to validate a complete board-level system for safe and secure operation. An increasing risk is that most chips nowadays run software or firmware, typically updated throughout the system lifetime, making it practically impossible to validate the full system at every given point in the manufacturing, integration and operational life cycle. This risk is elevated in devices that run 3rd party firmware. In this paper we show that an FPGA used as a common accelerator in various boards can be reprogrammed by software to introduce a sensor, suitable as a remote power analysis side-channel attack vector at the board-level. We show successful power analysis attacks from one FPGA on the board to another chip implementing RSA and AES cryptographic modules. Since the sensor is only mapped through firmware, this threat is very hard to detect, because data can be exfiltrated without requiring inter-chip communication between victim and attacker. Our results also prove the potential vulnerability in which any untrusted chip on the board can launch such attacks on the remaining system.
Elliptic Curve Cryptography (ECC), initially proposed by Koblitz [17] and Miller [20], is a public-key cipher. Compared with other popular public-key ciphers (e.g., RSA), ECC features a shorter key length for the same level of security. For example, a 256-bit ECC cipher provides 128-bit security, equivalent to a 2048-bit RSA cipher [4]. Using smaller keys, ECC requires less memory for performing cryptographic operations. Embedded systems, especially given the proliferation of Internet-of-Things (IoT) devices and platforms, require efficient and low-power secure communications between edge devices and gateways/clouds. ECC has been widely adopted in IoT systems for authentication of communications, while RSA, which is much more costly to compute, remains the standard for desktops and servers.
Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space.
In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.
Automatic systolic array generation has long been an interesting topic due to the need to reduce the lengthy development cycles of manual designs. Existing automatic systolic array generation approach builds dependency graphs from algorithms, and iteratively maps computation nodes in the graph into processing elements (PEs) with time stamps that specify the sequences of nodes that operate within the PE. There are a number of previous works that implemented the idea and generated designs for ASICs. However, all of these works relied on human intervention and usually generated inferior designs compared to manual designs. In this work, we present our ongoing compilation framework named PolySA which leverages the power of the polyhedral model to achieve the end-to-end compilation for systolic array architecture on FPGAs. PolySA is the first fully automated compilation framework for generating high-performance systolic array architectures on the FPGA leveraging recent advances in high-level synthesis. We demonstrate PolySA on two key applications---matrix multiplication and convolutional neural network. PolySA is able to generate optimal designs within one hour with performance comparable to state-of-the-art manual designs.
Memory partitioning has been widely adopted to increase the memory bandwidth. Data reuse is a hardware-efficient way to improve data access throughput by exploiting locality in memory access patterns. We found that for many applications in image and video processing, a global data reuse scheme can be shared by multiple patterns. In this paper, we propose an efficient data reuse strategy for multi-pattern data access. Firstly, a heuristic algorithm is proposed to extract the reuse information as well as find the non-reusable data elements of each pattern. Then the non-reusable elements are partitioned into several memory banks by an efficient memory partitioning algorithm. Moreover, the reuse information is utilized to generate the global data reuse logic shared by the multi-pattern. We design a novel algorithm to minimize the number of registers required by the data reuse logic. Experimental results show that compared with the state-of-the-art approach, our proposed method can reduce the number of required BRAMs by 62.2% on average, with the average reduction of 82.1% in SLICE, 87.1% in LUTs, 71.6% in Flip-Flops, 73.1% in DSP48Es, 83.8% in SRLs, 46.7% in storage overhead, 79.1% in dynamic power consumption, and 82.6% in execution time of memory partitioning. Besides, the performance is improved by 14.4%.
The most attractive feature of field-programmable gate arrays (FPGAs) is their configuration flexibility. However, if the configuration is performed manually, this flexibility places a heavy burden on system designers to choose among a vast number of configuration parameters and program transformations. In this paper, we improve the state-of-the-art with two main innovations: First, we apply compiler-automated transformations to the data layout and program statements to create streaming accesses. Such accesses are turned into streaming interfaces when the kernels are implemented in hardware, allowing the kernels to run efficiently. Second, we use two-step mixed integer programming to first minimize the execution time and then to minimize energy dissipation. Configuration parameters are chosen automatically, including several important ones omitted by existing models. Experimental results demonstrate significant performance gains and energy savings using these techniques.
This work is sponsored in part by the National Science Foundation (Grant 1533822).
Modern biomedical devices use sensor fusion techniques to improve the classification accuracy of motion intent of users for rehabilitation application. The design of motion classifier observes significant challenges due to the large number of channels and stringent communication latency requirement. This paper proposes an edge-computing distributed neural processor to effectively reduce the data traffic and physical wiring congestion. A special local and global networking architecture is introduced to significantly reduce traffic among multi-chips in edge computing. To optimize the design space of the features selected, a systematic design methodology is proposed. A novel mixed-signal feature extraction approach with assistance of neural network distortion recovery is also provided to significantly reduce the silicon area. A 12-channel 55nm CMOS test chip was implemented to demonstrate the proposed systematic design methodology. The measurement shows the test chip consumes only 20uW power, more than 10,000X less power than the current clinically used microprocessor and can perform edge-computing networking operation within 5ms time.
As small-form-factor and low-power end devices matter in the cloud networking and Internet-of-Things Era, the bio-inspired neuromorphic architectures attract great attention recently in the hope of reaching the energy-efficiency of brain functions. Out of promising solutions, a liquid state machine (LSM), that consists of randomly and recurrently connected reservoir neurons and trainable readout neurons, has shown a great promise in delivering brain-inspired computing power. In this work, we adopt the state-of-the-art face-to-face (F2F)-bonded 3D IC flow named Compact-2D [4] to the LSM processor design, and study the power-area-accuracy benefits of 3D LSM ICs targeting the next generation commercial-grade neuromorphic computing platforms. First, we analyze how the different size and connection density of a reservoir in the LSM architecture affects the learning performance using the real-world speech recognition benchmark. Also, we explore how much the power-area design overhead should be paid off to enable better classification accuracy. Based on the power-area-accuracy trade-off, we implement a F2F-bonded 3D LSM IC using the optimal LSM architecture, and finally justify that 3D integration practically benefits the LSM processor design in huge form factor and power savings while preserving the best learning performance.
In this work, we first propose a deep depthwise Convolutional Neural Network (CNN) structure, called Add-Net, which uses binarized depthwise separable convolution to replace conventional spatial-convolution. In Add-Net, the computationally expensive convolution operations (i.e. Multiplication and Accumulation) are converted into hardware-friendly Addition operations. We meticulously investigate and analyze the Add-Net's performance (i.e. accuracy, parameter size and computational cost) in object recognition application compared to traditional baseline CNN using the most popular large scale ImageNet dataset. Accordingly, we propose a <u>D</u>epthwise CNN <u>I</u>n-<u>M</u>emory <u>A</u>ccelerator (DIMA) based on SOT-MRAM computational sub-arrays to efficiently accelerate Add-Net within non-volatile MRAM. Our device-to-architecture co-simulation results show that, with almost the same inference accuracy to the baseline CNN on different data-sets, DIMA can obtain ~1.4× better energy-efficiency and 15.7× speedup compared to ASICs, and, ~1.6× better energy-efficiency and 5.6× speedup over the best processing-in-DRAM accelerators.
Continuous flow-based biochips are one of the promising platforms used in biochemical and pharmaceutical laboratories due to their efficiency and low costs. Inside such a chip, fluid volumes of nanoliter size are transported between devices for various operations, such as mixing and detection. The transportation channels and corresponding operation devices are controlled by microvalves driven by external pressure sources. Since assigning an independent pressure source to every microvalve would be impractical due to high costs and limited system dimensions, states of microvalves are switched using a control logic by time multiplexing. Existing control logic designs, however, still switch only a single control channel per operation --- leading to a low efficiency. In this paper, we propose the first automatic synthesis approach for a control logic that is able to switch multiple control channels simultaneously to reduce the overall switching time of valve states. In addition, we propose the first fault-aware design in control logic to introduce redundant control paths to maintain the correct function even when manufacturing defects occur. Compared with the existing direct connection method, the proposed multi-channel switching mechanism can reduce the switching time of valve states by up to 64%. In addition, all control paths for fault tolerance have been realized.
In this paper, we propose anew multi-physics finite element method (FEM) based analysis method for void growth simulation of confined copper interconnects. This new method for the first time considers three important physics simultaneously in the EM failure process and their time-varying interactions: the hydrostatic stress in the confined interconnect wire, the current density and Joule heating induced temperature. As a result, we end up with solving a set of coupled partial differential equations which consist of the stress diffusion equation (Korhonen's equation), the phase field equation (for modeling void boundary move), the Laplace equation for current density and the heat diffusion equation for Joule heating and wire temperature. In the new method, we show that each of the physics will have different physical domains and differential boundary conditions, and how such coupled multi-physics transient analysis was carried out based on FEM and different time scales are properly handled. Experiment results show that by considering all three coupled physics - the stress, current density, and temperature - and their transient behaviors, the proposed FEM EM solver can predict the unique transient wire resistance change pattern for copper interconnect wires, which were well observed by the published experiment data. We also show that the simulated void growth speed is less conservative than recently proposed compact EM model.
Transistor aging due to Bias Temperature Instability (BTI) is a crucial degradation that affects the reliability of circuits over time. Aging-aware circuit design flows do virtually not exist yet and even research is in its infancy. In this work, we demonstrate how the deleterious effects BTI-induced degradations can be modeled from physics, where they do occur, all the way up to the system level, where they finally take place and affect the delay and power of circuits. To achieve that, degradation-aware cell libraries, that properly capture the impact of BTI not only on the delay of standard cells but also on their static and dynamic power, are created. Unlike state of the art, which solely models the impact of BTI on the threshold voltage of transistors (Vth), we are the first to model the other key transistor parameters degraded by BTI like carrier mobility (μ), sub-threshold slope (SS), and gate-drain capacitance (Cgd).
Our cell libraries are compatible with existing commercial CAD tools. Employing the mature algorithms in such tools, enables designers - after importing our cell libraries - to accurately estimate the overall impact of aging on changing the delay and/or power of any circuit, despite its complexity. We demonstrate that ΔVth alone (as done in state of the art) is insufficient to correctly model the impact of BTI either on delay or power of circuits. On the one hand, neglecting BTI-induced μ and Cgd degradations leads to underestimating the impact that BTI has on increasing the delay of circuits. Hence, designers will employ narrower timing guardbands in which reliability of circuits during lifetime cannot be sustained. On the other hand, neglecting BTI-induced SS degradation leads to overestimating the impact that BTI has on static power reduction. Hence, the potential benefit of circuits from BTI will be exaggerated.
In addition to the conventional PVT (Process, Voltage and Temperature) variation, time-dependent current fluctuation such as random telegraph noise (RTN) poses a new challenge on VLSI reliability. In this paper, we show that compared with the static random variation, RTN amplitude of a particular device is not constant across supply voltages and temperatures. A device may show large RTN amplitude at one operating condition and small amplitude at another operating condition. As a result, RTN amplitude distribution becomes uncorrelated across a wide range of voltage and temperature. The emergence of uncorrelated distribution causes significant degradation of worst-case values. Analysis results based on variability models from a 65 nm silicon-on-insulator process show that uncorrelated RTN degrades the worst-case threshold voltage value significantly compared with that where RTN is not considered. Delay variation analysis shows that consideration of RTN in the statistical analysis have little impact at high supply voltage. However, at low voltage operation, RTN can degrade the worst-case value by more than 5 %.
Soft errors are a major safety concern in many devices, e.g., in automotive, industrial, control or medical applications. Ideally, safety-critical systems should be resilient against the impact of soft errors, but at a low cost. This requires to evaluate the soft error resilience, which is typically done by extensive fault injection.
In this paper, we present ETISS-ML, a multi-level processor simulator, which manages to achieve both accuracy and performance for fault simulation by intelligently switching the level of abstraction between an Instruction Set Simulator (ISS) and an RTL simulator. For a given software testcase and fault scenario, the software is first executed in ISS-mode until shortly before the fault injection. Then ETISS-ML switches to RTL-mode for accurate fault simulation. Whenever the impact of the fault is propagated completely out of the processor's micro-architecture, the simulation can switch back to ISS-mode. This paper describes the methods needed to preserve accuracy during both of these switches. Experimental results show that ETISS-ML obtains near to ISS performance with RTL accuracy. It is also shown that ETISS-ML can be used as the processor model in SystemC / TLM virtual prototypes (VPs) and, hence, allows to investigate the impact of soft errors at system level.
Quantum computation is currently moving from an academic idea to a practical reality. The recent past has seen tremendous progress in the physical implementation of corresponding quantum computers - also involving big players such as IBM, Google, Intel, Rigetti, Microsoft, and Alibaba. These devices promise substantial speedups over conventional computers for applications like quantum chemistry, optimization, machine learning, cryptography, quantum simulation, and systems of linear equations. The Computer-Aided Design and Verification (jointly referred as CAD) community needs to be ready for this revolutionizing new technology. While research on automatic design methods for quantum computers is currently underway, there is still far too little coordination between the CAD community and the quantum computation community. Consequently, many CAD approaches proposed in the past have either addressed the wrong problems or failed to reach the end users. In this summary paper, we provide a glimpse into both sides. To this end, we review and discuss selected accomplishments from the CAD domain as well as open challenges within the quantum domain. These examples showcase the recent state-of-the-art but also outline the remaining work left to be done in both communities.
Nowadays, a variety of multipliers are used in different computationally intensive industrial applications. Most of these multipliers are highly parallelized and structurally complex. Therefore, the existing formal verification techniques fail to verify them.
In recent years, formal multiplier verification based on Symbolic Computer Algebra (SCA) has shown superior results in comparison to all other existing proof techniques. However, for non-trivial architectures still a monomial explosion can be observed. A common understanding is that this is caused by redundant monomials also known as vanishing monomials. While several approaches have been proposed to overcome the explosion, the problem itself is still not fully understood.
In this paper we present a new theory for the origin of vanishing monomials and how they can be handled to prevent the explosion during backward rewriting. We implement our new approach as the SCA-verifier PolyCleaner. The experimental results show the efficiency of our proposed method in verification of non-trivial million-gate multipliers.
GPUs have been widely used to accelerate big-data inference applications and scientific computing through their parallelized hardware resources and programming model. Their extreme parallelism increases the possibility of bugs such as data races and un-coalesced memory accesses, and thus verifying program correctness is critical. State-of-the-art GPU program verification efforts mainly focus on analyzing application-level programs, e.g., in C, and suffer from the following limitations: (1) high false-positive rate due to coarse-grained abstraction of synchronization primitives, (2) high complexity of reasoning about pointer arithmetic, and (3) keeping up with an evolving API for developing application-level programs.
In this paper, we address these limitations by modeling GPUs and reasoning about programs at the instruction level. We formally model the Nvidia GPU at the parallel execution thread (PTX) level using the recently proposed Instruction-Level Abstraction (ILA) model for accelerators. PTX is analogous to the Instruction-Set Architecture (ISA) of a general-purpose processor. Our formal ILA model of the GPU includes non-synchronization instructions as well as all synchronization primitives, enabling us to verify multithreaded programs. We demonstrate the applicability of our ILA model in scalable GPU program verification of data-race checking. The evaluation shows that our checker outperforms state-of-the-art GPU data race checkers with fewer false-positives and improved scalability.
In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation.
A concerning weakness of deep neural networks is their susceptibility to adversarial attacks. While methods exist to detect these attacks, they incur significant drawbacks, ignoring external features which could aid in the task of attack detection. In this work, we propose SPN Dash, a method for detection of adversarial attacks based on integrity of sensor pattern noise embedded in submitted images. Through experiment, we show that our SPN Dash method is capable of detecting the addition of adversarial noise with up to 94% accuracy for images of size 256×256. Analysis shows that SPN Dash is robust to image scaling techniques, as well as a small amount of image compression. This performance is on par with state of the art neural network-based detectors, while incurring an order of magnitude less computational and memory overhead.
Deep neural networks (DNNs) have become an important tool for bringing intelligence to mobile and embedded devices. The increasingly wide deployment, sharing and potential commercialization of DNN models create a compelling need for intellectual property (IP) protection. Recently, DNN watermarking emerges as a plausible IP protection method. Enabling DNN watermarking on embedded devices in a practical setting requires a black-box approach. Existing DNN watermarking frameworks either fail to meet the black-box requirement or are susceptible to several forms of attacks. We propose a watermarking framework by incorporating the author's signature in the process of training DNNs. While functioning normally in regular cases, the resulting watermarked DNN behaves in a different, predefined pattern when given any signed inputs, thus proving the authorship. We demonstrate an example implementation of the framework on popular image classification datasets and show that strong watermarks can be embedded in the models.
Recent advances in adversarial Deep Learning (DL) have opened up a largely unexplored surface for malicious attacks jeopardizing the integrity of autonomous DL systems. With the wide-spread usage of DL in critical and time-sensitive applications, including unmanned vehicles, drones, and video surveillance systems, online detection of malicious inputs is of utmost importance. We propose DeepFense, the first end-to-end automated framework that simultaneously enables efficient and safe execution of DL models. DeepFense formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a DL network. To solve the aforementioned minimization problem, a set of complementary but disjoint modular redundancies are trained to validate the legitimacy of the input samples in parallel with the victim DL model. DeepFense leverages hardware/software/algorithm co-design and customized acceleration to achieve just-in-time performance in resource-constrained settings. The proposed countermeasure is unsupervised, meaning that no adversarial sample is leveraged to train modular redundancies. We further provide an accompanying API to reduce the non-recurring engineering cost and ensure automated adaptation to various platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders of magnitude performance improvement while enabling online adversarial sample detection.
Deep learning algorithms have demonstrated super-human capabilities in many cognitive tasks, such as image classification and speech recognition. As a result, there is an increasing interest in deploying neural networks (NNs) on low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers. In this paper, we discuss the challenges of deploying neural networks on microcontrollers with limited memory, compute resources and power budgets. We introduce CMSIS-NN, a library of optimized software kernels to enable deployment of NNs on Cortex-M cores. We also present techniques for NN algorithm exploration to develop light-weight models suitable for resource constrained systems, using keyword spotting as an example.
Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.
Recent breakthroughs in Machine Learning (ML) applications, and especially in Deep Learning (DL), have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms (from mobile devices to datacenters) have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. "What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)?" "Is it possible to predict this latency or energy consumption before a model is even trained?" "If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment?" From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention.
One cannot optimize what isn't properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of ML applications. Furthermore, ML practitioners are currently challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardwareaware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for ML applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.