DAC 2018 TOC

Full Citation in the ACM Digital Library

Ensemble learning for effective run-time hardware-based malware detection: a comprehensive analysis and classification

Sayadi
Hossein

Malware detection at the hardware level has emerged recently as a promising solution
to improve the security of computing systems. Hardware-based malware detectors take
advantage of Machine Learning (ML) classifiers to detect pattern of malicious …

Deepsecure: scalable provably-secure deep learning

Rouhani
Bita Darvish

This paper presents DeepSecure, the an scalable and provably secure Deep Learning
(DL) framework that is built upon automated design, efficient logic synthesis, and
optimization methodologies. DeepSecure targets scenarios in which neither of the …

DWE: decrypting learning with errors with errors

Bian
Song

The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic
applications, including quantumly-secure public-key encryption, digital signature,
and fully homomorphic encryption. In this work, we propose an approximate …

Reverse engineering convolutional neural networks through side-channel information
leaks

Hua
Weizhe

A convolutional neural network (CNN) model represents a crucial piece of intellectual
property in many applications. Revealing its structure or weights would leak confidential
information. In this paper we present novel reverse-engineering attacks on …

OFTL: ordering-aware FTL for maximizing performance of the journaling file system

Park
Daekyu

Journaling of ext4 file system employs two FLUSH commands to make their data durable,
even though the FLUSH is more expensive than the ordinary write operations. In this
paper, to halve the number of FLUSH commands, we propose an efficient FTL, called
…

LAWN: boosting the performance of NVMM file system through reducing write amplification

Wang
Chundong

Byte-addressable non-volatile memories can be used with DRAM to build a hybrid memory
system of volatile/non-volatile main memory (NVMM). NVMM file systems demand consistency
techniques such as logging and copy-on-write to guarantee data consistency in …

FastGC: accelerate garbage collection via an efficient copyback-based data migration in SSDs

Wu
Fei

Copyback is an advanced command contributing to accelerating data migration in garbage
collection (GC). Unfortunately, detecting copyback feasibility (whether copyback can
be carried out with assurable reliability) against data corruption in the …

Dynamic management of key states for reinforcement learning-assisted garbage collection
to reduce long tail latency in SSD

Kang
Wonkyung

Garbage collection (GC) is one of main causes of the long-tail latency problem in
storage systems. Long-tail latency due to GC is more than 100 times greater than the
average latency at the 99^th percentile. Therefore, due to such a long tail latency, …

WB-trees: a meshed tree representation for FinFET analog layout designs

Lu
Yu-Sheng

The emerging design requirements with the FinFET technology, along with traditional
geometrical constraints, make the FinFET-based analog placement even more challenging.
Previous works can handle only partial FinFET-induced design constraints because …

Analog placement with current flow and symmetry constraints using PCP-SP

Patyal
Abhishek

Modern analog placement techniques require consideration of current path and symmetry
constraints. The symmetry pairs can be efficiently packed using the symmetry island
configurations, but not all these configurations result in minimum gate …

Multi-objective bayesian optimization for analog/RF circuit synthesis

Lyu
Wenlong

In this paper, a novel multi-objective Bayesian optimization method is proposed for
the sizing of analog/RF circuits. The proposed approach follows the framework of Bayesian
optimization to balance the exploitation and exploration. Gaussian processes (…

Calibrating process variation at system level with in-situ low-precision transfer
learning for analog neural network processors

Jia
Kaige

Process Variation (PV) may cause accuracy loss of the analog neural network (ANN)
processors, and make it hard to be scaled down, as well as feasibility degrading.
This paper first analyses the impact of PV on the performance of ANN chips. Then proposes
…

DPS: dynamic precision scaling for stochastic computing-based deep neural networks

Sim
Hyeonuk

Stochastic computing (SC) is a promising technique with advantages such as low-cost,
low-power, and error-resilience. However so far SC-based CNN (convolutional neural
network) accelerators have been kept to relatively small CNNs only, primarily due
to …

Dyhard-DNN: even more DNN acceleration with dynamic hardware reconfiguration

Putic
Mateja

Deep Neural Networks (DNNs) have demonstrated their utility across a wide range of
input data types, usable across diverse computing substrates, from edge devices to
datacenters. This broad utility has resulted in myriad hardware accelerator …

Exploring the programmability for deep learning processors: from architecture to tensorization

Chen
Chixiao

This paper presents an instruction and Fabric Programmable Neuron Array (iFPNA) architecture, its 28nm CMOS chip prototype, and a compiler for
the acceleration of a variety of deep learning neural networks (DNNs) including convolutional
neural networks (…

LCP: a layer clusters paralleling mapping method for accelerating inception and residual
networks on FPGA

Lin
Xinhan

Deep convolutional neural networks (DCNNs) have been widely used in various AI applications.
Inception and Residual are two promising structures adopted in many important modern
DCNN models, including AlphaGo Zero’s model. These structures allow …

Ares: a framework for quantifying the resilience of deep neural networks

Reagen
Brandon

As the use of deep neural networks continues to grow, so does the fraction of compute
cycles devoted to their execution. This has led the CAD and architecture communities
to devote considerable attention to building DNN hardware. Despite these efforts,
…

DeepN-JPEG: a deep neural network favorable JPEG-based image compression framework

Liu
Zihao

As one of most fascinating machine learning techniques, deep neural network (DNN)
has demonstrated excellent performance in various intelligent tasks such as image
classification. DNN achieves such performance, to a large extent, by performing expensive
…

Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient
deep learning accelerators

Zhang
Jeff

Hardware accelerators are being increasingly deployed to boost the performance and
energy efficiency of deep neural network (DNN) inference. In this paper we propose
Thundervolt, a new framework that enables aggressive voltage underscaling of high-…

Loom: exploiting weight and activation precisions to accelerate convolutional neural networks

Sharify
Sayeh

Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented.
In LM every bit of data precision that can be saved translates to proportional performance
gains. For both weights and activations LM exploits profile-…

Parallelizing SRAM arrays with customized bit-cell for binary neural networks

Liu
Rui

Recent advances in deep neural networks (DNNs) have shown Binary Neural Networks (BNNs)
are able to provide a reasonable accuracy on various image datasets with a significant
reduction in computation and memory cost. In this paper, we explore two BNNs: …

An ultra-low energy internally analog, externally digital vector-matrix multiplier
based on NOR flash memory technology

Mahmoodi
M. Reza

Vector-matrix multiplication (VMM) is a core operation in many signal and data processing
algorithms. Previous work showed that analog multipliers based on nonvolatile memories
have superior energy efficiency as compared to digital counterparts at low-…

Coding approach for low-power 3D interconnects

Bamberg
Lennart

Through-silicon vias (TSVs) in 3D ICs show a significant power consumption, which
can be reduced using coding techniques. This work presents an approach which reduces
the TSV power consumption by a signal-aware bit assignment which includes inversions
…

A novel 3D DRAM memory cube architecture for space applications

Agnesina
Anthony

The first mainstream products in 3D IC design are memory devices where multiple memory
tiers are horizontally integrated to offer manifold improvements compared with their
2D counterparts. Unfortunately, none of these existing 3D memory cubes are ready …

A general graph based pessimism reduction framework for design optimization of timing
closure

Peng
Fulin

In this paper, we develop a general pessimism reduction framework for design optimization
of timing closure. Although the modified graph based timing analysis (mGBA) slack
model can be readily formulated into a quadratic programming problem with …

Virtualsync: timing optimization by synchronizing logic waves with sequential and combinational
components as delay units

Zhang
Grace Li

In digital circuit designs, sequential components such as flip-flops are used to synchronize
signal propagations. Logic computations are aligned at and thus isolated by flip-flop
stages. Although this fully synchronous style can reduce design efforts …

Noise-aware DVFS transition sequence optimization for battery-powered IoT devices

Luo
Shaoheng

Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT)
devices, which are well known for their bursty workloads and limited energy storage
— usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become
…

Accurate processor-level wirelength distribution model for technology pathfinding
using a modernized interpretation of rent’s rule

Prasad
Divya

Faithful system-level modeling is vital to design and technology pathfinding, and
requires accurate representation of interconnects. In this study, Rent’s rule is modernized
to cater to advanced technology and design, and applied to derive a priori …

Semi-automatic safety analysis and optimization

Munk
Peter

The complexity of safety-critical E/E-systems within the automotive domain are continuously
increasing. At the same time, functional safety standards such as the ISO 26262 prescribe
analysis methods like the Fault Tree Analysis (FTA) and Failure Mode …

Reasoning about safety of learning-enabled components in autonomous cyber-physical
systems

Tuncali
Cumhur Erkan

We present a simulation-based approach for generating barrier certificate functions
for safety verification of cyber-physical systems (CPS) that contain neural network-based
controllers. A linear programming solver is utilized to find a candidate …

Runtime monitoring for safety of intelligent vehicles

Watanabe
Kosuke

Advanced driver-assistance systems (ADAS), autonomous driving, and connectivity have
enabled a range of new features, but also made automotive design more complex than
ever. Formal verification can be applied to establish functional correctness, but
its …

Revisiting context-based authentication in IoT

Miettinen
Markus

The emergence of IoT poses new challenges towards solutions for authenticating numerous
very heterogeneous IoT devices to their respective trust domains. Using passwords
or pre-defined keys have drawbacks that limit their use in IoT scenarios. Recent …

MAXelerator: FPGA accelerator for privacy preserving multiply-accumulate (MAC) on cloud servers

Hussain
Siam U.

This paper presents MAXelerator, the first hardware accelerator for privacy-preserving
machine learning (ML) on cloud servers. Cloud-based ML is being increasingly employed
in various data sensitive scenarios. While it enhances both efficiency and …

Hypernel: a hardware-assisted framework for kernel protection without nested paging

Kwon
Donghyun

Large OS kernels always suffer from attacks due to their numerous inherent vulnerabilities.
To protect the kernel, hypervisors have been employed by many security solutions.
However, relying on a hypervisor has a detrimental impact on the system …

Reducing the overhead of authenticated memory encryption using delta encoding and
ECC memory

Yitbarek
Salessawi Ferede

Data stored in an off-chip memory, such as DRAM or non-volatile main memory, can potentially
be extracted or tampered by an attacker with physical access to a device. Protecting
such attacks requires storing message authentication codes and counters – …

Reducing time and effort in IC implementation: a roadmap of challenges and solutions

Kahng
Andrew B.

To reduce time and effort in IC implementation, fundamental challenges must be solved.
First, the need for (expensive) humans must be removed wherever possible. Humans are
skilled at predicting downstream flow failures, evaluating key early decisions …

Efficient reinforcement learning for automating human decision-making in SoC design

Sadasivam
Shankar

The exponential growth in PVT corners due to Moore’s law scaling, and the increasing
demand for consumer applications and longer battery life in mobile devices, has ushered
in significant cost and power-related challenges for designing and productizing …

Compensated-DNN: energy efficient low-precision deep neural networks by compensating quantization errors

Jain
Shubham

Deep Neural Networks (DNNs) represent the state-of-the-art in many Artificial Intelligence
(AI) tasks involving images, videos, text, and natural language. Their ubiquitous
adoption is limited by the high computation and storage requirements of DNNs, …

Thermal-aware optimizations of reRAM-based neuromorphic computing systems

Beigi
Majed Valad

ReRAM-based systems are attractive implementation alternatives for neuromorphic computing
because of their high speed and low design cost. In this work, we investigate the
impact of temperature on the ReRAM-based neuromorphic architectures and show how …

Compiler-guided instruction-level clock scheduling for timing speculative processors

Fan
Yuanbo

Despite the significant promise that circuit-level timing speculation has for enabling
operation in marginal conditions, overheads associated with recovery prove to be a
serious drawback. We show that fine-grained clock adjustment guided by the compiler
…

SRAM based opportunistic energy efficiency improvement in dual-supply near-threshold
processors

Gu
Yunfei

Energy-efficient microprocessors are essential for a wide range of applications. While
near-threshold computing is a promising technique to improve energy efficiency, optimal
supply demands from logic core and on-chip memory are conflicting. In this …

Enhancing workload-dependent voltage scaling for energy-efficient ultra-low-power
embedded systems

Mohan
Veni

Ultra-low-power (ULP) chipsets are in higher demand than ever due to the proliferation
of ULP embedded systems to support growing applications like the Internet of Things
(IoT), wearables and sensor networks. Since ULP systems are also cost constrained,
…

Efficient and reliable power delivery in voltage-stacked manycore system with hybrid
charge-recycling regulators

Zou
An

Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking
multiple voltage domains to eliminate explicit step-down voltage conversion and reduce
energy loss along the power delivery path. However, it suffers from …

Exact algorithms for delay-bounded steiner arborescences

Held
Stephan

Rectilinear Steiner arborescences under linear delay constraints play an important
role for buffering. We present exact algorithms for either minimizing the total length
subject to delay constraints, or minimizing the total length plus the (weighted) …

Efficient multi-layer obstacle-avoiding region-to-region rectilinear steiner tree
construction

Wang
Run-Yi

As Engineering Change Order (ECO) has attracted substantial attention in modern VLSI
design, the open net problem, which aims at constructing a shortest obstacle-avoiding
path to reconnect the net shapes in an open net, becomes more critical in the ECO
…

Obstacle-avoiding open-net connector with precise shortest distance estimation

Fang
Guan-Qi

At the end of digital integrated circuit (IC) design flow, some nets may still be
left open due to engineering change order (ECO). Resolving these opens could be quite
challenging for some huge nets such as power ground nets because of a large number
of …

COSAT: congestion, obstacle, and slew aware tree construction for multiple power domain design

Lu
Chien-Pang

Slew fixing, which ensures correct signal propagation, is essential during timing
closure of IC design flow. Conventionally, gate sizing, Vt swapping, or buffer insertion
is adopted to locally fix the slew violation on a single gate. Nevertheless, when
…

A machine learning framework to identify detailed routing short violations from a
placed netlist

Tabrizi
Aysa Fakheri

Detecting and preventing routing violations has become a critical issue in physical
design, especially in the early stages. Lack of correlation between global and detailed
routing congestion estimations and the long runtime required to frequently …

DSA-friendly detailed routing considering double patterning and DSA template assignments

Yu
Hai-Juan

As integrated circuit technology nodes continue to shrink, dense via distribution
becomes a severe challenge, requiring multiple masks to avoid spacing violations in
via layers. Meanwhile, the directed self-assembly (DSA) technique shows a great promise
…

Developing synthesis flows without human knowledge

Yu
Cunxi

Design flows are the explicit combinations of design transformations, primarily involved
in synthesis, placement and routing processes, to accomplish the design of Integrated
Circuits (ICs) and System-on-Chip (SoC). Mostly, the flows are developed based …

Efficient computation of ECO patch functions

Dao
Ai Quoc

Engineering Change Orders (ECO) modify a synthesized netlist after its specification
has changed. ECO is divided into two major tasks: finding target signals whose functions
should be updated and synthesizing the patch that produces the desired change. …

Canonical computation without canonical representation

Mishchenko
Alan

A representation of a Boolean function is canonical if, given a variable order, only one instance of the representation is possible for
the function. A computation is canonical if the result depends only on the Boolean function and a variable order, and …

SAT based exact synthesis using DAG topology families

Haaswijk
Winston

SAT based exact synthesis is a powerful technique, with applications in logic optimization,
technology mapping, and synthesis for emerging technologies. However, its runtime
behavior can be unpredictable and slow. In this paper, we propose to add a new …

Efficient batch statistical error estimation for iterative multi-level approximate
logic synthesis

Su
Sanbao

Approximate computing is an emerging energy-efficient paradigm for error-resilient
applications. Approximate logic synthesis (ALS) is an important field of it. To improve
the existing ALS flows, one key issue is to derive a more accurate and efficient …

BLASYS: approximate logic synthesis using boolean matrix factorization

Hashemi
Soheil

Approximate computing is an emerging paradigm where design accuracy can be traded
off for benefits in design metrics such as design area, power consumption or circuit
complexity. In this work, we present a novel paradigm to synthesize approximate …

Optimized I/O determinism for emerging NVM-based NVMe SSD in an enterprise system

Kim
Seonbong

Non-volatile memory express (NVMe) over peripheral component interconnect express
(PCIe) has been adopted in the storage system to provide low latency and high throughput.
NVMe allows a host system to reduce latency because it offers a high parallel …

Improving runtime performance of deduplication system with host-managed SMR storage
drives

Wu
Chun-Feng

Due to the cost consideration for data storage, high-areal-density shingled-magnetic-recording
(SMR) drives and data deduplication techniques are getting popular in many data storage
services for the improvement of profit per storage unit. However, …

Wear leveling for crossbar resistive memory

Wen
Wen

Resistive Memory (ReRAM) is an emerging non-volatile memory technology that has many
advantages over conventional DRAM. ReRAM crossbar has the smallest 4F² planar cell size and thus is widely adopted for constructing dense memory with large
capacity. …

RADAR: a 3D-reRAM based DNA alignment accelerator architecture

Huangfu
Wenqin

Next Generation Sequencing (NGS) technology has become an indispensable tool for studying
genomics, resulting in an exponentially growth of biological data. Booming data volume
demands significant computational resources and creates challenges for ‘…

Mamba: closing the performance gap in productive hardware development frameworks

Jiang
Shunning

Modern high-level languages bring compelling productivity benefits to hardware design
and verification. For example, hardware generation and simulation frameworks (HGSFs)
use a single “host” language for parameterization, static elaboration, test bench
…

ACED: a hardware library for generating DSP systems

Wang
Angie

Designers translate DSP algorithms into application-specific hardware via primitives
composed in various ways for different architectural realizations. Despite sharing
underlying algorithms and hardware constructs, designs are often difficult to reuse,
…

PARM: power supply noise aware resource management for NoC based
multicore systems in the dark silicon era

Raparti
Venkata Yaswanth

Reliability is a major concern in chip multi-processors (CMPs) due to shrinking technology
and low operating voltages. Today’s processors designed at sub-10nm technology nodes
have high device densities and fast switching frequencies that cause …

Aging-constrained performance optimization for multi cores

Khdr
Heba

Circuit aging has become a dire design concern and hence it is considered a primary
design constraint. Current practice to cope with this problem is to apply (too) conservative
means.

In contrast, we introduce a far less restrictive approach by …

A measurement system for capacitive PUF-based security enclosures

Obermaier
Johannes

Battery-backed security enclosures that are permanently monitored for penetration
and tampering are common solutions for providing physical integrity to multi-chip
embedded systems. This paper presents a well-tailored measurement system for a …

It’s hammer time: how to attack (rowhammer-based) DRAM-PUFs

Zeitouni
Shaza

Physically Unclonable Functions (PUFs) are still considered promising technology as
building blocks in cryptographic protocols. While most PUFs require dedicated circuitry,
recent research leverages DRAM hardware for PUFs due to its intrinsic properties …

CamPUF: physically unclonable function based on CMOS image sensor fixed pattern noise

Kim
Younghyun

Physically unclonable functions (PUFs) have proved to be an effective measure for
secure device authentication and key generation. We propose a novel PUF design, named
CamPUF, based on commercial off-the-shelf CMOS image sensors, which are ubiquitously
…

Tamper-resistant pin-constrained digital microfluidic biochips

Tang
Jack

Digital microfluidic biochips (DMFBs)—an emerging technology that implements bioassays
through manipulation of discrete fluid droplets—are vulnerable to actuation tampering
attacks, where a malicious adversary modifies control signals for the …

Approximation-aware coordinated power/performance management for heterogeneous multi-cores

Kanduri
Anil

Run-time resource management of heterogeneous multi-core systems is challenging due
to i) dynamic workloads, that often result in ii) conflicting knob actuation decisions,
which potentially iii) compromise on performance for thermal safety. We present a
…

QoS-aware stochastic power management for many-cores

Pathania
Anuj

A many-core processor can execute hundreds of multi-threaded tasks in parallel on
its 100s – 1000s of processing cores. When deployed in a Quality of Service (QoS)-based
system, the many-core must execute a task at a target QoS. The amount of processing
…

Employing classification-based algorithms for general-purpose approximate computing

Oliveira
Geraldo F.

Approximate computing has recently reemerged as a design solution for additional performance
and energy improvements at the cost of output quality. In this paper, we propose using
a tree-based classification algorithm as an approximation tool for …

Using imprecise computing for improved non-preemptive real-time scheduling

Huang
Lin

Conventional hard real-time scheduling is often overly pessimistic due to the worst
case execution time estimation. The pessimism can be mitigated by exploiting imprecise
computing in applications where occasional small errors are acceptable. This …

A modular digital VLSI flow for high-productivity SoC design

Khailany
Brucek

A high-productivity digital VLSI flow for designing complex SoCs is presented. The
flow includes high-level synthesis tools, an object-oriented library of synthesizable
SystemC and C++ components, and a modular VLSI physical design approach based on …

Basejump STL: systemverilog needs a standard template library for hardware design

Taylor
Michael Bedford

We propose a Standard Template Library (STL) for synthesizeable SystemVerilog that
sharply reduces the time required to design digital circuits. We overview the principles
that underly the design of the open-source BaseJump STL, including light-weight …

TRIG: hardware accelerator for inference-based applications and experimental demonstration
using carbon nanotube FETs

Hills
Gage

The energy efficiency demands of future abundant-data applications, e.g., those which
use inference-based techniques to classify large amounts of data, exceed the capabilities
of digital systems today. Field-effect transistors (FETs) built using …

OPERON: optical-electrical power-efficient route synthesis for on-chip signals

Liu
Derong

As VLSI technology scales to deep sub-micron, optical interconnect becomes an attractive
alternative for on-chip communication. The traditional optical routing works mainly
optimize the path loss, and few works explicitly exploit the optical-electrical …

Soft-FET: phase transition material assisted soft switching field effect transistor for supply
voltage droop mitigation

Teja
Subrahmanya

Phase Transition Material (PTM) assisted novel soft switching transistor architecture
named “Soft-FET” is proposed for supply voltage droop mitigation. By utilizing the
abrupt phase transition mechanism in PTMs, the proposed Soft-FET achieves soft …

Ultralow power acoustic feature-scoring using gaussian I-V transistors

Trivedi
Amit Ranjan

This paper discusses energy-efficient acoustic feature-scoring using transistors with
Gaussian-shaped Ids-Vgs. Acoustic feature-scoring is a critical step in speech recognition tasks such as
speaker recognition. Suited to the transistor, we discuss a …

Test cost reduction for X-value elimination by scan slice correlation analysis

Chae
Hyunsu

X-values in test output responses corrupt an output response compaction and can cause
a fault coverage loss. X-Masking and X-Canceling MISR methods have been suggested to eliminate X-values, however, there are control data volume and test time overhead …

Cross-layer fault-space pruning for hardware-assisted fault injection

Dietrich
Christian

With shrinking structure sizes, soft-error mitigation has become a major challenge
in the design and certification of safety-critical embedded systems. Their robustness
is quantified by extensive fault-injection campaigns, which on hardware level can
…

A machine learning based hard fault recuperation model for approximate hardware accelerators

Taher
Farah Naz

Continuous pursuit of higher performance and energy efficiency has led to heterogeneous
SoC that contains multiple dedicated hardware accelerators. These accelerators exploit
the inherent parallelism of tasks and are often tolerant to inaccuracies in …

SOTERIA: exploiting process variations to enhance hardware security with photonic NoC architectures

Chittamuru
Sai Vineel Reddy

Photonic networks-on-chip (PNoCs) enable high bandwidth on-chip data transfers by
using photonic waveguides capable of dense-wave-length-division-multiplexing (DWDM)
for signal traversal and microring resonators (MRs) for signal modulation. A Hardware
…

LEAD: learning-enabled energy-aware dynamic voltage/frequency scaling in NoCs

Clark
Mark

Network on Chips (NoCs) are the interconnect fabric of choice for multicore processors
due to their superiority over traditional buses and crossbars in terms of scalability.
While NoC’s offer several advantages, they still suffer from high static and …

Subutai: distributed synchronization primitives in NoC interfaces for legacy parallel-applications

Cataldo
Rodrigo

Parallel applications are essential for efficiently using the computational power
of a Multiprocessor System-on-Chip (MPSoC). Unfortunately, these applications do not
scale effortlessly with the number of cores because of synchronization operations
that …

Packet pump: overcoming network bottleneck in on-chip interconnects for GPGPUs

Cheng
Xianwei

In order to fully exploit GPGPU’s parallel processing power, on-chip interconnects
need to provide bandwidth efficient data communication. GPGPUs exhibit a many-to-few-to-many
traffic pattern which makes the memory controller connected routers the …

STASH: security architecture for smart hybrid memories

Swami
Shivam

Whereas emerging non-volatile memories (NVMs) are low power, dense, scalable alternatives
to DRAM, the high latency and low endurance of these NVMs limit the feasibility of
NVM-only memory systems. Smart hybrid memories (SHMs) that integrate NVM, DRAM, …

ACME: advanced counter mode encryption for secure non-volatile memories

Swami
Shivam

Modern computing systems that integrate emerging non-volatile memories (NVMs) are
vulnerable to classical security threats to data confidentiality (e.g., stolen DIMM
and bus snooping attacks) as well as new security threats to system availability (e.g.,
…

CASTLE: compression architecture for secure low latency, low energy, high endurance NVMs

Palangappa
Poovaiah M.

CASTLE is a Compression-based main memory Architecture realizing a read-decrypt-free
(i.e., write-only) Secure solution for low laTency, Low Energy, high endurance non-volatile
memories (NVMs). CASTLE integrates pattern-based data compression and …

A collaborative defense against wear out attacks in non-volatile processors

Cronin
Patrick

While the Internet of Things (IoT) keeps advancing, its full adoption is continually
blocked by power delivery problems. One promising solution is Non-Volatile (NV) processors,
which harvest energy for themselves and employ a NV memory hierarchy. This …

Protecting the supply chain for automotives and IoTs

Ray
Sandip

Modern automotive systems and IoT devices are designed through a highly complex, globalized,
and potentially untrustworthy supply chain. Each player in this supply chain may (1)
introduce sensitive information and data (collectively termed “assets”) …

Reconciling remote attestation and safety-critical operation on simple IoT devices

Carpent
Xavier

Remote attestation (RA) is a means of malware detection, typically realized as an
interaction between a trusted verifier and a potentially compromised remote device
(prover). RA is especially relevant for low-end embedded devices that are incapable
of …

Formal security verification of concurrent firmware in SoCs using instruction-level
abstraction for hardware

Huang
Bo-Yuan

Formal security verification of firmware interacting with hardware in modern Systems-on-Chip
(SoCs) is a critical research problem. This faces the following challenges: (1) design
complexity and heterogeneity, (2) semantics gaps between software and …

Application level hardware tracing for scaling post-silicon debug

Pal
Debjit

We present a method for selecting trace messages for post-silicon validation of Systems-on-a-Chips
(SoCs) with diverse usage scenarios. We model specifications of interacting flows
in typical applications. Our method optimizes trace buffer utilization …

Specification-driven automated conformance checking for virtual prototype and post-silicon
designs

Gu
Haifeng

Due to the increasing complexity of System-on-Chip (SoC) design, how to ensure that
silicon implementations conform to their high-level specifications is becoming a major
challenge. To address this problem, we propose a novel specification-driven …

Formal micro-architectural analysis of on-chip ring networks

van Wesel
Perry

In the realm of Multi-Processors System-on-Chip (MPSoC’s), the Network-on-Chip (NoC)
connecting all system components plays a crucial role in the overall correctness and
performance of the system. Recent papers have proposed several ring based NoC …

HFMV: hybridizing formal methods and machine learning for verification of analog and mixed-signal
circuits

Hu
Hanbin

With increasing design complexity and robustness requirement, analog and mixed-signal
(AMS) verification manifests itself as a key bottleneck. While formal methods and
machine learning have been proposed for AMS verification, these two techniques suffer
…

Cost-aware patch generation for multi-target function rectification of engineering
change orders

Zhang
He-Teng

The increasing system complexity makes engineering change order (ECO) mostly inevitable
and a common practice in integrated circuit design. Despite extensive research being
made, prior methods are not effectively applicable to instances where …

Modelling multicore contention on the AURIXTM TC27x

Díaz
Enrique

Multicores are becoming ubiquitous in automotive. Yet, the expected benefits on integration
are challenged by multicore contention concerns on timing V&V. Worst-case execution
time (WCET) estimates are required as early as possible in the software …

Cache side-channel attacks and time-predictability in high-performance critical real-time
systems

Trilla
David

Embedded computers control an increasing number of systems directly interacting with
humans, while also manage more and more personal or sensitive information. As a result,
both safety and security are becoming ubiquitous requirements in embedded …

Cross-layer dependency analysis with timing dependence graphs

Möstl
Mischa

We present Non-interference Analysis as a model-based method to automatically reveal,
track and analyze end-to-end timing dependencies as part of a cross-layer dependency
analysis in complex systems. Based on revealed timing dependencies of functional …

Brook auto: high-level certification-friendly programming for GPU-powered automotive systems

Trompouki
Matina Maria

Modern automotive systems require increased performance to implement Advanced Driving
Assistance Systems (ADAS). GPU-powered platforms are promising candidates for such
computational tasks, however current low-level programming models challenge the …

Dynamic vehicle software with AUTOCONT

Jakobs
Christine

Future automotive software needs to deal with an increasing level of dynamicity, reasoned
by the wish for connected driving, software updates, and dynamic feature activation.
Such functionalities cannot be properly realized with today’s classic AUTOSAR …

Automated interpretation and reduction of in-vehicle network traces at a large scale

Mrowca
Artur

In modern vehicles, high communication complexity requires cost-effective integration
tests such as data-driven system verification with in-vehicle network traces. With
the growing amount of traces, distributable Big Data solutions for analyses become
…

Atomlayer: a universal reRAM-based CNN accelerator with atomic layer computation

Qiao
Ximing

Although ReRAM-based convolutional neural network (CNN) accelerators have been widely
studied, state-of-the-art solutions suffer from either incapability of training (e.g.,
ISSAC [1]) or inefficiency of inference (e.g., PipeLayer [2]) due to the …

Towards accurate and high-speed spiking neuromorphic systems with data quantization-aware
deep networks

Liu
Fuqiang

Deep Neural Networks (DNNs) have gained immense success in cognitive applications
and greatly pushed today’s artificial intelligence forward. The biggest challenge
in executing DNNs is their extremely data-extensive computations. The computing …

CMP-PIM: an energy-efficient comparator-based processing-in-memory neural network accelerator

Angizi
Shaahin

In this paper, an energy-efficient and high-speed comparator-based processing-in-memory
accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented
comparator-based deep neural network called CMPNET. Inspired by local binary …

SNrram: an efficient sparse neural network computation architecture based on resistive random-access
memory

Wang
Peiqi

The sparsity in the deep neural networks can be leveraged by methods such as pruning
and compression to help the efficient deployment of large-scale deep neural networks
onto hardware platforms, such as GPU or FPGA, for better performance and power …

Long live TIME: improving lifetime for training-in-memory engines by structured gradient sparsification

Cai
Yi

Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While
conventional CMOS-based computing platforms are hard to achieve higher energy efficiency.
RRAM-based systems provide a promising solution to build efficient Training-…

Hierarchical hyperdimensional computing for energy efficient classification

Imani
Mohsen

Brain-inspired Hyperdimensional (HD) computing emulates cognition tasks by computing
with hypervectors rather than traditional numerical values. In HD, an encoder maps
inputs to high dimensional vectors (hypervectors) and combines them to generate a
…

Dadu-P: a scalable accelerator for robot motion planning in a dynamic environment

Lian
Shiqi

As a critical operation in robotics, motion planning consumes lots of time and energy,
especially in a dynamic environment. Through approaches based on general-purpose processors,
it is hard to get a valid planning in real time. We present an …

Data prediction for response flows in packet processing cache

Yamaki
Hayato

We propose a technique to reduce compulsory misses of packet processing cache (PPC),
which largely affects both throughput and energy of core routers. Rather than prefetching
data, our technique called response prediction cache (RPC) speculatively …

PULP-HD: accelerating brain-inspired high-dimensional computing on a parallel ultra-low power
platform

Montagna
Fabio

Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD
computing include a well-defined set of arithmetic operations on hypervectors, generality,
…

Active forwarding: eliminate IOMMU address translation for accelerator-rich architectures

Fu
Hsueh-Chun

Accelerator-rich architectures employ IOMMUs to support unified virtual address, but
researches show that they fail to meet the performance and energy requirements of
accelerators. Instead of optimizing the speed/energy of IOMMU address translation,
…

SARA: self-aware resource allocation for heterogeneous MPSoCs

Song
Yang

In modern heterogeneous MPSoCs, the management of shared memory resources is crucial
in delivering end-to-end QoS. Previous frameworks have either focused on singular
QoS targets or the allocation of partitionable resources among CPU applications at
…

PEP: proactive checkpointing for efficient preemption on GPUs

Li
Chen

The demand for multitasking GPUs increases whenever the GPU may be shared by multiple
applications, either spatially or temporally. This requires that GPUs can be preempted
and switch context to a new application while already executing one. Unlike CPUs,…

FMMU: a hardware-accelerated flash map management unit for scalable performance of flash-based
SSDs

Woo
Yeong-Jae

Address translation is increasingly a performance bottleneck in flash-based SSDs (solid
state drives). We propose a hardware-accelerated flash map management unit called
FMMU to speed up the address translation. The FMMU operates in a non-blocking …

Minimizing write amplification to enhance lifetime of large-page flash-memory storage
devices

Wang
Wei-Lin

Due to the decreasing endurance of flash chips, the lifetime of flash drives has become
a critical issue. To resolve this issue, various techniques such as wear-leveling
and error correction code have been proposed to reduce the bit error rates of flash
…

Proactive channel adjustment to improve polar code capability for flash storage devices

Hsu
Kun-Cheng

With the low encoding/decoding complexity and the high error correction capability,
polar code with the support of list-decoding and cyclic redundancy check can outperform
LDPC code in the area of data communication. Thus, it also draws a lot of …

Achieving defect-free multilevel 3D flash memories with one-shot program design

Ho
Chien-Chung

To store the desired data on MLC and TLC flash memories, the conventional programming
strategies need to divide a fixed range of threshold voltage (V_t) window into several parts. The narrowly partitioned V_t window in turn limits the design of …

Power-based side-channel instruction-level disassembler

Park
Jungmin

Modern embedded computing devices are vulnerable against malware and software piracy
due to insufficient security scrutiny and the complications of continuous patching.
To detect malicious activity as well as protecting the integrity of executable …

Side-channel security of superscalar CPUs: evaluating the impact of micro-architectural features

Barenghi
Alessandro

Side-channel attacks are performed on increasingly complex targets, starting to threaten
superscalar CPUs supporting a complete operating system. The difficulty of both assessing
the vulnerability of a device to them, and validating the effectiveness of …

Electro-magnetic analysis of GPU-based AES implementation

Gao
Yiwen

In this work, for the first time, we investigate Electro-Magnetic (EM) attacks on
GPU-based AES implementation. In detail, we first sample EM traces using a delicate
trigger; then, we build a heuristic leakage model and a novel leakage model to exploit
…

GPU obfuscation: attack and defense strategies

Chakraborty
Abhishek

Conventional attacks against existing logic obfuscation techniques rely on the presence
of an activated hardware for analysis. In reality, obtaining such activated chips may not always be practical,
especially if the on-chip test structures are …

Measurement-based cache representativeness on multipath programs

Milutinovic
Suzana

Autonomous vehicles in embedded real-time systems increase critical-software size
and complexity whose performance needs are covered with high-performance hardware
features like caches, which however hampers obtaining WCET estimates that hold valid
for …

Resource-aware partitioned scheduling for heterogeneous multicore real-time systems

Han
Jian-Jun

Heterogeneous multicore processors have become popular computing engines for modern
embedded real-time systems recently. However, there is rather limited research on
the scheduling of real-time tasks running on heterogeneous multicore systems with
…

Response-time analysis of DAG tasks supporting heterogeneous computing

Serrano
Maria A.

Hardware platforms are evolving towards parallel and heterogeneous architectures to
overcome the increasing necessity of more performance in the real-time domain. Parallel
programming models are fundamental to exploit the performance capabilities of …

Duet: an OLED & GPU co-management scheme for dynamic resolution adaptation

Lin
Han-Yi

The increasingly high display resolution of mobile devices imposes a further burden
on energy consumption. Existing schemes manage either OLED or GPU power to save energy.
This paper presents the design, algorithm, and implementation of a co-managing …

RAMP: resource-aware mapping for CGRAs

Dave
Shail

Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate
even non-parallel loops. Acceleration achieved through CGRAs critically depends on
the goodness of mapping (of loop operations onto the PEs of CGRA), and in …

An architecture-agnostic integer linear programming approach to CGRA mapping

Chin
S. Alexander

Coarse-grained reconfigurable architectures (CGRAs) have gained traction as a potential
solution to implement accelerators for compute-intensive kernels, particularly in
domains requiring hardware programmability. Architecture and CAD for CGRAs are …

Dnestmap: mapping deeply-nested loops on ultra-low power CGRAs

Karunaratne
Manupa

Coarse-Grained Reconfigurable Arrays (CGRAs) provide high performance, energy-efficient
execution of the innermost loops of an application. Most real-world applications,
however, comprise of deeply-nested loops with complex and often irregular control
…

Locality aware memory assignment and tiling

Rogers
Samuel

With the trend toward specialization, an efficient memory-path design is vital to
capitalize customization in data-path. A monolithic memory hierarchy is often highly
inefficient for irregular applications, traditionally targeted for CPUs. New …

GAN-OPC: mask optimization with lithography-guided generative adversarial nets

Yang
Haoyu

Mask optimization has been a critical problem in the VLSI design flow due to the mismatch
between the lithography system and the continuously shrinking feature sizes. Optical
proximity correction (OPC) is one of the prevailing resolution enhancement …

An efficient Bayesian yield estimation method for high dimensional and high sigma
SRAM circuits

Zhai
Jinyuan

With increasing dimension of variation space and computational intensive circuit simulation,
accurate and fast yield estimation of realistic SRAM chip remains a significant and
complicated challenge. In this paper, du Experiment results show that the …

RAIN: a tool for reliability assessment of interconnect networks—physics to software

Abbasinasab
Ali

In this paper, we study the main interconnect aging processes: electromigration, thermomigration
and stress migration and propose comprehensive yet compact models for transient and
steady states based on hydrostatic stress evolution. Our model can be …

A fast and robust failure analysis of memory circuits using adaptive importance sampling
method

Shi
Xiao

Performance failure has become a growing concern for the robustness and reliability
of memory circuits. It is challenging to accurately estimate the extremely small failure
probability when failed samples are distributed in multiple disjoint failure …

SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs

Lu
Liqiang

FPGAs have been an efficient accelerator for CNN inference due to its high performance,
flexibility, and energy-efficiency. To improve the performance of CNNs on FPGAs, fast
algorithms and sparse methods emerge as the most attractive alternatives, which …

Efficient winograd-based convolution kernel implementation on edge devices

Xygkis
Athanasios

The implementation of Convolutional Neural Networks on edge Internet of Things (IoT)
devices is a significant programming challenge, due to the limited computational resources
and the real-time requirements of modern applications. This work focuses on …

An efficient kernel transformation architecture for binary- and ternary-weight neural
network inference

Zheng
Shixuan

While deep convolutional neural networks (CNNs) have emerged as the driving force
of a wide range of domains, their computationally and memory intensive natures hinder
the further deployment in mobile and embedded applications. Recently, CNNs with low-…

Content addressable memory based binarized neural network accelerator using time-domain
signal processing

Choi
Woong

Binarized neural network (BNN) is one of the most promising solution for low-cost
convolutional neural network acceleration. Since BNN is based on binarized bit-level
operations, there exist great opportunities to reduce power-hungry data transfers
and …

A security vulnerability analysis of SoCFPGA architectures

Chaudhuri
Sumanta

SoCFPGAs or FPGAs integrated on the same die with chip multi processors have made
it to the market in the past years. In this article we analyse various security loopholes,
existing precautions and countermeasures in these architectures. We consider …

Raise your game for split manufacturing: restoring the true functionality through BEOL

Patnaik
Satwik

Split manufacturing (SM) seeks to protect against piracy of intellectual property
(IP) in chip designs. Here we propose a scheme to manipulate both placement and routing
in an intertwined manner, thereby increasing the resilience of SM layouts. Key …

Analysis of security of split manufacturing using machine learning

Zhang
Boyu

This work is the first to analyze the security of split manufacturing using machine
learning, based on data collected from layouts provided by industry, with 8 routing
metal layers, and significant variation in wire size and routing congestion across
…

Inducing local timing fault through EM injection

Ghodrati
Marjan

Electromagnetic fault injection (EMFI) is an efficient class of physical attacks that
can compromise the immunity of secure cryptographic algorithms. Despite successful
EMFI attacks, the effects of electromagnetic injection (EM) on a processor are not
…

IAfinder: identifying potential implicit assumptions to facilitate validation in medical cyber-physical
system

Fu
Zhicheng

According to the U.S. Food and Drug Administration (FDA) medical device recall database,
medical device recalls are at an all-time high. One of the major causes of the recalls
is due to implicit assumptions of which either the medical device operating …

An efficient timestamp-based monitoring approach to test timing constraints of cyber-physical
systems

Mehrabian
Mohammadreza

Formal specifications on temporal behavior of Cyber-Physical Systems (CPS) is essential
for verification of performance and safety. Existing solutions for verifying the satisfaction
of temporal constraints on a CPS are compute and resource intensive …

Runtime adjustment of IoT system-on-chips for minimum energy operation

Golanbari
Mohammad Saber

Energy-constrained Systems-on-Chips (SoC) are becoming major components of many emerging
applications, especially in the Internet of Things (IoT) domain. Although the best
energy efficiency is achieved when the SoC operates in the near-threshold region,
…

Edge-cloud collaborative processing for intelligent internet of things: a case study on smart surveillance

Mudassar
Burhan A.

Limited processing power and memory prevent realization of state of the art algorithms
on the edge level. Offloading computations to the cloud comes with tradeoffs as compression
techniques employed to conserve transmission bandwidth and energy …

Bandwidth-efficient deep learning

Han
Song

Deep learning algorithms are achieving increasingly higher prediction accuracy on
many machine learning tasks. However, applying brute-force programming to data demands
a huge amount of machine power to perform training and inference, and a huge amount
…

Co-design of deep neural nets and neural net accelerators for embedded vision applications

Kwon
Kiseok

Deep Learning is arguably the most rapidly evolving research area in recent years.
As a result it is not surprising that the design of state-of-the-art deep neural net
models proceeds without much consideration of the latest hardware targets, and the
…

Generalized augmented lagrangian and its applications to VLSI global placement

Zhu
Ziran

Global placement dominates the circuit placement process in its solution quality and
efficiency. With increasing design complexity and various design constraints, it is
desirable to develop an efficient, high-quality global placement algorithm for …

Routability-driven and fence-aware legalization for mixed-cell-height circuits

Li
Haocheng

Placement is one of the most critical stages in the physical synthesis flow. Circuits
with increasing numbers of cells of multi-row height have brought challenges to traditional
placers on efficiency and effectiveness. Furthermore, constraints on fence …

PlanarONoC: concurrent placement and routing considering crossing minimization for optical networks-on-chip

Chuang
Yu-Kai

Optical networks-on-chips (ONoCs) have become a promising solution for the on-chip
communication of multi-and many-core systems to provide superior communication bandwidths,
efficiency in power consumption, and latency performance compared to electronic …

Similarity-aware spectral sparsification by edge filtering

Feng
Zhuo

In recent years, spectral graph sparsification techniques that can compute ultra-sparse
graph proxies have been extensively studied for accelerating various numerical and
graph-related applications. Prior nearly-linear-time spectral sparsification …

S2FA: an accelerator automation framework for heterogeneous computing in datacenters

Yu
Cody Hao

Big data analytics using the JVM-based MapReduce framework has become a popular approach
to address the explosive growth of data sizes. Adopting FPGAs in datacenters as accelerators
to improve performance and energy efficiency also attracts increasing …

Automated accelerator generation and optimization with composable, parallel and pipeline
architecture

Cong
Jason

CPU-FPGA heterogeneous architectures feature flexible acceleration of many workloads
to advance computational capabilities and energy efficiency in today’s datacenters.
This advantage, however, is often overshadowed by the poor programmability of FPGAs.
…

TAO: techniques for algorithm-level obfuscation during high-level synthesis

Pilato
Christian

Intellectual Property (IP) theft costs semiconductor design companies billions of
dollars every year. Unauthorized IP copies start from reverse engineering the given
chip. Existing techniques to protect against IP theft aim to hide the IC’s …

Extracting data parallelism in non-stencil kernel computing by optimally coloring
folded memory conflict graph

Escobedo
Juan

Irregular memory access pattern in non-stencil kernel computing renders the well-known
hyperplane- [1], lattice- [2], or tessellation-based [3] HLS techniques ineffective.
We develop an elegant yet effective technique that synthesizes memory-optimal …

SMApproxlib: library of FPGA-based approximate multipliers

Ullah
Salim

The main focus of the existing approximate arithmetic circuits has been on ASIC-based
designs. However, due to the architectural differences between ASICs and FPGAs, comparable
performance gains cannot be achieved for FPGA-based systems by using the …

Sign-magnitude SC: getting 10X accuracy for free in stochastic computing for deep neural networks

Zhakatayev
Aidyn

Stochastic computing (SC) is a promising computing paradigm for applications with
low precision requirement, stringent cost and power restriction. One known problem
with SC, however, is the low accuracy especially with multiplication. In this paper
we …

Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators

Ullah
Salim

The architectural differences between ASICs and FPGAs limit the effective performance
gains achievable by the application of ASIC-based approximation principles for FPGA-based
reconfigurable computing systems. This paper presents a novel approximate …

Approximate on-the-fly coarse-grained reconfigurable acceleration for general-purpose
applications

Brandalero
Marcelo

Approximate functional unit designs have the potential to reduce power consumption
significantly compared to their precise counterparts; however, few works have investigated
composing them to build generic accelerators. In this work, we do a design-…

LEMAX: learning-based energy consumption minimization in approximate computing with quality
guarantee

Akhlaghi
Vahideh

Approximate computing aims to trade accuracy for energy efficiency. Various approximate
methods have been proposed in the literature that demonstrate the effectiveness of
relaxing accuracy requirements in a specific unit. This provides a basis for …

PIMA-logic: a novel processing-in-memory architecture for highly flexible and energy-efficient
logic computation

Angizi
Shaahin

In this paper, we propose PIMA-Logic, as a novel Processing-in-Memory
Architecture for highly flexible and efficient Logic computation. Insteadof
integrating complex logic units in cost-sensitive memory, PIMA-Logic …

Columba S: a scalable co-layout design automation tool for microfluidic large-scale integration

Tseng
Tsun-Ming

Microfluidic large-scale integration (mLSI) is a promising platform for high-throughput
biological applications. Design automation for mLSI has made much progress in recent
years. Columba and its succeeding work Columba 2.0 proposed a mathematical …

Design-for-testability for continuous-flow microfluidic biochips

Liu
Chunfeng

Flow-based microfluidic biochips are gaining traction in the microfluidics community
since they enable efficient and low-cost biochemical experiments. These highly integrated
lab-on-a-chip systems, however, suffer from manufacturing defects, which cause …

Design and architectural co-optimization of monolithic 3D liquid state machine-based
neuromorphic processor

Ku
Bon Woong

A liquid state machine (LSM) is a powerful recurrent spiking neural network shown
to be effective in various learning tasks including speech recognition. In this work,
we investigate design and architectural co-optimization to further improve the area-…

Enabling a new era of brain-inspired computing: energy-efficient spiking neural network with ring topology

Bai
Kangjun

The reservoir computing, an emerging computing paradigm, has proven its benefit to
multifarious applications. In this work, we successfully designed and fabricated an
analog delayed feedback reservoir (DFR) chip. Measurement results demonstrate its
rich …

A neuromorphic design using chaotic mott memristor with relaxation oscillation

Yan
Bonan

The recent proposed nanoscale Mott memristor features negative differential resistance
and chaotic dynamics. This work proposes a novel neuromorphic computing system that
utilizes Mott memristors to simplify peripheral circuitry. According to the …

DrAcc: a DRAM based accelerator for accurate CNN inference

Deng
Quan

Modern Convolutional Neural Networks (CNNs) are computation and memory intensive.
Thus it is crucial to develop hardware accelerators to achieve high performance as
well as power/energy-efficiency on resource limited embedded systems. DRAM-based CNN
…

On-chip deep neural network storage with multi-level eNVM

Donato
Marco

One of the biggest performance bottlenecks of today’s neural network (NN) accelerators
is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level,
embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. …

Closed yet open DRAM: achieving low latency and high performance in DRAM memory systems

Subramanian
Lavanya

DRAM memory access is a critical performance bottleneck. To access one cache block,
an entire row needs to be sensed and amplified, data restored into the bitcells and
the bitlines precharged, incurring high latency. Isolating the bitlines and sense
…

VRL-DRAM: improving DRAM performance via variable refresh latency

Das
Anup

A DRAM chip requires periodic refresh operations to prevent data loss due to charge
leakage in DRAM cells. Refresh operations incur significant performance overhead as
a DRAM bank/rank becomes unavailable to service access requests while being …

Enabling union page cache to boost file access performance of NVRAM-based storage
device

Chen
Shuo-Han

Due to the fast access performance, byte-addressability, and non-volatility of non-volatile
random access memory (NVRAM), NVRAM has emerged as a popular candidate for the design
of memory/storage systems on mobile computing systems. For example, the …

FLOSS: FLOw sensitive scheduling on mobile platforms

Zhang
Haibo

Today’s mobile platforms have grown in sophistication to run a wide variety of frame-based
applications. To deliver better QoS and energy efficiency, these applications utilize
multi-flow execution, which exploits hardware-level parallelism across …

Context-aware dataflow adaptation technique for low-power multi-core embedded systems

Jung
Hyeonseok

Today’s embedded systems operate under increasingly dynamic conditions. First, computational
workloads can be either fluctuating or adjustable. Moreover, as many devices are battery-powered,
it is common to have runtime power management technique, which …

Architecture decomposition in system synthesis of heterogeneous many-core systems

Richthammer
Valentina

Determining feasible application mappings for Design Space Exploration (DSE) and run-time
embedding is a challenge for modern many-core systems. The underlying NP-complete
system-synthesis problem faces tremendously complex problem instances due to the …

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural
networks

Kang
Jintaek

Existent GPU simulators are too slow to use for neural networks implemented in GPUs.
For fast performance estimation, we propose a novel hybrid method of analytical performance
modeling and sampled simulation of GPUs. By taking full advantage of …

STAFF: online learning with stabilized adaptive forgetting factor and feature selection algorithm

Gupta
Ujjwal

Dynamic resource management techniques rely on power consumption and performance models
to optimize the operating frequency and utilization of processing elements, such as
CPU and GPU. Despite the importance of these decisions, many existing approaches …

Extensive evaluation of programming models and ISAs impact on multicore soft error
reliability

Rosa
Felipe da

To take advantage of the performance enhancements provided by multicore processors,
new instruction set architectures (ISAs) and parallel programming libraries have been
investigated across multiple industrial segments. This paper investigates the …

Optimized selection of wireless network topologies and components via efficient pruning
of feasible paths

Kirov
Dmitrii

We address the design space exploration of wireless networks to jointly select topology
and component sizing. We formulate the exploration problem as an optimized mapping
problem, where network elements are associated with components from pre-defined …