2021 Symposium on VLSI Circuits

Workshops

**Live Session** Sunday, June 13, 7:00-9:00 (JST)

**Workshop 1**

AI/Machine Learning for Circuit Design and Optimization

Organizer: X. Zhang, IBM Corp.

1. **Machine Learning for Agile IC Design and Manufacturing**, D. Pan, Univ. of Texas at Austin
2. **Machine Learning for Analog and Digital Design**, S. Han, Massachusetts Institute of Technology (MIT)
3. **Reinforcement Learning for Analog EDA**, L. Zhang, Memorial Univ. of Newfoundland
4. **Improving Circuit Design Productivity with Latest ML Methods**, H. (Mark) Ren, NVIDIA Research
5. **Learning to Play the Game of Macro Placement with Deep Reinforcement Learning**, Y.-J. Lee, Google

**Workshop 2**

PPAC Analysis and System-Technology Co-Optimization for 3D Memory-on-Logic IC, Many-Core SOC and AI Computing Applications

Organizers: R. Chen, imec
G. Van der Plas, imec

1. **Time Performance Improvement by Agile Design and 3D Integration**, T. Kuroda, The Univ. of Tokyo
2. **Future of HBM Packaging Technology**, K. Lee, SK hynix Inc.
3. **High-Performance AI Computing and 3D Opportunities**, M. Badaroglu, Qualcomm, Inc.
5. **3D Partitioning Strategies for Memory-on-Logic Designs & Many Core SoCs**, D. Milojevic, Université Libre de Bruxelles
8. **Tackling the Memory Wall via 3D Memory Partitioning: A System Level Perspective**, M. Perumkunnil, imec
Workshop 3

Deep Analysis Can Compress the Time to Design Optimum Analog/Mixed-Signal Circuits

Organizers:  A. A. Abidi, Univ. of California, Los Angeles
T. Iizuka, The Univ. of Tokyo

<table>
<thead>
<tr>
<th>Time</th>
<th>Session</th>
</tr>
</thead>
<tbody>
<tr>
<td>7:00–7:10</td>
<td>Introduction [Prof. Asad Abidi]</td>
</tr>
<tr>
<td>7:10–7:15</td>
<td>Elevator Pitch 1 [Prof. Willy Sansen]</td>
</tr>
<tr>
<td>7:15–7:25</td>
<td>Q&amp;A session for 1st talk</td>
</tr>
<tr>
<td>7:40–7:45</td>
<td>Elevator Pitch 2 [Dr. Kejian Shi]</td>
</tr>
<tr>
<td>7:45–7:55</td>
<td>Q&amp;A session for 2nd talk</td>
</tr>
<tr>
<td>7:55–8:00</td>
<td>Elevator Pitch 3 [Dr. Dihang Yang]</td>
</tr>
<tr>
<td>8:00–8:10</td>
<td>Q&amp;A session for 3rd talk</td>
</tr>
<tr>
<td>8:10–8:15</td>
<td>Elevator Pitch 4 [Prof. Shanithi Pavan]</td>
</tr>
<tr>
<td>8:15–8:25</td>
<td>Q&amp;A session for 4th talk</td>
</tr>
<tr>
<td>8:25–8:30</td>
<td>Elevator Pitch 5 [Prof. Tetsuya Iizuka]</td>
</tr>
<tr>
<td>8:30–8:40</td>
<td>Q&amp;A session for 5th talk</td>
</tr>
<tr>
<td>8:40–9:00</td>
<td>Panel Discussion</td>
</tr>
</tbody>
</table>

1. **Optimum Op Amp Design in One Day**, W. Sansen, KU Leuven
3. **Frequency Synthesizer Design in Two Days**, D. Yang, Broadcom Ltd.
4. **Delta-Sigma A/D Converter Design**, S. Pavan, Indian Institute of Technology Madras
5. **Nyquist A/D Converter Design in Four Days**, T. Iizuka, The Univ. of Tokyo

Workshop 4

Materials Introductions - A Path Forward for All Devices

Organizer:  D. Thompson, Applied Materials, Inc.

2. **Advanced Logic Focus**, C.-O. Chui, TSMC
4. **Capital Equipment as the Bridge from Lab to Fab: High k Materials as an Object Lesson**, R. Clark, TEL Technology Center, America, LLC
5. **New Devices from New Materials: Why We Need Them and How to Get Them**, G. Yeric, Cerfe Labs, Inc.
6. **Integrating New Materials for Device Fabrication**, R. M. Pearlstein, EMD Electronics
7. **The Role of Academia in Identifying Compelling Devices and Materials**, S. Salahuddin, Univ. of California, Berkeley

Mentoring & Networking Events

How to Navigate Academia and Industry in a Virtual World: Mentoring for Young Professionals
Organized by SSCS & EDS Women in Circuits and Young Professionals
- SESSION ONE- June 13 09:00 JST / June 12 17:00 PDT / June 13 02:00 CEST
- SESSION TWO - June 18 00:00 JST / June 17 08:00 PDT / June 17 17:00 CEST

Satellite Workshops

2021 Silicon Nanoelectronics Workshop
Sunday, June 13-, 2021, All Virtual

2021 Spintronics Workshop on LSI
Sunday, June 13-, 2021, All Virtual
Short Course 1 (Technology)
Advanced Process and Device Technology Toward 2nm-CMOS and Emerging Memory

Moderators: K. Endo, AIST
S. Datta, Univ. of Notre Dame

Live Q&A Session  Monday, June 14, 7:00-8:30 (JST)

CMOS Device Technology for the Next Decade, J. Cai, TSMC
Nanosheet Device Architectures to Enable CMOS Scaling in 3nm and Beyond: Nanosheet, Forksheet and CFET, N. Horiguchi, imec
Extension of Cu Interconnects and Considerations for Post-Cu Alternative Metals in Advanced Nodes, K. Motoyama, IBM Corp.
Contact Module Engineering for Advanced CMOS Technologies: Key Concepts, Engineering Techniques and Device Integration Challenges, N. Breil, Applied Materials, Inc.
Metrology Challenges Towards 2-nm Node, M. Ikota, Hitachi High-Tech Corp.
Emerging Memories and the Applications, H.-T. Lue, Macronix International Co., Ltd.
Key Device Technologies and Challenges for 3D Non-Volatile Memory, M. Saitoh, KIOXIA Corp.
DRAM: Challenges and Opportunities, K. Hamada, Micron Memory Japan, G.K.

Short Course 2 (Joint)
Enabling a Future of Even More Powerful Computing

Moderators: K. Yoshioka, Keio Univ.
S. H. Kang, Qualcomm

Live Q&A Session  Monday, June 14, 7:00-8:30 (JST)

Acceleration of Tomorrow's Computational Challenges, G. Loh, Advanced Micro Devices, Inc.
3D-Structured Monolithic and Heterogeneous Devices for Post-5G Applications, Y. Hayashi, Keio Univ.
Accelerated Computing: Latest Advances and Future Challenges, B. Keller, NVIDIA Corp.
Next-Generation Deep-Learning Accelerators: From Hardware to System, Y. S. Shao, Univ. of California, Berkeley
Hardware for Next Generation AI, D. Nikonov and A. Khosrowshahi, Intel Corp.
Quantum Computing with Superconducting Circuits, M. Brink, IBM Corp.
Short Course 3 (Circuits)

Advanced Circuits and Systems for Internet-of-Things (IoT) Sensors

Moderators:  
P.-H. Cheng, NCTU  
D. Griffith, TI

Live Q&A Session  Monday, June 14, 7:00-8:30 (JST)

CMOS Sensor for IoT: First Frontier, S. Pietri, NXP Semiconductors N.V.

Non-CMOS Based Sensors for IoT, M. Zevenbergen, imec

Capacitive Power Management Circuits for Miniaturized Energy Harvesting IoT Systems, M.-K. Law, Univ. of Macau

Getting the Most Out of a Little: Ultra-Low Power Circuit Techniques for the IoT, D. A. Hall, Univ. of California, San Diego

Low Power and Energy-Efficient Digital CMOS for Mixed-Signal Sensor Interfaces, S. Bampi, Federal Univ. of Rio Grande do Sul

IC-Chip Level Physical Attack Protections for IoT Security, M. Nagata, Kobe Univ.

Image Sensor Technologies for Computer Vision Systems to Realize Smart Sensing, A. Nose, Sony Semiconductor Solutions Corp.


Forum

Technologies for Post COVID-19 Era

Moderator:  C. V. Hoof, imec

Live Q&A Session  Saturday, June 19, 7:00-8:30 (JST)


Going from a 5G/6G Vision to Real Implementation, F. Tillman, Ericsson Research

Digital Annealer: Technology for Solving Combinatorial Optimization Problems in Real World, J. Koyama, Fujitsu Ltd.

Challenges and Opportunities for Sub-7nm In-Memory/Near-Memory Computing, AI Accelerators, and Hardware Security, R. K. Krishnamurthy, Intel Corp.

NTT DOCOMO’s View on 5G Evolution and 6G, T. Asai, NTT DOCOMO, INC.

Data Analytics Approach for Smart Manufacturing and Optimizing Equipment Condition, T. Moriya, Tokyo Electron Ltd.

Leveraging Semiconductor Technologies for Next-Generation Healthcare Tools, P. Peumans, imec
SESSION 1
Opening and Plenary Session 1 [Room 1]
Tuesday, June 15, 6:50-8:50

6:50-
Opening Remarks
S. Yamakawa, Sony Semiconductor Solutions Corp.
K. Takeuchi, The Univ. of Tokyo
Chairperson: Y. Oike, Sony Semiconductor Solutions Corp.

C1-1 - 7:20 (Plenary)
Fugaku and A64FX: the First Exascale Supercomputer and Its Innovative Arm CPU, S. Matsuoka, Riken, Japan

Fugaku is the first exascale supercomputer in the world, designed and built primarily by Riken Center for Computational Science (R-CCS) and Fujitsu Ltd., but involving essentially all the major stakeholders in the Japanese HPC community. The name ‘Fugaku’ is an alternative name for Mt. Fuji, and was chosen to signify that the machine not only seeks very high performance, but also a broad base of users and applicability at the same time. The heart of Fugaku is the new Fujitsu A64FX Arm processor, which is 100% compliant to Aarch64 specifications, yet embodies technologies realized for the first time in a major server general-purpose CPU, such as 7nm process technology, on-package integrated HBM2 and terabyte-class SVE streaming capabilities, on-die embedded TOFU-D high-performance network including the network switch, and adoption of so-called ‘disaggregated architecture’ that allows separation and arbitrary combination of CPU core, memory, and network functions. Fugaku uses 158,974 A64FX CPUs in a single socket node configuration, making it the largest and fastest supercomputer ever created, signified by its groundbreaking achievements in major HPC benchmarks, as well as producing societal results in COVID-19 applications.

Chairperson: G. Jurczak, Lam Research Corp.

T1-1 - 8:05 (Plenary)

Circuits Focus Session
Energy-Efficient Machine Learning Processors [Room 1]
Tuesday, June 15, 9:20-10:20

Chairpersons: V. Honkote, Intel Corp.
D. Stark, Facebook

CFS1-1 - 9:20 (Invited)

MN-Core is a highly efficient deep learning training accelerator reaching in excess of 1 TFLOPS/W (half-precision) at board level in real-world mixed-precision workloads. To reach and sustain this level of performance, the design is partitioned and packaged as four-die MCM package exceeding 3000mm² of die area.

CFS1-2 - 9:30

CHIMERA is the first non-volatile deep neural network (DNN) chip for edge AI training and inference using foundry on-chip resistive RAM (RRAM) macros and no off-chip memory. CHIMERA achieves 0.92 TOPS peak performance and 2.2 TOPS/W. We scale inference to 6x larger DNNs by connecting 6 CHIMERAs with just 4% execution time and 5% energy costs, enabled by communication-sparse DNN mappings that exploit RRAM non-volatility through quick chip wakeup/shutdown (33 μs). We demonstrate the first incremental edge AI training which overcomes RRAM write energy, speed, and endurance challenges. Our training achieves the same accuracy as traditional algorithms with up to 283x fewer RRAM weight update steps and 340x better energy-delay product. We thus demonstrate 10 years of 20 samples/minute incremental edge AI training on CHIMERA.
CFS1-3 - 9:40
OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dual-Mode Weight Compression and On-Chip Sparse Weight Transposer, J. Lee, S. Kim, S. Kim, W. Jo, D. Han, J. Lee and H.-J. Yoo, KAIST, Korea

This paper presents OmniDRL, a 4.18 TFLOPS and 29.3 TFLOPS/W DRL processor. A group-sparse training core and exponent mean delta encoding are proposed to enable weight and feature map compression for every iteration of DRL training. A sparse weight transposer enables on-chip transpose of compressed weight for reducing external memory access. The processor fabricated in 28 nm CMOS technology and occupies 3.6x3.6 mm² die area. It achieved 7.16 TFLOPS/W energy efficiency for training robot agent ( Mujoco Halfcheetah, TD3), which is 2.4 times higher than the previous state-of-the-art.

CFS1-4 - 9:50
DepFiN: A 12nm, 3.8TOPs Depth-First CNN Processor for High Res. Image Processing, K. Goetschalckx and M. Verhelst, KU Leuven, Belgium

High-resolution pixel processing (PP) tasks like demosaicing, denoising, and super-resolution strongly benefit from Convolutional Neural Network (CNN) approaches, yet give rise to different architectural challenges compared to typical classification CNNs, preventing real-time execution on existing SoA CNN processors. The 12nm DepFiN processor is the first processor optimized for a wide range of PP CNNs, innovating 1) at system level, by shifting the memory/IO trade-off through line-based extreme layer fusion (depth-first CNN computation); 2) at architecture level, by achieving near 100% utilization @3.8TOPs, even on traditionally challenging depthwise-pointwise and ShiftNet-layers, through an optimized dataflow for high resolution PP; 3) at gate level, by drastically reducing switching activity through improved register file and Multiply-Accumulate interfacing.

DepFiN brings high-resolution pixel processing to mobile platforms by achieving up to 3.8 TOPs and 20 Tops/W, e.g. enabling 2x super-resolution to with the FSRCNN network while limiting off-chip IO to 3.2Mfeatures/inference.

CFS1-5 - 10:00
PNNPU: A 11.9 TOPS/W High-Speed 3D Point Cloud-Based Neural Network Processor with Block-Based Point Processing for Regular DRAM Access, S. Kim, J. Lee, D. Im and H.-J. Yoo, KAIST, Korea

An efficient and high-speed 3D point cloud-based neural network processing unit (PNNPU) is proposed using the block-based point processing. It has three key features: 1) page-based point block memory management unit (PMMU) with linked list-based page table (LLPT) for on-chip memory footprint reduction, 2) hierarchical block-wise farthest point sampling (HFPS), and block skipping ball-query (BSBQ) for fast and efficient point processing, 3) Skipping-based max-pooling prediction (SMPP) for throughput enhancement. The PNNPU is fabricated in 65nm CMOS process and evaluated on the 3D object detection (3D OD) application. As a result, it shows 84.8 fps at 266.8mW power consumption and achieving 6.6-11.9 TOPS/W energy efficiency.

CFS1-6 - 10:10

A dynamic weight pruning (DWP) explored processor, named Trainer, is proposed for energy-efficient deep-neural-network (DNN) training on edge-device. It has three key features: 1) A implicit redundancy speculation unit (IRSU) improves 1.46x throughput. 2) A dataflow, allowing a reuse-adaptive dynamic compression and PE regrouping, increases 1.52x utilization. 3) A data-retrieval eliminated batch-normalization (BN) unit (REBU) saves 37.1% of energy. Trainer achieves a peak energy efficiency of 276.55TFLOPS/W. It reduces 2.23x training energy and offers a 1.76x training speedup compared with the state-of-the-art sparse DNN training processor.

SESSION 2
Biosensing Techniques [Room 2]
Tuesday, June 15, 9:20-10:10

C2-1 - 9:20 (Invited)

A 1024-pixel CMOS biochip for multiplex polymerase chain reaction application is presented. Biosensing pixels include 137dB DDR photosensors and an integrated emission filter with OD ~ 6 to perform real-time fluorescence-based measurements while thermocycling the reaction chamber with heating and cooling rates of > ±10°C/s. The surface of the CMOS IC is biofunctionalized with DNA capturing probes. The biochip is integrated into a fluidic consumable enabling loading of extracted nucleic acid samples and the detection of upper respiratory pathogens, including SARS-CoV-2.
C2-2 - 9:30

A key challenge for near-infrared (NIR) powered neural recording ICs is to maintain robust operation in the presence of parasitic short circuit current from junction diodes when exposed to light. This is especially so when intentional currents are kept small to reduce power consumption. We present a neural recording IC that is tolerant up to 300 μW/mm² light exposure (above tissue limit) and consumes 0.57 μW at 38°C, making it lowest power among standalone motes while incorporating on-chip feature extraction and individual gain control.

C2-3 - 9:40

This paper presents the first fully integrated 2-D array of electronic-photonic ultrasound sensors targeting low-power miniaturized ultrasound probes for endoscopic applications. Fabricated in a zero-change 45nm CMOS-SOI technology this Electronic-Photonic System on Chip (EPSoC), utilizes micro-ring resonators (MRR) as ultrasound sensors instead of the traditional piezoelectric or capacitive micromachined transducers (PMUTs or CMUTs). The photonic nature of the sensor enables remote the power-hungry receiver electronics outside the probe tip, lowering the power dissipation inside the human body, eliminates the electrical cabling and reduces fiber count by 4x using wavelength division multiplexed (WDM) MRR sensors coupled onto the same waveguide. The photonic sensing element demonstrates >30 MHz bandwidth, 7.3 mV/kPa overall sensitivity, while consuming 0.43 mW of power.

C2-4 - 9:50
An RF-Ultrasound Relay for Powering Deep Implants across Air-Tissue Interfaces with a Multi-Output Regulating Rectifier and Ultrasound Beamforming, E. So, P. Yeon, E. J. Chichilnisky and A. Arbabian, Stanford Univ., USA

Single modality wireless power has limited depth for mm-sized implants across air-tissue (e.g. retina implants) or skull-tissue interfaces because of high loss in tissue (RF, optical) or high reflection at the medium interface (ultrasound). The proposed RF-ultrasound (US) relay uses an RF link across the interface to a relay chip which rectifies the RF power and transmits focused ultrasound to the implant using a 6-channel phased array with 251% efficiency improvement at 2.5 cm over fixed focusing.

C2-5 - 10:00
A 3.8-μW/Ch, 15-GΩ Total Input Impedance Chopper Stabilized Amplifier with Dual Positive Feedback Loops and Auto-Calibration Scheme, Y. Park, J.-H. Cha, S.-H. Han, J.-H. Park and S.-J. Kim, UNIST, Korea

A capacitively-coupled chopper instrument amplifier (CCIA) with dual positive feedback loops (DPFL) is proposed to achieve excellent noise efficiency and high total input impedance (T-ZIN) simultaneously. The proposed DPFL consists of a conventional internal positive feedback loop (IPFL) and an external positive feedback loop (EPFL) outside of the input chopper for compensating a large external parasitic capacitance (CPEXTER) improving T-ZIN significantly. An on-chip background calibration is also proposed to determine the feedback factor of the DPFL to deal with variable parasitic capacitors. The prototype shows the IRN of 0.36 μVrms from 0.5 to 300 Hz with the NEF of 1.5 and high input impedances of 15 GΩ at 10 Hz with additional CPEXTER of 82 pF.

SESSION 3
Integrated Voltage Regulators [Room 3]
Tuesday, June 15, 9:20-10:10

Chairpersons: P.-H. Chen, National Yang Ming Chiao Tung Univ. X. Zhang, IBM Corp.

C3-1 - 9:20

A 5V-input, high-frequency, high-density (8A/mm²) buck converter featuring a low-voltage GaN power transistor (with 5-10x better FoM than Si) with on-die gate clamps, integrated with a CMOS companion die in 4mm x 4mm package, achieves 94.2% peak efficiency for 5Vin/1Vout at 3MHz switching frequency with a 40nH inductor.

C3-2 - 9:30
A 1S Direct-Battery-Attach Integrated Buck Voltage Regulator with 5-Stack Thin-Gate 22nm FinFET CMOS Featuring Active Voltage Balancing and Cascaded Self-Turn-ON Drivers, S. Kim, H. Krishnamurthy, S. Amin, S. Weng, J. Feng, H. Do, K. Radhakrishnan, K. Ravichandran, J. Tschannen and V. De, Intel Corp., USA

A 1S direct-battery-attach buck converter with a 5-stack, thin-gate-FinFET power train delivers a peak efficiency of 89.2% for a 3.8V in to 1.8V out, with 10x higher power density (≈15A/mm²), switching at up to 10x higher frequency (40MHz) using 4x-10x lower inductance (10-100nH) than state of the art. Cascaded self-timed drivers and soft-switching low-side drivers incorporate complexity in driving 10 individual power switches safely.
C3-3 - 9:40

A dual-input, digital hybrid buck-LDO system featuring 300MHz Fully-Integrated Voltage Regulator (FIVR) and Computational Transient Management Controller (CTMC) based Low Dropout (LDO) regulator is presented. The high speed, parallel CTMC-LDO reduces the FIVR input resonance (82% pk-to-pk voltage-swing reduction). The CTMC-LDO operates without any communication with FIVR and can act as a clamp (54% droop & 69% settling time reduction) or share current (0-100%) in parallel with the FIVR, thus increasing load capacity.

C3-4 - 9:50
A 2.3GHz Fully Integrated DC-DC Converter Based on Electromagnetically Coupled Class-D LC Oscillators Achieving 78.1% Efficiency in 22nm FDSOI CMOS, A. Novello*, G. Atzeni*, G. Cristiano*, M. Coustans** and T. Jang*, *ETH Zürich and **STMicroelectronics, Switzerland

A fully integrated DC-DC converter based on electromagnetically coupled class-D LC oscillators achieving 0.42-3.2W/mm² power density and 69.4-78.1% efficiency is demonstrated in a 22nm FDSOI CMOS technology. This work proposes on-chip 8-shaped and vertically stacked transformers, which are orthogonally placed for the high-power density, low undesired coupling coefficient and small electromagnetic interference (EMI) radiation. In addition, the output ripple is <10mV without attaching any output capacitor thanks to the 4-phase electromagnetic power delivery scheme. The converter also offers a duty cycled operation mode that enables <2% efficiency degradation down to 100μW. The total chip area is 0.59mm² for 5.9nH inductance (high efficiency version) and 0.22mm² for 3.9nH (high power density versions).

C3-5 - 10:00
A Fully Integrated Switched-Capacitor Voltage Regulator with Multi-Rate Successive Approximation Achieving 190 ps Transient FoM and 83.7% Conversion Efficiency, B.-C. Wu and T.-T. Liu, National Taiwan Univ., Taiwan

This paper presents a fully integrated switched-capacitor dc-dc voltage regulator (SCVR) in standard 28 nm CMOS with a proposed regulation algorithm of multi-rate successive approximation (MRSA) and several conversion efficiency enhancement techniques. The proposed SCVR achieves 190 ps transient FoM with peak conversion efficiency of 83.7%@114.2 mA per mm² and 110x supported loading range of 80 μA-8.8 mA.

SESSION 4

6:40-8:50
Award and Plenary Session 2 [Room 1]

Wednesday, June 16, 6:40-8:50

Chairperson: B. Nikolić, Univ. of California, Berkeley

C4-1 - 7:20 (Plenary)
A New Era of Tailored Computing, M. Papermaster, S. Kosonocky, G. H. Loh and S. Naffziger, Advanced Micro Devices, Inc. (AMD), USA

The worldwide computing market grew tremendously over the past decades, and looking toward the future, these trends do not appear to be slowing down. Moore’s Law coupled with incredible innovation in hardware and software are engines driving this growth. However, the entire industry faces a barrage of challenges including the slowing of Moore’s Law, stringent power and energy constraints, an always-connected society, and disruptions from the on-going artificial intelligence revolution. To continue delivering ever higher-performance computing solutions amid these difficulties, the industry needs to pivot to a new mindset of “Tailored Computing.” The need for and opportunities to tailor our technologies in all aspects of future compute will propel the industry toward heterogeneity in everything it does.

Chairperson: K. Miyashita, Toshiba Electronic Devices & Storage Corp.

T4-1 - 8:05 (Plenary)
Pandemic Challenges, Technology Answers, S. Choi, Samsung Electronics Co., Ltd., Korea

As the global community was caught off guard with the pandemic, semiconductor industry dealt with unexpected swings in applications and demands. Health crisis created an immediate need for social distancing that disconnected and disrupted human interactions, and technology had to step up on short notice to mend and reconnect communities. In this paper, we share the insights we gained as the semiconductor technologists who were called to provide solutions in a nimble and yet comprehensive manner to deal with the unexpected, and offer our vision and new model for the foundries, not just as the manufacturers, but as solution providers. The new market reality dominated by “untact” and “connect” demand differentiated strategies in providing foundry solutions, which include close engagement with customers in earlier stages of technology R&D, as well as design infrastructure tailored to customers’ specific requirements. We present our vision to drive such change in foundry technology directions.
SESSION 5
Secure and Energy Efficient Circuits [Room 1]

Wednesday, June 16, 9:20-10:10

Chairpersons: Y. Yoshida, KIOXIA Corp.
C. Tokunaga, Intel Corp.

C5-1 - 9:20
MePLER: A 20.6-pJ Side-Channel-Aware In-Memory CDT Sampler, D. Li, Y. He, A. R. Pakala and K. Yang, Rice Univ., USA

This work presents a MePLER, an in-Memory cumulative distribution table (CDT) sampler, featuring custom cell derived from NAND-Type CAM for range-matching, pipelined and segmented array for reduced energy, and suppressed timing and power side-channel leakage. The precision and sample range are configurable for different sampling requirements. A 65nm prototype achieves constant 85.9-MSps, 1-sample/cycle throughput, 20.6-pJ/sample efficiency, and 0.03-mm$^2$ footprint.

C5-2 - 9:30

This paper presents EQZ-LDO, a digital low-drop-out regulator (LDO) with attack detection and detection-driven protection for side channel attack (SCA) resiliency. It typically incurs only 0.5% energy-delay-product (EDP) overhead since the proposed detection-driven scheme exercises protection only when the AES is under attack. This enables to amortize the EDP overhead over the lifetime of an Internet of Things (IoT) device. It still achieves very strong resiliency to SCA, demonstrating the protection of a 128b AES core from >10M-trace correlation power analysis (CPA).

C5-3 - 9:40
A 0.186-pJ per Bit Latch-Based True Random Number Generator with Mismatch Compensation and Random Noise Enhancement, R. Zhang, X. Wang, L. Wang, X. Chen, F. Yang, K. Liu and H. Shinohara, Waseda Univ., Japan

A calibration and feedback control-free latch-based true random number generator (TRNG) is presented. It features a mismatch self-compensation and a random noise enhancement technique to drastically improve the noise-to-mismatch ratio. By employing the XOR function of only 4-bit entropy sources, the proposed TRNG can efficiently operate across a wide voltage (0.3–1.0 V) and temperature (-20–100) range. An 8-bit von Neumann with waiting (VN8W) post-processing technique is used to extract full entropy bitstreams, which have been verified by the NIST-SP 800-22 randomness tests. Robustness against supply noise attack is demonstrated. The proposed TRNG is fabricated in 130-nm CMOS technology and achieves the state-of-the-art energy of 0.186 pJ per bit at 0.3 V with a core area of 661 um$^2$ (0.039 MF$^2$).

C5-4 - 9:50
All-Digital Closed-Loop Unified Retention/Wake-Up Clamp in a 10nm 4-Core x86 IP, C. Augustine, A. Afzal, U. Misgar, A. Owahid, A. Raman, K. Subramanian, F. Merchant, J. Tschanz and M. Khellah, Intel Corp., USA

A 4-core x86 IP in 10nm with multiple low-power states including C1 (clock-gated core), C6 (power-gated core) and a new state called C1LP where the core voltage is lowered to its retention voltage ($V_{RETENTION}$) is presented. All-digital closed-loop unified retention clamp for C1LP/wake up for C6 shows power savings of 33%/28% for core/IP, with 120ns wake up speed while addressing impact of PVT variations.

C5-5 - 10:00

This work shows a system for power delivery network (PDN) impedance measurements (PIM), targeting high-performance computing (HPC) applications. A delay-line based "TRIG-after-SAMP" approach relaxes timing margins and eliminates high-speed clock sources. On-chip DUTs, with two bonding schemes and a programmable de-coupling capacitor array, are demonstrated using 7nm FinFET technology. Measurement results show that this system achieves a sampling bandwidth of 27 GHz, an accuracy of 1 mV, and a core area of 0.028 mm$^2$. 
SESSION 6
Converters and Sensing Analog Frontend [Room 2]

Wednesday, June 16, 9:20-10:10

Chairpersons: K. Yoshioka, Keio Univ.
H. Lam, Analog Devices, Inc.

C6-1 - 9:20
A 192 nW 0.02 Hz High Pass Corner Acoustic Analog Front-End with Automatic Saturation Detection and Recovery, R. Rothe, M. Cho, K. Choo, S. Jeong, D. Sylvester and D. Blaauw, Univ. of Michigan, USA

This work presents a 192 nW acoustic analog front-end with a 0.02 Hz high pass corner robust to process and temperature variations (1.8x deviation across -40°C to 80°C and 0.035x standard deviation over mean spread). The DC common-mode feedback is established by a capacitive Delta Sigma modulated feedback injection. It ensures a unity DC feedback factor and eliminates the input frequency and phase-dependent systematic offset introduced by some previous techniques. This work presents an automatic saturation detection and recovery mechanism. A 10x recovery time reduction is measured as compared to a standard implementation.

C6-2 - 9:30
A 3.68aF$_{\text{rms}}$ Resolution 183dB FoMs 4th-Order Continuous-Time Bandpass ΔΣ Capacitance-to-Digital Converter in 0.18µm CMOS, S. Park*, H. Chae** and S. Cho*, *KAIST and **Konkuk Univ., Korea

This paper presents an ultra-high-resolution energy-efficient 4th-order continuous-time (CT) bandpass (BP) delta-sigma(DS) capacitance-to-digital converter (CDC). This is the first work where thermal noise folding does not occur in a CDC. Previous works on ultra-high-resolution CDCs suffer from thermal noise folding to in-band, as they are based on DT-DSM. We employ a CT approach to prevent noise folding. Moreover, a BP-DSM for a CT CDC is proposed as it reduces the power consumption compared to a low-pass DSM when OSR is large. We also propose a charge-domain DAC at the input stage for low thermal noise, accurate coefficient, and simplicity. Finally, an inverter-based amplifier with gain boosting is proposed for high DC gain and low power consumption. The proposed CDC achieves a resolution of 3.68aF$_{\text{rms}}$ at room temperature while achieving a Schreier figure-of-merit of 183dB which is more than 2x improvement over the state-of-the-art CDCs.

C6-3 - 9:40
A 47.5nJ Resistor-to-Digital Converter for Detecting BTEX with 0.06ppb-Resolution, Y. Lee, B. Cho, C. Lee, J. Kim and Y. Chae, Yonsei Univ., Korea

This paper describes an energy-efficient resistor-to-digital converter (RDC) for detecting benzene, toluene, ethylbenzene and xylene (BTEX). The sensor selectivity is adjusted by the different heat levels, thereby detecting the BTEX with a single-R sensor. The sensor is directly digitized by an energy-efficient RDC based on a continuous-time delta-sigma ADC. The proposed RDC achieves a better energy-efficiency by processing the signal in current domain. The prototype chip implemented in a 0.11-um CMOS consumes only 95µW from a 1.5-V supply. It achieves a resolution of 863mΩ with 0.5-ms measurement time. This corresponds to a gas resolution of 0.06ppb and an energy per measurement of 47.5nJ. This work is fully verified through the BTEX measurements.

C6-4 - 9:50

We present an impedance-monitoring IC achieving a wide frequency range (FR) and fast output data rate (ODR). The proposed IC support a wide FR with improved spectral density by down-converting the signal to the intermediate frequency (f$_{\text{IF}}$) in front of the instrumentation amplifier (IA) using the LO signal generated by a single-side-band (SSB) mixer. The proposed IF-sampling architecture does not require narrow-bandwidth (BW) low-pass filter (LPF), resulting in a fast ODR. A time-interleaved (TI) DFT is also employed to further improve the ODR. A band-pass delta-sigma ADC (BP-ΔΣ-ADC) with the auto-calibration and BP truncation is adopted to achieve the best noise performance at f$_{\text{IF}}$. The fabricated IC achieves 0.35Ω$_{\text{rms}}$ resolution in the FR from 4kHz to 8MHz with 122.1Hz BW while providing the ODR up to 31.25kS/s.

C6-5 - 10:00
A 2A/15A Current Sensor with 1.4 µA Supply Current and ±0.35%/0.6% Gain Error from -40°C to 85°C Using an Analog Temperature-Compensation Scheme, R. Zamparette and K. Makinwa, Delft Univ. of Technology, Netherlands

This paper presents a 2A/15A fully-integrated current sensor with a 20 mOhm on-chip shunt/3 mOhm PCB shunt. It employs an energy-efficient hybrid sigma-delta ADC with an FIR-DAC and consumes only 1.4 µA, a 3x improvement on the state-of-the-art.
SESSION 7

SRAM and DRAM [Room 3]

Wednesday, June 16, 9:20-10:10

Chairpersons: K. Miyano, Micron Memory Japan, G.K. J. Wuu, Advanced Micro Devices, Inc. (AMD)

C7-1 - 9:20
5nm Low Power SRAM Featuring Dual-Rail Architecture with Voltage-Tracking Assist Circuit for 5G Mobile Application,

Voltage Auto Tracking Cell Power Lowering (VACPL) Write Assist circuit is proposed for low-power SRAM with dual-rail architecture. VACPL adaptively controls the cell voltage with respect to the dual rail offset voltage to maximize bitcell write-ability. A 5nm EUV FinFET test chip demonstrates 210mV VMIN improvement and 4.7x larger range of operating voltage with VACPL. The proposed VACPL and VATA achieves 95.2% leakage power reduction by lowering VDDC by 400mV in 5nm 5G mobile device.

C7-2 - 9:30
SRAM Write- and Performance-Assist Cells for Reducing Interconnect Resistance Effects Increased with Technology Scaling,
*Yonsei Univ. and **Samsung Electronics Co., Ltd., Korea

This paper presents SRAM write- and performance-assist cells that have bit-cell compatible layouts and thus can be inserted into an bit-cell array without the white space. The proposed cells can effectively resolve the degradation in write-ability and performance caused by the interconnect resistance increased with technology scaling.

C7-3 - 9:40
A 5nm Fin-FET 2G-Search/s 512-entry x 220-bit TCAM with Single Cycle Entry Update Capability for Data Center ASICs,
C. Deshpande, R. Garg, G. Jedhe, G. Narvekar and S. Kumar, MediaTek Inc., USA

This paper presents a 2G-search/s embedded Ternary Content Addressable Memory (TCAM) design in 5nm Fin-FET technology with the ability to update both SRAM words in a TCAM entry in a single clock cycle. This reduces TCAM update latency by 50% for data center Application Specific Integrated Circuits (ASICs) with only 1% area overhead and no search power penalty. We present a novel time multiplexed input bus interface on a single port TCAM cell array and new architecture to enable fast updates. Silicon measurement shows the highest reported search rate of 2G-search/s at a 3.48Mb/mm² memory density including all global peripheral circuitry for a 512 entry, 220-bit wide, 110Kb TCAM.

C7-4 - 9:50
Enhanced Core Circuits for Scaling DRAM: 0.7V VCC with Long Retention 138ms at 125°C and Random Row/Column Access Times Accelerated by 1.5ns,
*Etron Technology, Inc., Taiwan and **I&C Lab., Singapore

Two inventions improve DRAM's core circuits within a 1Gb DDR3 product: (1) Scale VCC down to 0.7V but generate a Restore ONE signal 1.3V into memory cells to enhance Retention-Time at least to 138ms at 125°C. This facilitates scaling VDD and peripheral devices. (2) When addresses are ready in the DRAM-controller, an Interface Circuitry enables pre-decoding Row/Column addresses into the DRAM before other Commands, thus accelerating Random-Access Row/Column Times by 1.5ns, respectively.

C7-5 - 10:00
A Reflection and Crosstalk Canceling Continuous-Time Linear Equalizer for High-Speed DDR SDRAM,

This paper presents a reflection and crosstalk canceling continuous-time linear equalizer (CTLE) for high-speed DDR SDRAM interface. To enhance the voltage margin in noisy multi-drop DDR SDRAM channel, the proposed CTLE cancels reflection noise by common-mode compensation and compensates crosstalk by limiting RC filter charging to overcome inversion of common-mode information. The reflection and crosstalk canceling CTLE is implemented in a DRAM process and improves the average voltage margin of 16GB RDIMM at 3.2Gbps with 28.3mV.
Panel Session [Room 1, 4]
Thursday, June 17, 7:00-9:00

Organizers (JFE): H. Morioka, Socionext Inc.
T. Tokuda, Tokyo Institute of Technology
Organizers (NAE): P. Ye, Purdue Univ.
J. Wu, Advanced Micro Devices, Inc.

7:00-8:30 Circuits Panel [Room 1]
New Generation Chip Makers vs. the Incumbents
Moderator: N. Verma, Princeton Univ.

8:00-9:00 Technology Panel [Room 4]
3D/Heterogeneous Integration: Are We Running Towards a Thermal Crisis?
Moderator: T. Ohba, Tokyo Institute of Technology

SESSION 8
Application-Specific Processors at the Edge [Room 1]
Thursday, June 17, 8:40-9:20

Chairpersons: S. Otani, Renesas Electronics Corp.
Z. Zhang, Univ. of Michigan

C8-1 - 8:40
A Sub-mW Dual-Engine ML Inference System-on-Chip for Complete End-to-End Face-Analysis at the Edge, P. Jokic*,**, E. Azarkhish*, R. Cattenoz*, E. Türetken*, L. Benini**, and S. Emery*, *CSEM and **ETH Zürich, Switzerland

Smart vision-based IoT applications operate on a sub-mW power budget while requiring power-hungry always-on image processing capabilities. This work presents a system-on-chip (SoC) that enables hierarchical processing of face analysis under multiple sub-mW operating scenarios using two tightly coupled machine learning (ML) accelerators. A dynamically scalable binary decision tree (BDT) engine for face detection (FD) allows triggering a multi-precision convolutional neural network (CNN) engine for subsequent face recognition (FR). The 22nm SoC can therefore dynamically trade-off image analysis depth, frames-per-second (FPS), accuracy, and power consumption. It implements complete end-to-end edge processing, enabling always-on FD and FR within the tight 1mW power budget of a 55mm diameter indoor solar panel. The SoC achieves >2x improvement in energy efficiency at iso-accuracy and iso-FPS over state-of-the-art (SoA) systems.

C8-2 - 8:50
VOTA: A 2.45TFLOPS/W Heterogeneous Multi-Core Visual Object Tracking Accelerator Based on Correlation Filters, J. Zhu*, W. Tang*, C.-E. Lee*, H. Ye**, E. McCreath** and Z. Zhang*, *Univ. of Michigan, USA and **Australian National Univ., Australia

VOTA is a domain-specific accelerator for correlation filter (CF)-based visual object tracking (VOT). It encompasses a Winograd convolution core, a FFT core and a vector core in a high-bandwidth star-ring topology. VOTA's frame-based instructions and execution enable a 537GFLOPS performance and reduce the code size. An instruction-chaining mechanism permits inter-core pipelining to improve the utilization to 84.2%. A 10.2mm² 28nm FP16 VOTA prototype incorporating a RISC-V host CPU is measured to achieve 2.45TFLOPS/W at 0.72V. Running OPCF, a CF-based VOT enhanced by adaptive boosting and particle filters, the chip achieves 1157FPS on 640x480 input frames at 0.9V and 175MHz, consuming 296mW.

C8-3 - 9:00
A 2144.2-bits/min/mW 5-Heterogeneous PE-Based Domain-Specific Reconfigurable Array Processor for 8-Ch Wearable Brain-Computer Interface SoC, W. Byun*, M. Je* and J.-H. Kim**, *KAIST and **Ewha Womans Univ., Korea

This paper proposes a reconfigurable array processor (RAP) based wearable brain-computer interface (BCI) SoC that can energy-efficiently accelerate linear algebra operations mainly required for target identification algorithms in visual-stimuli-based BCI. The proposed domain-specific RAP contains an array of dynamically reconfigurable and scalable processing-elements for energy efficiency, and supports almost all three levels of basic linear algebra subprograms (BLAS) as well as matrix decompositions. In addition, this work proposes an optimized target identification (TI) algorithm for RAP, which leads to a higher information transfer rate (ITR) of 139.9-bits/min and a better accuracy of 95.4% compared to the previous work [5], and a processing energy efficiency in ITR of 2144.2-bits/min/mW. This SoC was fabricated in 130nm CMOS and, with the proposed TI algorithm, it shows 16.8x energy efficiency compared to the state-of-the-art [1].

C8-4 - 9:10

A 2.2 mm² full digital baseband SoC with four heterogeneous cores for 128-node 8-users distributed massive MIMO is presented. Two specialized DSPs perform rapid over-the-air synchronization within 0.1ms. A highly optimized 8-complex lane MIMO vector processor provides 4x hardware efficiency improvement over general-purpose processors. Circuit optimizations and the use of body-bias result in 107 pJ/b measured energy at 169 Mb/s detection rate.
SESSION 9
Highly-Efficient Processor Architectures [Room 1]

Thursday, June 17, 9:30-10:10

Chairpersons: J. M. Kühn, Preferred Networks, Inc.
P. Whatmough, ARM Research Labs, Inc.

C9-1 - 9:30 (Invited)
Lessons from Loihi: Progress in Neuromorphic Computing, M. Davies, Intel Corp., USA

The past three years have seen significant progress in neuromorphic computing research, especially with Intel’s Loihi research chip enabling quantitative evaluation of algorithms and applications designed for this emerging computer architecture. These results have rigorously confirmed, for the first time, that significant gains in energy efficiency and latency are possible over a wide range of workloads compared to state-of-the-art conventional approaches. The greatest gains come from novel algorithms unrelated to the deep learning paradigm. While the speed, efficiency, and scalability of these algorithms suggest near-term commercial viability, Loihi’s high resource cost for large-scale workloads also highlights an urgent need for denser solutions for synaptic state in these architectures.

C9-2 - 9:40

Ultra-low-power microcontrollers (ULP MCUs) face a performance trade-off between energy-efficient computing during activity periods and low sleep power, associated with limited wake-up time and energy. Adaptive back-biasing in FD-SOI, along with near-threshold operation at ultra-low voltage, has brought significant improvements by dynamically shifting the minimum energy point (MEP) along the frequency axis. This work introduces a highly-integrated 64-MHz ULP Cortex-M4 MCU with 96-kB SRAM in 28nm FD-SOI. A clock and power management unit (CPMU) generates all internal supplies and clocks from a 1.8-V supply, while unified frequency and back-bias regulation (UFBBR) performs PVT compensation. Custom 16 kB ULP SRAMs achieve low read/write access energy, 1.2/0.84 pJ/32-bit access respectively, and provide 0.98nW/kB ultra-low-leakage data retention. A low-power biomedical analog front-end enables biopotential monitoring. The MEP is 5.5μW/MHz (8.2μW/MHz including conversion losses). Sleep power is 7.7μW with retention of logic state and 32-kB memory.

C9-3 - 9:50

PETRA is a configurable FP16 matrix multiplication and convolution accelerator designed to be 2.5D integrated using Advanced Interface Bus (AIB). PETRA is built upon four 16x16 systolic arrays, but it employs a configurable H-tree accumulation to improve both the latency and the utilization by up to 8x. A 22nm 3.04mm² PETRA prototype provides 1.433TFLOPS in computing matrix-matrix multiplication (MMM) and convolution (conv) at 0.88V, and it achieves a 6.97TFLOPS/W peak efficiency at 0.7V. PETRA is integrated with an Intel Stratix 10 FPGA in a multi-chip package (MCP) to provide the flexibility of FPGA and the performance and efficiency of PETRA.

SESSION 10
Advancements in Power Management ICs [Room 2]

Thursday, June 17, 8:40-9:40

Chairpersons: S.-W. Hong, Sogang Univ.
S. Z. Asl, Ferric, Inc.

C10-1 - 8:40
An 8Ω, 5.5W, 0.006% THD+N, 2xV_{BAT}-Swing Switched-Mode Audio Amplifier with Fully-Differential Linear Buck-Boost Topology Achieving Total Efficiency of 87%, J.-H. Lee and H.-S. Kim, KAIST, Korea

This paper presents a fully-differential single-inductor linear buck-boost power topology-based switched-mode audio amplifier. The proposed buck-boost topology outputs from 0 to 2x battery voltage while improving efficiency by up to 10% compared to the dual-step conversion of Class-D after the step-up. The inductor-free-wheeling with a fixed de-energizing phase linearizes the voltage conversion in the buck-boost. The chip fabricated in 0.18-μm achieves 0.006% THD+N and 87% efficiency at a maximum output power of 5.5W on an 8Ω-speaker.
C10-2 - 8:50
Efficient RF-PA Two-Chip Supply Modulator Architecture for 4G LTE and 5G NR Dual-Connectivity RF Front-End, J.-S. Paek, D. Kim, J.-Y. Han, Y. Choo and J. Lee, Samsung Electronics Co., Ltd., Korea

This paper presents a two-chip supply modulation architecture for efficient RF power amplification using a fully switched-mode supply modulator (SM) and a linear-assisted hybrid SM to support simultaneous transmission on LTE and 5G bands. The designed fully switched-mode SM consists of a fast buck converter and a slow buck converter, and it achieves 88.2% peak efficiency with a low RX band noise of -140dBm/Hz at the SM output. The designed 5G NR SM, consisting of a class-AB linear amplifier (LA) and an interleaved 3-level buck-boost converter provides a 150-MHz 3-dB bandwidth for tracking the 100-MHz envelope signal. An optimal RF-PA supply deployment using the two SMs efficiently supports multiple RF-PA loads while satisfying the dual transmission requirements of E-UTRAN New Radio Dual-Connectivity (EN-DC) and 5G 100-MHz ET operation.

C10-3 - 9:00

This paper presents a hybrid dual-path Buck-Boost DC-DC converter (DPBB) achieving high power efficiency for the entire input voltage (VIN) range (2.8 to 4.2V) from a Li-ion battery for mobile applications. Unlike conventional non-inverting buck-boost converters (CBB), the proposed DPBB supplies a load current (ILOAD) at all phases by using the inductor path (L-path) and capacitor path (C-path) alternately. As a result, our design achieves high efficiency by reducing conduction losses and suppresses the output voltage (VOUT) ripple. It has only a single control mode over the whole VIN range, unlike the previous designs that require complex mode transitions. Note that high efficiency is achieved not only across a wide ILOAD range but also over a wide VIN range. The proposed DPBB achieves 96.6% peak efficiency with the smallest difference between the maximum and minimum efficiencies over a given range of ILOAD and VIN.

C10-4 - 9:10
A 4.5V-Input 0.3-to-1.7V-Output Step-Down Always-Dual-Path DC-DC Converter Achieving 91.5%-Efficiency with 250mΩ-DCR Inductor for Low-Voltage SoCs, J.-Y. Ko*, Y. Huh***, M.-W. Ko*, G.-G. Kang*, G.-H. Cho* and H.-S. Kim*, *KAIST, **Samsung Electronics Co., Ltd. and ***Samsung Advanced Institute of Technology, Korea

This paper presents an always-dual-path (ADP) DC-DC converter that achieves 4.5V-input 0.3-to-1.7V-output buck conversion for battery-powered low-voltage SoCs. Regardless of voltage conversion ratio (VCR), the proposed ADP converter maintains the inductor current constantly to be x0.5 of the load current, bringing high efficiency with a large DCR of the compact-volume inductor. Seamless dual-power-path formed by two flying-capacitors merits a low ripple. The chip fabricated in 180-nm 5V CMOS obtains an efficiency of 91.5% (84.6%) at a VCR of 0.38 (0.2) even with an inductor DCR of 250mΩ.

C10-5 - 9:20
A 6.78 MHz Wireless Power Transfer System for Simultaneous Charging of Multiple Receivers with Maximum Efficiency Using Adaptive Magnetic Field Distributor IC, H. Qiu and M. Takamiya, The Univ. of Tokyo, Japan

For the first time, we developed a 6.78 MHz wireless power transfer (WPT) system for simultaneous charging of multiple receiver (RX) coils. On the basis of the transmitter (TX)-RX and RX-RX coupling distinguished by the adaptive magnetic field distributor (AMFD) IC, the distribution of magnetic fields from the TX coils was optimized at each RX coil for the maximum efficiency. A 2-TX 2-RX WPT system was implemented with the AMFD ICs fabricated in 1.8 V, 180 nm CMOS process. Compared with the conventional method, the system efficiency is increased from 8.9 % to 61 % with the load power of 173 mW.

C10-6 - 9:30

The on-board voltage regulator in the DDR5 memory module is required to resiliently supply current at large load transient events and alleviate output noise at the same time. We present an adaptive on-time (AOT) buck regulator with a turbo dual-phase interleaving logic for stable regulation and on-time control with dithered pseudo-constant switching frequency to suppress output harmonics by 6dB. The voltage regulator delivers up to 10A with a peak efficiency of 92.5% and covers 10A/μs steep load transients.
SESSION 11

Advanced Wireless for 5G [Room 3]

Thursday, June 17, 8:40-9:20

Chairpersons:  H.-J. Song, POSTECH
  A. Zolfaghari, Broadcom Ltd.

C11-1 - 8:40

In this paper, a wirelessly-powered 28-GHz phased-array relay transceiver for 5G network is proposed. This relay transceiver consists of the proposed vector-summing backscatter for Tx and passive phase-shifting self-heterodyne receiver and rectifier for Rx and 24-GHz WPT. This transceiver implemented with on-PCB array antennas achieves -27.5dB and -31.3dB EVM for Tx and Rx for a 400-MHz 64QAM OFDMA-mode signal (5G NR, MCS 19) without any wired power supply.

C11-2 - 8:50

This work introduces a 28-GHz phased-array transceiver supporting fast beam switching. Totally 256-pattern beam settings could be stored inside the on-chip SRAM for a 4-ns beam switching. The proposed phased-array also supports 5G dual-polarized MIMO (DP-MIMO) operation with cross-pol. leakage self-cancellation. The cross-pol. leakages can be detected by the on-chip detector and cancelled by the canceller in both TX and RX modes. The measured DP-MIMO EVMs are 5.5% in 64QAM and 3.5% in 256QAM. Large-volume data streaming could be supported with low latency for the 5G NR.

C11-3 - 9:00

This paper presents a highly efficient 5G NR transmitter (TX) system consisting of a SAW-less RF transmitter IC, an RF power amplifier module (PAM) and a supply modulator IC (SM-IC). The RF TX with a 12-bit current-steering digital-to-analog converter (DAC) generates +7 dBm of output power to drive the PAM for high power user equipment (HPUE) in the 5G NR n77 band. The SM-IC generates a maximum output power of 10W at a 5-V peak envelope output voltage and provides an envelope tracking supply voltage of a 100-MHz NR RF signal to the n77 PAM. The PAM employing a differential topology generates a saturation output power of 32.5 dBm including the loss of the SOI switch and output filter. The designed PAM in ET mode achieves an ACLR of -37 dBc at a 27-dBm output power and saves 950 mW in comparison to APT mode.

C11-4 - 9:10

This work presents a current-mode inverse Class-D digital PA (DPA) with enhanced power back-off (PBO) efficiency. The PA adopts extra switches, which allows the scaling in the output voltage swing by half, leading to (theoretically) 6 dB enhancement in PBO efficiency while maintaining (ideally) 100% drain efficiency (DE). Implemented in a 65 nm CMOS, the proposed DPA shows the improvement in DE by x1.5 at 4.2 dB PBO in comparison with normalized Class-B PA while requiring only one transformer and single-supply voltage.

SESSION 12

Techniques for High-Performance Wireless [Room 3]

Thursday, June 17, 9:30-10:10

Chairpersons:  C.-H. Heng, National Univ. of Singapore
  D. Griffith, Texas Instruments Inc.

C12-1 - 9:30

A D-band radar/communication fusion-mode transceiver featuring dual-function multiplexers, a two-point modulation (TPM) FMCW digital PLL with a dual-core DCO, wideband I/Q LO generators, current chopping high-gain mixers, and a power-combining PA with high output power is implemented in 28nm CMOS. In the radar mode, the RF front-end demonstrates 46GHz bandwidth, and the on-chip PLL/LO generated FMCW chirp achieves bandwidth of 30GHz and slope of 30GHz/50us. In the communication mode, the transceiver including analog baseband realizes 20GHz BW and the IRR is better than 40dB. The measured TX saturated Pout is 13 dBm and output P1dB is 8.3 dBm. The measured PLL phase noise is -112dBc/Hz at 1MHz offset from the 11.69GHz carrier. The TX-to-RX over-the-air (OTA) modulation-demodulation measurement with QPSK and 16QAM signals shows the EVM of -20.7dB and -19.7dB, respectively.
C12-2 - 9:40
A Ka-Band Switched-Capacitor RFDAC Using Edge-Combining in 22nm FD-SOI, H. M. Nguyen*, J. S. Walling**, A. Zhu*** and R. B. Staszewski****, *Univ. College Dublin, **MCCI, Ireland and ***Univ. of Utah, USA

The paper proposes an RFDAC based on an edge-combining (EC) switched-capacitor power amplifier (SCPA) that triples its clock frequency directly in the output stage to enable a near-mm-wave operation. Another edge-combining based frequency-tripling DLL network increases the system efficiency while a new layout structure accounts for the distributed effects of combining transmission lines. Implemented in 22nm FD-SOI, the prototype achieves $P_{out}>21$ dBm, $DE>36\%$, $SE>22\%$ while operating in the Ka-band at 27.9GHz. Modulation at 2.4Gb/s results in 3.3% EVM and 30.8dBc ACLR.

C12-3 - 9:50
A 0.4-6 GHz Receiver for LTE and WiFi, H. Razavi and B. Razavi, Univ. of California, Los Angeles, USA

A universal receiver employs a feedback method to alleviate noise-linearity trade-offs and a new harmonic rejection (HR) method that does not require accurate phase matching. Realized in 28-nm CMOS technology, the prototype provides channel selection filtering at RF for channel bandwidths from 200 kHz to 160 MHz and exhibits a noise figure of 2.1-4.2 dB with HR > 60.8 dB while drawing 49 mW.

C12-4 - 10:00
Battery-Less IoT Sensor Node with PLL-Less WiFi Backscattering Communications in a 2.5-μW Peak Power Envelope, L. Lin, K. A. A. Ahmed, P. S. Salamani and M. Alioto, National Univ. of Singapore, Singapore

A system on chip including 802.11b WiFi communications is introduced to demonstrate battery-less operation for low-cost mm-scale sensor nodes. μW peak power is enabled by PLL-less WiFi backscattering communications and event-driven frequency regulation to compensate environmental variations. A 180nm testchip integrating the entire signal chain from any of four sensor interfaces to wireless communications with a commercial WiFi router exhibits 2.5μW total power.

Technology / Circuits Joint Focus Session 1
3D/Heterogeneous Integration [Room 5]
Thursday, June 17, 8:40-9:20

Chairpersons: T. Tanaka, Tohoku Univ.
M. Delaus, Analog Devices, Inc.

JFS1-1 - 8:40 (Invited)
Design and Technology Solutions for 3D Integrated High Performance Systems, G. Van der Plas and E. Beyne, imec, Belgium

3D system integration builds on interconnect scaling roadmaps of TSVs (5μm to 100nm CD) and fine pitch bumps/pads (to <1μm pitch) for D2W and W2W schemes. Si bridges connect chiplets at 9.5Gbp, 338fJ/b, while W2W fine pitch memory logic functional partitioning improves power/performance by 30% vs 2D. Impingement cooler, BSDPN, high density MIMCAP and integrated magnetics push the power wall to 300W/cm². On the other hand, 3D design flows require further development. Process optimization, DfT, KGD/S and heterogeneous technology optimization of functionally partitioned 3D-SOC make high performance systems cost-effective.

JFS1-2 - 8:50 (Invited)
Chiplet-Based Advanced Packaging Technology from 3D/TSV to FOWLP/FHE, T. Fukushima, Tohoku Univ., Japan

More recently, "chiplets" are expected for further scaling the performance of LSI systems. However, system integration with the chiplets is not a new methodology. The basic concept dates back well over a few decades. The symbolic configuration of this concept based on the chiplets is 3D integration with TSV we have worked on since 1989. This paper introduces our 3D and heterogeneous system integration research from its historical activities to the latest efforts, including capillary selfassembly of tiny dies with a size of less than 0.1 mm and advanced flexible hybrid electronics (FHE) using fan-out wafer-level packaging (FOWLP).

JFS1-3 - 9:00

A high-density low bit error rate and low-power PHY for ultra-short-reach (USR) die-to-die communication has been fabricated in TSMC 7nm FinFET 1P15M CMOS technology. Interconnection is demonstrated through TSMC Chip-on-Wafer-on-Substrate (CoWoS) and TSMC Integrated Fan-Out (InFO) packaging technology. PHY exploits energy-efficient and high performance scheme, includes single-ended without termination, quarter rate strobe and unbalance scheme on transceiver, minimum intrinsic auto-alignment and novel noise-immunity coding methodology. Achieving 20Gbps per wire and 0.46pJ/bit under 1-mm ultra-short-reach platform target to BER 1E-25. Bandwidth density is shoreline 5.31Tbps/mm and area 2.25Tbps/mm².

A direct silicon water cooling solution using fusion bonded silicon lid is proposed. It is successfully demonstrated as an effective cooling solution with total power >2600 W on a single SoC, equivalent to power density of 4.8 W/mm². Low temperature logic chip to silicon lid fusion bonding, with trench/grid cooling structure cutting into silicon lid enables minimal thermal resistance between active device and cooling water and best cooling efficiency. Direct water cooling on logic chip silicon backside has also been demonstrated with power density better than 7 W/mm².

Joint Panel Session [Room 1]
Friday, June 18, 7:30-8:30

Organizers (JFE): H. Morioka, Socionext Inc.
T. Tokuda, Tokyo Institute of Technology
Organizers (NAE): P. Ye, Purdue Univ.
J. Wu, Advanced Micro Devices, Inc.

7:30-8:30 Joint Panel [Room 1]
The New Normal...How will it Change Work, Life and Education?
Moderator: K. Yano, Hitachi, Ltd.

Technology / Circuits Joint Focus Session 2
Computing-in-Memory [Room 1]
Friday, June 18, 8:40-10:00

Chairpersons: K. Sohn, Samsung Electronics Co., Ltd.
F. Sheikih, Intel Corp.

JFS2-1 - 8:40
A 13.7 TFLOPS/W Floating-Point DNN Processor Using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory, J. Lee, J. Kim, W. Jo, S. Kim, S. Kim, J. Lee and H.-J. Yoo, KAIST, Korea

An energy-efficient floating-point DNN training processor is proposed with heterogenous bfloat16 computing architecture using exponent computing-in-memory (CIM) and mantissa processing engine. Mantissa free exponent calculation enables pipelining of exponent and mantissa operation for heterogenous bfloat16 computing while reducing MAC power by 14.4 %. 6T SRAM exponent computing-in-memory with bitline charge reusing reduces memory access power by 46.4 %. The processor fabricated in 28 nm CMOS technology and occupies 1.62x3.6 mm² die area. It achieves 13.7 TFLOPS/W energy efficiency which is 274 times higher than the previous floating-point CIM processor.

JFS2-2 - 8:50

We present a programmable in-memory computing (IMC) accelerator integrating 108 capacitive-coupling-based IMC SRAM macros of a total size of 3.4 Mb, demonstrating one of the largest IMC hardware to date. We developed a custom ISA featuring IMC and SIMD functional units with hardware loop to support a range of deep neural network (DNN) layer types. The 28nm prototype chip achieves system-level peak energy-efficiency of 437 TOPS/W and peak throughput of 4.9 TOPS at 40MHz, 1V supply.

JFS2-3 - 9:00

This work presents a 65nm RNN processor with computing-in-memory (CIM) macros. The main contributions include: 1) A similarity analyzer (SimAyz) to fully leverage the temporal stability of input sequences with 1.52 times performance speedup; 2) An attention-based context-breaking (AttenBrk) method with output speculation to reduce off-chip data accesses up to 30.3%; 3) A double-buffering scheme for CIM macros to hide writing latency and a pipeline processing element (PE) array to increase the system throughput. Measured results show that this chip achieves 6.54-to-26.03 TOPS/W energy efficiency vary from various LSTM benchmarks.

JFS2-4 - 9:10
Fully Row/Column-Parallel In-Memory Computing SRAM Macro Employing Capacitor-Based Mixed-Signal Computation with 5-b Inputs, J. Lee, H. Valavi, Y. Tang and N. Verma, Princeton Univ., USA

This paper presents an in-memory computing (IMC) macro in 28nm for fully row/column-parallel matrix-vector multiplication (MVM), exploiting precise capacitor-based analog computation to extend from binary input-vector elements to 5-b input-vector elements, for 16x increase in energy efficiency and 5x increase in throughput. The 1152(row)x256(col.) macro employs multi-level input drivers based on a digital-switch DAC implementation, which preserve compute accuracy well beyond the 8-b resolution of the output ADCs, and whose area is halved via a dynamic-range doubling (DRD) technique. The macro achieves the highest reported IMC energy efficiency of 5796 TOPS/W and compute density of 12 TOPS/mm² (both normalized to 1-b ops). CIFAR-10 image classification is demonstrated with accuracy of 91%, equal to the level of ideal SW implementation.
**JFS2-5 - 9:20**

**HERMES Core – A 14nm CMOS and PCM-Based In-Memory Compute Core Using an Array of 300ps/LSB Linearized CCO-Based ADCs and Local Digital Processing,**


We present a 256x256 in-memory compute (IMC) core designed and fabricated in 14nm CMOS with backend-integrated multi-level phase-change memory (PCM). It comprises 256 linearized current controlled oscillator (CCO)-based ADCs at a compact 4um pitch and a local digital processing unit performing affine scaling and ReLU operations. A novel frequency-linearization technique for CCOs is introduced, leading to accurate on-chip matrix-vector-multiply (MVM) when operating over 1 GHz. Measured classification accuracies on MNIST and CIFAR-10 datasets are presented when two cores are employed for deep learning (DL) inference. The measured energy efficiency is 10.5 TOPS/W at a performance density of 1.59 TOPS/mm².

**JFS2-6 - 9:30**

**A 20x28 Spins Hybrid In-Memory Annealing Computer Featuring Voltage-Mode Analog Spin Operator for Solving Combinatorial Optimization Problems,**

J. Mu*, Y. su* and B. Kim**, *Nanyang Technological Univ., Singapore and **Univ. of California, Santa Barbara, USA

This work proposes a hybrid analog-digital implementation of an annealing computer that achieves major improvements in both area and programmability. A compact hybrid spin circuit adopts eight voltage-mode analog spin operators that accumulate the spin interactions from neighboring spins using segmented voltage-mode drivers, two additional pairs of spin operator units for a magnetic coefficient with offset calibration, and external binary random bits for simulated annealing. A sense amplifier converts the analog spin operation result to a binary spin state, and a register stores and transmits the spin state to the neighboring spins. The test-chip is fabricated using the 65nm process, and the measured power consumption is 9.9mW at 0.8V and 320MHZ. It can achieve 1.58x improvement in the area and >3x reduction in annealing time compared with recent works.

**JFS2-7 - 9:40**

**Analog In-Memory Computing in FeFET-Based 1T1R Array for Edge AI Applications,**


Deep neural network (DNN) inference for edge AI requires low-power operation, which can be achieved by implementing massively parallel matrix-vector multiplications (MVM) in the analog domain on a highly resistive memory array. We propose a 1T1R compute cell (1T1R-cell) using a ferroelectric hafnium oxide-based FET (FeFET) and TiN/SiO₂ tunneling junction of mega-ohm-resistor (MOR) for analog in-memory computing (AImC). The MOR exhibited a tunneling current behavior and mega-ohm resistance. A 1T1R-cell array-level evaluation was also performed. A random access for writing with low write disturbance scheme was confirmed from the DC current summation output, and binaries were successfully classified into “T” and “L”. Based on the experimental results of our proposed 1T1R-cell, we obtained a state-of-the-art energy efficiency of 13700 TOPS/W including the periphery. Furthermore, we confirmed that a high inference accuracy can be obtained with our low-resistance-variability 1T1R-cell with a properly trained model.

**JFS2-8 - 9:50**

**Energy-Efficient Reliable HZO FeFET Computation-in-Memory with Local Multiply & Global Accumulate Array for Source-Follower & Charge-Sharing Voltage Sensing,**

C. Matsui, K. Toprasertpong, S. Takagi and K. Takeuchi, The Univ. of Tokyo, Japan

Energy efficient, high throughput, noise immune, high density HZO FeFET Computation-in-Memory (CiM) is proposed. Local Multiply & Global Accumulate Array is realized by source-follower read, which multiplies neural network inputs and weights (FeFET V₁), and charge-sharing, which accumulates multiplied values. Proposed FeFET CiM operates 32 Multiply-lines (MLs) and 1024 Accumulate-lines (ALs) in parallel. Source-follower voltage sensing achieves 3 bit/cell for weight storage. Proposed CiM is immune to read-disturb. After 10-year data-retention, 3 bit/cell FeFET is feasible. Assuming FeFET read time of 100 ns, 66 TOPS/W is achieved. Conventional pre-charge/discharge voltage-sensing operates only bit-line (BL)-parallel (Not word-line (WL)-parallel) and increases read time. Conventional current-sensing CiM operates with 32 WL- and 16 BL-parallel, which is restricted by DC current of memory cells. Thus, proposed CiM provides 64 times higher TOPS/W than conventional current-sensing CiM.
C13-1 - 8:40
A 1024-Channel Simultaneous Recording Neural SoC with Stimulation and Real-Time Spike Detection, D.-Y. Yoon, S. Pinto, S. Chung, P. Merolla, T.-W. Koh and D. Seo, Neuralink Corporation, USA

A fully implantable brain-machine interface (BMI) targeting clinical applications has stringent size and power requirements. In this paper we present a 5x4mm² neural system-on-chip (SoC) capable of recording and stimulating from 1024 implanted electrodes via a serial digital link. The design has on-chip configurable spike detection that can reduce off-chip bandwidth by 100X and with fully integrated power management circuitry with power-on-reset and brown-out detection, our design consumes 24.7mW total power consumption, making it the lowest-power, highest-density AC-coupled neural SoC reported for recording both local field potential (LFP) and action potential (AP) with a 5Hz-10kHz bandwidth.

C13-2 - 8:50
A 1.15μW 5.54mm³ Implant with a Bidirectional Neural Sensor and Stimulator SoC Utilizing Bi-Phasic Quasi-Static Brain Communication Achieving 6kbps-10Mbps Uplink with Compressive Sensing and RO-PUF Based Collision Avoidance, B. Chatterjee, G. K. K., M. Nath, S. Xiao, N. Modak, D. Das, J. Krishna and S. Sen, Purdue Univ., USA

To solve the challenge of powering and communication in a brain implant with low end-end energy loss, we present Bi-Phasic Quasi-static Brain Communication (BP-QBC), achieving < 60dB worst-case channel loss, and ~41X lower power w.r.t. traditional Galvanic body channel communication (G-BCC) at a carrier frequency of 1MHz (~6X lower power than G-BCC at 10MHz) by blocking DC-current paths through the brain tissue. An additional 16X improvement in net energy-efficiency (μJ/b) is achieved through compressive sensing (CS), allowing a scalable (6kbps-10Mbps) duty-cycled uplink (UL) from the implant to an external wearable, while reducing the active-power consumption to 0.52μW at 10Mbps, i.e. within the range of harvested body-coupled power in the downlink (DL), with externally applied electric currents < 1/5th of ICNIRP safety limits. BP-QBC eliminates the need for sub-cranial interrogators, utilizing quasi-static electrical-signals for end-to-end BCC, avoiding transduction losses unlike alternative technologies (ultrasound, optical and magneto-electric).

C13-3 - 9:00
A One-Shot Learning, Online-Tuning, Closed-Loop Epilepsy Management SoC with 0.97μJ/Classification and 97.8% Vector-Based Sensitivity, M. Zhang*, L. Zhang*, J. H. Park**, C.-W. Tsai*, K. A. Ng****, L. Lin*, Y. Dong*, J. Li*, T. Tang*, H. Wu*, L. 19t and J. Yoo*****, **National Univ. of Singapore, Singapore, ***Samsung Electronics Co., Ltd., Korea, ****DigiPen and *****The N.I Institute for Health, Singapore

We propose a patient-specific closed-loop epilepsy tracking and real-time suppression SoC with the first-in-literature one-shot learning and online tuning. The entire SoC consumes the lowest energy reported to date of 0.97μJ/class, and occupies the smallest area of 0.13mm²/Ch. Verified with CHB-MIT database and a local hospital patient, the 9.8b ENOB 2-Cycle AFE combined with the GTCA-SVM DBE achieves vector-based sensitivity, specificity, and latency of 97.8%, 99.5%, and <1s.

C13-4 - 9:10
A 1.5nJ/cls Unsupervised Online Learning Classifier for Seizure Detection, A. Chua*, M. I. Jordan* and R. Muller**, **Univ. of California, Berkeley and ***Chan-Zuckerberg Biohub, USA

This work presents a 1.5 nJ/classification (nJ/cls) seizure detection classifier which provides unsupervised online updates on an initial offline-trained regression model to achieve >97% average sensitivity and specificity on 27 patient datasets, including three that have >250 hours of continuous recording. The classifier was fabricated in 28nm CMOS and operates at 0.5V supply. Through hardware optimizations and low overall computational complexity and voltage scaling, the online learning classifier achieves 24× better energy per classification and occupies 10× lower area than state-of-the-art.

C13-5 - 9:20
A 77-dB DR 16-Ch 2nd-Order Δ-ΔΣ Neural Recording Chip with 0.0077mm²/Ch, S. Wang***, M. Ballini*, X. Yang*, C. Sawigun*, J.-W. Weijers*, D. Biswas* and C. M. Lopez*, **imec, Belgium and ***Univ. of Southampton, UK

This paper presents a scalable 16-channel neural recording chip enabling simultaneous acquisition of action-potentials (APs), local-field potentials (LFPs), electrode DC offsets (EDOs) and stimulation artifacts (SAs) without saturation. By combining a DC-coupled Δ-ΔΣ architecture with new bootstrapping and chopping schemes, the proposed readout IC achieves an area of 0.0077mm² per channel, an input-referred noise of 5.53±0.36μVrms in the AP band and 2.8±0.18μVrms in the LFP band, a dynamic range (DR) of 77dB, an EDO tolerance of ±70mV and an input impedance of 283MΩ. The chip has been validated in an in vitro setting, demonstrating the capability to record extracellular signals even when using small, high-impedance electrodes. Because of the small area achieved, this architecture can be used to implement ultra-high-density neural probes for large-scale electrophysiology.
SESSION 14
Noise-Shaping A/D Converters [Room 2]

Friday, June 18, 9:40-10:10

Chairpersons: K. Yoshioka, Keio Univ.
R. Kapusta, Analog Devices, Inc.

C14-1 - 9:40
A 0.6V 86.5dB-DR 40kHz-BW Inverter-Based Continuous-Time Delta-Sigma Modulator with PVT-Robust Body-Biasing Technique, S. Lee*, S. Park*, Y. Kim* and Y. Chae**, *Samsung Electronics Co., Ltd. and **Yonsei Univ., Korea

This paper presents a body-biasing technique for an energy-efficient inverter-based integrator that significantly improves the PVT robustness of the integrators in sub-1V continuous-time delta-sigma modulators (CTDSMs). A prototype CTDSM with the body biasing technique is implemented in a 28 nm CMOS process and achieves 83 dB SNDR, 84 dB SNR, and 86.5 dB DR in a 40-kHz bandwidth, while consuming only 33.6 μW from a 0.6 V supply. It achieves a Schreier FoM of 177.3 dB.

C14-2 - 9:50
OTA-free 1-1 MASH ADC Using Fully Passive Noise Shaping SAR & VCO ADC, S. T. Chandrasekaran*, S. P. Bhanushali*, S. Pietri** and A. Sanyal*, *Univ. at Buffalo - SUNY and **NXP Semiconductors N.V., USA

We present an OTA-free 1-1 MASH ADC utilizing a fully passive noise shaping (FPNS) SAR as first-stage and open-loop VCO ADC as second stage. The key contribution of this work is to address the challenge of driving large sampling capacitor for high resolution NS-SAR. The proposed architecture reduces resolution of SAR stage and leverages residue attenuation due to passive charge sharing in the FPNS SAR to linearize the VCO. Combining an FPNS SAR with a VCO ADC shapes in-band thermal noise of VCO and SAR comparator at ADC output. Additionally, we demonstrate a computationally inexpensive foreground inter-stage gain calibration algorithm for the proposed ADC architecture. The prototype ADC consumes 0.16mW while achieving an SNDR/DR of 71.5/75.8dB over a 1.1MHz bandwidth and walden FoM of 23.3fJ/step which is the lowest in 65nm technology.

C14-3 - 10:00
A 300MHz-BW 38mW 37dB/40dB SNDR/DR Frequency-Interleaving Continuous-Time Bandpass Delta-Sigma ADC in 28nm CMOS, R. Lu and M. P. Flynn, Univ. of Michigan, USA

Bandpass CT DSMs offer essential advantages for RF and mm-wave systems, including digital I/Q mixing, innate anti-alias filtering, and an easy-to-drive input. However, existing CT DSMs lack the bandwidth for emerging applications, including 5G and radar. A practical frequency-interleaving architecture efficiently scales CT bandpass DSM bandwidth to break the CT DSM bandwidth bottleneck. A 28nm CMOS prototype frequency-interleaves two DSMs for 300MHz BW, 37dB SNDR at 1.5GHz while consuming 38mW.

SESSION 15
High-Speed ADCs [Room 3]

Friday, June 18, 8:40-9:20

Chairpersons: M. Fukazawa, Renesas Electronics Corp.
I. Arsovski, Google

C15-1 - 8:40
A 10.0 ENOB, 6.2 fJ/conv.-step, 500 MS/s Ringamp-Based Pipelined-SAR ADC with Background Calibration and Dynamic Reference Regulation in 16nm CMOS, J. Lagos, N. Markulic, B. Hershberg, D. Dermit, M. Shirvas, E. Martens and J. Craninckx, imec, Belgium

We present a single-channel fully-dynamic pipelined SAR ADC that leverages a novel quantizer and narrowband dither injection to achieve fast and comprehensive background calibration of DAC mismatch, interstage gain, and ring amplifier (ringamp) bias optimality. The ADC also includes an on-chip wide-range, fully-dynamic reference regulation system. Consuming 3.3 mW at 500 MS/s, it achieves 10.0 ENOB and 75.5 dB SFDR, yielding a Walden FoM of 6.2 fJ/conv.-step.

C15-2 - 8:50

This paper demonstrates a background timing skew calibration for the time-interleaved (TI) ADCs using a direct time-based estimation to facilitate the input-independent and fast convergence features. It suppresses the timing spurs of a 20GS/s 8x TI-ADC below -50dB at a Nyquist input regardless of the calibrating input condition with 24 calibration cycles. The 8b time-domain TI-ADC achieves a 91.3fJ/conv.-step FoM_Walden and >16GHz bandwidth.
We present a time-interleaved (TI) SAR ADC with 8 channels realizing 8-bit conversion at 1 GS/s each. SNDR is 45 dB at low frequency with an ERBW of 5 GHz limited by sampler distortion. Conventional SAR conversion at high speed with minimum degradation is achieved by leveraging techniques such as early quantization, minimum delay logic, DAC redundancy and gain and offset compensation via the DAC. At 8 GS/s the ADC consumes 26 mW resulting in an efficiency of 30 fJ/conv.-step.

This paper presents an auxiliary-channel-assisted background calibration for ADC’s input front-end (buffer + T/H) distortion and inter-stage gain error. The auxiliary channel is custom-designed which runs at a fractional speed of the main ADC with only moderate noise performance but high linearity. It also has a pseudo inter-stage gain characteristic as the main ADC, which incorporates with the multi-layer LMS procedure, facilitating a fast convergence speed. Verified in a 12-bit 1GS/s pipelined SAR ADC in 28nm CMOS, the SNDR and SFDR at Nyquist input are 59.28 dB and 67.09 dB SFDR, respectively. Just the distortion calibrations alone contribute >11.13dB SFDR improvement in the entire Nyquist band. Both the ADCs and input buffers work under a 1V supply, consuming 19.2mW with 17% from the buffer.

A 16Kb one-time-programmable (OTP) antifuse memory is fabricated in a 5nm high-K metal-gate FinFET CMOS for the first time. The bootstrap high voltage scheme (BHVS), read endpoint detection (REPD) and pseudo-differential sensing (PDS) are implemented to achieve intrinsic bit error rate (BER) below 1ppb for in-field programming in 5nm SoC and 10 years of data retention at 125°C.

This paper proposes a 16Mb e-NVM macrocell for automotive grade 0 microcontroller based on a PCM cell with a Bipolar Transistor (BJT) selector. The solution is developed in proprietary 28nm FD-SOI CMOS technology, with a Super-STI scheme that has enabled high dense e-NVM with 0.019μm² cell size. Macrocell organization offers the capability to be configured by application either for Over-The-Air (OTA) mode up to 24MB, or for 16MB extra reliability mode, with two cells per bit, still resulting in an extremely competitive equivalent bit-cell size (0.038μm²). Cell Mode configuration can be dynamically tuned, with a unique set of features for flexible assisted OTA software update. The integration of a 16MB PCM cell array, extensible up to 24MB, in an automotive grade product-like test vehicle chip is presented here as the evolution of the first Embedded PCM macrocell for automotive, complementing, the fulfillment of all criteria in the demanding automotive environment.

A 28nm embedded Flash memory is designed for the Automotive application in Foundry. Through Temperature Auto-Tracking Sense Amplifier using the Bit line Charge Boost and Bit line Leakage current Compensation technology, it succeeded in implementing innovative performance and size improvement. Also Word Line and YMUX Gate Boost is applied to secure a sensing margin at a low voltage. These techniques enable 10ns reading operation of 288 bits at a time based on 16Mb memory size by improving sensing margin in wide temperature range, and minimum voltage range of 0.85V. It also implemented a competitive minimum IP size of 7.42Mb per mm² and we have secured high yield that enough to mass production as a result of Silicon validation.
C16-4 - 10:00

This paper presents an energy-efficient wordline driver for a triple level cell 3D NAND flash. Unlike conventional circuit that has a large charge pump and high-voltage regulators operating under the inefficient stepped-up voltage, the proposed circuit has a distributed charge pump (CP) that directly drive the wordlines, aided by a charge compensating regulator that operate under the nominal supply and produces a ripple free output. The proposed voltage driver for a 40 wordline layer is fabricated in 180nm UHV process and it consumes 98.3nJ from a 2.2V during 1 unit of program pulse and verify period, which is more than 2.1x improvement in energy efficiency compared to conventional scheme.

SESSION 17
Advanced Frequency Synthesizers [Room 4]
Friday, June 18, 8:40-9:20

Chairpersons: K. Okada, Tokyo Institute of Technology
A. Raychowdhury, Georgia Institute of Technology

C17-1 - 8:40
A 11.1-to-14.2 GHz Self-Adapted Two-Point Modulation Dual-Path Type-II Digital PLL Concurrently Achieving 124.7-MHz/µs Chirp Rate and 2.27-GHz Bandwidth, W. Deng***, Z. Chen**, H. Jia***, S. Sun***, G. Chen*, Z. Wang* and B. Chi*, *Tsinghua Univ., **Research Institute of Tsinghua Univ. in Shenzhen and ***Beijing Institute of Technology, China

Different from conventional two-point modulation (TPM) type-II PLLs requiring non-trivial gain calibrations and TPM type-III PLLs with loop stability concern and limited chirp rate, a self-adapting gain-mismatch TPM type-II digital PLL is proposed. The measurement results indicate that the proposed PLL can generate a precise triangular chirp with 2.27 GHz BW and 18.2 µs period, which demonstrates the widest normalized chirp bandwidth and the fastest chirp rate simultaneously.

C17-2 - 8:50

This paper proposes a mm-wave all-digital PLL (ADPLL) employing a 4x reference oversampling (ROS) phase detector (PD). The fractional-N operation is assisted by two capacitive DACs embedded in the ROS-PD. Exploiting the benefits of all-digital implementation, differential/offset mismatches can be compensated by CDAC and zeroed out through a 4-tap moving average (MA) to reduce the reference spurs. The proposed fractional-N ADPLL is implemented in 28 nm CMOS. It achieves rms jitter of 237 fs at a carrier of 28.8 GHz when using a standard 50 MHz crystal oscillator, while consuming only 11.9 mW, leading to FoMjitter-N of -269.3 dB.

C17-3 - 9:00
A 3.3-GHz 4.6-mW Fractional-N Type-II Hybrid Switched-Capacitor Sampling PLL Using CDAC-Embedded Digital Integral Path with -80-dBc Reference Spur, Z. Xu, M. Osada and T. Iizuka, The Univ. of Tokyo, Japan

We present a type-II fractional-N hybrid switched-capacitor sampling PLL, using a capacitive digital-to-analog converter (CDAC) as a sampler and an analog adder receiving digital integrator's output. To guarantee sufficient CDAC settling time and filter switch-on time, we designed a synchronous timing generator utilizing the multi-modulus divider's (MMDIV’s) inter-stage clocks. The prototype chip in 65-nm CMOS achieves -80-dBc reference spur, 236-fs integrated RMS jitter, and 4.6-mW power consumption, translating to -246-dB FoM.

C17-4 - 9:10
A Feedforward and Feedback Constant-Slope Digital-to-Time Converter in 28nm CMOS Achieving ≤ 0.12% INL/Range over >100mV Supply Range, P. Chen*, F. Zhang***, S. Hu* and R. B. Staszewski*, *Univ. College Dublin, Ireland and ***Silicon Austria Labs, Austria

We propose a constant-slope DTC with feedforward and feedback techniques to minimize the nonlinearities and power consumption. Implemented in 28 nm CMOS, the proposed DTC runs at 50 MHz consuming 36 uW. It features a 543 ps delay range, a nonlinearity cancellation yielding 0.11% INL/Range. The linearity is retained from 780 mV to 900 mV power supply.

SESSION 18
High-Performance Clock Generators [Room 4]
Friday, June 18, 9:30-10:00

Chairpersons: W. Deng, Tsinghua Univ.
J. Proesel, Nubis Communications

C18-1 - 9:30
A 19-GHz PLL with 20.3-fs Jitter, Y. Zhao and B. Razavi, Univ. of California, Los Angeles, USA

A PLL samples both the rising and falling edges of the reference clock and employs a new retiming method in the feedback divider. Fabricated in 28-nm CMOS technology, the prototype achieves an rms jitter of 20.3 fs from 10 kHz to 100 MHz with a spur of -66 dBc while consuming 12 mW.
C18-2 - 9:40
A 43nW 32kHz Pulsed Injection TCXO with ±4.2ppm Accuracy Using ΔΣ Modulated Load Capacitance, S. Park*, J.-H. Seol*****, E. Xu**, D. Sylvester** and D. Blaauw***, *KAIST, Korea, **Univ. of Michigan, USA and ***Samsung Electronics Co., Ltd., Korea

This paper presents an ultra-low power temperature-compensated crystal oscillator (TCXO) with a pulsed injection XO driver for IoT applications. Temperature compensation is achieved using a single switched load capacitance, modulated by a ΔΣ modulator (DSM). The ΔΣM digitizes a piece-wise linear (PWL) approximation of the XO temperature dependence where a coarse 4-bit temperature sensor selects the PWL segment. The proposed 32.768kHz TCXO achieves an accuracy of +/−4.2ppm from −20 to 85 Celsius with 3-point trimming and Allan deviation floor of 34 ppb while consuming 43nW.

C18-3 - 9:50
A 27-73 GHz Injection-Locked Frequency Divider, H. Razavi and B. Razavi, Univ. of California, Los Angeles, USA

A new model for injection-locked dividers leads to a 4.76-mW prototype that operates from 24 GHz to 73 GHz with no need for tuning or adjustments. Occupying an area of 0.037 mm², the circuit can robustly serve millimeter-wave radios as well as full-rate 28-Gb/s, 56-Gb/s and half-rate 112 Gb/s wireline transceivers.

Technology / Circuits Joint Focus Session 3
Photonics Interconnect and Compute [Room 6]

Friday, June 18, 8:40-9:50

Chairpersons: Y. Shiratori, Nippon Telegraph and Telephone Corp. (NTT) / M. Takenaka, The Univ. of Tokyo T. Letavic, GLOBALFOUNDRIES Inc.

JF3-1 - 8:40 (Invited)
On-Silicon Photonic Integrated Circuit toward On-Chip Interconnection and Distributed Computing, N. Nishiyama and T. Amemiya, Tokyo Institute of Technology, Japan

Heterogeneous material integration technology gives us freedom of material choices in both electronic and photonic devices. In this presentation, status, technology and characteristics of photonic devices in photonic integrated circuits (PICs) on Si (SOI) will be reviewed. Membrane (thin III-V film) PICs can realize low power consumption data transmission on Si substrate. This PICs can be applicable to on-chip interconnection to reduce power dissipation under higher speed transmission. 93 fJ/bit transmission with 20 Gbps has been demonstrated. Hybrid PICs were also demonstrated to realize 10-Tbps-class transceiver with low energy cost for distributed computing. This structure can integrate multiple function and many array devices in one chip. Also, by dense integration, some function of electronics can be moved to photonics part. This enables power consumption reduction.

JF3-2 - 8:50 (Invited)

For the first time, we demonstrate an error-free, 128Gbps (8x16Gbps) optical transceiver using a microring-based wavelength-division multiplexed (WDM) architecture. The optical transceiver ran for 12 hours with zero errors, resulting in a measured bit-error rate of <1.45e-15 per optical lane. The total number of bits sent during this time was ~691 terabits per lane and ~5.5 petabits aggregate across all lanes.

JF3-3 - 9:00
Graphene Electro-Absorption Modulators Integrated at Wafer-Scale in a CMOS Fab, C. H. Wu*,**, S. Brems*, D. Yudistira*, D. Cott*, A. Milenin*, K. Vandersmissen*, A. Maestre***, J. Van Campenhout*, C. Huyghebaert* and M. Pantouvaki*, imec's 300mm silicon photonics platform and the full integration sequence is using standard CMOS production tools expect for the 6-inch CVD graphene growth and transfer, transferred by Graphenea. 164x TE EAMs were measured per wafer and demonstrate 90% yield with modulation efficiency (ME) of 41dB/mm for 8V voltage swing, after process optimization. The 3dB bandwidth of the EAMs is 14.9GHz for the device with 50um active length. Both parameters show comparable performance with lab-based devices, obtained on coupons using similar CVD graphene. This work paves the way to enable high-volume manufacturing of 2D-material-based photonic devices.

JF3-4 - 9:10
Silicon Photonic Micro-Ring Modulator-Based 4 x 112 Gb/s O-Band WDM Transmitter with Ring Photocurrent-Based Thermal Control in 28nm CMOS, J. Sharma, H. Li, Z. Xuan, R. Kumar, C.-M. Hsu, M. Sakib, P. Liao, H. Rong, J. Jaussi and G. Balamurugan, Intel Corp., USA

We present a 4λ x 112 Gb/s/Lambda hybrid-integrated silicon photonic TX suitable for 400G Ethernet modules and co-packaged optics. The photonic IC (PIC) uses cascaded micro-ring modulators (MRMs) with integrated heaters for efficient wavelength division multiplexing (WDM). The 28nm CMOS electronic IC includes PAM4 MRM drivers with nonlinear FFE and control circuits to stabilize MRM performance against process and temperature variations. A thermal control scheme based on sensing MRM photocurrents is used to minimize monitoring hardware in the PIC. Measured results demonstrate 112 Gb/s PAM4 operation with <0.7 dB TDECQ from each of the 4 channels. To our best knowledge, this is the highest per-λ data rate reported for an O-band ring-based WDM transmitter.
First InGaAs/InAlAs Single-Photon Avalanche Diodes (SPADs) Heterogeneously Integrated with Si Photonics on SOI Platform for 1550 nm Detection, J. Zhang, H. Xu, G. Zhang, Y. Chen, H. Wang, K. H. Tan, S. Wicaksono, C. Wang, C. Sun, Q. Kong, C. W. Lim, S.-F. Yoon and X. Gong, National Univ. of Singapore, Singapore

For the first time, heterogeneous integration of InGaAs/InAlAs single-photon avalanche diodes (SPADs) with Si photonics was realized and demonstrated through a temperature die-to-die bonding technique. Together with the adoption of a triple-mesa structure in SPADs that not only avoids the surface exposure to the high electric field but also alleviates the electric field crowding at mesa edges, our integrated SPADs exhibit high single-photon detection efficiency (SPDE) of 22% and low dark count rate (DCR) of 8.6 × 10^5 Hz, which are among the best performance reported for InGaAs/InAlAs SPADs, and are approaching that of InGaAs/InP SPADs. High device yield and performance uniformity were also achieved.

Bandgap-Tunable III-V-OI Photonics Platform with Quantum Well Intermixing for Versatile Active-Passive Integration of Chip-Scale Photonic Integrated Circuits, N. Sekine, K. Toprasertpong, S. Takagi and M. Takenaka, The Univ. of Tokyo, Japan

We investigated monolithic active-passive integration on III-V-OI photons platform using quantum well intermixing for photonic integrated circuits. We developed the void-free wafer bonding technique and established large bandgap modulation through quantum well intermixing using low-energy hot P_2^+ molecular ion implantation. We monolithically integrated passive waveguides, waveguide PDs, electro-absorption (EA) modulators, optical switches with carrier-injection optical phase shifters on III-V-OI operating at 1.55 μm wavelength.

First Demonstration of Waveguide-Coupled Ge<sub>0.92</sub>Sn<sub>0.08</sub>/Ge Multiple-Quantum-Well Photodetector on the SOI Platform for 2-μm Wavelength Optoelectronic Integrated Circuit, H. Wang*, Y. Chen*, J. Zhang*, G. Zhang*, Y.-C. Huang**, and X. Gong*,

*National Univ. of Singapore, Singapore and **Applied Materials, Inc., USA

We report the first demonstration of a silicon-on-insulator (SOI) waveguide-coupled Ge<sub>0.92</sub>Sn<sub>0.08</sub>/Ge multiple-quantum-well (MQW) photodiode (PD) for 2 μm wavelength using a flip-chip bonding technology. The light in the waveguide couples to the PD for detection via a grating coupler. The grating coupler and waveguide were designed and fabricated on the standard SOI wafer for 2 μm and bonded with the GeSn Ge PDs. On the same wafer, back illuminated GeSn/Ge PDs were also integrated using the same technology for free space optical detection. Our waveguide-coupled PD exhibits responsivity of 10.3 mA/W at 2 μm wavelength and one of the lowest dark current densities of 38.4 mA cm<sup>2</sup> for Ge<sub>1-x</sub>Sn<sub>x</sub> PDs. In addition, no degradation of the dark current was found after the bonding.


This paper presents a CMOS image sensor and an AI accelerator to realize surveillance camera systems based on edge computing. For CMOS image sensors to be used for surveillance, it is desirable that they are highly sensitive even in low illumination. We propose a new timing shift ADC used in CMOS image sensors for improving high sensitivity performance. Our proposed ADC improves non-linearity characteristics under low illumination by 63%. Achieving power-efficient edge computing is a challenge for the systems to be used widely in the surveillance camera market. We demonstrate that our proposed AI accelerator performs inference processing for object recognition with 1 TOPS/W.


We developed a dual pixel with accurate and all-directional auto focus (AF) performance in CMOS image sensor (CIS). The optimized in-pixel deep trench isolation (DTI) provided accurate AF data and good image quality in the entire image area and over whole visible wavelength range. Furthermore, the horizontal-vertical (HV) dual pixel with the slanted in-pixel DTI enabled the acquisition of all-directional AF information by the conventional dual pixel readout method. These technologies were demonstrated in 1.4um dual pixel and will be applied to the further shrunken pixels.
JFS4-3 - 9:00

Sub-micron pixels have been widely adopted in recent CMOS image sensors to implement high resolution cameras in small form factors, i.e. slim mobile-phones. Even with shrinking pixels, customers demand higher image quality, and the pixel performance must remain comparable to that of the previous generations. Conventionally, to suppress the optical crosstalk between pixels, a metal grid has been used as an isolation structure between adjacent color filters. However, as the pixel size continues to shrink to the sub-micron regime, an optical loss increases because the focal spot size of the pixel's microlens does not downscale accordingly with the decreasing pixel size due to the diffraction limit; the light absorption inevitably occurs in the metal grid. For the first time, we have demonstrated a new lossless, dielectric-only grid scheme. The result shows 29% increase in sensitivity and +1.2-dB enhancement in Y-SNR when compared to the previous hybrid metal-and-dielectric grid.

JFS4-4 - 9:10

This paper presents a low-random noise of 2.6 e-rms, a low-power of 116.2 mW at video rate, and a high-speed up to 960 fps 2-mega pixels global-shutter type CMOS image sensor (CIS) using an advanced DRAM technology. To achieve a high performance global-shutter CIS, we proposed a novel architecture for the digital pixel sensor which is a remarkable global-shutter operation CIS with a pixel-wise ADC and an in-pixel digital memory. Each pixel has two small-pitch Cu-to-Cu interconnectors for the wafer-level stacking, and the pitch of each unit pixel is less than 5 um which is the world's smallest pixel embedding both pixel-level ADC and 22-bit memories.

JFS4-5 - 9:20
A Photon-Counting 4Mpixel Stacked BSI Quanta Image Sensor with 0.3e- Read Noise and 100dB Single-Exposure Dynamic Range, J. Ma, D. Zhang, O. Elgendy and S. Masoodian, Gigajot Technology Inc., USA

This paper reports a 4Mpixel, 3D-stacked backside illuminated Quanta Image Sensor (QIS) with 2.2um pixels that can operate simultaneously in photon-counting mode with deep sub-electron read noise (0.3e- rms) and linear integration mode with large full-well capacity (30K e-). A single-exposure dynamic range of 100dB is realized with this dual-mode readout under room temperature. This QIS device uses a cluster-parallel readout architecture to achieve up to 120fps frame rate at 550mW power consumption.

JFS4-6 - 9:30
A 5.1ms Low-Latency Face Detection Imager with In-Memory Charge-Domain Computing of Machine-Learning Classifiers, H. Song*, S. Oh*, J. Salinas Jr.*, S.-Y. Park**, and E. Yoon*, *Univ. of Michigan, USA and **Pusan National Univ., Korea

We present a CMOS imager for low-latency face detection empowered by parallel imaging and computing of machine-learning (ML) classifiers. The energy-efficient parallel operation and multi-scale detection eliminate image capture delay and significantly alleviate backend computational loads. The proposed pixel architecture, composed of dynamic samplers in a global shutter (GS) pixel array, allows for energy-efficient in-memory charge-domain computing of feature extraction and classification. The illumination-invariant detection was realized by using log-Haar features. A prototype 240x240 imager achieved an on-chip face detection latency of 5.1ms with a 97.9% true positive rate and 2% false positive rate at 120fps. Moreover, a dynamic nature of in-memory computing allows an energy efficiency of 418pJ_pixel for feature extraction and classification, leading to the smallest latency-energy product of 3.66ms.nJ_pixel with digital backend processing.

JFS4-7 - 9:40

This paper presents a CMOS LiDAR sensor with high background noise (BGN) immunity. The sensor has on-chip pre-post weighted histogramming to detect only time-correlated time-of-flight (TOF) out of BGN from both sunlight and exponentially increased dark noise while enhancing sensitivity through higher excess voltage (Vex) of SPADs. The sensor also employs a SPAD-based random number generator (SRNG) for canceling interference (IF) from an infinite number of LiDARs. The sensor shows 8.08 cm accuracy for the range of 32 m under high BGN (105 klx sunlight and 48.72 kcps dark-count rate with increased Vex).

JFS4-8 - 9:50

Innovative applications with multiple near-infrared (multi-NIR) spectral CMOS image sensors (CIS) and camera systems have recently been developed. The multi-NIR filter is an indispensable key technology in practical use of the multi-NIR camera system in consumer camera. Advanced processing technology for multi-NIR signals has been developed using a Fabry-Perot structure. Three types of NIR wavelength filters are formed as a Bayer pattern with 2-x-2um2 pixel size on a 5-M pixel BSI-CIS. The thickness differences of the three types of bandpass filters are suppressed to less than 75 nm. To enable high-end applications in surveillance, automobiles, and fundus cameras for health management, signal processing technology has also been developed that processes and mixes each signal of a multi-NIR signal with low-intensity visible light images. This provides good image SNR (Signal-to-Noise Ratio) under low lighting conditions of 0.1 lux or less allowing changes of state to be easily identified.
SESSION 19
Circuits for Sensing Applications [Room 2]
Saturday, June 19, 8:40-9:20

Chairpersons: T. Tokuda, Tokyo Institute of Technology
P. Mercier, Univ. of California, San Diego

C19-1 - 8:40
A Direct-Digitization Open-Loop Gyroscope Frontend with +/-8000°/s Full-Scale Range and Noise Floor of 0.0047°/s/√Hz,

This paper presents the architecture and techniques used to realize the first direct-digitization open-loop gyroscope frontend, which achieves an input range of +/-8000 degrees per second (4x larger than state of art) with a noise floor of 0.0047 degrees per second per square root Hertz, deployed in an inertial measurement unit (IMU) drawing 650 micro Amperes from a 1.8V supply in gyroscopic-only mode.

C19-2 - 8:50
A Self-Powered Wireless Gas Sensor Node Based on Photovoltaic Energy Harvesting,

In this work, we present a compact self-powered wireless gas sensor node based on photovoltaic (PV) energy harvesting (EH). Instead of a bulky and power-hungry gas sensor with separate gas signal processing (GSP) circuits, a mm3-sized colorimetric sensor film is integrated with a PV cell, and the GSP function is seamlessly embedded within EH circuits. Also, a dual-input shared-inductor boost converter is used to improve the EH efficiency under gas exposure. Offset cancellation is performed in GSP circuits to provide accurate gas-sensing readout without any external trimming.

C19-3 - 9:00
A SiPM Readout IC Embedded in a Boost Converter for Mobile Dosimeters,
H. Jeon*, I. Choi*, Y. Kim**, S.-U. Shin*** and M. Je*, *KAIST, **KETI and ***UNIST, Korea

A power-and hardware-efficient radiation detection system is presented for mobile dosimeter. Thanks to the duty quantizer embedded in boost converter for driving radiation detector, the radiation signal can be converted to digital value without implementing separate sensor interface circuits. The prototype IC is implemented using 0.18-μm BCD process, and achieves 0.217μA_{rms} of integrated input referred noise enough low to measure radiation signal without using complex readout IC.

C19-4 - 9:10
Fully-Digital Self-Calibrating Decoder with Sub-uW, 1.6fJ/convstep and 0.0075mm² per Receptor for Scaling to Human-Like Tactile Sensing Density,
P. Agarwal, V. K. Rajanna, T. W. Da, B. C. K. Tee and M. Alioto, National Univ. of Singapore, Singapore

This work presents an area- and energy-efficient decoder for tactile e-skin sensing encoding to scale up receptor density to the human scale. A fully-digital signal-adaptive receptor interface and event decoder architecture are introduced, leveraging tactile signal sparsity in the temporal and spatial dimension to reduce activity and power at negligible accuracy degradation. A novel reference-less comparator self-calibration is introduced to cancel offset by exploiting the statistical balance of spread-spectrum tactile pulses and noise. A 40nm testchip demonstrates 1.6-fJ/convstep energy (0.0075mm² area) per receptor with 50X (5X) improvement over prior art, and 160-receptor aggregation on a single pad.

SESSION 20
Analog Circuit Techniques [Room 2]
Saturday, June 19, 9:30-10:10

Chairpersons: T. Nezuka, MIRISE Technologies Corp.
S. Ho, MediaTek Inc.

C20-1 - 9:30
A 0.93-μW Single-Stage Rail-to-Rail Class AB Buffer Amplifier Improving DC Gain and Slew-Rate with Different-Ratio Current-Mirrors and Positive-Feedback Loops,
J.-M. Cho*, H.-J. Park* and S.-W. Hong*, *Sogang Univ., Korea

This paper presents a rail-to-rail class AB single-stage buffer amplifier to drive a capacitive load (C_o). To achieve a high DC gain and a high slew-rate (SR), the proposed amplifier uses different-ratio current-mirrors (DRCMs) and positive feedback loops (PFLs). The proposed amplifier drives a wide range of C_o over 0.15-15 nF owing to the single-stage structure. A prototype chip was fabricated in a 0.18um CMOS process and occupies an area of 0.014 mm².
C20-2 - 9:40
This paper presents an OLED/μLED display driver IC with cascaded loading-free capacitive interpolation (LFCI) DAC and a high-slew buffer amplifier. The 12-bit color-depth is realized by a combination of 7-bit R-DAC and proposed 5-bit LFCI DAC while occupying only 295x17μm
2
, which is x2 reduction compared to the state-of-the-art. In-pixel MSB-conversion is also presented to reduce chip size further. 5V amplifier offers a slew-rate of 6.24V/μs at 80pF with a static current of 2μA. The chip fabricated in 180-nm achieved the measured 0.43LSB (DNL), 0.95LSB (INL), and 7.9mV (DVO).

C20-3 - 9:50
A -121.5 dB THD Class-D Audio Amplifier with 49 dB Suppression of LC Filter Nonlinearity and Robust to +/-30% LC Filter Spread, H. Zhang*, M. Berkhout**, K. Makinwa* and Q. Fan*, *Delft Univ. of Technology and **Goodix Technology, Netherlands
This paper reports a Class-D audio amplifier that uses multi-loop feedback to suppress output LC filter nonlinearity by 49 dB, enabling the use of small, low-cost LC filters with +/-30% spread while maintaining low distortion. Fabricated in a 180 nm BCD process, the prototype achieves a THD of -121.5 dB and a THD+N of -107.1 dB. It delivers 12W/21W into an 8-Ω/4-Ω load with 91%/87% efficiency.

C20-4 - 10:00
A 650 pW, -71 dB PSRR, 205°C Temperature Range Hybrid Voltage Reference with Curvature-Based Temperature Compensation and SBFL Techniques, C.-Z. Shao and Y.-T. Liao, National Chiao Tung Univ., Taiwan
This paper presents a 650 pW 1V hybrid voltage reference with curvature-based temperature compensation in a 0.18-μm CMOS process. The design achieves a 45 ppm/°C from -55 to 150°C, line sensitivity of 0.016%/V and PSRR of -71 dB at 100 Hz by employing a self-biasing feedback loop.

SESSION 21
Ultra-High-Speed Wireline [Room 3]
Saturday, June 19, 8:40-9:30
Chairpersons: C. P. Yue, The Hong Kong Univ. of Science and Technology
P. Upadhyaya, Xilinx, Inc.

C21-1 - 8:40
A receiver analog front end (AFE) suitable for a 224 Gb/s PAM-4 long-reach ADC-based SerDes receiver is implemented in Intel 10nm FinFET process. The AFE consists of distributed input matching network and a hybrid peaking CTLE followed by a VGA that drives an interleaved ADC used for characterization. The AFE achieves 19dB boost and 11.7dB peak gain at 54GHz while consuming 60mW.

C21-2 - 8:50
This paper describes the design of a 1.24pJ/b 112Gb/s PAM4 transceiver testchip in 7nm FinFET for inpackage die-to-die communication. The receiver supports 0-1.2V input common mode and utilizes a single-stage active inductor-based CMOS CTLE with 12 data slicers and 2 error slicers. The quad-rate voltage-mode transmitter implements delay based sub-UI two-tap FFE and digital I/Q and DCC clock calibration. A single-phase clock from a wideband LC PLL is distributed to eight transceiver channels. In each channel, an ILO generates eight-phase clocks that feed an 8-bit CMOS PI. The transceiver achieves <1e-12 BER over 30mm channel @106.25Gb/s and over 20mm channel @112Gb/s.

C21-3 - 9:00
106 Gb/s PAM-4 Transmitter With 2.1 Vpp Swing in 7nm FinFET Process, H. Sephrhian, S. Aic, P. Madeira and D. Tonietto, Huawei Ottawa Research Center, Canada
This paper demonstrates a high swing 106Gb/s PAM-4 transmitter in 7nm FinFET process. The 7-bit DAC based transmitter is designed to directly drive a range of TOSAs in either single ended or differential configuration eliminating the cost and power of an additional laser driver IC. Using a 2.4V supply, it achieves 2.1Vpp differential and 1.05Vpp single-ended swing across a 50Ω load. Total power consumption of the transmitter was found to be 2.63 pJ/bit with 2.4V supply and 2.23 pJ/b with 1.5 V supply and external bias-T.

C21-4 - 9:10
A 56-Gb/s 8-mW PAM4 CDR/DMUX with High Jitter Tolerance, G. Hou and B. Razavi, Univ. of California, Los Angeles, USA
An analog one-eighth-rate CDR circuit detects both major and minor transitions in PAM4 data by calculating the Euclidean distances between the sampled points. Realized in 28-nm CMOS technology, the prototype exhibits a jitter transfer bandwidth of 160 MHz and a jitter tolerance of 1 UI at 10 MHz.
C21-5 - 9:20
A 60-Gb/s 1.2-pJ/bit 1/4-Rate PAM4 Receiver with a -8-dB JTRAN 40-MHz 0.2-UIPP JTOL Clock and Data Recovery, L. Wang*, Z. Zhang**, and C. P. Yue*, *Hong Kong Univ. of Science and Technology and **Chinese Academy of Sciences, China

This paper presents a source-synchronous 60-Gb/s quarter-rate (1/4-rate) PAM4 receiver (Rx) with a jitter compensation clock and data recovery circuit (JCCDR) to overcome the stringent trade-off between jitter transfer (JTRAN) and jitter tolerance bandwidth (JTOL BW). The jitter compensation circuit (JCC) utilizes the delay-locked loop (DLL) filter voltage to produce a complementary control signal \( VLF_{\text{INV}} \), which modulates a group of complementary voltage-controlled delay lines (C-VCDL) so to negate the JTRANs on the recovered data and clock signals. The proposed 40-nm CMOS Rx test chip achieves error-free operation with PAM4 input from 30 to 60 Gb/s. The JCCDR achieves a 40-MHz JTOL BW with over 0.2-UIPP jitter amplitude while maintaining a -8-dB JTRAN. A jitter compensation ratio of around 60% has been achieved up to 40 MHz.

SESSION 22
Advanced Wireline Techniques [Room 3]
Saturday, June 19, 9:40-10:10

Chairpersons: H. Yamaguchi, Fujitsu Laboratories Ltd.
A. Loke, NXP Semiconductors N.V.

C22-1 - 9:40

A 134 GHz 16 QAM fully-packaged transceiver system for dielectric waveguides with >12 GHz of RF bandwidth built in 22nm CMOS achieves a measured EVM of -19.8 dB (~5x10^{-6} BER) at a reach of 3 meters at a 50 Gbps data rate at a total power consumption of 494 mW from a 1.0 V supply. It achieves a FoM of 3.3 pJ/bit/m with the highest reported data rate at a distance greater than 2 m to date.

C22-2 - 9:50
A Machine Learning Inspired Transceiver with ISI-Resilient Data Encoding: Hybrid-Ternary Coding + 2-Tap FFE + CTLE + Feature Extraction and Classification for 44.7dB Channel Loss in 7.3pJ/bit, Z. Wang, M. Megahed, Y. Chun and T. Anand, Oregon State Univ., USA

This paper presents a machine learning inspired energy-efficient transceiver targeting long-reach channels using an ISI-resilient hybrid-ternary encoding on the transmitter and feature extraction and classification on the receiver. In addition to data encoding, the proposed transceiver also employs an 2-tap FFE and CTLE to achieve communication on a 44.7dB loss FR4 channel with BER less than 1e-6, and an energy efficiency of 7.3pJ/bit at 13.8Gb/s in 65nm CMOS.

C22-3 - 10:00

This paper presents a novel 8-ary modulation technique with higher SNR compared to the PAM-8. The proposed modulation (SNR-Enhanced), modulates the pulse width and amplitude to achieve an average SNR improvement of 9.5 dB over PAM-8 in the near-end eye at the cost of 8.2% reduction in the horizontal eye margin. Using 3-tap FFE and CTLE, the proposed transceiver achieves 1x10^{-7} BER at 9 dB channel loss with an efficiency of 5.39 pJ/bit in the 65 nm CMOS process.

Technology / Circuit and Technology for Quantum Computing [Room 6]
Saturday, June 19, 8:40-9:40

Chairpersons: M. Tada, NanoBridge Semiconductor, Inc.
B. En, Advanced Micro Devices, Inc. (AMD)

JFS5-1 - 8:40 (Invited)

Building quantum computers requires not only a large number of qubits with high fidelity and low variability, but also a large amount of analog and digital components to drive the qubits. Larger arrays of solid-state qubits with high fidelity and low variability require improvements in fabrication processes and array layout design co-optimized with the underlying hardware technology. Here we outline progress on 300nm fabrication of qubit devices and on classical CMOS components to enable the quantum system. We describe work on superconducting qubits and spin qubits in Si, both types of devices fabricated on 300nm experimental platforms and discuss challenges related to variability. Massive electrical characterization is key over wide temperature range is key to enabling system upscaling for QC.
JFS5-2 - 8:50 (Invited)
**Superconducting Quantum Computer: a Hint for Building Architectures**, Y. Tabuchi*, S. Tamate**, and S. Yorozu*, *Riken and **The Univ. of Tokyo, Japan

We discuss the scalability of superconducting quantum computers, especially in a wiring problem. The number of wiring inside a cryostat is almost proportional to the number of qubits in the current wiring architecture. We introduce "The three-Y's": regularity, modularity, and hierarchy to an architecture design of superconducting quantum computers. The key to the wiring elimination is found in the quantum error correction codes having thresholds and spatial translational symmetry, i.e., the surface code. We show a superconducting-digital-logic-based architecture and introduce a stacked heterogeneous structure of the quantum module.

JFS5-3 - 9:00

Larger arrays of electron spin qubits require radical improvements in fabrication and device uniformity. Here we demonstrate excellent qubit device uniformity and tunability from 300K down to mK temperatures. This is achieved, for the first time, by integrating an overlapping polycrystalline silicon-based gate stack in an ‘all-Silicon’ and lithographically flexible 300mm flow. Low-disorder Si/SiO$_2$ is proved by a 10K Hall mobility of 1.5x10$^4$ cm$^2$/Vs. Well-controlled sensors with low charge noise (3.6 ueV/Hz$^{0.5}$ at 1 Hz) are used for charge sensing down to the last electron. We demonstrate excellent and reproducible interdot coupling control over nearly 2 decades (2-100 GHz). We show spin manipulation and single-shot spin readout, extracting a valley splitting energy of around 150 ueV. These low-disorder, uniform qubit devices and 300mm fab integration pave the way for fast scale-up to large quantum processors.

JFS5-4 - 9:10

This work presents a qubit controller IC based on the direct synthesis. The IC consists of six independently-working pulse modulators utilizing the same LO frequency. We propose a sinusoid-shaping nonlinear DAC followed by a linear interpolating DAC to improve both of energy and hardware efficiencies. The implemented IC in 40nm CMOS is verified by superconducting qubit operations with Rabi and Ramsey oscillations while consuming power of < 1/60 compared with the previous state-of-the-art.

JFS5-5 - 9:20

We propose a buried nanomagnet (BNM) realizing high-speed/low-variability silicon spin qubit operation, inspired by buried wiring technology, for the first time. High-speed quantum-gate operation results from large slanting magnetic-field generated by the BNM disposed quite close to a spin qubit, and low-variation of fidelity thanks to the self-aligned fabrication process. Employing TCAD-based simulation, we demonstrate that the BNM realizes 10 times faster Rabi oscillation (faster spin-flip) than previous works and >99% fidelity under certain process variations. Also, the proposed BNM arrangement is implementable for error-correctable large-scale quantum computers employing a 2D-latticed qubit layout. This technology paves the way to practical large-scale quantum computers with silicon.

JFS5-6 - 9:30

We report the first-kind scalability and tunability of Ge QDs that are controllably sized, closely coupled, and self-aligned with control gates, using a combination of lithographic patterning, spacer technology, and self-assembled growth. The core experimental design is based on the thermal oxidation of poly-SiGe spacer islands designated at each included-angle location of designed Si$_x$N$_{y-c}$Si ridge structures. Multiple Ge QDs with good size tunability of 7-20 nm were controllably achieved by adjusting the process times for deposition, etch back and thermal oxidation of poly-SiGe spacer islands. Our Ge QDs array provides a common platform for engineering diverse QD electronic devices with desired reconfigurability and optimizing their performance.