# A 1 TB/s 1 pJ/b 6.4 mm<sup>2</sup>/TB/s QDR Inductive-Coupling Interface Between 65-nm CMOS Logic and Emulated 100-nm DRAM

Noriyuki Miura, Member, IEEE, Mitsuko Saito, Student Member, IEEE, and Tadahiro Kuroda, Fellow, IEEE

Abstract—1 TB/s 1 pJ/b 6.4  $\rm mm^2/TB/s$  inductive-coupling interface between 65-nm complementary metal—oxide—semiconductor (CMOS) logic and emulated 100-nm dynamic random access memory (DRAM) is developed. BER  $< 10^{-16}$  operation is examined in 1024-bit parallel links. Compared to the latest wired 40-nm DRAM interface, the bandwidth is increased to 32×, and the energy consumption and the layout area are reduced to 1/8 and 1/22, respectively.

Index Terms—High-bandwidth interface, inductive coupling, memory-processor stacking, three-dimensional integration.

#### I. INTRODUCTION

N GRAPHICS applications, such as game consoles and video cards, required memory bandwidth between graphics processing unit (GPU) and dynamic random access memory (DRAM) is exponentially increasing by 10 times every five years and will reach 1 TB/s (= 8 Tbit/s) around 2014 (Fig. 1). Currently, TB/s DRAM interfaces are intensively studied for future graphics applications [1]-[3]. In a conventional wired link approach such as DDR, a GPU chip, and a DRAM chip are separately packaged and implemented on a printed-circuit board [Fig. 2(a)]. A technical problem of this on-board implementation is poor channel characteristics due to long interconnection length and need of highly-capacitive electro-static discharge (ESD) protection devices in each chip IO, which finally results in large power and area consumption in the interface. At the symposium on VLSI circuits 2008, an emulated 40-nm DRAM wired interface was presented [1]. It uses 65-nm complementary metal-oxide-semiconductor (CMOS) to emulate the interface performance in 40-nm DRAM. However, even by using this advanced DRAM technology, huge power (~64 W) and area ( $\sim 144 \text{ mm}^2$ ) will be consumed for 1 TB/s memory bandwidth. There are huge power and area walls in the conventional on-board implementation.

Three-dimensional (3-D) system integration is one of the key approaches to overcome these problems. GPU and DRAM

Manuscript received December 30, 2011; revised March 09, 2012; accepted March 24, 2012. Date of publication May 14, 2012; date of current version June 07, 2012. This work was supported by CREST/JST. The VLSI chip in this study has been fabricated in the chip fabrication program of VDEC, the University of Tokyo in collaboration with STARC, e-Shuttle, Inc., and Fujitsu Ltd. This paper was recommended by Guest Editor K. Choi.

The authors are with the Department of Electrical Engineering, Keio University, Yokohama 223-8522, Japan (e-mail: miura@kuroda.elec.keio.ac.jp).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JETCAS.2012.2193836



Fig. 1. Memory bandwidth trend in graphics application.



Fig. 2. DRAM-GPU system implementation using (a) on-board DDR, (b) micro-bump, (c) TSV, and (d) inductive-coupling link.

chips are stacked together in a package and communicate each other through vertical interconnections. Since the communication distance can be significantly reduced down to several tens of micron meters, channel characteristics can be potentially improved resulting in a high-speed, low-power interface with a small-area I/O cell. A technical challenge is physically how to form the vertical interconnections. Micro-bump technology [4] is one of the solutions [Fig. 2(b)]. One problem is limitation in stacking direction. It can be applied only to face-to-face chip stack. Typically high-performance GPU needs to be facedown mount to a C4 bump package due to large number of power and signal connections. Thru-Si Via (TSV) technology [5], [6] can solve this problem [Fig. 2(c)]. However

TSV causes huge cost increase due to additional fabrication process steps. According to a calculation provided by a TSV promotion community EMC-3D [7], additional fabrication cost per wafer will be around \$150, which is almost equivalent to 20¢/chip cost up. Also, a silicon area penalty for TSV is not negligibly small. A current TSV minimum pitch is 50  $\mu m$ [5], [6]. Although the via diameter is around 20  $\mu m$ , any transistor circuits can not be placed spacing between TSVs due to TSV process stress, resulting in considerable silicon area overhead. For 1TB/s data bandwidth, TSVs [5], [6] require an extra silicon area of  $45.5~\mathrm{mm}^2~\sim~102.4~\mathrm{mm}^2$ . In practice, further silicon area will be needed for redundant TSVs and IO buffers. Inductive-coupling link technology [8]-[12] is an alternative approach [Fig. 2(d)] where coupled coils between stacked chips are utilized to form a wireless vertical data link. Since the coil is made by on-chip metal interconnections, no additional process steps are required and hence no additional cost up. In addition, the silicon area penalty is small. Unlike TSV, active transistor circuits can be placed under the coil and also in spacing between the coils. Although the channel pitch is relatively large ( $\sim 100 \ \mu \mathrm{m}$ ), the silicon area penalty will be eventually reduced by placing GPU logic and DRAM structure under the coil and in spacing between the coils. Interference between the circuit and the coil is negligibly small [13]. As for communication performance, the inductive-coupling link is comparable with TSV even though it is based on a wireless communication scheme. Since the interface coil is not exposed for any possible physical contacts outside of the chip, the ESD protection devices can be removed, enabling high-speed, low-power, and small-area interface. At ISSCC 2006, a 0.128 TB/s inductive-coupling interface in 0.18  $\mu m$ CMOS was presented [10]. The energy efficiency is 1/3 and the area efficiency is 1/8 of the-state-of-the-art 40-nm wired DRAM interface [1]. However, even if this inductive-coupling interface is applied for 1 TB/s bandwidth, still large power ( $\sim$ 24 W) and area ( $\sim 18.6 \text{ mm}^2$ ) will be consumed. Further power and area reduction is required in the inductive-coupling interface.

In this work, a 1 TB/s low-power small-area inductivecoupling interface between 65-nm CMOS logic (assumed as GPU) and 100-nm DRAM is presented. The power and area efficiency are both improved by  $3\times$  of the previous interface [10]. This performance improvement is not because of the process scaling from 0.18  $\mu m$  to 65-nm CMOS. Typically, a DRAM transistor is slower compared to the same-generation logic process. Typically, there is  $4 \times$  difference between the DRAM and the logic process. In a 100-nm DRAM process, the transistor performance is almost equal to that in a 0.18  $\mu m$  logic process. In addition, layout area saving due to shrinkage of the process is not included in the calculation of the area efficiency. The  $3\times$  performance improvement in this work is obtained by following three circuit techniques: 1) quad data rate (QDR) architecture, 2) injection-lock clock recovery, and 3) NMOS CML inductive-coupling transceiver. This paper extends [14] by adding details of circuit design and chip implementation.

The rest of the paper is organized as follows. In Section II, the interface architecture and configuration will be overviewed. Next in Section III, three circuit techniques will be proposed



Fig. 3. Overview of 1TB/s inductive-coupling interface.

and detailed circuit design will be described. In Section IV, test chip design including coil layout optimization will be explained. In Section V, measurement results will be presented. Finally, in Section VI, conclusion will be given.

# II. INTERFACE DESIGN

Fig. 3 depicts an overview of the 1 TB/s inductive-coupling interface. A 100-nm DRAM chip is stacked on top of a 65-nm CMOS GPU chip. In each chip, 32×32 in total 1024 inductive-coupling links are arranged. Each link operates at 8 Gb/s which yields 1 TB/s aggregated data bandwidth. Each link has a bi-directional capability to provide an uplink from GPU to DRAM for memory write, and a downlink from DRAM to GPU for memory read. In this interface, one of the most important design considerations is the previously-mentioned transistor performance gap between 65-nm logic CMOS and 100-nm DRAM. Typically, there is 4× difference. In order to provide 8 Gb/s bi-directional capability, a QDR architecture is employed in the DRAM interface. By utilizing a quadrature clock for multiplexing and demultiplexing data, the operating clock frequency can be reduced to 1/4 for the given data rate, which compensates for the 4x transistor performance gap in the DRAM process. Another approach may be employing CML-type multiplexer (MUX) and demultiplexer (DMUX) in the DRAM interface. However, this approach will degrade power and area efficiency. In the QDR architecture, since the operating clock frequency is lowered, the MUX and DMUX can be implemented by using static CMOS digital circuitry. As a result, the power and the area efficiency can be significantly improved. The circuit detail of the QDR architecture will be explained in the next section.

### III. CIRCUIT DESIGN

## A. QDR Architecture

Fig. 4 depicts the block diagram of the QDR inductive-coupling interface. In the GPU chip, CMOS-based MUX/DMUX can provide 8-GHz bandwidth. Therefore, 8-Gb/s data stream can be directly generated and delivered to the inductive-coupling transceiver by using 8-GHz clock. In the DRAM chip, on the other hand, CMOS-based MUX/DMUX can only manage up to 2-GHz bandwidth due to the 4× transistor performance gap. For multiplexing and demultiplexing the 8-Gb/s data stream, 2-GHz quadrature clock is used. In the DRAM transmitter, buffered data are retrieved using the 2-GHz quadrature clock. 4-bit at 2 Gb/s *Mtxdm*[3:0] are generated with 90° phase offset. First 4:2 MUX multiplexes *Mtxdm* into 4 Gb/s *Txdm*0 and *Txdm*1. *Txdm*0 and *Txdm*1 are finally multiplexed into



Fig. 4. Block diagram of QDR inductive-coupling interface.



Fig. 5. Circuit diagram and operating waveforms of clockless XOR-based multiplexer and demultiplexer.

8-Gb/s transmit current  $I_{\rm TM}$  by a DRAM inductive-coupling transmitter Txm. The transceiver frontend including Txm is implemented by using CML. The 8-Gb/s data stream can be managed even with the slow DRAM devices. In the DRAM receiver, a DRAM inductive-coupling receiver Rxm recovers 8 Gb/s data and demultiplexes it into 4 Gb/s Rxdm0 and Rxdm1. Succeeding 2:4 DMUX demultiplexes them into 2 Gb/s Mrxdm[3:0]. A clock recovery circuit recovers synchronous clock from each Mrxdm and stores the data in the buffer.

As mentioned in the previous section, MUX and DMUX are implemented by using static CMOS logic for power and area reduction. For further reduction in power and area, clockless MUX and DMUX based on XOR logic are proposed. Fig. 5 describes the circuit diagrams and the operating waveforms of the XOR-based MUX and DMUX. By utilizing the XOR logic and 180° phase offset, data can be multiplexed and demultiplexed without using high-frequency clock. No precise timing control between high-frequency clock and data is required, resulting in low power and small area compared to the conventional QDR MUX/DMUX in [1]. The 2:1 MUX consists of just a single XOR logic gate. 2 Gb/s Mtxdm0 and Mtxdm2 are retrieved from the buffer with 180° phase offset. By taking XOR of them, the output Txdm0 toggles every input transition. As a result, the 2 Gb/s input data are multiplexed into 4 Gb/s serial data. Demultiplexing is performed in the opposite way using two toggle dividers; one operating at the data rising edge and the other at the falling edge. 4 Gb/s Rxdm0 is demultiplexed to 2 Gb/s Mrxdm0 and Mrxdm2.



Fig. 6. Circuit diagram and operating waveform of injection-locking VCO for clock recovery.



Fig. 7. Coarse frequency adjustment using replica PLL.

## B. Injection-Lock Clock Recovery

In our QDR architecture with the XOR-based MUX/DMUX (Fig. 4), precise timing control between clock and data is not needed at high-frequency (>4 Gb/s), resulting in power and area reduction. Timing control is only required in a data sampling of 2 Gb/s *Mrxdm*[3:0] at the buffer. The sampling clocks are recovered from each received data by using injection-lock clock recovery. Circuits for clock transceiver and clock distribution required in [8]–[10] are removed to reduce power dissipation and layout area to half.

Fig. 6 depicts the circuit diagram of the clock recovery circuit where injection-lock VCO (IVCO) is employed. IVCO consists of an edge detector and a gated VCO core. The edge detector generates injection pulse Inj at every data edges. Inj pulse signal resets VCO clock oscillation at every data edges to align clock falling edge to the data edge. Clock rising edge is used to sample the data. In order to align the rising edge to the data center, Inj pulse width is adjusted to be half the clock cycle by using a replica of VCO core as a delay line in an injection pulse generator. Free-running oscillation frequency  $f_{\rm OSC}$  of the IVCOs is coarsely adjusted to target data rate of 2 GHz by  $V_{\rm CTRL}$ .  $V_{\rm CTRL}$  is generated by a PLL with another replica of the VCO core (Fig. 7). Fine adjustment in frequency and phase is performed by the data edge pulse injection.

As shown in Fig. 6, the injection-lock clock recovery circuit is very simple. It only requires several digital logic gates. In addition, since the operating frequency is reduced down to 2 GHz in the QDR architecture, all the logic gates are implemented by using static CMOS logic circuits. Power and area penalty caused by the clock recovery circuits are negligibly small in overall interface circuits. However, in order to keep synchronization, periodic pulse injection is needed in an actual implementation. A



Fig. 8. Circuit diagram of (a) proposed NMOS CML and (b) conventional CMOS inductive-coupling transceiver.

1-bit periodic dummy data must be added at every eight data. An effective data rate will be degraded to 89% (= 8/9) accordingly.

# C. NMOS CML Inductive-Coupling Transceiver

An interface frontend is implemented in NMOS CML for high-speed 8 Gb/s operation. Fig. 8(a) depicts a proposed NMOS CML inductive-coupling transceiver. Unlike a conventional transceiver [8] [Fig. 8(b)], only NMOS is used. The absence of PMOS enables 8 Gb/s operation even with slow DRAM transistors. The transmitter coil is center-tapped to supply voltage  $V_{DD}$  and driven by NMOS differential pair. Txdata is directly converted into transmit current  $I_T$ . Positive or negative pulse-shaped voltage is induced in the receiver coil at every data transitions. Unlike an NMOS single-end pull-down transmitter, the supply current is kept constant to reduce supply noise significantly for reliable operation. The receiver consists of a hysteresis latch using NMOS and poly register. The hysteresis latch consists of a gain stage (differential pair) and a latch circuit (cross-coupled NMOS). The gain stage amplifies  $V_R$  pulse and it drives the succeeding latch to switch and recover *Rxdata*. According to *Rxdata* the latch stage modulates threshold voltage of the differential pair in the gain stage. A broken line in Fig. 8 denotes the modulated threshold voltage of the differential pair, namely  $V_{\rm TH}$ . For example, when Rxdata is low,  $V_{TH}$  increases to  $+V_H$  ( $V_H$  is so called hysteresis width). When the input pulse voltage exceeds  $+V_H$ due to the positive pulse  $V_R$ , Rxdata switches to high. The latch circuit then shifts  $V_{TH}$  to  $-V_H$  and holds Rxdata high until the negative pulse voltage  $V_R$  is applied to the input. Repeating this operation, digital data is correctly recovered from the pulse voltages.

Not using both PMOS and NMOS improves circuit operation margin against PVT variations. It is further improved by using adaptive bias control (ABC) in Fig. 9. The tail transistor gate bias  $V_{\rm TAIL}$  is controlled by constant current  $I_{\rm TAIL}$  which is generated by a bandgap reference circuit. This bias control keeps  $I_T$  amplitude constant and hence  $V_R$  constant. The receiver input common mode is adaptively generated by the receiver replica. Together with the  $V_{\rm TAIL}$  control, the receiver sensitivity keeps constant. This strong tolerance against PVT variation allows high-reliable operation of the 1024 links which are distributed across large chip area. It also enables power reduction in the transceiver. Simulated results are presented in Fig. 9.



Fig. 9. Adaptive bias control circuit and simulated PVT variation in  $dI_T/dt$ , receiver hysteresis, and power dissipation.



Fig. 10. Transceiver circuit for (a) uplink and (b) downlink

The receiver hysteresis width  $V_H$  needs to be larger than 50 mV for noise margin. In a conventional receiver [8] that uses both PMOS and NMOS without ABC,  $V_H$  varies from 50 to 180 mV. To provide adequate margin for the worst case ( $V_H = 180 \, \mathrm{mV}$ ), the received voltage  $V_R$  should be higher than 240 mV. In the proposed NMOS CML receiver with ABC, on the other hand,  $V_H$  varies from 50 to 100 mV over the same PVT variations. The required amplitude of  $V_R$  is reduced to 160 mV, reducing transmit current  $I_T$ . Furthermore, in the proposed transmitter, the PVT variation in  $I_T$  itself is also suppressed. In total, power dissipation in the transceiver is reduced to half.

A transceiver circuit in DRAM is slightly modified from the standard NMOS CML transceiver in Fig. 8(a). DRAM inductive-coupling transceiver contains MUX/DMUX function. Fig. 10(a) shows the DRAM receiver Rxm includes 1:2 DMUX. The parallel toggle divider operating at 180° phase offset. Using differential output of the receiver core, 8 Gb/s Rxdm is demultiplexed to 4 Gb/s Rxdm0 and Rxdm1. Fig. 10(b) shows the DRAM transmitter Txm includes 2:1 MUX function. XOR logic is embedded into Txm using CML network.



Fig. 11. Detail circuit schematic of DRAM transmitter and simulated eye patterns. (a) Before modification. (b) After modification.



Fig. 12. Equivalent circuit model of inductive-coupling channel.

Number of CML stages is reduced for low-power operation. Although the number of stacked transistor is increased in Txm, the NMOS CML enables 8 Gb/s operation even with slow DRAM devices. By symmetrically connecting the stacked NMOS transistors as shown in Fig. 11, data-pattern-dependent deterministic jitter can be eliminated, resulting in wider eye opening and hence high-speed operation.

# D. Inductive-Coupling Channel Design

This section presents a simple design guideline to quickly derive initial design parameters of the coil. The final specific design parameters should be determined based on iterative simulations by using a field solver and a circuit simulator.

Fig. 12 depicts an equivalent circuit model of an inductive-coupling channel. The received voltage  $V_R$  is given by

coupling channel. The received voltage 
$$V_R$$
 is given by 
$$V_R = \frac{1}{(1 - \omega^2 L_R C_R) + j\omega R_R C_R} \cdot j\omega M$$
 
$$\cdot \frac{1}{(1 - \omega^2 L_T C_T) + j\omega R_T C_T} \cdot I_T \quad (1)$$
 where  $L$  is self inductance,  $C$  is parasitic capacitance,  $R$  is par-

where L is self inductance, C is parasitic capacitance, R is parasitic resistance of the coil (T and R subscript denotes the parameter belongs to transmitter and receiver, respectively), and M is mutual inductance between the coils. In (1), the second term  $j\omega M$  expresses ideal inductive-coupling channel response where  $V_R = MdI_T/dt$ . The ideal inductive-coupling channel



Fig. 13. Characteristics of transmit current and received voltage in (a) time and (b) frequency domain.

functions as transimpedance with first-order differentiator so that the channel response is proportional to the frequency. The first and the third terms denote bandwidth limitation due to CR parasitics of on-chip coils. It appears as a second-order low-pass filter with peaking at the LC self-resonant frequency of the coil  $f_{\rm SR}$ . Due to this bandwidth limitation, the actual inductive-coupling channel exhibits band-pass filter characteristic with the peak frequency at  $f_{\rm SR}$ . The actual channel response departs from the ideal channel response due to the peaking at around  $f_{\rm SR}$ . This peaking distorts signal frequency spectrum and causes ringing and hence inter-symbol interference (ISI), resulting in bit error rate (BER) degradation. The actual inductive-coupling channel bandwidth  $f_{\rm CH}$  can be seen at around  $f_{\rm SR}/2$  which is given by

$$f_{\rm CH} = \frac{f_{\rm SR}}{2} = \frac{1}{4\pi\sqrt{LC}}.$$
 (2)

Characteristics of the transmit current  $I_T$  and the received voltage  $V_R$  are summarized in Fig. 13.  $V_R$  can be modeled as Gaussian pulse

$$V_R(t) = V_P \exp\left[-\frac{4t^2}{\tau^2}\right] \tag{3}$$

where  $V_P$  is pulse amplitude and  $\tau$  is pulse width.  $I_T$  is given as an integral form of  $V_R$  and therefore similar to a step pulse. Assuming the step pulse amplitude is  $I_P$ , the  $V_R$  amplitude  $V_P$  is given by

$$V_P = \frac{4}{\sqrt{\pi}} M \frac{I_P}{\tau} \approx 2.3 M \frac{I_P}{\tau}.$$
 (4)

It denotes that the  $V_R$  amplitude is proportional to transmit current slew rate  $(S_P = I_P/\tau)$ . The minimum  $\tau$  is restricted by device performance.  $I_P$  can be controlled by increasing the transistor size.  $I_P$  is finally restricted by power budget as the transmit power is given by  $I_PV_{DD}$ . M is, therefore, important parameter to provide sufficient signal power to the receiver. M is rewritten as

$$M = k\sqrt{L_T L_R} \tag{5}$$

where k is a coupling coefficient which denotes how much magnetic flux is coupled together between the transmitter and the



Fig. 14. Calculated coupling coefficient dependence on communication distance and coil diameter.

receiver coil. Since the self inductance  $L_T$ ,  $L_R$  is limited due to the channel bandwidth requirement, k finally governs the channel gain. k for multi-turn square coupled coils is approximately given by [15]

$$k = \left\{ \frac{0.25 D_{\text{eff},T} D_{\text{eff},R}}{X^2 + 0.25 D_{\text{eff},R}^2} \right\}^{1.5}, (D_{\text{eff},R} \ge D_{\text{eff},T})$$
 (6)

$$k = \left\{ \frac{0.25 D_{\text{eff},T} D_{eff,R}}{X^2 + 0.25 D_{\text{eff},T}^2} \right\}^{1.5}, \ (D_{\text{eff},R} < D_{\text{eff},T})$$
 (7)

where  $D_{\rm eff}$  is an effective coil diameter (T and R subscript denotes transmitter and receiver, respectively).  $D_{\rm eff}$  is given by

$$D_{\text{eff}} = \frac{D_{\text{outer}} + D_{\text{inner}}}{2}.$$
 (8)

 $D_{
m outer}$  and  $D_{
m inner}$  is an outer and an inner diameter of the coil, respectively. For further simplicity and better understanding of k, (6)–(8) are further approximated by assuming both transmitter and receiver coils has a single wire turn ( $D_{
m eff} = D_{
m outer} = D_{
m inner} = D$ ) and also has the same diameter ( $D_{
m eff,T} = D_{
m eff,B} = D$ ). That is

$$k = \left\{ \frac{0.25}{(X/D)^2 + 0.25} \right\}^{1.5}.$$
 (9)

Fig. 14 plots the calculated results. When X is longer than D (X/D>1), k strongly attenuated by the cubic of X/D. For reliable data communication, X/D should be designed around 1/3 as k attenuation is only proportional to X/D. Wire turns of the coil n is secondary important design parameter. It determines the self inductance and hence the channel bandwidth  $f_{\rm CH}$ , n is finally decided based on the  $f_{\rm CH}$  requirement. The required channel bandwidth is determined by signal frequency spectrum. The frequency spectrum of  $V_R$  is given also by Gaussian distribution

$$|V_R(\omega)| = \frac{\sqrt{\pi}\tau V_P}{2} \exp\left(-\frac{\omega^2 \tau^2}{16}\right). \tag{10}$$

To suppress signal distortion in  $V_R$  pulse properly, frequency components with > 1/e of peak power should be delivered.



Fig. 15. Microphotograph of stacked test chips.

Therefore,  $f_{\rm CH}$  should be designed to meet the following condition:

$$f_{\rm CH} > \frac{2}{\pi \tau} \approx \frac{0.64}{\tau}.\tag{11}$$

The coil turns n is maximized until  $f_{\rm CH}$  reaches the channel bandwidth requirement.

# IV. TEST CHIP DESIGN

Fig. 15 shows a microphotograph of stacked test chips and the layout snapshot of the GPU and the DRAM transceiver. The test chips are both fabricated in 65-nm CMOS logic process. The DRAM transistor performance is emulated for 100-nm DRAM process by enlarging the channel length from 60 to 150 nm. Only four metal layers are used for the DRAM transceiver layout as in a typical DRAM process [16] although 12 metal layers are available in the 65-nm CMOS logic process. The DRAM chip is thinned to  $20 \,\mu\mathrm{m}$  and stacked over the GPU chip both face-up by using adhesive of 5  $\mu m$  thick. Communication distance between the coils is around 20  $\mu m$  (less than 25  $\mu m$ ). This is because M4 is used for the DRAM transceiver coil while M9 for the GPU. The transmitter (Tx) coil is placed inside the receiver (Rx)coil concentrically. Tx coil size is 60  $\mu m$  in diameter and Rx is 79  $\mu m$  to cover the typical operation range  $(X/D \sim 1/3)$ . The coil turns is 8 for both Tx and Rx.  $32 \times 32$  totally 1024 transceivers are arranged with a pitch of 110  $\mu m$ . Alignment accuracy is smaller than  $\pm 5 \,\mu\mathrm{m}$  which is small enough to keep strong coupling between the stacked coils [17]. Aggregated crosstalk between channels is about 10% of the signal amplitude even in the worst case [18]. The crosstalk can be rejected by the receiver hysteresis that is higher than 50 mV.

### V. MEASUREMENT RESULTS

Fig. 16 presents measured BER dependence on data rate. The uplink and the downlink transceivers both successfully operates at the data rate higher than the target data rate of 8 Gb/s with BER less than  $10^{-13}$ . Figs. 17 and 18 present measured data rate dependence on supply voltages for BER  $< 10^{-13}$ . In both the uplink and the downlink, 8 Gb/s operation is achieved at supply



Fig. 16. Measured BER dependence on data rate



Fig. 17. Measured data rate dependence on supply voltage in uplink.

voltages with larger than  $\pm 10\%$  variations. It is confirmed that the NMOS CML transceiver with the ABC provides strong immunity against supply voltage variations. The immunity is further increased when the data rate is reduced. At nominal voltage of 1.1 V, BER  $< 10^{-13}$  operation is confirmed for up to 9 Gb/s uplink and 8.5 Gb/s downlink. Even if the data rate fluctuates from 7.6 to 8.4 Gb/s due to the variation in VCO, both links can operate at BER  $< 10^{-13}$ . Fig. 19 shows a measured waveform and jitter of the clock recovered by IVCO. The measured timing jitter of the recovered clock relative to the received data is less than 6  $\mathrm{ps_{rms}}$  (<5% U.I.). The timing jitter is negligibly small. 1024 parallel transceivers are tested by using on-chip random pattern generators and error checkers. BER  $< 10^{-16}$  operation is confirmed.

The chip performance is summarized in Table I and compared to the latest wired 40 nm DRAM interface [1]. The layout area of the inductive-coupling interface is calculated as the total coil footprint (79  $\mu m \times 79~\mu m \times 1024~channels + PLL~area=6.5~mm^2).$  At least, spacing between the coils can be utilized to place the GPU logic and the DRAM structure (it is also possible to utilize the free space under the coils). The NMOS CML transceiver circuit with the ABC improves immunity against PVT variations to reduce design margin. As a result, the energy



Fig. 18. Measured data rate dependence on supply voltage in downlink.



Fig. 19. Measured waveform and jitter of recovered clock.

TABLE I
PERFORMANCE SUMMARY AND COMPARISON

|                      | This Work                          | <sup>[1]</sup> Previous Work   |
|----------------------|------------------------------------|--------------------------------|
| Aggregated Bandwidth | 1TB/s (32)                         | 0.032TB/s (1)                  |
| Data Rate            | 8Gb/s/Link                         | 16Gb/s/Link                    |
| Number of Links      | 1024                               | 16                             |
| Layout Area          | 6.5mm <sup>2</sup>                 | 4.4mm <sup>2</sup>             |
| Area/Bandwidth       | 6.4mm <sup>2</sup> /TB/s (1/22)    | 137.6mm <sup>2</sup> /TB/s (1) |
| Power Dissipation    | 8W                                 | 2W                             |
| Energy/bit           | 1pJ/b (1/8)                        | 8pJ/b (1)                      |
| BER                  | <10 <sup>-16</sup>                 | <10 <sup>-15</sup>             |
| Process              | 65nm CMOS &<br>Emulated 100nm DRAM | Emulated 40nm DRAM             |

per bit is reduced to 1/8 of [1]. In addition, the area per bandwidth is reduced to 1/22 of [1]. This is achieved by increasing the data rate using the quadrature clocking with the XOR-based MUX/DMUX, and by reducing area through the use of the injection-lock clock recovery. According to a simulation study in 40 nm DRAM, the same process as in [1], the data rate will be 24 Gb/s/link  $(1.5\times)$ , the energy per bit will be 0.3 pJ/b (1/27), and the area per bandwidth will be  $2.08 \, \mathrm{mm^2/TB/s} \, (1/66)$ . The test chips in this work are stacked face-up due to limitation of our test equipment. In case of back-to-back chip stack [as in

Fig. 2(d)], the communication distance and hence the coil diameter will be doubled so that the coil footprint will be increased four times. However, still the inductive-coupling link has an advantage in area efficiency.

#### VI. CONCLUSION

A 1 TBb/s low-power small-area inductive-coupling interface between 65-nm CMOS GPU and 100-nm DRAM is developed. A QDR architecture compensates for the performance gap between 65-nm CMOS and 100-nm DRAM which enables 8 Gb/s/link high-speed operation. An injection-lock clock recovery removes clock link resulting in area and power reduction by half. An NMOS CML transceiver with adaptive bias control provides strong tolerance against PVT variations enabling BER  $< 10^{-16}$  parallel link operation at 1 TB/s. Compared to 40-nm DRAM wired interface, 32× bandwidth is obtained with 1/8 of energy and 1/22 of area dissipation.

#### ACKNOWLEDGMENT

The authors are grateful to M. Tago with NEC Corporation for the assistance in stacked-chip assembly.

#### REFERENCES

- [1] N. Nguyen et al., "A 16-Gb/s differential I/O cell with 380 fs RJ in an emulated 40 nm DRAM process," in Symp. VLSI Circ. Dig. Tech. Papers, Jun. 2008, pp. 128–129.
- [2] K. Chang et al., "A 16 Gb/s/link, 64 GB/s bidirectional asymmetric memory interface cell," in Symp. VLSI Circ. Dig. Tech. Papers, Jun. 2008, pp. 126–127.
- [3] H. Lee et al., "A 16 Gb/s/Link, 64 GB/s bidirectional asymmetric memory interface," *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1235–1247, Apr. 2009.
- [4] T. Ezaki et al., "A 160 Gb/s interface design configuration for multichip LSI," in ISSCC Dig. Tech. Papers, Feb. 2004, pp. 140–141.
- [5] M. Wordeman et al., "A 3D system prototype of an eDRAM cache stacked over processor-like logic using through-silicon vias," in ISSCC Dig. Tech. Papers, Feb. 2012, pp. 186–187.
- [6] J. Kim et al., "A 1.2 V 12.8 GB/s 2 Gb mobile wide-I/O DRAM with 4 × 128 I/Os using TSV-based stacking," in ISSCC Dig. Tech. Papers, Feb. 2011, pp. 496–498.
- [7] P. Siblerud, Cost reduction scenario of 3D TSV integration [Online]. Available: http://www.emc3d.org/CoO.html
- [8] N. Miura et al., "An 11 Gb/s inductive-coupling link with burst transmission," in ISSCC Dig. Tech. Papers, Feb. 2007, pp. 298–299.
- [9] N. Miura et al., "A 0.14 pJ/b inductive-coupling inter-chip data transceiver with digitally-controlled precise pulse shaping," in *ISSCC Dig. Tech. Papers*, Feb. 2007, pp. 358–359.
- [10] N. Miura et al., "A 1 Tb/s 3 W inductive-coupling transceiver for interchip clock and data link," in ISSCC Dig. Tech. Papers, Feb. 2006, pp. 424–425.
- [11] N. Miura et al., "A 195 Gb/s 1.2 W 3D-stacked inductive inter-chip wireless superconnect with transmit power control scheme," in ISSCC Dig. Tech. Papers, Feb. 2005, pp. 264–265.
- [12] D. Mizoguchi et al., "A 1.2 Gb/s/pin wireless superconnect based on inductive inter-chip signaling (IIS)," in ISSCC Dig. Tech. Papers, Feb. 2004, pp. 142–143.
- [13] K. Niitsu et al., "Interference from power/signal lines and to SRAM circuits in 65 nm CMOS inductive-coupling link," in A-SSCC Dig. Tech. Papers, Nov. 2007, pp. 131–134.
- [14] N. Miura et al., "An 8 Tb/s 1 pJ/b 0.8 mm<sup>2</sup>/Tb/s QDR inductive-coupling interface between 65 nm CMOS and 0.1 μm DRAM," in ISSCC Dig. Tech. Papers, Feb. 2010, pp. 436–437.
- [15] N. Miura et al., "Analysis and design of inductive coupling and transceiver circuit for inductive inter-chip wireless superconnect," *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 829–837, Apr. 2005.
- [16] International technology roadmap for semiconductor (ITRS) 2003 edition: Interconnect [Online]. Available: http://www.itrs.net/

- [17] K. Niitsu et al., "Misalignment tolerance in inductive-coupling interchip link for 3D system integration," in Extended Abstracts SSDM, Sep. 2008, pp. 86–87.
- [18] N. Miura et al., "Crosstalk countermeasures for high-density inductive-coupling channel array," *IEEE J. Solid-State Circuits*, vol. 42, no. 2, pp. 410–421, Feb. 2007.



**Noriyuki Miura** (S'06–M'08) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Keio University, Yokohama, Japan, in 2003, 2005, and 2007, respectively.

From 2005 to 2008, he served as a Fellow Researcher of Japan Society for the Promotion of Science (JSPS). He is currently a Research Associate at Keio University, working on short-range wireless transceiver circuit design and wireless interconnect technology for 3-D system integration.

Dr. Miura received the IEEE System LSI Award in

2005 and 2007, the 2006 LSI IP Design Award, the 2006 IP/SoC Best Design Award, the 2006 IEEE SSCS Japan Chapter Young Researcher Award, and the 2007 ASP-DAC Outstanding Design Award.



**Mitsuko Saito** (S'09) received the B.S. and M.S. degrees in electrical engineering from Keio University, Yokohama, Japan, in 2009 and 2011, where she is currently working toward the Ph.D. degree.

Since 2008, she has been engaged in a research on the 3-D stacked inductive inter-chip wireless interface for 3-D system integration. From 2011, she is serving as a Fellow Researcher of Japan Society for the Promotion of Science (JSPS).

Ms. Saito received the Fifth TSMC Outstanding Student Research Award, Bronze Medal in 2011.



**Tadahiro Kuroda** (M'88–SM'00–F'06) received the Ph.D. degree in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1999.

In 1982, he joined Toshiba Corporation, where he designed CMOS SRAMs, gate arrays and standard cells. From 1988 to 1990, he was a Visiting Scholar with the University of California, Berkeley, where he conducted research in the field of VLSI CAD. In 1990, he was back to Toshiba, and engaged in the research and development of BiCMOS ASICs, ECL gate arrays, high-speed CMOS LSIs for telecommu-

nications, and low-power CMOS LSIs for multimedia and mobile applications. He invented a variable threshold-voltage CMOS (VTCMOS) technology to control VTH through substrate bias, and applied it to a DCT core processor and a gate-array in 1995. He also developed a Variable Supply-voltage scheme using an embedded DC-DC converter, and employed it to a microprocessor core and an MPEG-4 chip for the first time in the world in 1997. In 2000, he moved to Keio University, Yokohama, Japan, where he has been a professor since 2002. He was a Visiting Professor at Hiroshima University, Japan, and the University of California, Berkeley. His research interests include low-power, high-speed CMOS design for wireless and wireline communications, human—computer interactions, and ubiquitous electronics. He has published more than 200 technical publications, including 60 invited papers, and 21 books/chapters, and has filed more than 100 patents.

Dr. Kuroda served as the General Chairman for the Symposium on VLSI Circuits, the Vice Chairman for ASP-DAC, sub-committee chairs for A-SSCC, ICCAD, and SSDM, and program committee members for ISSCC, the Symposium on VLSI Circuits, CICC, DAC, ASP-DAC, ISLPED, SSDM, ISQED, and other international conferences. He is a recipient of the 2005 P&I Patent of the Year Award, the 2006 LSI IP Design Award, the 2007 ASP-DAC Best Design Award, the 2009 IEICE Achievement Award, and the 2011 IEICE Society Award. He is an elected AdCom member for the IEEE Solid-State Circuits Society and an IEEE SSCS Distinguished Lecturer.