# THEME ARTICLE: EMERGING SYSTEM INTERCONNECTS

# Photonic Network-on-Wafer for Multichiplet GPUs

Shiqing Zhang, Ziyue Zhang <sup>(D)</sup>, Mahmood Naderan-Tahan, Hossein SeyyedAghaei, Xin Wang <sup>(D)</sup>, He Li, Senbiao Qin <sup>(D)</sup>, Didier Colle <sup>(D)</sup>, Guy Torfs, Mario Pickavet <sup>(D)</sup>, Johan Bauwelinck <sup>(D)</sup>, Günther Roelkens <sup>(D)</sup>, and Lieven Eeckhout <sup>(D)</sup>, *Ghent University, 9000, Gent, Belgium* 

This article introduces the photonic network-on-wafer graphics processing unit (GPU) architecture to overcome fundamental limitations in electrical interconnect scaling by implementing the inter-GPU network in a wafer-scale optical interposer. We argue that the photonic-NoW GPU is a scalable architecture, delivering significant performance benefits in a power-efficient manner.

n the last decade, advancements in graphics processing units (GPUs) have been propelling major developments in artificial intelligence (AI), highperformance computing (HPC), and data analytics. Continuing this trend in any of these domains requires the ability to continuously scale GPU performance. Until recently, GPU performance has been scaled by increasing the number of streaming multiprocessors (SMs) across generations. This was made possible by leveraging Moore's Law and using the maximum possible transistor count in the most advanced chip technology node. Unfortunately, transistor scaling is slowing down and is likely to eventually stop. In addition, manufacturing issues further constrain the maximum die size as modern-day GPUs are approaching the reticle limit (around 800 mm<sup>2</sup>). Moreover, very large dies lead to yield issues, rendering the cost of large monolithic GPUs to undesirable levels.

The solution to GPU performance scaling is to connect multiple physical GPUs together while providing the abstraction of a single logical GPU to software. One approach is to connect multiple GPUs on a printed circuit board. Scaling GPU workloads on these multi-GPU systems is hard because of the limited inter-GPU bandwidth offered. On-package interconnects, e.g., through interposer technology, provide higher bandwidth and lower latency than off-package interconnects, providing a promising direction to scale GPU performance to a handful GPUs.<sup>1</sup> Wafer-scale integration goes one step further by bonding premanufactured dies on a silicon wafer, providing a pathway toward a wafer-scale GPU with tens of GPUs.<sup>2</sup> Unfortunately, providing high bandwidth density

0272-1732 © 2023 IEEE Digital Object Identifier 10.1109/MM.2023.3237927 Date of publication 18 January 2023; date of current version 13 March 2023.

IEEE Micro

at low power consumption over long distances is fundamentally challenging using electrical interconnects, constraining GPU scaling using electrical interposer technology.

PROVIDING HIGH BANDWDITH DENSITY AT LOW POWER CONSUMPTION OVER LONG DISTANCES IS FUNDAMENTALLY CHALLENGING USING ELECTRICAL INTERCONNECTS, CONSTRAINING GPU SCALING USING ELECTRICAL INTERPOSER TECHNOLOGY.

In this article, we propose the photonic network-onwafer (NoW) GPU architecture in which premanufactured and pretested GPU dies and memory chips are mounted on a wafer-level interposer that connects the GPU chips through a photonic network layer, while connecting each GPU die with its local memory stack electrically, as illustrated in Figure 1. The key asset of the photonic-NoW GPU architecture is the ability to achieve high bandwidth density at low power over relatively long, wafer-scale distances (up to tens of centimeters). The goal of this article is to present the vision of the photonic-NoW GPU architecture, and argue for its potential and feasibility based on a preliminary quantitative and qualitative evaluation. More specifically, our preliminary simulation results indicate that GPU applications benefit from increased interchip bandwidth and that bandwidth sensitivity increases with system size, supporting the case for a photonic wafer-scale inter-GPU interconnect. We further argue that manufacturing a photonic-NoW

March/April 2023

Authorized licensed use limited to: University of Gent. Downloaded on April 07,2023 at 07:07:31 UTC from IEEE Xplore. Restrictions apply.

Published by the IEEE Computer Society



FIGURE 1. Photonic-NoW GPU architecture. A high-bandwidth, low-energy photonic network connects the GPU tiles across a wafer.

appears to be technically feasible in the near future, making it a promising direction to scale GPU performance. This article further highlights research and design opportunities and challenges in the context of the photonic-NoW GPU paradigm.

We believe that this work is in line with where industry is heading. Nvidia's NVLink and NVSwitch technology provide high-bandwidth electrical interconnects within and across server nodes.<sup>13</sup> Ayar Labs and Nvidia recently announced to explore high-bandwidth, yet lowpower optical-based interconnects to develop scale-out multi-GPU architectures.<sup>14</sup> Cerebras Systems developed a wafer-scale AI accelerator in which the cores are connected through an (electrical) on-wafer interconnect.<sup>15</sup> Lightmatter very recently announced Passage, a photonic wafer-scale interconnect that ties chiplets with silicon photonics and co-packaged optics; while conceptually similar to our proposal, unfortunately, not many details are provided.<sup>16</sup>

#### MOTIVATION

GPU systems are bandwidth-hungry: keeping thousands of concurrent threads fed with data requires high off-chip bandwidth. While individual GPU chips need high-bandwidth interconnects to their local highbandwidth memory stack, multi-GPU systems in addition need a high-bandwidth inter-GPU interconnection network for accessing data in remote memory stacks. Providing a high-bandwidth network is challenging, particularly when considering a wafer-scale GPU architecture, for three reasons. For one, we need high bandwidth density, or high bit rate per cross-sectional area of the interconnect. Second, the energy consumed per bit needs to be affordable. Third, we need interconnects over fairly long distances, up to tens of centimeters in wafer-scale GPUs.

Figure 2 summarizes the current state-of-the-art in electrical interconnect technology in terms of these three goals: bandwidth density, energy per bit, and distance of communication. The vertical axis reports bandwidth density (Gb/s/mm) per energy per bit (pJ/bit), while the horizontal axis shows interconnect length (mm). It is clear from this graph that current electrical interconnects achieve high bandwidth density at affordable energy per bit at the chip scale (i.e., for short distances of less than one centimeter). However, this high figure of merit plummets when considering interconnects in the tens of centimeter range, which we need for a wafer-scale network. More specifically, based on our survey of the recent literature, we find current technology to provide less than 100 Gb/s/ mm per pJ/bit for interconnect lengths in the 10+ cm range. Ideally, to support a high-bandwidth wafer-



FIGURE 2. Bandwidth density per energy per bit as a function of interconnect length. Current state-of-the-art electrical interconnects cannot achieve high bandwidth density and low energy over long distances.

March/April 2023



FIGURE 3. A photonic-NoW four-GPU system. High-radix optical switches provide all-to-all connectivity; multiple ports per GPU and multiple wavelengths per port enable achieving high bandwidth.

scale interconnect, we would like to achieve one or two orders of magnitude higher bandwidth density per energy per bit.

We thus need to revert to other technologies than electrical interconnects to provide a high-bandwidth, low-energy wafer-scale interconnection network. We argue that a photonic-NoW could be this technology for the following four reasons. First, a photonic layer can achieve much higher bandwidth density compared to electrical interconnects. Electrical impedance-controlled high-speed transmission lines on chip (or on a silicon interposer) are typically shielded and separated by a distance of 100-200  $\mu$ m to avoid crosstalk (on an organic interposer this increases to approximately 500  $\mu$ m), while optical waveguides in silicon can be spaced on a pitch of 25  $\mu$ m or less. Second, wavelength multiplexing can be used to further increase the bandwidth density, which is not possible using electrical interconnects. Third, the power consumption of optical interconnects is quasi-independent of the interconnect length when low-loss waveguide technology is used, making it the preferred solution for wafer-scale interconnects over a few tens of centimeters. Fourth, optical interconnects can cross each other in the same layer, thereby significantly simplifying the design, in contrast to electrical interconnects that need to be routed to different metal layers.

# PHOTONIC-NOW GPU ARCHITECTURE

We propose the photonic-NoW GPU architecture, as previously illustrated in Figure 1. A so-called *GPU tile* groups a GPU chiplet with its local memory stack. Because of the close proximity (less than 1 cm), this GPU-memory interconnect can be realized using traditional electrical interposer technology. Interconnecting the different GPU tiles on the other hand, involves interconnects on the order of tens of centimeters. As aforementioned, we rely on photonic waveguides to provide high-bandwidth, low-power interconnects between different GPU tiles across the wafer. Each GPU tile thus contains electro-optical transceivers to convert bits from the electrical domain to the optical domain, and back. Optical switches route these bits through the waveguides from source to destination. Wavelength division multiplexing increases the bandwidth achieved per waveguide.

The photonic-NoW is a circuit-switched network in which all switching happens in the optical domain, i.e., no back-and-forth conversion between the electrical and optical domains on the path from sender to receiver. The reason for opting for a circuit-switched network is because optical switches cannot buffer packets as in the electrical domain. We currently consider high-radix optical switches,<sup>a</sup> where the radix is determined by the number of GPUs in the system that we want to connect to each other through a singlehop connection. The different ports per GPU connect to the different switches through a waveguide. Wavelength routing is deployed in the optical switches in which a wavelength determines the destination GPU in the system. Each GPU maintains a routing table to keep track of the network port and wavelength to reach a particular destination GPU. The routing tables

<sup>&</sup>lt;sup>a</sup>The radix of a switch is defined by the number of input/output (I/O) ports.



FIGURE 4. High-speed optical transceiver. A 32-Gb/s bitstream per port is (de)modulated to (from) 16 wavelengths.

are configured such that there is no conflict in any of the switches.

Figure 3 illustrates this topology for a four-GPU system with nine switches and nine network ports per GPU. A single high-radix switch provides all-to-all connectivity. To achieve high bandwidth, we envision 16 wavelengths per network port and waveguide providing 32 Gb/s of unidirectional bandwidth per wavelength. To further increase bandwidth, we provide multiple network ports. For example, nine network ports provide 576-GB/s unidirectional or 1.152-TB/s bidirectional bandwidth per GPU. A waveguide connects a GPU's network port to an I/O port of the optical switch. The target network bandwidth determines the number of optical switches and the number of network ports per GPU. The "Photonic-NoW" section further motivates this photonic-NoW design in light of a variety of network-level design tradeoffs.

The interface between the electrical and optical domains relies on high-speed optical transceiver chiplets with dedicated electronic circuitry. The key electronic functions in the optical link are the modulator driver, the low-noise transimpedance amplifier (TIA) and the clock-and-data recovery (CDR) circuits to modulate outgoing traffic to one of the 16 wavelengths and to demodulate incoming traffic, as illustrated in Figure 4. Key challenges

for the complementary metal-oxide-semiconductor (CMOS) circuit design are the high-speed operation at low power consumption, the electro-optic co-design and interfacing, and the high-density multichannel integration. More details are provided in the "High-Speed Optical Transceiver" section.

The cross-section of the photonic-NoW GPU architecture in Figure 5 illustrates its physical implementation. GPU chips are connected to their local memory chips through traditional electrical interposer technology [e.g., embedded multidie interconnect bridge (EMIB)]. The optical transceiver chiplets discussed above provide connectivity to the optical layer in which the photonic-NoW is implemented (waveguide routing and switching). Lasers are sourced outside the wafer. Power is provided using through-silicon vias. The optical waveguide layer consists of ultra-low loss SiN waveguide circuits for waveguide routing and optical switching. Active components, such as high-speed photodiodes (PDs), electroabsorption modulators (EAM), and driver electronics, are integrated on the optical waveguide layer through microtransfer printing as will be elaborated in the "Optical Layer" section. Optical amplifiers could be integrated as well to overcome the insertion loss of the photonic switches. However, they need to be positioned as far as



FIGURE 5. Cross-section of the photonic-NoW GPU. GPU chips are connected to their local memory stack through electrical interposer interconnects; different GPU chips across the wafer are connected through waveguides in the wafer-scale photonic network.

#### March/April 2023

IEEE Micro

| Technology       | Footprint | Achievable radix | Switching time                                        | Wavelength routing | Power |
|------------------|-----------|------------------|-------------------------------------------------------|--------------------|-------|
| MEMS             | Large     | Very high        | $\sim 1 \mu s^3$                                      | No                 | Low   |
| MZI              | Medium    | High             | $\sim$ 10 ns <sup>4</sup>                             | No                 | High  |
| Push-pull MR MZI | Medium    | Medium           | $\sim$ 10 $\mu$ s $^{5}$ $\sim$ 10 ns feasible $^{*}$ | Yes                | High  |
| AWGR             | Small     | High             | N/A                                                   | Yes                | Low   |

TABLE 1. High-radix optical switches: Network characteristics.

\*Assumes electro-optical switching as in Qiao et al.4

possible from the GPU tiles to prevent efficiency degradation due to high operating temperatures.

# **PHOTONIC-NOW**

While photonic networks have been explored at the chip level, there are at least two key differences between a photonic network-on-chip (NoC) versus a photonic-NoW. For one, the area footprint is much larger for an NoW (on the order of  $10^4$ – $10^5$  mm<sup>2</sup>) compared to an NoC ( $10^2$ – $10^3$  mm<sup>2</sup>). This implies that a photonic-NoW can rely on large-footprint electronic and photonic devices that were out of reach for photonic-NoCs. Second, a photonic-NoW needs to provide (much) higher bandwidth between the network nodes. The network nodes in a chip-level network are individual cores or cache banks, in contrast to the full-fledged GPUs in a wafer-scale network that need much higher bandwidth.

Our proposed NoW architecture relies on highradix optical switches, as discussed in the previous section. The key benefits of a high-radix optical switch include high bandwidth per network node, low hop count for routing, and (relatively) simple network topologies. An optical switch has the potential to provide a higher radix compared to their electrical counterparts, and thus connect more GPUs through singlehop connections. If the number of GPUs were to exceed the switch's radix, multihop communication would be needed to fully connect the network.

Different optical switch designs have been proposed with varying properties in terms of area footprint, achievable radix, switching time (i.e., the time it takes to reconfigure the I/O connectivity in the switch), support for wavelength routing, and power consumption. We now discuss existing optical switch designs and their key properties, see also Table 1.

 Microelectromechanical systems (MEMS): Optical switches based on MEMS feature a high radix (hundreds of ports) and are energy efficient. On the flip side, MEMS switches suffer from a relatively low switching speed. The size of a MEMS crossbar scales quadratically with its radix. Most MEMS optical switches do not support wavelength routing. Kwon et al.<sup>3</sup> demonstrated a 128  $\times$  128 MEMS switch with an  $\sim$ 1  $\mu$ s switching time.

- 2) Mach-Zehnder interferometer (MZI): MZI-based optical switches feature a high switching speed, but their achievable radix is lower compared to MEMS due to attenuation and cross-talk limitations. A 64 × 64 integrated MZI optical switch has been demonstrated by Qiao et al.<sup>4</sup> with an ~10-ns switching time. Huang et al.<sup>5</sup> demonstrated that by using push-pull overcoupling microrings, MZI switches can support wavelength routing; however, the achievable radix is somewhat lower (tens of ports) due to increased attenuation loss. MZI switches consume more power than MEMS switches.
- 3) Arrayed waveguide grating router (AWGR): AWGR provides static all-to-all connections between its input and output ports through wavelength routing but is nonswitchable. AWGR incurs a small footprint and is power-efficient; however, the biggest limitation is that AWGR is nonswitchable and may therefore be inefficient for unbalanced NoW traffic. An 8×8 O-band AWGR for on-chip communication has been demonstrated by Pitris et al.<sup>6</sup>

The photonic-NoW design assumed in this work aligns with the push-pull microring MZI switches, i.e., we assume that the switches support wavelength routing and that if workloads exhibit time-varying bandwidth demands between different GPUs in the system, the switch can be reconfigured as such.

# HIGH-SPEED OPTICAL TRANSCEIVER

The high-speed electronic and electro-optical transceiver circuits translate the data between the digital,

analog, and optical domains for optimum transmission and reception. The very high density of optical waveguides and the use of multiple wavelengths in the photonic-NoW provide a massive amount of parallel interconnects so that basic binary NRZ modulation at a reasonable rate (e.g., 32 Gb/s) can be adopted for high energy efficiency (minimal pJ/b).<sup>7</sup> This avoids more complex multilevel modulation techniques, such as PAM-4, which is now widely adopted by the datacom interconnect industry. Implementing PAM-4 would bring significant disadvantages, such as higher circuit complexity, reduced noise margins, and the need for forward error correction. NRZ modulation is most efficient when sufficient bandwidth is available. which is certainly the case for the very compact (so low-capacitance, ~10 fF) silicon-photonic EAM modulators and PDs considered here (and which have already been demonstrated for up to 100-Gb/s NRZ<sup>8</sup>).

The transceiver's chiplet size and power consumption will be mainly determined by the key electronic functions, consisting of the modulator driver, the lownoise TIA and the serializer-deserializer (serdes) circuits including CDR. Advanced CMOS technology is preferred to co-integrate all these analog and mixed-signal transceiver circuits in a very compact transceiver chiplet. Large circuits such as complex equalizer circuits, T-coils, or peaking inductors are avoided by the use of moderate bit rates per wavelength and by dense co-integration through microtransfer printing (see the next section). Microtransfer printing saves significant area as it avoids the use of flip-chip pads (with, e.g., 40- $\mu$ m diameter) and the associated parasitics (two pads plus bumps) as microtransfer printing can provide interconnections on pads of only 10 by 10  $\mu$ m.

We believe it is feasible to achieve a figure of merit (bandwidth density per energy per bit) that enables a wafer-scale photonic interconnect. More specifically, based on recent results by Guermandi et al.<sup>9</sup> we make the following estimates. Bandwidth density can be computed as the product of the bit rate per wavelength times the number of wavelengths per waveguide divided by the waveguide pitch

bandwidth density

#### = bit rate per wavelength × no. wavelengths waveguide pitch

Assuming a TRx bit rate of 32-Gb/s NRZ per wavelength and 16 wavelengths per waveguide, and a waveguide pitch of 25  $\mu$ m at the edge of the TRx chiplet (assuming 25-mm shoreline for optical I/O and a maximum of 8,000 TRx chiplets of 200  $\mu$ m by 400  $\mu$ m underneath a single GPU tile), we obtain a total bandwidth density of 20,480 GB/s/mm. Accounting for 4-pJ/b energy consumption (this includes the laser, optical switching, and amplification), we arrive at the predicted figure of merit of 5,120 Gb/s/mm per pJ/b, as indicated by the red star in the target area in Figure 2.

# **OPTICAL LAYER**

The optical layer has to provide the optical connectivity between the different GPU tiles. As already elaborated before, as the size of the multi-GPU system increases, so does the interconnection distance between the GPUs. This requires the implementation of low-loss waveguide technology. SiN waveguide circuits allow realizing low-loss (dB/m-level) waveguides, while at the same time keeping the bend radius small (50  $\mu$ m), allowing for compact routing. It is however a purely passive platform, except for heaters that can be implemented to enable optical switching in tens of microseconds. For the optical transceivers, as well as for the inline optical amplification, nonnative optoelectronic components need to be integrated on the SiN photonic interposer. A very efficient way to realize this is the use of microtransfer printing technology,<sup>10</sup> in which the non-native optoelectronic components (semiconductor optical amplifiers, PDs, modulators) are fabricated on their native (III-V) substrate, after which the devices are released from their substrate and transferred in a massively parallel way onto the SiN interposer. The devices that are transferred are only tens of micron wide and a few micron thick, and can be placed with submicron precision on the SiN interposer. Once printed, the devices are electrically connected to the interposer back-end stack using a metal redistribution layer. Besides optoelectronic components, also ultrathin CMOS transceiver circuits and high-density electrical interposer chiplets (for interconnecting the GPU to memory) can be integrated using the same microtransfer printing technology.

### **EXPERIMENTAL SETUP**

We use simulation to conduct a preliminary performance evaluation of the proposed photonic-NoW GPU architecture. We extended the GPGPU-Sim simulator<sup>11</sup> to model a multitile architecture. Each tile consists of a GPU chip along with a high-bandwidth (2 TB/s) 4-GB memory stack. The GPU consists of 64 SMs and features a 4-MB last-level cache (LLC), which is configured as a memory-side cache. An on-chip 4-Tb/s crossbar interconnection network connects the SMs to the LLC. The photonic-NoW is modeled using BookSim 2.0,<sup>12</sup> and is integrated with GPGPU-Sim to model the entire system. We further

IEEE Micro



FIGURE 6. Normalized performance (IPC): (a) the harmonic mean with the RR scheduler, (b) the harmonic mean with the DS scheduler, (c) individual benchmarks with the RR scheduler, and (d) individual benchmarks with the DS scheduler. GPU applications benefit from increased interchip bandwidth [see (a), (b), (c) and (d)], bandwidth sensitivity increases with system size [see (a) and (b)], and substantial speedup is obtained with a higher chiplet count if balanced interchiplet bandwidth is provided [see (c) and (d)].

assume first-touch page allocation, and consider both round-robin (RR) and distributed CTA schedulers (DS) to optimize data locality within each tile.<sup>1</sup> We consider a diverse set of benchmarks taken from the AI and HPC application domains, namely b +tree, dwt2d, bfs, and lud from Rodinia, and ssd-resnet34 from the MLPerf inference benchmark suite. To make full use of the resources provided for increasing chiplet count, we carefully scale the input sets to provide enough threads.

We simulate 4-, 8-, and 16-chiplet systems. In the 4-chiplet system, we consider eight optical switches. We further assume 16 wavelengths per port/waveguide with 32-Gb/s unidirectional bandwidth, providing a total of 512-GB/s unidirectional bandwidth (or 1,024-GB/s bidirectional bandwidth) per GPU chiplet. This baseline bandwidth configuration corresponds (roughly) to what Nvidia's NVLink provides in its fourth generation, in terms of per-GPU bandwidth, namely 900-GB/s bidirectional bandwidth. We explore GPU performance's sensitivity to interchip network bandwidth (512, 1,024, 2,048, and 4,096-GB/s) by varying the number of ports per GPU and the number of switches proportionally. For example, a 1,024-GB/s bandwidth configuration requires 16 ports per GPU and 16 optical switches to provide 1,024-GB/s unidirectional bandwidth per GPU.

We further assume that the latency across the photonic-NoW equals 6 ns: 2 ns for electrical-to-optical conversion, 2 ns for optical transmission (assuming an  $\sim$ 30-cm waveguide length), and 2 ns for optical-to-electrical conversion.

# **PRELIMINARY EVALUATION**

RR and DS are common policies used in multi-GPU systems. We find that the optimum CTA scheduling policy varies across benchmarks, number of chiplets,

and interchiplet bandwidth. We hence evaluate bandwidth sensitivity for both RR and DS.

Figure 6(a) and (b) reports average normalized performance (harmonic mean IPC or number of instructions executed per cycle across all benchmarks) for the 4-, 8-, and 16-chiplet systems under RR and DS, respectively. Four bandwidth configurations are considered, and the results are normalized to our baseline configuration, a 4-chiplet GPU with 512-GB/s unidirectional interchiplet bandwidth per chiplet. There are two important conclusions to be taken from these results. First, performance improves significantly as we increase the number of chiplets and interchiplet bandwidth beyond the baseline. We note a 4.70imes and 4.19imesperformance improvement under RR and DS, respectively, for 16 chiplets with 4,096-GB/s unidirectional interchiplet bandwidth per chiplet. In other words, important AI and HPC workloads benefit from increased chiplet count and interchiplet bandwidth. Second, when comparing the 16-chiplet performance results against the 4- and 8-chiplet results, we note that the performance improvement increases with increasing interchiplet bandwidth. For RR scheduling, the 4,096-GB/s configuration yields  $1.66 \times$ ,  $3.01 \times$ , and 4.7 $\times$  higher performance for 4, 8, and 16 chiplets, respectively. Similarly, for DS scheduling, the 4,096-GB/ s configuration yields  $1.14 \times$ ,  $2.09 \times$ , and  $4.19 \times$  higher performance for 4, 8, and 16 chiplets, respectively. This suggests that increasing interchiplet bandwidth is more critical for increased system size. The reason is that the effective GPU-to-GPU bandwidth decreases as system size increases assuming fixed per-GPU offchiplet bandwidth.

Figure 6(c) and (d) reports speedup for individual benchmarks over the baseline configuration. Per-benchmark results are shown for three balanced configurations in which we simultaneously scale interchiplet bandwidth

and system size, namely 1,024, 2,048, and 4,096 GB/s unidirectional bandwidth for the 4-, 8-, and 16-chiplet systems, respectively. We note speedups as high as  $6.3 \times$  for dwt2d under RR and  $6.45 \times$  for bfs using DS for the 4,096-GB/s 16-chiplet configuration. The key observation here is that substantial speedups are obtained at high chiplet count if interchiplet bandwidth increases commensurably. These results demonstrate that there is ample room for improving performance through a high-bandwidth photonic interchiplet NoW as we scale the number of GPUs.

# SUMMARY AND FUTURE RESEARCH DIRECTIONS

This article proposed the photonic-NoW GPU architecture and argued that it is promising and technically feasible paradigm to scale GPU performance. The photonic-NoW GPU as described and evaluated in this article is just a first attempt to explore the large design space. We believe that the photonic-NoW GPU paradigm opens up a wide arena of potential topics for future research, across different layers in the system stack.

THE PHOTONIC-NOW GPU AS DESCRIBED AND EVALUATED IN THIS ARTICLE IS JUST A FIRST ATTEMPT TO EXPLORE THE LARGE DESIGN SPACE.

At the architecture level, the high interchip bandwidth offered through the photonic-NoW might change how best to organize and manage the memory hierarchy on a per-workload basis. In particular, a high-bandwidth interchiplet network renders remote accesses relatively cheap, which provides opportunities to adaptively reconfigure the memory hierarchy to cache data locally (to maximize effective bandwidth) versus remotely (to maximize the effective cache capacity).

At the network level, network reconfiguration methods will be explored for our proposed photonic-NoW architecture. Suitable control mechanisms and algorithms can be designed for adapting the network's bandwidth distribution among GPUs to meet up their time-varying bandwidth demands. Moreover, when the number of GPUs exceeds the achievable radix of the optical switches, multihop networks will need to be designed. For the electro-optical transceiver design, high efficiency and compact driver, TIA, and serdes circuits have been demonstrated in the literature. However, significant performance gains can still be found by customizing the designs for this particular application (data format, optical link budget, floorplan) and technology platform that combines highspeed photonics with very low optical losses and very low electrical parasitics thanks to microtransfer printing. Furthermore, to efficiently support wavelength routing and dynamic bandwidth configuration in the network, fast wake-up and fast-locking burst-mode CDR circuits also need to be investigated to save power when no data transfer takes place.

To implement the photonic layer, further development of the microtransfer printing technology is of key importance. This includes both the development of the transfer printing of III-V optoelectronic components, as well as electronic ICs and interposers. Proof-of-principle demonstrations of the microtransfer printing of such components have already been made, but scaling up the technology to waferscale photonic interposers still needs to be demonstrated.

# ACKNOWLEDGMENTS

We thank the guest editors and reviewers for their valuable feedback. This work was supported by UGent Project under Grant BOF21-GOA-014.

# REFERENCES

- A. Arunkumar et al., "MCM-GPU: Multi-chip-module GPUs for continued performance scalability," in Proc. IEEE/ACM Int. Symp. Comput. Archit., 2017, pp. 320–332.
- S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, "Architecting waferscale processors: A GPU case study," in Proc. IEEE Int. Symp. High- Perform. Comput. Archit., 2019, pp. 250–263.
- K. Kwon et al., "128× 128 silicon photonic MEMS switch with scalable row/column addressing," in CLEO: Science and Innovations, 2018, Paper. SF1A–4.
- L. Qiao, W. Tang, and T. Chu, "Ultra-large-scale silicon optical switches," in Proc. IEEE 13th Int. Conf. Group IV Photon., 2016, pp. 1–2.
- Y. Huang, Q. Cheng, A. Rizzo, and K. Bergman, "Push-pull microring-assisted space-andwavelength selective switch," *Opt. Lett.*, vol. 45, no. 10, pp. 2696–2699, 2020.

- S. Pitris et al., "Silicon photonic 8× 8 cyclic arrayed waveguide grating router for O-band on-chip communication," *Opt. Exp.*, vol. 26, no. 5, pp. 6276–6284, 2018.
- M. Raj et al., "Design of a 50-Gb/s hybrid integrated Siphotonic optical link in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 55, no. 4, pp. 1086–1095, Apr. 2020.
- J. Verbist et al., "Real-time 100 Gb/s NRZ and EDB transmission with a GeSi electroabsorption modulator for short-reach optical interconnects," J. Lightw. Technol., vol. 36, no. 1, pp. 90–96, Jan. 2018.
- D. Guermandi et al., "TSV-assisted hybrid FinFET CMOS–silicon photonics technology for high density optical I/O," in *Proc. 45th Eur. Conf. Opt. Commun.*, 2019, pp. 1–4.
- J. Zhang et al., "III-V-on-Si photonic integrated circuits realized using micro-transfer-printing," APL Photon., vol. 4, no. 11, 2019, Art. no. 110803.
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in *Proc. Int. Symp. Perform. Anal. Syst.* Softw., 2009, pp. 163–174.
- N. Jiang et al., "A detailed and flexible cycle-accurate network-on-chip simulator," in *Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw.*, 2013, pp. 86–96.
- [Online]. Available: https://www.nvidia.com/en-us/ data-center/nvlink/
- [Online]. Available: https://ayarlabs.com/ayar-labs-toaccelerate-development-and-application-of-opticalinterconnects-in-artificial-intelligence-machinelearning-architectures-with-nvidia/
- [Online]. Available: https://www.cerebras.net/productchip/
- [Online]. Available: https://lightmatter.co/products/ passage/

SHIQING ZHANG is a doctoral student at Ghent University, 9000, Gent, Belgium. Her research interests include GPGPU system design, computer architecture modeling, and optimization. Zhang received a master's degree in computer science from the National University of Defense Technology, Changsha, China. Contact her at Shiqing.Zhang@UGent.be.

ZIYUE ZHANG is a doctoral student in the Fixed Internet Architectures & Optical Networks (FARON) Research Group at Ghent University-imec, 3001, Leuven, Belgium. His research interests include optical network design and network algorithm design. Zhang received a master's degree in photonics from Ghent University. Contact him at Ziyue.Zhang@UGent.be. MAHMOOD NADERAN-TAHAN is a postdoctoral researcher at Ghent University, 9000, Gent, Belgium. His research interests include computer architecture, GPU acceleration, and performance benchmarking. Naderan-Tahan received a Ph.D. degree from Sharif University of Technology, Tehran, Iran. Contact him at Mahmood.Naderan@UGent.be.

HOSSEIN SEYYEDAGHAEI is a doctoral student at Ghent University, 9000, Gent, Belgium. His research interests include general-purpose computing on graphics processing units (GPGPU) system design, scale-down simulation, and optimization. SeyyedAghaei received a master's degree in computer engineering from Tehran University, Tehran, Iran. Contact him at SeyyedHossein.SeyyedAghaeiRezaei@UGent.be.

XIN WANG is a doctoral student in the IDLab Design Group at Ghent University-imec, 3001, Leuven, Belgium. His research focuses on high-speed mixed-signal integrated circuit design for (opto-)electronic communication systems. Wang received a master's degree in microelectronics from Fudan University, Shanghai, China. He is a Graduate Student Member of IEEE. Contact him at Xin.Wang@UGent.be.

**HE LI** is a doctoral student at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research interests include photonic integrated circuits and optical interconnects. Li received a master's degree in material engineering from Nanjing University, Nanjing, China. Contact him at He.Li@UGent.be.

SENBIAO QIN is a doctoral student at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research interests include photonic integrated circuits and optical interconnects. Qin received a master's degree in optical engineering from the Huazhong University of Science and Technology, Wuhan, China. Contact him at Senbiao.Qin@UGent.be.

**DIDIER COLLE** is a full professor at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research interests include fixed Internet architectures and optical networks, green ICT, design of network algorithms, and techno-economic studies. Colle received a Ph.D. degree from Ghent University. He is a Member of IEEE. Contact him at Didier.Colle@UGent.be.

**GUY TORFS** is an associate professor at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research focuses on high-speed (opto-)electronic integrated circuits. Torfs received a Ph.D. degree in applied sciences and electronics from Ghent University. Contact him at Guy.Torfs@UGent.be.

MARIO PICKAVET is a senior full professor at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research interests include optical networking, green ICT, and algorithm design for complex networking problems. Pickavet received a Ph.D. degree in electrical engineering from Ghent University. He is a Senior Member of IEEE. Contact him at Mario.Pickavet@UGent.be.

JOHAN BAUWELINCK is an associate professor at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research focuses on high-speed (opto-)electronic integrated circuits for optical interconnects and sensing. Bauwelinck received a Ph.D. degree in electrical engineering from Ghent University. He is a Senior Member of IEEE. Contact him at Johan.Bauwelinck@UGent.be.

GÜNTHER ROELKENS is a full professor at Ghent University, 9000, Gent, Belgium, and imec, 3001, Leuven, Belgium. His research interests include photonic integrated circuits and in particular heterogeneous photonic/electronic ICs. Roelkens received a Ph.D. degree in electrical engineering from Ghent University. He is a Senior Member of the IEEE. Contact him at Gunther.Roelkens@UGent.be.

LIEVEN EECKHOUT is a senior full professor at Ghent University, 9000, Gent, Belgium. His research interests include computer architecture performance analysis and modeling, and CPU/GPU microarchitecture, and resource management. Eeckhout received a Ph.D. degree in computer science engineering from Ghent University. He is an IEEE and ACM Fellow. Contact him at Lieven.Eeckhout@UGent.be.



March/April 2023