By John on December 11, 2020 | Leave a Comment

These days, cloud data centers are busy building their artificial intelligence (AI)-based computing platform for the vast majority of application software developers, who do not necessarily have to be machine learning (ML) specialists or data scientists, to develop various applications for multiple industries such as health care, robotics, social media, finance, autonomous cars, gaming, etc. These AI computing platforms are expected to become so popular and powerful that their impacts might soon exceed those of the internet and mobile device platforms.

Two key factors are critical to the success of a cloud AI computing platform. One is the computing power, and the other is the interconnection bandwidth among distributed computers. Computing power has been increasing at an astounding speed of doubling every 3.4 months from 2012 to AlphaGo, thanks to advanced computer processors such as graphics processing units (GPU) and Tensor processing units (TPU), which were optimized for distributed and parallel computing. The growth of AI/ML at Google is shown in the figure below, and this hockey stick growth rate is similar for other major cloud AI/ML data centers. As a result, AI/ML has driven the east-

west intra-data center traffic to a previously unseen apex. While data center operators have been using AI/ML to optimize their network performance to support AI/ML traffic, the networking ecosystem is growing at a much slower speed because both Ethernet switch and optical transceiver capacities are only doubling every two years on average, and this growth rate may even be slowing down in the next few years. Nevertheless, super computers based on multiple optically inter-connected computing farms have recently achieved an astonishing 700 petaFlops of AI super-computing performance. In this super-computing platform, optical inter-connection is achieved by using thousands of short-distance 200 Gb/s pluggable optical transceivers in the spine and leaf switches. In the meantime, these 200 Gb/s pluggable transceivers will soon be upgraded to 400 Gb/s pluggable transceivers.

Looking forward, as the speed of Ethernet/Infiniband switches continues to increase, the pluggable optical transceivers could be replaced by co-packaged optics (CPO, meaning the optical components are co-packaged with the spine and leaf switches). It is foreseen that CPO would also be serving as the ≥ 100~400Gb/s optical interfaces for future server chips, network interface cards, and GPUs/TPUs. The challenges in CPO reside not only in the 3D optoelectronic packaging technique, but also the extremely high reliability that is required in a CPO package. The ultra-high reliability is due to the fact that if one of the optical transceivers surrounding the central switch fails, the entire package has to be replaced.

Recently, many researchers and startups are looking into the possibility of using silicon photonic integrated chips (PIC) to perform a much faster and power-efficient artificial neural networks for AI/ML. Their motivation is based on the fact that typical machine learning systems spend more than 90% of the energy and runtime on matrix multiplication, while linear matrix multiplication can be implemented using parallel or series of cascaded silicon photonic Mach-Zehnder interferometers (MZI). However, these approaches would face a fundamental scalability challenge. For series MZIs, the scalability is limited by the large cascaded optical insertion loss. For a parallel approach using wavelength-division-multiplexing, the scalability is limited by the number of available wavelengths (including various limits on array lasers or comb lasers) and the design of wavelength multiplexers/demultiplexers on a silicon photonic PIC.

Attachments

  • Original document
  • Permalink

Disclaimer

NeoPhotonics Corporation published this content on 11 December 2020 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 11 December 2020 04:20:01 UTC