Driving AI innovation and scale with high-capacity Ethernet
By Ash Halford, Head of Technology & Director Systems Engineering, ANZ, Juniper Networks
Wednesday, 12 February, 2025
AI has become the driving force behind transformative innovation across industries worldwide, from healthcare and financial services to autonomous vehicles and smart cities. These advancements demand massive data processing, computational power and a network fabric capable of handling unprecedented workloads.
While InfiniBand (IB) has long been the go-to technology for certain high-performance environments, its operational complexity and vendor lock-in present significant challenges. Ethernet, by contrast, is emerging as the universal choice for AI networks. Its openness, scalability, high performance and cost efficiency make it the ideal solution for the next wave of AI innovation.
Why Ethernet outpaces InfiniBand for AI
Several factors point to a future where next-generation graphics processing units (GPUs) and accelerators rely on Ethernet, sidestepping the cost and complexity of InfiniBand.
Avoiding vendor lock-in
InfiniBand’s proprietary technology creates dependency on a specific vendor, restricting flexibility, while Ethernet’s open standard fosters multivendor ecosystems, offering greater flexibility in network design. This enables organisations to use GPUs and accelerators from different vendors, avoiding dependence on a single supplier.
Consequently, with the rise of specialised GPUs and accelerators from companies like AMD, Intel and innovative startups, Ethernet’s compatibility across various vendors’ systems makes it the best choice for businesses seeking flexibility without vendor constraints. This allows organisations to innovate and scale without connectivity restrictions.
Scalability and high performance
Ethernet provides the scalability and performance necessary to meet the increasing demands of AI workloads. Supporting speeds of up to 800 Gbps, with 1.6 Tbps on the horizon, and cutting-edge features like RDMA over Converged Ethernet version 2 (RoCE v2) and Global Load Balancing ensure low latency and high throughput, meeting or surpassing InfiniBand’s capabilities.
This positions Ethernet as an ideal choice for next-generation accelerators, empowering businesses to scale AI projects, integrate complex and diverse data sources, and foster seamless innovation without limitations. Besides, continuous advancements in Ethernet standards further enhance its ability to manage congestion, ensure fault tolerance, and optimise real-time performance, ensuring smooth operation in even the most demanding AI environments.
Cost efficiency and operational simplicity
As AI scales rapidly, businesses require affordable and easily deployable networking solutions. Ethernet’s widespread presence in data centres reduces deployment costs and streamlines integration with hybrid cloud setups, enabling seamless communication between on-premises hardware and cloud-based GPUs. Its broad adoption ensures access to a skilled talent pool and tools, minimising operational hurdles for AI projects. In addition, Ethernet’s open ecosystem fosters innovation, providing adaptable network designs while simplifying operational complexity for evolving AI workloads.
Lastly, Ethernet’s longstanding history of winning against technologies like Token Ring, FDDI, ATM and SONET highlights its strength mentioned above. Industry is expecting the same for AI networks.
Ethernet’s impact on AI workloads
Ethernet’s high-speed connectivity, low latency and congestion management provides the best connectivity option for large-scale AI environments. The technology has proven to be the ideal fabric for AI, providing the bandwidth, performance and flexibility necessary for:
- Model training: AI models, particularly large ones, need an infrastructure that can handle vast datasets and iterative learning. Ethernet’s scalability with port speeds of up to 800G ensures that AI training can proceed without bottlenecks. It supports large AI model training, inference at scale, and distributed systems with advanced congestion management and RDMA.
- Inference and real-time decision-making: Low latency is crucial in applications like autonomous driving, financial trading and healthcare diagnostics. Ethernet’s architecture enables fast, reliable data transmission for faster, more accurate outcomes.
- Data processing at scale: AI processes vast amounts of data. Ethernet’s high-throughput capabilities make it the perfect fit for HPC clusters and distributed AI systems, ensuring smooth, high-volume data transfer.
Industry leaders are adopting Ethernet
A recent survey by ZK Research and theCUBE Research found that 59% of respondents prefer Ethernet over InfiniBand. Reasons for that preference included existing Ethernet network installed base and in-house skills, cloud vendors already using Ethernet for AI, ROCE providing comparable performance and concern with InfiniBand.
Companies like Meta, NVIDIA and Oracle have already adopted Ethernet for large-scale AI applications:
- Meta’s Llama 3.1 Model: Meta trained this cutting-edge AI model using a massive Ethernet RoCE v2 cluster of 24,000 GPUs, demonstrating Ethernet’s ability to support large-scale training without packet loss.
- Oracle’s shift from IB to Ethernet: Oracle’s Exadata system transitioned from IB to Ethernet fabrics to handle demanding workloads, recognising Ethernet’s scalability and simplicity.
- NVIDIA’s adoption: NVIDIA, a member of the Ultra Ethernet Consortium, has embraced Ethernet solutions to support large-scale, multi-GPU clusters efficiently.
Powering AI’s future with Ethernet
AI requires massive data processing, complex algorithms and real-time decision-making, all of which depend on a robust, scalable network fabric. Ethernet has solidified its place as the backbone of AI innovation. Its proven performance, scalability and operational flexibility make it essential for supporting the rapid growth of AI workloads. Unlike InfiniBand, Ethernet offers a more future-proof solution that adapts as AI’s demands grow, enabling seamless infrastructure expansion.
With technologies like enhanced RoCE, Ethernet continues to provide low-latency, high-performance networking for next-generation AI accelerators. Its open ecosystem, high-speed connectivity and innovative congestion management ensure that organisations can confidently drive AI advancements, unlocking the full potential of AI applications and fuelling innovation.
Interview: LuLu Shiraz, Vertiv
For our annual Leaders in Technology series we are discussing Vertiv's power and cooling...
Managing hybrid IT environments in the great cloud repatriation
Today the cloud serves as the backbone of modern IT, and Australian tech teams are facing...
Why government agencies should reduce maverick spending
For government departments and councils, reducing maverick spending is critical to creating a...