- Home
- Accelerators
- Intel
- Intel Gaudi 3 PCIe
Overview
AI Acceleration Without Vendor Lock-In
The Intel Gaudi 3 PCIe card (HL-338) packs 128GB of HBM2e memory, 3.7 TB/s bandwidth, and 24 on-chip 200GbE RoCE v2 networking ports into a standard PCIe Gen5 dual-slot card. That last point is what makes it different from every other accelerator in this lineup: the networking is built into the chip, not bolted on as separate NICs.
This means you can scale Gaudi 3 clusters using the Ethernet switches you already own instead of investing in proprietary interconnects. It supports PyTorch natively, integrates with Hugging Face and vLLM, and handles LLM inference, fine-tuning, and training workloads with automated FP8 quantization.
Gaudi 3 PCIe vs. H100 NVL PCIe
The buyer's real question: open Ethernet scaling with more memory, or the established CUDA ecosystem?
| Gaudi 3 PCIe | H100 NVL PCIe | |
|---|---|---|
| Memory | 128GB HBM2e | 94GB HBM3 |
| Bandwidth | 3.7 TB/s | 3.9 TB/s |
| BF16 | 1,835 TFLOPS | 1,671 TFLOPS* |
| FP8 | 1,835 TFLOPS | 3,341 TFLOPS* |
| On-Chip NICs | 24× 200GbE | None |
| TDP | 600W | 350-400W |
| Software | PyTorch (native) | CUDA |
128GB
HBM2e Memory
24
On-Chip 200GbE Ports
1.8
PFLOPS FP8 / BF16
600W
PCIe Gen5 Dual-Slot
Compatible Servers
Dell PowerEdge Server for Gaudi 3 PCIe
Dell is the lead OEM and first to market with an integrated Gaudi 3 PCIe server configuration.
Dell PowerEdge XE7740
- 4U server, up to 8× Intel Gaudi 3 PCIe accelerators
- Optional 2× groups of 4-way bridged accelerators (RoCE v2)
- 1:1 accelerator-to-NIC ratio via 8 full-height PCIe slots + OCP module
- Air-cooled, fits ~10kW racks without cooling upgrades
- Optimized for Llama, DeepSeek, Phi, Qwen, Falcon, and more
- Dell Smart Cooling, OpenManage Enterprise, APEX AIOps
Use Cases
Where Gaudi 3 PCIe Fits
LLM Inference & Fine-Tuning
128GB of HBM2e means larger models fit in memory without model parallelism overhead. The 24 integrated 200GbE ports eliminate the need for separate NICs, reducing cost and latency in multi-node inference clusters. Native vLLM and Hugging Face support with automated FP8 quantization makes deployment straightforward for popular models including Llama, DeepSeek, and Falcon.
Ethernet-Native Scaling
Every other GPU accelerator requires separate NICs for inter-node communication. Gaudi 3 integrates 24× 200GbE RoCE v2 ports directly on the chip, delivering 4.8 Tb/s of networking bandwidth per card. This means you can build multi-node training and inference clusters using the standard Ethernet switches you already own, without investing in proprietary interconnect hardware like NVLink or InfiniBand.
Open Software & No Lock-In
Gaudi 3 integrates natively with PyTorch, so your team works with the framework they already know. Hugging Face model hub support and automated FP8 quantization simplify deployment. Unlike proprietary ecosystems, Intel's software stack is open, and the hardware scales over standard Ethernet. For organizations building AI infrastructure that they want to own and control, Gaudi 3 removes the lock-in concern.
Specifications
Intel Gaudi 3 PCIe (HL-338)
| Specification | Gaudi 3 PCIe |
|---|---|
| Architecture | Intel Gaudi 3 (5nm) |
| Compute Engines | 8 MME + 64 TPC |
| Memory | 128GB HBM2e |
| Memory Bandwidth | 3.7 TB/s |
| On-Die SRAM | 96MB (12.8 TB/s) |
| FP8 | 1,835 TFLOPS |
| BF16 | 1,835 TFLOPS |
| Data Types | FP8, BF16, FP16, TF32, FP32 |
| Networking | 24× 200GbE RoCE v2 on-chip (4.8 Tb/s) |
| Host Interface | PCIe Gen5 x16 |
| TDP | Up to 600W |
| Form Factor | Dual-slot, FHFL PCIe |
| Thermal | Passive |
| Software | Intel Gaudi Software, PyTorch, vLLM, Hugging Face |
Ready to Evaluate Intel Gaudi 3?
Open ecosystem, standard Ethernet, no vendor lock-in. ServerMonkey can configure the Dell PowerEdge XE7740 with Gaudi 3 PCIe.
Request a Quote




