
GPU Comparison for AI Training
Compare RTX 4090, A100, H100, and H200 for AI training, inference, and rentals to find the best fit for your workload.
When a team picks a GPU for training or inference, one of two things usually happens: either they buy the most expensive card “for future growth” and then spend six months watching it sit at 30% utilization, or they save money on VRAM and hit an OOM error halfway through an epoch. Both scenarios are about money: in the first case you overpay for hardware, and in the second you overpay for engineer time.
This article is an attempt to put in one place what people usually have to piece together from Reddit threads, benchmark comparisons, and hard-earned experience. No hardware ads here, just a practical “task → GPU → reasonable rental price” guide.
Why GPU choice matters
A common mistake is choosing a GPU by release year and TFLOPS alone. On paper, H100 is much faster than A100, and A100 is faster than RTX 4090. In practice, the real gain depends on what your pipeline is actually bottlenecked by.
If you fine-tune a 7B model in FP16 with a batch size that fits into 24 GB of VRAM, the speed difference between RTX 4090 and H100 on the same task will be about 2–3x — but H100 rental is often 5–8x more expensive. The math is simple: for the same budget, you can run three or four experiments on a 4090 instead of one on an H100. That matters a lot during the research phase.
The opposite case is training a 70B-parameter model. Here, RTX 4090 is not a real option: 24 GB of VRAM is not enough even for the weights, let alone the optimizer and activations. In that case, H100 with 80 GB HBM3 or H200 with 141 GB HBM3e is not a luxury — it is a requirement.
What to check besides FLOPS
- VRAM determines how large a model and batch size you can load at all. For LLMs, this is usually the first limit you hit.
- Memory bandwidth tells you how much data the GPU can move between memory and compute units per second. For most transformer inference workloads, memory bandwidth matters more than raw compute. H200 at 4.8 TB/s versus A100 at 2 TB/s can improve LLM inference just from memory alone, without any new tensor cores.
- Tensor cores and format support matter too. H100 and H200 support FP8, while A100 does not. If your stack already uses FP8, Hopper-class GPUs can deliver a real 2x gain. If you are using FP16 or BF16, the difference is smaller.
- NVLink and interconnect matter more than the card itself for multi-GPU training. SXM versions of A100 and H100 connect through NVLink at hundreds of GB/s, while PCIe versions rely on the PCIe bus, which becomes a bottleneck in multi-GPU setups.
GPU profiles
The prices below are median rental prices on the market at the start of 2026, based on GPU marketplace aggregators. Prices can easily vary by 2–3x between providers.
GPU | VRAM | Memory bandwidth | Tensor cores | Typical rental price | Best for |
|---|---|---|---|---|---|
RTX 4090 | 24 GB GDDR6X | 1.0 TB/s | 4th gen, FP16/BF16 | $0.30–0.60/hr | Inference, fine-tuning up to 7B, rendering |
A100 PCIe 40/80 GB | 40 or 80 GB HBM2e | 1.5–2.0 TB/s | 3rd gen | $0.80–1.60/hr | Mid-size model training, scientific compute |
A100 SXM 80 GB | 80 GB HBM2e | 2.0 TB/s | 3rd gen, NVLink 600 GB/s | $1.20–2.20/hr | Multi-GPU training for 7B–30B models |
H100 SXM 80 GB | 80 GB HBM3 | 3.35 TB/s | 4th gen, FP8, NVLink 900 GB/s | $2.50–4.50/hr | 30B+ LLM training, FP8 stacks |
H200 SXM | 141 GB HBM3e | 4.8 TB/s | 4th gen, FP8 | $3.50–6.00/hr | 70B+ LLMs, long context, large-model inference |
In the mid-range, it is also worth knowing about A10 (24 GB, around $0.40–0.80/hr) — a workhorse for inference on medium-sized models — and L40S (48 GB, around $1.00–1.80/hr) — a compromise between memory size and cost, often a sensible alternative to A100 PCIe for inference and light training.
A100 PCIe vs SXM deserves a separate note: the difference is not only NVLink. SXM versions have higher TDP and keep frequencies more stable under load. For a single card, the performance gap is about 5–10%; for an 8-GPU setup, it can reach 30–40% thanks to faster inter-GPU communication.
Which GPU for which task
Training large LLMs
For 30B–70B+ models, the choice narrows to H100 and H200, often in multi-GPU configurations. A 70B model with Adam in FP16 needs roughly 1.1 TB of total memory for weights, gradients, optimizer states, and activations. That means at least 8x H100 80 GB or 8x H200 with some headroom.
H200 is especially attractive for long-context work and large-model inference because its 141 GB VRAM and 4.8 TB/s bandwidth reduce memory pressure and speed up inference.
As for A100 vs H100 debates for big-model training, A100 is still relevant for models up to about 13B–30B, especially if your pipeline is already tuned for BF16. But beyond that, H100 pays for itself by reducing training time. A job that takes three weeks on 8x A100 can fit into about one to one and a half weeks on 8x H100, even though the rental cost is around 1.8–2.2x higher.
Fine-tuning medium models
This is where RTX 4090 becomes a genuinely reasonable choice, especially for research iterations. With QLoRA or 4-bit quantization, Llama-7B can be fine-tuned comfortably on a single 4090. The speed is lower than on A100, but rental cost is usually 3–5x lower, which is a strong tradeoff for prototyping and hyperparameter tuning.
A100 80 GB starts to make sense when you need full fine-tuning without quantization, when the model is closer to 13B, or when batch size is critical for quality. If you are running multiple experiments in parallel, four RTX 4090s for the price of one H100 often deliver more total useful time than the single H100, especially when the tasks are independent and do not require inter-card communication.
Low-latency inference
There is no simple answer here — it depends on model size and latency requirements.
For models up to 7B in production inference, A10 and L40S often win on cost per request versus A100 and especially H100. Their compute is weaker, but for inference that is rarely the bottleneck; memory bandwidth and KV-cache handling matter more.
For 13B–70B inference, H100 and especially H200 pull ahead. FP8 can deliver real throughput gains over FP16 on A100, and H200 can fit larger models more comfortably without forcing tensor parallelism.
RTX 4090 can be used for production inference, but with caution: NVIDIA’s GeForce licensing terms may restrict commercial data-center use. For research and internal workloads, it is fine; for commercial service, data-center GPUs are the safer choice.
3D rendering and diffusion
RTX 4090 is the best value for money here. Stable Diffusion XL, Flux, and video generation are mostly compute-bound and benefit strongly from FP16, where the 4090 performs very well. A100 and H100 are often overkill for these tasks and do not deliver proportional gains.
Scientific compute and simulation
It depends on whether you need FP64. If you do, A100 and H100 are the only sensible choices, since GeForce cards are heavily limited in FP64. If your workload is FP32 or FP16, RTX 4090 can still be competitive.
Rental pricing
The least obvious part of GPU rental is how different providers can be for identical hardware. At the start of 2026, H100 SXM pricing on the market ranged roughly from about $4.5–6 per GPU-hour on hyperscalers like AWS p5 and GCP A3 to about $2.5–3.5 per hour on specialized GPU clouds, and sometimes $1.8–2.5 per hour with smaller providers.
The difference between $1.8 and $5 per hour is 2.7x for the same hardware. Over a month of continuous rental for one card, that is about $2,300 difference. On an 8-card setup, that can exceed $18,000 per month. For a startup, that is often the difference between surviving to the next round and not making it there.
There are several reasons for this spread: hyperscale clouds charge for brand, SLA, and ecosystem integration; specialized GPU providers win on focus; and smaller providers in regions with cheaper electricity can offer prices large players cannot match. Marketplaces like QuData help compare offers from many providers in one place, saving time and reducing the hassle of searching manually.
How to decide fast
Use this checklist. First, look at model size and calculate the VRAM you need with a 1.5–2x buffer for the optimizer and activations. Next, define the task: training large models from scratch means Hopper and multi-GPU setups; fine-tuning medium models means A100 or RTX 4090 depending on the method; inference means choosing between A10, L40S, and H100 based on model size and latency needs. Then check your budget: if you are in a research phase with dozens of experiments, saving on hardware almost always makes sense; if you are in production with stable load, SLA and availability matter more than the last 20% of price.
Most importantly, do not pick H100 “just in case.” If your model fits comfortably on an RTX 4090 and you are not compute-bound, paying extra for Hopper is just overpaying.