Learning About GPUs Through Measuring Memory Bandwidth

Author

Manon Oomen

Date

June 13, 2025

Reading time

30 minutes

A shelf with various graphics cards next to each other with the display ports facing the camera.
Share article

This is going to be a very interesting article about gpu memory bandwidth.

Sign up to our newsletter
And be the first to know about what we're up to!
Blog posts
New releases
Blog posts
New releases

Findings? What should I call this section?

Running the microbenchmarks on various hardware has been very insightful. From learning various quirks in different chips, to finding surprising bottlenecks in our microbenchmark.

IMPORTANT NOTE: These benchmarks are not comparisons as to which GPU is ‘better’. They are just measurements to learn about the different GPU architectures. For this reason, we have opted to use GPUs from different price brackets. We are not interested in ‘this number is higher on this GPU’, but more interested in what patterns are slow or fast on a given piece of hardware. If you are interested in comparing the real-world performance of GPUs, the Evolve benchmark suite is great for this!

Qualcomm Adreno 740 (Meta Quest 3)

Let’s start of with a mobile chip, the Qualcomm Adreno 740 in the Meta Quest 3. On this hardware, we found some interesting data about using buffers vs textures. We found it to be fairly common for there to be a difference in bandwidth for data in cache across many GPUs. However, on the Quest 3 we found a significant bandwidth difference in main memory. It is significant enough that, when designing software specifically for this hardware, it would be worth it to replace large buffers with textures to achieve a 2x to nearly 3x bandwidth improvement.

It is important to note that we do see higher bandwidth for small buffers and that this benchmark does not measure latency.

Bandwidth (GiB/s) of reading float4 values on the Meta Quest from various sized texturs and buffers.
AMD RX 9070 XT

On AMD’s latest GPU architecture, we ran into an interesting problem. When measuring the bandwidth, we noticed a significant difference in performance when loading floating-point values vs integers. This had us quite confused since, from a hardware perspective, it should not matter whether you load an integer or a float.

It turns out we were running into an ALU bottleneck in our bandwidth benchmark! Just performing one integer multiplication per load was enough to be a bottleneck. When we replaced our dummy multiply op with an addition, we were able to measure the same bandwidth for both integer and floating point values.

I suppose we did not expect AMD’s L0 cache to be that fast. We managed to measure it higher than 21 TB/s when loading from a buffer. When loading from a storage texture, we only managed to reach 12 TB/s.

Take The Lead!
Push your rig to the max with Evolve benchmark suite and prove your setup’s dominance!
Join the Battle
Desktop
Mobile
Blog posts
New releases
Share article

Findings

Running the microbenchmarks on various hardware has been very insightful. From learning various quirks in different chips, to finding surprising bottlenecks in our microbenchmark.

IMPORTANT NOTE: These benchmarks are not comparisons as to which GPU is ‘better’. They are just measurements to learn about the different GPU architectures. For this reason, we have opted to use GPUs from different price brackets. We are not interested in ‘this number is higher on this GPU’, but more interested in what patterns are slow or fast on a given piece of hardware. If you are interested in comparing the real-world performance of GPUs, the Evolve benchmark suite is great for this!

Qualcomm Adreno 740 (Meta Quest 3)

Let’s start of with a mobile chip, the Qualcomm Adreno 740 in the Meta Quest 3. This GPU has 12 Shader Processors, which each consist out of 2 micro Shader Processors. We find a block of 1 KiB L1 cache per micro Shader Processor. Additionally, there is a block of shared L2 cache, and then further on a system level cache, which is usually shared with the CPU. From our measurements we see drop offs in read bandwidth after 128 KiB, 1MiB, and 8 MiB. For the write bandwidth we see clear drop off in performance at 512 KiB, but further drop offs after that are not quite as clear.

Other than differences in bandwidth btween the caches we found some interesting data about using buffers vs textures. We found it to be fairly common for there to be a difference in bandwidth for data in cache across many GPUs. However, on the Quest 3 we found a significant bandwidth difference in main memory. It is significant enough that, when designing software specifically for this hardware, it would be worth it to replace large buffers with textures to achieve a near 3x bandwidth improvement from around 46 GiB/s to over 130 GiB/s.

Also interesting to note is that we hit the highest VRAM bandwidth of 143 GiB/s when we mix in 25% buffer loads. Additionally, we can see that using buffers is fine as long as we manage to hit the cache. When writing code we would recommend to use whichever resource type is most natural, but when optimizing for the Meta Quest 3 you might want to replace specific large buffers with textures to get that higher bandwidth.

Read Bandwidth (GiB/s) on the Meta Quest 3 (Qualcomm Adreno 740)
Write Bandwidth (GiB/s) on the Meta Quest 3 (Qualcomm Adreno 740)
AMD RX 9070 XT

AMD has excellent documentation on what their hardware looks like. With the introduction of the RDNA architecture they released a white paper detailing the major differences, as well as releasing slides with what is new with each new iteration of their architecture. On top of that we also get a fully documented ISA full of little hardware details.

An AMD RDNA 4 GPU is built out of a various number of Shader Engines. Each Shader Engine consists out of two Shader Arrays, and each Shader Array contains some Work Group Processors. In the case of the RX 9070 XT there are a total of 4 Shader Engines with 4 Work Group Processors per Shader Array. The Work Group Processor consists out of 4 SIMDs divided into two Dual Compute Units. In a Work Group Processor we find 16KiB of scalar cache, 32 KiB of instruction cache, 128 KiB for groupshared memory, as well as two blocks of 32 KiB L0 cache, one for each Dual Compute Unit. Per Shader Array we find 256 KiB of L1 cache. This cache is also used for coherency gathering the load/store requests before moving on to the 8 MiB of L2 cache. Finally RDNA 4 has 64 MiB of infinity cache, which on RDNA 3 was on separate chiplets, is back on a monolith die in RDNA 4.

When benchmarking the RX 9070 XT, we ran into an interesting problem. When measuring the bandwidth, we noticed a significant difference in performance when loading floating-point values vs integers. This had us quite confused since, from a hardware perspective, it should not matter whether you load an integer or a float.

It turns out we were running into an ALU bottleneck in our bandwidth benchmark! Just performing one integer multiplication per load was enough to be a bottleneck. When we replaced our dummy multiply op with an addition, we were able to measure the same bandwidth for both integer and floating point values.

I suppose we did not expect AMD’s L0 cache to be that fast. We managed to measure it at nearly 20 TiB/s when loading from a buffer. When loading from a storage texture, we only managed to reach 11 TiB/s.

Read Bandwidth (GiB/s) on the AMD RX 9070 XT
Write Bandwidth (GiB/s) on the AMD RX 9070 XT
Intel Arc B580

Battlemage is Intel's second discrete GPU architecture. When Battlemage was announced Intel release some diagrams describing what their new GPU looks like giving us some interesting insights.

An overview of the Intel B580 GPU architcture

The battlemage GPU architecture is built out of Render Slices, 5 in the case of the B580. Each Render Slice contains 4 Xe Cores as well as 4 Ray Tracing Units, 4 Samplers and some blocks for rasterization. In side each Xe Core we find 8 Xe2 Vector Engines, 96 KiB of instruction cache and 256 KiB of L1 Cache. The L1 block is also used groupshared memory usage where the block can be partitioned as needed. The main block of L2 cache is 18 MiB in size.

A zoomed in view of an Intel B580 Render Slice

The B580 surprised us when comparing 4-byte vs 16-byte loads when loading from textures vs buffers. We found that for buffers bandwidth goes down for larger data types, while for textures bandwidth goes up for larger data types. This would suggest that data paths via buffers and textures, for the caches at least, seem to be optimized for different use cases.

A second interesting find is that when reading or writing 12-byte element from and to a buffer, performance increases as we move to a later cache level. We are not sure why this is the case, but it is interesting nonetheless.

Read Bandwidth (GiB/s) on the Intel Arc B580
Write Bandwidth (GiB/s) on the Intel Arc B580
NVIDIA RTX 5070 Ti

NVIDIA's Blackwell architecture is by far the largest scaled of all all the current generation GPUs. When the new architecture was released they also released a white paper describing all the changes as well as some very useful diagrams. The diagram is released is of their full GB202 chip, of which a slightly cut down version is found in the RTX 5090, likely due to binning.

The RTX 5070 Ti we tested uses the same architecture but a smaller version of the chip. On chip we find 6 Graphics Processing Clusters (GPC), 35 Texture Processing Units (TPC), and 70 Shader Multiprocessors (SM). On the GPU we find 48 MiB of L2 cache, and as we zoom into the SM we find 128 KiB of L1 cache on each of them. The L1 block is also used for groupshared memory and can be split up with 10 different configurations ranging from 0 KiB to 228 KiB used for groupshared.

In benchmarking the RTX 5070 Ti we found that many writes to the same small piece of memory causes a large bottleneck reducing performance massively. To be fair, this is not a use case with any real world computation uses, but it's interesting to see and makes us wonder what exactly is happening in the hardware that causes this.

Another interesting find is that we saw a large decrease in read bandwidth when reading a uint1 from a buffer vs a float1. So far we found on AMD's RX 9070 XT that integer multiplies caused an ALU bottleneck. This case is different, as it's only the buffer loads are affected, not the texture loads. We have reran the microbench but got the same results back. So far we have not been able to explain why there is a difference in performance in this specific case.

Read Bandwidth (GiB/s) on the NVIDIA RTX 5070 Ti
Write Bandwidth (GiB/s) on the NVIDIA RTX 5070 Ti

Conclusion

With our new bandwidth microbenchmark we are just scratching the surface of what we can discover. So far we have been able to find that some paths can be faster on certain hardware like using textures vs buffers on the Meta Quest 3, or that loading 12-byte elements on Intel doesn't do well on the caches closests to the shader cores. However, we also found patterns that still remain a mystery to us. As we further develop our microbenchmarks we hope to uncover more insights on the hardware.

Generally we don't let one pattern being slower or faster on one specific vendor affect the way we write code, as we don't want to favor one GPU vendor over another. However the common patterns are certainly helpful. Additionally for projects targetting specific hardware that are not Evolve related, this type of data can helps us to squeeze every bit of performance out.

We already have a lot of ideas for more microbenchmarks we want to build and try out. When we find some more interesting results we will share them with another blog posts so keep your eyes out for that!

Sources:

Curious about our enterprise offering?
We help hardware vendors, phone manufacturers, OEMs, board partners and more to achieve the best possible performance!
Contact us
Advisory Board Memberships
Art and Source Code Access
Contact us
Advisory Board Memberships
Art and Source Code Access
Blog posts
New releases