Why slower with larger threadgroup memory?

I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal

I notice that performance is significantly different with different threadgroup memory set in [computeEncoder setThreadgroupMemoryLength] All other lines are exactly same, the only difference is this parameter.

Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max), but 400 GFLops if I set 8192.

Why does this happen? How can I optimize it?

Answered by MikeAlpha in 833966022

This is normal. Thing is, every GPU core has only so much threadgroup memory. In case of newer Apple Silicon GPUs, that's 32kB.

Now, if you asking that threadgroup be allocated 32kB, you are simply telling the GPU "that one threadgroup is going to use whole threadgroup memory available on the core". In effect, you are limiting number of concurrently executing thread groups to one.

Smaller allocations will increase that number. So for example with 16kB the core will be able to run two independent thread groups, with 8kB four and so on.

Of course, the performance of your code also depends on the size of the threadgroup memory it can use. So optimisation in this case would be about finding the sweet spot where your code can still use reasonable amount of threadgroup memory in every threadgroup, but at the same time GPU can run enough thread groups concurrently.

Accepted Answer

This is normal. Thing is, every GPU core has only so much threadgroup memory. In case of newer Apple Silicon GPUs, that's 32kB.

Now, if you asking that threadgroup be allocated 32kB, you are simply telling the GPU "that one threadgroup is going to use whole threadgroup memory available on the core". In effect, you are limiting number of concurrently executing thread groups to one.

Smaller allocations will increase that number. So for example with 16kB the core will be able to run two independent thread groups, with 8kB four and so on.

Of course, the performance of your code also depends on the size of the threadgroup memory it can use. So optimisation in this case would be about finding the sweet spot where your code can still use reasonable amount of threadgroup memory in every threadgroup, but at the same time GPU can run enough thread groups concurrently.

The answer would be "usually, no". The basic idea underlying modern GPUs is they try to get around memory latency by having A LOT of parallelism. The cores are usually designed to execute several workgroups at the same time. Some hardware (not sure about Apple Silicon, but I knew several CUDA machines that have this) wants you to have several thread groups AND several "simd groups"/warps in each threadgroup to reach peak efficiency.

My understanding is, you even proved this yourself - you got 250GFlops when asking for 32kB and 400GFlops when asking for 8kB of threadgroup memory. That memory request translates into "I want only one threadgroup to run on each core" vs "I am ok with up to 4 thread groups running on each core".

If I were you, I would immediately try with 4kb and 16kb, so with up to 8 and up to 2 thread groups, then check performance.

Why slower with larger threadgroup memory?
 
 
Q