I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal
I notice that performance is significantly different with different threadgroup memory set in [computeEncoder setThreadgroupMemoryLength] All other lines are exactly same, the only difference is this parameter.
Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max), but 400 GFLops if I set 8192.
Why does this happen? How can I optimize it?
This is normal. Thing is, every GPU core has only so much threadgroup memory. In case of newer Apple Silicon GPUs, that's 32kB.
Now, if you asking that threadgroup be allocated 32kB, you are simply telling the GPU "that one threadgroup is going to use whole threadgroup memory available on the core". In effect, you are limiting number of concurrently executing thread groups to one.
Smaller allocations will increase that number. So for example with 16kB the core will be able to run two independent thread groups, with 8kB four and so on.
Of course, the performance of your code also depends on the size of the threadgroup memory it can use. So optimisation in this case would be about finding the sweet spot where your code can still use reasonable amount of threadgroup memory in every threadgroup, but at the same time GPU can run enough thread groups concurrently.