스트리밍은 대부분의 브라우저와
Developer 앱에서 사용할 수 있습니다.
-
Metal 4 게임 알아보기
Metal 4의 최신 개선 사항으로 게임 엔진을 최적화하는 법을 살펴보세요. CPU 오버헤드를 최소화하고, 방대한 장면을 지원하기 위한 그래픽 리소스 관리를 확장하며, 메모리 한도를 극대화하고, 대규모 파이프라인 상태 라이브러리를 빠르게 로드하는 방법을 확인합니다. 이 세션을 최대한 활용하려면 ‘Metal 4 살펴보기'를 먼저 시청하는 것이 좋습니다.
챕터
- 0:00 - Intro
- 1:33 - Encode more efficiently
- 8:42 - Scale your resource management
- 17:24 - Load pipelines quickly
- 31:25 - Next steps
리소스
- Human Interface Guidelines: Designing for games
- Metal Binary Archives
- Reading and Writing to Sparse Textures
- Resource Synchronization
- Synchronizing resource accesses between multiple passes with a fence
- Synchronizing resource accesses with earlier passes with a consumer-based queue barrier
- Synchronizing resource accesses with subsequent passes with a producer-based queue barrier
- Synchronizing resource accesses within a single pass with an intrapass barrier
- Understanding the Metal 4 core API
- Using a Render Pipeline to Render Primitives
- Using the Metal 4 compilation API
관련 비디오
WWDC25
- 게임 수준 높이기
- 몰입감 넘치는 앱을 위한 Metal 렌더링의 새로운 기능
- Metal 4 게임 심화 기능 알아보기
- Metal 4 머신 러닝과 그래픽 결합하기
- Metal 4 살펴보기
Tech Talks
-
비디오 검색…
Hi, I'm Jason. And I'm Yang. We are GPU Driver Engineers. In this video, we will present how you can accelerate your game engine with Metal 4.
This is the second talk in our four part series to introduce the next major version of the Metal API.
Before we explore Metal 4 games, please watch “Discover Metal 4” for a thorough orientation of what Metal 4 has to offer. After this talk, watch “Go Further with Metal 4 Games” to learn about incredible new Metal FX and Metal Raytracing APIs. And watch “Metal 4 Machine Learning” for a dedicated talk on integrating ML. Let's explore Metal 4.
Metal 4 is designed for the modern game engine. Games like Assassin’s Creed Shadows from Ubisoft draw incredibly detailed characters and landscapes to immerse players in fantasy worlds. They stream gigabytes of detailed geometry and textures, rendered with thousands of shaders to take advantage of all the computing power available with Apple Silicon. The games of the future will be even more demanding, and will need a graphics API that scales up to the task. That's Metal 4.
Metal 4 contains some important advancements in Metal for games. These include efficient command encoding and scaled resource management to speed up your hot path, as well as quicker pipeline loading to get your players out of loading screens into games. In this talk, my colleague and I will dive into how to best use these features. In every frame of your game, draw calls, kernel dispatches, blits, and ray tracing work are all on the encoding hot path. Metal 4 encoding is designed to meet this challenge by being efficient and concurrent.
Metal 4 unifies the most common operations into two encoder classes so that you can do more with each encoder. Command Allocators allow you to explicitly manage your memory allocations incurred while encoding. Command Buffers let you encode work across multiple threads.
The “Discover Metal 4” talk described how two encoder classes, render and compute, handle all the command encoding. To use them efficiently, you’ll need to synchronize data dependencies between compute operations, as well as use attachment maps to remap fragment shader outputs.
All compute operations, kernel dispatches, blits, and acceleration structure builds can now be encoded in a single compute encoder. Without any additional synchronization, these commands run concurrently. This allows workloads without dependencies to make better use of GPU resources. If some commands in the pass need to run serially due to a data dependency, you can express this using a Pass Barrier. This barrier ensures that the GPU waits until all previous blits on this encoder complete before later compute dispatches begin. Here’s an example of how to synchronize access from a blit to a dispatch. First, the copyFromBuffer blit updates buffer1, so Encode a pass barrier. Then you can encode a dispatch that uses the data in buffer1. That's unified compute encoding. Do all your compute work in one encoder and use barriers to express data dependencies, Metal 4 has also updated render encoding. With color attachment mapping, you can now control the correspondence between a render pipeline’s color outputs and the render encoder’s attachments. Instead of binding pipelines to a fixed render target layout, you can provide a color attachment map. Then when you set a new pipeline, you can change the map instead of switching encoders.
Suppose you have a Metal pipeline with a fragment function that draws to three attachments. Without color attachment mapping, you would create a render encoder with three color attachments. The fragment function returns three color outputs, and the encoder directs those outputs to the corresponding attachments in tile memory. For the next draw call, you might need a pipeline that writes to a different set of outputs. Since the attachments are different, you would need to create a new render encoder to match the color outputs. With color attachment mapping, you don’t need a second encoder. Instead, the render encoder has all color attachments needed by both pipelines. Then the color attachment map translates shading output to specific attachments. To implement color attachment mapping, start by setting up a render pass descriptor that supports color attachment mapping. Then create the full superset of attachments that will be used by the encoder.
To configure which attachment the encoder will draw to, create a color attachment map, then set its remap entries. For each entry, set a logical index to determine the shader output, and a physical index to determine the attachment index. Construct these mapping objects before encoding, and reuse them each frame.
When setting a render pipeline, also bind a color attachment map. If a pipeline draws to different attachments, you can switch to a different color attachment map.
Color attachment mapping could significantly reduce the number of render encoders in your game. This will reduce encoding overhead and improve GPU efficiency by reducing the amount of render passes.
Metal 4 also gives you more control over memory allocations.
Command allocators allow you to reuse command buffer memory and avoid dynamic allocations while encoding. Allocator memory grows as you encode commands. Reset allocators once the associated GPU commands have finished running. Reset makes their command memory available for reuse in later command encoding. You can use multiple allocators to avoid blocking encoding while GPU work completes. When you encode on a new command allocator, it will allocate memory for the encoded commands. This command memory is the work that runs on the GPU, so wait until committed work completes before resetting. Once the GPU work is complete, reset the command allocator. This immediately marks its memory as available for reuse. To continue encoding while the GPU work is running, use a second command allocator. This lets you avoid blocking encoding on the completion of GPU work.
It’s important to reset a command allocator once its GPU work is complete. Allocator memory will grow to support encoding until it’s reset. If you don’t plan to encode any more work on a command allocator, you can release it to reduce your memory footprint. Command allocators aren’t thread safe, so use different allocators for different threads. This is important as you parallelize your scene encoding. Metal 4 command buffers let you divide your encoding across multiple threads. In single-threaded encoding, you would encode a series of commands to one or more command buffers in sequence. Take advantage of Apple Silicon’s powerful multi-core CPU and begin multiple command buffers on multiple threads, each with a different command allocator. You can use the improved flexibility of Metal 4′s compute encoder to evenly distribute the encoding of blitz, dispatches, and acceleration structure work. When you’ve finished encoding your command buffers, you can submit them all with a single commit call. Metal 4 also allows you to commit multiple render encoders as a single pass on the GPU.
Suppose you have a render pass that takes a long time to encode.
By default, if you divide your encoding across separate render encoders, the GPU runs them as separate render passes. Each pass incurs overhead for storing and loading intermediate results.
Metal 4 provides Suspend/Resume option to merge multiple render encoders. Simply suspend a render encoder in one command buffer and resume it in another command buffer.
Once you are done encoding the command buffers, submit them sequentially in a single commit call. Submitting the render encoders in one commit instructs Metal to merge the render passes.
To implement this, create the first encoder with suspending options. Metal will merge this encoder with future encoders. Use a different command buffer for each encoder. The middle encoder has both resuming and suspending options.
Create the final encoder with only resuming options. Commit all three encoded command buffers together. That’s it. Your render passes are now merged.
You can use Metal 4 to improve your encoding efficiency by reducing your encoder count, reusing your command memory, and encoding across multiple threads. To learn more about Metal 4 command encoding, please refer to the article on the Apple developer site.
Now that I have covered efficient encoding, let’s dive into efficient resource management. Metal 4 has some exciting new features to help you manage your resources at scale. Argument tables and residency sets allow you to scale your resource binding to thousands of resources. Metal 4 puts you in charge of managing your drawable resources, and gives you control over dependencies. Queue Barriers provide a way to express your resource dependencies at scale. Text review pools and placement sparse heaps help you manage the memory required by large resources.
Increasing the complexity of shaders often means that a bindless model is appropriate for the quantity of resources. With a single argument buffer, your shader can access thousands of resources, including buffers, textures, samplers, pipeline states, and more. But indexed bind points are still used for binding your root-level resources.
Use argument tables to bind resources by index. While encoding, set which argument table should be used by the next draw or dispatch. These resources are available to shaders as indexed function arguments. At draw and dispatch time, Metal collects the arguments. This means that it’s safe to set a new resource to a bind index between draw calls. A single argument table can be set on many encoder stages.
Creating argument tables before encoding lets you move resource binding off of your critical path. You can attach one argument table to several encoders.
Use argument tables together with argument buffers to scale your resource binding needs.
The next step to accessing these resources in your shaders is to make them GPU visible. Anytime you need a resource on the GPU, add it to a residency set. This includes pipelines, buffers, textures, and drawables. Residency sets allow you to group many resources together and make them all visible at once. Either attach them to a command buffer being committed, or directly to the command queue.
If a residency set doesn’t change much over time, prefer attaching it to the command queue. If a residency set changes frequently, then attach it to the appropriate command buffers.
It can take time to prepare large resources for the GPU. You can ask Metal to make a set’s resources resident ahead of time.
Prefer having fewer residency sets with more resources each. This improves performance by allowing Metal to process your resources in bulk. To learn more about residency sets, please refer to the article on the Apple developer site, as well as last year’s talk called “Port advanced games to Apple Platforms”.
With Metal 4, your control over resource residency also applies to your game’s drawable surfaces. To send your game’s rendered content to the display, you render to the drawable surfaces in a CAMetalLayer.
Each Metal layer maintains a dynamic residency set. Add it to your command queue to make all the textures in the layer resident. You only need to add the residency set once. CAMetalLayer will update it as necessary.
In Metal 4, you also need to synchronize your rendering with the drawable. During each frame, after getting the next drawable, encode a wait on the command queue prior to rendering to the drawable. Then after you’ve committed the render work, encode a signal to the drawable on the queue.
Call present to send your frame contents to the display once rendering is complete.
To reduce tracking overhead, Metal 4 puts you in charge of synchronizing your resources.
Earlier in this talk, I covered how to use Pass Barriers within encoders. Queue Barriers, on the other hand, express a data dependency across encoders on the same queue.
Barriers filter by metal stages. Each command in an encoder is associated with one or more stages of execution.
For example, draw calls in render encoders generate both vertex and fragment shading stages. Apple Silicon GPUs batch all the vertex work together, followed by all the fragment shading work.
Metal4′s compute commands correspond to dispatch, blit, and acceleration structure stages. It’s important to choose appropriate stages to avoid over-synchronizing.
In this example, a compute pass performs atmospheric simulation in a kernel dispatch. It writes the results to a texture in memory. The render pass draws the scene. The fragment shading requires the results of the simulation to blend with lighting, but the vertex work should be free to overlap with the compute work. To synchronize access to the simulation results, encode a barrier from the queue’s dispatch stages to the render encoder’s fragment stage.
To implement this example, begin by encoding the dispatch on a compute encoder. Then, on a render command encoder, add a barrier after queue stage’s dispatch and before fragment stage.
After the barrier, you can encode draw calls. Metal ensures that any fragment stage work, in the current render encoder and in future encoders, isn’t run until all dispatch stage work in previous encoders is complete.
To help you find the best places for your barriers, the Metal debugger will show you their locations, along with which encoders and which stages they apply to. Use this to maximize concurrency while maintaining your data dependencies.
To learn more about synchronizing your resources using metal barriers, you can read through the full articles on the Apple developer site.
Texture and buffer streaming allows you to manage the memory footprint for thousands of resources. Metal 4 allows you to efficiently stream buffers and textures. You can create light-weight texture views and manage your memory resource footprint with Placement Sparse.
Modern games can create hundreds of texture and texture buffer views per frame. TextureViewPools allow you to pre-allocate the memory required to contain all your texture views. Then you can create light-weight texture views at any index in the pool. This doesn’t incur any dynamic allocations so you can create them while encoding.
Use the texture view’s resource ID to bind it to an argument buffer or to an argument table.
Here's how you can implement this. Create a text review pool ahead of encoding time. In this case, the created text review pool has memory allocated for 500 text reviews.
When encoding, set a text review at the desired index in the text review pool. Use the returned MTLResourceID to bind your text review to an argument table.
Sometimes, the resources you need to bind can have a large memory footprint.
Sparse resources are great for high fidelity resources that may not all fit in memory at the same time. They decouple the resources creation from their memory backing.
With placement sparse, you are in control over how your resources get mapped to the pages of your heap.
When updating memory mappings for your resources, APIs on the Metal 4 command queue allow you to synchronize these updates with other GPU work.
The memory in a placement heap is organized as a sequence of tiles. You control the assignment of these tiles to your sparse buffers and textures.
Provide sparse resources with memory by mapping byte ranges or pixel regions to sparse tiles.
When creating a placement heap, consider the sparse page sizes that your resources will need. Larger page sizes have some performance benefits during mapping and unmapping operations, but will use more memory for padding and alignment. The heap will support all sparse page sizes up to the maximum you specify. This example chooses a 64 kilobyte maximum page size.
Once you've created a placement heap, you can create the sparse resources. Creating placement sparse buffers and textures is done from a metal device, similar to non-sparse resources. For buffers, align the requested buffer size with multiples of a sparse tile size. The device provides a query to perform this conversion. Set the placement sparse page size when calling new buffer with length or on a texture descriptor. This property informs the metal device that a placement heap will provide the memory backing. When you first create a placement sparse buffer, it doesn't have any memory backing it. You assign tiles to buffer ranges with update mapping operations.
To assign tiles from a placement heap to a buffer, first specify an update mapping operation. Provide the starting offset and length and the tile offset in the heap to assign to this buffer range. Then submit the mapping operation on a Metal 4 command queue.
To learn more about working with sparse resources, check out the article on the Apple Developer website. Another challenge for modern games is managing large libraries of pipeline states. For that, I'll hand it over to Yang. Thanks, Jason. Modern games needs to create thousands of pipelines to create complex and dynamic visuals. Loading many pipelines quickly is crucial to eliminating shader completion stutters and reducing your game's load time.
To load pipelines quickly in Metal 4, you can reuse your render pipeline compilations. You can also compile pipelines on device with a new level of parallelism. To go a step further, compiling pipelines ahead of time allows you to reduce your pipeline loading time to near zero.
I will start by showing you how to reuse your render pipeline compilations with flexible render pipeline states.
Let’s say you’re creating a city builder game where players can place houses around the map.
When the player is deciding where to place a house, the game should render the house in hologram style, so it needs a pipeline with an additive blend state. The player places the house and the house starts building. To render the house with transparency, to show the building progress, it needs another pipeline with a transparent blend state. Finally, when the house finishes building, it will render the house with a third pipeline with an opaque blend state.
You could compile these three pipelines with full state by providing the full pipeline configurations upon creation.
Start with a vertex function and the fragment function, and the color attachment configurations for the opaque, transparent, and hologram houses.
Color attachment configuration here means a part of the descriptor that affects writing fragment shader outputs to color attachments. This includes the pixel format of the attachments, the right masks, and the blend states.
Create a render pipeline descriptor that references the vertex function, the fragment function, and the opaque configuration. With that descriptor, you create the opaque pipeline containing a vertex binary, a fragment binary body, and a fragment output part.
By swapping the color attachment configuration in the descriptor, you can similarly create the transparent pipeline and the hologram pipeline.
Most of the binaries in these three pipelines are the same, only the fragment output part is different.
From a CPU timeline view, you compile the full opaque pipeline, then the transparent pipeline, followed by the hologram pipeline. The CPU spends a lot of time recompiling mostly the same pipeline, except for the fragment output part.
With Metal 4, you can now reuse most of the pipeline completion by first creating an unspecialized pipeline. Then use different color attachment configurations to get the final specialized pipelines you need. This can really save you a lot of render pipeline completion time. To achieve the savings, first create an unspecialized pipeline. Start with the same descriptor, but instead of providing the actual color attachment configuration, set every field to unspecialized.
To do this, simply loop through all color attachment descriptors and set pixelFormat, writeMask, and blendingState to their corresponding unspecialized values.
The unspecialized pipeline contains a vertex binary, the fragment binary body, and the default fragment output part. The default fragment output works for some cases, but most of the time, you will need to replace it by specializing the pipeline.
To create the specialized pipeline, start with the unspecialized pipeline and a new render pipeline descriptor. This time, set the color attachment configuration in the descriptor to the actual values you need.
The specialized pipeline contains a corresponding fragment output replacing the default fragment output. This new fragment output can be generated very quickly, so you don't need to go through the full shader compilation process again.
I’ll return to the example of specializing the transparent pipeline. Start by setting the previously unspecialized properties in the descriptor. Enable the blending state and set blending sub states. The code here sets the pipeline to do pre-multiplied alpha blending. Then instantiate the specialized pipeline using the new descriptor with the unspecialized pipeline.
Your game may be creating thousands of stateful render pipelines. To maximize the load time reduction, create all your render pipeline unspecialized and specialize them later as needed.
After doing that, you may note that there is a small GPU performance overhead. A lot of the overhead can come from unnecessary work from the shared fragment body. For example, if a fragment shader writes to four color channels and the color attachment only has one channel, the compiler is no longer able to optimize the unused channels.
There's also a small overhead caused by jumping from the fragment binary body to the fragment output part.
This overhead is usually small, but can be large in some fragment shaders. Identify those important shaders and compile with full state in the background so your player can enjoy both short load times and great frame rate.
You can use Instruments Metal System Trace to rank the most expensive specialized fragment shaders.
To recap, here's how to best incorporate flexible render pipeline states into your game.
Start by compiling every render pipeline unspecialized and specialize them as needed. If there's a noticeable performance drop, use Instrument’s Metal System Trace to profile your game and identify important pipelines. For those important pipelines, compile a staple version in the background and use it to replace a specialized version when it’s ready. To learn more about flexible render pipeline states, check out this article at the Apple Developer website.
After reusing pipeline compilations with flexible render pipeline states, shorten your pipeline load time further by paralyzing your own device compilation. Some games might use a single thread to load pipelines during gameplay. Here’s the single compilation thread that builds the pipelines the game is about to use. And here’s a render thread that runs the repeated frame rendering work, such as encoding. If the required pipelines aren’t ready in time, the game can stutter.
You can speed up pipeline loading by adding another compilation thread. The pipeline compilation finishes sooner. However, if you’re not careful with thread priority, the game can still miss its present intervals.
After setting the background completion thread’s priority to a value lower than that of the render thread, the hitch is gone. The player can now enjoy smoother gameplay.
Here’s how to add multithreaded completion to your game. Use the Metal 4 completion APIs. The compiler is able to go even wider in Metal 4. You can either use Grand Central Dispatch or create your own thread pool, depending on which one fits your game’s architecture better. No matter which one you choose, remember to set an appropriate priority. Metal will respect the priority of your compilation tasks.
Grand Central Dispatch is the easiest way to issue your multithreaded compilations.
If you want the compilation to inherit the priority of the calling thread, you can use dispatch groups with the async methods provided by the compiler. Async methods are the ones with a completion handler. Metal will automatically execute these methods concurrently.
If you want to customize the priority of your compilation, you can create a concurrent dispatch queue with a custom quality-of-service, or QoS class.
For pipeline prewarming and streaming, we recommend you set it to default.
To submit completion tests to a dispatch queue, you can invoke sync methods in block and send it to the queue via dispatch_async. Sync methods are the ones without a completion handler.
You can also create your own thread pool if this fits your game’s architecture better. Use a maximumConcurrentCompletionTestCount property in Metal Device as the thread count for a thread pool. Set the default thread count to 2 as this is the maximum concurrency in OSs that don’t support this property.
It is important to also set a proper Quality-of-Service, or QoS class, for our compilation threads to avoid starving other important threads in your game. Set QoS class to default for pipeline prewarming and streaming. And that's it. You can now start sending compilation tasks to your thread pool.
For more information on the best ways to paralyze and prioritize your pipeline compilation, check out these articles at the Apple Developer website.
Multithreaded compilation on-device can greatly reduce your compile time. To further reduce it to near zero, precompiling pipelines at development time is the way to go. To compile pipelines ahead of time, games typically use an end-to-end workflow.
The workflow begins by harvesting pipeline configurations used in-game by running the game with some instrumentation. The harvested results are fed into the GPU toolchain to build GPU binaries. Finally, at runtime, the game looks up the precompiled GPU binaries to quickly build pipelines. Metal 4 makes it easier than ever to harvest your pipeline configuration online and look up precompiled binaries in your shipping game.
In Metal 4, the easiest way to harvest a pipeline configuration is to serialize a pipeline script. A pipeline script is just a JSON formatted file. It contains a textual representation of the pipeline descriptors you create on device.
Serializing a pipeline script is easy with a pipeline data set serializer in Metal 4. Once you bind this object to your compiler, it will automatically record the descriptors for the created pipelines. Next, you can serialize these descriptors into a pipeline script.
To create the pipeline data set serializer start with a descriptor. Set the configuration to CaptureDescriptors. This informs the serializer to only track pipeline descriptors and reduces its memory footprint. Use a serializer descriptor to create the pipeline data set serializer. Then attach a serializer to the compiler descriptor you use to create the compiler.
After creating the compiler, you can just use it to create pipelines as usual. The serializer will automatically record the pipeline descriptors you use.
When you are done harvesting, call serializeAsPipelinesScriptWithError to serialize the recorded pipelines into a pipeline script. The return value is an NSData. You can use your favorite method to send it back to your development system.
This example just writes it to a file on disk. Set the suffix of the file to mtl4-json. This is the suffix expected by the GPU toolchain. Once you’ve harvested pipeline configurations, the next step is to build the binaries. Feed your pipeline configuration script and Metal IR libraries to metal-tt. it will output the GPU binaries packed in a Metal archive. Before you feed the harvested pipeline script to metal-tt open the script and modify the path to the Metal IR libraries to match the path on your development system. For more information about the pipeline configuration script format, open the manual page with this command. Next, simply execute the metal-tt command on screen to build an archive for iOS. Now that you precompile the binaries, your game needs to look them up at runtime.
Metal 4 makes it even easier for you to create pipelines from GPU binaries in an archive. Just use the same descriptor you would use to compile on device to retrieve the pipeline states.
For example, create a MTL4Archive object by providing the URL to your archive. Then, query the pipeline state directly from the archive object with a pipeline descriptor.
The lookup into the archive can miss for multiple reasons, such as no matching pipeline, incompatible OS, or incompatible GPU architecture. You need to handle such misses yourself in Metal 4. The example here simply falls back to on-device compilation, so the game still has a required pipeline state going forward.
Here’s the CPU timeline for the example game again with a multithreaded on-device completion. By adopting ahead of time completion, the pipeline load time shrinks to almost zero. To learn more about ahead of time completion, check out these articles at the Apple Developer website.
Here's a recap. Metal 4 provides great ways to load pipeline states faster than ever before.
You can reuse compilation results using pipeline specialization. You can further speed up compilation with multithreading. For the lowest pipeline load times, adopt ahead of time compilation with a streamlined harvest and lookup workflow.
We were so excited to show you many ways to use Metal 4 APIs to build the next generation of high-performance games. Download the new Xcode to begin optimizing your game’s encoding, resource management, and pipeline loading. We’ve included sample projects and detailed articles to support your journey. To continue exploring what Metal 4 has to offer, check out the other talks in this series. Thank you for watching.
-
-
0:01 - Synchronize access to a buffer within an encoder
// Synchronize access to a buffer within an encoder id<MTL4ComputeCommandEncoder> encoder = [commandBuffer computeCommandEncoder]; [encoder copyFromBuffer:src sourceOffset:0 toBuffer:buffer1 destinationOffset:0 size:64]; [encoder barrierAfterEncoderStages:MTLStageBlit beforeEncoderStages:MTLStageDispatch visibilityOptions:MTL4VisibilityOptionDevice]; [encoder setComputePipelineState:pso]; [argTable setAddress:buffer1.gpuAddress atIndex:0]; [encoder setArgumentTable:argTable]; [encoder dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadgroup]; [encoder endEncoding];code snippet.
-
4:29 - Configure superset of color attachments
// Configure superset of color attachments MTL4RenderPassDescriptor *desc = [MTLRenderPassDescriptor renderPassDescriptor]; desc.supportColorAttachmentMapping = YES; desc.colorAttachments[0].texture = colortex0; desc.colorAttachments[1].texture = colortex1; desc.colorAttachments[2].texture = colortex2; desc.colorAttachments[3].texture = colortex3; desc.colorAttachments[4].texture = colortex4;
-
4:38 - Set color attachment map entries
// Set color attachment map entries MTLLogicalToPhysicalColorAttachmentMap* myAttachmentRemap = [MTLLogicalToPhysicalColorAttachmentMap new]; [myAttachmentRemap setPhysicalIndex:0 forLogicalIndex:0]; [myAttachmentRemap setPhysicalIndex:3 forLogicalIndex:1]; [myAttachmentRemap setPhysicalIndex:4 forLogicalIndex:2];
-
4:57 - Set a color attachment map per pipeline
// Set a color attachment map per pipeline [renderEncoder setRenderPipelineState:myPipeline]; [renderEncoder setColorAttachmentMap:myAttachmentRemap]; // Draw with myPipeline [renderEncoder setRenderPipelineState:myPipeline2]; [renderEncoder setColorAttachmentMap:myAttachmentRemap2]; // Draw with myPipeline2
-
8:03 - Encode a single render pass with 3 render encoders
// Encode a single render pass with 3 render encoders with suspend/resume options id<MTL4RenderCommandEncoder> enc0 = [cmdbuf0 renderCommandEncoderWithDescriptor:desc options:MTL4RenderEncoderOptionSuspending]; id<MTL4RenderCommandEncoder> enc1 = [cmdbuf1 renderCommandEncoderWithDescriptor:desc options:MTL4RenderEncoderOptionResuming | MTL4RenderEncoderOptionSuspending]; id<MTL4RenderCommandEncoder> enc2 = [cmdbuf2 renderCommandEncoderWithDescriptor:desc options:MTL4RenderEncoderOptionResuming]; id<MTL4CommandBuffer> cmdbufs[] = { cmdbuf0, cmdbuf1, cmdbuf2 }; [commandQueue commit:cmdbufs count:3]
-
11:48 - Synchronize drawable contents
// Synchronize drawable contents id<MTLDrawable> drawable = [metalLayer nextDrawable]; [queue waitForDrawable:drawable]; // ... encode render commands to commandBuffer ... [queue commit:&commandBuffer count:1]; [queue signalDrawable:drawable]; [drawable present];
-
13:25 - Encode a queue barrier to synchronize data
// Encode a queue barrier to synchronize data id<MTL4ComputeCommandEncoder> compute = [commandBuffer computeCommandEncoder]; [compute dispatchThreadgroups:threadGrid threadsPerThreadgroup:threadsPerThreadgroup]; [compute endEncoding]; id<MTL4RenderCommandEncoder> render = [commandBuffer renderCommandEncoderWithDescriptor:des]; [render barrierAfterQueueStages:MTLStageDispatch beforeStages:MTLStageFragment visibilityOptions:MTL4VisibilityOptionDevice]; [renderCommandEncoder drawPrimitives:MTLPrimitiveTypeTriangle vertexStart:vertexStart vertexCount:vertexCount]; [render endEncoding];
-
14:57 - Create a texture view pool
// Create a texture view pool MTLResourceViewPoolDescriptor *desc = [[MTLResourceViewPoolDescriptor alloc] init]; desc.resourceCount = 500; id <MTLTextureViewPool> myTextureViewPool = [myDevice newTextureViewPoolWithDescriptor:myTextureViewPoolDescriptor error:nullptr];
-
15:07 - Set a texture view
// Set a texture view MTLResourceID myTextureView = [myTextureViewPool setTextureView:myTexture descriptor:myTextureViewDescriptor atIndex:5]; [myArgumentTable setTexture:myTextureView atIndex:0];
-
16:01 - Choose appropriate sparse page size
MTLHeapDescriptor *desc = [MTLHeapDescriptor new]; desc.type = MTLHeapTypePlacement; desc.storageMode = MTLStorageModePrivate; desc.maxCompatiblePlacementSparsePageSize = MTLSparsePageSize64; desc.size = alignedHeapSize; id<MTLHeap> heap = [device newHeapWithDescriptor:desc];
-
17:05 - Update buffer mappings
// Update buffer mappings MTL4UpdateSparseBufferMappingOperation bufferOperation; bufferOperation.mode = MTLSparseTextureMappingModeMap; bufferOperation.bufferRange.location = bufferOffsetInTiles; bufferOperation.bufferRange.length = length; bufferOperation.heapOffset = heapOffsetInTiles; [cmdQueue updateBufferMappings:myBuf heap:myHeap operations:&bufferOperation count:1];
-
20:41 - Set unspecialized configuration
// In MTL4RenderPipelineColorAttachmentDescriptor // Set unspecialized configuration pipelineDescriptor.colorAttachments[i].pixelFormat = MTLPixelFormatUnspecialized; pipelineDescriptor.colorAttachments[i].writeMask = MTLColorWriteMaskUnspecialized; pipelineDescriptor.colorAttachments[i].blendingState = MTL4BlendStateUnspecialized;
-
21:40 - Create a specialized transparent pipeline
// Create a specialized transparent pipeline // Set the previously unspecialized properties pipelineDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm; pipelineDescriptor.colorAttachments[0].writeMask = MTLColorWriteMaskRed | MTLColorWriteMaskGreen | MTLColorWriteMaskBlue; pipelineDescriptor.colorAttachments[0].blendingState = MTL4BlendStateEnabled; pipelineDescriptor.colorAttachments[0].sourceRGBBlendFactor = MTLBlendFactorOne; pipelineDescriptor.colorAttachments[0].destinationRGBBlendFactor = MTLBlendFactorOneMinusSourceAlpha; pipelineDescriptor.colorAttachments[0].rgbBlendOperation = MTLBlendOperationAdd; id<MTLRenderPipelineState> transparentPipeline = [compiler newRenderPipelineStateBySpecializationWithDescriptor:pipelineDescriptor pipeline:unspecializedPipeline error:&error]; // Similarly, create the specialized opaque and hologram pipelines
-
26:22 - Determine thread count
// Determine thread count NSInteger numThreads = 2; if (@available(macOS 13.3, iOS 19, visionOS 3, tvOS 19, *)) { numThreads = [device maximumConcurrentCompilationTaskCount]; }
-
26:30 - Set a proper QoS class for your compilation threads
// Create thread pool for (NSInteger i = 0; i < numThreads; ++i) { // Creating a thread with a QoS class DEFAULT pthread_attr_set_qos_class_np(&attr, QOS_CLASS_DEFAULT, 0) ; pthread_create(&threadIds[i], &attr, entryPoint, NULL); pthread_attr_destroy(&attr); }
-
28:24 - Harvest pipeline configuration scripts
// Harvest pipeline configuration scripts with the pipeline data set serializer // Create a pipeline data set serializer that only captures descriptors MTL4PipelineDataSetSerializerDescriptor *desc = [MTL4PipelineDataSetSerializerDescriptor new]; desc.configuration = MTL4PipelineDataSetSerializerConfigurationCaptureDescriptors; id<MTL4PipelineDataSetSerializer> serializer = [device newPipelineDataSetSerializerWithDescriptor:desc]; // Set the pipeline data set serializer when creating the compiler MTL4CompilerDescriptor *compilerDesc = [MTL4CompilerDescriptor new]; [compilerDesc setPipelineDataSetSerializer:serializer]; id<MTL4Compiler> compiler = [device newCompilerWithDescriptor:compilerDesc error:nil]; // Create pipelines using the compiler as usual // Serialize the descriptors as a pipeline script NSData *data = [serializer serializeAsPipelinesScriptWithError:&err]; // Write the pipeline script data to disk NSString *path = [NSString pathWithComponents:@[folder, @"pipelines.mtl4-json"]]; BOOL success = [data writeToFile:path options:NSDataWritingAtomic error:&err];
-
30:28 - Query pipeline state from MTLArchive
// Query pipeline state from MTLArchive id<MTL4Archive> archive = [device newArchiveWithURL:archiveURL error:&error]; id<MTLRenderPipelineState> pipeline = [archive newRenderPipelineStateWithDescriptor:descriptor error:&error]; if (pipeline == nil) { // handle lookup miss pipeline = [compiler newRenderPipelineStateWithDescriptor:descriptor compilerTaskOptions:nil }
-
-
- 0:00 - Intro
This is the second in a four-part series on Metal 4, the new Apple graphics API designed for modern game engines. Metal 4 enhances command encoding, resource management, and pipeline loading. Metal 4 meets the demands of current and future games which stream gigabytes of detailed geometry and textures, rendered with thousands of shaders, so they can take advantage of all the computing power available with Apple silicon. Also watch the other parts of the series for more details on MetalFX, ray tracing, and machine learning integration.
- 1:33 - Encode more efficiently
Metal 4 is designed to enhance GPU efficiency by optimizing command encoding. It introduces two main encoder classes, render and compute, which can now handle most common game operations. You can use Metal 4 to improve your encoding efficiency by reducing your encoder count, reusing your command memory, and encoding across multiple threads.
- 8:42 - Scale your resource management
Metal 4 has some exciting new features to help you manage your resources at scale. Argument tables and residency sets allow you to scale your resource binding to thousands of resources. Metal 4 puts you in charge of managing your drawable resources, and gives you control over dependencies. Queue barriers provide a way to express your resource dependencies at scale. Texture view pools and placement sparse heaps help you manage the memory required by large resources.
- 17:24 - Load pipelines quickly
Modern games need to create thousands of pipelines to create complex and dynamic visuals. Loading many pipelines quickly is crucial to eliminating shader compilation stutters and reducing your game's load time. To load pipelines quickly in Metal 4, reuse your render pipeline compilations, compile the pipelines on-device with a new level of parallelism, and compile pipelines ahead of time so your pipeline loading time reduces to near 0.
- 31:25 - Next steps
Metal 4 APIs are designed to enable you to build the next generation of high-performance games. You can check out the documentation on the developer web site, try out the sample projects, and download the new Xcode to get started.