Thanks for being a part of WWDC25!

How did we do? We’d love to know your thoughts on this year’s conference. Take the survey here

KV-Cache MLState Not Updating During Prefill Stage in Core ML LLM Inference

Hello,

I'm running a large language model (LLM) in Core ML that uses a key-value cache (KV-cache) to store past attention states. The model was converted from PyTorch using coremltools and deployed on-device with Swift. The KV-cache is exposed via MLState and is used across inference steps for efficient autoregressive generation.

During the prefill stage — where a prompt of multiple tokens is passed to the model in a single batch to initialize the KV-cache — I’ve noticed that some entries in the KV-cache are not updated after the inference. Specifically:

Here are a few details about the setup:

  • The MLState returned by the model is identical to the input state (often empty or zero-initialized) for some tokens in the batch.
  • The issue only happens during the prefill stage (i.e., first call over multiple tokens).
  • During decoding (single-token generation), the KV-cache updates normally.
  • The model is invoked using MLModel.prediction(from:using:options:) for each batch.

I’ve confirmed:

  • The prompt tokens are non-repetitive and not masked.
  • The model spec has MLState inputs/outputs correctly configured for KV-cache tensors.
  • Each token is processed in a loop with the correct positional encodings.

Questions:

  1. Is there any known behavior in Core ML that could prevent MLState from updating during batched or prefill inference?
  2. Could this be caused by internal optimizations such as lazy execution, static masking, or zero-value short-circuiting?
  3. How can I confirm that each token in the batch is contributing to the KV-cache during prefill?

Any insights from the Core ML or LLM deployment community would be much appreciated.

Core ML framework is agnostic to LLM's specific setup such as pre-filling step vs single token generation steps. It just executes whatever the model (ML Program) says.

Given that the state (KV-cache) is updated in single token generation steps, I would suspect that ML Program for the prompt step has a bug where it does not update KV-cache.

Another thing worth verifying is that you need to use the same MLState instance in both pre-filling and token generations.

As for your questions,

  1. Not I know of.
  2. No. Each prediction call fully completes the state update.
  3. After pre-filling prediction call, you can examine the contents of the state by withMultiArray(for:), but I suppose you already did that.
KV-Cache MLState Not Updating During Prefill Stage in Core ML LLM Inference
 
 
Q