Hello,
I'm running a large language model (LLM) in Core ML that uses a key-value cache (KV-cache) to store past attention states. The model was converted from PyTorch using coremltools and deployed on-device with Swift. The KV-cache is exposed via MLState and is used across inference steps for efficient autoregressive generation.
During the prefill stage — where a prompt of multiple tokens is passed to the model in a single batch to initialize the KV-cache — I’ve noticed that some entries in the KV-cache are not updated after the inference. Specifically:
Here are a few details about the setup:
- The MLState returned by the model is identical to the input state (often empty or zero-initialized) for some tokens in the batch.
- The issue only happens during the prefill stage (i.e., first call over multiple tokens).
- During decoding (single-token generation), the KV-cache updates normally.
- The model is invoked using MLModel.prediction(from:using:options:) for each batch.
I’ve confirmed:
- The prompt tokens are non-repetitive and not masked.
- The model spec has MLState inputs/outputs correctly configured for KV-cache tensors.
- Each token is processed in a loop with the correct positional encodings.
Questions:
- Is there any known behavior in Core ML that could prevent MLState from updating during batched or prefill inference?
- Could this be caused by internal optimizations such as lazy execution, static masking, or zero-value short-circuiting?
- How can I confirm that each token in the batch is contributing to the KV-cache during prefill?
Any insights from the Core ML or LLM deployment community would be much appreciated.