Overly strict foundation model rate limit when used in app extension

I am calling into an app extension from a Safari Web Extension (sendNativeMessage, which in turn results in a call to NSExtensionRequestHandling’s beginRequest). My Safari extension aims to make use of the new foundation models for some of the features it provides.

In my testing, I hit the rate limit by sending 4 requests, waiting 30 seconds between each. This makes the FoundationModels framework (which would otherwise serve my use case perfectly well) unusable in this context, because the model is called in response to user input, and this rate of user input is perfectly plausible in a real world scenario.

The error thrown as a result of the rate limit is “Safety guardrail was triggered after consecutive failures during streaming.", but looking at the system logs in Console.app shows the rate limit as the real culprit.

My suggestions:

  • Please introduce sensible rate limits for app extensions, through an entitlement if need be. If it is rate limited to 1 request per every couple of seconds, that would already fix the issue for me.
  • Please document the rate limit.
  • Please make the thrown error reflect that it is the result of a rate limit and not a generic guardrail violation. IMPORTANT: please indicate in the thrown error when it is safe to try again.

Filed a feedback here: FB18332004

Answered by Frameworks Engineer in 845702022

Thank you for the feedback!

First, rate limiting is not expected when your device is connected to power. This is a known issue. (153216632)

Rate limiting applies when you device is on battery AND when your process is running in the background. Safari extensions run in the background. When using Foundation Models in the background, we recommend not streaming the responses as it would use more power and hit the rate limit sooner. Instead, we recommend calling respond to generate the whole response.

Thank you for the feedback!

First, rate limiting is not expected when your device is connected to power. This is a known issue. (153216632)

Rate limiting applies when you device is on battery AND when your process is running in the background. Safari extensions run in the background. When using Foundation Models in the background, we recommend not streaming the responses as it would use more power and hit the rate limit sooner. Instead, we recommend calling respond to generate the whole response.

I am running it on simulator on a SwiftUI app and keep getting guardrails violation in Beta 2. On Beta 1 it was working fine.

Thank you for the response @Frameworks Engineer!

Is there a chance that rate limits will be reduced for background tasks, like Safari Extensions? This would obviously make Foundation Models impossible to use for safari extensions. I can't just tell my users to plug in their device in order to use my extension.

I don't quite understand how such strict, minute-long cooldowns make sense to begin with, since they make it impossible to provide a reliable user experience. At that point you might as well completely forbid the API in background processes with a static check.

As for your suggestions:

  1. The rate limit I described, I'm hitting on power too (see report for sysdiag).
  2. I didn't see much difference (if any) in the limits when using the respond API as opposed to streamResponse.
  3. We are pretty much forced to use streamResponse because we are randomly hitting guardrail violations for even the most innocuous prompts (I think I saw a couple reports about this already, but that's a separate issue). If I use the respond API, it is all or nothing, with streaming, at least I get some of the response before it taps out. Besides, streaming is a much better UX, so I wouldn't want to give up on it even if it wasn't as rate limited (which it currently is), so there has to be another fix.

Crossing my fingers this rate limiting decision gets reversed (or reduced to seconds as opposed to minutes) because it will break a good bunch of perfectly valid use cases, like mine.

Overly strict foundation model rate limit when used in app extension
 
 
Q