In Speech framework is SFTranscriptionSegment timing supposed to be off and speechRecognitionMetadata nil until isFinal?

Question

Created 3w

Replies 0

Boosts 0

Participants 1

I'm working in Swift/SwiftUI, running XCode 16.3 on macOS 15.4 and I've seen this when running in the iOS simulator and in a macOS app run from XCode. I've also seen this behaviour with 3 different audio files.

Nothing in the documentation says that the speechRecognitionMetadata property on an SFSpeechRecognitionResult will be nil until isFinal, but that's the behaviour I'm seeing.

I've stripped my class down to the following:

    private var isAuthed = false

    // I call this in a .task {} in my SwiftUI View
    public func requestSpeechRecognizerPermission() {
        SFSpeechRecognizer.requestAuthorization { authStatus in
            Task {
                self.isAuthed = authStatus == .authorized
            }
        }
    }
    
    public func transcribe(from url: URL) {
        guard isAuthed else { return }
        
        let locale = Locale(identifier: "en-US")
        let recognizer = SFSpeechRecognizer(locale: locale)
        let recognitionRequest = SFSpeechURLRecognitionRequest(url: url)
        
        // the behaviour occurs whether I set this to true or not, I recently set
        // it to true to see if it made a difference
        recognizer?.supportsOnDeviceRecognition = true
        recognitionRequest.shouldReportPartialResults = true
        recognitionRequest.addsPunctuation = true
        
        recognizer?.recognitionTask(with: recognitionRequest) { (result, error) in
            guard result != nil else { return }
            
            if result!.isFinal {
                //speechRecognitionMetadata is not nil
            } else {
                //speechRecognitionMetadata is nil
            }
        }
    }
}

Further, and this isn't documented either, the SFTranscriptionSegment values don't have correct timestamp and duration values until isFinal. The values aren't all zero, but they don't align with the timing in the audio and they change to accurate values when isFinal is true.

The transcription otherwise "works", in that I get transcription text before isFinal and if I wait for isFinal the segments are correct and speechRecognitionMetadata is filled with values.

The context here is I'm trying to generate a transcription that I can then highlight the spoken sections of as audio plays and I'm thinking I must be just trying to use the Speech framework in a way it does not work. I got my concept working if I pre-process the audio (i.e. run it through until isFinal and save the results I need to json), but being able to do even a rougher version of it 'on the fly' - which requires segments to have the right timestamp/duration before isFinal - is perhaps impossible?

Boost