My team has developed an app with a Matter commissioner feature (for own ecosystem) using the Matter framework on the MatterSupport extension.
Recently, we've noticed that commissioning Matter devices with the MatterSupport extension has become very unstable. Occasionally, the HomeUIService stops the flow after commissioning to the first fabric successfully, displaying the error: "Failed to perform Matter device setup: Error Domain=HMErrorDomain Code=2." (normally, it should send open commissioning window to the device and then add the device to the 2nd fabric). The issue is never seen before until recently few weeks and there is no code changes in the app. We are suspected that there is some data that fail to download from the icloud or apple account that cause this problem.
For evaluation, we tried removing the HomeSupport extension and run the Matter framework directly in developer mode, this issue disappears, and commissioning works without any problems.
So, let me start with that error here:
displaying the error: "Failed to perform Matter device setup: Error Domain=HMErrorDomain Code=2."
Error 2 is "HMErrorCodeNotFound". Unfortunately, that's a fairly general error that's used in a large number of different context to basically mean "I didn't have/get something I expected". Given that you seem to be able to reproduce the issue, here's what I would suggest doing next.
First off, if at all possible, do this testing on a dedicated test device and with the minimum possible home configuration. This is always feasible but every additional app, accessory, or configuration choice introduces more log activity. Log activity is what makes this process difficult, so anything you can do to reduce that noise is helpful.
Next, please install the following profiles on the device that's failing:
That's obviously a lot of profiles, but the goal here is to get "all" of the necessary information in a single pass so that we don't end up in a situation where the log tells what component the failure happened in but not what the actual problem is.
Once those profile are installed do the following:
-
Turn the device off.
-
Leave it alone for "awhile". The exact amount of time doesn't matter, but longer is always better. As little as 10-15 minutes is fine, overnight is fabulous.
The goal here to create a large time gap in the console log, making it easier to cut out/ignore old data.
When your ready to start testing, do the following:
-
Turn the device on and unlock it.
-
Give the device a few minutes, then start testing.
-
When the problem occurs, note the time it occurred, then wait a few minutes.
-
Trigger a sysdiagnose and collect the data.
Obviously that's the "ideal" flow when the problem is relatively easy to replicate. If the problem is more intermittent (or, for example, it only happens on an end user device), then you should do the same setup as above and then do your normal testing until the failure happens. Once the failure happens, the critical points are:
-
When exactly the log is captured isn't that important. Eventually the system does purge data, but as long as the device has plenty of storage their isn't very much difference between a log collected immediately after and a log capture several hours later.
-
It is important that you NOT reboot the device until after you've collected the sysdiagnose. Many components purge their log data when the device reboots, which makes that log data largely useless.
Once you've got a sysdiagnose, please file a bug describing what happened, what time the failure occurred, and then upload the sysdiagnose. After that's done, please post the bug number back here and I'll see what I can determine. If it takes awhile to get the bug data, then you can also file a code-level support request that includes my name and a link to this post.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware