Thanks for being a part of WWDC25!

How did we do? We’d love to know your thoughts on this year’s conference. Take the survey here

Symbolicating kernel backtraces on Apple Silicon

I am able to symbolicate kernel backtraces for addresses that belong to my kext.

Is it possible to symbolicate kernel backtraces for addresses that lie beyond my kext and reference kernel code?

Answered by DTS Engineer in 836958022

Let me step back to the beginning and explain this in a bit more detail:

  1. When using "-l", what atos is actually "getting" there is the load address starting from the first __TEXT segment in the executable.

  2. In arm64, (very generally) things shifted about code was loading out of __TEXT_EXEC, NOT _TEXT.

SO, to correctly symbolicate, you need to properly shift the load address to account for that. Starting with loadable KEXTs, this is fairly easy. They're __TEXT segment is at 0, so the math here is:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->

KEXTs also tend to have a fixed opening __TEXT size, so the value of __TEXT_EXEC.vmaddr ends up being "0x4000"

0x4000 - 0 -> 0x4000

That is where my original command came from:

atos -arch arm64e -o <symbol file path)  -l <KEXT load address - 0x4000>  <address to symbol>

NOTE: My earlier post was wrong, you need to subtract not add.

For kext, that load address comes directly from the panic log. Here is what that looks like in the panic from the other thread on this issue:

          lr: 0xfffffe0013723894  fp: 0x0000000000000000
      Kernel Extensions in backtrace:
         com.company.product(1.4.21d119)[92BABD94-80A4-3F6D-857A-3240E4DA8009]@0xfffffe001203bfd0->0xfffffe00120533ab
            dependency: com.apple.iokit.IOPCIFamily(2.9)[6D6666E6-340F-3A5E-9464-DE05164C0658]@0xfffffe0015e65e90->0xfffffe0015e93b3f

Now, the math is actually exactly the same for the kernel, but the numbers are totally different. Running otool -l on kernel.release.t8112 gets you:

__TEXT.vmaddr -> 
...
Load command 0
      cmd LC_SEGMENT_64
  cmdsize 552
  segname __TEXT
   vmaddr 0xfffffe0007004000
   vmsize 0x0000000000100000
...

AND:

__TEXT_EXEC.vmaddr:
...
Load command 2
      cmd LC_SEGMENT_64
  cmdsize 312
  segname __TEXT_EXEC
   vmaddr 0xfffffe00072b8000
   vmsize 0x00000000008f0000
...

(You may notice there are multiple __TEXT_EXEC section, in this case all we care about is the first)

SO, plugging that that into our math above:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->
0xfffffe00072b8000 - 0xfffffe0007004000 -> 0x2B4000

That number is then subtracted from:

Kernel text exec base:  0xfffffe001cd8c000

0xfffffe001cd8c000 - 0x2B4000 -> 0xFFFFFE001CAD8000

Plugging that value into your atos command:

atos -o kernel.release.t8112 -arch arm64e -l 0xFFFFFE001CAD8000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

Gives us:

handle_debugger_trap (in kernel.release.t8112) (debug.c:1839)
handle_uncategorized (in kernel.release.t8112) (sleh.c:1591)
sleh_synchronous (in kernel.release.t8112) (sleh.c:1477)
fleh_synchronous (in kernel.release.t8112) + 44
panic_trap_to_debugger (in kernel.release.t8112) (debug.c:1400)
panic_with_options_and_initiator (in kernel.release.t8112) (debug.c:1213)
0xfffffe001d86a708
0xfffffe001d86941c
0xfffffe001d866504
0xfffffe001e234bc8
sleh_irq (in kernel.release.t8112) (sleh.c:3840)
fleh_irq (in kernel.release.t8112) + 64
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
ubc_create_upl_kernel (in kernel.release.t8112) (ubc_subr.c:2643)
vnode_pageout (in kernel.release.t8112) (vnode_pager.c:407)
vnode_pager_data_return (in kernel.release.t8112) (bsd_vm.c:451)
vm_pageout_iothread_external_continue (in kernel.release.t8112) (vm_pageout.c:4237)
Call_continuation (in kernel.release.t8112) + 196

Note the references to "panic..." which are an easy way to validate that the symbolication is valid. By it's nature, the kernel cannot "crash" in the same sense any other process can, as the kernel IS the core component that controls all of the mechanisms that cause OTHER processes to crash. Because of that, "all" kernel panics end in a call to "panic" because that's the function that ends up generating the log data your looking at and then stopping normal execution.

Anticipating your question, no, I don't know why the other 4 frame aren't symbolicating, but I don't think it's an issue with atos. Apple obviously has our own internal symbolication tools and they're getting exactly the same result. Are they from your KEXT and you just didn't include the KEXT address info?

Having said that, they also don't really matter for this particular panic. That's because of this message at the top of the log:

panic(cpu 4 caller 0xfffffe001d86a708): watchdog timeout: no checkins from watchdogd in 90 seconds (25 total checkins since monitoring last enabled)

This message basically means exactly what it sounds like. There's system in place (managed by "watchdogd") to verify that user space that requires specific, critical system daemons to "checkin" with the kernel. If they fail to do so, then the system times out and the kernel panics. That dynamic also means that the specific stack traces is largely irrelevant, as it simply shows what happened to be executing at the time the panic occurred, NOT what actually caused the panic.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Is it possible to symbolicate kernel backtraces for addresses that lie beyond my kext and reference kernel code?

Yes, it should be. Each Kernel Debug Kit (KDKs) release contains symbols files for that particular system. You can find those download here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks for your response.

I've downloaded and extracted Kernel_Debug_Kit_15.4_build_24E248.dmg.

Here's what I've tried so far to sybolicate the sample kernel panic log.

Note: the kernel passed in to the -o flag of atos(1) is the one that is found in kernel.release.t8112.dSYM/Contents/Resources/DWARF/kernel.release.t8112.

The load address computed as the kernel base addresses given in the kernel panic log minus the corresponding slide

atos -o kernel.release.t8112 -arch arm64e -l FFFFFE0007004000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

No output generated

The value of Kernel slide given in the kernel panic log passed to the -s flag

atos -o kernel.release.t8112 -arch arm64e -s 0000000014610000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

No output generated

The value of KernelCache slide given in the kernel panic log passed to the -s flag

atos -o kernel.release.t8112 -arch arm64e -s 0000000014608000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

No output generated

The value of Kernel text exec slide given in the kernel panic log passed to the -s flag

atos -o kernel.release.t8112 -arch arm64e -s 0000000015D88000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

Output: Looks bogus

__doprnt.digits (in kernel.release.t8112) + 103398
bootstrap_pagetables (in kernel.release.t8112) + 329876
bootstrap_pagetables (in kernel.release.t8112) + 322192
table_10bits (in kernel.release.t8112) + 10400
__doprnt.digits (in kernel.release.t8112) + 101598
recvmsg_x (in kernel.release.t8112) + 332
IOMachPort::noMoreSendersForObject(OSObject*, unsigned int, unsigned int*) (in kernel.release.t8112) + 440
IOUserClient::init(OSDictionary*) (in kernel.release.t8112) + 0
IOServiceMessageUserNotification::MetaClass::~MetaClass() (in kernel.release.t8112) + 528
bootstrap_pagetables (in kernel.release.t8112) + 336156
table_10bits (in kernel.release.t8112) + 10520
ropagetable_end (in kernel.release.t8112) + 2688
ropagetable_end (in kernel.release.t8112) + 2688
ipsec_cleanup_family (in kernel.release.t8112) + 920
pf_test (in kernel.release.t8112) + 3700
ropagetable_begin (in kernel.release.t8112) + 324884
nfU16InvMasks (in kernel.release.t8112) + 112

Can you please instruct me how to symbolicate my sample kernel panic log properly.

Can you please instruct me how to symbolicate my sample kernel panic log properly.

I think you're running into this issue. The short summary of the situation here is:

  • KEXT changed on ARM64, which ends up shifting the load offset from base address you see in an unsymbolicated backtrace. You can find the actual offset by running this command and then looking for the value of "vmaddr" in __TEXT_EXEC in the output:
otool -l <path to kext executable>
...
cmd LC_SEGMENT_64
cmdsize 232
segname __TEXT_EXEC
vmaddr 0x0000000000004000
vmsize 0x0000000000004000
  • While the value is theoretically dynamic, in practice it's actually 0x4000. Practically speaking, I'd add that as a fixed value to the load address and only check vmaddr if atos returns something silly*. Here is what that atos command looks like:
atos -arch arm64e -o <symbol file path)  -l <load address + (0x4000/vmaddr)>  <address to symbol>

*Keep in mind that the vmaddr offset is large enough that atos will tend to return garbage, so any issue is unlikely to be subtle.

  • atos needs to be update (r.135959767) to take care of this itself, probably by adding a new flag option like "-textExecAddress" that you'd use instead of "-l". However, until that available you'll need to do the math yourself.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks.

Just a couple of points to clarify.

atos -arch arm64e -o <symbol file path)  -l <load address + (0x4000/vmaddr)>  <address to symbol>

In that particular panic log, there's not load address specified for my kext.

What address do I specify in that case?

0x4000/vmaddr is that divided by vmaddr or is that saying that 0x4000 is the value of vmaddr?

Accepted Answer

Let me step back to the beginning and explain this in a bit more detail:

  1. When using "-l", what atos is actually "getting" there is the load address starting from the first __TEXT segment in the executable.

  2. In arm64, (very generally) things shifted about code was loading out of __TEXT_EXEC, NOT _TEXT.

SO, to correctly symbolicate, you need to properly shift the load address to account for that. Starting with loadable KEXTs, this is fairly easy. They're __TEXT segment is at 0, so the math here is:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->

KEXTs also tend to have a fixed opening __TEXT size, so the value of __TEXT_EXEC.vmaddr ends up being "0x4000"

0x4000 - 0 -> 0x4000

That is where my original command came from:

atos -arch arm64e -o <symbol file path)  -l <KEXT load address - 0x4000>  <address to symbol>

NOTE: My earlier post was wrong, you need to subtract not add.

For kext, that load address comes directly from the panic log. Here is what that looks like in the panic from the other thread on this issue:

          lr: 0xfffffe0013723894  fp: 0x0000000000000000
      Kernel Extensions in backtrace:
         com.company.product(1.4.21d119)[92BABD94-80A4-3F6D-857A-3240E4DA8009]@0xfffffe001203bfd0->0xfffffe00120533ab
            dependency: com.apple.iokit.IOPCIFamily(2.9)[6D6666E6-340F-3A5E-9464-DE05164C0658]@0xfffffe0015e65e90->0xfffffe0015e93b3f

Now, the math is actually exactly the same for the kernel, but the numbers are totally different. Running otool -l on kernel.release.t8112 gets you:

__TEXT.vmaddr -> 
...
Load command 0
      cmd LC_SEGMENT_64
  cmdsize 552
  segname __TEXT
   vmaddr 0xfffffe0007004000
   vmsize 0x0000000000100000
...

AND:

__TEXT_EXEC.vmaddr:
...
Load command 2
      cmd LC_SEGMENT_64
  cmdsize 312
  segname __TEXT_EXEC
   vmaddr 0xfffffe00072b8000
   vmsize 0x00000000008f0000
...

(You may notice there are multiple __TEXT_EXEC section, in this case all we care about is the first)

SO, plugging that that into our math above:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->
0xfffffe00072b8000 - 0xfffffe0007004000 -> 0x2B4000

That number is then subtracted from:

Kernel text exec base:  0xfffffe001cd8c000

0xfffffe001cd8c000 - 0x2B4000 -> 0xFFFFFE001CAD8000

Plugging that value into your atos command:

atos -o kernel.release.t8112 -arch arm64e -l 0xFFFFFE001CAD8000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

Gives us:

handle_debugger_trap (in kernel.release.t8112) (debug.c:1839)
handle_uncategorized (in kernel.release.t8112) (sleh.c:1591)
sleh_synchronous (in kernel.release.t8112) (sleh.c:1477)
fleh_synchronous (in kernel.release.t8112) + 44
panic_trap_to_debugger (in kernel.release.t8112) (debug.c:1400)
panic_with_options_and_initiator (in kernel.release.t8112) (debug.c:1213)
0xfffffe001d86a708
0xfffffe001d86941c
0xfffffe001d866504
0xfffffe001e234bc8
sleh_irq (in kernel.release.t8112) (sleh.c:3840)
fleh_irq (in kernel.release.t8112) + 64
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
ubc_create_upl_kernel (in kernel.release.t8112) (ubc_subr.c:2643)
vnode_pageout (in kernel.release.t8112) (vnode_pager.c:407)
vnode_pager_data_return (in kernel.release.t8112) (bsd_vm.c:451)
vm_pageout_iothread_external_continue (in kernel.release.t8112) (vm_pageout.c:4237)
Call_continuation (in kernel.release.t8112) + 196

Note the references to "panic..." which are an easy way to validate that the symbolication is valid. By it's nature, the kernel cannot "crash" in the same sense any other process can, as the kernel IS the core component that controls all of the mechanisms that cause OTHER processes to crash. Because of that, "all" kernel panics end in a call to "panic" because that's the function that ends up generating the log data your looking at and then stopping normal execution.

Anticipating your question, no, I don't know why the other 4 frame aren't symbolicating, but I don't think it's an issue with atos. Apple obviously has our own internal symbolication tools and they're getting exactly the same result. Are they from your KEXT and you just didn't include the KEXT address info?

Having said that, they also don't really matter for this particular panic. That's because of this message at the top of the log:

panic(cpu 4 caller 0xfffffe001d86a708): watchdog timeout: no checkins from watchdogd in 90 seconds (25 total checkins since monitoring last enabled)

This message basically means exactly what it sounds like. There's system in place (managed by "watchdogd") to verify that user space that requires specific, critical system daemons to "checkin" with the kernel. If they fail to do so, then the system times out and the kernel panics. That dynamic also means that the specific stack traces is largely irrelevant, as it simply shows what happened to be executing at the time the panic occurred, NOT what actually caused the panic.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks. That makes perfect sense now.

I got the same symbolicated output on my system following your instructions.

Are they from your KEXT and you just didn't include the KEXT address info?

That particular kernel panic is reproducible whenever a very large zip file is being extracted over my filesystem.

With watchdog panics, is there a way of telling what caused watchdogd to emit a panic?

That particular kernel panic is reproducible whenever a very large zip file is being extracted over my filesystem.

With watchdog panics, is there a way of telling what caused watchdogd to emit a panic?

Not with certainty, but you may be able to get better "clarity". A few things to try:

  • I'm not sure exactly where you've been getting your panic log data, but the "full" log should also include a spindump of every thread on the system. The log is quite large and painful to symbolicate but if you file a bug with one of those full logs and post the number back here, I can tak a look and see if that offers any hints.

  • The KDK instructions also describe how to set up two machine debugging and/or a core dump server. You can also find a pretty good overview of the kernel debugging tool inside the XNU source.

You can also look at this from the general issue side as well. Going back to here:

...reproducible whenever a very large zip file is being extracted over my filesystem.

Particularly if you're not using our flash (reading between the lines of your panic tace), at it's core, zip extraction is basically just pure I/O. The data is being read from a source and written to a target, and the CPU itself is virtually guaranteed to be able to process (decompress) data faster than one of those two targets. SO, depending on the nature of the data and the devices, one of a limited set of things will occur:

  • Extraction will saturate the read source (because it's the slowest).

  • Extraction will saturate the VFS system (because the content is lots of small objects).

  • Extraction will saturate the write target (because it's the slowest).

A panic then occurs because some kind of stall/hang occurs which delays critical system function long enough that the watchdog panics. That typically because:

  1. Either of the storage devices ends up bottlenecking in a way that stalls other critical activity.

  2. "Something else" REACTS to what's happening in a way that generates the stall.

First off, the second point is what I would actually look at first, particularly by looking for any kind of Endpoint Security client. The ES API is unavoidably designed/positioned such that it can create EXACTLY this kind of failure, so there's no point trying to debug further until you've ruled on ES as the failure source.

Beyond that, you may also be able to debug this by looking at this as an hang/stall, NOT a panic. That is, if the system panics with a "1 GB" file but does NOT panic with a "900 MB" file, that is (probably) NOT because of any fundamental difference between those to sizes but simply because the smaller case finished before the watchdog gave up. Looking at the console log or running tools like spindump against the non-panic case may give you enough information to resolves the bottleneck, which will then prevent the final panic.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks a lot for your really helpful and detailed responses.

Two-machine debugging on Apple Silicon has certain limitations, which are worded in the KDK Readme as follows:

Note: Apple silicon doesn’t support active kernel debugging. You may inspect the current state of the kernel when it is halted due to a panic or NMI. However, you cannot set breakpoints, continue code execution, step into code, step over code, or step out of the current instruction.

x86-64 is better suited for remote kernel debugging. This particular panic doesn't occur on x86-64 though. Only on Apple Silicon.

I could try and set up a core dump server/client and see how that pans out.

Before I go ahead and file a bug report you suggested by way of submitting a full panic log, you mentioned about ES. In the full log, I found a reference to this:

"name":"AFKDCPEXTEndpoint9"},"1597":{"id":1597,"system_usec":104537416,"schedPriority":91,"state":["TH_RUN"],"snapshotFlags":["kKernel64_p","kThreadIOPassive","kThreadOnCore"],"user_usec":0,"kernelFrames":[[0,377776],[0,9311856],[10,23512],[10,18668],[10,6612],[11,6952],[0,1810716],[0,31032],[0,1395328],[0,1395328],[0,6300680],[0,6673156],[0,1049164],[0,1389844],[0,68612]],"basePriority":91,"lastMadeRunnableTime":3.1102084579999998,"userTime":0,"lastRunTime":3.1102656249999998,"systemTime":104.537416333,

Which corresponds to the panicked thread id:

Panicked thread: 0xfffffe31ffe31fc0, backtrace: 0xfffffe81838478d0, tid: 1597

That thread name, AFKDCPEXTEndpoint9, wouldn't a part of the ES API you were referring to, would it?

That thread name, AFKDCPEXTEndpoint9, wouldn't a part of the ES API you were referring to, would it?

It is not. I won't attempt to describe what it's actually "for" but it's part of a fairly general purpose architecture for interacting with coprocessors. It's possible that it does have some role in this case, but that role is very likely a secondary ("you did something that eventually caused it to stall, which then caused the panic") not primary ("it stalled on it's own, which then caused the panic").

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hello Kevin,

I was wondering if you've had a chance to take a look at FB17497530 yet?

Thanks.

I was wondering if you've had a chance to take a look at FB17497530 yet?

I spent an hour or so looking at it today and have a few thoughts.

First off, something is clearly happening in the VM system. I don't know what the issue might actually be, but 200+ thread are blocked in vm_page_wait and I didn't find any thread that was showing clear "progress" in the VM system.

How much space is available on the boot volume and is there ANY possibility That it's becoming constrained? Like most modern OSs, the system is going to "fail" if it actually exhausts available swap. There are systems in place to mitigate that (first by warning the user, then by terminating processes) but there isn't any way to guarantee that will work and it's possible that some nuance of your implementation is bypassing that those protections.

Related to that point, what's the content you're actually decompressing? One (or several) large files or a bunch of small ones? The thought that crossed my mind is that it's possible decompressing a very large file might cause a pre-allocation that would end up "suddenly" filling your boot system faster than the normal mechanism could react. That's not something I'd expect to be happening, but it's worth ruling out.

That particular kernel panic is reproducible whenever a very large zip file is being extracted over my filesystem.

Is your file system the source or the target for the zip file? I only found one VFS stack that looks like it might be your file system and it's a read, not a write call:

13524	Kernel Frames
0	kernel.release.t8112	0xfffffe0008acbaf4	thread_invoke + 1180	(sched_prim.c:3566)
1	kernel.release.t8112	0xfffffe0008aca59c	thread_block_reason + 244	(sched_prim.c:4183)
2	kernel.release.t8112	0xfffffe0008ba0ce4	vm_page_wait + 436	(vm_resident.c:5443)
3	kernel.release.t8112	0xfffffe0008ba29bc	vm_page_alloc_list + 96	(vm_resident.c:8581)
4	kernel.release.t8112	0xfffffe0008b562f8	kmem_alloc_guard_internal + 404	(vm_kern.c:908)
5	kernel.release.t8112	0xfffffe0008aa7298	kalloc_large + 360	(kalloc.c:2364)
6	kernel.release.t8112	0xfffffe0008fdbe40	_MALLOC + 68	(kern_malloc.c:119)
7	[15, 181932]
8	[15, 184316]
9	[15, 317620]
10	[15, 316216]
11	[15, 71024]
12	[15, 70676]
13	[15, 236844]
14	kernel.release.t8112	0xfffffe0008c7b994	cluster_io + 3416	(vfs_cluster.c:2336)
15	kernel.release.t8112	0xfffffe0008c8549c	advisory_read_ext + 756	(vfs_cluster.c:6258)
16	kernel.release.t8112	0xfffffe0008c857a4	cluster_read_ahead + 652	(vfs_cluster.c:2568)
17	kernel.release.t8112	0xfffffe0008c82d38	cluster_read_copy + 3524	(vfs_cluster.c:4942)
18	kernel.release.t8112	0xfffffe0008c81f28	cluster_read_ext + 376	(vfs_cluster.c:4456)
19	[15, 233920]
20	kernel.release.t8112	0xfffffe0008cc1eec	vn_read_common + 388	(vfs_vnops.c:1154)
21	kernel.release.t8112	0xfffffe0008cc1b6c	vn_read + 860	(vfs_vnops.c:1204)
22	kernel.release.t8112	0xfffffe0009022768	readv_internal + 560	(sys_generic.c:340)
23	kernel.release.t8112	0xfffffe0009022964	read + 300	(sys_generic.c:391)
24	kernel.release.t8112	0xfffffe000912aea0	unix_syscall + 772	(systemcalls.c:181)
25	kernel.release.t8112	0xfffffe0008beeb1c	sleh_synchronous + 888	(sleh.c:1257)
26	kernel.release.t8112	0xfffffe0008a3f8c0	fleh_synchronous + 44	
27	kernel.release.t8112	0xfffffe0008a3f884	fleh_dispatch64 + 412	

User Frames
0	libsystem_kernel.dylib	0x1804307dc	read + 8	
1	libarchive.2.dylib	0x18e446a00	client_read_proxy + 64	(archive_read.c:168)
2	libarchive.2.dylib	0x18e444e9c	__archive_read_filter_ahead + 628	(archive_read.c:1407)
3	libarchive.2.dylib	0x18e444c1c	__archive_read_ahead + 48	(archive_read.c:1340)
4	libarchive.2.dylib	0x18e4bad54	zip_read_data_none + 924	(archive_read_support_format_zip.c:1740)
5	libarchive.2.dylib	0x18e4b53f8	archive_read_format_zip_read_data + 512	(archive_read_support_format_zip.c:3189)
6	libarchive.2.dylib	0x18e446728	_archive_read_data_block + 192	(archive_read.c:989)
7	libarchive.2.dylib	0x18e4c9950	archive_read_data_block + 64	(archive_virtual.c:167)
8	libarchive.2.dylib	0x18e4f02e8	copy_data + 104	(archive_read_extract2.c:137)
9	libarchive.2.dylib	0x18e4f01f4	archive_read_extract2 + 208	(archive_read_extract2.c:99)
10	libarchive.2.dylib	0x18e46d588	archive_read_extract + 244	(archive_read_extract.c:74)

Related to that point, what's actually providing the storage for your file system? From the context in the bug, it sounds like this might be some form of network/remote file system, but I didn't find any obvious support daemon. Is it possible that some supporting component has failed?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks very much for looking into this.

Here are the points you wanted clarified.

How much space is available on the boot volume and is there ANY possibility That it's becoming constrained?

That doesn't seem to be the case. See

what's the content you're actually decompressing? One (or several) large files or a bunch of small ones?

It is a bunch of files of varying sizes. Here's a

Is your file system the source or the target for the zip file?

It's actually both.

what's actually providing the storage for your file system? Is it possible that some supporting component has failed?

It is a network filesystem with I/O happening via RPC calls. Internally, the filesystem implements a socket polling algorithm that relies on the use of sock_upcall callback that's passed into sock_socket() on socket creation. The backing store on the remote server has been verified to have ample disk space available.

Do let me know if there's anything else I can provide that would help identify the cause of the problem.

Note: The message below is structured in the order I wrote it, but the later section on the VM system is probably what's most important. The short summary is that I think "something" you're doing is pushing the VM system past where it should be operating. The watchdog panic just happened to be how that underlying problem presented.

It is a bunch of files of varying sizes. Here's a

Yeah, that's only 60gb for free space, so you shouldn't be that "tight".

Is your file system the source or the target for the zip file?

It's actually both.

Interesting. A few questions on that point then:

  • Does the panic happen when you're just the source and/or destination or only when your "both"? I suspect it will happen when you're just the destination and won't happen when you''re just the source, but I'd like to confirm that.

  • How fast is the I/O path back to the server? Are you saturating that connection?

  • Is the I/O pipeline here simply "read compressed file from network-> decompress data-> write data out to network"? Or is there any intermediary in that process?

  • What actually makes it "back" to the server before the panic occurs? How much data were you actually able to write?

This is just a guess here, but the compression process here means there is an asymmetry between your effective read performance and your write performance. How does your write process actually "work"? Is there anything that would limit/constrain how much data you have pending (waiting to be sent over the network)? This is purely speculative, but I could imagine a situation where you run the machine out of memory by blindly accepting data to be written much faster than you're able to send the data out.

Do let me know if there's anything else I can provide that would help identify the cause of the problem.

First off, just to be clear, given that your file system is involved it's very likely that your file system is the underlying cause, even though the panic doesn't directly implicate you. There are a few different things I would look at here:

  • I would get Archive Utility out the process. My guess is that you can replicate that panic a basic tool that simply opens lot of files and writes a similar data volume at high speed, so that's probably where I'd start. One suggestion here is that I'd either generate random data or use the uncompressed data a the source, NOT "NULL" or simple patterns. If the data you work with is easily compressed, then the memory compressor could end up masking the problem.

  • Related to that point, I would write this as an app and have it write progress/diagnostic data to the the screen. Part of what makes this process difficult is getting data "close" to the point of failure, and writing to the screen means you can can data much closer to the point the panic actually happens.

  • I'd also try adding the Network Link Conditioner into the mix and see how that changes things. If my theory is write, then reducing network performance should make the problem worse (as the machine is forced to sit on more data waiting to be written). If you're not familiar with the Network Link Conditioner, it's a system preference that allows you artificially restrict the mac's network connection, simulating arbitrarily poor network conditions. You can find it in our "Additional Tools for Xcode", which you can download here.

The overall goal here is first get a clearer understanding of what's actually happening leading up to the panel and, if that doesn't provide clarity, allow you to reliably get the machine close "enough" to the point it panics that you can see what's going on while the machine is still running (so you can get diagnostic data).

Finally, a question here:

Internally, the filesystem implements a socket polling algorithm that relies on the use of sock_upcall callback that's passed into sock_socket() on socket creation.

My big question here is what memory is actually holding the file data that's waiting to be written to your server. My concern here is that this is happening directly in the kernel and/or is entangled with kernel memory in a way that ultimately forces the panic you're seeing. For reference, here is that state of memory from the panic:

active       695500
fileBacked   1390993
free         903
inactive     1
purgeable    0
speculative  695500
throttled    0
wired        100550

Note that both of these numbers jump out at me as VERY large:

fileBacked   1390993-> ~21g
speculative  695500-> ~10g

And this is downright SCARY:

free         903-> ~14m

For comparison, here are the same values from my machine when I run "vm_stat", reordered and edited to match the list above:

Mach Virtual Memory Statistics: (page size of 16384 bytes)
...
Pages active:                            188532.
File-backed pages:                       136625.
Pages free:                               13765.
Pages inactive:                          180253.
Pages purgeable:                           1509.
Pages speculative:                         5770.
Pages throttled:                              0.
Pages wired down:                        299012.

Note that my machine is not REMOTELY idle. This was my machine state as I was writing this with 20+ apps running and an uptime of 28 days.

My informal "rule of thumb" is that the system tends to run at ~80-90% of available memory, leaving the extra 10-20% as the "cushion" it uses to to deal with demand spikes. Pulling from the numbers below, the machine I'm working from has 16gb and my free pages are:

Pages free: 13765-> ~2g

...exactly what I'd expect. Frankly, I'm surprised you managed to get as low as 14m before you panic'd. As an immediate test, I'd suggest running vm_stat while testing with "larger" archives (basically, the biggest archive that doesn't panic) and see how the numbers shift. I think you'll find that you're causing serious spikes as well, they're just not big enough to cause the panic.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Does the panic happen when you're just the source and/or destination or only when your "both"? I suspect it will happen when you're just the destination and won't happen when you''re just the source, but I'd like to confirm that.

You were right in assuming that the panic occurred when my filesystem was the destination. I was able to verify that.

How fast is the I/O path back to the server? Are you saturating that connection?

The connection is not likely to be saturated as this is a 100Gb Link on a thunderbolt 5 interface.

Is the I/O pipeline here simply "read compressed file from network-> decompress data-> write data out to network"? Or is there any intermediary in that process?

The I/O pipeline is as you described it with no intermediary involved.

What actually makes it "back" to the server before the panic occurs? How much data were you actually able to write?

On two subsequent runs, around 41-42 Gb out of 64Gb of data were written before the panic ensued.

du -smx ./25116_CINEGRAPHER_MARCCAIN
41303	./25116_CINEGRAPHER_MARCCAIN

du -smx ./25116_CINEGRAPHER_MARCCAIN
42600	./25116_CINEGRAPHER_MARCCAIN

How does your write process actually "work"? Is there anything that would limit/constrain how much data you have pending (waiting to be sent over the network)?

The source uio_t buffer passed into vnop_write() is userspace. Before passing it down to sock_send() which operates on kernel resident memory buffers, we create a kernelspace copy of the userspace uio_t buffer of size equal to uio_resid(uspace_uio) with copying performed by uiomove() incrementally in chunks equal to either the value of the amount of data left in the userspace buffer or the value of the kernel's copysize_limit_panic, whichever happens to be the smaller of the two.

The kernelspace uio_t buffer is further split up into smaller chunks of data pertinent to the filesystem design which end up being passed into sock_send().

Reading is done in a similar fashion, the only difference being the use of sock_receive_mbuf() in stead of sock_receive() which uses an uio_t buffer instead of an mbuf.

I'm onto the debugging strategies you suggested now. I'll report back on my findings as they emerge.

Thanks once again for all your help.

Hopefully, we'll be able to resolve this soon.

You were right in assuming that the panic occurred when my filesystem was the destination. I was able to verify that.

OK, that matches up with the memory side of this since the kernel can't just "throw away" data it need/thinks it needs to write in the same way read data can be.

Shifting to here:

The connection is not likely to be saturated as this is a 100Gb Link on a thunderbolt 5 interface.

So, one thing I would immediately to is DRAMATICALLY restrict that connection, probably down to Mb/s, then what what happens with files in the 1-2 GB range. That should slow the output rate down enough that you'll be able to see accumulation occurring with transfers that are small enough that you won't threaten system stability.

I suspect what you'll find is that you're accepting writes at "full" speed, leaving the kernel "stuck" with 1+ GB of memory slowing streaming out over that connection. That's not acceptable since it means:

  • It's inevitably going to cause some kind of panic.

  • Even without a directly panic, any other kind of failure means all that unwritten data is lost.

Expanding on that last point, while write caching is a basic part of the systems design, you don't want to create a situation where you've told the user that something like a copy is "done" but the data transfer isn't actually finished until minutes later. The bottom line here is that you ultimately need to constrain how quickly you complete write calls to be roughly in line with your underlying ability to transfer data.

Reading is done in a similar fashion, the only difference being the use of sock_receive_mbuf() in stead of sock_receive() which uses an uio_t buffer instead of an mbuf.

Note that reads can't really cause the same issue as writes. Your VFS data "returns" that read data to the larger system, at which point the VM system has it's full "toolbox" for managing that memory. In the case of something like decompression (and lots of other use cases) that toolbox doesn't even matter that much, as user space is disposing of the read data at roughly the same rate you're returning it, preventing any accumulation.

BTW, how much memory does this machine have? Is it 32g or 16g?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Symbolicating kernel backtraces on Apple Silicon
 
 
Q