macOS Hang After Implementing UserMapHBAData()

Hi everyone,

I was following the Video Modernize PCI and SCSI drivers with DriverKit and the Document to implement UserMapHBAData(), and here’s my current implementation:

// kern_return_t DRV_MAIN_CLASS_NAME::UserMapHBAData_Impl(uint32_t *uniqueTaskID)
kern_return_t IMPL(DRV_MAIN_CLASS_NAME, UserMapHBAData)
{
    Log("UserMapHBAData() - Start");
    
    // Define the vm_page_size explicitly
    const uint32_t vm_page_size = 4096;

    kern_return_t ret;
    IOBufferMemoryDescriptor *buffer = nullptr;
    IOMemoryMap *memMap = nullptr;
    void *taskData = nullptr;

    // Create a buffer for HBA-specific task data
    ret = IOBufferMemoryDescriptor::Create(kIOMemoryDirectionOutIn, ivars->fTaskDataSize, vm_page_size, &buffer);
    __Require((kIOReturnSuccess == ret), Exit);

    // Map memory to the driver extension's memory space
    ret = buffer->CreateMapping(0, 0, 0, 0, 0, &memMap);
    __Require((kIOReturnSuccess == ret), Exit);

    // Retrieve mapped memory address
    taskData = reinterpret_cast<void *>(memMap->GetAddress());
    __Require(taskData, Exit);

    // WARNING: Potential leak of an object stored into 'buffer'
    // WARNING: Potential leak of an object stored into 'memMap'

    // Assign a unique task ID
    ivars->fTaskID++; // ERROR: No member named 'fTaskID' in 'DriverKitAcxxx_IVars'
    ivars->fTaskArray[ivars->fTaskID] = taskData;
    *uniqueTaskID = ivars->fTaskID;

    Log("UserMapHBAData() - End");
    
    return kIOReturnSuccess;

Exit:
    // Cleanup in case of failure
    if (memMap) {
        memMap->free();  // Correct method for releasing memory maps
    }

    if (buffer) {
        buffer->free();  // Correct method for releasing buffer memory
    }
    
    LogErr("ret = 0x%0x", ret);
    Log("UserMapHBAData() - End");
    
    return ret;
}

For reference, in KEXT, memory allocation is typically done using:

IOBufferMemoryDescriptor *buffer = IOBufferMemoryDescriptor::inTaskWithOptions(
    kernel_task,          // Task in which memory is allocated
    kIODirectionOutIn,    // Direction (read/write)
    1024,                 // Size of the buffer in bytes
    4);                   // Alignment requirements

However, after installing the dext, macOS hangs, and I have to do a hardware reset. After rebooting, the sysctl list output shows:

% sectl list
1 extension(s)
--- com.apple.system_extension.driver_extension
enabled active  teamID  bundleID (version)  name    [state]
    *   -   com.accusys.DriverKitAcxxx (5.0/11) com.accusys.DriverKitAcxxx  [activated waiting for user]

Questions:

  • What could be causing macOS to halt?
  • How should I approach debugging and resolving this issue?

Looking forward to your insights, any suggestions would be greatly appreciated!

Best regards, Charles

Answered by DTS Engineer in 838693022

What could be causing macOS to halt?

First off, as general note, I think it's important to emphasize that DEXT development is FAR closer to KEXT development than it is standard app development. The DEXT architecture is VASTLY more secure that KEXTs and the impact "scope" of DEXT is far smaller. However, DEXTs are fundamentally still drivers and are FULLY capable of hanging or panic'ing the kernel, particularly the lower level PCI and SCSI families.

Looking at your specific code, there are a few things I notice:

  • I don't think it will make any difference but you can use IOVMPageSize to check the pages size and, strictly speaking, it's 16k on Apple Silicon, not 4k.

  • You're retrieving the address using GetAddress(), but that means you didn't retrieve/check the length. Both the address and the length are more typically retrieved through the returned arguments of "Map()".

  • You should be "fully" processing the memory in UserMapHBAData, not mapping it. Our own driver takes the map all the way through IODMACommand->PrepareForDMA so that it has the final physical address.

Having said all that, I suspect that you didn't actually hang here, but actually stalled at a later point when actually trying to start I/O.

That leads to here:

How should I approach debugging and resolving this issue?

A few different comments/suggestions:

  1. Keep in mind that significant log tends to be lost when the machine panic/power cycles, so don't assume that log data you retrieve after a reboot is necessarily accurate.

  2. I would try is creating an ssh session into the machine, using that to monitor the system log, then triggering the hang. It's possible that enough of the system is still functioning normally that you'll be able to see the kernel log data, at which point you can then use standard "print" debugging to investigate further.

  3. You can also use two machine kernel debugging to "actively" debug your DEXT. The instructions for that as well as kernel symbols are in the KDKs (Kernel Debug Kits) found in the "More Downloads" section of the developer portal. Note that there are issues with the "standard" atos (this does not effect lldb) symbolication process, the details of which are described here.

Finally, as a general comment, my experience has been that what's often helpful is to slow down and carefully validate every step of the process instead of focusing on getting it "all" to work. In the case here, I would start by simply getting the I/O "pipeline" working WITHOUT actually doing any I/O. That is, implementing every method and having each step "succeed" without actually touching hardware. Next, I'd try to get I/O working, but I'd "rig" the process so that I'm only actually doing a single I/O to a "fixed" addresses, not trying to setup the fully functional I/O pipeline. Finally, I'd then expand that approach to the "final" full I/O pipeline. The goal here is to ensure that at every point the failure "range" is narrow enough that it's relatively easy to figure out what actually failed, avoiding the problem where "everything" has been implemented, but nothing actually works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

What could be causing macOS to halt?

First off, as general note, I think it's important to emphasize that DEXT development is FAR closer to KEXT development than it is standard app development. The DEXT architecture is VASTLY more secure that KEXTs and the impact "scope" of DEXT is far smaller. However, DEXTs are fundamentally still drivers and are FULLY capable of hanging or panic'ing the kernel, particularly the lower level PCI and SCSI families.

Looking at your specific code, there are a few things I notice:

  • I don't think it will make any difference but you can use IOVMPageSize to check the pages size and, strictly speaking, it's 16k on Apple Silicon, not 4k.

  • You're retrieving the address using GetAddress(), but that means you didn't retrieve/check the length. Both the address and the length are more typically retrieved through the returned arguments of "Map()".

  • You should be "fully" processing the memory in UserMapHBAData, not mapping it. Our own driver takes the map all the way through IODMACommand->PrepareForDMA so that it has the final physical address.

Having said all that, I suspect that you didn't actually hang here, but actually stalled at a later point when actually trying to start I/O.

That leads to here:

How should I approach debugging and resolving this issue?

A few different comments/suggestions:

  1. Keep in mind that significant log tends to be lost when the machine panic/power cycles, so don't assume that log data you retrieve after a reboot is necessarily accurate.

  2. I would try is creating an ssh session into the machine, using that to monitor the system log, then triggering the hang. It's possible that enough of the system is still functioning normally that you'll be able to see the kernel log data, at which point you can then use standard "print" debugging to investigate further.

  3. You can also use two machine kernel debugging to "actively" debug your DEXT. The instructions for that as well as kernel symbols are in the KDKs (Kernel Debug Kits) found in the "More Downloads" section of the developer portal. Note that there are issues with the "standard" atos (this does not effect lldb) symbolication process, the details of which are described here.

Finally, as a general comment, my experience has been that what's often helpful is to slow down and carefully validate every step of the process instead of focusing on getting it "all" to work. In the case here, I would start by simply getting the I/O "pipeline" working WITHOUT actually doing any I/O. That is, implementing every method and having each step "succeed" without actually touching hardware. Next, I'd try to get I/O working, but I'd "rig" the process so that I'm only actually doing a single I/O to a "fixed" addresses, not trying to setup the fully functional I/O pipeline. Finally, I'd then expand that approach to the "final" full I/O pipeline. The goal here is to ensure that at every point the failure "range" is narrow enough that it's relatively easy to figure out what actually failed, avoiding the problem where "everything" has been implemented, but nothing actually works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I figured out the reason why macOS crashed — it was because I called SetDispatchQueue() before assigning a value to the queue variable.

// Create and set dispatch AuxiliaryQueue
ret = IODispatchQueue::Create("AuxiliaryQueue",
                              kIODispatchQueueReentrant, // Options (e.g., allow reentrancy)
                              0, // Priority
                              &ivars->fAuxQueue);
if (!ivars->fAuxQueue) LogErr("Create AuxiliaryQueue fail");
__Require((kIOReturnSuccess == ret), Exit);

ret = SetDispatchQueue("AuxiliaryQueue", ivars->fAuxQueue);
if (kIOReturnSuccess != ret) LogErr("SetDispatchQueue AuxiliaryQueue");
__Require((kIOReturnSuccess == ret), Exit);

// Create and set dispatch InterruptQueue
ret = IODispatchQueue::Create("InterruptQueue",
                              kIODispatchQueueReentrant,
                              0,
                              &ivars->fIntQueue);
if (!ivars->fIntQueue) LogErr("Create InterruptQueue fail");
__Require((kIOReturnSuccess == ret), Exit);

// AME_IO_milliseconds_Delay(2000); // 200ms delay

#if 0 // Fail to SetDispatchQueue()
ret = SetDispatchQueue("InterruptQueue", ivars->fIntQueue);
if (kIOReturnSuccess != ret) LogErr("SetDispatchQueue InterruptQueue");
__Require((kIOReturnSuccess == ret), Exit);
#endif

However, there's still an issue: the second call to SetDispatchQueue() fails, even if I add a sleep before it.

UserInitializeController() - Start
!!! ERROR : SetDispatchQueue InterruptQueue !!!
!!! ERROR : ret = 0xe00002c2 !!!
UserInitializeController() - End

Additionally, at first, UserMapHBAData() wasn't working either. The logs showed the following error:

UserMapHBAData() - Start
!!! ERROR : Buffer allocation failed !!!
!!! ERROR : ret = 0xe00002bd !!!
UserMapHBAData() - End

Eventually, I was able to fix it by experimenting with different parameter values:

// ivars->fTaskDataSize = 64;
// ivars->fTaskDataSize = 128;
// ivars->fTaskDataSize = 512;
ivars->fTaskDataSize = 1024;
// ivars->fTaskDataSize = 4096;
const uint32_t alignment = 4;
// const uint32_t alignment = 0; // Pass 0 for no guaranteed alignment.

ret = IOBufferMemoryDescriptor::Create(kIOMemoryDirectionOutIn,
                                       ivars->fTaskDataSize,
                                       alignment,
                                       &buffer);
if (!buffer) LogErr("Buffer allocation failed!");
Log("ivars->fTaskDataSize = %llu, alignment = %d", ivars->fTaskDataSize, alignment);
__Require((kIOReturnSuccess == ret), Exit);

As you mentioned, the best documentation is working sample code. If Apple could provide complete, buildable source examples, it would be an immense help, both to Apple and to developers.

Lastly, you said:

In the case here, I would start by simply getting the I/O ‘pipeline’ working WITHOUT actually doing any I/O.

How should I go about doing that, exactly?

Best Regards, Charles

However, there's still an issue: the second call to SetDispatchQueue() fails, even if I add a sleep before it.

This isn't really an I/O call (it's basically adds and entry to a lookup table inside you DEXT), so the delay wouldn't matter.

!!! ERROR : ret = 0xe00002c2 !!!

The error "0xe00002c2" is "kIOReturnBadArgument". Have you marked any of the methods in your iig file with "QUEUENAME"-> InterruptQueue? As far as I can tell, the only error SetDispatchQueue can ever return is "kIOReturnBadArgument" and the only reason it return that is that the corresponding QUEUENAME() declaration is missing.

Eventually, I was able to fix it by experimenting with different parameter values:

Glad to here that you've got this working.

How should I go about doing that, exactly?

I'll admit, I haven't tried to implement this in detail, however, fundamentally, the I/O system is "blind". That is, the only reason it "knows" a write worked is because your driver said it did. I think you have enough control of the "configuration" side that you could "fake" what's necessary to present a block storage device to the system. At that point, I think you could then:

  • Fail every read. That should fail the partition scheme driver, leaving you with a dev node the system will ignore. That's fine, since you were never going to mount it anyway.

  • Succeed every write, discarding data.

At that point, you should be able to open up the rdev node to the device and send writes to the device. Note particularly elegant or functional, but it's a starting point you can build up from.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Following is the sequence of current implementation

Start()
Timer_Init()
UserInitializeController()

In UserInitializeController(), we loop to check whether the RAID is ready and also watch a flag to see if UserDoesHBAPerformDeviceManagement() has been called, then invoke the TargetForID_Init() to call UserCreateTargetForID()

Then we proceed with:

Timer_Start()
ZtsTimerOccurred()
UserGetDMASpecification()
UserReportInitiatorIdentifier()                   // *identifier = 65
UserReportHighestSupportedDeviceID() //  *id = 63
UserReportMaximumTaskCount()             //  *count = 200
UserMapHBAData()                                    // run 200 times
UserStartController()
UserDoesHBAPerformDeviceManagement()
UserInitializeTargetForID()                         // runs for Target IDs 0 through 63
                                                                      // Each call is paired with UserProcessParallelTask()
TargetForID_Init()                                        // be invoked in a loop process within UserInitializeController() 

Inside the log of TargetForID_Init() we see:

TargetForID_Init() - Start
AsyncEventHandler(ivars->fAsyncEventHandler, 0)
!!! ERROR : Failed at LUN 0: 0xe00002d8 !!!
...
Initialize Total 64 TargetForID, 0 Successfully
TargetForID_Init() - End

Questions:

  1. Does UserCreateTargetForID() have to be called only after UserDoesHBAPerformDeviceManagement()?

  2. What exactly does the error message 0xe00002d8 mean in this context?

  3. I never explicitly called UserInitializeTargetForID(), so why is it being executed?

Looking forward to your reply. Really appreciate it!

Best Regards, Charles

  1. I never explicitly called UserInitializeTargetForID(), so why is it being executed?

Did "UserDoesHBAPerformDeviceManagement" return "true"? If so, then what actually happens is that the SCSI driver itself will call "CreateTargetForID", which will then call UserInitializeTargetForID. Note that this means your DEXT will generally NOT call "UserCreateTargetForID".

Also, as broader context, some DEXT methods are designed for you to call (like "UserCreateTargetForID") while other are intended for the system to call (like "UserInitializeTargetForID").

  1. Does UserCreateTargetForID() have to be called only after UserDoesHBAPerformDeviceManagement()?

You shouldn't call it before UserDoesHBAPerformDeviceManagement is called, but I believe a controller that implemented it's on device management would call it "anytime" (for example to support hot plug).

  1. What exactly does the error message 0xe00002d8 mean in this context?

The error itself is kIOReturnNotReady. Looking at our code and what you've described above, I think what you're hitting is a low level reentrancy check inside DriverKit's IPC system. I suspect that what's actually going on is that CreateTargetForID is still in flight, which means you tried to reenter it by calling "UserCreateTargetForID".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Did "UserDoesHBAPerformDeviceManagement" return "true"?

Yes, it returns TRUE, and at the same time sets ivars->bUserDoesHBAPerformDeviceManagement as true. This ensures that TargetForID_Init() waits until UserDoesHBAPerformDeviceManagement() has completed before executing AsyncEventHandler(), which then calls UserCreateTargetForID().

IMPL(DRV_MAIN_CLASS_NAME, UserDoesHBAPerformDeviceManagement)
{
    Log("UserDoesHBAPerformDeviceManagement() - Start");
    bool ret = TRUE;
    ivars->bUserDoesHBAPerformDeviceManagement = true;
    Log("UserDoesHBAPerformDeviceManagement() - End");
    return ret;
}

If AsyncEventHandler() is not called, then UserCreateTargetForID() will not be invoked by the system. Additionally, TargetForID_Init() is called by AME_Host_RAID_Ready(), that runs inside a while loop in the UserInitializeController().

while (1)
{
        bool sts;
        sts = AME_Check_RAID_Ready_Status(pAMEData);
        bool b_device_menagement = ivars->bUserDoesHBAPerformDeviceManagement;
        if (sts == TRUE && ivars->bUserDoesHBAPerformDeviceManagement) {
            AME_Host_RAID_Ready();  // will call TargetForID_Init() to invoke AsyncEventHandler() and then call UserCreateTargetForID()
            ...
            break;
        }
}

I suspect that what's actually going on is that CreateTargetForID is still in flight, which means you tried to reenter it by calling "UserCreateTargetForID".

How can I avoid re-entering UserCreateTargetForID() and ensure that CreateTargetForID() has already been executed?

Best Regards, Charles

First off, I need to correct something I said:

Did "UserDoesHBAPerformDeviceManagement" return "true"?

That should have sat "false". In other words, if UserDoesHBAPerformDeviceManagement returns:

  • "false"-> The system handles iterating the bus, so you don't need to call UserCreateTargetForID.

  • "true"-> Your controller is responsible for iterating the bus, and so you WILL need to call "UserCreateTargetForID".

Sorry for the confusion here.

Next, looking at other points:

Additionally, TargetForID_Init() is called by AME_Host_RAID_Ready(), that runs inside a while loop in the UserInitializeController().

Two things that concern me here:

  1. Did you actually mean "UserInitializeController" here? That method is a "preparation" method (analogous to "init" in IOKit), not the actual "start" of your controller. The method "UserStartController" is when your controller is actually expected to start "working". In my concrete terms, "UserInitializeController" is where you'd do things like

setup dispatch queues, create action handlers, and allocate memory. However, "UserStartController" is when you'd actually call your async action handler, which would iterate the bus, etc.

  1. As a general rule, blocking in DriverKit makes me very nervous. There are cases where it may be appropriate, but the issue here is that blocking your KEXT will also block the underlying kernel driver, which can then prevent THAT driver from operating "normally".

How can I avoid re-entering UserCreateTargetForID() and ensure that CreateTargetForID() has already been executed?

Expanding on that second point, when does this while loop return:

while (1)
{
        bool sts;
        sts = AME_Check_RAID_Ready_Status(pAMEData);
        bool b_device_menagement = ivars->bUserDoesHBAPerformDeviceManagement;
        if (sts == TRUE && ivars->bUserDoesHBAPerformDeviceManagement) {
            AME_Host_RAID_Ready();  // will call TargetForID_Init() to invoke AsyncEventHandler() and then call UserCreateTargetForID()
            ...
            break;
        }
}

Code like this blocks whatever thread it's called on and that's a real problem if that thread is also need for UserCreateTargetForID to finish.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

macOS Hang After Implementing UserMapHBAData()
 
 
Q