Memory visibility issue regards to shared data with Dispatch Queue

Question

Created May ’25

Replies 3

Boosts 0

Participants 2

I’m working with apple dispatch queue in C with the following design: multiple dispatch queues enqueue tasks into a shared context, and a dedicated dispatch queue (let’s call it dispatch queue A) processes these tasks. However, it seems this design has a memory visibility issue.

Here’s a simplified version of my setup:

I have a shared_context struct that holds:

task_lis: a list that stores tasks to be prioritized and run — this list is only modified/processed by dispatch queue A (a serially dispatch queue), so I don't lock around it.
cross_thread_tasks: a list that other queues push tasks into, protected by a lock.

Other dispatch queues call a function schedule_task that

locks and appends a new task to cross_thread_tasks
call dispatch_after_f() to schedule a process_task() on dispatch queue A

process_task() that processes the task_list and is repeatedly scheduled on dispatch queue A :

Swaps cross_thread_tasks into a local list (with locking).
Pushes the tasks into task_list.
Runs tasks from task_list.
Reschedules itself via dispatch_after_f().

Problem:

Sometimes the tasks pushed from other threads don’t seem to show up in task_list when process_task() runs. The task_list appears to be missing them, as if the cross-thread tasks aren’t visible. However, if the process_task() is dispatched from the same thread the tasks originate, everything works fine.

It seems to be a memory visibility or synchronization issue. Since I only lock around cross_thread_tasks, could it be that changes to task_list (even though modified on dispatch queue A only) are not being properly synchronized or visible across threads?

My questions

What’s the best practice to ensure shared context is consistently visible across threads when using dispatch queues? Is it mandatory to lock around all tasks? I would love to minimize/avoid lock if possible.

Any guidance, debugging tips, or architectural suggestions would be appreciated!

===============================

And here is pseudocode of my setup if it helps:

struct shared_data {
     struct linked_list* task_list;
}

struct shared_context {
   struct shared_data *data;
   struct linked_list* cross_thread_tasks;
   struct thread_mutex* lock; // lock is used to protect cross_thread_tasks
}

static void s_process_task(void* shared_context){
     struct linked_list* local_tasks;
     // lock and swap the cross_thread_tasks into a local linked list
     lock(shared_context->lock)
     swap(shared_context->cross_thread_tasks, local_tasks)
     unlock(shared_context->lock)

     // I didnt use lock to protect `shared_context->data` as they are only touched within dispatch queue A in this function.
     for (task : local_tasks) {
          linked_list_push(shared_context->data->task_list)
     }

       // If the `process_task()` block is dispatched from `schedule_task()` where the task is created, the `shared_context` will be able to access the task properly otherwise not. 
      for (task : shared_context->data->task_list) {
            run_task_if_timestamp_is_now(task)
      }
      
      timestamp = get_next_timestamp(shared_context->data->task_list)
      dispatch_after_f(timestamp, dispatch_queueA, shared_context, process_task);
}


// On dispatch queue B 
static void schedule_task(struct task* task, void* shared_context) {
    lock(shared_context->lock)
    push(shared_context->cross_thread_tasks, task)
    unlock(shared_context->lock)

    timestamp = get_timestamp(task)
    // we only dispatch the task if the timestamp < 1 second. We did this to avoid the dispatch queue schedule the task too far ahead and prevent the shutdown process. Therefore, not all task will be dispatched from the thread it created.
    if(timestamp < 1 second)
       dispatch_after_f(timestamp, dispatch_queueA, shared_context, process_task);
}

Answered by DTS Engineer in 838523022

So, let me start with the immediate issue here:

Sometimes the tasks pushed from other threads don’t seem to show up in task_list when process_task() runs. The task_list appears to be missing them, as if the cross-thread tasks aren’t visible. However, if the process_task() is dispatched from the same thread the tasks originate, everything works fine.

Yes, it's very likely that something like that is going on. As described in "Synchronize Access to Shared Data in Memory", ARM has a very weak memory ordering model, particularly compared to x86. That gives the CPU very wide latitude to reorder reads/writes, which can create a variety of odd behaviors that wouldn't happen on x86.

I'll also note that these kinds of issues are extremely difficult to interpret/predict/diagnose. You can google "arm lock free programming" for all sorts of suggestions and advice but my experience has been that outside of VERY narrow and specific situations, this approach has VERY little benefit.

Moving to here:

I’m working with apple dispatch queue in C with the following design: multiple dispatch queues enqueue tasks into a shared context, and a dedicated dispatch queue (let’s call it dispatch queue A) processes these tasks.

SO, my first question here is "what are you actually trying to do". Something I've seen happen a few times is developers building increasingly complex architecture to make things "faster" which are in fact either largely pointless or, even worse, ACTIVELY slower than a much simpler solution. You can read a write up a did on one example of that here.

I suggest a redesign below which (I believe) would resolve your immediate concern, however, I'm concerned that you're building something which is unnecessarily complex and/or costing your app significant performance. Notably, unless the actual "work" being performed is substantial, it's very likely that you're gaining very little from this approach.

What’s the best practice to ensure shared context is consistently visible across threads when using dispatch queues?

So, the simplest answer here is basically "don't share state across threads". In the case of GCD, part of GCDs role is to act as a locking primitive, so the fact you need an additional lock inside your GCD work means that you're not using GCD properly.

In terms of resolving your immediate issue, there are two approaches that I'd consider:

The first is to just use one queue for EVERYTHING. Note that I mean EVERYTHING here- that is, both your "schedule_task" and "s_process_task" would ONLY be called from inside the same GCD queue. Note that while this might seem slower than your current approach, unless the actual work involved here is VERY large, I'd expect it to be equal or significantly faster than what you're currently doing. Importantly, test this approach and figure out what's "wrong" with it before you try and move to something more complicated.

In any case, the solution I would use is to strongly define the role of each queue and stick within those limitation. In your case, that means:

The scheduling queue is responsible for managing the tasks list. ALL access to the tasks list must occur on this queue.
The processing queue is where work occurs but it ONLY executes the work it's directly given and does NOT have access to the task list.

When the scheduling queue decides work will occur, it removes those tasks from it's task list and create a new task list. That new task list is the dispatched to the processing queue. Sending results/progress/data from processing to scheduling works in reverse- the scheduling queue packages the data it want to send and then sending it to the scheduling queue. The key point here is that at NO point do two queues ever have access to the "same" data.

The one problem this architecture might have (or seem to) is that you only have 1 argument you can pass into dispatch_after_f and you may need to pass in both some shared (read-only) state as well as the actual data that needs to be processed. Two solutions to that:

dispatch_set_context/dispatch_get_context allow to you attach and retrieve an arbitrary pointer to any dispatch_object_t (including queues). This is commonly used to separate static configuration ("where does this data go when I'm done") from the actual data being processed.
The structure you pass into dispatch_after_f can also include a reference to the other metadata. The key here is that you need to pass in "independant" memory, not references to a structure you're actively modifying in the other queue.

That last point is the key. Data needs to have a clear "owner" and ONLY that owner is allowed to access that data.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Boost

Answer 1

DTS Engineer OP

Apple

May ’25

Accepted Answer

So, let me start with the immediate issue here:

Sometimes the tasks pushed from other threads don’t seem to show up in task_list when process_task() runs. The task_list appears to be missing them, as if the cross-thread tasks aren’t visible. However, if the process_task() is dispatched from the same thread the tasks originate, everything works fine.

Yes, it's very likely that something like that is going on. As described in "Synchronize Access to Shared Data in Memory", ARM has a very weak memory ordering model, particularly compared to x86. That gives the CPU very wide latitude to reorder reads/writes, which can create a variety of odd behaviors that wouldn't happen on x86.

I'll also note that these kinds of issues are extremely difficult to interpret/predict/diagnose. You can google "arm lock free programming" for all sorts of suggestions and advice but my experience has been that outside of VERY narrow and specific situations, this approach has VERY little benefit.

Moving to here:

I’m working with apple dispatch queue in C with the following design: multiple dispatch queues enqueue tasks into a shared context, and a dedicated dispatch queue (let’s call it dispatch queue A) processes these tasks.

SO, my first question here is "what are you actually trying to do". Something I've seen happen a few times is developers building increasingly complex architecture to make things "faster" which are in fact either largely pointless or, even worse, ACTIVELY slower than a much simpler solution. You can read a write up a did on one example of that here.

I suggest a redesign below which (I believe) would resolve your immediate concern, however, I'm concerned that you're building something which is unnecessarily complex and/or costing your app significant performance. Notably, unless the actual "work" being performed is substantial, it's very likely that you're gaining very little from this approach.

What’s the best practice to ensure shared context is consistently visible across threads when using dispatch queues?

So, the simplest answer here is basically "don't share state across threads". In the case of GCD, part of GCDs role is to act as a locking primitive, so the fact you need an additional lock inside your GCD work means that you're not using GCD properly.

In terms of resolving your immediate issue, there are two approaches that I'd consider:

The first is to just use one queue for EVERYTHING. Note that I mean EVERYTHING here- that is, both your "schedule_task" and "s_process_task" would ONLY be called from inside the same GCD queue. Note that while this might seem slower than your current approach, unless the actual work involved here is VERY large, I'd expect it to be equal or significantly faster than what you're currently doing. Importantly, test this approach and figure out what's "wrong" with it before you try and move to something more complicated.

In any case, the solution I would use is to strongly define the role of each queue and stick within those limitation. In your case, that means:

The scheduling queue is responsible for managing the tasks list. ALL access to the tasks list must occur on this queue.
The processing queue is where work occurs but it ONLY executes the work it's directly given and does NOT have access to the task list.

When the scheduling queue decides work will occur, it removes those tasks from it's task list and create a new task list. That new task list is the dispatched to the processing queue. Sending results/progress/data from processing to scheduling works in reverse- the scheduling queue packages the data it want to send and then sending it to the scheduling queue. The key point here is that at NO point do two queues ever have access to the "same" data.

The one problem this architecture might have (or seem to) is that you only have 1 argument you can pass into dispatch_after_f and you may need to pass in both some shared (read-only) state as well as the actual data that needs to be processed. Two solutions to that:

dispatch_set_context/dispatch_get_context allow to you attach and retrieve an arbitrary pointer to any dispatch_object_t (including queues). This is commonly used to separate static configuration ("where does this data go when I'm done") from the actual data being processed.
The structure you pass into dispatch_after_f can also include a reference to the other metadata. The key here is that you need to pass in "independant" memory, not references to a structure you're actively modifying in the other queue.

That last point is the key. Data needs to have a clear "owner" and ONLY that owner is allowed to access that data.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 2

xiazhvera OP

May ’25

Thank you so much for the detailed explanation and suggestion!! I will go with suggestions that only one dispatch queue will access the task list.

0

Answer 3

DTS Engineer OP

Apple

May ’25

I will go with suggestions that only one dispatch queue will access the task list.

Good luck and please let me know if you have run into any problems!

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0