CoreML multifunction model runtime memory cost

Recently, I'm trying to deploy some third-party LLM to Apple devices. The methodoloy is similar to https://github.com/Anemll/Anemll.

The biggest issue I'm having now is the runtime memory usage. When there are multiple functions in a model (mlpackage or mlmodelc), the runtime memory usage for weights is somehow duplicated when I load all of them. Here's the detail:

  • I created my multifunction mlpackage following https://apple.github.io/coremltools/docs-guides/source/multifunction-models.html

  • I loaded each of the functions using the generated swift class:

let config = MLModelConfiguration()
config.computeUnits = MLComputeUnits.cpuAndNeuralEngine

config.functionName = "infer_512";
let ffn1_infer_512 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)

config.functionName = "infer_1024";
let ffn1_infer_1024 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)

config.functionName = "infer_2048";
let ffn1_infer_2048 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)
  • I observed that RAM usage increases linearly as I load each of the functions.

  • Using instruments, I see that there are multiple HWX files generated and loaded, each of which contains all the weight data.

My understanding of what's happening here:

  • The CoreML framework did some MIL->MIL preprocessing before further compilation, which includes separating CPU workload from ANE workload.

  • The ANE part of each function is moved into a separate MIL file then compile separately into a HWX file each.

The problem is that the weight data of these HWX files are duplicated. Since that the weight data of LLMs is huge, it will cause out-of-memory issue on mobile devices.

The improvement I'm hoping from Apple: I hope we can try to merge the processed MIL files back into one before calling ANECCompile(), so that the weights can be merged. I don't have control over that in user space and I'm not sure if that is feasible. So I'm asking for help here.

Thanks.

I hope we can try to merge the processed MIL files back into one before calling ANECCompile(), so that the weights can be merged.

As to user space, I'm expecting there be some API there that initiates all the model functions all together in one call and returns multiple MLModel objects.

CoreML multifunction model runtime memory cost
 
 
Q