Recently, I'm trying to deploy some third-party LLM to Apple devices. The methodoloy is similar to https://github.com/Anemll/Anemll.
The biggest issue I'm having now is the runtime memory usage. When there are multiple functions in a model (mlpackage or mlmodelc), the runtime memory usage for weights is somehow duplicated when I load all of them. Here's the detail:
-
I created my multifunction mlpackage following https://apple.github.io/coremltools/docs-guides/source/multifunction-models.html
-
I loaded each of the functions using the generated swift class:
let config = MLModelConfiguration()
config.computeUnits = MLComputeUnits.cpuAndNeuralEngine
config.functionName = "infer_512";
let ffn1_infer_512 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)
config.functionName = "infer_1024";
let ffn1_infer_1024 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)
config.functionName = "infer_2048";
let ffn1_infer_2048 = try! mimo_FFN_PF_lut4_chunk_01of02(configuration: config)
-
I observed that RAM usage increases linearly as I load each of the functions.
-
Using instruments, I see that there are multiple HWX files generated and loaded, each of which contains all the weight data.
My understanding of what's happening here:
-
The CoreML framework did some MIL->MIL preprocessing before further compilation, which includes separating CPU workload from ANE workload.
-
The ANE part of each function is moved into a separate MIL file then compile separately into a HWX file each.
The problem is that the weight data of these HWX files are duplicated. Since that the weight data of LLMs is huge, it will cause out-of-memory issue on mobile devices.
The improvement I'm hoping from Apple: I hope we can try to merge the processed MIL files back into one before calling ANECCompile(), so that the weights can be merged. I don't have control over that in user space and I'm not sure if that is feasible. So I'm asking for help here.
Thanks.