9/20/2024
本文详细分析了在使用Llama模型时遇到的CANN错误,特别是rtGetDevMsg函数中上下文指针为NULL的问题。通过深入代码分析,我们定位了错误发生在ggml-cann.cpp文件的182行,并探讨了aclrtMemGetAllocationGranularity函数的调用失败原因。
orangepi ai pro llama.cpp 运行失败分析
首先上命令和报错信息
./build/bin/llama-cli -m /root/model/meta-llama-3.1-8b-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
报错
CANN error: EE1001: [PID: 5213] 2024-09-20-14:02:01.210.511 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
Get Allocation Granularity failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5244]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
current device: 0, in function ggml_cann_init at /root/llama.cpp/ggml/src/ggml-cann.cpp:182
aclrtMemGetAllocationGranularity( &prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED, &info.devices[id].vmm_granularity)
/root/llama.cpp/ggml/src/ggml-cann.cpp:123: CANN error
/root/llama.cpp/build/ggml/src/libggml.so(+0x40464)[0xe7ffc5e20464]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_abort+0x140)[0xe7ffc5e21630]
/root/llama.cpp/build/ggml/src/libggml.so(+0xc026c)[0xe7ffc5ea026c]
/root/llama.cpp/build/ggml/src/libggml.so(_Z14ggml_cann_infov+0x160)[0xe7ffc5ea0fb0]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_get_device_count+0xc)[0xe7ffc5ea14fc]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_buffer_type+0x50)[0xe7ffc5ea1560]
/root/llama.cpp/build/src/libllama.so(+0x77200)[0xe7ffc6027200]
/root/llama.cpp/build/src/libllama.so(llama_load_model_from_file+0xe00)[0xe7ffc60699f0]
./build/bin/llama-cli(+0x22e48)[0xaaaab9d62e48]
./build/bin/llama-cli(+0x12870)[0xaaaab9d52870]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xe7ffc59573fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xe7ffc59574cc]
./build/bin/llama-cli(+0x19a30)[0xaaaab9d59a30]
Aborted (core dumped)
观察报错信息,我们可以发现错误发生在ggml-cann.cpp
文件的 123 行,这一行是
GGML_ABORT("CANN error");
好吧,这一行是抛出错误,那真正出错误在哪里呢?
继续观察,发现是ggml-cann.cpp
文件的 182 行,调用 rtGetDevMsg
时上下文指针为 NULL
。182 行是什么呢?
ACL_CHECK(aclrtMemGetAllocationGranularity(
&prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED,
&info.devices[id].vmm_granularity));
可以猜出来是ACL_CHECK
抛出的问题,这个ACL_CHECK
是干嘛的呢?我们可以找到它的定义在ggml/src/ggml-cann/common.h
60 行
#define ACL_CHECK_GEN(stmt, success, error_fn) \
do { \
int err_code = (stmt); \
if (err_code != (success)) { \
ggml_cann_error(#stmt, __func__, __FILE__, __LINE__, error_fn()); \
} \
} while (0);
#define ACL_CHECK(stmt) ACL_CHECK_GEN(stmt, 0, aclGetRecentErrMsg)
ok 看来如果出问题了就是aclrtMemGetAllocationGranularity
的返回值不对,那aclrtMemGetAllocationGranularity
这个函数是啥呢?搜了一下在 llama.cpp 的仓库里面已经搜不到了,看来是 CANN 的函数了,google 一下发现找不到!接着百度,发现一段代码 https://gitee.com/cann/acl/blob/2c7e13bce2a0f010a2c7c76546c2d14c5aee0624/runtime/memory.cpp
这是 aclrtMemGetAllocationGranularity 的定义
aclError aclrtMemGetAllocationGranularity(aclrtPhysicalMemProp *prop, aclrtMemGranularityOptions option,
size_t *granularity)
{
ACL_PROFILING_REG(acl::AclProfType::AclrtMemGetAllocationGranularity);
ACL_LOG_DEBUG("start to execute aclrtMemGetAllocationGranularity");
ACL_REQUIRES_NOT_NULL_WITH_INPUT_REPORT(prop);
ACL_REQUIRES_NOT_NULL_WITH_INPUT_REPORT(granularity);
rtDrvMemProp_t rtProp1 = {};
rtProp1.side = prop->location.type;
rtProp1.devid = prop->location.id;
rtProp1.module_id = acl::APP_MODE_ID_U16;
rtProp1.reserve = prop->reserve;
switch (prop->memAttr) {
case ACL_HBM_MEM_HUGE: {
rtProp1.pg_type = 1U;
rtProp1.mem_type = 0U;
break;
}
case ACL_HBM_MEM_NORMAL: {
rtProp1.pg_type = 0U;
rtProp1.mem_type = 0U;
break;
}
default: {
ACL_LOG_ERROR("memAttr [%d] support ACL_HBM_MEM_HUGE or ACL_HBM_MEM_NORMAL",
static_cast<int32_t>(prop->memAttr));
return ACL_ERROR_INVALID_PARAM;
}
}
const rtError_t rtErr = rtMemGetAllocationGranularity(&rtProp1,
static_cast<rtDrvMemGranularityOptions>(option), granularity);
if (rtErr != RT_ERROR_NONE) {
ACL_LOG_CALL_ERROR("Get Allocation Granularity failed, runtime result = %d", static_cast<int32_t>(rtErr));
return ACL_GET_ERRCODE_RTS(rtErr);
}
return ACL_SUCCESS;
}
但是到目前为止我们没找到什么上下文指针 (未完待续)