9/20/2024

本文详细分析了在使用Llama模型时遇到的CANN错误,特别是rtGetDevMsg函数中上下文指针为NULL的问题。通过深入代码分析,我们定位了错误发生在ggml-cann.cpp文件的182行,并探讨了aclrtMemGetAllocationGranularity函数的调用失败原因。

orangepi ai pro llama.cpp 运行失败分析

首先上命令和报错信息

./build/bin/llama-cli -m /root/model/meta-llama-3.1-8b-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0

报错

CANN error: EE1001: [PID: 5213] 2024-09-20-14:02:01.210.511 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        Get Allocation Granularity failed, runtime result = 207000[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5244]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

  current device: 0, in function ggml_cann_init at /root/llama.cpp/ggml/src/ggml-cann.cpp:182
  aclrtMemGetAllocationGranularity( &prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED, &info.devices[id].vmm_granularity)
/root/llama.cpp/ggml/src/ggml-cann.cpp:123: CANN error
/root/llama.cpp/build/ggml/src/libggml.so(+0x40464)[0xe7ffc5e20464]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_abort+0x140)[0xe7ffc5e21630]
/root/llama.cpp/build/ggml/src/libggml.so(+0xc026c)[0xe7ffc5ea026c]
/root/llama.cpp/build/ggml/src/libggml.so(_Z14ggml_cann_infov+0x160)[0xe7ffc5ea0fb0]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_get_device_count+0xc)[0xe7ffc5ea14fc]
/root/llama.cpp/build/ggml/src/libggml.so(ggml_backend_cann_buffer_type+0x50)[0xe7ffc5ea1560]
/root/llama.cpp/build/src/libllama.so(+0x77200)[0xe7ffc6027200]
/root/llama.cpp/build/src/libllama.so(llama_load_model_from_file+0xe00)[0xe7ffc60699f0]
./build/bin/llama-cli(+0x22e48)[0xaaaab9d62e48]
./build/bin/llama-cli(+0x12870)[0xaaaab9d52870]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xe7ffc59573fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xe7ffc59574cc]
./build/bin/llama-cli(+0x19a30)[0xaaaab9d59a30]
Aborted (core dumped)

观察报错信息,我们可以发现错误发生在ggml-cann.cpp 文件的 123 行,这一行是

GGML_ABORT("CANN error");

好吧,这一行是抛出错误,那真正出错误在哪里呢?

继续观察,发现是ggml-cann.cpp 文件的 182 行,调用 rtGetDevMsg 时上下文指针为 NULL。182 行是什么呢?

ACL_CHECK(aclrtMemGetAllocationGranularity(
            &prop, ACL_RT_MEM_ALLOC_GRANULARITY_RECOMMENDED,
            &info.devices[id].vmm_granularity));

可以猜出来是ACL_CHECK抛出的问题,这个ACL_CHECK是干嘛的呢?我们可以找到它的定义在ggml/src/ggml-cann/common.h 60 行

#define ACL_CHECK_GEN(stmt, success, error_fn)                                \
    do {                                                                      \
        int err_code = (stmt);                                                \
        if (err_code != (success)) {                                          \
            ggml_cann_error(#stmt, __func__, __FILE__, __LINE__, error_fn()); \
        }                                                                     \
    } while (0);

#define ACL_CHECK(stmt) ACL_CHECK_GEN(stmt, 0, aclGetRecentErrMsg)

ok 看来如果出问题了就是aclrtMemGetAllocationGranularity的返回值不对,那aclrtMemGetAllocationGranularity这个函数是啥呢?搜了一下在 llama.cpp 的仓库里面已经搜不到了,看来是 CANN 的函数了,google 一下发现找不到!接着百度,发现一段代码 https://gitee.com/cann/acl/blob/2c7e13bce2a0f010a2c7c76546c2d14c5aee0624/runtime/memory.cpp

这是 aclrtMemGetAllocationGranularity 的定义

aclError aclrtMemGetAllocationGranularity(aclrtPhysicalMemProp *prop, aclrtMemGranularityOptions option,
                                          size_t *granularity)
{
    ACL_PROFILING_REG(acl::AclProfType::AclrtMemGetAllocationGranularity);
    ACL_LOG_DEBUG("start to execute aclrtMemGetAllocationGranularity");
    ACL_REQUIRES_NOT_NULL_WITH_INPUT_REPORT(prop);
    ACL_REQUIRES_NOT_NULL_WITH_INPUT_REPORT(granularity);

    rtDrvMemProp_t rtProp1 = {};
    rtProp1.side = prop->location.type;
    rtProp1.devid = prop->location.id;
    rtProp1.module_id = acl::APP_MODE_ID_U16;
    rtProp1.reserve = prop->reserve;
    switch (prop->memAttr) {
        case ACL_HBM_MEM_HUGE: {
            rtProp1.pg_type = 1U;
            rtProp1.mem_type = 0U;
            break;
        }
        case ACL_HBM_MEM_NORMAL: {
            rtProp1.pg_type = 0U;
            rtProp1.mem_type = 0U;
            break;
        }
        default: {
            ACL_LOG_ERROR("memAttr [%d] support ACL_HBM_MEM_HUGE or ACL_HBM_MEM_NORMAL",
                static_cast<int32_t>(prop->memAttr));
            return ACL_ERROR_INVALID_PARAM;
        }
    }

    const rtError_t rtErr = rtMemGetAllocationGranularity(&rtProp1,
        static_cast<rtDrvMemGranularityOptions>(option), granularity);
    if (rtErr != RT_ERROR_NONE) {
        ACL_LOG_CALL_ERROR("Get Allocation Granularity failed, runtime result = %d", static_cast<int32_t>(rtErr));
        return ACL_GET_ERRCODE_RTS(rtErr);
    }
    return ACL_SUCCESS;
}

但是到目前为止我们没找到什么上下文指针 (未完待续)