Llama.cpp

llama-cpp-python 是 llama.cpp 的 Python 绑定。

它支持许多 LLM 模型的推理，可以在 Hugging Face 上访问。

此笔记本介绍了如何运行llama-cpp-python在 LangChain 中。

注意：新版本llama-cpp-python使用 GGUF 模型文件（请参阅此处）。

这是一项重大更改。

要将现有 GGML 模型转换为 GGUF，您可以在 llama.cpp 中运行以下命令：

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin

安装

关于如何安装 llama-cpp 软件包，有不同的选项：

CPU 使用率
CPU + GPU （使用众多 BLAS 后端之一）
金属 GPU（带有 Apple Silicon 芯片的 MacOS）

仅 CPU 安装

%pip install --upgrade --quiet  llama-cpp-python

使用 OpenBLAS / cuBLAS / CLBlast 进行安装

llama.cpp支持多个 BLAS 后端以加快处理速度。使用FORCE_CMAKE=1环境变量强制使用 cmake 并为所需的 BLAS 后端（源）安装 pip 软件包。

使用 cuBLAS 后端的示例安装：

!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要说明：如果您已经安装了该软件包的仅限 CPU 版本，则需要从头开始重新安装。请考虑以下命令：

!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

使用金属安装

llama.cpp支持 Apple Silicon 一等公民 - 通过 ARM NEON、Accelerate 和 Metal 框架进行优化。使用FORCE_CMAKE=1环境变量强制使用 cmake 并为 Metal 支持安装 pip 包（源）。

使用 Metal Support 的安装示例：

!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要说明：如果您已经安装了该软件包的仅限 cpu 版本，则需要从头开始重新安装它：请考虑以下命令：

!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

使用 Windows 安装

安装llama-cpp-python库。您可以按照存储库本身中的大多数说明进行作，但有一些特定于 Windows 的说明可能有用。

安装llama-cpp-python,

git
蟒
CMAKE
Visual Studio Community（请确保使用以下设置安装此 Cookie）
- 使用 C++ 进行桌面开发
- Python 开发
- 使用 C++ 进行 Linux 嵌入式开发

递归克隆 git 仓库以获取llama.cpp子模块

git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git

打开命令提示符并设置以下环境变量。

set FORCE_CMAKE=1
set CMAKE_ARGS=-DGGML_CUDA=OFF

如果您有 NVIDIA GPU，请确保DGGML_CUDA设置为ON

编译和安装

现在您可以cd到llama-cpp-python目录并安装软件包

python -m pip install -e .

重要说明：如果您已经安装了该软件包的仅限 cpu 版本，则需要从头开始重新安装它：请考虑以下命令：

!python -m pip install -e . --force-reinstall --no-cache-dir

用法

确保您按照所有说明安装所有必要的模型文件。

您不需要API_TOKEN因为您将在本地运行 LLM。

值得了解哪些型号适合在所需的机器上使用。

TheBloke的Hugging Face 模型有一个Provided files部分，该部分公开了运行不同量化大小和方法的模型所需的 RAM（例如：Llama2-7B-Chat-GGUF）。

这个 github 问题也与为您的机器找到合适的模型有关。

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

API 参考：LlamaCpp | 回调管理器 | 流式处理 StdOut回调处理程序 | 提示模板

考虑使用适合您模型的模板！检查 Hugging Face etc. 上的模型页面以获取正确的提示模板。

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

中央处理器

使用 LLaMA 2 7B 模型的示例

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)

Stephen Colbert:
Yo, John, I heard you've been talkin' smack about me on your show.
Let me tell you somethin', pal, I'm the king of late-night TV
My satire is sharp as a razor, it cuts deeper than a knife
While you're just a british bloke tryin' to be funny with your accent and your wit.
John Oliver:
Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.
My show is the one that people actually watch and listen to, not just for the laughs but for the facts.
While you're busy talkin' trash, I'm out here bringing the truth to light.
Stephen Colbert:
Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.
You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.
While I'm the one who's really makin' a difference, with my sat
``````output

llama_print_timings:        load time =   358.60 ms
llama_print_timings:      sample time =   172.55 ms /   256 runs   (    0.67 ms per token,  1483.59 tokens per second)
llama_print_timings: prompt eval time =   613.36 ms /    16 tokens (   38.33 ms per token,    26.09 tokens per second)
llama_print_timings:        eval time = 10151.17 ms /   255 runs   (   39.81 ms per token,    25.12 tokens per second)
llama_print_timings:       total time = 11332.41 ms

"\nStephen Colbert:\nYo, John, I heard you've been talkin' smack about me on your show.\nLet me tell you somethin', pal, I'm the king of late-night TV\nMy satire is sharp as a razor, it cuts deeper than a knife\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\nJohn Oliver:\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\nStephen Colbert:\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\nWhile I'm the one who's really makin' a difference, with my sat"

使用 LLaMA v1 模型的示例

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)

llm_chain = prompt | llm

question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})

1. First, find out when Justin Bieber was born.
2. We know that Justin Bieber was born on March 1, 1994.
3. Next, we need to look up when the Super Bowl was played in that year.
4. The Super Bowl was played on January 28, 1995.
5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.
``````output

llama_print_timings:        load time =   434.15 ms
llama_print_timings:      sample time =    41.81 ms /   121 runs   (    0.35 ms per token)
llama_print_timings: prompt eval time =  2523.78 ms /    48 tokens (   52.58 ms per token)
llama_print_timings:        eval time = 23971.57 ms /   121 runs   (  198.11 ms per token)
llama_print_timings:       total time = 28945.95 ms

'\n\n1. First, find out when Justin Bieber was born.\n2. We know that Justin Bieber was born on March 1, 1994.\n3. Next, we need to look up when the Super Bowl was played in that year.\n4. The Super Bowl was played on January 28, 1995.\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'

图形处理器

如果使用 BLAS 后端的安装正确，您将看到一个BLAS = 1指示符。

用于 GPU 的两个最重要的参数是：

n_gpu_layers- 确定将模型的多少层卸载到 GPU。
n_batch- 并行处理的令牌数量。

正确设置这些参数将显著提高求值速度（有关更多详细信息，请参阅 wrapper code ）。

n_gpu_layers = -1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})

1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.

2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.

3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.

So, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl.
``````output

llama_print_timings:        load time =   427.63 ms
llama_print_timings:      sample time =   115.85 ms /   164 runs   (    0.71 ms per token,  1415.67 tokens per second)
llama_print_timings: prompt eval time =   427.53 ms /    45 tokens (    9.50 ms per token,   105.26 tokens per second)
llama_print_timings:        eval time =  4526.53 ms /   163 runs   (   27.77 ms per token,    36.01 tokens per second)
llama_print_timings:       total time =  5293.77 ms

"\n\n1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\n\n2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\n\n3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\n\nSo, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl."

金属

如果使用 Metal 进行安装正确，您将看到一个NEON = 1指示符。

两个最重要的 GPU 参数是：

n_gpu_layers- 确定将模型的多少层卸载到 Metal GPU。
n_batch- 并行处理的 Token 数量，默认为 8，设置为更大的数字。
f16_kv- 出于某种原因，仅支持 MetalTrue，否则您将收到Asserting on type 0 GGML_ASSERT: .../ggml-metal.m:706: false && "not implemented"

正确设置这些参数将显著提高求值速度（有关更多详细信息，请参阅 wrapper code ）。

n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

控制台日志将显示以下日志，以指示 Metal 已正确启用。

ggml_metal_init: allocating
ggml_metal_init: using MPS
...

您也可以检查Activity Monitor通过观察进程的 GPU 使用率，开机后 CPU 使用率会急剧下降n_gpu_layers=1.

对于第一次调用 LLM，由于 Metal GPU 中的模型编译，性能可能会变慢。

语法

我们可以使用语法来约束模型输出，并根据它们中定义的规则对 token 进行采样。

为了演示此概念，我们提供了示例语法文件，将在以下示例中使用。

创建 gbnf 语法文件可能很耗时，但如果您的用例中输出架构很重要，则有两个工具可以提供帮助：

在线语法生成器应用程序，可将 TypeScript 接口定义转换为 gbnf 文件。
用于将 json 架构转换为 gbnf 文件的 Python 脚本。例如，您可以创建pydantic对象生成其 JSON 架构，使用.schema_json()方法，然后使用此脚本将其转换为 GBNF 文件。

在第一个示例中，提供指定json.gbnf文件以生成 JSON：

n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
    grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf",
)

%%capture captured --no-stdout
result = llm.invoke("Describe a person in JSON format:")

{
  "name": "John Doe",
  "age": 34,
  "": {
    "title": "Software Developer",
    "company": "Google"
  },
  "interests": [
    "Sports",
    "Music",
    "Cooking"
  ],
  "address": {
    "street_number": 123,
    "street_name": "Oak Street",
    "city": "Mountain View",
    "state": "California",
    "postal_code": 94040
  }}
``````output

llama_print_timings:        load time =   357.51 ms
llama_print_timings:      sample time =  1213.30 ms /   144 runs   (    8.43 ms per token,   118.68 tokens per second)
llama_print_timings: prompt eval time =   356.78 ms /     9 tokens (   39.64 ms per token,    25.23 tokens per second)
llama_print_timings:        eval time =  3947.16 ms /   143 runs   (   27.60 ms per token,    36.23 tokens per second)
llama_print_timings:       total time =  5846.21 ms

我们也可以供应list.gbnf要返回列表：

n_gpu_layers = 1
n_batch = 512
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
    grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/list.gbnf",
)

%%capture captured --no-stdout
result = llm.invoke("List of top-3 my favourite books:")

["The Catcher in the Rye", "Wuthering Heights", "Anna Karenina"]
``````output

llama_print_timings:        load time =   322.34 ms
llama_print_timings:      sample time =   232.60 ms /    26 runs   (    8.95 ms per token,   111.78 tokens per second)
llama_print_timings: prompt eval time =   321.90 ms /    11 tokens (   29.26 ms per token,    34.17 tokens per second)
llama_print_timings:        eval time =   680.82 ms /    25 runs   (   27.23 ms per token,    36.72 tokens per second)
llama_print_timings:       total time =  1295.27 ms

LLM 概念指南
LLM 操作指南

安装