Intel 仅权重量化

使用 Intel Extension for Transformers Pipelines 对 Huggingface 模型进行仅权重量化

Hugging Face 模型可以通过WeightOnlyQuantPipeline类。

Hugging Face Model Hub 在一个在线平台上托管了超过 120k 个模型、20k 数据集和 50k 个演示应用程序（Spaces），所有这些应用程序都是开源和公开可用的，人们可以在其中轻松协作和共同构建 ML。

这些可以通过这个本地管道包装器类从 LangChain 调用。

要使用transformerspython 软件包，以及 pytorch、intel-extension-for-transformers。

%pip install transformers --quiet
%pip install intel-extension-for-transformers

模型加载

可以通过使用from_model_id方法。模型参数包括WeightOnlyQuantConfig类intel_extension_for_transformers。

from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from langchain_community.llms.weight_only_quantization import WeightOnlyQuantPipeline

conf = WeightOnlyQuantConfig(weight_dtype="nf4")
hf = WeightOnlyQuantPipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    quantization_config=conf,
    pipeline_kwargs={"max_new_tokens": 10},
)

API 参考：WeightOnlyQuantPipeline

它们也可以通过传入现有的transformers管道直接

from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
    "text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
)
hf = WeightOnlyQuantPipeline(pipeline=pipe)

Create Chain （创建链）

将模型加载到内存中后，您可以编写它，并提示形成一个链。

from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

API 参考：PromptTemplate

CPU 推理

现在 intel-extension-for-transformers 仅支持 CPU 设备推理。将很快支持 intel GPU。在具有 CPU 的机器上运行时，您可以指定device="cpu"或device=-1参数将模型放在 CPU 设备上。默认为-1进行 CPU 推理。

conf = WeightOnlyQuantConfig(weight_dtype="nf4")
llm = WeightOnlyQuantPipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    quantization_config=conf,
    pipeline_kwargs={"max_new_tokens": 10},
)

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | llm

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

批量 CPU 推理

您还可以在 CPU 上以批处理模式运行推理。

conf = WeightOnlyQuantConfig(weight_dtype="nf4")
llm = WeightOnlyQuantPipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    quantization_config=conf,
    pipeline_kwargs={"max_new_tokens": 10},
)

chain = prompt | llm.bind(stop=["\n\n"])

questions = []
for i in range(4):
    questions.append({"question": f"What is the number {i} in french?"})

answers = chain.batch(questions)
for answer in answers:
    print(answer)

Intel-extension-for-transformers 支持的数据类型

我们支持将权重量化为以下数据类型，以便在 WeightOnlyQuantConfig 中存储（weight_dtype）：

int8：使用 8 位数据类型。
int4_fullrange：使用 int4 范围的 -8 值，而不是正常的 int4 范围 [-7,7]。
int4_clip：剪辑并保留 int4 范围内的值，将其他值设置为零。
nf4：使用标准化浮点型 4 位数据类型。
fp4_e2m1：使用常规浮点型 4 位数据类型。“e2” 表示 2 位用于指数，“m1” 表示 1 位用于尾数。

虽然这些技术以 4 位或 8 位存储权重，但计算仍然发生在 weightOnlyQuantConfig 中的 float32、bfloat16 或 int8（compute_dtype 中）：

fp32：使用 float32 数据类型进行计算。
bf16：使用 bfloat16 数据类型进行计算。
int8：使用 8 bit 数据类型进行计算。

支持的算法矩阵

intel-extension-for-transformers（weightOnlyQuantConfig 中的算法）中支持的量化算法：

算法	PyTorch 插件	LLM 运行时
RTN	✔	✔
AWQ	✔	stay tuned
TEQ	✔	stay tuned

RTN：我们可以非常直观地想到的一种量化方法。它不需要额外的数据集，是一种非常快速的量化方法。一般来说，RTN 会将权重转换为均匀分布的整数数据类型，但一些算法，如 Qlora，提出了非均匀 NF4 数据类型，并证明了其理论最优性。

AWQ：证明仅保护 1% 的突出权重可以大大降低量化误差。通过观察每个通道的 Activation 和 Weight 的分布来选择 Ssignificant Weight 通道。在量化之前，在量化之前，也会对显著权重进行量化，以便保留。

TEQ：一种可训练的等效变换，在仅权重量化中保留 FP32 精度。它受到 AWQ 的启发，同时提供了一种新的解决方案来搜索激活和权重之间的最佳每通道缩放因子。

LLM 概念指南
LLM 操作指南