无标签

发布日期: 2025-03-14

文章字数: 7.7k

阅读时长: 33 分

阅读次数:

Ricky の大模型学习之路

基础概念

Tokens 与 Embeddings

Tokens 与 Embeddings 的区别

Concept	Tokens	Embeddings
What is it?	A textual unit (like a word or subword)	A numerical vector that represents a token’s meaning
Data type	Discrete integers (IDs)	Dense floating-point tensors
Example	`"insulin"` → `2345` (token ID)	`[-0.01, 0.32, ..., 1.25]` (vector of 1024 floats)
Used for	Input to the model (after tokenization)	Internal representation in the model
Reversible?	✅ Can convert back to text via tokenizer	❌ Cannot easily recover original text from embedding

Text → Tokens → Embeddings 的可视化

  ┌──────────────────────────────────────────────┐
  │ "What condition is characterized by..."      │  → text
  └──────────────────────────────────────────────┘
                        ↓
                   Tokenization
                        ↓
┌────────────┬────────────┬────────────┬────────────┐
│  "What"    │ "condition"│    "is"    │   ...      │  → tokens
└────────────┴────────────┴────────────┴────────────┘
                        ↓
           [1023, 5678, 2345, ...]  → token IDs
                        ↓
              ┌───────────────────────┐
              │    Language Model     │   →   Language Model
              └───────────────────────┘
                        ↓
 ┌────────────────────────────────────────────────┐
 │      last_hidden_state (1, seq_len, 1024)      │ → token embeddings
 └────────────────────────────────────────────────┘
                        ↓
             Mean across tokens (dim=1)
                        ↓
         ┌──────────────────────────────┐
         │ Sentence Embedding (1, 1024) │  → sentence embeddings
         └──────────────────────────────┘

微调

微调概念清单

为什么需要微调？✅（2025.3.11）
微调的概念 ✅ （2025.3.11）
微调的步骤 ✅ （2025.3.11）
微调的策略分类 ✅ （2025.3.11）
微调的框架
具体的微调框架
微调的挑战 ✅ （2025.3.11）
微调的相关术语

为什么需要微调？

虽然预训练的 LLM（如 GPT-4、LLaMA）已经学到了大量知识，但在具体应用中，可能仍然有以下问题：

领域适应性不足：如医疗、法律、金融等专业领域的语言和表达方式。
任务针对性不强：如情感分析、摘要生成、代码生成等特定任务。
未掌握特定风格：比如让 ChatGPT 说话更像某个品牌，或者适应某种语气。

微调的概念

大模型微调是指在预训练大模型的基础上，通过特定任务的数据对大模型进一步训练（修改大模型的部分参数、调整大模型的结构等），以提升大模型在该特定任务上的表现。

微调的步骤

准备微调数据集
准备预训练模型
调整模型结构（调整模型的输出层）
使用微调技术进行训练
验证与测试

微调的策略分类

全模型微调（Full Fine-Tuning）
- 微调所有参数，适合数据充足的情况。
参数高效微调（PEFT, Parameter Efficient Fine-Tuning）
- 只更新模型的一部分参数，而冻结大部分参数，减少计算量。
- 低秩适应微调（LoRA，Low-Rank Adaptation）
  - 在预训练模型的权重矩阵上添加低秩矩阵，只训练这些低秩矩阵，从而大幅减少需要训练的参数数量。
- 适配器微调（Adapter Fine-Tuning）
  - 在预训练模型中插入小的适配器模块，只训练这些适配器模块，而保持预训练模型的参数不变。适配器模块通常是一个小的神经网络层。
提示微调（Prompt Tuning，P-tuning）
- 提示微调是一种新兴的微调方法，通过在输入中添加特定的提示，引导模型生成期望的输出，而不需要大量修改模型参数。
知识蒸馏微调（Knowledge Distillation Fine-Tuning）
- 知识蒸馏微调是指使用一个已经训练好的大模型（教师模型）来指导一个小模型（学生模型）的训练。通过这种方式，学生模型可以学习教师模型的知识，从而在特定任务上表现更好。

微调的挑战

数据需求：微调需要大量标注数据，数据不足时效果受限。
计算资源：大模型微调需要大量计算资源，尤其是全模型微调。

微调（Fine-tuning）相关术语总结表

中文术语	英文术语	术语解释
预训练	Pre-training	在大规模无监督数据集上训练模型，使其学习通用语言特征。
微调	Fine-tuning	在特定领域或任务数据上进一步训练预训练模型，以提升其性能。
监督微调	Supervised Fine-tuning	使用带有标签的数据进行微调，让模型学习任务的正确输出。
低秩适配	LoRA (Low-Rank Adaptation)	一种高效微调方法，通过对权重矩阵添加低秩适配层来减少计算成本。
参数高效微调	PEFT (Parameter Efficient Fine-tuning)	只调整部分参数（如 LoRA、Adapter），而非整个模型，以减少计算需求。
适配器	Adapter	在 Transformer 层之间插入的小型网络模块，用于高效微调。
全参数微调	Full Fine-tuning	调整整个模型的所有参数，通常需要更高的计算资源。
指令微调	Instruction Fine-tuning	通过提供不同的指令数据，使模型更擅长遵循指令。
强化学习微调	RLHF (Reinforcement Learning from Human Feedback)	结合人类反馈进行强化学习，使模型的回答更符合人类期望。
数据集	Dataset	用于微调的文本或任务数据，通常分为训练集、验证集和测试集。
迁移学习	Transfer Learning	在一个任务上训练的模型权重用于另一个任务，以减少训练成本。
温度参数	Temperature	控制模型输出的随机性，较高温度增加创造性，较低温度增加确定性。
Token 限制	Token Limit	LLM 处理的最大 token 数，影响训练和推理过程的上下文长度。
训练损失	Training Loss	衡量模型在训练集上的误差，常见损失函数有交叉熵损失（Cross-Entropy Loss）。
验证损失	Validation Loss	衡量模型在验证集上的表现，用于避免过拟合。
过拟合	Overfitting	模型在训练数据上表现良好，但在新数据上泛化能力较差。
梯度累积	Gradient Accumulation	通过多次累积小批量梯度来模拟更大的批次，降低显存需求。
梯度裁剪	Gradient Clipping	防止梯度爆炸的技术，限制梯度的最大值。
学习率	Learning Rate	控制模型参数更新步伐的超参数，影响收敛速度和稳定性。
预训练权重	Pretrained Weights	经过大规模数据训练的模型参数，可以在微调时进一步优化。
AdamW 优化器	AdamW Optimizer	一种改进的 Adam 优化器，广泛用于 LLM 微调。
训练步数	Training Steps	训练过程中进行参数更新的次数，影响模型的收敛情况。
Batch Size	批次大小	训练时一次处理的数据样本数量，影响计算开销和收敛速度。
Prompt 工程	Prompt Engineering	通过设计输入提示词来引导 LLM 生成期望的输出。
指令数据	Instruction Data	训练模型遵循指令格式的数据，如 “请总结这篇文章”。
增量微调	Incremental Fine-tuning	在已有的微调模型上进一步训练，而不是从基础模型开始。
混合精度训练	Mixed Precision Training	结合 FP16 和 FP32 进行训练，以减少显存占用并加速计算。
零样本学习	Zero-shot Learning	模型在未见过的任务上进行预测，无需额外训练。
少样本学习	Few-shot Learning	通过少量示例让模型适应新任务，提高泛化能力。
全精度训练	Full Precision Training	使用 FP32 进行训练，计算精度高但显存占用大。

Hugging Face 实战

平台简介

Hugging Face 是一家专注于 自然语言处理（NLP） 和 人工智能模型开源与部署 的公司。

它提供了：

✅ 大量预训练模型（BERT、GPT、T5 等）
✅ 一个统一的 Python 库（transformers）
✅ 训练、微调、部署、分享模型的一整套工具

Hugging Face 的 核心产品：

名称	功能说明
🤗 transformers	一个强大易用的 Python 库，支持加载、使用、训练各种预训练 transformer 模型
🤗 Datasets	提供上千个标准数据集（如 SQuAD、SST-2、CoQA）
🤗 Hub	一个模型社区平台，你可以上传、下载别人训练好的模型
🤗 Spaces	免费托管和展示你的 AI 项目（支持 Gradio、Streamlit 等）
🤗 Auto Classes	提供统一的模型加载接口（如 `AutoTokenizer`, `AutoModel`）简化使用流程

模型下载

模型名称结构解析

部分	示例	含义
组织名/作者	`bert-base-uncased`（无组织） `facebook/bart-large`	模型发布者，通常是作者、研究机构、公司等
模型架构	`bert`, `roberta`, `gpt2`, `t5`, `llama`, `distilbert`	说明使用的 Transformer 架构类型
大小	`base`, `large`, `small`	控制参数量和模型规模
大小写	`cased` / `uncased`	英文是否区分大小写（如 Apple ≠ apple）
语言	`english`, `chinese`, `multilingual`	支持的语言
训练任务 / 数据集	`finetuned-sst-2`, `squad`, `cnn-dailymail`	说明是否在特定任务上微调过
特殊训练方法	`wwm`, `whole-word-masking`, `ext`, `dapt`, `adapter`	特殊技术如全词掩码、扩展训练、适配器等

常见命名实例解析

模型名	含义
`bert-base-uncased`	Google 的基础版 BERT，不区分大小写，未微调
`google-bert/bert-large-cased-whole-word-masking`	更大、区分大小写、使用全词掩码的 BERT
`distilbert-base-uncased-finetuned-sst-2-english`	DistilBERT 模型，SST-2 上训练的英文情感分析模型
`facebook/bart-large-cnn`	Facebook 的 BART 模型，微调在 CNN 新闻摘要任务上
`google/flan-t5-xl`	Google 的 FLAN T5 模型，超大版，适合 zero-shot 多任务
`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`	微软训练的用于生物医学的 PubMed-BERT，专门处理论文摘要

当手动下载 Hugging Face 上的一个模型（比如 bert-base-uncased），最重要的文件有这几类：

文件名	作用	是否必须
`config.json`	模型结构配置（层数、隐藏维度、分类头等）	✅
`pytorch_model.bin`	模型的权重（PyTorch 格式）	✅
`tokenizer_config.json`	分词器的配置	✅
`vocab.txt` 或 `merges.txt` + `vocab.json`	词表（不同模型格式不同）	✅
`special_tokens_map.json`	特殊 token（如 `[CLS]`, `[SEP]`, `[MASK]`）的定义	⚠️ 推荐
`tokenizer.json`	分词器的 fast version 文件	⚠️ 推荐
`README.md`	模型说明文档	❌ 可选

代码基础

transformer库

transformer库是 Hugging Face 出品的开源库，用于：

下载和使用预训练模型（如 BERT, GPT, BioGPT, T5 等）
快速进行文本分类、文本生成、问答、翻译、摘要等任务
提供统一的 Tokenizer, Model, 和 Pipeline 接口

🗂️ `transformer` 核心组件

模块	功能简述
`AutoTokenizer` / `XXXTokenizer`	把文本转成 tokens 和 ids
`AutoModel` / `XXXModel`	返回隐藏状态（embeddings）
`AutoModelForCausalLM` / `XXXForCausalLM`	用于文本生成
`pipeline`	快速使用模型进行任务（如情感分析、QA）

🧠 不同模型任务的选择

任务	模型类	示例模型名
文本嵌入	`AutoModel`	`bert-base-uncased`
文本分类	`AutoModelForSequenceClassification`	`distilbert-base-uncased-finetuned-sst-2-english`
文本生成	`AutoModelForCausalLM`	`gpt2`, `BioGPT`
问答系统	`AutoModelForQuestionAnswering`	`distilbert-base-uncased-distilled-squad`
翻译	`AutoModelForSeq2SeqLM`	`t5-small`, `facebook/mbart-large-50-many-to-many-mmt`

pipeline 的使用

不同任务的简单示例

📘 句子/文本嵌入（`AutoModel`）

AutoModel 只返回 hidden states，不包含分类头（不进行预测，只做特征提取）。

import torch
from transformers import AutoTokenizer, AutoModel

# 加载本地保存的 DistilBERT 模型和分词器
model_path = '***/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# 输入句子
text = "Transformers are powerful models for NLP."

# 使用分词器将句子转为模型输入张量
inputs = tokenizer(text, return_tensors="pt")

# 禁用梯度计算，执行前向传播
with torch.no_grad():
    outputs = model(**inputs)

# 对所有 token 的隐藏状态做平均，得到句子级别的向量表示
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

# 打印Embedding的形状（应该是 [1, hidden_size]，如 [1, 768]）
print(sentence_embedding.shape)

torch.Size([1, 768])

🧾 文本生成（`AutoModelForCausalLM`）

AutoModelForCausalLM 是 Hugging Face 的自动模型加载器的一种，用于加载 因果语言建模（Causal Language Modeling） 的模型。该模型用于「给定前文，预测下一个词（token）」的场景。

🧠 “Causal LM” 是什么意思？

Causal = 因果性：只能看到“过去”的词，不能看到未来的词。

训练目标：预测当前 token 只依赖于它左边（前面的）token。

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load GPT-2 tokenizer and model (causal language model)
model_path = '***/gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Define prompt for text generation
prompt = "The future of medicine is"

# Tokenize input (returns dictionary with input_ids and attention_mask)
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text using sampling method
output_ids = model.generate(
    **inputs,  # Unpack tokenized inputs
    max_length=50,  # Maximum total tokens (input + generated)
    do_sample=True,  # Enable probabilistic sampling
    top_k=50,  # Consider top 50 probable tokens at each step
    top_p=0.9,  # Nucleus sampling: choose from top tokens covering 90% probability mass
    temperature=0.9,  # Lower = more predictable, higher = more creative
    pad_token_id=tokenizer.eos_token_id  # Use EOS token for padding
)

# Decode generated token IDs to text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

tensor([[ 464, 2003, 286, 9007, 318, 284, 1064, 649, 23533, 11,
475, 340, 468, 257, 890, 835, 284, 467, 878, 326,
4325, 13, 198, 198, 464, 717, 1688, 8668, 4473, 319,
428, 1808, 373, 5952, 287, 262, 3095, 12, 23664, 82,
11, 475, 340, 373, 407, 257, 1688, 1943, 13, 198]])
The future of medicine is to find new medicines, but it has a long way to go before that happens.

The first major clinical trial on this question was conducted in the mid-1980s, but it was not a major success.

参数名	类型	说明
`**inputs`	dict	传入经过分词器编码后的输入（如 `input_ids`、`attention_mask`）
`max_length`	int	最多生成 `max_length` 个 token 的文本
`do_sample`	bool	是否启用采样策略（True 表示每一步不只选最可能的词，而是根据概率分布采样）
`top_k`	int	每一步只从概率最高的前 K 个词中进行采样，限制采样范围，避免奇怪的词被选中
`top_p`	float (0-1)	核采样（nucleus sampling）：只从累计概率达到 p 的前几个词中采样，动态选择候选词数量
`temperature`	float (>0)	控制输出的“创造性”：越低越保守（趋近确定性），越高越发散（趋近随机）
`pad_token_id`	int	填充 token 的 ID（用于对齐长度），这里设为模型的终止符号（EOS）ID，避免报错

📊 文本分类（`AutoModelForSequenceClassification`）

AutoModelForSequenceClassification 是 Hugging Face Transformers 库中的一个通用模型接口，用于加载和使用预训练的文本分类模型。

它是在一个基础模型（比如 BERT、RoBERTa、DistilBERT 等）后面加了一个分类头（通常是一个线性层 + Softmax）：

Text → Tokenizer → Transformer Encoder (BERT) → [CLS] → Linear Layer → 分类概率

最终输出是：

outputs.logits  # shape = (batch_size, num_labels)

✅ 适用任务类型

任务类型	示例输入	示例输出
二分类（情感分析）	“I love this product.”	[Negative, ✅Positive]
多分类（意图识别）	“Book a flight to Tokyo tomorrow.”	[weather, ✅booking, cancel]
文档分类	一篇完整文章	[sport, politics, tech]

示例代码

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# ==== 1. 加载本地模型和分词器 ====
model_path = '***/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# ==== 2. 输入文本 ====
text = "Clannad is the best anime I have ever seen."  
# 将文本编码为模型输入（包含 input_ids 和 attention_mask）
inputs = tokenizer(text, return_tensors="pt")

# ==== 3. 前向传播并计算概率 ====
with torch.no_grad():  # 禁用梯度，节省内存，仅推理
    outputs = model(**inputs)        # 输出包含 logits（原始分数）
    logits = outputs.logits          # logits 形状: (1, 2)，表示对两个标签的评分
    probabilities = F.softmax(logits, dim=1)  # 使用 softmax 转为概率分布

# ==== 4. 获取预测类别 ====
predicted_class = torch.argmax(probabilities, dim=1).item()  # 获取最大概率对应的索引
labels = ["Negative", "Positive"]  # SST-2 数据集的两个标签

# ==== 5. 打印输出结果 ====
print("\n=== 原始模型输出 ===")
print(outputs)

print("\n=== Logits（未归一化分数）===")
print(logits)

print("\n=== 概率分布（Softmax 之后）===")
print(probabilities)

# 打印最终预测标签与概率
print(f"\n✅ 预测情感类别: {labels[predicted_class]} ({probabilities[0][predicted_class]:.4f})")

=== Raw model outputs ===
SequenceClassifierOutput(loss=None, logits=tensor([[-4.2006, 4.5068]]), hidden_states=None, attentions=None)

=== Logits (unnormalized scores) ===
tensor([[-4.2006, 4.5068]])

=== Probabilities (after softmax) ===
tensor([[1.6534e-04, 9.9983e-01]])

✅ Predicted class: Positive (0.9998)

❓问答系统（`AutoModelForQuestionAnswering`）

AutoModelForQuestionAnswering 是 Hugging Face 提供的自动加载接口，用于构建 提取式问答系统。

提取式问答系统：给定一个问题和一个上下文段落，模型从段落中提取出一个连续的答案片段。

输入：

上下文（context）：”The heart pumps blood through the body using rhythmic contractions.”

问题（question）：”What does the heart do?”

模型输出：“pumps blood through the body”

此类模型基于 BERT、RoBERTa 等 Transformer 架构，模型学习为每个 token 预测两个分数：

Start score：答案起始位置的概率
End score：答案结束位置的概率

最终选择：

start_index = argmax(start_scores)
end_index = argmax(end_scores)

所以：

answer = context_tokens[start_index : end_index + 1]

示例代码

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# ========== 1. 加载模型与分词器 ==========
model_path = "***/distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)

# ========== 2. 输入问题与上下文 ==========
question = "What does NLP stand for?"
context = "NLP stands for Natural Language Processing, a subfield of artificial intelligence."

print("\n=== 原始输入 ===")
print("📌 Question:", question)
print("📄 Context:", context)

# 编码输入
inputs = tokenizer(question, context, return_tensors="pt")

# 打印编码后的输入
print("\n=== Tokenizer 编码结果 ===")
print("🧾 input_ids:", inputs["input_ids"])
print("🧠 tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print("📊 attention_mask:", inputs["attention_mask"])

# ========== 3. 模型推理 ==========
with torch.no_grad():
    outputs = model(**inputs)

# ========== 4. 查看 logits ==========
start_logits = outputs.start_logits
end_logits = outputs.end_logits

print("\n=== 模型输出分数（logits） ===")
print("🚩 start_logits:", start_logits)
print("🏁 end_logits:", end_logits)

# ========== 5. 计算起止位置 ==========
start_idx = torch.argmax(start_logits, dim=1).item()
end_idx = torch.argmax(end_logits, dim=1).item()

print(f"\n📍 预测的起始位置: {start_idx}")
print(f"📍 预测的结束位置: {end_idx}")

# ========== 6. 解码答案 ==========
if start_idx > end_idx:
    print("⚠️ 起始位置大于结束位置，答案无效。")
    answer = "[Invalid prediction]"
else:
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx + 1])
    print("🔍 答案相关 tokens:", tokens)
    answer = tokenizer.convert_tokens_to_string(tokens)

# ========== 7. 输出最终结果 ==========
print("\n✅ 最终预测答案:", answer)

示例输出


=== 原始输入 ===
📌 Question: What does NLP stand for?
📄 Context: NLP stands for Natural Language Processing, a subfield of artificial intelligence.

=== Tokenizer 编码结果 ===
🧾 input_ids: tensor([[  101,  1327,  1674, 21239,  2101,  2484,  1111,   136,   102, 21239,
          2101,  4061,  1111,  6240,  6828, 18821,  1158,   117,   170,  4841,
          2427,  1104,  8246,  4810,   119,   102]])
🧠 tokens: ['[CLS]', 'What', 'does', 'NL', '##P', 'stand', 'for', '?', '[SEP]', 'NL', '##P', 'stands', 'for', 'Natural', 'Language', 'Process', '##ing', ',', 'a', 'sub', '##field', 'of', 'artificial', 'intelligence', '.', '[SEP]']
📊 attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])

=== 模型输出分数（logits） ===
🚩 start_logits: tensor([[-2.3248e+00, -2.4476e+00, -3.1210e+00, -2.7561e+00, -4.8028e+00,
         -4.4546e+00, -5.1295e+00, -2.0812e+00,  1.0627e-02,  3.7939e+00,
         -1.8733e+00, -1.1664e+00, -1.4235e+00,  1.1914e+01,  2.4047e+00,
          2.3639e+00,  4.6805e-01, -1.6326e+00,  1.9572e+00,  1.0077e+00,
         -2.5621e+00, -3.3284e+00,  3.0263e+00, -1.2865e+00, -2.7076e+00,
          1.0653e-02]])
🏁 end_logits: tensor([[-0.8734, -2.4869, -3.9640, -4.1681, -3.1094, -4.6843, -4.2300, -2.2853,
          0.8502, -2.0429,  0.0950, -2.6078, -2.2444,  4.0181,  5.2197,  2.5751,
         11.4774,  5.3121, -2.4660, -3.0402,  0.7667, -4.4215, -1.3441,  4.3169,
          4.8004,  0.8503]])

📍 预测的起始位置: 13
📍 预测的结束位置: 16
🔍 答案相关 tokens: ['Natural', 'Language', 'Process', '##ing']

✅ 最终预测答案: Natural Language Processing

🔴 PS：AutoModelForQuestionAnswering 仅用于构建 提取式问答系统，不包含其他类型的问答）

拓展：问答系统的若干种范式

类型是否需要上下文答案来源典型模型适用场景

抽取式QA 是上下文片段 BERT, RoBERTa 文档阅读理解

生成式QA 可选模型生成 T5, GPT 开放域问答

多选QA 是给定选项 BERT, XLNet 考试系统

开放域QA 否知识库/模型 DPR + BERT 智能助手

视觉QA 是（图片）图片内容 ViLBERT, CLIP 图像理解

表格QA 是（表格）表格数据 TAPAS 结构化数据查询

对比表：提取式问答 vs Zero-shot 问答

维度提取式问答（Extractive QA） Zero-shot 问答（Zero-shot QA）

定义从提供的上下文段落中提取一个连续的答案不依赖于提供上下文，直接基于预训练知识生成答案

依赖上下文 ✅ 必须有上下文 ❌ 可没有上下文

模型训练 需要在问答数据集（如 SQuAD）上进行 fine-tune 通常基于大型语言模型（如 GPT-3、T5），无需特定 QA 微调

输出形式 通常是段落中的子字符串模型生成的自由形式答案

模型接口 AutoModelForQuestionAnswering AutoModelForSeq2SeqLM / AutoModelForCausalLM

示例模型 BERT QA, RoBERTa QA, ALBERT QA GPT-3, T5, FLAN-T5, BioGPT

各类问答的具体介绍：

1.抽取式问答（Extractive QA）
特点：从给定的文本中直接抽取答案片段（span）。

模型示例：BERT、RoBERTa、DistilBERT。

输入输出：

输入：问题 + 上下文（context）

输出：答案在上下文中的起始和结束位置（字符或token索引）
代码示例：
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")
result = qa_pipeline(
    question="What is the capital of France?",
    context="France is a country in Europe. Its capital is Paris."
)
print(result)  # 输出：{'answer': 'Paris', 'score': 0.98, ...}
应用场景：阅读理解、文档检索（如SQuAD数据集）。
生成式问答（Generative QA）
特点：模型根据问题生成自由文本答案（无需依赖上下文中的原句）。

模型示例：T5、GPT-3、BART。

输入输出：

输入：问题（+ 可选的上下文）

输出：生成的答案文本
代码示例：
from transformers import pipeline
generator = pipeline("text2text-generation", model="t5-small")
answer = generator(
    "question: What is the capital of France? context: France is a country in Europe."
)
print(answer)  # 输出：[{'generated_text': 'Paris'}]
应用场景：开放域问答、聊天机器人。
多选问答（Multiple-Choice QA）
特点：从给定的选项中选择正确答案。

模型示例：BERT、XLNet。

输入输出：

输入：问题 + 上下文 + 候选选项

输出：选项标签（如A/B/C/D）或置信度分数
代码示例：
from transformers import AutoModelForMultipleChoice, AutoTokenizer
model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 问题与选项
question = "What is the capital of France?"
options = ["London", "Berlin", "Paris", "Madrid"]

# 对每个选项编码并计算分数
inputs = [tokenizer(question, opt, return_tensors="pt") for opt in options]
outputs = [model(**input).logits for input in inputs]
best_answer = options[torch.argmax(outputs)]
print(best_answer)  # 输出：Paris
应用场景：考试系统、标准化测试。
开放域问答（Open-Domain QA）
特点：无需提供上下文，直接从知识库或模型参数中检索答案。

技术组合：

检索器（如DPR） + 阅读器（如BERT）

纯生成式模型（如GPT-3）
代码示例（检索+生成结合）：
# 使用Haystack框架（基于检索的QA）
from haystack import Pipeline
from haystack.nodes import BM25Retriever, FARMReader

retriever = BM25Retriever(document_store=my_document_store)
reader = FARMReader(model_name="deepset/bert-base-cased-squad2")
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

result = pipeline.run(query="What is the capital of France?")
应用场景：智能助手（如Alexa、Siri）。
视觉问答（Visual QA, VQA）
特点：基于图像内容回答问题。

模型示例：ViLBERT、CLIP、BLIP。

输入输出：

输入：图片 + 问题

输出：文本答案
代码示例：
from transformers import pipeline
vqa_pipeline = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")
answer = vqa_pipeline(
    image="paris.jpg",
    question="What is in the center of the image?"
)
print(answer)  # 输出：{'answer': 'Eiffel Tower'}
应用场景：图像理解、盲人辅助工具。
表格问答（Table QA）
特点：从结构化表格中提取答案。

模型示例：TAPAS、TaBERT。
代码示例：
from transformers import pipeline
table_qa = pipeline("table-question-answering", model="google/tapas-base-finetuned-wtq")

table = {
    "City": ["Paris", "London", "Berlin"],
    "Country": ["France", "UK", "Germany"]
}
answer = table_qa(
    table=table,
    query="Which city is in France?"
)
print(answer)  # 输出：{'answer': 'Paris', ...}
应用场景：Excel/数据库查询、金融报表分析。

类型	是否需要上下文	答案来源	典型模型	适用场景
抽取式QA	是	上下文片段	BERT, RoBERTa	文档阅读理解
生成式QA	可选	模型生成	T5, GPT	开放域问答
多选QA	是	给定选项	BERT, XLNet	考试系统
开放域QA	否	知识库/模型	DPR + BERT	智能助手
视觉QA	是（图片）	图片内容	ViLBERT, CLIP	图像理解
表格QA	是（表格）	表格数据	TAPAS	结构化数据查询

维度	提取式问答（Extractive QA）	Zero-shot 问答（Zero-shot QA）
定义	从提供的上下文段落中提取一个连续的答案	不依赖于提供上下文，直接基于预训练知识生成答案
依赖上下文	✅ 必须有上下文	❌ 可没有上下文
模型训练	需要在问答数据集（如 SQuAD）上进行 fine-tune	通常基于大型语言模型（如 GPT-3、T5），无需特定 QA 微调
输出形式	通常是段落中的子字符串	模型生成的自由形式答案
模型接口	`AutoModelForQuestionAnswering`	`AutoModelForSeq2SeqLM` / `AutoModelForCausalLM`
示例模型	BERT QA, RoBERTa QA, ALBERT QA	GPT-3, T5, FLAN-T5, BioGPT

🌍 翻译任务（`AutoModelForSeq2SeqLM`）

AutoModelForSeq2SeqLM 是 Hugging Face 中用于 生成式任务（如摘要、翻译、问答、对话）的模型接口，适用于 编码器-解码器结构（Encoder-Decoder） 的模型架构。

任务类型	示例模型	输入	输出
文本摘要	`facebook/bart-large-cnn`	新闻文章	概括句子
机器翻译	`Helsinki-NLP/opus-mt-en-zh`	English	中文
生成式问答	`t5-base`, `flan-t5-base`	问题 + 文档	自由文本答案
多任务学习	`google/flan-t5-xl`	指令式 prompt	任意任务输出

常见的 Seq2Seq 模型（可用于 AutoModelForSeq2SeqLM）

模型名	说明
`t5-base`	Google 的多任务 Seq2Seq 模型
`flan-t5-base`	更强版本的 T5，经过 instruction tuning
`facebook/bart-large-cnn`	BART 微调用于新闻摘要
`Helsinki-NLP/opus-mt-*`	多语言机器翻译模型

示例代码：

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. 加载分词器和模型
model_path = '***/Helsinki-NLP:opus-mt-en-zh'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# 2. 准备输入文本
source_text = "The future of AI is full of possibilities."

# 3. 对输入文本进行分词编码（转换为模型可接受的格式）
inputs = tokenizer(source_text, return_tensors="pt")

# 4. 生成翻译结果（禁用梯度，提高推理速度）
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_length=64,     # 最长输出长度
        num_beams=5,       # beam search 的宽度，越大越稳但速度越慢
        early_stopping=True  # 如果 beam 提前收敛就停止生成
    )

# 5. 解码输出的 token IDs，转换为可读的翻译结果
translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

# 6. 打印原文与翻译结果
print("\n🌐 Original:", source_text)
print("🌍 Translated:", translated_text)

🌐 Original: The future of AI is full of possibilities.
🌍 Translated: AI的未来充满了可能性。

Tokens 基础

from transformers import AutoTokenizer

# Initialize tokenizer from a fine-tuned DistilBERT model for sentiment analysis
model_åpath = '***/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Sample text for tokenization demonstration
text = "Hello, my name is Ricky"
print(f'Raw text: {text}\n')

# Tokenization: text -> subword tokens
tokens = tokenizer.tokenize(text)
print(f'Tokenized output: {tokens}\n')

# tokens -> vocabulary IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f'Token IDs: {token_ids}\n')

# text -> model input format
input_py = tokenizer(text)  # Returns Python dictionary with lists
input_pt = tokenizer(text, return_tensors='pt')  # Returns Python dictionary with ready-to-use PyTorch tensors

print(f'Python dictionary format (input_py): {input_py}\n')
print(f'PyTorch tensor format (input_pt): {input_pt}')

# Best Practice Note:
# The PyTorch tensor format (input_pt) is preferred for production because:
# 1. Direct compatibility with model.forward() expectations
# 2. Automatic GPU acceleration when available (via .cuda() or .to(device))
# 3. Built-in support for batch processing
# 4. Memory efficiency for large-scale inference
# 5. Seamless integration with PyTorch's computation graph

测试结果

Raw text: Hello, my name is Ricky

Tokenized output: ['hello', ',', 'my', 'name', 'is', 'ricky']

Token IDs: [7592, 1010, 2026, 2171, 2003, 11184]

Python dictionary format (input_py): {'input_ids': [101, 7592, 1010, 2026, 2171, 2003, 11184, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

PyTorch tensor format (input_pt): {'input_ids': tensor([[  101,  7592,  1010,  2026,  2171,  2003, 11184,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

Embeddings 基础

import torch
from transformers import AutoTokenizer, AutoModel

# Load pre-trained model and tokenizer
model_path = '/home/xixingyu/disk1/PretrainedModels/distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Sample text 
text = 'Hello, my name is Ricky.'

# =============================================
# STEP 1: TOKENIZATION - Converting text to numbers
# =============================================
print("\n=== Tokenization ===")
print(f"\nRaw text: {text}")

inputs = tokenizer(text, return_tensors='pt')  
print("Tokenized Input Structure (PyTorch tensors):")
print(inputs)

# Let's see what each component means
print("\nInput IDs (numerical representation of tokens):")
print(inputs['input_ids'])

print("\nAttention Mask (shows which tokens are real vs padding):")
print(inputs['attention_mask'])

# =============================================
# STEP 2: GENERATING EMBEDDINGS
# =============================================
print("\n=== Generating Embeddings ===")
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(**inputs) # # Forward pass through the model
    
    # The model returns a tuple where the first element contains token embeddings
    token_embeddings = outputs.last_hidden_state
    
    print("\nShape of Token Embeddings:")
    print(f"[Batch Size, Sequence Length, Embedding Dimension] = {token_embeddings.shape}")
    print("- Batch Size = 1 (we processed one sentence)")
    print(f"- Sequence Length = {token_embeddings.shape[1]} (number of tokens)")
    print(f"- Embedding Dimension = {token_embeddings.shape[2]} (size of each vector)")

# =============================================
# STEP 3: CREATING SENTENCE EMBEDDINGS
# =============================================
print("\n=== Creating Sentence Embeddings ===")
# Simple method: Average all token embeddings (mean pooling)
sentence_embedding = token_embeddings.mean(dim=1)

print("\nShape of Sentence Embedding:")
print(f"[Batch Size, Embedding Dimension] = {sentence_embedding.shape}")
print("- Represents the entire sentence as a single vector")

# =============================================
# EDUCATIONAL NOTES
# =============================================
"""
WHAT ARE EMBEDDINGS?

1. Token Embeddings:
- Each word/subword is converted to a high-dimensional vector
- These vectors capture semantic meaning and context
- Example: "cat" and "kitten" will have similar vectors

2. Sentence Embeddings:
- A single vector representing the entire sentence
- Created by combining token embeddings (here by averaging)
- Can be used for tasks like similarity comparison

WHY THIS MATTERS:
- Computers don't understand words, only numbers
- Embeddings convert language to numerical representations
- Similar meanings → Similar vectors → Better AI understanding
"""

测试结果

=== Tokenization ===

Raw text: Hello, my name is Ricky.
Tokenized Input Structure (PyTorch tensors):
{'input_ids': tensor([[  101,  7592,  1010,  2026,  2171,  2003, 11184,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Input IDs (numerical representation of tokens):
tensor([[  101,  7592,  1010,  2026,  2171,  2003, 11184,  1012,   102]])

Attention Mask (shows which tokens are real vs padding):
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])

=== Generating Embeddings ===

Shape of Token Embeddings:
[Batch Size, Sequence Length, Embedding Dimension] = torch.Size([1, 9, 768])
- Batch Size = 1 (we processed one sentence)
- Sequence Length = 9 (number of tokens)
- Embedding Dimension = 768 (size of each vector)

=== Creating Sentence Embeddings ===

Shape of Sentence Embedding:
[Batch Size, Embedding Dimension] = torch.Size([1, 768])
- Represents the entire sentence as a single vector

文本情感分类（入门程序）

大模型调用版本的 helloworld.py

from transformers import AutoTokenizer, AutoModelForSequenceClassification  # Hugging Face的Transformer库
import torch 

# 预训练模型的路径（该模型是在SST-2英文情感数据集上微调的DistilBERT模型，用于情感分类）
model_path = '***/distilbert-base-uncased-finetuned-sst-2-english'

# ===================== 加载模型和分词器 =====================
# 加载与预训练模型对应的分词器，负责将原始文本转换为模型可以理解的token ID序列）
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 加载预训练好的序列分类模型
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# ===================== 准备输入数据 =====================
# 待分析的文本
text = "I love China"

# 使用分词器处理文本：
# - 将文本分割成token
# - 添加特殊token（如[CLS], [SEP]）
# - 将token转换为对应的ID
# - 生成注意力掩码（attention mask）
# return_tensors='pt'表示返回PyTorch张量（而不是NumPy数组）
input = tokenizer(text, return_tensors='pt')
print(input)
print()

# 查看输入的结构（调试用）：
# input_ids: token对应的数字ID
# attention_mask: 指示哪些位置是有效token（1表示真实token，0表示填充padding）

# ===================== 模型推理 =====================
# 将处理好的输入传入模型
output = model(**input) #**操作符用于将字典解包为关键字参数
print(output)
print()

# ===================== 解析模型输出 =====================
# 从输出中获取logits（未经过softmax的原始输出分数）
# logits形状为[batch_size, num_classes]
# 因为我们只输入了一个句子，所以batch_size为1
# SST-2是二分类任务，所以num_classes为2（索引0:负面，1:正面）
logits = output.logits
print(logits)
print()

# 获取预测类别：找到logits中最大值的索引
# torch.argmax返回最大值所在的索引
# dim=1表示我们在类别维度（而不是批次维度）上寻找最大值
predicted_class_id = torch.argmax(logits, dim=1)
print(predicted_class_id)  # 此时还是PyTorch张量
print()

predicted_class_id = predicted_class_id.item() # 使用.item()将张量转换为Python整数
print(predicted_class_id)  # 现在是普通整数（0或1）
print()

# ===================== 结果映射和输出 =====================
# 定义类别标签（注意SST-2的标准顺序）
labels = ['Negative', 'Positive']

# 获取预测结果对应的人类可读标签
print(labels[predicted_class_id])
print()

测试结果

{'input_ids': tensor([[ 101, 1045, 2293, 2859,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1768,  4.4845]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

tensor([[-4.1768,  4.4845]], grad_fn=<AddmmBackward0>)

tensor([1])

1

Positive

医疗大模型

模型
华佗
BioGPT

大模型 Leaderboard

https://aider.chat/docs/leaderboards/

https://livebench.ai/#/

大模型的使用方式

1. 网页端调用（最方便的方式）

这种方式适合不需要深度集成或技术门槛较低的场景，用户通过浏览器直接与大模型交互。

📌 代表平台

平台名称	公司名	模型类型	官网链接
ChatGPT	OpenAI	GPT-4	🔗 chat.openai.com
Claude	Anthropic	Claude 3.5 / Claude 3.7 Sonnet	🔗 claude.ai
Gemini	Google DeepMind	Gemini 1.5 系列（前 Bard）	🔗 gemini.google.com
POE	Quora	GPT、Claude、Gemini、Mistral等	🔗 poe.com
DeepSeek	DeepSeek	DeepSeek-V3、DeepSeek-R1	🔗 www.deepseek.com
通义千问	阿里云	Qwen 系列	🔗 tongyi.aliyun.com
豆包	字节跳动	Skywork / MiniMax	🔗 www.doubao.com
Kimi	月之暗	Moonshot 系列	🔗 kimi.moonshot.cn
智谱清言	智谱AI	ChatGLM 系列	🔗 chatglm.cn

2. API调用（适合开发者）

3. 本地部署调用（注重隐私和安全）

ollama简介

Ollama 是一个用于在本地运行、部署和管理大型语言模型（LLM）的开源工具，特别适合开发者、研究人员或任何想离线或在私有环境中使用 AI 模型的用户。

ollama官方网页：https://ollama.com/

ollama常用命令

模型管理

命令	说明
`ollama pull <模型名>`	下载模型（如 `llama3`、`mistral`）
`ollama list`	查看已安装的模型列表
`ollama rm <模型名>`	删除本地模型
`ollama cp <源模型> <新模型名>`	复制模型副本
`ollama show --license <模型名>`	查看模型许可证

运行与交互

命令	说明
`ollama run <模型名>`	启动交互式对话（默认加载模型）
`ollama run <模型名> "你的提示词"`	直接输入提示词并获取输出（非交互）
`ollama serve`	启动本地 API 服务（默认端口 `11434`）

系统与调试

命令	说明
`ollama help`	查看所有命令帮助
`ollama version`	查看 Ollama 版本
`OLLAMA_MODELS=<路径>`	指定模型存储路径（环境变量）
`ollama ps`	查看当前运行的模型进程

ollama下载模型

ollama下载deepseek（https://ollama.com/library/deepseek-r1）

ollama run deepseek-r1:7b

Cherry Studio

4. 插件/Agent系统（打造智能工作流）

Rickyの水果摊

https://ricky2333.github.io/AI-basics/index.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Rickyの水果摊 !

无标签

本篇

Ricky の大模型学习之路

2025-03-14 Rickyの水果摊

本篇

Ricky の大模型学习之路

2025-03-14 Rickyの水果摊

Ricky の 大模型学习之路

Ricky の 大模型学习之路

基础概念

Tokens 与 Embeddings

微调

微调概念清单

为什么需要微调？

微调的概念

微调的步骤

微调的策略分类

微调的挑战

微调（Fine-tuning）相关术语总结表

Hugging Face 实战

平台简介

模型下载

代码基础

transformer库

🗂️ transformer 核心组件

🧠 不同模型任务的选择

pipeline 的使用

不同任务的简单示例

📘 句子/文本嵌入（AutoModel）

🧾 文本生成（AutoModelForCausalLM）

📊 文本分类（AutoModelForSequenceClassification）

❓问答系统（AutoModelForQuestionAnswering）

🌍 翻译任务（AutoModelForSeq2SeqLM）

Tokens 基础

Embeddings 基础

文本情感分类（入门程序）

医疗大模型

大模型 Leaderboard

大模型的使用方式

1. 网页端调用（最方便的方式）

2. API调用（适合开发者）

3. 本地部署调用（注重隐私和安全）

ollama简介

ollama常用命令

模型管理

运行与交互

系统与调试

ollama下载模型

Cherry Studio

4. 插件/Agent系统（打造智能工作流）

你的赏识是我前进的动力

Ricky の大模型学习之路

Ricky の大模型学习之路

🗂️ `transformer` 核心组件

📘 句子/文本嵌入（`AutoModel`）

🧾 文本生成（`AutoModelForCausalLM`）

📊 文本分类（`AutoModelForSequenceClassification`）

❓问答系统（`AutoModelForQuestionAnswering`）

🌍 翻译任务（`AutoModelForSeq2SeqLM`）