大模型·上下文学习/ICL(1)：ICL从诞生到两篇机制的解释

临床医师 · 发表于 2024-12-19 15:14

登陆有奖并可浏览互动！

您需要登录才可以下载或查看，没有账号？立即注册

×

TimeLines：

22.2 华盛顿大学&MetaAI，做了很多细致的ICL类型验证
21.11 斯坦福：ICLR’22，用贝叶斯理论解释ICL
20.6 OpenAI在GPT-3里，提出 In-Context Learning 的概念

【ICL概述】

In-Context Learning 这一概念最早由 OpenAI 在其 2020 年发表的论文 "Language Models are Few-Shot Learners" 中提出。
在这篇论文中，作者介绍了 GPT-3 (Generative Pre-trained Transformer 3) 语言模型。他们发现，GPT-3 在没有经过任何微调（fine-tuning）的情况下，仅通过在输入中提供少量示例（few-shot learning），就能够完成各种自然语言处理任务，如文本分类、问答、翻译等。这种能力被称为 "In-Context Learning"，即模型能够从输入的上下文中学习如何完成任务。
这篇论文展示了大规模语言模型的强大能力，引起了学术界和工业界的广泛关注。此后，In-Context Learning 成为了大语言模型研究的热点方向之一，并在其他模型如 PaLM、ChatGPT 等中得到了进一步的发展和应用。
需要注意的是，尽管 In-Context Learning 这一术语是在 2020 年的 GPT-3 论文中正式提出的，但类似的思想在此之前的一些工作中也有所体现，如 2019 年的 GPT-2 模型。但 GPT-3 论文对这一概念进行了系统的阐述和实验验证，使其得到广泛关注和后续研究。
2022年，Google AI 的研究人员发表了一篇论文《PaLM: Scaling Language Models with Pathways》，介绍了一种新的语言模型架构，它可以有效地利用 In-Context Learning 来完成各种各样的任务。
ICL的关键特点包括：

无需参数更新：模型在看到新的输入示例时，不需要对模型的参数进行优化或更新。这意味着模型能够即时利用已有的知识来做出预测。
利用上下文信息：ICL模型能够利用输入序列中的上下文信息来做出决策。例如，在NLP任务中，模型可以分析整个句子来预测单词或短语的意义。
适用于少量样本：ICL在只有少量标注数据可用时仍然有效，这使得它在数据稀缺的情况下非常有用。
灵活性：ICL不依赖于特定的任务特定的模型设计，而是通过序列模型的参数隐式地实现学习算法。在ICL中，模型通常在大量未标记的数据上进行预训练，然后在特定的任务上进行微调。微调过程中，模型会看到一系列的输入-输出对，但它的任务是学习如何根据这些示例来预测新输入的输出，而不是直接学习输入和输出之间的映射。ICL在NLP中的应用包括但不限于文本分类、问答系统、文本生成等。这种学习方式使得模型能够在面对新任务时快速适应，而不需要从头开始学习，这是大型语言模型如BERT、GPT等能够执行多种任务的关键因素之一。

【太长不看版】

1、OpenAI的GPT-3中

图片 11.png

图片 2.png

图片 3.png

2、斯坦福用贝叶斯来解释

2342.png

123.png

456.png

3、华大&Meta更细致的探查

2353.png

46.png

【22.2, 华大&Meta：Rethinking…】

image.png

原文：https://arxiv.org/abs/2202.12837
“Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”
截至24.4，Google Scholar显示引用量750+：

image.png

一、总述

image.png

：ICL的示例。没啥
全文结论：

有demon很重要；demo的数量K，从0~4会显著增加性能；K>8后，性能增长不大；
label是否correct，不重要
manual template不总是胜过minimal template
input disctribution很重要，即，demo的input text，尽量从预训练的数据集中，而非OOD
label从random label->任意单词时，direct model下降厉害，channel model drop不明显
format很重要，要遵循 input-label 这种pair形式；只input text 或只label，都不行

1）ground truth的demo不重要

：ICL随着shot增多、规模增大，而逐渐变好；但，给demonstration错误/随机的label，性能下掉。
random label，基于不影响性能；

demonstrations with random labels barely hurts performance in a range of classification and multi-choice tasks。the model does not rely on the input-label mapping in the demonstrations to perform the task

我们发现：

1、the label space and the distribution of the input text specified by the demonstrations are both key to in-context learning (regardless of whether the labels are correct for individual inputs);
：label space和input分布，是影响ICL效果的关键因素
2、specifying the overall format is also crucial, e.g., when the label space is unknown, using random English words as labels is significantly better than using no labels;
：格式很重要；比如，一个样本，它的label是随机英文word，但依然好于没有label
3、meta-training with an in-context learning objective (Min et al., 2021b) magnifies these effects—the models almost exclusively exploit simpler aspects of the demonstrations like the format rather than the input-label mapping
：重要结论

在 prompt 里带上 demo 是很重要的，而且 demo 在形式上 input 和 label 都需要。
对于 demo 中的 input，不要乱来，要给出比较合理的 input。
对于 demo 中的 label，只要它属于正确的值域空间 label space 就可以了，是否与 input 有 correct mapping 不重要。

二、Eval

1、Gold labels vs random labels

：使用demonstration比没有demon要好；label是random时，也无所谓，略差于gold label
No Demos：LMs直接进行零样本预测，无提示
Demos w gold：依赖于K个标注的examples进行提示，进行预测
Demos w random labels：抽样K个examples提示，但样本labels在标签集中随机采样，而非groundtruth。
用random label替换gold label只会轻微影响性能(0-5%)。这一趋势在几乎所有的模型（700M->175B）上都是一致的

2、Ablation消融实验

1) label correct是否重要

：这是label correct的占比；label故意错置混淆，效果只是缓慢下降，非断崖；比no-demo要好。

2）K设多大？

：demo的数量K，从无到有、从1~4时，上涨最快；k>8时，性能平缓甚至drop，越多未必好，model还要去理解K本身。

3）prompt template是否重要？

：mininal prompt vs 人工写的template；manual template不总是胜过minimal template。

三、继续探讨：ICL影响因素

5.1~5.3，探讨以下4个aspect：

5.4，讨论基于ICL的meta-trained
所有对比，都是在5个classification task+4个multi-choice task上进行的。

5.1 input distribution

每 4 个一组的柱状图里（具体说明在右侧），中间两个（Random labels 和 OOD + Random labels）的对比区别就是 input distribution 不同。可见除了 Direct MetaICL 模型之外，其他模型下这两组 input distribution 带来的结果表现差异是很显著的。所以可以通过实验初步得出结论：input distribution 是有显著影响的。

这里的OOD，是针对input text，从训练集之外去采集，即模型没见过
：OOD的input，性能下降很大；这说明，任何输入，遵循LM指定的input distribution很重要。
因为，LM在训练时，是按指定条件的分布学习input text；那使用时，下游任务也应该贴合这种分布：

This suggests that in-distribution inputs in the demonstrations substantially contribute to performance gains. This is likely because conditioning on the in-distribution text makes the task closer to language modeling, since the LM always conditioned on the in-distribution text during training.

5.2 label space

：label从random label->任意单词时，两种模型表现不同：
direct models，性能下降快，5~16%，
channel models，性能下降不大 <2%，因为它是被label建模，并需要感知label space(全部至于空间)

the channel models only condition on the labels, and thus are not benefiting from knowing the label space. This is in contrast to direct models which must generate the correct labels.

：direct models vs channel models：

前者，更胜在exploit label space，后者更胜在exploit input distribution

5.3 input-label pairing

新增了一些format的测试：
demons with no-labels: 只有input text，每个样例不带标签
demons with labels -only：只有标签，没有input text

demo不按正常format（深紫、深绿），比no-demo更差劲
“input-label”组成的pair对，更能引导模型，去全力模仿，以在新的example上照章生成。
所以，只要合乎 input-label pair的format，性能依然能retain：
浅紫，是input text在train之外采集OOD；label是随机label；也能保持82%+的性能
浅绿，即使label是random word，也能保留75%以上的性能

5.4 meta-training

For instance, the ground truth input-label mapping matters even less, and keeping the format of the demonstrations matters even more.

input-label是否准确/mapping，不重要；保持demon的format，更重要。

meta-training encourages the model to exclusively exploit simpler aspects of the demonstrations and to ignore others.

：meta-training，鼓动model，从demostraction更simple的aspect去学习（趋易避难）

This is based on our intuition that ：
(1) the input-label mapping is likely harder to exploit,

input-label的正确与否，这个关系，比较难exploit；所以，label in-correct，对性能影响不大

(2) the format is likely easier to exploit,

format是比较exploit的，所以对此比较敏感

(3) the space of the text that the model is trained to generate is likely easier to exploit than the space of the text that the model conditions on.

让model去生成的text space，比模型要去建模/学习的text space，更容易学

【21.11 斯坦福：Bayesian来解释ICL】

image.png

原文：https://arxiv.org/abs/2111.02080
“An Explanation of In-context Learning as Implicit Bayesian Inference”
截至24.4，Google Scholar显示引用量近400

image.png

一、总述

1）图示

一个prompt = 几个相互独立的training examples + a test example

image.png

：Pretrain后，会有一个concept隐式空间；ICL Infer时，demonstrations和其他prompts一起，依然可以共享该隐式空间，做出正确的预测predict-next

Pretrain: 在预训练期间（执行Next Token Prediction），语言模型（LM) 利用前面句子隐含地学习推断一个潜在概念（例如，名称（阿尔伯特·爱因斯坦）→国籍（德语）→职业（物理学家）→...)。
ICL: 在推理时，尽管示例以非连续的方式拼接。LM仍然可以通过共享概念（名称→国籍）来执行预测，这也意味着发生了上下文学习。

2）公式

假设，LM拟合了预训练分布p，所以ICL的问题就变成，在基于预训练分布给出prompt p(output|prompt)之后，如何刻画completions的条件分布；该prompt是从一个不同的分布pprompt得到的。
所以，后验的预测分布（posterior predictive distribution），如下：

image.png

LLM从训练数据中学习到了多样的concept（人名->国家，输入输出词分布，格式，句间关系等），而ICL的示例为贝叶斯推断提供信号，提示大模型关注特定concept来生成答案。
此时，p(output|prompt)里，如果examples是多个，则LM通过选择the prompt concept，去类似贝叶斯地走完infer过程。
1、主要挑战在于，prompts是从一个不同的分布采样而来，不是预训练的那个分布。
2、Bernstein-von Mises假设，观察之间是相互独立的，或者来自同一个分布，这两者都不能满足。
3、

2、方法-ICL

2.1 基本数学形式

公式
2.2 5个假设

2.3 理论分析

三、Eval

1）n、k

image.png

：demonstraction样本数是n，每个样本长度是k；随着n和k增大，ICL效果越好。

2）concept个数/探索concept组成的latent space

image.png

上图展示了三种 ICL fail 的情况，因为ACC并未与n、k正相关，这证实了，贝叶斯理论的真实性。
都是训练了4层的transformer。

Left：预训练数据只有一个concept(去除了先验)
Middle：在随机的预训练数据上，进行预训练；此时，即使demon样例有，也看不出ICL的效果。Pretrain model本身，就没有什么合理的分布。
Right：prompt来自随机的、model没见过的空间，ICL照样失败。

3）模型大小

image.png

：模型大小的影响，模型越大，ICL表现越好。

4）Demo Ordering & Zero-shot

image.png

Left部分：Sensitivity to example ordering/示例样本的顺序
：上图每个train set ID，是一组样本；在某个ID内，样本按不同顺序，生成不同的分布，即，不同的样本。可以看到，效果是有差异的。
Right部分：Zero-shot is sometimes better than few-shot/Zero-shot有时表现更佳

【20.6 OpenAI：在GPT-3提出ICL】

2020.6 NIPS：OpenAI GPT-3：Language Models are Few-Shot Learners

一、介绍ICL

image.png

为避免task-specificfine-tune，就想到了 meta-learning/零样本迁移，但本文并非真正的0个样本，还是有实力demonstration的，所以：
1、使用术语“元学习”来捕捉一般方法的内循环/外循环结构，
2、使用术语“上下文学习”来指代元学习的内循环。“上下文学习”这个术语来描述该过程的内部循环，该循环发生在每个序列的前向传递中。
3、我们根据推理时提供的演示数量进一步将描述特化为“零样本”、“一次样本”或“少量样本”。

image.png