Published as a conference paper at ICLR 2022
Jason Wei[^*], Maarten Bosma[^*], Vincent Y. Zhao[^*], Kelvin Guu[^*], Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le Google Research
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Figure 1: Top: overview of instruction tuning and FLAN. Instruction tuning finetunes a pretrained language model on a mixture of tasks phrased as instructions. At inference time, we evaluate on an unseen task type; for instance, we could evaluate the model on natural language inference (NLI) when no NLI tasks were seen during instruction tuning. Bottom: performance of zero-shot FLAN, compared with zero-shot and few-shot GPT-3, on three unseen task types where instruction tuning improved performance substantially out of ten we evaluate. NLI datasets: ANLI R1–R3, CB, RTE. Reading comprehension datasets: BoolQ, MultiRC, OBQA. Closed-book QA datasets: ARC-easy, ARC-challenge, NQ, TriviaQA.
Visual Description of Figure 1 Top Panel: * Finetune on many tasks (“instruction-tuning”) * Input (Commonsense Reasoning): Here is a goal: Get a cool sleep on summer days. How would you accomplish this goal? OPTIONS: -Keep stack of pillow cases in fridge. -Keep stack of pillow cases in oven. -> Target: keep stack of pillow cases in fridge * Input (Translation): Translate this sentence to Spanish: The new office building was built in less than three months. -> Target: El nuevo edificio de oficinas se construyó en tres meses. * Other tasks: Sentiment analysis tasks, Coreference resolution tasks, … * Inference on unseen task type * Input (Natural Language Inference): Premise: At my age you will probably have learnt one lesson. Hypothesis: It’s not certain how many lessons you’ll learn by your thirties. Does the premise entail the hypothesis? OPTIONS: -yes -it is not possible to tell -no -> FLAN Response: It is not possible to tell
Visual Description of Figure 1 Bottom Panel (Bar Charts): * Natural language inference: GPT-3 175B zero shot (42.9), GPT-3 175B few-shot (53.2), FLAN 137B zero-shot (56.2) * Reading Comprehension: GPT-3 175B zero shot (63.7), GPT-3 175B few-shot (72.6), FLAN 137B zero-shot (77.4) * Closed-Book QA: GPT-3 175B zero shot (49.8), GPT-3 175B few-shot (55.7), FLAN 137B zero-shot (56.6)
[^*]Lead contributors. Author contributions listed at end of paper.
Language models (LMs) at scale, such as GPT-3 (Brown et al., 2020), have been shown to perform few-shot learning remarkably well. They are less successful at zero-shot learning, however. For example, GPT-3’s zero-shot performance is much worse than few-shot performance on tasks such as reading comprehension, question answering, and natural language inference. One potential reason is that, without few-shot exemplars, it is harder for models to perform well on prompts that are not similar to the format of the pretraining data.
In this paper, we explore a simple method to improve the zero-shot performance of large language models, which would expand their reach to a broader audience. We leverage the intuition that NLP tasks can be described via natural language instructions, such as “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” We take a pretrained language model of 137B parameters and perform instruction tuning—finetuning the model on a mixture of more than 60 NLP datasets expressed via natural language instructions. We refer to this resulting model as FLAN, for Finetuned LAnguage Net.
To evaluate the zero-shot performance of FLAN on unseen tasks, we group NLP datasets into clusters based on their task types and hold out each cluster for evaluation while instruction tuning FLAN on all other clusters. For example, as shown in Figure 1, to evaluate FLAN’s ability to perform natural language inference, we instruction tune the model on a range of other NLP tasks such as commonsense reasoning, translation, and sentiment analysis. As this setup ensures that FLAN has not seen any natural language inference tasks in instruction tuning, we then evaluate its ability to perform zero-shot natural language inference.
Our evaluations show that FLAN substantially improves the zero-shot performance of the base 137B-parameter model. FLAN’s zero-shot also outperforms 175B-parameter GPT-3’s zero-shot on 20 of 25 datasets that we evaluate, and even outperforms GPT-3’s few-shot by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. In ablation studies, we find that increasing the number of task clusters in instruction tuning improves performance on unseen tasks and that the benefits of instruction tuning emerge only with sufficient model scale.
Instruction tuning is a simple method that, as depicted in Figure 2,
combines appealing aspects of both the pretrain–finetune and prompting
paradigms by using supervision via finetuning to improve language
model’s responses to inference-time text interactions. Our empirical
results demonstrate promising abilities of language models to perform
tasks described purely via instructions. Source code for loading the
instruction tuning dataset used for FLAN is publicly available at
https://github.com/google-research/flan.
Figure 2: Comparing instruction tuning with pretrain–finetune and prompting. * (A) Pretrain–finetune (BERT, T5): Pretrained LM \(\rightarrow\) Finetune on task A \(\rightarrow\) Inference on task A. Typically requires many task-specific examples; One specialized model for each task. * (B) Prompting (GPT-3): Pretrained LM \(\rightarrow\) Improve performance via few-shot prompting or prompt engineering \(\rightarrow\) Inference on task A. * (C) Instruction tuning (FLAN): Pretrained LM \(\rightarrow\) Instruction-tune on many tasks: B, C, D, … (Model learns to perform many tasks via natural language instructions) \(\rightarrow\) Inference on unseen task (Inference on task A).
The motivation of instruction tuning is to improve the ability of language models to respond to NLP instructions. The idea is that by using supervision to teach an LM to perform tasks described via instructions, the LM will learn to follow instructions and do so even for unseen tasks. To evaluate performance on unseen tasks, we group datasets into clusters by task type and hold out each task cluster for evaluation while instruction tuning on all remaining clusters.
As creating an instruction tuning dataset with many tasks from scratch would be resource-intensive, we transform existing datasets from the research community into an instructional format. We aggregate 62 text datasets that are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture. Figure 3 shows these datasets—each dataset is categorized into one of twelve task clusters, for which datasets in a given cluster are of the same task type. Descriptions, sizes, and examples of each dataset are shown in Appendix G.
Figure 3: Datasets and task clusters used in this paper (NLU tasks in blue; NLG tasks in teal).
For each dataset, we manually compose ten unique templates that use natural language instructions to describe the task for that dataset. While most of the ten templates describe the original task, to increase diversity, for each dataset we also include up to three templates that “turned the task around,” (e.g., for sentiment classification we include templates asking to generate a movie review). We then instruction tune a pretrained language model on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset. Figure 4 shows multiple instruction templates for a natural language inference dataset.
Figure 4: Multiple instruction templates describing a natural language inference task.
Premise: Russian cosmonaut Valery Polyakov set the record for the longest continuous amount of time spent in space, a staggering 438 days, between 1994 and 1995.
Hypothesis: Russians hold the record for the longest stay in space.
Target: Entailment (Options: - yes, - no)
Template 1:
<premise><hypothesis>?<options>Template 2:
<premise><hypothesis><options>Template 3:
<premise><hypothesis><options>We are interested in how FLAN performs on tasks not seen in instruction tuning, and so it is crucial to define what counts as an unseen task. Whereas some prior work defines unseen tasks by disallowing the same dataset to appear in training, we use a more conservative definition that leverages the task clusters from Figure 3. In this work, we only consider dataset \(\mathcal{D}\) unseen at evaluation time if no datasets from any task clusters that \(\mathcal{D}\) belongs to were seen during instruction tuning. For instance, if \(\mathcal{D}\) is an entailment task, then no entailment datasets appeared in instruction tuning, and we instruction-tuned on all other clusters.[^1] Hence, to evaluate zero-shot FLAN on \(c\) task clusters, we instruction tune \(c\) models, where each model holds out a different task cluster for evaluation.
The output space for a given task is either one of several classes (classification) or free text (generation). As FLAN is an instruction-tuned version of a decoder-only language model, it naturally responds in free text, and so no further modifications are needed for generation tasks.
For classification tasks, prior work (Brown et al., 2020) used a
rank classification approach where, for example, only two
outputs (“yes” and “no”) are considered and the higher probability one
is taken as the model’s prediction. Though this procedure is logically
sound, it is imperfect in that the probability mass for answers may have
an undesired distribution among ways of saying each answer (e.g., a
large number of alternative ways of saying “yes” may lower the
probability mass assigned to “yes”). Therefore, we include an
options suffix, in which we append the token
OPTIONS to the end of a classification task along with a
list of the output classes for that task. This makes the model aware of
which choices are desired when responding to classification tasks.
Example use of options is shown in the NLI and commonsense examples in
Figure 1.
Model architecture and pretraining. In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022). This model is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library (Kudo & Richardson, 2018). Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog).
Instruction tuning procedure. FLAN is the instruction-tuned version of LaMDA-PT. Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset. To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k.[^2] We finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in finetuning are 1024 and 256, respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token. This instruction tuning takes around 60 hours on a TPUv3 with 128 cores. For all evaluations, we report results on the final checkpoint trained for 30k steps.
We evaluate FLAN on natural language inference, reading comprehension, closed-book QA, translation, commonsense reasoning, coreference resolution, and struct-to-text. As described in §2.2, we evaluate on unseen tasks by grouping datasets into task clusters and holding out each cluster for evaluation while instruction tuning on all remaining clusters (i.e., each evaluation task cluster uses a different checkpoint). For each dataset, we evaluate the mean of performance on all templates, which proxies the expected performance given a typical natural language instruction. As a dev set is sometimes available for manual prompt engineering (Brown et al., 2020), for each dataset we also obtain the test set performance using the template with the best dev set performance.
For comparison, we report zero and few-shot results for LaMDA-PT using the same prompts as GPT-3 (as LaMDA-PT is not suitable for natural instructions without instruction tuning). This baseline provides the most direct ablation of how much instruction tuning helps. Instruction tuning significantly improves LaMDA-PT on most datasets.
We also show the zero-shot performances of GPT-3 175B (Brown et al., 2020) and GLaM 64B/64E (Du et al., 2021), as reported in their respective papers. With the best dev template, zero-shot FLAN outperforms zero-shot GPT-3 on 20 of 25 datasets and even surpasses GPT-3’s few-shot performance on 10 datasets. With the best dev-template, zero-shot FLAN outperforms zero-shot GLaM on 13 of 19 available datasets and one-shot GLaM on 11 of 19 datasets.
Overall, we observe that instruction tuning is very effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation, struct-to-text) and is less effective on tasks directly formulated as language modeling, where instructions would be largely redundant (e.g., commonsense reasoning and coreference resolution tasks that are formatted as finishing an incomplete sentence or paragraph). Results on natural language inference, reading comprehension, closed-book QA, and translation are summarized in Figure 5 and described below.
Figure 5: Zero-shot performance of FLAN compared to LaMDA-PT 137B, GPT-3 175B, and GLaM 64B/64E on natural language inference, reading comprehension, closed-book QA, and translation. Performance of FLAN is the mean of up to 10 instructional templates per task. Supervised models were either T5, BERT, or translation models (specified in Table 2 and Table 1 in the Appendix).
Natural language inference (NLI). On five NLI
datasets, where a model must determine whether a hypothesis is true
given some premise, FLAN outperforms all baselines by a large margin. As
noted by Brown et al. (2020), perhaps one reason why GPT-3 struggles
with NLI is that NLI examples are unlikely to have appeared naturally in
an unsupervised training set and are thus awkwardly phrased as a
continuation of a sentence. For FLAN, we phrase NLI as the more natural
question “Does <premise> mean that
<hypothesis>?”, achieving much higher
performance.
Reading comprehension. On reading comprehension, where models are asked to answer a question about a provided passage, FLAN outperforms baselines for MultiRC (Khashabi et al., 2018) and OBQA (Mihaylov et al., 2018). On BoolQ (Clark et al., 2019a), FLAN outperforms GPT-3 by a large margin, though LaMDA-PT already achieves high performance on BoolQ.
Closed-book QA. For closed-book QA, which asks models to answer questions about the world without access to specific information containing the answer, FLAN outperforms GPT-3 on all four datasets. Compared to GLaM, FLAN has better performance on ARC-e and ARC-c (Clark et al., 2018), and slightly lower performance on NQ (Lee et al., 2019; Kwiatkowski et al., 2019) and TQA (Joshi et al., 2017).
Translation. Similar to GPT-3, the training data for LaMDA-PT is around 90% English and includes some text in other languages that was not specifically used to train the model to perform machine translation. We also evaluate FLAN’s performance on machine translation for the three datasets evaluated in the GPT-3 paper: French–English from WMT’14 (Bojar et al., 2014), and German–
[^1] When evaluating on the read. comp. with commonsense cluster, both read. comp. and commonsense reasoning were dropped from instruction tuning. Conversely, the read. comp. with commonsense cluster was not used for instruction tuning when evaluating on read. comp. or commonsense reasoning. We also drop the paraphrase cluster from instruction tuning when evaluating on NLI tasks and vice-versa.
[^2] In this mixing scheme, a mixing rate maximum of 3,000 means that a dataset does not receive additional sampling weight for examples in excess of 3,000.
English and Romanian–English from WMT’16 (Bojar et al., 2016). Compared with GPT-3, FLAN outperforms zero-shot GPT-3 for all six evaluations, though it underperforms few-shot GPT-3 in most cases. Similar to GPT-3, FLAN shows strong results for translating into English and compares favorably against supervised translation baselines. Translating from English into other languages, however, was relatively weaker, as might be expected given that FLAN uses an English sentencepiece tokenizer and that the majority of pretraining data is English.
Additional tasks. Although we see strong results for the above task clusters, one limitation with instruction tuning is that it does not improve performance for many language modeling tasks (e.g., commonsense reasoning or coreference resolution tasks formulated as sentence completions). For seven commonsense reasoning and coreference resolution tasks (see Table 2 in the Appendix), FLAN only outperforms LaMDA-PT on three of the seven tasks. This negative result indicates that when the downstream task is the same as the original language modeling pre-training objective (i.e., in cases where instructions are largely redundant), instruction tuning is not useful. Finally, we report results for sentiment analysis, paraphrase detection, and struct-to-text, as well as additional datasets for which GPT-3 results are not available, in Table 2 and Table 1 in the Appendix. Generally, zero-shot FLAN outperforms zero-shot LaMDA-PT and is comparable with or better than few-shot LaMDA-PT.
As the core question of our paper asks how instruction tuning improves a model’s zero-shot performance on unseen tasks, in this first ablation we examine how performance is affected by the number of clusters and tasks used in instruction tuning. For this setup, we hold out NLI, closed-book QA, and commonsense reasoning as evaluation clusters, and use the seven remaining clusters for instruction tuning.[^3] We show results for one to seven instruction tuning clusters, where clusters are added in decreasing order of number of tasks per cluster.
Figure 6 shows these results. As expected, we observe that average performance across the three held-out clusters improves as we add additional clusters and tasks to instruction tuning (with the exception of the sentiment analysis cluster), confirming the benefits of our proposed instruction tuning approach on zero-shot performance on novel tasks. It is further interesting to see that, for the seven clusters we test, the performance does not appear to saturate, implying that performance may further improve with even more clusters added to instruction tuning. Of note, this ablation does not allow us to draw conclusions about which instruction tuning cluster contributes the most to each evaluation cluster, although we see minimal added value from the sentiment analysis cluster.
Figure 6: Adding additional task clusters to instruction tuning improves zero-shot performance on held-out task clusters. The evaluation tasks are the following. Commonsense: CoPA, HellaSwag, PiQA, and StoryCloze. NLI: ANLI R1–R3, QNLI, RTE, SNLI, and WNLI. Closed-book QA: ARC easy, ARC challenge, Natural Questions, and TriviaQA.
As Brown et al. (2020) shows that zero and few-shot capabilities of language models substantially improve for larger models, we next explore how the benefits of instruction tuning are affected by model scale. Using the same cluster split as in the previous ablation study, we evaluate the effect of instruction tuning on models of size 422M, 2B, 8B, 68B, and 137B parameters.
Figure 7 shows these results. We see that for the two models on the order of 100B parameters, instruction tuning substantially improves performance on held-out tasks, as is expected given the prior results in our paper. The behavior on held-out tasks for the 8B and smaller models, however, is thought provoking—instruction tuning actually hurts performance on held-out tasks. One potential explanation for this result could be that for small-scale models, learning the $$40 tasks used during instruction tuning fills the entire model capacity, causing these models to perform worse on new tasks. Under this potential explanation, for the larger scale models, instruction tuning fills up some model capacity but also teaches these models how to follow instructions, allowing them to generalize to new tasks with the remaining capacity.
Figure 7: Whereas instruction tuning helps large models generalize to new tasks, for small models it actually hurts generalization to unseen tasks, potentially because all model capacity is used to learn the mixture of instruction tuning tasks.
In a final ablation study, we explore the role of instructions during finetuning, as one possibility is that performance gains come entirely from multi-task finetuning and the model could perform just as well without instructions. We hence consider two finetuning setups without instructions. In a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be “The dog runs.” and the output would be “Le chien court.”). In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be “[Translation: WMT’14 to French] The dog runs.”).
We compare these two ablations to FLAN’s finetuning procedure, which used natural instructions (e.g., “Please translate this sentence to French: ‘The dog runs.’”). We perform evaluations for four held-out clusters from Figure 5. For the no template setup, we used the FLAN instructions during zero-shot inference (because if we used no template, the model would not know what task to perform). For models finetuned on dataset name only, we report zero-shot performance for FLAN instructions as well as using the dataset name. Figure 8 shows the results—both ablation configurations performed substantially worse than FLAN, indicating that training with instructions is crucial for zero-shot performance on unseen tasks.
Figure 8: Ablation study result using models with instructions removed from finetuning (FT).
So far, we have focused on instruction tuning in the zero-shot setting. Here, we study how instruction tuning can be used when few-shot exemplars are available at inference time. The format for the few-shot setting builds on the zero-shot format. For some input \(x\) and output \(y\), let \(\text{instruct}(x)\) denote the zero-shot instructions. Then, given \(k\) few-shot exemplars \((x_i, y_i)_{i=1}^k\) and a new input \(x\), the instruction format for the few-shot setting is “\(\text{instruct}(x_1) \oplus y_1 \oplus \text{instruct}(x_2) \oplus y_2 \oplus \dots \oplus \text{instruct}(x_k) \oplus y_k \oplus \text{instruct}(x)\)”, where \(\oplus\) denotes string concatenation with a delimiter token inserted in between. At both training and inference time, exemplars are randomly drawn from the training set, and the number of exemplars is capped at 16 and such that the total sequence length is less than 960 tokens. Our experiment uses the same task splits and evaluation procedure as §3, such that few-shot exemplars for an unseen task are only used at inference time.
As shown in Figure 9, few-shot exemplars improve the performance on all task clusters, compared with zero-shot FLAN. Exemplars are especially effective for tasks with large/complex output spaces, such as struct to text, translation, and closed-book QA, potentially because exemplars help the model better understand the output format. In addition, for all task clusters, standard deviation among templates is lower for few-shot FLAN, indicating reduced sensitivity to prompt engineering.
Figure 9: Adding few-shot exemplars to FLAN is a complementary method for improving the performance of instruction-tuned models. The orange bars indicate standard deviation among templates, averaged at the dataset level for each task cluster.
As we’ve seen that instruction tuning improves the ability of a model to respond to instructions, it follows that, if FLAN is indeed more amenable to performing NLP tasks, then it should also achieve better performance when performing inference using soft prompts, represented by prepended continuous variables optimized via prompt tuning (Li & Liang, 2021; Lester et al., 2021). As further analysis, we train continuous prompts for each of the SuperGLUE (Wang et al., 2019a) tasks in accordance with the cluster splits from §2.2 such that when prompt-tuning on task \(\mathcal{T}\), no tasks in the same cluster as \(\mathcal{T}\) were seen during instruction tuning. Our prompt tuning setup follows the procedure of Lester et al. (2021) except that we use a prompt length of 10, weight decay of 1e-4, and did not use dropout on the attention scores; we found in preliminary experiments that these changes improved the performance of LaMDA-PT.
Figure 10 shows the results of these prompt tuning experiments for both using a fully-supervised training set and in a low-resource setting with only 32 training examples. We see that in all scenarios, prompt tuning works better with FLAN than LaMDA-PT. In many cases, especially for the low-resource setting, prompt tuning on FLAN even achieves more than 10% improvement over prompt tuning on the LaMDA-PT. This result exemplifies in another way how instruction tuning can result in a checkpoint that is more desirable for performing NLP tasks.
Figure 10: Instruction-tuned models respond better to continuous inputs from prompt tuning. When prompt tuning on a given dataset, no tasks from the same cluster as that dataset were seen during instruction tuning. Performance shown is the average on the SuperGLUE dev set.
Our work relates to several broad research areas including zero-shot learning, prompting, multi-task learning, and language models for NLP applications (Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020; Efrat & Levy, 2020; Aghajanyan et al., 2021; Li & Liang, 2021, inter alia). We describe prior work for these broad areas in an extended related work section (Appendix D), and here we describe two subareas narrower in scope that perhaps relate most closely to our work.
[^3] We do not use the paraphrase or reading comprehension with commonsense clusters for instruction tuning in this ablation because they are too similar to NLI and commmonsense reasoning, respectively.
The way we ask a model to respond to instructions is similar to QA-based task formulation (Kumar et al., 2016; McCann et al., 2018), which aims to unify NLP tasks by casting them as QA over a context. Though these methods are very similar to ours, they mostly focus on multi-task learning instead of zero-shot learning, and—as noted by Liu et al. (2021)—they are generally not motivated by using existing knowledge in pretrained LMs. Moreover, our work supercedes recent work such as Chai et al. (2020) and Zhong et al. (2021) in terms of both model scale and scope of tasks.
The success of language models has led to nascent research on the ability of models to follow instructions. Most recently, Mishra et al. (2021) finetune 140M parameter BART on instructions with few-shot exemplars, and evaluate its few-shot abilities on unseen tasks—this is similar to our few-shot instruction tuning result from §4.4. This promising result (as well as one from Ye et al. (2021), which does not emphasize instructions as much) suggests that finetuning on a collection of tasks improves few-shot performance on unseen tasks, even at a smaller model scale. Sanh et al. (2021) finetune T5 in a setup similar to ours, finding that zero-shot learning can be improved in a model of 11B parameters. At a model scale similar to ours, OpenAI’s InstructGPT models are trained via both finetuning and reinforcement learning to produce outputs that are more preferred by human raters (Ouyang et al., 2022).
Our paper has explored a simple question in zero-shot prompting: does finetuning a model on a collection of tasks phrased as instructions improve its performance on unseen tasks? We operationalize this question via instruction tuning, a simple method that combines appealing aspects of both the pretrain–finetune and prompting paradigms. Our instruction-tuned model, FLAN, improves performance against an untuned model and surpasses zero-shot GPT-3 on the majority of tasks that we evaluate on. Ablation studies reveal that performance on unseen tasks improves with the number of instruction tuning task clusters, and, interestingly, that performance improvements from instruction tuning emerge only with sufficient model scale. Moreover, instruction tuning can be combined with other prompting methods such as few-shot prompting and prompt tuning.
The diverse capabilities of language models at scale have drawn attention to the tradeoffs between specialist models (one model per task) and generalist models (one model for many tasks; Arivazhagan et al., 2019; Pratap et al., 2020), for which our study has potential implications. Although one might expect labeled data to have the most natural role in improving specialist models, instruction tuning demonstrates how labeled data can be used to help large language models perform many, unseen tasks. In other words, the positive effect of instruction tuning on cross-task generalization shows that task-specific training is complementary to general language modeling and motivates further research on generalist models.
As for limitations of our study, there is a degree of subjectivity in assigning tasks to clusters (though we try to use accepted categorizations in the literature), and we only explore the use of relatively short instructions of typically a single sentence (c.f. detailed instructions given to crowd-workers). A limitation for our evaluation is that individual examples might have appeared in the models’ pretraining data, which includes web documents, though in post-hoc analysis (Appendix C) we do not find any evidence that data overlap substantially impacted the results. Finally, the scale of FLAN 137B makes it costly to serve. Future work on instruction tuning could include gathering/generating even more task clusters for finetuning, cross-lingual experiments, using FLAN to generate data for training downstream classifiers, and using finetuning to improve model behavior with respect to bias and fairness (Solaiman & Dennison, 2021).
This paper has explored a simple method for improving the ability of language models at scale to perform zero-shot tasks based purely on instructions. Our instruction-tuned model, FLAN, compares favorably against GPT-3 and signals the potential ability for language models at scale to follow instructions. We hope that our paper will spur further research on instructions-based NLP, zero-shot learning, and using labeled data to improve large language models.
This work uses language models, for which the risks and potential harms are discussed in Bender & Koller (2020), Brown et al. (2020), Bender et al. (2021), Patterson et al., (2021), and others. As our contribution in this paper is not a pretrained language model itself but rather an empirical study of how instruction tuning affects the zero-shot performance of a language model on unseen tasks, we additionally highlight two relevant ethical considerations. First, labeled datasets such as those we use for finetuning can contain undesirable biases, and these biases can be propagated into zero-shot applications of the model on downstream tasks. And second, instruction-tuned models can potentially require less data and expertise to use; such lower barriers to access could increase both the benefits and associated risks of such models.
We use the same pretrained language models as Austin et al. (2021). The energy cost and carbon footprint for the pretrained models were 451 MWh and 26 tCO2e, respectively. The additional instruction tuning gradient-steps for finetuning FLAN is less than 2% of the number of pretraining steps, and so the estimated additional energy cost is comparatively smaller.
Maarten Bosma conceived the original idea and implemented the first version of FLAN. Vincent Zhao prototyped the training and evaluation pipelines, as well as rank classification. Kelvin Guu proposed and implemented the idea of task clusters and evaluation using inter-cluster splits. Jason Wei, Maarten Bosma, Vincent Zhao, and Adams Wei Yu implemented the NLP tasks. Jason Wei, Vincent Zhao, and Adams Wei Yu conducted and managed most of the experiments. Jason Wei designed and ran the ablation studies. Jason Wei, Maarten Bosma, and Quoc V. Le wrote most of the paper. Jason Wei, Maarten Bosma, and Nan Du obtained the zero and few-shot baselines. Vincent Zhao and Kelvin Guu designed, implemented, and conducted the few-shot FLAN experiments. Maarten Bosma and Jason Wei ran the data contamination analysis. Brian Lester ran the prompt tuning experiments. Quoc V. Le and Andrew M. Dai advised, provided high-level guidance, and helped edit the paper.
We thank Ed Chi, Slav Petrov, Dan Garrette, Ruibo Liu, and Clara Meister for providing feedback on our manuscript. We thank Adam Roberts, Liam Fedus, Hyung Won Chung, and Noam Shazeer for helping debug some of our models. We thank Ellie Pavlick for feedback on the study design during the middle stages of the project. We thank Daniel De Freitas Adiwardana for helping initiate the project, large language model advising, and giving us access to some computational resources. Finally, we thank the team involved in pretraining LaMDA-PT: Daniel De Freitas Adiwardana, Noam Shazeer, Yanping Huang, Dmitry Lepikhin, Dehao Chen, Yuanzhong Xu and Zhifeng Chen.
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke
Zettlemoyer, and Sonal Gupta. Muppet: Massive multi-task representations
with pre-finetuning. arXiv preprint arXiv:2101.11038, 2021. URL
https://arxiv.org/abs/2101.11038.
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin
Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin
Cherry, et al. Massively multilingual neural machine translation in the
wild: Findings and challenges. arXiv preprint arXiv:1907.05019,
2019. URL https://arxiv.org/abs/1907.05019.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc
Le, and Charles Sutton. Program synthesis with large language models.
arXiv preprint arXiv:2108.07732, 2021. URL
https://arxiv.org/abs/2108.07732.
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via
pseudo in-domain data selection. In Proceedings of the 2011
Conference on Empirical Methods in Natural Language Processing,
pp. 355–362, 2011. URL
https://aclanthology.org/D11-1033.
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu
Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu,
Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema
Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William
Waites, Dion Wiggins, and Jaume Zaragoza. ParaCrawl: Web-scale
acquisition of parallel corpora. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics,
pp. 4555–4567, 2020. URL
https://aclanthology.org/2020.acl-main.417.
Emily M. Bender and Alexander Koller. Climbing towards NLU: On
meaning, form, and understanding in the age of data. In Proceedings
of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 5185–5198, 2020. URL
https://aclanthology.org/2020.acl-main.463.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and
Shmargaret Shmitchell. On the dangers of stochastic parrots: Can
language models be too big? In Proceedings of the 2021 ACM
Conference on Fairness, Accountability, and Transparency, FAccT
’21, pp. 610–623. Association for Computing Machinery, 2021. URL
https://doi.org/10.1145/3442188.3445922.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The
Fifth PASCAL Recognizing Textual Entailment Challenge. In TAC,
2009. URL
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.232.1231&rep=rep1&type=pdf.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin
Choi. PIQA: Reasoning about physical commonsense in natural language. In
Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
URL https://arxiv.org/abs/1911.11641.
Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow,
Philipp Koehn, Christof Monz, Matt Post, and Lucia Specia (eds.).
Proceedings of the Ninth Workshop on Statistical Machine
Translation, 2014. URL
https://aclanthology.org/W14-3300.
Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann,
Liane Guillou, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes,
Aurélie Névéol, Mariana Neves, Pavel Pecina, Martin Popel, Philipp
Koehn, Christof Monz, Matteo Negri, Matt Post, Lucia Specia, Karin
Verspoor, Jörg Tiedemann, and Marco Turchi (eds.). Proceedings of
the First Conference on Machine Translation: Volume 1, Research
Papers, 2016. URL
https://aclanthology.org/W16-2200.
Rishi Bommasani, Drew A. Hudson, E. Adeli, R. Altman, Simran Arora,
Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut,
Emma Brunskill, E. Brynjolfsson, S. Buch, D. Card, Rodrigo Castellon,
Niladri S. Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis,
Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, S. Ermon, J.
Etchemendy, Kawin Ethayarajh, L. Fei-Fei, Chelsea Finn, Trevor Gale,
Lauren E. Gillespie, Karan Goel, Noah D. Goodman, S. Grossman, Neel
Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho,
Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan
Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, G. Keeling, Fereshte
Khani, O. Khattab, Pang Wei Koh, M. Krass, Ranjay Krishna, Rohith
Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, J.
Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali
Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell,
Zanele Munyikwa, Suraj Nair, Avanika Narayan, D. Narayanan, Ben Newman,
Allen Nie, J. C. Niebles, H. Nilforoshan, Julian Nyarko, Giray Ogut,
Laurel Orr, Isabel Papadimitriou, Joon Sung Park, C. Piech, Eva
Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu
Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher
R’e, D. Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, K.
Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr,
Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael
Xie, Michihiro Yasunaga, Jiaxuan You, M. Zaharia, Michael Zhang, Tianyi
Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy
Liang. On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021. URL
https://arxiv.org/abs/2108.07258.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D.
Manning. A large annotated corpus for learning natural language
inference. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 632–642, 2015. URL
https://aclanthology.org/D15-1075.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D.
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language
models are few-shot learners. In Advances in Neural Information
Processing Systems, volume 33, pp. 1877–1901, 2020. URL
https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei Li. Description
based text classification with reinforcement learning. In
Proceedings of the International Conference on Machine
Learning, pp. 1371–1382. PMLR, 2020. URL
http://proceedings.mlr.press/v119/chai20a/chai20a.pdf.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde,
Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman,
et al. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374, 2021. URL
https://arxiv.org/abs/2107.03374.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin
Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in
context. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pp. 2174–2184, 2018. URL
https://aclanthology.org/D18-1241.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski,
Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising
difficulty of natural yes/no questions. In Proceedings of the 2019
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), pp. 2924–2936, 2019a. URL
https://aclanthology.org/N19-1300.
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D.
Manning, and Quoc V. Le. BAM! born-again multi-task networks for natural
language understanding. In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pp. 5931–5937,
2019b. URL https://aclanthology.org/P19-1595.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish
Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved
question answering? Try ARC, the AI2 reasoning challenge. arXiv
preprint arXiv:1803.05457, 2018. URL
https://arxiv.org/abs/1803.05457.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from
scratch. Journal of Machine Learning Research, 12:2493–2537,
2011. URL
https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf.
Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and
Serena Villata. Hybrid emoji-based masked language models for zero-shot
abusive language detection. In Findings of the Association for
Computational Linguistics: EMNLP 2020, pp. 943–949, 2020. URL
https://aclanthology.org/2020.findings-emnlp.84.
Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL
Recognising Textual Entailment challenge. In Proceedings of the
First International Conference on Machine Learning Challenges:
Evaluating Predictive Uncertainty Visual Object Classification, and
Recognizing Textual Entailment, MLCW’05, pp. 177–190, 2005. URL
https://doi.org/10.1007/11736790_9.
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In
Proceedings of the Conference on Neural Information Processing
Systems, 2015. URL
https://papers.nips.cc/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf.
Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The
CommitmentBank: Investigating projection in naturally occurring
discourse. In Proceedings of Sinn und Bedeutung, pp. 107–124,
2019. URL
https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers),
pp. 4171–4186, 2019. URL
https://aclanthology.org/N19-1423.
William B. Dolan and Chris Brockett. Automatically constructing a
corpus of sentential paraphrases. In Proceedings of the Third
International Workshop on Paraphrasing (IWP2005), 2005. URL
https://aclanthology.org/I05-5002.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin,
Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et
al. GLaM: Efficient scaling of language models with mixture-of-experts.
arXiv preprint arXiv:2112.06905, 2021. URL
https://arxiv.org/pdf/2112.06905.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer
Singh, and Matt Gardner. DROP: A reading comprehension benchmark
requiring discrete reasoning over paragraphs. In Proceedings of the
2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), pp. 2368–2378, 2019. URL
https://aclanthology.org/N19-1246.
Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield.
Edinburgh’s phrase-based machine translation systems for WMT-14. In
Proceedings of the Ninth Workshop on Statistical Machine
Translation, pp. 97–104, 2014. URL
https://aclanthology.org/W14-3309.
Ondřej Dušek, David M. Howcroft, and Verena Rieser. Semantic noise
matters for neural natural language generation. In Proceedings of
the 12th International Conference on Natural Language Generation,
pp. 421–426, 2019. URL
https://aclanthology.org/W19-8652.
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier.
Understanding back-translation at scale. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing,
pp. 489–500, 2018. URL
https://aclanthology.org/D18-1045.
Avia Efrat and Omer Levy. The Turking Test: Can language models
understand instructions? arXiv preprint arXiv:2010.11982, 2020.
URL https://arxiv.org/abs/2010.11982.
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev.
Multi-news: A large-scale multi-document summarization dataset and
abstractive hierarchical model. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics,
pp. 1074–1084, 2019. URL
https://aclanthology.org/P19-1102.
Fast.AI. Yelp Sentiment Classification Dataset.
https://course.fast.ai/datasets.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers:
Scaling to trillion parameter models with simple and efficient sparsity.
arXiv preprint arXiv:2101.03961, 2021. URL
https://arxiv.org/abs/2101.03961.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic
meta-learning for fast adaptation of deep networks. In Proceedings
of the International Conference on Machine Learning (ICML),
pp. 1126–1135, 2017. URL
https://arxiv.org/abs/1703.03400.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language
models better few-shot learners. In Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pp. 3816–3830, 2021. URL
https://aclanthology.org/2021.acl-long.295.
Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura
Perez-Beltrachini. The WebNLG challenge: Generating text from RDF data.
In Proceedings of the 10th International Conference on Natural
Language Generation, pp. 124–133, 2017. URL
https://aclanthology.org/W17-3518.
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka
Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi
Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du,
Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina
Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh
Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal
Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood,
Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina
McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi
Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur
Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan
Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira
Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik
Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and
Jiawei Zhou. The GEM benchmark: Natural language generation, its
evaluation and metrics. In Proceedings of the 1st Workshop on
Natural Language Generation, Evaluation, and Metrics (GEM 2021),
pp. 96–120, 2021. URL
https://aclanthology.org/2021.gem-1.10.
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The
third PASCAL recognizing textual entailment challenge. In
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and
Paraphrasing, pp. 1–9, 2007. URL
https://aclanthology.org/W07-1401.
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer.
SAMSum corpus: A human-annotated dialogue dataset for abstractive
summarization. In Proceedings of the 2nd Workshop on New Frontiers
in Summarization, pp. 70–79, 2019. URL
https://aclanthology.org/D19-5409.
Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment
classification using distant supervision. CS224N project report,
Stanford, 1(12):2009, 2009. URL
https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf.
Dan Goldwasser and Dan Roth. Learning from natural instructions.
Machine learning, 94(2):205–232, 2014. URL
https://link.springer.com/article/10.1007/s10994-013-5407-y.
Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3
million summaries with diverse extractive strategies. In Proceedings
of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers), pp. 708–719, 2018. URL
https://aclanthology.org/N18-1065.
R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo,
Bernardo Magnini, and Idan Szpektor. The Second PASCAL Recognising
Textual Entailment Challenge. In Proceedings of the Second PASCAL
Challenges Workshop on Recognising Textual Entailment, 2006. URL
http://www.cs.biu.ac.il/~szpekti/papers/RTE2-organizers.pdf.