Index
- Agent
- Decoding
- Fine-tuning
- Hallucination
- In-context learning
- Mixture of Experts (MoE)
- Prompt
- Prompt engineering
- Quantization
- Retrieval Augmented Generation (RAG)
- RLHF
- Zero-shot, few-shot prompting
Agent
A program or tool with non-zero LLM calls.
Fine-tuning
Fine-tuning refers to the process of training the language model with labeled data for the downstream tasks. In the context of LLM, fine-tuning is further categorized into three:- Full fine-tuning means updating all the parameters of the language model.
- Efficient fine-tuning refers to only fine-tuning a subset of parameters in the language model. For example, Parameter-Efficient Fine Tuning (PEFT) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. The adapters are trained to learn task-specific information.
- Instruction Tuning denotes fine-tuning language model with downstream task instructions to encourage model generalization to unseen tasks in inference.
Hallucination
In the context of large language models (LLMs) like GPT-3, hallucination refers to a phenomenon where the model generates outputs that may be incorrect or not grounded in reality.
Decoding
Decoding refers to different approaches of how we select output tokens/words with their probabilities while generating text. There are three main decoding approaches.
- Greedy decoding selects the word with the highest probability as its next word at each timestep while generating output tokens.
- Beam search is another alternative approach to reduce the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Greedy decoding and beam search are referred as deterministic methods. Deterministic methods often lead to the problem of model degeneration, i.e., the generated text is unnatural and contains undesirable repetitions.
- Contrastive search aims to solve the degeneration problem by incorporating a degeneration penalty. In other words, it considers both the probability of candidate word and the degeneration penalty by choosing it.
- Another approach is sampling, falls in to the category of stochastic methods. Based on the words and their probabilities, in its most basic form, sampling means randomly picking the next word according to the conditional word probability distribution extracted from the language model. Two widely-used stochastic methods are top-k sampling and nucleus sampling (also called top-p sampling).
In-context Learning
In-context learning (ICL) is a technique where task demonstrations are integrated into the prompt in a natural language/textual format. This enables pre-trained LLMs to address new tasks without fine-tuning the model (or updating their parameters).
Mixture of Experts (MoE)
Mixture of Experts framework enables reducing the number of parameters of LLMs significantly, by utilizing (1) Sparse MoE layers which are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts”, where each expert is a neural network; (2) A gate network or router determining which tokens are sent to which expert. For example, described as a “scaled-down GPT-4,” Mixtral 8x7B utilizes a Mixture of Experts (MoE) framework with eight experts.
Prompt
Prompt is what we give to an LLM as an input. Prompt Engineering Guide categorizes a prompt into four parts/elements:
- Instruction: a specific task or instruction you want the model to perform
- Context: external information or additional context that can steer the model to better responses
- Input data: the input or question that we are interested to find a response for
- Output indicator: the type or format of the output
Prompt engineering
Prompt engineering is a very recent but rapidly growing discipline that has the goal of designing the optimal prompt given a generative model and a goal.
Quantization
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
RAG
RAG (Retrieval Augmented Generation) - proposed from Meta researchers - augment context to LLMs by retrieving a set of relevant/supporting documents from a source defined (e.g., internal documents, knowledge graphs). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static.
The following figure illustrates what we often refer to as the RAG triad, which consists of Query, Context, and Response.
RLHF (Reinforcement Learning with Human Feedback)
Reinforcement Learning with Human Feedback (RLHF) is an approach to reinforcement learning where human feedback is incorporated into the learning process to improve the performance of the learning agent. For example, ChatGPT utilizes RLHF because it generates conversational responses that mimic real-life interactions, incorporating human feedback to enhance its learning process.
Zero-shot, few-shot prompting
Zero or few shot refers to how many examples of the target task we give to an LLM. For example, in zero-shot prompting, we want to use LLM directly without giving any example to classify the sentiment of movie reviews. In contrast, we can also give one or more examples (e.g., movie reviews and their sentiments) in our prompt, in this case we are using few-shot prompting.