Guangyuan's Research and Development Blog: LLM (Large Language Models) Dictionary: Demystifying LLM Terms

Index

Agent
Agentic RAG
Alignment
Analogical Prompting
Budget Forcing
Decoding
Fine-tuning
Function Calling
Hallucination
In-context learning
Mixture of Experts (MoE)
Prompt
Prompt engineering
Quantization
Retrieval Augmented Generation (RAG)
RLHF
Scaling Laws
Tool Use
Zero-shot, few-shot prompting

Agent

A program or tool with non-zero LLM calls.

Agentic RAG

In addition to Retrieval Augmented Generation (RAG), agentic RAG includes one or more agents and tools, and the agent(s) can make decisions such as the ones listed below — impacting the workflow — to solve the task:

To retrieve information or not
Which tool(s) to use to retrieve information
Formulate the relevant queries by itself
Evaluate the retrieved context and judge it or decide to retrieve again

Alignment

The LLMs pretrained with large corpus using the “next token prediction” manner themselves do not behave according to human desire, e.g., with respect to safety, instruction-following. Alignment refers to the further step(s) to fine tune these pretrained models to align with human desire regarding certain criteria as mentioned earlier.

Analogical Prompting

A method where the LLM is instructed to first recall relevant exemplars — self-generated or retrieved from prior knowledge — before attempting to solve the given problem, allowing for problem-solving through analogy

Budget Forcing

Budget forcing is a test-time strategy where, if the model generates a response that does not fully utilize the allocated token budget, a special “wait” token is appended to force additional token generation. This hints to the model that its answer may be incomplete or uncertain, potentially improving response accuracy and consistency [Advanced LLM Agent MOOC].

Fine-tuning

Fine-tuning refers to the process of training the language model with labeled data for the downstream tasks. In the context of LLM, fine-tuning is further categorized into three:

Full fine-tuning means updating all the parameters of the language model.
Efficient fine-tuning refers to only fine-tuning a subset of parameters in the language model. For example, Parameter-Efficient Fine Tuning (PEFT) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. The adapters are trained to learn task-specific information.
Instruction fine-tuning (Supervised fine-tuning/Instruction tuning) denotes fine-tuning language model with downstream task instructions to encourage model generalization to unseen tasks in inference.
Preference fine-tuning is a methodology that builds upon supervised fine-tuning (SFT) by incorporating human feedback to better align model outputs with human preferences, leading to stronger influence on style and chat-based evaluations, though the improvements in core task capabilities may be less pronounced [Advanced LLM Agent MOOC].

Function Calling

A specific mechanism within tool use that involves invoking predefined functions with structured inputs and receiving structured outputs, e.g., OpenAI function calling.

Hallucination

In the context of large language models (LLMs) like GPT-3, hallucination refers to a phenomenon where the model generates outputs that may be incorrect or not grounded in reality.

Decoding

Decoding refers to different approaches of how we select output tokens/words with their probabilities while generating text. There are three main decoding approaches.

Greedy decoding selects the word with the highest probability as its next word at each timestep while generating output tokens.
Beam search is another alternative approach to reduce the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Greedy decoding and beam search are referred as deterministic methods. Deterministic methods often lead to the problem of model degeneration, i.e., the generated text is unnatural and contains undesirable repetitions.
Contrastive search aims to solve the degeneration problem by incorporating a degeneration penalty. In other words, it considers both the probability of candidate word and the degeneration penalty by choosing it.
Another approach is sampling, falls in to the category of stochastic methods. Based on the words and their probabilities, in its most basic form, sampling means randomly picking the next word according to the conditional word probability distribution extracted from the language model. Two widely-used stochastic methods are top-k sampling and nucleus sampling (also called top-p sampling).

In-context Learning

In-context learning (ICL) is a technique where task demonstrations are integrated into the prompt in a natural language/textual format. This enables pre-trained LLMs to address new tasks without fine-tuning the model (or updating their parameters).

Mixture of Experts (MoE)

Mixture of Experts framework enables reducing the number of parameters of LLMs significantly, by utilizing (1) Sparse MoE layers which are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts”, where each expert is a neural network; (2) A gate network or router determining which tokens are sent to which expert. For example, described as a “scaled-down GPT-4,” Mixtral 8x7B utilizes a Mixture of Experts (MoE) framework with eight experts.

Prompt

Prompt is what we give to an LLM as an input. Prompt Engineering Guide categorizes a prompt into four parts/elements:

Instruction: a specific task or instruction you want the model to perform
Context: external information or additional context that can steer the model to better responses
Input data: the input or question that we are interested to find a response for
Output indicator: the type or format of the output

Prompt engineering

Prompt engineering is a very recent but rapidly growing discipline that has the goal of designing the optimal prompt given a generative model and a goal.

Quantization

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

RAG

RAG (Retrieval Augmented Generation) - proposed from Meta researchers - augment context to LLMs by retrieving a set of relevant/supporting documents from a source defined (e.g., internal documents, knowledge graphs). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static.

The following figure illustrates what we often refer to as the RAG triad, which consists of Query, Context, and Response.

RLHF (Reinforcement Learning with Human Feedback)

The language model performance (loss) scales as a power-law with (1) model size, (2) dataset size, and (3) the amount of compute used for training.

Scaling Laws

A concept refers to an agent interacting with external systems, APIs, or tools to achieve a goal. For example, it can include function calls, web searches, database queries, code execution, or physical actions in robotics.

Tool Use

Reinforcement Learning with Human Feedback (RLHF) is an approach to reinforcement learning where human feedback is incorporated into the learning process to improve the performance of the learning agent. For example, ChatGPT utilizes RLHF because it generates conversational responses that mimic real-life interactions, incorporating human feedback to enhance its learning process.

Zero-shot, few-shot prompting

Zero or few shot refers to how many examples of the target task we give to an LLM. For example, in zero-shot prompting, we want to use LLM directly without giving any example to classify the sentiment of movie reviews. In contrast, we can also give one or more examples (e.g., movie reviews and their sentiments) in our prompt, in this case we are using few-shot prompting.

LLM (Large Language Models) Dictionary: Demystifying LLM Terms