GenAI Glossary

73 termsFree

Every generative AI term you need to know, explained in plain English. Includes interview angles for the terms that matter most in FAANG and top-tier AI company interviews.

A
4 terms

AGI (Artificial General Intelligence)

A hypothetical AI system that can understand, learn, and apply knowledge across any intellectual task at or above human level. Unlike today's narrow AI models that excel at specific tasks, AGI would generalize seamlessly. No AGI system exists today, though frontier labs like OpenAI and DeepMind have stated it as a long-term goal.

Related:

Alignment

The field of research focused on ensuring AI systems behave in accordance with human values and intentions. Alignment techniques include RLHF, Constitutional AI, and DPO. The core challenge is specifying what 'aligned' means when human preferences are complex, contradictory, and evolving.

Related:

Attention Mechanism

A neural network mechanism that allows a model to weigh the importance of different parts of the input when generating each part of the output. Self-attention — the variant used in Transformers — lets each token attend to every other token in the sequence, enabling the model to capture long-range dependencies.

Related:

Autoregressive Model

A model that generates output one token at a time, where each new token is conditioned on all previously generated tokens. GPT, Claude, and LLaMA are all autoregressive. This sequential generation is why LLM inference can be slow — each token requires a full forward pass through the model.

Related:
B
2 terms

BERT (Bidirectional Encoder Representations from Transformers)

A landmark encoder-only Transformer model released by Google in 2018. Unlike autoregressive models that read left-to-right, BERT reads text in both directions simultaneously, making it excellent for understanding tasks like classification, NER, and semantic search. BERT is not used for text generation.

Related:

Bias (in AI)

Systematic errors in AI outputs that reflect prejudices in training data, labeling decisions, or model architecture choices. Bias can manifest as stereotyping, underrepresentation, or unfair treatment of certain demographic groups. Mitigating bias requires careful dataset curation, evaluation benchmarks, and post-training alignment.

Related:
C
5 terms

Chain-of-Thought (CoT)

A prompting technique where the model is instructed to show its reasoning step by step before providing a final answer. CoT dramatically improves performance on math, logic, and multi-step reasoning tasks. Variants include zero-shot CoT ('Let's think step by step') and few-shot CoT (providing worked examples).

Related:

ChatGPT

OpenAI's conversational AI product built on top of GPT models, fine-tuned with RLHF for dialogue. Launched November 2022, it reached 100 million users in 60 days — the fastest-growing consumer application in history. ChatGPT popularized the concept of AI assistants and triggered the current generative AI wave.

Related:

Claude

Anthropic's family of AI assistant models, designed with a focus on safety, helpfulness, and honesty. Claude uses Constitutional AI for alignment and offers long context windows (up to 200K tokens). The Claude model family includes Haiku (fast/cheap), Sonnet (balanced), and Opus (most capable).

Related:

Constitutional AI (CAI)

An alignment technique developed by Anthropic where the AI critiques and revises its own outputs based on a set of written principles (a 'constitution'). Instead of relying solely on human labelers, CAI uses AI feedback to scale the alignment process. This makes training safer and more consistent.

Related:

Context Window

The maximum number of tokens a model can process in a single request, including both the input prompt and the generated output. GPT-4 supports up to 128K tokens; Claude supports up to 200K. Larger context windows enable processing entire codebases or books but increase memory usage quadratically due to attention.

Related:
D
3 terms

Diffusion Model

A generative model that creates data (typically images) by learning to reverse a gradual noising process. During training, noise is incrementally added to data; the model learns to remove it step by step. Stable Diffusion, DALL-E 3, and Midjourney all use diffusion-based architectures.

Related:

Distillation (Knowledge Distillation)

A technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model. The student learns from the teacher's output probability distributions (soft labels) rather than just hard labels. This produces smaller, faster models that retain much of the teacher's capability.

Related:

DPO (Direct Preference Optimization)

A simpler alternative to RLHF for aligning language models with human preferences. DPO skips the reward model entirely and directly optimizes the language model using preference pairs (chosen vs rejected responses). It's mathematically equivalent to RLHF under certain conditions but is much easier to implement and more stable to train.

Related:
E
3 terms

Embedding

A dense vector representation of text (or images, audio, etc.) in a continuous high-dimensional space. Similar items cluster together in embedding space, enabling semantic search, clustering, and retrieval. Modern embedding models like OpenAI's text-embedding-3 or Cohere's embed produce vectors of 768-3072 dimensions.

Related:

Encoder-Decoder

A Transformer architecture with two parts: an encoder that reads and compresses the input, and a decoder that generates the output. T5 and BART use this architecture. In contrast, GPT is decoder-only and BERT is encoder-only. Encoder-decoder models excel at tasks like translation and summarization.

Related:

Evaluation (LLM)

The process of measuring an LLM's performance across various dimensions: accuracy, coherence, safety, helpfulness, and instruction-following. Common benchmarks include MMLU, HumanEval, and MT-Bench. Increasingly, LLM-as-Judge approaches use a stronger model to evaluate a weaker one.

Related:
F
4 terms

Few-Shot Learning

A prompting technique where you provide a small number of input-output examples in the prompt to show the model the desired behavior. Typically 2-5 examples are sufficient. Few-shot learning leverages in-context learning — the model generalizes from the examples without any weight updates.

Related:

Fine-Tuning

The process of further training a pre-trained model on a specific dataset to adapt it for a particular task or domain. Full fine-tuning updates all model weights, while parameter-efficient methods like LoRA update only a small subset. Fine-tuning is how general foundation models become specialized for specific use cases.

Related:

Foundation Model

A large model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The term was coined by Stanford's Center for Research on Foundation Models. GPT-4, Claude, LLaMA, and Gemini are all foundation models. The key insight is that pre-training on diverse data creates general capabilities that transfer.

Related:

Function Calling (Tool Use)

A capability that allows LLMs to generate structured JSON outputs that invoke external functions or APIs. Instead of answering directly, the model outputs a function name and arguments, which the application executes and feeds back as context. This enables LLMs to search the web, query databases, perform calculations, and take real-world actions.

Related:
G
4 terms

GAN (Generative Adversarial Network)

A generative model architecture consisting of two neural networks — a generator and a discriminator — that compete against each other. The generator creates fake data while the discriminator tries to distinguish real from fake. GANs were revolutionary for image generation before diffusion models largely replaced them.

Related:

GPT (Generative Pre-trained Transformer)

OpenAI's family of autoregressive language models. GPT-3 (2020) demonstrated few-shot learning at scale. GPT-4 (2023) introduced multi-modal capabilities and dramatically improved reasoning. The GPT architecture — decoder-only Transformer with next-token prediction pre-training — has become the dominant paradigm for LLMs.

Related:

Grounding

Connecting an LLM's responses to verified external data sources to improve factual accuracy and reduce hallucinations. Grounding techniques include RAG (retrieving relevant documents), web search integration, and tool use for real-time data access. Google's Gemini and Perplexity AI heavily emphasize grounding.

Related:

Guardrails

Programmatic safety checks and constraints placed around LLM inputs and outputs to prevent harmful, off-topic, or policy-violating content. Guardrails can be implemented via system prompts, output classifiers, regex filters, or dedicated models like NVIDIA's NeMo Guardrails. They are essential for production deployments.

Related:
H
2 terms

Hallucination

When an LLM generates plausible-sounding but factually incorrect or fabricated information. Hallucinations arise because LLMs are trained to produce statistically likely text, not verified truth. Types include factual errors, citation fabrication, and confident but wrong reasoning. Reducing hallucinations is one of the biggest open challenges in the field.

Related:

Hugging Face

The leading open-source platform for machine learning models, datasets, and tools. Hugging Face Hub hosts 500K+ models and 100K+ datasets. Their Transformers library is the standard way to load and use pre-trained models. They also offer Inference Endpoints, Spaces for demos, and enterprise features.

Related:
I
3 terms

In-Context Learning (ICL)

The ability of LLMs to learn new tasks from examples provided in the prompt, without any weight updates. When you provide few-shot examples, the model uses attention to identify the pattern and generalize. ICL is an emergent capability that improves with model scale and was first prominently demonstrated in GPT-3.

Related:

Inference

The process of running a trained model to generate predictions or outputs for new inputs. In the LLM context, inference means feeding a prompt into the model and getting a completion. Inference cost and latency are major production concerns — optimizations include quantization, KV caching, batching, and speculative decoding.

Related:

Instruction Tuning

A fine-tuning approach where a pre-trained model is trained on a dataset of (instruction, response) pairs to make it follow natural language instructions. Instruction tuning is what transforms a raw language model (which just predicts next tokens) into a helpful assistant. The quality and diversity of instruction data heavily influence the model's capabilities.

Related:
J
1 term

JSON Mode

A model configuration that constrains the LLM to output valid JSON. Available in OpenAI's API and others, JSON mode ensures structured, parseable outputs for application integration. It's essential for function calling, data extraction pipelines, and any system that needs to programmatically consume LLM outputs.

Related:
K
2 terms

Knowledge Distillation

See 'Distillation' — the process of training a smaller student model to replicate the behavior of a larger teacher model. The student learns from the teacher's soft probability outputs rather than hard labels, capturing richer information about the teacher's learned representations.

Related:

KV Cache (Key-Value Cache)

An optimization for autoregressive Transformer inference that stores the key and value tensors from previously computed attention layers. Without KV caching, the model would recompute attention for all previous tokens at each step. The cache trades memory for speed — it's why LLM inference requires large GPU memory even for small batch sizes.

Related:
L
4 terms

LangChain

A popular open-source framework for building applications powered by LLMs. LangChain provides abstractions for chains (sequential LLM calls), agents (models that decide which tools to use), memory, and retrieval. It supports multiple LLM providers and has become the default framework for prototyping RAG and agent systems.

Related:

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora using next-token prediction. LLMs like GPT-4, Claude, and LLaMA demonstrate emergent capabilities — reasoning, code generation, and creative writing — that weren't explicitly programmed. The 'large' refers to both parameter count (7B to 1T+) and training data (trillions of tokens).

Related:

LLM-as-Judge

An evaluation paradigm where a powerful LLM (like GPT-4 or Claude) is used to evaluate the outputs of other models. The judge model scores responses on criteria like helpfulness, accuracy, and safety. This approach scales better than human evaluation but introduces biases (self-preference, position bias, verbosity bias).

Related:

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that adds small trainable low-rank matrices to frozen model weights. Instead of updating all billions of parameters, LoRA typically trains less than 1% of the total parameters while achieving comparable performance. This makes fine-tuning accessible on consumer GPUs.

Related:
M
4 terms

MCP (Model Context Protocol)

An open protocol developed by Anthropic that standardizes how AI models connect to external data sources and tools. MCP provides a universal interface for context integration — similar to how USB standardized device connections. It enables models to access files, databases, APIs, and development tools through a consistent protocol.

Related:

MIG (Multi-Instance GPU)

An NVIDIA technology that partitions a single GPU into multiple isolated instances, each with dedicated compute, memory, and cache. MIG is available on A100 and H100 GPUs and is essential for multi-tenant inference serving, allowing different models or users to share a GPU without interference.

Related:

MMLU (Massive Multitask Language Understanding)

A benchmark consisting of 57 academic subjects (from STEM to humanities) used to evaluate LLMs' knowledge and reasoning abilities. MMLU scores are frequently cited in model comparisons — GPT-4 scores ~86%, Claude 3.5 Sonnet scores ~88%. While widely used, MMLU is increasingly criticized for data contamination and question quality.

Related:

Multi-Modal

AI models that can process and generate multiple types of data — text, images, audio, video, or code. GPT-4V, Gemini, and Claude 3 are multi-modal: they can analyze images, understand charts, and read documents. Multi-modal capabilities are essential for real-world applications where information comes in diverse formats.

Related:
N
3 terms

Neural Network

A computational model inspired by biological neurons, consisting of interconnected layers of nodes (neurons) that learn to transform inputs into outputs through training. Deep neural networks (with many layers) are the foundation of modern AI. Transformers, CNNs, and RNNs are all types of neural network architectures.

Related:

NLP (Natural Language Processing)

The broad field of AI concerned with the interaction between computers and human language. NLP encompasses tasks like text classification, sentiment analysis, translation, summarization, and question answering. While LLMs have revolutionized NLP, the field predates them by decades and includes rule-based and statistical approaches.

Related:

NVIDIA

The dominant hardware provider for AI training and inference. NVIDIA's GPUs (A100, H100, B200) and CUDA software ecosystem power the vast majority of LLM workloads. Their market cap exceeded $3 trillion in 2024, making them the most valuable company in the world — a reflection of AI's compute demands.

Related:
O
2 terms

OpenAI

The AI research company that created GPT, ChatGPT, DALL-E, and Whisper. Founded in 2015 as a non-profit, it transitioned to a capped-profit model and became the most prominent commercial AI lab. OpenAI's API is the most widely used LLM API, and their models consistently rank among the most capable available.

Related:

ORPO (Odds Ratio Preference Optimization)

A preference optimization technique that combines SFT and alignment into a single training step. Unlike DPO which requires a separate SFT phase, ORPO uses an odds ratio-based penalty to directly align the model during supervised fine-tuning. This reduces training time and computational cost while achieving competitive results.

Related:
P
4 terms

PEFT (Parameter-Efficient Fine-Tuning)

A family of techniques that fine-tune only a small subset of a model's parameters while keeping the rest frozen. PEFT methods include LoRA, QLoRA, prefix tuning, and adapters. They reduce memory requirements by 10-100x compared to full fine-tuning, making it feasible to customize large models on modest hardware.

Related:

Perplexity

A metric that measures how well a language model predicts a sample of text. Lower perplexity means the model assigns higher probability to the actual text, indicating better language modeling. Perplexity is also the name of a popular AI search engine that uses LLMs to generate cited answers from web sources.

Related:

Prompt Engineering

The practice of crafting inputs (prompts) to elicit desired outputs from LLMs. Techniques include role prompting, chain-of-thought, few-shot examples, structured output formatting, and constraint specification. Effective prompt engineering can dramatically improve output quality without any model changes.

Related:

Prompt Injection

A security vulnerability where malicious text in the input causes the LLM to ignore its instructions and follow the attacker's commands instead. Direct injection targets the user input; indirect injection hides malicious instructions in retrieved documents or tool outputs. No complete defense exists today.

Related:
Q
2 terms

QLoRA (Quantized Low-Rank Adaptation)

A technique that combines 4-bit quantization of the base model with LoRA fine-tuning. The base model is loaded in 4-bit precision (reducing memory by 4x), while LoRA adapters are trained in full precision. QLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU — democratizing access to LLM customization.

Related:

Quantization

The process of reducing the numerical precision of model weights (e.g., from 16-bit float to 4-bit integer). Quantization shrinks model size and speeds up inference with minimal quality loss. Common methods include GPTQ, AWQ, and bitsandbytes. A 70B model in 4-bit quantization can run on consumer hardware.

Related:
R
3 terms

RAG (Retrieval-Augmented Generation)

An architecture that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then feeding them as context to the model. RAG reduces hallucinations, enables access to private/recent data, and is more cost-effective than fine-tuning for factual tasks. It's the most common production LLM architecture.

Related:

ReAct Pattern (Reason + Act)

An agent architecture where the LLM alternates between reasoning (thinking about what to do) and acting (calling tools or taking actions). At each step, the model: (1) observes the current state, (2) thinks about the next action, (3) executes the action, (4) observes the result. This loop continues until the task is complete.

Related:

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns LLMs with human preferences using reinforcement learning. The process: (1) collect human preference data (which response is better?), (2) train a reward model on these preferences, (3) use PPO to optimize the LLM against the reward model. RLHF is how ChatGPT and Claude became helpful assistants.

Related:
S
5 terms

Sampling

The method by which an LLM selects the next token from its predicted probability distribution. Greedy sampling always picks the highest-probability token; random sampling introduces randomness. Temperature, top-k, and top-p (nucleus sampling) are parameters that control the randomness-creativity tradeoff.

Related:

Self-Attention

The specific attention mechanism used within Transformers where each token in a sequence attends to every other token (including itself) to compute its representation. Self-attention enables the model to capture dependencies regardless of distance in the sequence — a word at position 1 can directly attend to a word at position 1000.

Related:

SFT (Supervised Fine-Tuning)

The first stage of post-training alignment where a pre-trained model is fine-tuned on high-quality (instruction, response) pairs. SFT teaches the model to follow instructions and respond in a helpful, conversational manner. It typically precedes RLHF or DPO in the alignment pipeline.

Related:

Speculative Decoding

An inference optimization where a smaller, faster 'draft' model generates candidate tokens that are then verified in parallel by the larger target model. Since verification is faster than generation (it can be batched), this speeds up inference by 2-3x without changing output quality. Used in production by several frontier labs.

Related:

System Prompt

A special prompt placed at the beginning of the conversation that defines the AI's role, personality, constraints, and behavior guidelines. System prompts are invisible to the end user but heavily influence model behavior. They are the primary mechanism for customizing AI assistants for specific applications.

Related:
T
7 terms

Temperature

A sampling parameter that controls the randomness of LLM outputs. Temperature scales the logits before softmax: temperature=0 gives deterministic (greedy) output, temperature=1 gives the default distribution, and higher values increase randomness. Lower temperatures are best for factual tasks; higher for creative ones.

Related:

Token

The fundamental unit of text that LLMs process. Tokens are typically subword units — 'unhappiness' might be split into ['un', 'happiness']. On average, 1 token is roughly 4 characters or 0.75 words in English. LLM pricing, context windows, and generation speed are all measured in tokens.

Related:

Tokenizer

The algorithm that converts raw text into tokens (and vice versa). Different models use different tokenizers — GPT uses tiktoken (BPE-based), LLaMA uses SentencePiece. Tokenizer choice affects model efficiency: a tokenizer trained on English may be 2-3x less efficient on non-English languages, increasing costs.

Related:

Tool Use

The capability of LLMs to interact with external tools and APIs to extend their abilities beyond text generation. Tool use enables models to execute code, search the web, query databases, send emails, and manipulate files. It's the bridge between language understanding and real-world action.

Related:

Top-k

A sampling method that restricts token selection to the k most probable next tokens. For example, top-k=50 means only the 50 highest-probability tokens are considered at each step, with the rest zeroed out. This prevents the model from selecting very unlikely tokens while maintaining diversity.

Related:

Top-p (Nucleus Sampling)

A sampling method that selects from the smallest set of tokens whose cumulative probability exceeds p. For example, top-p=0.9 means the model considers enough tokens to cover 90% of the probability mass. Unlike top-k (which is a fixed count), top-p adapts — it selects fewer tokens when the model is confident and more when uncertain.

Related:

Transformer

The neural network architecture introduced in the 2017 paper 'Attention Is All You Need' that underpins virtually all modern LLMs. Transformers use self-attention to process entire sequences in parallel (unlike RNNs which process sequentially). This parallelism enables efficient training on massive datasets and is the key reason LLMs scale so well.

Related:
V
3 terms

Vector Database

A specialized database optimized for storing and querying high-dimensional vector embeddings. Vector databases use approximate nearest neighbor (ANN) algorithms like HNSW or IVF to find similar vectors efficiently. Popular options include Pinecone, Weaviate, Chroma, and pgvector. They are the storage backbone of RAG systems.

Related:

Vector Embedding

A numerical representation of data (text, image, audio) as a fixed-length array of floating-point numbers. Semantically similar items produce similar vectors, enabling mathematical comparison. Common dimensions range from 384 to 3072. The quality of embeddings directly determines the quality of retrieval in RAG systems.

Related:

vLLM

A high-performance open-source library for LLM inference serving. vLLM's key innovation is PagedAttention, which manages KV cache memory like an operating system manages virtual memory — eliminating memory waste from fragmentation. vLLM achieves 2-4x higher throughput than naive implementations and is widely used in production.

Related:
W
2 terms

Weights

The learned numerical parameters of a neural network that determine its behavior. An LLM like LLaMA 70B has 70 billion weights. During training, weights are adjusted via gradient descent to minimize the loss function. During inference, weights are fixed and used to compute predictions. Model size is typically measured by weight count.

Related:

Word2Vec

A pioneering word embedding technique from Google (2013) that learned vector representations of words such that semantic relationships were captured as vector arithmetic (e.g., 'king' - 'man' + 'woman' = 'queen'). While superseded by contextual embeddings from models like BERT, Word2Vec established the conceptual foundation for all modern embedding methods.

Related:
Z
1 term

Zero-Shot Learning

The ability of a model to perform a task it was never explicitly trained on, without any examples in the prompt. For instance, a model can classify sentiment or translate languages from just a natural language instruction. Zero-shot capability is an emergent property of large-scale pre-training — larger models are significantly better at zero-shot tasks.

Related:

Ready to go deeper?

These terms come alive when you apply them. Pick a learning path and start building real GenAI skills.