Back to blog

Unlocking AI Memory: A Deep Dive into Embeddings, Vector Stores, and RAG

2026-04-26

Large Language Models (LLMs) are incredibly powerful reasoning engines, but out of the box, they suffer from three critical gaps: they do not know your private enterprise data, standard keyword searches miss synonyms, and they are restricted by strict context window limits. To make AI truly useful for enterprise data, we must move beyond traditional keyword search and leverage Retrieval-Augmented Generation (RAG).

In this post, we will explore the core concepts of RAG architectures covered in the Week 2 materials, including text embeddings, chunking strategies, and advanced indexing.

The Foundation: Text Embeddings and Vector Databases

To enable an LLM to search through private data based on meaning rather than exact keywords, we use Text Embeddings.

An embedding model (such as OpenAI's text-embedding-3-small) acts as an encoder that translates human text into a high-dimensional vector representation. For example, a piece of text might be converted into an array of 1536 numbers. By translating text into math, we place text in a "vector space" where concepts that are semantically similar are positioned closer together.

To store and query these vectors, we use Vector Databases like ChromaDB, FAISS, Pinecone, or Weaviate. A vector database stores the mathematical embeddings alongside the original text chunks and highly useful metadata. When a user asks a question:

  • The user's query is converted into a vector.
  • The system calculates the distance between the query vector and the stored vectors, typically using Cosine Similarity (measuring the angle between vectors) to find the closest semantic matches.
  • The most relevant chunks are retrieved and fed to the LLM to generate a grounded, factual response.

The Art of Chunking: Breaking Down Documents

You cannot feed a massive textbook or thousands of support tickets directly into an embedding model. Chunking is the critical offline process of splitting long documents into smaller, manageable units before they are embedded and stored.

The sources highlight several primary chunking strategies:

  • Fixed-size chunking: The most straightforward method, which splits text into uniform character or token segments. To prevent cutting off important context at the boundaries, this method usually includes an overlap between chunks.
  • Recursive chunking: The most popular default strategy. It hierarchically splits text based on natural boundaries—starting with paragraph breaks, then line breaks, then sentences, and finally words—until the chunks fit the size limit.
  • Semantic chunking: Instead of relying on character counts, this method uses the embedding model itself to detect when the actual topic or context changes, splitting the document precisely where there is a drop in semantic similarity.
  • Document structure-based chunking: This approach utilizes the inherent structure of the document, such as Markdown headers, HTML tags, or sections, to define boundaries.
  • LLM-based chunking: The most expensive method, which prompts an LLM to evaluate and generate semantically isolated chunks, ensuring high accuracy at the cost of consuming many tokens.

Indexing Strategies: Organizing Data for Retrieval

Creating chunks and embeddings is only half the battle; how you organize those chunks dictates how fast and accurate your RAG system will be. Frameworks like LlamaIndex provide abstractions to index your data effectively.

Here are the five core indexing strategies to know:

  • Vector Index (Flat): The most common starting point for RAG. It embeds everything and performs a straightforward similarity search to return the top-K nearest chunks.
  • Summary Index: Stores full documents and relies on the LLM to scan and evaluate relevance.
  • Tree Index: A highly effective hierarchical approach (summaries leading to details) that is particularly well-suited for extremely large datasets with natural hierarchies, such as books or extensive employee handbooks.
  • Keyword Table Index: Operates like a traditional inverted index where an LLM or rules-engine extracts keywords from the documents, allowing for exact-match keyword routing.
  • Hybrid Retrieval: The gold standard for production RAG systems. It combines the semantic power of Vector Indexes with the exact-matching capabilities of Keyword Indexes. It fires the query to both indexes and merges the results using techniques like Reciprocal Rank Fusion (RRF), offering the highest overall accuracy and robustness.

By mastering embeddings, intelligently chunking your documents, and selecting the right indexing strategy, you can build highly performant AI agents that leverage your private data with pinpoint accuracy.