Vector data (Juan - Live)
Moved to PR #270. See preview.
The content has been updated since!
CrateDB natively supports vector embeddings for efficient similarity search using k-nearest neighbour (kNN) algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics, all in the simplicity of SQL.
Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables real-time vector search at scale, in combination with other data types like full-text, geospatial, and time-series.
Data Type: VECTOR
CrateDB has a native VECTOR
type with the following key characteristics:
Fixed-length float arrays (e.g. 768, 1024, 2048 dimensions)
Supports HNSW (Hierarchical Navigable Small World) indexing for fast approximate search
Optimized for cosine, Euclidean, and dot-product similarity
Example: Define a Table with Vector Embeddings
CREATE TABLE documents (
id UUID PRIMARY KEY,
title TEXT,
content TEXT,
embedding VECTOR(FLOAT[768])
);
VECTOR(FLOAT[768])
declares a fixed-size vector column.You can ingest vectors directly or compute them externally and store them via SQL
Querying Vectors with SQL
Use the nearest_neighbors
predicate to perform similarity search:
SELECT id, title, content
FROM documents
ORDER BY embedding <-> [0.12, 0.73, ..., 0.01]
LIMIT 5;
This ranks results by vector similarity using the index.
Or, filter and rank by proximity:
SELECT id, title, content, embedding <-> [0.12, ..., 0.01] AS score
FROM documents
WHERE MATCH(content_ft, 'machine learning') AND author = 'Alice'
ORDER BY score
LIMIT 10;
Ingestion: Working with Embeddings
You can ingest vectors in several ways:
Precomputed embeddings from models like OpenAI, HuggingFace, or SentenceTransformers:
INSERT INTO documents (id, title, embedding) VALUES ('uuid-123', 'AI and Databases', [0.12, 0.34, ..., 0.01]);
Batched imports via
COPY FROM
using JSON or CSVCrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB.
Performance & Scaling
Vector search uses HNSW: state-of-the-art ANN algorithm with logarithmic search complexity.
CrateDB parallelizes ANN search across shards/nodes.
Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries.
Integrations
Python / pandas / LangChain: CrateDB has native drivers and REST interface
Embedding models: Use OpenAI, HuggingFace, Cohere, or in-house models
RAG architecture: CrateDB stores vector + metadata + raw text in a unified store
Further Learning & Resources
CrateDB Docs – Vector Search
Blog: Using CrateDB for Hybrid Search (Vector + Full-Text)
CrateDB Academy – Vector Data
Last updated