AI integration (Andreas)
CrateDB is not just a real-time analytics database, it's a powerful platform to feed and interact with machine learning models, thanks to its ability to store, query, and transform structured, unstructured, and vectorized data at scale using standard SQL.
Whether you're training models, running batch or real-time inference, or integrating with AI pipelines, CrateDB offers:
High-ingestion performance for time-series or sensor data
Real-time queries across structured and semi-structured data
SQL-powered transformations and filtering
Native support for embeddings via FLOAT_VECTOR
Why CrateDB for ML Use Cases?
Real-time ingestion
Ingest millions of records per second from IoT, logs, or user behavior
Unified data
Mix structured, full-text, vector, and JSON data
FLOAT_VECTOR
Store and query high-dimensional embeddings
SQL transforms
Use SQL for preprocessing, feature extraction, and filtering
ML integration
Use CrateDB as a feature store or inference backend
Python & LangChain support
Easily connect to training and inference pipelines
Common Machine Learning Patterns
Feature Engineering
Use SQL to build features dynamically from raw data.
SELECT
user_id,
AVG(duration) AS avg_session,
COUNT(DISTINCT page) AS page_diversity
FROM sessions
GROUP BY user_id;
Training Dataset Extraction
Efficiently extract and filter relevant training data.
SELECT *
FROM telemetry
WHERE temperature > 80
AND error_code IS NOT NULL
AND ts BETWEEN NOW() - INTERVAL '7 days' AND NOW();
Store Embeddings
Save vector representations for documents or entities.
CREATE TABLE article_embeddings (
id UUID PRIMARY KEY,
content TEXT,
embedding FLOAT_VECTOR(384)
);
Use CrateDB as a Feature Store
Centralize your features and use them in production models.
SELECT *
FROM user_features
WHERE last_active > NOW() - INTERVAL '1 day';
When to Use CrateDB in ML Pipelines
Feature Store
Store pre-computed features with SQL access
Real-Time Inference
Serve vector-based results with KNN_MATCH
Experimentation
Use SQL for fast slicing, filtering, and aggregations
Monitoring
Track model performance, drift, or input quality
Data Collection
Capture telemetry, events, logs, and raw user data
Architecture Examples
Model Training Pipeline
[ Ingestion (sensors, APIs) ]
↓
[ CrateDB ]
(Real-time data lake)
↓
[ Python / Spark / Pandas ]
(Feature engineering, training)
↓
[ Model Registry / Serving ]
Real-Time Inference with Hybrid Queries
[ CrateDB ]
- JSON filters
- Full-text search
- FLOAT_VECTOR support
↓
[ SQL + KNN_MATCH ]
↓
[ Application Response ]
Performance Tips for ML Scenarios
Use FLOAT_VECTOR
for embeddings
Store and access high-dimensional vectors efficiently
Normalize vectors before inference
Improves ANN accuracy
Index only what you need
Reduces overhead
Use hybrid filters + vector search
Improves performance and precision
Train offline, infer online
CrateDB is ideal for live inference from pre-trained models
Ecosystem & Integration
Python
crate-python
for pandas, scikit-learn
LangChain
Native CrateDB vector store
Jupyter
Ideal for experimentation, model development
OpenAI, Cohere, etc.
Store and search their embeddings via SQL
Kafka
Connect for real-time ingestion and prediction
Further Reading & Resources
Last updated