AI integration (Andreas)

CrateDB is not just a real-time analytics database, it's a powerful platform to feed and interact with machine learning models, thanks to its ability to store, query, and transform structured, unstructured, and vectorized data at scale using standard SQL.

Whether you're training models, running batch or real-time inference, or integrating with AI pipelines, CrateDB offers:

  • High-ingestion performance for time-series or sensor data

  • Real-time queries across structured and semi-structured data

  • SQL-powered transformations and filtering

  • Native support for embeddings via FLOAT_VECTOR

circle-info

For more details on how CrateDB handles similarity search and embeddings, see the Vector Search use case.


Why CrateDB for ML Use Cases?

Feature
Benefit

Real-time ingestion

Ingest millions of records per second from IoT, logs, or user behavior

Unified data

Mix structured, full-text, vector, and JSON data

FLOAT_VECTOR

Store and query high-dimensional embeddings

SQL transforms

Use SQL for preprocessing, feature extraction, and filtering

ML integration

Use CrateDB as a feature store or inference backend

Python & LangChain support

Easily connect to training and inference pipelines


Common Machine Learning Patterns

Feature Engineering

Use SQL to build features dynamically from raw data.

Training Dataset Extraction

Efficiently extract and filter relevant training data.

Store Embeddings

Save vector representations for documents or entities.

circle-info

CrateDB supports high-dimensional vectors with FLOAT_VECTOR. To query these vectors for similarity-based inference, see Vector Search.

Use CrateDB as a Feature Store

Centralize your features and use them in production models.


When to Use CrateDB in ML Pipelines

Use Case
CrateDB Role

Feature Store

Store pre-computed features with SQL access

Real-Time Inference

Serve vector-based results with KNN_MATCH

Experimentation

Use SQL for fast slicing, filtering, and aggregations

Monitoring

Track model performance, drift, or input quality

Data Collection

Capture telemetry, events, logs, and raw user data


Architecture Examples

Model Training Pipeline

Real-Time Inference with Hybrid Queries


Performance Tips for ML Scenarios

Tip
Benefit

Use FLOAT_VECTOR for embeddings

Store and access high-dimensional vectors efficiently

Normalize vectors before inference

Improves ANN accuracy

Index only what you need

Reduces overhead

Use hybrid filters + vector search

Improves performance and precision

Train offline, infer online

CrateDB is ideal for live inference from pre-trained models


Ecosystem & Integration

Tool
Integration

Python

crate-python for pandas, scikit-learn

LangChain

Native CrateDB vector store

Jupyter

Ideal for experimentation, model development

OpenAI, Cohere, etc.

Store and search their embeddings via SQL

Kafka

Connect for real-time ingestion and prediction


Further Reading & Resources

Last updated