Hybrid search

While vector search provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, traditional full-text search (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. Hybrid search blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results.

Hybrid search is particularly effective for Knowledge bases, Product or document search, Multilingual content search, FAQ bots and semantic assistants, and AI-powered search experiences. It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms.

CrateDB supports hybrid search by combining vector similarity search (kNN) and term-based full-text search (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using Common Table Expressions (CTEs) and scoring fusion techniques, such as:

  • Convex combination (weighted sum of scores)

  • Reciprocal Rank Fusion (RRF)


Supported Search Capabilities in CrateDB

Search Type
Function
Description

Vector search

KNN_MATCH()

Finds vectors closest to a given vector

Full-text search

MATCH()

Uses Lucene's BM25 scoring

Geospatial search

MATCH()

For shapes and points (see: Geospatial Search)

CrateDB enables all three through pure SQL, allowing flexible combinations and advanced analytics.


Example: Hybrid Search in SQL

Here’s a simple structure of a hybrid search query combining BM25 and vector results using a CTE:

WITH 
    vector_results AS (
        SELECT id, title, content, 
               _score AS vector_score
        FROM documents
        WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10)
    ),
    bm25_results AS (
        SELECT id, title, content, 
               _score AS bm25_score
        FROM documents
        WHERE MATCH(content, 'knn search')
    )

SELECT 
    v.id,
    v.title,
    bm25_score,
    vector_score,
    0.5 * bm25_score + 0.5 * vector_score AS hybrid_score
FROM 
    bm25_results b
JOIN 
    vector_results v ON v.id = b.id
ORDER BY 
    hybrid_score DESC
LIMIT 10;

You can adjust the weighting (0.5) depending on your desired balance between keyword precision and semantic similarity.


Sample Results

Hybrid Scoring (Convex Combination)

hybrid_score
bm25_score
vector_score
title

0.7440

1.0000

0.5734

knn_match(float_vector, float_vector, int)

0.4868

0.5512

0.4439

Searching On Multiple Columns

0.4716

0.5694

0.4064

array_position(...)

Reciprocal Rank Fusion (RRF)

final_rank
bm25_rank
vector_rank
title

0.03278

1

1

knn_match(float_vector, float_vector, int)

0.03105

7

2

Searching On Multiple Columns

0.03057

8

3

Usage

RRF rewards documents that rank highly across multiple methods, regardless of exact score values.

Last updated