Skip to content

feat: Infix filtering support #2685#2813

Open
mdhaduk wants to merge 6 commits into
typesense:v31from
mdhaduk:feat/2685-infix-filtering-support
Open

feat: Infix filtering support #2685#2813
mdhaduk wants to merge 6 commits into
typesense:v31from
mdhaduk:feat/2685-infix-filtering-support

Conversation

@mdhaduk
Copy link
Copy Markdown

@mdhaduk mdhaduk commented Mar 4, 2026


feat: infix filtering support (field:value) #2685

Closes #2685


Typesense supports exact (field:value) and prefix (field:value*) filtering on string fields, but had no way to filter by substring , users couldn't express "give me all docs where this field contains this value."

This PR adds field:value syntax to filter_by, enabling infix/contains filtering on string and string array fields that have infix: true in their schema.

Changes:

  • Detection (filter_result_iterator.cpp): value is now detected as an infix pattern before the existing prefix (value*) check, avoiding misidentification.
  • Validation: Returns a 400 error if the field does not have infix: true... the infix index is only built when explicitly enabled.
  • Lookup: Reuses the existing Index::search_infix() function (already used for query-time infix search). For multi-token filter values (e.g. Chris P), results
    are intersected (AND). Multiple filter values in a list (e.g. [foo, bar]) are OR'd together.
  • Result assembly: Results are stored as a flat filter_result_t (sorted uint32_t[]) -> the same pattern used by numeric and geo filters. Mixed infix + exact/prefix in a single OR list is handled by flattening posting lists and OR'ing with infix results before returning.

Tesing (CollectionFilteringTest.InfixFilterOnTextFields)

  • Basic single-token infix on string[] field (ris matches "Chris Evans", "Chris Parnell", "Chris Pine")
  • Infix on string field
  • No matches case
  • OR of multiple infix values ([ris, art])
  • Error on field without infix: true (400 response)
  • Mixed infix + exact in OR ([ris, Martin])
  • Mixed infix + prefix in OR ([ris, Ma*])
  • Infix combined with AND on another field (cast: ris && points: >60)

@tharropoulos
Copy link
Copy Markdown
Contributor

CC: @kishorenc

@tharropoulos
Copy link
Copy Markdown
Contributor

Is this ready for review?

@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented Mar 10, 2026

New contributor, so I just wanted a review of this draft before submitting as PR in case of any obvious pitfalls

Let me know if I should just submit as PR, I think repo does not run clean build on issue of draft PR

@tharropoulos
Copy link
Copy Markdown
Contributor

@happy-san Could you check this out?

@mdhaduk mdhaduk marked this pull request as ready for review April 28, 2026 15:40
@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented Apr 28, 2026

Went ahead and just opened it since some time has passed!

@happy-san
Copy link
Copy Markdown
Contributor

Hey @mdhaduk, The PR looks good! It is missing implementation for handling lazy evaluation of filter_by though. Let me know if you want to take a stab at that. What needs to be done:

@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented Apr 30, 2026

Hey @happy-san, thanks for taking a look! I’m happy to finish out the missing implementation for lazy evaluation of filter_by

mdhaduk added 2 commits April 30, 2026 16:42
…upport

     Changes:
   - include/index.h, src/index.cpp: add search_infix_leaves(); refactor
     search_infix() to call it
   - src/filter_result_iterator.cpp: lazy infix path via
     search_infix_leaves() + posting_list_iterators; 400 error for
     multi-token; remove has_infix_results block; remove #include <algorithm>
   - test/filter_test.cpp: add FilterTest.InfixLazyEvaluation (5 cases
     including multi-token 400 error)

      Assumptions:
   - Infix filter values must be single tokens. A value that tokenises to
     N > 1 tokens returns 400. This is safe: every existing test uses
     single-token patterns (*ris*, *pta*, etc.); no documented behaviour
     depended on multi-token AND semantics.
@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented Apr 30, 2026

Hi @happy-san, implemented the lazy evaluation as requested. Here's a summary of what changed:

Lazy evaluation path (filter_result_iterator.cpp)

  • Replaced the eager search_infix() call (which materialised all doc IDs into a flat array) with search_infix_leaves(), which returns art_leaf* pointers without merging doc IDs.
  • Each matching vocab token becomes one outer entry in posting_list_iterators, ORed lazily by the existing get_string_filter_next_match() via the CONTAINS comparator branch — no position verification is needed, which is correct for infix.
  • Removed the has_infix_results finalization block; the normal string_filter_ids_threshold check now applies uniformly to infix.

search_infix_leaves() (index.h, index.cpp)

  • Extracted the scan logic from search_infix() into a new function that stops before merging doc IDs. search_infix() now delegates to it (no behaviour change for query-time infix search).

Tests (test/filter_test.cpp)

  • Added FilterTest.InfixLazyEvaluation with 5 cases: basic lazy iteration, no-match, OR of two infix values, infix AND numeric, and a multi-token error case.

assumptions made:

  • multi-token infix (cast:foo bar) was removed and replaced with a 400 error. The infix index stores individual word tokens only, so substring search across word boundaries is architecturally impossible... the old path was producing semantically wrong AND results with no adjacency guarantee and had no test coverage. Users can write cast:foo && cast:bar as separate conditions instead.

Comment thread test/collection_filtering_test.cpp Outdated
auto res_op = coll->search("*", {}, "name_no_infix: *foo*", {}, {}, {0}, 10, 1, FREQUENCY, {false});
ASSERT_FALSE(res_op.ok());
ASSERT_EQ(400, res_op.code());
ASSERT_NE(std::string::npos, res_op.error().find("infix"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should match the error message verbatim here.

Comment thread test/filter_test.cpp Outdated
filter_tree_root, enable_lazy_evaluation);
ASSERT_FALSE(iter_multi_token.init_status().ok());
ASSERT_EQ(400, iter_multi_token.init_status().code());
ASSERT_NE(std::string::npos, iter_multi_token.init_status().error().find("single token"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should match the error message verbatim.

Comment thread test/filter_test.cpp
// Test 5: Multi-token infix must return 400 — *ris pin* tokenizes to ["ris", "pin"].
// The infix index stores individual word tokens only; substring search across word
// boundaries is structurally impossible. Users should write cast:*ris* && cast:*pin* instead.
filter_op = filter::parse_filter_query("cast: *ris pin*", coll->get_schema(), store,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test case having cast: *ris* && cast: *in* which should match doc id: 5.

… case

- Replace substring find() checks with verbatim ASSERT_EQ on full error messages in FilterTest.InfixLazyEvaluation
- Add FilterTest.InfixLazyEvaluation Test 5: cast:*ris* && cast:*in*
@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented May 1, 2026

Just pushed the small updates regarding error messages + cast: *ris* && cast: *in* test case!

R"({"title": "Good Will Hunting", "cast": ["Matt Damon", "Ben Affleck"], "name_no_infix": "qux", "points": 83})"_json,
R"({"title": "Percy Jackson", "cast": ["Logan Lerman", "Alexandra Daddario"], "name_no_infix": "quux", "points": 59})"_json,
R"({"title": "Quantum Quest", "cast": ["Chris Pine"], "name_no_infix": "corge", "points": 52})"_json,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add these final tests as well

{
  "q": "will",
  "query_by": "title",
  "filter_by": "cast: [*an*, *in*]",
  "enable_lazy_filter": true
}

which should match document id: 2.

{
  "q": "will",
  "query_by": "title",
  "filter_by": "cast: ! [*an*, *in*]",
  "enable_lazy_filter": true
}

which should match document id: 3.

You can call search like this as well: https://github.com/typesense/typesense/blob/v31/test/collection_filtering_test.cpp#L181-L198

Copy link
Copy Markdown
Contributor

@happy-san happy-san May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also haven't tested what happens when a filter like cast:= *in* is sent. := is the exact match filter operator.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, will add this as well!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call search like this as well: https://github.com/typesense/typesense/blob/v31/test/collection_filtering_test.cpp#L181-L198

Ah I see because coll->search doesn't expose enable_lazy_filter...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but you would've ended up initializing way too much stuff to get to it https://github.com/typesense/typesense/blob/v31/include/collection.h#L1084

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but you would've ended up initializing way too much stuff to get to it https://github.com/typesense/typesense/blob/v31/include/collection.h#L1084

I see I missed it, that makes sense

…ests

Note: Tests 10 and 11 use collectionManager.do_search() with req_params to exercise the enable_lazy_filter HTTP parameter, which is not exposed through the coll->search()
@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented May 1, 2026

Updated collection_filtering_test with those three additional tests as outlined

@happy-san
Copy link
Copy Markdown
Contributor

@kishorenc PR looks good to me. Ready for your review.

@mdhaduk
Copy link
Copy Markdown
Author

mdhaduk commented May 4, 2026

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Infix filtering support

3 participants