feat: Infix filtering support #2685#2813
Conversation
|
CC: @kishorenc |
|
Is this ready for review? |
|
New contributor, so I just wanted a review of this draft before submitting as PR in case of any obvious pitfalls Let me know if I should just submit as PR, I think repo does not run clean build on issue of draft PR |
|
@happy-san Could you check this out? |
|
Went ahead and just opened it since some time has passed! |
|
Hey @mdhaduk, The PR looks good! It is missing implementation for handling lazy evaluation of filter_by though. Let me know if you want to take a stab at that. What needs to be done:
|
|
Hey @happy-san, thanks for taking a look! I’m happy to finish out the missing implementation for lazy evaluation of |
…upport
Changes:
- include/index.h, src/index.cpp: add search_infix_leaves(); refactor
search_infix() to call it
- src/filter_result_iterator.cpp: lazy infix path via
search_infix_leaves() + posting_list_iterators; 400 error for
multi-token; remove has_infix_results block; remove #include <algorithm>
- test/filter_test.cpp: add FilterTest.InfixLazyEvaluation (5 cases
including multi-token 400 error)
Assumptions:
- Infix filter values must be single tokens. A value that tokenises to
N > 1 tokens returns 400. This is safe: every existing test uses
single-token patterns (*ris*, *pta*, etc.); no documented behaviour
depended on multi-token AND semantics.
|
Hi @happy-san, implemented the lazy evaluation as requested. Here's a summary of what changed: Lazy evaluation path (filter_result_iterator.cpp)
search_infix_leaves() (index.h, index.cpp)
Tests (test/filter_test.cpp)
assumptions made:
|
| auto res_op = coll->search("*", {}, "name_no_infix: *foo*", {}, {}, {0}, 10, 1, FREQUENCY, {false}); | ||
| ASSERT_FALSE(res_op.ok()); | ||
| ASSERT_EQ(400, res_op.code()); | ||
| ASSERT_NE(std::string::npos, res_op.error().find("infix")); |
There was a problem hiding this comment.
We should match the error message verbatim here.
| filter_tree_root, enable_lazy_evaluation); | ||
| ASSERT_FALSE(iter_multi_token.init_status().ok()); | ||
| ASSERT_EQ(400, iter_multi_token.init_status().code()); | ||
| ASSERT_NE(std::string::npos, iter_multi_token.init_status().error().find("single token")); |
There was a problem hiding this comment.
Same here, we should match the error message verbatim.
| // Test 5: Multi-token infix must return 400 — *ris pin* tokenizes to ["ris", "pin"]. | ||
| // The infix index stores individual word tokens only; substring search across word | ||
| // boundaries is structurally impossible. Users should write cast:*ris* && cast:*pin* instead. | ||
| filter_op = filter::parse_filter_query("cast: *ris pin*", coll->get_schema(), store, |
There was a problem hiding this comment.
Let's add a test case having cast: *ris* && cast: *in* which should match doc id: 5.
… case - Replace substring find() checks with verbatim ASSERT_EQ on full error messages in FilterTest.InfixLazyEvaluation - Add FilterTest.InfixLazyEvaluation Test 5: cast:*ris* && cast:*in*
|
Just pushed the small updates regarding error messages + |
| R"({"title": "Good Will Hunting", "cast": ["Matt Damon", "Ben Affleck"], "name_no_infix": "qux", "points": 83})"_json, | ||
| R"({"title": "Percy Jackson", "cast": ["Logan Lerman", "Alexandra Daddario"], "name_no_infix": "quux", "points": 59})"_json, | ||
| R"({"title": "Quantum Quest", "cast": ["Chris Pine"], "name_no_infix": "corge", "points": 52})"_json, | ||
| }; |
There was a problem hiding this comment.
Let's add these final tests as well
{
"q": "will",
"query_by": "title",
"filter_by": "cast: [*an*, *in*]",
"enable_lazy_filter": true
}which should match document id: 2.
{
"q": "will",
"query_by": "title",
"filter_by": "cast: ! [*an*, *in*]",
"enable_lazy_filter": true
}which should match document id: 3.
You can call search like this as well: https://github.com/typesense/typesense/blob/v31/test/collection_filtering_test.cpp#L181-L198
There was a problem hiding this comment.
We also haven't tested what happens when a filter like cast:= *in* is sent. := is the exact match filter operator.
There was a problem hiding this comment.
That's a good point, will add this as well!
There was a problem hiding this comment.
You can call search like this as well: https://github.com/typesense/typesense/blob/v31/test/collection_filtering_test.cpp#L181-L198
Ah I see because coll->search doesn't expose enable_lazy_filter...
There was a problem hiding this comment.
It does, but you would've ended up initializing way too much stuff to get to it https://github.com/typesense/typesense/blob/v31/include/collection.h#L1084
There was a problem hiding this comment.
It does, but you would've ended up initializing way too much stuff to get to it https://github.com/typesense/typesense/blob/v31/include/collection.h#L1084
I see I missed it, that makes sense
…ests Note: Tests 10 and 11 use collectionManager.do_search() with req_params to exercise the enable_lazy_filter HTTP parameter, which is not exposed through the coll->search()
|
Updated |
|
@kishorenc PR looks good to me. Ready for your review. |
|
Any update on this? |
feat: infix filtering support (field:value) #2685
Closes #2685
Typesense supports exact (field:value) and prefix (field:value*) filtering on string fields, but had no way to filter by substring , users couldn't express "give me all docs where this field contains this value."
This PR adds field:value syntax to filter_by, enabling infix/contains filtering on string and string array fields that have
infix: truein their schema.Changes:
are intersected (AND). Multiple filter values in a list (e.g. [foo, bar]) are OR'd together.
Tesing (CollectionFilteringTest.InfixFilterOnTextFields)