Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions src/query.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,26 @@ impl Query {
Ok(format!("Query({:?})", self.get()))
}

pub(crate) fn __and__(&self, other: Query) -> Query {
let inner = tv::query::BooleanQuery::from(vec![
(tv::query::Occur::Must, self.inner.box_clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check if self or other already is a BooleanQuery of all-Must and apend ourselves to that instead of building trees of BooleanQuery?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why would you want this feature? Complicated syntaxes like (query1 | query2) & query3 can be very troublesome to implement. Its also common to nest queries in Lucene.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear for the reduced efficiency compared to having a flat list of subqueries. AFAIU, Tantivy will not optimize that query tree and its execution will directly mirror the structure of the input query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to distinguish the presence of a boolean request, I afraid we need to setup dedicated Query struct but it will lead to big change in the API.

I was rather thinking about downcasting here. I also think we can keep this internal as an optimization instead of changing the API, i.e. downcast to see if we already have a boolean query with matching modifiers and append a clause instead of nesting.

Copy link
Contributor

@alex-au-922 alex-au-922 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think the idea is fine, the optimization cannot be too deep due to the Occur implementation.

Currently we have 3 variants of Occur: Should, Must, MustNot. The optimization can be done when we have a tuple of (Occur, BooleanQuery) where the boolean query might contain one or more groups of Should, Must, MustNot.

Their mathematics goes like the symmetric matrix (Take the header as the index 0 Occur of the (Occur, BooleanQuery) tuple, while the rows Occur as the groups inside the BooleanQuery):

Should Must MustNot
Should Should (Must, [Should, ...]) (MustNot, [Should, ...])
Must (Should, [Must, ...]) Must MustNot
MustNot (Should, [MustNot, ...]) MustNot Must

This means that we can only flatten 1 layer those 5 types of (Occur, InnerOccurGroup), and for the flattening we need to extract the boolean query's inner occur group, am I right?

P.S. Why I remain (Must, [Should, ...]) and (MustNot, [Should, ...]) though they can be deduced by DeMorgan's rule is that when transforming them, the resultant query still have the same degree of nesting and there is no significant benefits so I just remain them as it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was only aimed at keeping the status quo, i.e. chaining clauses using & and | should result in a single boolean query as if the boolean_query method is used instead of a unbalanced tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we can keep this internal as an optimization instead of changing the API.

So do you mean you expect whenever __and__ and __or__ are used, the above optimization evaluation rule applies while the original Query.boolean_query remains the status quo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean you expect whenever and and or are used, the above optimization evaluation rule applies while the original Query.boolean_query remains the status quo?

No, I just want a & b & c and boolean_query([(Must, a), (Must, b), (Must, c)]) to produce the same internal representation using a single boolean query instead of the first form using two boolean queries.

Copy link
Contributor

@alex-au-922 alex-au-922 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the & and | are binary operations that they must evaluate to something first, no matter the order? Say for a & b & c then we must either evaluate to d & c where d = a & b or vice versa, and there must be nesting.

In case we optimistically flatten the queries with the & or | syntax, say a & b & c, while any of the queries a, b, c are boolean, the resultant query will be different from a boolean_query([(Must, a), (Must, b), (Must, c)]).

If we really want to adapt this syntax while producing the same internal representation, the Query.boolean_query method should also be flattened.

(tv::query::Occur::Must, other.inner.box_clone()),
]);
Query {
inner: Box::new(inner),
}
}

pub(crate) fn __or__(&self, other: Query) -> Query {
let inner = tv::query::BooleanQuery::from(vec![
(tv::query::Occur::Should, self.inner.box_clone()),
(tv::query::Occur::Should, other.inner.box_clone()),
]);
Query {
inner: Box::new(inner),
}
}

/// Construct a Tantivy's TermQuery
#[staticmethod]
#[pyo3(signature = (schema, field_name, field_value, index_option = "position"))]
Expand Down
26 changes: 26 additions & 0 deletions tests/tantivy_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,32 @@ def test_and_query(self, ram_index):

assert len(result.hits) == 1

def test_combine_queries(self, ram_index):
index = ram_index

query1 = ram_index.parse_query("title:men", ["title"])
query2 = ram_index.parse_query("body:summer", ["body"])

combined_and = query1 & query2

searcher = index.searcher()
result = searcher.search(combined_and, 10)

# This is an AND query, so it should return 0 results since summer isn't present
assert len(result.hits) == 0

combined_or = query1 | query2

result = searcher.search(combined_or, 10)

assert len(result.hits) == 1

double_combined = (query1 & query2) | query1

result = searcher.search(double_combined, 10)

assert len(result.hits) == 1

def test_and_query_numeric_fields(self, ram_index_numeric_fields):
index = ram_index_numeric_fields
searcher = index.searcher()
Expand Down