Skip to content

[POC] Support extract_snippets function in ES|QL #132549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

kderusso
Copy link
Member

@kderusso kderusso commented Aug 7, 2025

Provides a prototype implementation of the EXTRACT_SNIPPETS function in ES|QL, for discussion on the best approaches to pursue.

Here is an example of how to call this field:

POST _query?format=txt
{
  "query": """
  FROM books METADATA _score
  | EVAL snippets = extract_snippets(synopsis, "hobbit takes a ring on a long journey", 3, 10)
  | KEEP title, snippets, _score
  | SORT _score DESC 
  | LIMIT 10
  """
}

You can break apart multiple values using MV_EXPAND:

POST _query?format=txt
{
  "query": """
  FROM books METADATA _score
  | EVAL snippets = extract_snippets(synopsis, "hobbit takes a ring on a long journey", 1, 10)
  | MV_EXPAND snippets
  | KEEP title, snippets, _score
  | SORT _score DESC 
  | LIMIT 10
  """
}

EXTRACT_SNIPPETS will work on text and semantic_text fields, though semantic_text fields do not yet support char length so the whole chunk will be returned.

Some notes on this POC:

  • We've had some discussions on whether this should be a function or a command. This POC is currently implemented as a function. We think a function is OK for now, the only reason to use a command is the fact that we want to perform inference via an async call. As long as we only perform inference once (like we do for the semantic match query) a function should be OK.
  • This POC uses the data node to retrieve the field value for snippet extraction. This restriction could potentially be removed by using an in-memory Lucene index to perform the highlighting query if needed.

@kderusso
Copy link
Member Author

kderusso commented Aug 8, 2025

@carlosdelest @jimczi I plan to reach out to you next week RE: this PR.

Two main questions:

  1. Why is the QueryBuilderResolver not doing a full rewrite?
  2. Why is appending BytesRef only ever returning the first item? Does this need to be a List instead?

@kderusso kderusso force-pushed the kderusso/esql-extract-snippets branch 3 times, most recently from cb98553 to 2872969 Compare August 12, 2025 19:39
@kderusso kderusso force-pushed the kderusso/esql-extract-snippets branch from 2872969 to 5b9347c Compare August 12, 2025 19:40
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and seems doable 💯

I have some concerns about:

  • Removing the highlighters from the ToEvaluator interface
  • Need to have a way of moving the execution to the data nodes
  • Because of that, I keep wondering if using a command would be a better approach from the implementation PoV.

I'm sure we will get more clarity on those as we move forward 👍

var query = """
FROM test
| EVAL x = extract_snippets(content, "fox", 1, 10)
| SORT x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important that we push the evaluator to the data nodes - we'll need a physical optimization rule for that. Otherwise, the evaluator won't have shard context data and won't be able to run.

That may be tougher than it looks, as we will need to maintain the dependencies from other EVALs that use the produced snippet.

We will know we have succeeded when we can do the above query without SORT:

FROM test
| EVAL x = extract_snippets(content, "fox", 1, 10)

... and it works because it has been moved below a ExchangeExec to be pushed to the data nodes.

Highlighter highlighter = highlighters.getOrDefault(fieldType.getDefaultHighlighter(), new DefaultHighlighter());

// TODO: Consolidate these options with the ones built in the text similarity reranker
SearchHighlightContext.FieldOptions.Builder optionsBuilder = new SearchHighlightContext.FieldOptions.Builder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of construction for every appendMatch - Hopefully we can move this construction to the evaluator constructor and reuse it for every matching doc.

I don't like that appendMatch is so crowded with params - but I didn't see a better alternative at the time.

@@ -921,6 +923,10 @@ private static Map<String, Highlighter> setupHighlighters(Settings settings, Lis
return unmodifiableMap(highlighters.getRegistry());
}

public static Map<String, Highlighter> getStaticHighlighters() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tough that the highlighter code is so deeply ingrained into the fetch phase 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this was the best/cleanest solution that I could come up with, if you have better suggestions I'd be happy to talk about them!

@@ -79,6 +82,11 @@ public FoldContext foldCtx() {
public List<ShardContext> shardContexts() {
return shardContexts;
}

@Override
public Map<String, Highlighter> highlighters() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep struggling with this - highlighters should not be something the toEvaluator method or other evaluators care about.

I wonder if making it part of the SearchContext or ShardContext would be a better option, so it's passed to the factory and is not a part of the ToEvaluator interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea - I made that move in 838b054 - I think that's better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants