Skip to content

Conversation

benjaoming
Copy link
Contributor

@benjaoming benjaoming commented Mar 8, 2023

WIP - need to test this, just want to know that this is what you had in mind. I'll test and make it ready for review then. CC: @humitos @stsewd

  • Update test data
  • Implement known Sphinx behavior in a generic way
  • Figure out how much of SphinxDomain logic to disable
  • Open up follow-up issue with a plan for old SphinxDomain data + deprecating/removing the sphinx_domain app ?

Fixes: #9571

Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going in a great direction!

Sorry, you said to not review the code, but kind of had to look at the code p: so just left some notes in case is useful

@benjaoming benjaoming marked this pull request as ready for review March 13, 2023 17:04
@benjaoming benjaoming requested review from a team as code owners March 13, 2023 17:04
@benjaoming benjaoming requested review from agjohnson and humitos March 13, 2023 17:04
@benjaoming
Copy link
Contributor Author

benjaoming commented Mar 13, 2023

@stsewd I wasn't able to find any real-life Sphinx HTML output that I was sure would match the intention of this change. The implementation seems fine and solid, but I just wonder which <dl> we are looking with <dt id="foobar">? Can you quickly point to an example?

Edit: Maybe this is it? https://docs.readthedocs.io/en/stable/glossary.html

@benjaoming benjaoming requested a review from stsewd March 13, 2023 17:08
@stsewd
Copy link
Member

stsewd commented Mar 13, 2023

@stsewd I wasn't able to find any real-life Sphinx HTML output that I was sure would match the intention of this change. The implementation seems fine and solid, but I just wonder which <dl> we are looking with <dt id="foobar">? Can you quickly point to an example?

Edit: Maybe this is it? https://docs.readthedocs.io/en/stable/glossary.html

http domains https://docs.readthedocs.io/en/stable/api/v3.html#get--api-v3-projects-, and autodoc output

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automodule

@stsewd
Copy link
Member

stsewd commented Mar 13, 2023

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automodule is kind of an interesting case, it has several aliases, we may do some research if that's the standard way of doing that, and see how we can handle those cases, but also, something that can be done in another PR.

@benjaoming
Copy link
Contributor Author

Great! Thanks! I'll brew up some test cases based on this 👍

@benjaoming
Copy link
Contributor Author

@stsewd I added httpdomain and autodoc test data.

Could you give it a review again? The most important part is to understand if the JSON data pertaining sections is how you'd expect from the HTML.

@benjaoming benjaoming force-pushed the generic-html-parser-dls-remove-sphinx-domain branch from a15e620 to 562d18b Compare March 23, 2023 12:21
@benjaoming benjaoming requested a review from stsewd March 23, 2023 14:12
@benjaoming
Copy link
Contributor Author

benjaoming commented Apr 10, 2023

Just a note about removing elements that won't be indexed. And removing those print statements :)

As you can see, a lot of changes went in 😞

Kudos to your objection to the removal of <dl>s inside the loop: It turned out that by chance, when I was wrongly removing <dl>s inside the loop to avoid later indexing in the below sections, it made all the first results look good. But nodes got removed too early and never indexed... I had to look closer at the bigger test sets to realize this.

That also meant that parts of the approach wasn't accurate... I found out that "pre-removing" content in <dd> nodes before indexing them was the sensitive part. I wrestled with CSS selectors of selectolax/Modest. They don't cover the full subset of possible CSS selectors and aren't AFAICT really documented.. it made me go back and forth between different approaches, getting a bit superstitious at times :)

Finally, I split everything into a new _parse_dls method. Looking at it, it does things how I'd like to do read them now: "for each <dl>, do this".

@benjaoming benjaoming requested a review from stsewd April 10, 2023 15:38
Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid assigning the random UUID. Everything else looks good.

There is also one thing we haven't taken into consideration, this is domain titles are currently indexed using the simple analyzer

'name': fields.TextField(
# Simple analyzer breaks on `.`,
# otherwise search results are too strict for this use case
analyzer='simple',
),

And sections use the default (standard) analyzer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html). We are using the simple analyzer because mostly of the domains are in the form of foo.bar (and the simple analyzer splits that into foo and bar, and the default analyzer makes it a whole word).

I think I'm fine with that (maybe we should invest in partial matches more, like experiment with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html?), but maybe @ericholscher has some opinions here.

@ericholscher
Copy link
Member

ericholscher commented Apr 10, 2023

And sections use the default (standard) analyzer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html). We are using the simple analyzer because mostly of the domains are in the form of foo.bar (and the simple analyzer splits that into foo and bar, and the default analyzer makes it a whole word).

I think I'm fine with that (maybe we should invest in partial matches more, like experiment with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html?), but maybe @ericholscher has some opinions here.

I don't have a great sense for the trade-offs here. I do know we had users who would complain because they couldn't search for things that include periods and get the good results. I think as long as we have a way to ensure the search query is mapped reasonably onto the results, it should be fine. If we index django.db.models as django db models, will a query for django.db.models from a user return it properly, or will it get messed up?

@stsewd
Copy link
Member

stsewd commented Apr 10, 2023

If we index django.db.models as django db models, will a query for django.db.models from a user return it properly, or will it get messed up?

Good question, so analyzers are applied to both, the indexed content and the query (by default the same analyzer used for index is used for search). So searching for django.db.models would be the same as searching for django db models. I can do a quick test to confirm.

@stsewd
Copy link
Member

stsewd commented Apr 10, 2023

Yep, confirmed, the only thing is that the highlight would be broken into three matches.

Screenshot 2023-04-10 at 17-14-26 Search conf py project test-builds_search-special-chars Read the Docs\

(this was searching conf.py)

Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just waiting for the analyzer stuff, otherwise everything else is good

@benjaoming
Copy link
Contributor Author

Awesome! Thanks for all the accurate and helpful feedback.

@stsewd I guess that #10184 will require a bit similar research to what you just did, so if there is anything else regarding search indexing and search results, we could take it up there.

Do we need to open another issue about the next pieces to remove? I'm guessing it'd be nice to only have the generic parser left if possible?

@benjaoming
Copy link
Contributor Author

benjaoming commented Apr 11, 2023

Just waiting for the analyzer stuff, otherwise everything else is good

Oh heh, now I got a bit confused - are you waiting for anything else regarding analyzer stuff? Or is that referring to previous analysis? It seems to be concluding that the search analyzer works, it's only the highlighting that was different?

@stsewd
Copy link
Member

stsewd commented Apr 11, 2023

Just waiting for the analyzer stuff, otherwise everything else is good

Oh heh, now I got a bit confused - are you waiting for anything else regarding analyzer stuff? Or is that referring to previous analysis? It seems to be concluding that the search analyzer works, it's only the highlighting that was different?

With this change we are changing from using the simple analyzer to using the default analyzer for sphinx domains (description lists now). The simple analyzer will break foo.bar into 2 words, and the default analyzer won't, so search results will be different. With the default analyzer searching for foo or bar will match foo.bar, but it won't when using the default analyzer.

….com:readthedocs/readthedocs.org into generic-html-parser-dls-remove-sphinx-domain
@benjaoming
Copy link
Contributor Author

@stsewd added on title and content, supposing that the issue of searching for conf.py will affect both fields.

@stsewd
Copy link
Member

stsewd commented Apr 11, 2023

I think it's probably worth a little more discussion before changing the analyzer, I'm fine merging this without the analyzer changes. Changing the analyzer will require a re-index.

@benjaoming
Copy link
Contributor Author

I think it's probably worth a little more discussion before changing the analyzer, I'm fine merging this without the analyzer changes. Changing the analyzer will require a re-index.

Check 👍 Reverted it.

So this should be done now, I mentioned the analyzer in #10184 for now, but we can also open a separate issue.

@benjaoming benjaoming merged commit 973a3e9 into main Apr 11, 2023
@benjaoming benjaoming deleted the generic-html-parser-dls-remove-sphinx-domain branch April 11, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Search: index definition lists as sections

3 participants