Search: index <dl>s as sections and remove Sphinx domain logic #10128

benjaoming · 2023-03-08T16:53:42Z

WIP - need to test this, just want to know that this is what you had in mind. I'll test and make it ready for review then. CC: @humitos @stsewd

Update test data
Implement known Sphinx behavior in a generic way
Figure out how much of SphinxDomain logic to disable
Open up follow-up issue with a plan for old SphinxDomain data + deprecating/removing the sphinx_domain app ?

Fixes: #9571

readthedocs/search/parsers.py

stsewd

This is going in a great direction!

Sorry, you said to not review the code, but kind of had to look at the code p: so just left some notes in case is useful

readthedocs/search/parsers.py

benjaoming · 2023-03-13T17:06:20Z

@stsewd I wasn't able to find any real-life Sphinx HTML output that I was sure would match the intention of this change. The implementation seems fine and solid, but I just wonder which <dl> we are looking with <dt id="foobar">? Can you quickly point to an example?

Edit: Maybe this is it? https://docs.readthedocs.io/en/stable/glossary.html

readthedocs/search/tests/data/generic/in/basic.html

stsewd · 2023-03-13T17:18:21Z

@stsewd I wasn't able to find any real-life Sphinx HTML output that I was sure would match the intention of this change. The implementation seems fine and solid, but I just wonder which <dl> we are looking with <dt id="foobar">? Can you quickly point to an example?

Edit: Maybe this is it? https://docs.readthedocs.io/en/stable/glossary.html

http domains https://docs.readthedocs.io/en/stable/api/v3.html#get--api-v3-projects-, and autodoc output

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automodule

stsewd · 2023-03-13T17:29:56Z

https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-automodule is kind of an interesting case, it has several aliases, we may do some research if that's the standard way of doing that, and see how we can handle those cases, but also, something that can be done in another PR.

benjaoming · 2023-03-13T18:40:53Z

Great! Thanks! I'll brew up some test cases based on this 👍

readthedocs/search/tests/data/sphinx/in/httpdomain.json

readthedocs/search/tests/data/sphinx/out/autodoc.json

benjaoming · 2023-03-21T13:31:57Z

@stsewd I added httpdomain and autodoc test data.

Could you give it a review again? The most important part is to understand if the JSON data pertaining sections is how you'd expect from the HTML.

readthedocs/search/parsers.py

readthedocs/search/tests/data/sphinx/out/autodoc.json

…neric-html-parser-dls-remove-sphinx-domain

readthedocs/search/tests/data/generic/in/basic.html

readthedocs/search/parsers.py

benjaoming · 2023-04-10T14:42:50Z

Just a note about removing elements that won't be indexed. And removing those print statements :)

As you can see, a lot of changes went in 😞

Kudos to your objection to the removal of <dl>s inside the loop: It turned out that by chance, when I was wrongly removing <dl>s inside the loop to avoid later indexing in the below sections, it made all the first results look good. But nodes got removed too early and never indexed... I had to look closer at the bigger test sets to realize this.

That also meant that parts of the approach wasn't accurate... I found out that "pre-removing" content in <dd> nodes before indexing them was the sensitive part. I wrestled with CSS selectors of selectolax/Modest. They don't cover the full subset of possible CSS selectors and aren't AFAICT really documented.. it made me go back and forth between different approaches, getting a bit superstitious at times :)

Finally, I split everything into a new _parse_dls method. Looking at it, it does things how I'd like to do read them now: "for each <dl>, do this".

…neric-html-parser-dls-remove-sphinx-domain

stsewd

I think we can avoid assigning the random UUID. Everything else looks good.

There is also one thing we haven't taken into consideration, this is domain titles are currently indexed using the simple analyzer

readthedocs.org/readthedocs/search/documents.py

Lines 119 to 123 in 32ffebe

    
           'name': fields.TextField( 
        
               # Simple analyzer breaks on `.`, 
        
               # otherwise search results are too strict for this use case 
        
               analyzer='simple', 
        
           ),

And sections use the default (standard) analyzer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html). We are using the simple analyzer because mostly of the domains are in the form of foo.bar (and the simple analyzer splits that into foo and bar, and the default analyzer makes it a whole word).

I think I'm fine with that (maybe we should invest in partial matches more, like experiment with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html?), but maybe @ericholscher has some opinions here.

readthedocs/search/parsers.py

ericholscher · 2023-04-10T21:35:49Z

And sections use the default (standard) analyzer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html). We are using the simple analyzer because mostly of the domains are in the form of foo.bar (and the simple analyzer splits that into foo and bar, and the default analyzer makes it a whole word).

I think I'm fine with that (maybe we should invest in partial matches more, like experiment with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html?), but maybe @ericholscher has some opinions here.

I don't have a great sense for the trade-offs here. I do know we had users who would complain because they couldn't search for things that include periods and get the good results. I think as long as we have a way to ensure the search query is mapped reasonably onto the results, it should be fine. If we index django.db.models as django db models, will a query for django.db.models from a user return it properly, or will it get messed up?

stsewd · 2023-04-10T22:06:39Z

If we index django.db.models as django db models, will a query for django.db.models from a user return it properly, or will it get messed up?

Good question, so analyzers are applied to both, the indexed content and the query (by default the same analyzer used for index is used for search). So searching for django.db.models would be the same as searching for django db models. I can do a quick test to confirm.

stsewd · 2023-04-10T22:18:57Z

Yep, confirmed, the only thing is that the highlight would be broken into three matches.

\

(this was searching conf.py)

stsewd

Just waiting for the analyzer stuff, otherwise everything else is good

benjaoming · 2023-04-11T06:25:54Z

Awesome! Thanks for all the accurate and helpful feedback.

@stsewd I guess that #10184 will require a bit similar research to what you just did, so if there is anything else regarding search indexing and search results, we could take it up there.

Do we need to open another issue about the next pieces to remove? I'm guessing it'd be nice to only have the generic parser left if possible?

benjaoming · 2023-04-11T06:28:03Z

Just waiting for the analyzer stuff, otherwise everything else is good

Oh heh, now I got a bit confused - are you waiting for anything else regarding analyzer stuff? Or is that referring to previous analysis? It seems to be concluding that the search analyzer works, it's only the highlighting that was different?

stsewd · 2023-04-11T13:18:16Z

Just waiting for the analyzer stuff, otherwise everything else is good

Oh heh, now I got a bit confused - are you waiting for anything else regarding analyzer stuff? Or is that referring to previous analysis? It seems to be concluding that the search analyzer works, it's only the highlighting that was different?

With this change we are changing from using the simple analyzer to using the default analyzer for sphinx domains (description lists now). The simple analyzer will break foo.bar into 2 words, and the default analyzer won't, so search results will be different. With the default analyzer searching for foo or bar will match foo.bar, but it won't when using the default analyzer.

….com:readthedocs/readthedocs.org into generic-html-parser-dls-remove-sphinx-domain

benjaoming · 2023-04-11T13:37:40Z

@stsewd added on title and content, supposing that the issue of searching for conf.py will affect both fields.

stsewd · 2023-04-11T14:05:13Z

I think it's probably worth a little more discussion before changing the analyzer, I'm fine merging this without the analyzer changes. Changing the analyzer will require a re-index.

This reverts commit ffca7ad.

benjaoming · 2023-04-11T14:25:37Z

I think it's probably worth a little more discussion before changing the analyzer, I'm fine merging this without the analyzer changes. Changing the analyzer will require a re-index.

Check 👍 Reverted it.

So this should be done now, I mentioned the analyzer in #10184 for now, but we can also open a separate issue.

benjaoming added 2 commits March 8, 2023 17:49

WIP: Draft on parsing <dl> as normal section

c84577a

Drafting what to remove

13c7946

benjaoming commented Mar 8, 2023

View reviewed changes

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

domain_data is no longer generated

780dec0

stsewd reviewed Mar 10, 2023

View reviewed changes

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

benjaoming added 2 commits March 13, 2023 15:03

Update generic logic, remove old Sphinx <dl> parsing

0968e01

Improve parsing with "General sibling combinator", add test case data

5797c6a

benjaoming marked this pull request as ready for review March 13, 2023 17:04

benjaoming requested review from a team as code owners March 13, 2023 17:04

benjaoming requested review from agjohnson and humitos March 13, 2023 17:04

auto-assign bot assigned benjaoming Mar 13, 2023

benjaoming commented Mar 13, 2023

View reviewed changes

readthedocs/search/tests/data/generic/in/basic.html Show resolved Hide resolved

benjaoming requested a review from stsewd March 13, 2023 17:08

Add httpdomain example

b87d3ee

benjaoming commented Mar 20, 2023

View reviewed changes

readthedocs/search/tests/data/sphinx/in/httpdomain.json Outdated Show resolved Hide resolved

test case for Sphinx autodoc HTML

cc77f5e

benjaoming commented Mar 20, 2023

View reviewed changes

readthedocs/search/tests/data/sphinx/out/autodoc.json Show resolved Hide resolved

stsewd reviewed Mar 21, 2023

View reviewed changes

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

readthedocs/search/parsers.py Show resolved Hide resolved

readthedocs/search/tests/data/sphinx/out/autodoc.json Show resolved Hide resolved

readthedocs/search/tests/data/sphinx/out/autodoc.json Outdated Show resolved Hide resolved

benjaoming added 3 commits March 23, 2023 11:46

Merge branch 'main' of github.com:readthedocs/readthedocs.org into ge…

71a4018

…neric-html-parser-dls-remove-sphinx-domain

Remove entire block that was indexing Sphinx domains

612c687

Clean up remaining Sphinx domain search index

562d18b

benjaoming force-pushed the generic-html-parser-dls-remove-sphinx-domain branch from a15e620 to 562d18b Compare March 23, 2023 12:21

benjaoming requested a review from stsewd March 23, 2023 14:12

benjaoming commented Apr 10, 2023

View reviewed changes

readthedocs/search/tests/data/generic/in/basic.html Show resolved Hide resolved

benjaoming commented Apr 10, 2023

View reviewed changes

readthedocs/search/parsers.py Show resolved Hide resolved

benjaoming added 4 commits April 10, 2023 16:16

Cleanup: Remove inaccurate comment

2d8d585

Cleanup: Select adjacent dd instead of iterating

952142a

Fix strange syntax

3de9e39

Do not accumulate lists: Yield indexed nodes and section content

c1a0287

benjaoming added 2 commits April 10, 2023 16:48

Merge branch 'main' of github.com:readthedocs/readthedocs.org into ge…

0231e6c

…neric-html-parser-dls-remove-sphinx-domain

Appease "darker" lint

012e8dc

benjaoming requested a review from stsewd April 10, 2023 15:38

stsewd reviewed Apr 10, 2023

View reviewed changes

readthedocs/search/parsers.py Outdated Show resolved Hide resolved

Reduce complexity: replace css selector with a Python look

4170338

stsewd approved these changes Apr 10, 2023

View reviewed changes

benjaoming mentioned this pull request Apr 11, 2023

Search: index from children nodes up to parent nodes #10184

Open

benjaoming added 2 commits April 11, 2023 15:36

Use "simple" analyzer on section contents

ffca7ad

Merge branch 'generic-html-parser-dls-remove-sphinx-domain' of github…

8bc1450

….com:readthedocs/readthedocs.org into generic-html-parser-dls-remove-sphinx-domain

Revert "Use "simple" analyzer on section contents"

17dfb8a

This reverts commit ffca7ad.

benjaoming merged commit 973a3e9 into main Apr 11, 2023

benjaoming deleted the generic-html-parser-dls-remove-sphinx-domain branch April 11, 2023 18:38

benjaoming mentioned this pull request Apr 26, 2023

Search: Only use generic parsers #10272

Closed

4 tasks

benjaoming mentioned this pull request Jun 21, 2023

Search: stop creating SphinxDomain objects #10451

Merged

	'name': fields.TextField(
	# Simple analyzer breaks on `.`,
	# otherwise search results are too strict for this use case
	analyzer='simple',
	),

Uh oh!

Search: index <dl>s as sections and remove Sphinx domain logic #10128

Search: index <dl>s as sections and remove Sphinx domain logic #10128

Uh oh!

Conversation

benjaoming commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stsewd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaoming commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stsewd commented Mar 13, 2023

Uh oh!

stsewd commented Mar 13, 2023

Uh oh!

benjaoming commented Mar 13, 2023

Uh oh!

Uh oh!

Uh oh!

benjaoming commented Mar 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaoming commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsewd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericholscher commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsewd commented Apr 10, 2023

Uh oh!

stsewd commented Apr 10, 2023

Uh oh!

stsewd left a comment

Choose a reason for hiding this comment

Uh oh!

benjaoming commented Apr 11, 2023

Uh oh!

benjaoming commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsewd commented Apr 11, 2023

Uh oh!

benjaoming commented Apr 11, 2023

Uh oh!

stsewd commented Apr 11, 2023

Uh oh!

benjaoming commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaoming commented Mar 8, 2023 •

edited

Loading

benjaoming commented Mar 13, 2023 •

edited

Loading

benjaoming commented Apr 10, 2023 •

edited

Loading

ericholscher commented Apr 10, 2023 •

edited

Loading

benjaoming commented Apr 11, 2023 •

edited

Loading