Skip to content

Commit d3af559

Browse files
committed
Keep h1 and other headings
Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side. Fixes wallabag/wallabag#5805 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
1 parent 6689f19 commit d3af559

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

src/Readability.php

+5-2
Original file line numberDiff line numberDiff line change
@@ -395,12 +395,15 @@ public function prepArticle(\DOMNode $articleContent): void
395395
$this->clean($articleContent, 'object');
396396
$this->clean($articleContent, 'iframe');
397397
$this->clean($articleContent, 'canvas');
398-
$this->clean($articleContent, 'h1');
399398

400399
/*
401-
* If there is only one h2, they are probably using it as a main header, so remove it since we
400+
* If there is only one h1 or h2, they are probably using it as a main header, so remove it since we
402401
* already have a header.
403402
*/
403+
$h1s = $articleContent->getElementsByTagName('h1');
404+
if (1 === $h1s->length && mb_strlen($this->getInnerText($h1s->item(0), true, true)) < 100) {
405+
$this->clean($articleContent, 'h1');
406+
}
404407
$h2s = $articleContent->getElementsByTagName('h2');
405408
if (1 === $h2s->length && mb_strlen($this->getInnerText($h2s->item(0), true, true)) < 100) {
406409
$this->clean($articleContent, 'h2');

0 commit comments

Comments
 (0)