Avoid duplicate keyword search #1066

fancycode · 2025-01-08T13:37:40Z

The example file contains a huge object with this content:

210891 0 obj
<</Limits [(node00000002) (node00275919)]
/Names [(node00000002) 8591 0 R (node00000006) 190838 0 R (node00000007) 66313 0 R (node00000008) 24685 0 R (node00000009) 208328 0 R (node00000010) 163561 0 R (node00000011) 27134 0 R (node00000012) 126131 0 R (node00000013) 9895 0 R (node00000014) 101713 0 R (node00000015) 64243 0 R (node00000016) 118039 0 R (node00000017)
... continues for more than 5 MBytes ...
(node00275919) 209354 0 R]>>
endobj

This blows up the parsing code in

pdfcpu/pkg/pdfcpu/model/parse.go

Line 1243 in 4e0e0db

func DetectKeywords(line string) (endInd int, streamInd int, err error) {

Here the keywords are searched lots of times after the start is advanced for each string literal inside the array (e.g. (node00000002), etc.).

The PR fixes this by searching only once for the keywords and then adjusting the position if strings / comments are found before them. In my tests the example file now parses in about 3 seconds.

hhrutter · 2025-01-09T20:28:40Z

Sounds very cool..
Let me double check your patch of this critical code section.

hhrutter · 2025-01-16T18:21:01Z

pkg/pdfcpu/read.go

@@ -1721,7 +1721,7 @@ func buffer(c context.Context, rd io.Reader) (buf []byte, endInd int, streamInd
 		growSize = min(growSize*2, maximumBufSize)
 		line := string(buf)

-		endInd, streamInd, err = model.DetectKeywords(line)
+		endInd, streamInd, err = model.DetectKeywordsWithContext(c, line)


Why don't we leave the original method name DetectKeywords and pass in as 2nd parm the context from the caller. does that make any sense?

hhrutter

There is a(n old) bug in posFloor:

if pos1 < pos2 {
    return pos2
}

should be

if pos1 < pos2 {
    return pos1
}

hhrutter · 2025-01-16T18:31:47Z

Happy to merge if you can take care of these 2 issues.

tmm1 · 2025-02-11T22:47:58Z

@fancycode ping

fancycode mentioned this pull request Jan 8, 2025

Improve the speed of reading basic PDF information #1057

Closed

hhrutter reviewed Jan 16, 2025

View reviewed changes

fancycode and others added 3 commits February 21, 2025 15:30

Don't search for keywords multiple times in the same text.

776a234

Abort keyword detection if Context has error (e.g. timed out).

f7a40fa

Fix model.posFloor

d94329a

hhrutter force-pushed the avoid-duplicate-keyword-search branch from 1d7b124 to d94329a Compare February 21, 2025 14:33

hhrutter merged commit eb132fb into pdfcpu:master Feb 21, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicate keyword search #1066

Avoid duplicate keyword search #1066

fancycode commented Jan 8, 2025

hhrutter commented Jan 9, 2025

hhrutter Jan 16, 2025

hhrutter left a comment

hhrutter commented Jan 16, 2025

tmm1 commented Feb 11, 2025

Avoid duplicate keyword search #1066

Avoid duplicate keyword search #1066

Conversation

fancycode commented Jan 8, 2025

hhrutter commented Jan 9, 2025

hhrutter Jan 16, 2025

Choose a reason for hiding this comment

hhrutter left a comment

Choose a reason for hiding this comment

hhrutter commented Jan 16, 2025

tmm1 commented Feb 11, 2025