Description
I noticed that go-git is using a ton of CPU and memory when doing simple operations like iterating through a tree on large repositories.
Test program
package main
import (
"log"
"os"
"github.com/go-git/go-git/v5"
)
func main() {
log.SetFlags(log.LstdFlags | log.Lmicroseconds)
repo, err := git.PlainOpen(os.Args[1]); if err != nil { panic(err) }
headRef, err := repo.Head(); if err != nil { panic(err) }
commit, err := repo.CommitObject(headRef.Hash()); if err != nil { panic(err) }
tree, err := commit.Tree(); if err != nil { panic(err) }
log.Println("Iterating tree")
for _, entry := range tree.Entries {
size, err := tree.Size(entry.Name); if err != nil { panic(err) }
log.Println(entry.Name, size)
}
log.Println("Done")
}
Test environment
go version go1.23.6 linux/arm64
M1 MacBook Air (2020)
Expected behavior
The entire process completes in about 0.1s.
2025/03/08 20:54:07.990596 Iterating tree
2025/03/08 20:54:07.990964 .get_maintainer.ignore 194
2025/03/08 20:54:07.990993 .gitattributes 105
2025/03/08 20:54:07.991031 .gitignore 2180
2025/03/08 20:54:07.991156 .mailmap 42849
2025/03/08 20:54:07.991184 .rustfmt.toml 369
2025/03/08 20:54:07.991209 COPYING 496
...
2025/03/08 20:54:07.992446 Done
Actual behvior
It takes about 1.3s.
2025/03/08 20:52:52.634226 Iterating tree
...
2025/03/08 20:52:52.635310 .get_maintainer.ignore 194
2025/03/08 20:52:52.635343 .gitattributes 105
2025/03/08 20:52:53.781688 .gitignore 2180 <-- spends a lot of time here
2025/03/08 20:52:53.783122 .mailmap 42849
2025/03/08 20:52:53.783330 .rustfmt.toml 369
2025/03/08 20:52:53.783372 COPYING 496
...
2025/03/08 20:52:53.809643 Done
In my scenario it's .gitignore
that's being slow to tree.Size
on.
Profiling
It looks like the main problem is how genOffsetHash uses maps. I'm no expert when it comes to the Go runtime so I'll leave it here.
First bad commit
commit bb5c35196ee55b82f6506d69257de0e59cd8c41d
Date: Sun Nov 17 00:34:11 2024 +0000
plumbing: packfile, Refactor Parser and Scanner logic
The changes focus on increasing thread-safety, simplifying the code base,
enabling support for sha256 and improving general time and space complexities.
(snip: some statistics showing fewer allocs)
Call stack
Interestingly, genOffsetHash is only called occasionally, and it takes a long time whenever it is called.
github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash(0x400016ed88)
/home/runxiyu/go-git/plumbing/format/idxfile/idxfile.go:230 +0x308
github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).FindHash(0x400016ed88, 0x4ec70ae3)
/home/runxiyu/go-git/plumbing/format/idxfile/idxfile.go:191 +0x1d0
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).getByOffset(0x40000a3180, 0x4ec70ae3)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:175 +0x34
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).getMemoryObject(0x40000a3180, 0x40001b4820)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:310 +0x224
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).objectFromHeader(0x40000a3180?, 0xf357138a8184676e?)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:281 +0x44
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).getByOffset(0x40000a3180, 0x73450f85)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:189 +0xb0
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).GetByOffset(0x40000a3180, 0x73450f85)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:81 +0xd0
github.com/go-git/go-git/v5/plumbing/format/packfile.(*Packfile).GetSizeByOffset(0x40000a3180, 0x73450f85)
/home/runxiyu/go-git/plumbing/format/packfile/packfile.go:91 +0x34
github.com/go-git/go-git/v5/storage/filesystem.(*ObjectStorage).encodedObjectSizeFromPackfile(0x4000160398, {0xa6, 0x1e, 0x47, 0x78, 0xd0, 0x11, 0xcf, 0x70, 0x6e, ...})
/home/runxiyu/go-git/storage/filesystem/object.go:326 +0x228
github.com/go-git/go-git/v5/storage/filesystem.(*ObjectStorage).EncodedObjectSize(0x4000160398, {0xa6, 0x1e, 0x47, 0x78, 0xd0, 0x11, 0xcf, 0x70, 0x6e, ...})
/home/runxiyu/go-git/storage/filesystem/object.go:340 +0xb0
github.com/go-git/go-git/v5/plumbing/object.(*Tree).Size(0x400015e410, {0x400011ce40?, 0x2?})
/home/runxiyu/go-git/plumbing/object/tree.go:100 +0x88
main.main()
/home/runxiyu/crap/test-go-git.go:25 +0x2d8
func (idx *MemoryIndex) genOffsetHash() error {
defer idx.mu.Unlock()
idx.mu.Lock()
count, err := idx.Count()
if err != nil {
return err
}
idx.offsetHash = make(map[int64]plumbing.Hash, count)
idx.offsetHashIsFull = true
var hash plumbing.Hash
i := uint32(0)
for firstLevel, fanoutValue := range idx.Fanout {
mappedFirstLevel := idx.FanoutMapping[firstLevel]
for secondLevel := uint32(0); i < fanoutValue; i++ {
fmt.Println("secondLevel", secondLevel) <-- [1]
copy(hash[:], idx.Names[mappedFirstLevel][secondLevel*objectIDLength:])
offset := int64(idx.getOffset(mappedFirstLevel, int(secondLevel)))
idx.offsetHash[offset] = hash
secondLevel++
}
}
return nil
}
The second-level loop at [1] seems to be called idx.Count()
times, which is 10950545 in my case. Quite reasonably this slows things down.
genOffsetHash
was only called for the file that slowed things down. It wasn't called at all before the problematic commit.
Now, I think setting a map 10950545 times sounds problematic... but this seems to have always been the behavior here. I'm looking into what caused genOffsetHash
to be called.
func (idx *MemoryIndex) FindHash(o int64) (plumbing.Hash, error) {
var hash plumbing.Hash
var ok bool
fmt.Println(o) // I added these random debug prints because I can't use debuggers properly
fmt.Println(idx.offsetHash[o])
idx.mu.RLock()
if idx.offsetHash != nil {
if hash, ok = idx.offsetHash[o]; ok {
idx.mu.RUnlock()
return hash, nil
}
}
idx.mu.RUnlock()
// Lazily generate the reverse offset/hash map if required.
if !idx.offsetHashIsFull || idx.offsetHash == nil {
if err := idx.genOffsetHash(); err != nil {
return plumbing.ZeroHash, err
}
hash, ok = idx.offsetHash[o]
}
if !ok {
return plumbing.ZeroHash, plumbing.ErrObjectNotFound
}
return hash, nil
}
This has previously returned
1933905797
a61e4778d011cf706e6784818a1357f392f3a669
2025/03/08 21:38:56.182156 .gitignore 2180
but after the problematic commit it returns
1933905797
a61e4778d011cf706e6784818a1357f392f3a669
1933905797
a61e4778d011cf706e6784818a1357f392f3a669
1321667299
0000000000000000000000000000000000000000
Previously FindHash
is only called one time per file, but now it's being called at least two times for each file, and for .gitignore
it's called three times (with the last one returning a null hash).
Notes
It's on main but not in any releases yet