-
Notifications
You must be signed in to change notification settings - Fork 646
Description
Is there an existing issue for this?
- I have searched the existing issues
Task description
In BigramDictionary
and WordDictionary
of Lucene.Net.Analysis.SmartCn
, the code that loads from files uses ByteBuffer
to read int
values as little endian. It was done this way in Java because the C-derived code uses little endian byte order but the default in Java is to use big endian byte order. However, in .NET, the default byte order is little endian so we can eliminate these extra allocations by wrapping the FileStream
in a BinaryReader
and calling BinaryReader.ReadInt32()
. The intBuffer
allocation can be eliminated in addition to the allocations caused by ByteBuffer.Wrap()
in each loop. But the main benefit will be to decouple us from ByteBuffer
.
In addition, in BigramDictionary
, we can eliminate the calls to tmpword.ToCharArray()
by using ReadOnlySpan<char>
.
char[] carray = tmpword.ToCharArray();
can be changed to:
ReadOnlySpan<char> caarray = tmpword.AsSpan();
And we should change the caarray
parameter to ReadOnlySpan<char>
in the Hash1()
, Hash2()
, GetItemIndex()
, GetBigramItemIndex()
and GetFrequency()
methods. No other changes to those methods should be required.
Note: the
wordItem_charArrayTable
inWordDictionary
could be changed to useReadOnlyMemory<char>
, but I don't think it is worth the added complexity and will diverge significantly from upstream. The references would have to be maintained for all of the strings to keep them in scope.
Tests
Note that this code loads from disk and in the past has only been tested manually. There is some info here:
lucenenet/src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs
Lines 106 to 163 in a0578d6
// LUCENENET conversion note: | |
// The data in Lucene is stored in a proprietary binary format (similar to | |
// .NET's BinarySerializer) that cannot be read back in .NET. Therefore, the | |
// data was extracted using Java's DataOutputStream using the following Java code. | |
// It can then be read in using the LoadFromInputStream method below | |
// (using a DataInputStream instead of a BinaryReader), and saved | |
// in the correct (BinaryWriter) format by calling the SaveToObj method. | |
// Alternatively, the data can be loaded from disk using the files | |
// here(https://issues.apache.org/jira/browse/LUCENE-1629) in the analysis.data.zip file, | |
// which will automatically produce the .mem files. | |
//public void saveToOutputStream(java.io.DataOutputStream stream) throws IOException | |
//{ | |
// // save wordIndexTable | |
// int wiLen = wordIndexTable.length; | |
// stream.writeInt(wiLen); | |
// for (int i = 0; i<wiLen; i++) | |
// { | |
// stream.writeShort(wordIndexTable[i]); | |
// } | |
// // save charIndexTable | |
// int ciLen = charIndexTable.length; | |
// stream.writeInt(ciLen); | |
// for (int i = 0; i<ciLen; i++) | |
// { | |
// stream.writeChar(charIndexTable[i]); | |
// } | |
// int caDim1 = wordItem_charArrayTable == null ? -1 : wordItem_charArrayTable.length; | |
// stream.writeInt(caDim1); | |
// for (int i = 0; i<caDim1; i++) | |
// { | |
// int caDim2 = wordItem_charArrayTable[i] == null ? -1 : wordItem_charArrayTable[i].length; | |
// stream.writeInt(caDim2); | |
// for (int j = 0; j<caDim2; j++) | |
// { | |
// int caDim3 = wordItem_charArrayTable[i][j] == null ? -1 : wordItem_charArrayTable[i][j].length; | |
// stream.writeInt(caDim3); | |
// for (int k = 0; k<caDim3; k++) | |
// { | |
// stream.writeChar(wordItem_charArrayTable[i][j][k]); | |
// } | |
// } | |
// } | |
// int fDim1 = wordItem_frequencyTable == null ? -1 : wordItem_frequencyTable.length; | |
// stream.writeInt(fDim1); | |
// for (int i = 0; i<fDim1; i++) | |
// { | |
// int fDim2 = wordItem_frequencyTable[i] == null ? -1 : wordItem_frequencyTable[i].length; | |
// stream.writeInt(fDim2); | |
// for (int j = 0; j<fDim2; j++) | |
// { | |
// stream.writeInt(wordItem_frequencyTable[i][j]); | |
// } | |
// } | |
//} |
that contains info about how to test it.
It would be great if we could automate the tests, but it will require getting enough info about how to create a small dictionary file to load so we don't add multiple MB of data to the test project. A few KB is all that is required to ensure it loads from disk correctly. There would also need to be a way devised to determine that loading was successful, which will require some analysis and exploration.
Note also that the Kuromoji project loads dictionary data in a similar manner and also doesn't have automated tests.