Lucene.Net corrupt indexes

lucene.net

 

 

DotLucene is the dotNet version of Java Lucene API. It is still an open source project with a smaller community. There are many classes that needs to be implemented especially those specific to the Java world. Many open source projects are built on top od DotLucene such as SubText, Lucandra.Net (Lucene and Cassandra) and the awesome RavenDB.
I’ve been using it since 2008 and I want to thank everyone that helped revive this project after two years of hibernation.

When playing with the .Net implementation, I faced the ambiguous exception saying “Read Past EOF”.  Contributors always suspected corrupt indexes. It was not my case: I double checked using the CheckIndex.Status class and it does not reveal any problem. I went through the Lucene.Net DLL code, and focused on the method throwing the error(Refill())


private void Refill()

{
long start = bufferStart + bufferPosition;
long end = start + _bufferSize;
if (end > Length())
// don't read past EOF
end = Length();
int newLength = (int) (end - start);

if (newLength <= 0)// The error is thrown here
throw new System.IO.IOException("read past EOF");

if (buffer == null)
{
NewBuffer(new byte[_bufferSize]); // allocate buffer lazily
SeekInternal(bufferStart);
}
ReadInternal(buffer, 0, newLength);
bufferLength = newLength;
bufferStart = start;
bufferPosition = 0;

To check whether your indexes are corrupt or not, you can use the CheckIndex class as below:

Dim cheker As CheckIndex = New CheckIndex(idxdir)
Dim status As CheckIndex.Status = checker.CheckIndex_Renamed_Method()

The status holds many attributes which can ensure the health of your indexes. Mine shows this:
CheckIndex.Status
So the indexes are healthy, no locks is left there (check locks on file also using Lock Hunter), RW rights are set up correctly, so what was happening?

 

Simply, a coding error ^^’


Dim pth As String = Application.StartupPath + INDEX_PATH + "\" + entityName
Dim idxdir As Lucene.Net.Store.Directory = SimpleFSDirectory.Open(New IO.DirectoryInfo(pth))
Dim crawler As IndexSearcher = New IndexSearcher(idxdir, False)

//and later when iterating through the results (i is the counter)

Dim scrd As ScoreDoc = hts.ScoreDocs(i)

//the error occurs here
Dim doc As Document = crawlr.Doc(scrd.Doc)

Two hours and 25 pages to find out the problem…. the crawler is the index searcher that fetches a given entity (entityName), whereas the  crawlr (without “e” before the “r”) is a global index searcher used somewhere else… That’s all.

A good lesson, do a diff for your team’s updates before starting any bug fix!

gym