Salem's Euphoria

Sharing Experience

Lucene.Net corrupt indexes

1 Comment

lucene.net

 

 

DotLucene is the dotNet version of Java Lucene API. It is still an open source project with a smaller community. There are many classes that needs to be implemented especially those specific to the Java world. Many open source projects are built on top od DotLucene such as SubText, Lucandra.Net (Lucene and Cassandra) and the awesome RavenDB.
I’ve been using it since 2008 and I want to thank everyone that helped revive this project after two years of hibernation.

When playing with the .Net implementation, I faced the ambiguous exception saying “Read Past EOF”.  Contributors always suspected corrupt indexes. It was not my case: I double checked using the CheckIndex.Status class and it does not reveal any problem. I went through the Lucene.Net DLL code, and focused on the method throwing the error(Refill())


private void Refill()

{
long start = bufferStart + bufferPosition;
long end = start + _bufferSize;
if (end > Length())
// don't read past EOF
end = Length();
int newLength = (int) (end - start);

if (newLength <= 0)// The error is thrown here
throw new System.IO.IOException("read past EOF");

if (buffer == null)
{
NewBuffer(new byte[_bufferSize]); // allocate buffer lazily
SeekInternal(bufferStart);
}
ReadInternal(buffer, 0, newLength);
bufferLength = newLength;
bufferStart = start;
bufferPosition = 0;

To check whether your indexes are corrupt or not, you can use the CheckIndex class as below:

Dim cheker As CheckIndex = New CheckIndex(idxdir)
Dim status As CheckIndex.Status = checker.CheckIndex_Renamed_Method()

The status holds many attributes which can ensure the health of your indexes. Mine shows this:
CheckIndex.Status
So the indexes are healthy, no locks is left there (check locks on file also using Lock Hunter), RW rights are set up correctly, so what was happening?

 

Simply, a coding error ^^’


Dim pth As String = Application.StartupPath + INDEX_PATH + "\" + entityName
Dim idxdir As Lucene.Net.Store.Directory = SimpleFSDirectory.Open(New IO.DirectoryInfo(pth))
Dim crawler As IndexSearcher = New IndexSearcher(idxdir, False)

//and later when iterating through the results (i is the counter)

Dim scrd As ScoreDoc = hts.ScoreDocs(i)

//the error occurs here
Dim doc As Document = crawlr.Doc(scrd.Doc)

Two hours and 25 pages to find out the problem…. the crawler is the index searcher that fetches a given entity (entityName), whereas the  crawlr (without “e” before the “r”) is a global index searcher used somewhere else… That’s all.

A good lesson, do a diff for your team’s updates before starting any bug fix!

gym

Advertisements

Author: Salem Ben Afia

Big Data & Java developer Search Engine Architect, Lucene Expert

One thought on “Lucene.Net corrupt indexes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s