Web based Lucene application clustering – Part I
There are many resources on the web about Lucene applications clustering. Main restrictions are related to write access and different sites synchronization. Solr and Elastic Search made significant efforts in this field. I tried the two solutions but I’m still facing problems related to indexes accuracy (duplicate entries mainly on indexes). In this article will try to set up the archtitecture and the protocol to make a safe Lucene index synchronisation using Spring and Hibernate Search existing application.
Note that this post is not an RFC, neither an official specification document. This is just written for fun and newbies help. There are proprietary and commercial tools that can do this work, I always suggest a DIY approach.
In my case, the following schema describes best the situation.
The main aim of such work, is to make Lucene searches answer with the same result as often as possible from different entries of the whole system. Moreover, the deployment schema must ensure a 24/7 service availability. The work seems intended to sensitive data that must be kept synchronized.Well, that’s not the case for me even if I’m doing it (for pleasure).
Displayed labels on links describes actions done by every server. For better presentation, I stretched messages between links. The process begins with servers subscription to synchronization flow. The entity we want to synchronize will be called CONTRACT.
1. Hey Master, I’m your slave right now:
Once installed, the slave, the rescue, or whatever we call him, Administrator of this server must configure two parameters:
- Main server IP and port
- Synchorinzation enqueries delay or Polling Interval (POLLING_INTERVAL)
Adminstrator then enables polling service. “Rescue 1” pings the main server to tell him that he’s subscribed to synchronization service. Meanwhile, rescue server continue to act as an independent server. Users can access the server and update data.
2. Ok, will keep this in mind.
When main server receives ping informations (IP, port, server name, started since, application version), he persists calling server data into a database table called DLL_SRVRS. Server generates an ID (a sequence) on persist operation and sends it to the rescue server.
3. Hey Boy, any news?
Periodically, Main server asks recue nodes for updates. Polling interval calculation will be discussed further in this article. (POLLING_INTERVAL)
4. Yes Master, here’s my delta.
If the rescue node has updates, he will reply to the main server with an XML file called delta. Delta Structure can be described as follows:
- All contract entity fields
- Operation type (update, delete, insert)
- Operation rescue timestamp
- Server ID
When Main Server receives a non-empty delta file, he morphs delta lines into database rows; We will call a row in this table a “Dsync”. Dsync will contain additional information as a delta line:
- Source server: server where this operation was done
- Broadcast list: List of servers that have not received the update yet, source server is excluded. If the update has been done on main node, all rescue nodes are mentioned for broadcast on initial insert.
- Processed: Whether the main server executed this update on main database or not.
5- Hey Boss, anything for me?
Another scheduled task (definetly will use Quark for it), is the agents initiated inquery. Every N minutes, rescue servers asks main server for updates.”N” is the rescue server polling interval set up on install.
6. Yes, I have updates for you. Let me know when you receive them.
Main server will check Dsync table to see updates that have not yet been sent to and received by the calling rescue server. Delta file size must be limited for network performance and restrictions considerations. (MAX_DELTA_SIZE)
7. Received, I’m integrating it.
Once received, slave server must acknowledge main server with delta size information, somthing like checksum. On slave answer, main server removes his ID from broacast list.
Salves has dsyncs too. Received delta is saved as Dsyncs and then processed by another scheduled task. (Can be delayed). (PROCESS_INTERVAL)
All servers must have a cleaning job to mantain growing Dsync tables periodically (CLEANER_INTERVAL).
Mentioned delays must be configurable for each server. Interval may vary from a salve server to another.
The main aim of this work is to make data available from every sites, this interval must be as narrow as possible. Consideration that must be kept in mind while defining this interval:
- You should not create heavy work load to agents, so that they can continue doing their job as usual (we talk about synchronization transparency here)
- You should not poll data as often as you consume all network bandwidth. MAX_DELTA_SIZE also has an impact on this side.
In my use case, it should be something like 60 seconds.
- You must give the server a correct delay to treat 5 delta files between processing start event. If PROCESS_INTERVAL is not correctly calculated, agents will get a constantly growing Dsyncs table and may never finish that work.
In my use case this interval should be around 3 minutes if MAX_DELTA_SIZE is 100 rows.
Cleaning job intervals can be wide compared to POLLING_INTERVAL and PROCESS_INTERVAL. Dsyncs to clean are :
- On the slaves side: Processed Dsyncs (processed = 0)
- On the master side: Processed Dsyncs(processed =0) and acknoledged to be received by all listening servers (Broadcast List is an empty field).
I set it to 24 hours (a daily job).
This parameter depends on network performance and on database update frequency.
One more hint about the delta file; If a contract attribute is labeled “ContractActivationDate”, the delta file XML elements must not have the original name of attributes; It should be better “cad” instead to reduce the delta file size. Also, attribute values must escape special XML characters to avoid non valid XML structure (field values may contain “>” or “<“).
The post title is Web Based Lucene application clustering… Lucene was not even mentioned in this whole process. Yes, in purpose. Synchonization of Lucene indexes is very senstive work when relative to a database object. I preferred to do synchronization on top of Lucene and Hibernate Search to avoid concurrent access problems. Server and slaves have the same processing method of every Dsync… So the result should be the same on all Lucene repositories.