1 – Why would someone use Solr to search on a wide-column database (HBase)?
The power of HBase search (scans) is not filters. All is about the rowkey design. If you want to take full advantage of HBase, you must know all your search queries at the moment of deigning your database. This way, you will put all the “search” intelligence in your rowkeys. But what if you don’t know all your search criteria at the beginning? What if you need to add extra search criterias? Would you create a new “view” of data with another rowkey strategy? What would you do if your client needs to search by “proximity” or a did you mean style?
There is no answer for this question than “it depends”.
2 – Why we did not use Ambari for Solr deployment?
It is not integrated offcially, it does not bring any added-value, it adds some more complexity in ambari-agents scripts (must be altered manually for this use case).
The scope of this post does not cover Hadoop/Hbase setup. I asume that you have a running Hbase environment with a Master (HMaster) and two region servers (rs1 and rs2).
I’ll be using the HDP2.5 release from HortonWorks setup on CentOS 7.2.
1 – Setup Solr
Actually, I don’t want Ambari to manage my Solr instance because, we have some specific configurations to add and we won’t alter default ambari-agent’s behaviour.
sudo rpm --import http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins sudo cd /etc/yum.repos.d/ sudo wget http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos7/hdp-solr.repo sudo yum install lucidworks-hdpsearch
If Solr and HBase are not on the same machine (distributed architecture), you will probably face this ZK problem:
[WARN ][15:14:42,409][host:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) java.io.IOException: Failed to connect with Zookeeper within timeout 30000, connection string: localhost:2181 at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:109) at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:73) at com.ngdata.hbaseindexer.cli.BaseIndexCli.connectWithZooKeeper(BaseIndexCli.java:92) at com.ngdata.hbaseindexer.cli.BaseIndexCli.run(BaseIndexCli.java:79) at com.ngdata.hbaseindexer.cli.AddIndexerCli.run(AddIndexerCli.java:50) at com.ngdata.hbaseindexer.cli.BaseCli.run(BaseCli.java:69) at com.ngdata.hbaseindexer.cli.AddIndexerCli.main(AddIndexerCli.java:30)
Notice here that HBase-indexer is trying to reach the localhost server of Zookeeper.
Do not to forget –zookeeper param in distributed HBase-Indexer setup:
/opt/lucidworks-hdpsearch/hbase-indexer/bin/hbase-indexer add-indexer -n hbaseindexer -c /opt/lucidworks-hdpsearch/hbase-indexer/indexdemo-indexer.xml -cp solr.zk=hmaster:2181 -cp solr.collection=hbaseCollection –zookeeper hmaster:2181
While testing the Hadoop database, HBase, I noticed that dedicated mirrors were very slow. Here’s another link using 4shared.
By the way, if you see this error while trying to start HBase (./start-hbase.sh):
$ ./start-hbase.sh Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/hadoop/hbase/util/HBaseConfTool : Unsupported major.minor version 51.0
Be sure that you’re running the minimum required JDK (1.7) and that the HBASE_HOME is set. I found this line in hbase-env.sh:
# The java implementation to use. Java 1.7+ required. # export JAVA_HOME=/usr/java/jdk1.6.0/