Salem's Euphoria

Sharing Experience

Setup HBase Indexer (Part 2)

Leave a comment

1 – Why would someone use Solr to search on a wide-column database (HBase)?

The power of HBase search (scans) is not filters. All is about the rowkey design. If you want to take full advantage of HBase, you must know all your search queries at the moment of deigning your database. This way, you will put all the “search” intelligence in your rowkeys. But what if you don’t know all your search criteria at the beginning? What if you need to add extra search criterias? Would you create a new “view” of data with another rowkey strategy? What would you do if your client needs to search by “proximity” or a did you mean style?

There is no answer for this question than “it depends”.


2 – Why we did not use Ambari for Solr deployment?

It is not integrated offcially, it does not bring any added-value, it adds some more complexity in ambari-agents scripts (must be altered manually for this use case).

3- Why Lucidworks distribution?

Just because it ships hbase-indexer with it and is test against these versions of HDP components.

4- How Solr detects updates on HBase?

Lucidworks docs:

When using Solr with HDPSearch, you should run Solr in SolrCloud mode, which provides central configuration for a cluster of Solr servers, automatic load balancing and fail-over for queries, and distributed index replication.

SolrCloud relies on Apache ZooKeeper to coordinate requests between the nodes of the cluster. It’s recommended to use the ZooKeeper ensemble running with HDP 2.5.x for this purpose.

This why also we need to add the list of all the running zookeeper servers in our environment. In my own experience, I always had to setup at least a Zookeeper ensemble holding 3 serving instances (always an odd number).

5 – What is this “hbase.zookeeper.quorum”?

A quite long story, but Ed. J. Yoon made it shorter 😉

According to Wikipedia, Quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group. Ordinarily, this is a majority of the people expected to be there, although many bodies may have a lower or higher quorum.

6 – replication.source.ratio?

Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out!!

7- Why “data_driven_schema_configs” whene creating Solr collection?

This is one way to tell Solr to create a “schemaless” collection. We are managing a “NoSQL” schemaless database. So we don’t want to limit it by a pre-built (managed) Solr schema.

The problem here is that Solr fields are filled through a mapping file, which is written manually and managed manually. So would the “DBA” alter this file everytime a new HBase Column appears?


The Hbase-indexer file has dynamic container. It can define a field this way:

<field value="f0:a*" name="all_f0_fields_starting_with_a" type="string"/>

Here, the Solr field “all_f0_fields_starting_with_a” may contain all the cells of the family “f0” in one field.

This sounds OK to me because if we have to search in Solr, not all the HBase attributes are useful but we may specify patterns for searcheable fields/columns. For example, column qualifiers starting with “s0_” will go to “field0” in Solr, …

<field value="f0:sa_*" name="field0" type="string"/>

Be patient, the worst is yet to come 🙂


Author: Salem Ben Afia

Big Data & Java developer Search Engine Architect, Lucene Expert

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s