Category Archives: BigData
If you want to remove an Ambari-through install of HDP components, you will have to it manually.
Actually, Ambari gives you a way to uninstall services and remove hosts. But, this feature assumes that you have completed the components install process successfully. What if the process fails somewhere when installing?
In my case, I collected all the commands from many posts and grouped them into one single script. You can find the script in my github.
By the way, this is a very well written manual on how to setup HDP 2.5.
1 – Why would someone use Solr to search on a wide-column database (HBase)?
The power of HBase search (scans) is not filters. All is about the rowkey design. If you want to take full advantage of HBase, you must know all your search queries at the moment of deigning your database. This way, you will put all the “search” intelligence in your rowkeys. But what if you don’t know all your search criteria at the beginning? What if you need to add extra search criterias? Would you create a new “view” of data with another rowkey strategy? What would you do if your client needs to search by “proximity” or a did you mean style?
There is no answer for this question than “it depends”.
2 – Why we did not use Ambari for Solr deployment?
It is not integrated offcially, it does not bring any added-value, it adds some more complexity in ambari-agents scripts (must be altered manually for this use case).
The scope of this post does not cover Hadoop/Hbase setup. I asume that you have a running Hbase environment with a Master (HMaster) and two region servers (rs1 and rs2).
I’ll be using the HDP2.5 release from HortonWorks setup on CentOS 7.2.
1 – Setup Solr
Actually, I don’t want Ambari to manage my Solr instance because, we have some specific configurations to add and we won’t alter default ambari-agent’s behaviour.
sudo rpm --import http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins sudo cd /etc/yum.repos.d/ sudo wget http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos7/hdp-solr.repo sudo yum install lucidworks-hdpsearch
If Solr and HBase are not on the same machine (distributed architecture), you will probably face this ZK problem:
[WARN ][15:14:42,409][host:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) java.io.IOException: Failed to connect with Zookeeper within timeout 30000, connection string: localhost:2181 at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:109) at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:73) at com.ngdata.hbaseindexer.cli.BaseIndexCli.connectWithZooKeeper(BaseIndexCli.java:92) at com.ngdata.hbaseindexer.cli.BaseIndexCli.run(BaseIndexCli.java:79) at com.ngdata.hbaseindexer.cli.AddIndexerCli.run(AddIndexerCli.java:50) at com.ngdata.hbaseindexer.cli.BaseCli.run(BaseCli.java:69) at com.ngdata.hbaseindexer.cli.AddIndexerCli.main(AddIndexerCli.java:30)
Notice here that HBase-indexer is trying to reach the localhost server of Zookeeper.
Do not to forget –zookeeper param in distributed HBase-Indexer setup:
/opt/lucidworks-hdpsearch/hbase-indexer/bin/hbase-indexer add-indexer -n hbaseindexer -c /opt/lucidworks-hdpsearch/hbase-indexer/indexdemo-indexer.xml -cp solr.zk=hmaster:2181 -cp solr.collection=hbaseCollection –zookeeper hmaster:2181
This slideshare introduction is quite interresting. It explains how K-Means algorithm works.
(Credit to Subhas Kumar Ghosh)
One common problem with Hadoop, is the unexplained hang when running a sample job. For instance, I’ve been testing Mahout (cluster-reuters) on a Hadoop multinode cluster (1 namenode, 2 slaves). A sample trace in my case looks like this listing:
15/10/17 12:09:06 INFO YarnClientImpl: Submitted application application_1445072191101_0026 15/10/17 12:09:06 INFO Job: The url to track the job: http://master.phd.net:8088/proxy/application_1445072191101_0026/ 15/10/17 12:09:06 INFO Job: Running job: job_1445072191101_0026 15/10/17 12:09:14 INFO Job: Job job_1445072191101_0026 running in uber mode : false 15/10/17 12:09:14 INFO Job: map 0% reduce 0%
The jobs web console told me that the job State=Accepted, Final Status = UNDEFINED and the tracking UI was UNASSIGNED.
First thing I suspected, was a warning thrown by hadoop binary:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Absolutely nothing to do with my problem. I rebuilt this jar from sources, but the job still hangs.
I reviewed the namenode logs. Nothing special. Then the Yarn different logs(resourcemanager, nodemanager). No problems. Slaves logs. Same thing.
As we don’t have much information from Hadoop logs, I went through the net for similar problems. It seems that this is a common problem related to memory configuration. I wonder why such problems are not yet logged (Hadoop 2.6). Even if I analyzed the memory consumption using JConsole, but nothing was alarming with it.
All the used machines are CentOS 6.5 virtual machines hosted on a 16 Gb RAM, i7-G5 laptop. After connecting and configuring the three machines, I realized that the allowed disk space (15Gb for the namenode, 6Gb for slaves ) is not enough. Checking the disk space usage (df -h), only 3% of the disk space were available on the two slaves. This could be an issue, but Hadoop reports such errors.
Looking in yarn-site.xml, I remembered that I gave Yarn 2Gb to run this test jobs.
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>2024</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property>
I tried doubling this value on the namenode:
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property>
Then I propagated changes to slaves and restarted everything (stop-dfs.sh && stop-yarn.sh && start-yarn.sh && start-dfs.sh). then
It works, finally 🙂
Now, I’m trying to visualize the clustering results using Gephi right now.