Ambari uninstall scripts

 

If you want to remove an Ambari-through install of HDP components, you will have to it manually.

Actually, Ambari gives you a way to uninstall services and remove hosts. But, this feature assumes that you have completed the components install process successfully. What if the process fails somewhere when installing?

In my case, I collected all the commands from many posts and grouped them into one single script. You can find the script in my github.

By the way, this is a very well written manual on how to setup HDP 2.5.

Setup HBase Indexer (Part 2)

1 – Why would someone use Solr to search on a wide-column database (HBase)?

The power of HBase search (scans) is not filters. All is about the rowkey design. If you want to take full advantage of HBase, you must know all your search queries at the moment of deigning your database. This way, you will put all the “search” intelligence in your rowkeys. But what if you don’t know all your search criteria at the beginning? What if you need to add extra search criterias? Would you create a new “view” of data with another rowkey strategy? What would you do if your client needs to search by “proximity” or a did you mean style?

There is no answer for this question than “it depends”.

 

2 – Why we did not use Ambari for Solr deployment?

It is not integrated offcially, it does not bring any added-value, it adds some more complexity in ambari-agents scripts (must be altered manually for this use case).

Read the rest of this entry

Setup HBase Indexer (Part 1)

Pre-requisites:

The scope of this post does not cover Hadoop/Hbase setup. I asume that you have a running Hbase environment with a Master (HMaster) and two region servers (rs1 and rs2).

I’ll be using the HDP2.5 release from HortonWorks setup on CentOS 7.2.

1 – Setup Solr

Actually, I don’t want Ambari to manage my Solr instance because, we have some specific configurations to add and we won’t alter default ambari-agent’s behaviour.

sudo rpm --import http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins
sudo cd /etc/yum.repos.d/
sudo wget http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/centos7/hdp-solr.repo
sudo yum install lucidworks-hdpsearch

Read the rest of this entry

HBase, Zookeeper and Solr

 

If Solr and HBase are not on the same machine (distributed architecture), you will probably face this ZK problem:

[WARN ][15:14:42,409][host:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

java.io.IOException: Failed to connect with Zookeeper within timeout 30000, 
connection string: localhost:2181
 at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:109)
 at com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper.<init>(StateWatchingZooKeeper.java:73)
 at com.ngdata.hbaseindexer.cli.BaseIndexCli.connectWithZooKeeper(BaseIndexCli.java:92)
 at com.ngdata.hbaseindexer.cli.BaseIndexCli.run(BaseIndexCli.java:79)
 at com.ngdata.hbaseindexer.cli.AddIndexerCli.run(AddIndexerCli.java:50)
 at com.ngdata.hbaseindexer.cli.BaseCli.run(BaseCli.java:69)
 at com.ngdata.hbaseindexer.cli.AddIndexerCli.main(AddIndexerCli.java:30)

Notice here that HBase-indexer is trying to reach the localhost server of Zookeeper.
Do not to forget –zookeeper param in distributed HBase-Indexer setup:
/opt/lucidworks-hdpsearch/hbase-indexer/bin/hbase-indexer add-indexer -n hbaseindexer -c /opt/lucidworks-hdpsearch/hbase-indexer/indexdemo-indexer.xml -cp solr.zk=hmaster:2181 -cp solr.collection=hbaseCollection –zookeeper hmaster:2181

Read the rest of this entry

CentOS 7.2 VM hangs on startup

Host :  Windows 10
Guest : CentOS 7.2 64x
Tool : Oracle VirtualBox 5.2

Problem:

For an unknown reason, your VM won’t startup and hangs on boot. You have tried like me many VBoxManage commands, transform raw to vdi, file access rights, … but you’re still on this screen:

CentOS hangs here

 

Solution :

When your machine gives you the boot choice, press “e” to enter the grub configuration. Look for “rhgb” parameter in the kernel boot commands and delete it.

2017-06-26_1049

This command should be in the line starting with “linux16”.

Add the command “systemd.unit=multi-user.target” at the end of this line.

Press Ctrl+x to start the VM and maybe everything will work fine.

Logstash as CSV pump to ElasticSearch

logstash-logo

Ever wondered if the Logstash can pump comma separated data into ElasticSearch? Yes, the whiskers man can do more!

This configuration was tested against version 2.4.0 and ElasticSearch 2.4.1 on Windows 7.Es-icon

Our input data are tab separated values of academic articles formatted as :

CODE   TITLE   YEAR

An excerpt from our input :


W09-2307    Discriminative Reordering with Chinese Grammatical Relations Features    2009
W04-2607    Non-Classical Lexical Semantic Relations    2004
W01-1314    A System For Extraction Of... And Semantic Constraints    2001
W04-1910    Bootstrapping Parallel Treebanks    2004
W09-3306    Evaluating a Statistical CCG Parser on Wikipedia    2009

I created a file named tab-articles.conf to process current input :


input {
file {
path => ["D:/csv/*.txt"]
type => "core2"
start_position => "beginning"
}
}

filter {
csv {
columns => ["code","title","year"]
separator => "    "
}
}

output {
elasticsearch {
action => "index"
<strong>hosts => ["localhost"] </strong>
index => "papers-%{+YYYY.MM.dd}"
workers => 1
}

}

Note that the filter.separator field does not contain “\t” (=TAB character), but the raw value of a TAB as written in the input file.
Note also that the output server attribute is [hosts] and not [host].

Now, check if your ElasticSearch server is running, and run the following command:


logstash.bat -f tab-articles.conf

If you already have a file called articles.txt under d:\csv, it won’t be injected into ES, bacause logstash is mainly intended for logs parsing, and thus acts by default as a “tail -f” reader.

So, after starting Logstash, copy your file into the configured input directory.

Nutch & Solr 6

Apache Nutch
Intro

This “project” is about setting up Apache Nutch (v1.12) and Solr(6). The main change between this configuration and older ones is that Solr 6 does no longer use the schema.xml file for documents parsing. Instead, Solr uses the managed schema, which config file starts with :


&lt;!-- Solr managed schema - automatically generated - DO NOT EDIT --&gt;

So, just ignore it. Backup the genuine file and start editing. This file is usually located under {SOLR_INSTALL_DIR}/server/solr/{core.dir}/conf/

In my case for instance, it is : /opt/solr621/server/solr/nutchdir/conf/.

You can use the file attached in this project or edit it yourself by adding fields from the schema.xml under {NUTCH_INSTALL}/conf.
Using NUTCH

Let /opt/nutch112 be our install directory for Nutch. We’ll be carwling on Amazon.com for some BSR. To crawl using Apache Nutch, follow these steps:


mkdir /opt/nutch112/urls /opt/nutch112/amzcom

echo https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-goods/ref=zg_bs_nav_0 &gt; /opt/nutch112/urls/seeds.txt

bin/nutch inject amzcom/db urls

bin/nutch generate amzcom/db amzcom/segs

s1=`ls -d crawl/segments/2* | tail -1`

bin/nutch fetch $s1

bin/nutch parse $s1

bin/nutch updatedb amzcom/db

bin/nutch generate amzcom/db amzcom/segs

s1=`ls -d crawl/segments/2* | tail -1`

bin/nutch fetch $s1

You can repeat these steps as much as you want.

Once done with fetching,


bin/nutch invertlinks amzcom/linkdb -dir amzcom/segs

Now, you can either dump the binary segments to see the content or read the link database to show parsed links or move data to Solr. To dump physical files from stored segments :


bin/nutch dump -segment amzcom/segs/ -outputDir amzcom/dump0/

To read the fetched links from the latest segment:

bin/nutch readlink $s1 -dump amzcom/dumplnk

To migrate data to Solr (must be running) and you must have at least one core configured with the updated managed-schema.

bin/nutch solrindex http://slmsrv:8983/solr/nutch amzcom/db/ -linkdb amzcom/linkdb/ amzcom/segs/20160924132600 -filter -normalize

 

De-Googling myself! (Step 2)

After kicking Gmail out, there are still some problems with mailing transfer. I managed to get a full backup using this old but quite efficient tool: Gmail Backup. I think I’ll be using Thunderbird for desktop use.

 

2016-09-15_16-13-09

Next big thing to deal with, is moving my GDrive data somewhere else, outside my computer and it must not be DropBox. I’m not sure yet it is the best one, but I like the way Kim Dotcom deals with his problems. So, I decided to stick to megaupload v2: Mega.

Cloudwards suggests sync.com as the best alternative to DropBox even if Mega offers 50GB  (vs 5GB to sync.com) of forver free storage. This is what I need for the moment. Moreover, Mega has a valuable tool called MEGAsync. So, all I had to do is download my drive to a single directory using the desktop Google drive app, and then point it a s a source of Sync to MEGASync… then I just kept my computer connected for one night :).

De-Googling myself! (Step 1)

Many of us will never realize how much we are Google dependent until we reach the maximum free storage capacity.

2016-09-15_10-59-05

But, what if google decides to remove the free offer? what can prevent them from doing it? Whatsapp tried but let down after a while … Google is not Whatsapp. Did ever think how much google knows about you?

So I decided to start degoogling my self. There are a lot of alternatives, I have just to be more patient/tolerant with open source ones and choose carefully. Uninstalled the greedy  Google Chrome, downloaded all my files from the Google Drive and looking for a new solution for mailing and remote storage.

Many “Alternatives To” website suggest using “mail.com”. Seems clean and quite interresting as a domain name. But, when I receive this kind of messages on registration, I’m not sure I can go further.

2016-09-15_10-56-13

Then, I came across this beautiful “TutaNota”.

2016-09-15_11-10-38

A replacement for Google Services. This is what I’m looking for.

And here where you can join now : sba@keemail.me

Roadmap to master BigData World

data_scientist

Source(nirvacana.com)