Category Archives: Database
This slideshare introduction is quite interresting. It explains how K-Means algorithm works.
(Credit to Subhas Kumar Ghosh)
One common problem with Hadoop, is the unexplained hang when running a sample job. For instance, I’ve been testing Mahout (cluster-reuters) on a Hadoop multinode cluster (1 namenode, 2 slaves). A sample trace in my case looks like this listing:
15/10/17 12:09:06 INFO YarnClientImpl: Submitted application application_1445072191101_0026 15/10/17 12:09:06 INFO Job: The url to track the job: http://master.phd.net:8088/proxy/application_1445072191101_0026/ 15/10/17 12:09:06 INFO Job: Running job: job_1445072191101_0026 15/10/17 12:09:14 INFO Job: Job job_1445072191101_0026 running in uber mode : false 15/10/17 12:09:14 INFO Job: map 0% reduce 0%
The jobs web console told me that the job State=Accepted, Final Status = UNDEFINED and the tracking UI was UNASSIGNED.
First thing I suspected, was a warning thrown by hadoop binary:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Absolutely nothing to do with my problem. I rebuilt this jar from sources, but the job still hangs.
I reviewed the namenode logs. Nothing special. Then the Yarn different logs(resourcemanager, nodemanager). No problems. Slaves logs. Same thing.
As we don’t have much information from Hadoop logs, I went through the net for similar problems. It seems that this is a common problem related to memory configuration. I wonder why such problems are not yet logged (Hadoop 2.6). Even if I analyzed the memory consumption using JConsole, but nothing was alarming with it.
All the used machines are CentOS 6.5 virtual machines hosted on a 16 Gb RAM, i7-G5 laptop. After connecting and configuring the three machines, I realized that the allowed disk space (15Gb for the namenode, 6Gb for slaves ) is not enough. Checking the disk space usage (df -h), only 3% of the disk space were available on the two slaves. This could be an issue, but Hadoop reports such errors.
Looking in yarn-site.xml, I remembered that I gave Yarn 2Gb to run this test jobs.
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>2024</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property>
I tried doubling this value on the namenode:
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property>
Then I propagated changes to slaves and restarted everything (stop-dfs.sh && stop-yarn.sh && start-yarn.sh && start-dfs.sh). then
It works, finally 🙂
Now, I’m trying to visualize the clustering results using Gephi right now.
This simple mysql command in InnonDB engine may save you 30 minutes if you want to remove the duplicate copies of a row. Consider a table called “your_table_name” having many columns : column1, column2, column3, …. You define a duplicate row as a row having the same values as in column1 and column2. You have just to create a unique index based on those columns.
ALTER IGNORE TABLE your_table_name ADD UNIQUE INDEX give_ur_index_a_name (column1, column2 );
Commit, and check your table again… duplicate rows went away!
Update: In some versions of MySQL, “ALTER IGNORE” won’t ignore the duplicate key problem. So you may have to run the following command :
set session old_alter_table=1;
After adding the index, do not forget to set old_alter_table to “0” again.
In Sql Server, if your scheduled maintainance plan holds a backup of your database and its log transaction, and it fails, it will cause lots of space loss. Here’s how to arrange that quickly, but you must revisit your maintainance plan… that’s a better idea.
USE [DATABASE_NAME] GO CHECKPOINT BACKUP TRANSACTION [DATABASE_NAME]WITH NO_LOG DBCC SHRINKFILE('[DATABASE_LOG_FILE]',5000) DBCC SHRINKDATABASE('[DATABASE_NAME]',10)
This is a new serie of posts I always told my self to start it. This reminder is about useful Oracle queries, unix commands and so…
The first one is an Oracle select command to query foreign key name, column, referenced table and column.
SELECT c_list.CONSTRAINT_NAME as FK_NAME, substr(c_src.COLUMN_NAME, 1, 20) as SRC_COLUMN, c_dest.TABLE_NAME as DEST_TABLE, substr(c_dest.COLUMN_NAME, 1, 20) as DEST_COLUMN FROM ALL_CONSTRAINTS c_list, ALL_CONS_COLUMNS c_src, ALL_CONS_COLUMNS c_dest WHERE c_list.CONSTRAINT_NAME = c_src.CONSTRAINT_NAME AND c_list.R_CONSTRAINT_NAME = c_dest.CONSTRAINT_NAME AND c_list.CONSTRAINT_TYPE = 'R' AND c_src.TABLE_NAME = '<your-table-here>' GROUP BY c_list.CONSTRAINT_NAME, c_src.TABLE_NAME, c_src.COLUMN_NAME, c_dest.TABLE_NAME, c_dest.COLUMN_NAME;