Posts

Showing posts from July, 2012

HOW TO RUN MAPREDUCE PROGRAMS USING ECLIPSE

Hadoop provides us a plugin for Eclipse that helps us to connect our Hadoop cluster to Eclipse. We can then run MapReduce jobs and browse Hdfs,  through the Eclipse itself . But it requires a few things to be done in order to achieve that.  Normally, it is said that we just have to copy  hadoop-eclipse-plugin-*.jar to the eclipse/plugins directory in order to get things going. But unfortunately it did not work for me. When I tried to connect eclipse to my Hadoop cluster it threw this error : An internal error occurred during: "Map/Reduce location status updater". org/codehaus/jackson/map/JsonMappingException You may face some different error, but it would be somewhat similar to this. This is because of the fact that some required jars are missing from the plugin that comes with Hadoop.  Then, I tried a few things and it turned out to be positive. So, I thought of sharing it, so that if anybody else is facing the same issue, can try it out. Just t ry the steps outli

HOW TO SETUP AND CONFIGURE 'ssh' ON LINUX (UBUNTU)

SSH (Secure Shell) is a network protocol  secure data communication, remote shell services or command execution and other secure network services between two networked computers that it connects via a secure channel  over an insecure network. The ssh server runs on a machine (server) and ssh client runs on another machine (client). ssh has 2 main components : 1- ssh : T he command we use to connect to remote machines - the client.  2-  sshd : T he daemon that is running on the  server  and allows clients to connect to the server. ssh is pre-enabled on Linux, but in order to start sshd daemon , we need to install ssh first. Use this command to do that : $ sudo apt-get install ssh This will install ssh on your machine. In order to check if ssh is setup properly do this : $ which ssh It will throw this line on your terminal /usr/bin/ssh $ which sshd It will throw this line on your terminal /usr/bin/sshd SSH uses  public-key cryptography  to  authenticate  the remote co

HOW TO CONFIGURE HADOOP

Image
You can find countless posts on the same topic over the internet. And most of them are really good. But quite often, newbies face some issues even after doing everything as specified. I was no exception. In fact, many a times, my friends who are just starting their Hadoop journey, call me up and tell me that they are facing some issues even after doing everything in order. So, I thought of writing down the things which worked for me. I am not going in detail as there are many better post that outline everything pretty well. I'll just show how to configure Hadoop on a single Linux box in pseudo distributed mode. Prerequisites : 1- Sun(Oracle) java must be installed on the machine. 2- ssh must be installed and keypair must be already generated. NOTE :  Ubuntu comes with its own java compiler (i.e OpenJDK), but  Sun(Oracle) java  is the preferable choice for Hadoop. You can visit this  link   if you need some help on how to install it. NOTE : You can visit this link   if

HOW TO INSTALL SUN(ORACLE) JAVA ON UBUNTU 12.04 IN 3 EASY STEPS

If you have upgraded to Ubuntu 12.04 or just made a fresh Ubuntu installation you might want to install sun(oracle) java on it. Although Ubuntu has its own jdk, the OpenJdk , but there certain things that demand for sun(oracle) java . You can follow the steps shown below to do that - 1 -  Add the “WEBUPD8″ PPA :      hadoop@master:~ $ sudo add-apt-repository ppa:webupd8team/java   2 -  Update the repositories :       hadoop@master:~ $ sudo apt-get update 3 - Begin the installation :       hadoop@master:~ $ sudo apt-get install oracle-java7-installer Now, to test if the installation was ok or not do this : hadoop@master:~$ java -version If everything was ok you must be able to see something like this on your terminal : hadoop@master:~$ java -version java version "1.7.0_05" Java(TM) SE Runtime Environment (build 1.7.0_05-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.1-b03, mixed mode) hadoop@master:~$ 

BETWEEN OPERATOR IN HIVE

Hive is a wonderful tool for those who like to perform batch operations to process their large amounts of data residing on a Hadoop cluster and who are comparatively new to the NOSQL world. Not only it provides us warehousing capabilities on top of a Hadoop cluster, but also a superb SQL like interface which makes it very easy to use and makes our task execution more familiar. But, one thing which newbies like me always wanted to have is the support of BETWEEN operator in Hive. Since the release of version 0.9.0  earlier  this year, Hive provides us some new and very useful features. BETWEEN operator is one among those.