Posts

Showing posts from June, 2012

HBASE COUNTERS (PART I)

Apart from various useful features, Hbase provides another advanced and useful feature called COUNTERS . Hbase provides us a mechanism to treat columns as counters . Counters allow us to increment a column value with least possible overhead. Advantage of using Counters While trying to increment a value stored in a table, we would have to lock the row, read the value, increment it, write it back to the table, and finally remove the look from the row, so that it can be used by other clients. This could cause a row to be locked for a long period and may possibly cause a clash between the clients tying to access the same row. Counters help us to overcome this problem as  Increments are done under a single row lock, so write operations to a row are synchronized. **Older versions of Hbase supported calls which involved one RPC per counter update . But the newer versions allow us to bundle multiple counters in a single RPC call .  Counters are limited to a single row , though we can

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Quite often the very first blow which people, starting their Hive journey, face is this exception : FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask And the worst part is that even after Googling for hours and trying out different solutions provided by different peoples this error just keeps on haunting them. The solution is simple. You just have to change the owner ship and permissions of the Hve directories which you have crated for warehousing your data. Use the following commands t do that. $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $HADOOP_HOME/bin/hadoop fs -chmod 777 /tmp $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse $HADOOP_HOME/bin/hadoop fs -chmod 777 /user/hive/warehouse You are good to go now. Keep Hadooping. NOTE : Please do not forget to tell me

HOW TO CONFIGURE HBASE IN PSEUDO DISTRIBUTED MODE ON A SINGLE LINUX BOX

If you have successfully configured Hadoop on a single machine in pseudo-distributed mode and looking for some help to use Hbase on top of that then you may find this writeup useful. Please let me know if you face any issue. Since you are able to use Hadoop, I am assuming you have all the pieces in place . So we'll directly start with Habse configuration. Please follow the steps shown below to do that: 1 - Download the Hbase release from one of the mirrors using the link shown below. Then unzip it at some convenient location (I'll call this location as HBASE_HOME  now on) -   http://apache.techartifact.com/mirror/hbase/ 2 - Go to the /conf directory inside the unzipped  HBASE_HOME  and do these changes :      - In the hbase-env.sh file modify these line as shown :         export JAVA_HOME=/usr/lib/jvm/java-6-sun        export H BASE_REGIONSERVERS        =/PATH_TO_YOUR_HBASE_FOLDER/conf/regionservers        export HBASE_MANAGES_ZK=true

HOW TO USE %{host} ESCAPE SEQUENCE IN FLUME-NG

Sometimes you may try to aggregate data from different sources and dump it into a common location, say your HDFS. In such a scenario it will be useful to create a directory inside the HDFS corresponding to each host machine. To do this FLEME-NG provide a suitable escape sequence,  the %{host} . Unfortunately it was not working with early releases of FLUME-NG. In such case the only solution was to create a custom interceptor that adds a host header key to each event, along with the corresponding hostname as the header value. But, luckily guys at Clouders did a great job and contributed an Interceptor to provide this feature out of the box. Now we just have to add few lines in our configuration file and we are good to go. For example, suppose we are collecting Apache web server logs from different hosts into a directory called flume  inside the HDFS . It would be quite fussy to figure out which log is coming from which host. So we''ll use %{host} in our agent configuration fil

HOW TO MOVE DATA INTO AN HBASE TABLE USING FLUME-NG

The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase table .  We should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link : http://cloudfront.blogspot.in/2012/06/how-to-build-and-use-flume-ng.html First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :

HOW TO BUILD AND USE FLUME-NG

In this post we'll see how to build flume-ng from trunk and use it for data aggregation. Prerequisites : In order to to do a hassle free build we should have following two things pre-installed on our box : 1- Thrift 2- Apache Maven-3.x.x Build the project : Once we are done with this we have to build flume-ng from the trunk . Use following commands to do this : $ svn co  https://svn.apache.org/repos/ asf/incubator/flume/trunk  flume This will create a directory flume inside our /home/username/ directory. Now go inside this directory and start the build process. $ cd flume $ mvn3 install -DskipTests NOTE : If everything was fine then you will receive a BUILD SUCCESS message after this. But sometimes you may get an error somewhat like this :

HOW TO CHANGE THE DEFAULT KEY-VALUE SEPARATOR OF A MAPREDUCE JOB

The default MapReduce output format, TextOutputFormat , writes records as lines of text. Its keys and values may be of any type, since TextOutputFormat turns them to strings by calling toString() on them.  Each key-value pair is separated by a tab character. We can change this separator to some character of our choice using the mapreduce.output.textoutputformat.separator ( In the older MapReduce API this was mapred.textoutputformat.separator). To do this you have to add this line in your driver function - Configuration.set("mapreduce.output.key.field.separator", ",");

Error while executing MapReduce WordCount program (Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable)

Quite often I see questions from people who are comparatively new to the Hadoop world or just starting their Hadoop journey that they are getting below specified error while executing the traditional WordCount program : Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable If you are also getting this error then you have to set your MapOutputKeyClass explicitly like this : - If you are using the older MapReduce API then do this :   conf.setMapOutputKeyClass(Text.class);    conf.setMapOutputValueClass(IntWritable.class);   - And if you are using the new MapReduce API then do this :     job.setMapOutputKeyClass(Text.class);         job.setMapOutputValueClass(IntWritable.class); REASON : The reason for this is that your MapReduce application might be using  TextInputFormat  as the InputFormat class and this class generates keys of type LongWritable and values of type Text by default. But your applicati

How to install maven3 on ubuntu 11.10

If you are trying to install maven2 that comes shipped with your ubutnu 11.10, and it is not working as intended you can try following steps to install maven3 : 1 - First of all add the repository for maven3. Use following command for this -      $ sudo add-apt-repository ppa: natecarlson/maven3 2 - Now update the repositories -       $ sudo apt-get update 3 - Finally install maven3 -        $ sudo apt-get install maven3 NOTE : To check whether installation was done properly or not, issue the following                command -               $ mvn --version