Hadoop, Hive, Tez and S3 go to a bar....
well, long story short, it doesn't go well for sometime. till they know each other well. and then..... that is the story below:
Hadoop and Hive
It is kinda easy to configure. Lots of documentation out there. I am assuming Hive with MR as the engine. You don't have to install hive on every node on your cluster, unlike hadoop. The only tricky thing is to configure Hive with an external/remote metastore, and on a cloud provider like AWS, like postgre or mysql running on RDS. Very decent cloudera documentation on that - here is the link: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html#topic_18_4_3__title_524
Couple of notes:
1. The upgrade scripts still says hive-schema-2.1.0 in the Hive 2.1.1 version. that is fine.
2. Hive 2.1.1 has some small issue in the postgre scripts, namely the location of another embedded link to a script. You have to open the main script - hive-schema-2.1.0.postgres.sql and change this line "\i hive-ten-schema-2.1.0.postgres.sql;"
to an absolute path -just to not be dependent on where you are running this script from.
3. The thrift setting (hive.metastore.uris) in the hive-site.url points to the server where you run the "hive --service metastore command" not the RDS. RDS goes in the javax.jdo.option.ConnectionURL property.
4. Hive logs go into a default /tmp/ec2-user (on AWS). To set a custom log file name or directory, put hive-log4j2.properties in the $HIVE_HOME/conf directory, some sample entries would run like (I created the logs directory):
property.hive.log.dir = /home/ec2-user/apache-hive-2.1.1-bin/logs
property.hive.log.file = hive.log
Hive with Tez
This is a bit more complicated. The documentation out there on apache and other websites is not complete, and any errors are harder to debug. Nonetheless, the steps given on apache tez website are valid, just few clarifications:
1. The tar.gz file which goes in the HDFS is in the "share" drive. do not put the downloaded apache tez binary in the path - it didn't work for me. Actually I ended up compiling my own source for tez and build the file there, since it kept on throwing a lot of weird class not found, incompatible java version error etc. If you go down that road on an AWS instance, remember:
a) It requires maven, node.js, git and bower. node requires npm to be installed. only after installing these try building the tez binary.
2. the unpacked tez binary needs to be on every node/slave, and its folders $TEZ_HOME and $TEZ_HOME/lib have to be on the hadoop class path.
3. Important: HADOOP_CLASSPATH can accept /* to indicate all files, but it only worked for me when I gave that in the $HADOOP_HOME/etc/hadoop/hadoop-env.sh , not in .bash_profile.
4. Also, don't set TEZ_JARS to 'something/*', rather just set it to point to a folder where the unpacked tez jars went, and then add $TEZ_JARS/lib/*, $TEZ_JARS/* to the HADOOP_CLASSPATH in the file mentioned above.
5. Java version on the Master and Slaves (hadoop nodes) has to match the Tez version you are using. else you get java major/minor version mismatch errors.
6. Hadoop version should be correct as per the Tez version you are building/using (as per the documentation).
7. tez.use.cluster.hadoop-libs - I set it to true to be consistent.
8. Memory settings and tez running forever issues:
a) Tez picks up default memory and vcores settings from the mapreduce-site.xml and yarn-site.xml. If your jobs are running forever or just failing, this might be the issue. you can always override the container settings in the tez-site.xml too.
b) If you are getting virtual memory issues, you have to set the right values in yarn configuration (yarn.nodemanager.vmem-pmem-ratio) or disable the vmem check (yarn.nodemanager.vmem-check-enabled).
9. The new Tez shuffle handler: Try that if you can - https://tez.apache.org/shuffle-handler.html
It cut out the execution time by 10-20% for me, but it doesn't seem to be included in the binaries. I build tez from the source so I found it in the tez-aux-services-0.9.0.jar file.
Hive/Tez with S3
First and foremost, the documentation is rare or confusing enough to discourage you, but it is there in bits and pieces (https://wiki.apache.org/hadoop/AmazonS3). here are my most important takes:
1. As of now (aug 2017), hadoop 2.7.3 uses s3a protocol. when you give locations in Hive for your external tables, use that protocol (and not s3 or s3n).
2. Do NOT download/use AWS sdk/driver for s3 jar file (as of today it was aws-java-sdk-1.11.174.jar) . Use whatever 2.7.3 was shipped with - aws-java-sdk-1.7.4.jar.
3. The jars in the folder $HADOOP_HOME/share/hadoop/tools/lib need to be on hive class path (not HADOOP_CLASSPATH). You can set it in hive-site.xml or $HIVE_HOME/bin/hive-config.sh
I wasn't sure how to add a folder to the former so I used the latter. Turns out they made a small change here compared to HADOOP_CLASSPATH. You set HIVE_AUX_JARS_PATH to the name of the folder, not folder/*, like
export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/tools/lib
4. The property fs.s3a.aws.credentials.provider (and other related ones, depending on what you are using on the Hive EC2 instance to authenticate yourself to S3 - you can use key/secret key or instance profile - if the latter, then this is the only property you have to give) in the core-site.xml needs to be configured to the right value. In my case it was com.amazonaws.auth.InstanceProfileCredentialsProvider - of course it assumes that you have set up your Hive EC2 instance with an instance profile.
It is kinda easy to configure. Lots of documentation out there. I am assuming Hive with MR as the engine. You don't have to install hive on every node on your cluster, unlike hadoop. The only tricky thing is to configure Hive with an external/remote metastore, and on a cloud provider like AWS, like postgre or mysql running on RDS. Very decent cloudera documentation on that - here is the link: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html#topic_18_4_3__title_524
Couple of notes:
1. The upgrade scripts still says hive-schema-2.1.0 in the Hive 2.1.1 version. that is fine.
2. Hive 2.1.1 has some small issue in the postgre scripts, namely the location of another embedded link to a script. You have to open the main script - hive-schema-2.1.0.postgres.sql and change this line "\i hive-ten-schema-2.1.0.postgres.sql;"
to an absolute path -just to not be dependent on where you are running this script from.
3. The thrift setting (hive.metastore.uris) in the hive-site.url points to the server where you run the "hive --service metastore command" not the RDS. RDS goes in the javax.jdo.option.ConnectionURL property.
4. Hive logs go into a default /tmp/ec2-user (on AWS). To set a custom log file name or directory, put hive-log4j2.properties in the $HIVE_HOME/conf directory, some sample entries would run like (I created the logs directory):
property.hive.log.dir = /home/ec2-user/apache-hive-2.1.1-bin/logs
property.hive.log.file = hive.log
Hive with Tez
This is a bit more complicated. The documentation out there on apache and other websites is not complete, and any errors are harder to debug. Nonetheless, the steps given on apache tez website are valid, just few clarifications:
1. The tar.gz file which goes in the HDFS is in the "share" drive. do not put the downloaded apache tez binary in the path - it didn't work for me. Actually I ended up compiling my own source for tez and build the file there, since it kept on throwing a lot of weird class not found, incompatible java version error etc. If you go down that road on an AWS instance, remember:
a) It requires maven, node.js, git and bower. node requires npm to be installed. only after installing these try building the tez binary.
2. the unpacked tez binary needs to be on every node/slave, and its folders $TEZ_HOME and $TEZ_HOME/lib have to be on the hadoop class path.
3. Important: HADOOP_CLASSPATH can accept /* to indicate all files, but it only worked for me when I gave that in the $HADOOP_HOME/etc/hadoop/hadoop-env.sh , not in .bash_profile.
4. Also, don't set TEZ_JARS to 'something/*', rather just set it to point to a folder where the unpacked tez jars went, and then add $TEZ_JARS/lib/*, $TEZ_JARS/* to the HADOOP_CLASSPATH in the file mentioned above.
5. Java version on the Master and Slaves (hadoop nodes) has to match the Tez version you are using. else you get java major/minor version mismatch errors.
6. Hadoop version should be correct as per the Tez version you are building/using (as per the documentation).
7. tez.use.cluster.hadoop-libs - I set it to true to be consistent.
8. Memory settings and tez running forever issues:
a) Tez picks up default memory and vcores settings from the mapreduce-site.xml and yarn-site.xml. If your jobs are running forever or just failing, this might be the issue. you can always override the container settings in the tez-site.xml too.
b) If you are getting virtual memory issues, you have to set the right values in yarn configuration (yarn.nodemanager.vmem-pmem-ratio) or disable the vmem check (yarn.nodemanager.vmem-check-enabled).
9. The new Tez shuffle handler: Try that if you can - https://tez.apache.org/shuffle-handler.html
It cut out the execution time by 10-20% for me, but it doesn't seem to be included in the binaries. I build tez from the source so I found it in the tez-aux-services-0.9.0.jar file.
Hive/Tez with S3
First and foremost, the documentation is rare or confusing enough to discourage you, but it is there in bits and pieces (https://wiki.apache.org/hadoop/AmazonS3). here are my most important takes:
1. As of now (aug 2017), hadoop 2.7.3 uses s3a protocol. when you give locations in Hive for your external tables, use that protocol (and not s3 or s3n).
2. Do NOT download/use AWS sdk/driver for s3 jar file (as of today it was aws-java-sdk-1.11.174.jar) . Use whatever 2.7.3 was shipped with - aws-java-sdk-1.7.4.jar.
3. The jars in the folder $HADOOP_HOME/share/hadoop/tools/lib need to be on hive class path (not HADOOP_CLASSPATH). You can set it in hive-site.xml or $HIVE_HOME/bin/hive-config.sh
I wasn't sure how to add a folder to the former so I used the latter. Turns out they made a small change here compared to HADOOP_CLASSPATH. You set HIVE_AUX_JARS_PATH to the name of the folder, not folder/*, like
export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/tools/lib
4. The property fs.s3a.aws.credentials.provider (and other related ones, depending on what you are using on the Hive EC2 instance to authenticate yourself to S3 - you can use key/secret key or instance profile - if the latter, then this is the only property you have to give) in the core-site.xml needs to be configured to the right value. In my case it was com.amazonaws.auth.InstanceProfileCredentialsProvider - of course it assumes that you have set up your Hive EC2 instance with an instance profile.
No comments:
Post a Comment