Monday, August 14, 2017

Hadoop, Hive, Tez and S3 go to a bar....

well, long story short, it doesn't go well for sometime. till they know each other well. and then..... that is the story below:

Hadoop and Hive

It is kinda easy to configure. Lots of documentation out there. I am assuming Hive with MR as the engine. You don't have to install hive on every node on your cluster, unlike hadoop. The only tricky thing is to configure Hive with an external/remote metastore, and on a cloud provider like AWS, like postgre or mysql running on RDS. Very decent cloudera documentation on that - here is the link: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html#topic_18_4_3__title_524

Couple of notes:

1. The upgrade scripts still says hive-schema-2.1.0 in the Hive 2.1.1 version. that is fine.

2. Hive 2.1.1 has some small issue in the postgre scripts, namely the location of another embedded link to a script. You have to open the main script - hive-schema-2.1.0.postgres.sql and change this line "\i hive-ten-schema-2.1.0.postgres.sql;"
to an absolute path -just to not be dependent on where you are running this script from.

3. The thrift setting (hive.metastore.uris) in the hive-site.url points to the server where you run the "hive --service metastore command" not the RDS. RDS goes in the javax.jdo.option.ConnectionURL  property.

4. Hive logs go into a default /tmp/ec2-user (on AWS). To set a custom log file name or directory, put hive-log4j2.properties in the $HIVE_HOME/conf directory, some sample entries would run like (I created the logs directory):

property.hive.log.dir = /home/ec2-user/apache-hive-2.1.1-bin/logs
property.hive.log.file = hive.log


Hive with Tez

This is a bit more complicated. The documentation out there on apache and other websites is not complete, and any errors are harder to debug. Nonetheless, the steps given on apache tez website are valid, just few clarifications:

1.  The tar.gz file which goes in the HDFS is in the "share" drive. do not put the downloaded apache tez binary in the path - it didn't work for me. Actually I ended up compiling my own source for tez and build the file there, since it kept on throwing a lot of weird class not found, incompatible java version error etc. If you go down that road on an AWS instance, remember:
      a) It requires maven, node.js, git and bower. node requires npm to be installed. only after installing these try building the tez binary.

2. the unpacked tez binary needs to be on every node/slave, and its folders $TEZ_HOME and $TEZ_HOME/lib have to be on the hadoop class path.

3. Important: HADOOP_CLASSPATH can accept /* to indicate all files, but it only worked for me when I gave that in the $HADOOP_HOME/etc/hadoop/hadoop-env.sh , not in .bash_profile.

4. Also, don't set TEZ_JARS to 'something/*', rather just set it to point to a folder where the unpacked tez jars went, and then add $TEZ_JARS/lib/*, $TEZ_JARS/* to the HADOOP_CLASSPATH in the file mentioned above.

5. Java version on the Master and Slaves (hadoop nodes) has to match the Tez version you are using. else you get java major/minor version mismatch errors.

6. Hadoop version should be correct as per the Tez version you are building/using (as per the documentation).

7. tez.use.cluster.hadoop-libs -  I set it to true to be consistent.

8. Memory settings and tez running forever issues:
     a) Tez picks up default memory and vcores settings from the mapreduce-site.xml and yarn-site.xml. If your jobs are running forever or just failing, this might be the issue. you can always override the container settings in the tez-site.xml too.
     b) If you are getting virtual memory issues, you have to set the right values in yarn configuration (yarn.nodemanager.vmem-pmem-ratio) or disable the vmem check (yarn.nodemanager.vmem-check-enabled).

9. The new Tez shuffle handler: Try that if you can - https://tez.apache.org/shuffle-handler.html
It cut out the execution time by 10-20% for me, but it doesn't seem to be included in the binaries. I build tez from the source so I found it in the tez-aux-services-0.9.0.jar file.


Hive/Tez with S3

First and foremost, the documentation is rare or confusing enough to discourage you, but it is there in bits and pieces (https://wiki.apache.org/hadoop/AmazonS3). here are my most important takes:

1. As of now (aug 2017), hadoop 2.7.3 uses s3a protocol. when you give locations in Hive for your external tables, use that protocol (and not s3 or s3n).

2. Do NOT download/use AWS sdk/driver for s3 jar file (as of today it was aws-java-sdk-1.11.174.jar) . Use whatever 2.7.3 was shipped with - aws-java-sdk-1.7.4.jar.

3. The jars in the folder $HADOOP_HOME/share/hadoop/tools/lib need to be on hive class path (not HADOOP_CLASSPATH). You can set it in hive-site.xml or $HIVE_HOME/bin/hive-config.sh
I wasn't sure how to add a folder to the former so I used the latter. Turns out they made a small change here compared to HADOOP_CLASSPATH. You set HIVE_AUX_JARS_PATH to the name of the folder, not folder/*, like

export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/tools/lib

4. The property fs.s3a.aws.credentials.provider (and other related ones, depending on what you are using on the Hive EC2 instance to authenticate yourself to S3 - you can use key/secret key or instance profile - if the latter, then this is the only property you have to give) in the core-site.xml needs to be configured to the right value. In my case it was com.amazonaws.auth.InstanceProfileCredentialsProvider - of course it assumes that you have set up your Hive EC2 instance with an instance profile.









Monday, January 04, 2016

The wheels of the Rail go round and round....

ok, ok, I know that is not the correct line :-)

But making a slightly complicated model relationship and form sure seemed to be like that in Ruby on Rails.

Here is the model - pretty simple at that:

1. You have players who can play 1-n number of sports. represented by a model called Players
2. You have a number of sports. represented by a model called Sports
3. Players can play number of sports, similarly a given Sport could be played by a number of players. the relationship between them is called PlayerSports - it is many to many, through this relationship table, which also stores certain attributes, Rating and StartYear for the relationship itself

We want to create a single form for Players as well as allowing them to be associated with Sports in the same form, along with some of the details about this relationship as given in point 3 above.

Now the form only displays details about one mandatory sport, while registering yourself as a player (after all if you do not play even one Sport, you can't really be a Player, right?). Using jquery/javascript you can add the form inputs for more Sports dynamically - in case a Player plays more than one Sport. How to create the form and the rest of the code to do this in relatively straight forward manner? Given below is the step by step implementation, narrowed down by reading tons of blogs, documentation., hit and trial. Also listing some of the fails, pitfalls to beware of.

Disclaimer: Not a ruby/rails expert, learned all this starting 2 months back. but found it difficult to implement so thought should document it somewhere.

Statement 1: I found (for a newbie perhaps) that using simple_form, other form gems, and cocoon isn't worth the time, especially since my problem was solved relatively quickly using standard Rails means and some jquery. also note that cocoon etc expects some format for it to work, if you have a custom html form, it might mean more work than needed.

Statement 2: Tried and tested on Rails version 4.2.3

Statement 3: We do not create Sport rows/data through the form. Just new Players and their relationship with Sports.

Statement 4: Pay attention to the singular and plural forms of the words. they are significant in Rails.

Step 1 - Models:
First we need to create a many-to-many through relationship. if you don't know what that is, please read the rails guides for an intro (it's relatively easy). The code needed for that is given below:

class Player < ActiveRecord::Base
# relationship for sports for a single player
has_many :player_sports, foreign_key: "player_id"
has_many :sports, through: :player_sports
accepts_nested_attributes_for :player_sports
reject_if: :all_blank, allow_destroy: true

# ... other code not relevant here ... 
end

class PlayerSport < ActiveRecord::Base  
belongs_to :player  
belongs_to :sport  
validates :player_id, presence: true  
#validates :sport_id, presence: true
# accepts_nested_attributes_for :sportsend

class Sport < ActiveRecord::Base  
has_many :player_sports, foreign_key: "sport_id"  
has_many :players, through: :player_sports
end


Few clarifications here:
1. Some blogs mentioned you need inverse_of in other models for the through :player_sports relationship. Not true, as the code above shows for the purposes here.
2. The validates and accepts_nested_attributes_for in the relationship table PlayerSport is not needed. In fact validates gave me an issue when creating a new Player (and one or more PlayerSport relationship rows) since the player_id would be blank when the Player record is not yet created. Seems common sense.
3. Some sites mentioned attr_accessor to be created for the PlayerSport attributes in the Player model. I did not find that is required and remember reading somewhere that accepts_nested_attributes_for automatically creates that. autosave isn't needed too.




Step 2 - Controllers:

For creating and saving a new record:  Nothing!!! I did nothing (new) in the Player controller when I added this relationship other than what is needed for the core Player itself. If you do not know what that is, then this is probably not the right place for you to start from. But the code should essentially look something like this (player_params is a method for allowing the params to pass through as given in the point below):

def create @player = Player.new(player_params)

  if @player.save
    # do something
  else    flash[:danger] = @player.errors
    # do something else
  endend

Permitting the attributes (strong parameters in Rails) - this is interesting. you need to add player_sports_attributes as an array in the parameters allowed for Player - not player_sport or player_sport_attributes but exactly the same format-pluralized as given below. the fields in the permit array can vary depending upon what you have in the PlayerSport model, but _destroy is a special one needed for deleting the relationship records.

def player_params  params.require(:player).permit(:first_name, :last_name, ...other_fields...
:player_sports_attributes => [:id, :done, :player_id, :sport_id, :rating, :_destroy])
end





Step 3 - Form:
The main form - containing fields for Player is the standard one. In my case, I created the html layout in a table, with a <tr> representing a field. What is interesting is the part where you need to put the PlayerSport fields. That is achieved with the code below:
tr>
  <td><%= f.label :comments %><br></td>
  <td><%= f.text_area :comments, class: 'form-control' %></td>
</tr> <!-- some html code to show the Player model fields -->

    <%= f.fields_for 'player_sports_attributes', index: 0 do |builder| %>
      <%= render 'player_sport_fields', :f => builder %>
    <% end %>
    <tr>
     

<tr id="addAnchorRow">
<td
<a href="javascript: void(0)" id="addSportAnchor" onClick="addSport(this)">  Add Another Sport</a>
</td></tr>

<!-- the remaining code if any -->

 Things to note here:
1. If you create a form using form_for and are using a html <table> to layout the fields, the form_for must be outside the table, since we would be adding <tr> dynamically and the new <tr> rows (and thus the extra PlayerSport rows apart from the one added directly) would NOT be submitted in params to the controller. It is not a Rails issue, it is html/DOM issue - I am not sure if it is an issue even, I think it behaves as it should - just mentioning it for the sake of completeness here.
2. Index is not needed as Rails should automatically generate the form fields for the single PlayerSport row present statically (i.e. not created through jquery) with a "0". Again, just mentioned for clarity.
3. If you try giving 'player_sports' instead of 'player_sports_attributes', the static fields also would disappear - Rails would NOT generate anything. Again this is standard behavior I think.
4. The last line is to generate an <a> tag, clicking which should generate more fields. It calls a javascript function when clicked as you can see, and while it is not unobtrusive javascript, to me it seemed simple enough.




Step 4 - Jquery/Javascript:
For the static/mandatory in our case, PlayerSport, the generated html from Rails is of the format (one field shown below):

<select class="form-control" name="player[player_sports_attributes][0][sport_id]" id="player_player_sports_attributes_0_sport_id">..options...</select>

Through jquery we need to:
0. Generate a set of almost identical row - containing the form fields for additional PlayerSport records, and attach it to the <table> within the form_for
1. Change or remove the ID so that we don't have form fields (<select> in this case) with duplicate ID.  Now, I have seen blogs talking about changing the ID value, but I tried just removing it altogether and still the form was submitted and records created properly for me. so your choice here I think. Would even duplicate IDs matter? I am not sure, since I saw Rails complaining more about the name of the fields than ID. But it is possible I missed something.
2. Change the [0] part in the name player[player_sports_attributes][0][sport_id] for the newly generated fields - you can change it to anything unique - [1] would be fine. I used $.now() to get current time in milliseconds and replaced the "0" with that. This is important, since the params are submitted like:
Parameters: {..other params..., "player_sports_attributes"=>{"0"=>{"sport_id"=>"1", "rating"=>"7.0"}, "1451889076747"=>{"sport_id"=>"2", "rating"=>"1990"}}}, "commit"=>"Register Player"}
As you can see the number inside the middle [] become a key in the params hash submitted.