Deployed HDP 2.3 Today!

the_Grinch · September 2015

Haven't posted a topic in awhile and figure why not post about my successful deployment of Hortonworks Data Platform 2.3? This was actually the third HDP deployment I have done. The first time was a pure nightmare! My configuration of the operating system on each of my 13 servers was terrible and thus HDP exposed all of those mistakes. So I wiped each of the servers, redeployed the operating system, and built a script to check all of the items I had missed (script works really nicely and I also end up finding a bash script from Hortonworks that did a lot more then mine). I successfully deployed Hadoop and only had a few issues to contend with (I had to rewipe two servers and redeploy, but then realized I could have fixed them without wiping). When I went to the Hortonworks Data Science course they had released HDP 2.3 and at first I was against even looking at it. I had our cluster (HDP 2.2) up, Kerborized, and the only item I had issues with was deploying Ranger.

As I reviewed, I saw how many changes they made and how well HDP 2.3 would work for us. First, they enabled it to work with FreeIPA again. HDP 2.1 worked with FreeIPA, HDP 2.2 made changes to allow Ambari to make the Kerboros keys automatically. Due to how FreeIPA does it, Ambari couldn't utilize the needed commands to make the keys. HDP 2.3 made it an option to either deploy the keys automatically or to manually do it (thus you can use FreeIPA). Second, they made the deployment of Ranger much easier and added some items that can utilize it that weren't previously able to. Finally, they also added the ability to encrypt data at rest. While we don't have PII data being able to say that we can restrict access (down to the cell level if need be), audit who does what, and encrypt the data we have will make us all feel better.

So I took two days and wiped all the servers again. I had read a lot of information talking about how difficult updating to new versions can be and since I had no data in the system it made sense to start fresh. With all the servers wiped, I setup Ambari with a local repo and started the deployment. I almost had a heart attack when everything went off without a hitch! Of course it was short lived because when I went to the dashboard two of my servers could not start HDFS (one of which was a big part of my storage capacity). I tired to start the service on both and they would fail. I reviewed the logs and found out for some reason the wrong accounts were made owner of the HDFS folders. I changed the owner and tried to start the service again. Failed! Look again and found that the cluster ID in one file on both servers was wrong. So I went into the VERSION file and I adjusted the cluster ID. Start and failed again. This time I found that the node ID was wrong on some of the drives. Start and bam one of the servers was running properly. But the other server still would not start and I found that was because for this server it was looking for 15 drives that did not exist. Because of those 15 failed drives the service would not start. I had to make a configuration group so that I could adjust the config to just use two drives and bam I was up.

I also want to point out that Hortonworks engineers are very helpful. I found that a set of their tutorials had a lot of issues. I emailed the engineer and we had several conversations about a number of things. He was really great about helping me through some issues (with 2.2). I added that it didn't make a lot of sense to release tutorials that utilized their sandbox because that doesn't help engineers in the real world. With version 2.3 they made the change to using a single node cluster and that made a world of difference as well. Tomorrow I'll begin enabling Kerboros and possibly deploying Ranger. I'll keep everyone informed!

the_Grinch · September 2015

Today I got Kerberos up and running! Since I am using FreeIPA, you have to do all the work with creating the needed user accounts and generating the principal keys. Hortonworks provides a tutorial, but my concern was the fact that I am working with 13 nodes not just 1. My thought was that I would need to break the kerberos.csv into individual files and import for each node. Very wrong! What I didn't realize was if you run the script and issue keys they change each time. Thus when I tried to run the commands to test after doing all of the work on each server the commands would fail. I Googled for about two hours when I came across an unrelated article that discussed keys in Kerberos and said that with each generation the keys would change.

Thus I went to the last server I worked on and confirmed that the test commands worked on it. I created a spreadsheet of what certificates were where with the thought that I would move them as needed. I knew this morning would be a nightmare if I had to do that. Thankfully, a Hortonworks Engineer let me know that I could delete, reissue the keys (without separate files since the script would pull the needed ones based on the host), and then copy only the files for services that do not change. The only issue is I would need to change the permissions on the files (which the script does) to allow them to work. So what I did was go through the steps to generate the script and make the keys on one box. Then I would connect to another box, transfer the kerberos.csv, generate the script, but comment out the service principals (basically comment out anything that didn't have blah/host.realm.com). Run the script, which will grab the specific keys, and then copy said script naming it something else. From there I would look to see what service keytabs I needed to copy over and then comment out the ones I didn't need. I adjusted the copied script to remove the keytab issue command and kept only the chown and chmod commands. Once I completed that, I restarted the NameNode and Secondary NameNode. Then I shutdown all services and started them up.

Here I will point out that Accumulo had issues from the start and I ultimately used the Ambari api to remove it as I won't be using it. Did a quick test to confirm that now only users in Kerberos could access Hadoop and whoa success! Monday I will be working on deploying Ranger and then I can begin to import data into my cluster!

paul78 · September 2015

Interesting to read your progress. Thanks for sharing. I am curious - I see that you are using Hortonworks. Did you look at the other commercial support pages such as Cloudera or Pivotal. I'm wondering if you could share any pros/cons with Cloudera.

the_Grinch · September 2015

I also deployed Ranger today as well!!

paul78 - When we began to go down this road we did explore several options, in particular Cloudera vs Hortonworks. What it really came down to for us was the cost. While we more then likely could have gotten funding for Cloudera, it's a much harder sell to management when we don't know the outcome of the project. In my case, this project will be a multiyear deal and we might end up saying perhaps it's not worth doing. In that event we purchased something we will not use at all if this project fails. A lot of the things you really want in Cloudera cost something, where as with Hortonworks every product is free you are just paying for support.

As I see it, there are three advantages to using Cloudera. First, they have the most tutorials available on the web. In my search for answers, they seem to pop up way more often and in a more production sense (vs Hortonworks who's tutorials always use the Sandbox and that isn't helpful in the production world). Second, Cloudera's community is very much more active then Hortonworks'. Good luck getting an answer on the Hortonworks forums. Cloudera's forums seem to get answers rather quickly. Finally, in my area it seems a lot of places have chosen Cloudera and thus you have a better shot at getting a job.

All that being said, it seems to me that Hortonworks has gone in the right direction. They've kept everything open source and haven't created anything propriety(i.e. Impala). Also, with the advent of HDP 2.3 things are vastly smoother when it comes to deployment. Finally, Hortonworks has the most developers coding for Hadoop in general. I will warn you their support is super expensive, but even they'll admit that beyond deployment they don't get many calls. I've gone to several of their training courses and it is truly top notch. I'd much rather my agency pay for the training then for support because for the most part it's easy to figure out what the problem is.

paul78 · September 2015

Thanks for sharing your thoughts. I've been toying with some use-cases and Hadoop has an interesting ecosystem.

the_Grinch · September 2015

Hadoop is definitely one of the best things I've had a chance to work on. While it was not my idea (my former coworker suggested it and gave a high level on how it would work), I've been the one to fully deploy and implement it. In regards to the ecosystem, it seems to me that Pig and Hive are just about the only tools you need initially. I've trying to get approval to attend the developer course, but I was able to play with Pig in the Data Science course. You can do amazing things with it and it's language (Pig Latin) is extremely easy to pick up. We took csv files and imported them into Pig then did some really cool stuff (sorting the data, adding it up, showing high's and low's). While the sheets weren't "big" data, you could see where for large csv files you would be able to munch that data in a flash.

A case I might be using Hadoop for is going to involve the analysis of some hand history in a csv. The file should end up being too big to open up in Excel and more then likely will require some complicated analysis in order to make sense of it. Hoping with Pig (and maybe Hive) I'll be able to do it fast then if one of our engineers had reviewed it manually. Let me know how you proceed and (if you can) your use case, always interested to talk with people in the (or moving into) the Big Data world.

paul78 · September 2015

My interest in Hadoop isn't for data analytics but more as a transaction processing store. I'm not convinced yet about Hadoop because I don't care of master/slave architectures because I have a requirement for multiple zone high-availability. Admittedly, I have not done as much reading on Hadoop as on other distributed stores like MongoDB and Cassandra. I also have a need for tunable consistency - or at least one where I can control consistency depending on the transaction type.

I'm enjoying reading about your experience though, so thanks for sharing.

the_Grinch · September 2015

Yeah I don't believe Hadoop is quite up to transactional processing just yet. I mean technically you can do it, but I don't know if I would put it into production at this point. To your point about the Master/Slave thing, if I understand it properly (haven't dealt with MongoDB) and it's along the lines of Elasticsearch then you will still have a Master and Slave. Granted, you could denote a lot of Masters thus if a node dies off something steps in to take it's place. But that being said, having dealt with nodes dying in the Elasticsearch world, there is still issues. The cluster will run, but there are issues that have to be addressed.

In the Hadoop world there are things you can do to mitigate that (just like any system). First, your Master is typically a powerhouse and is setup as such. In my case, we would have to have a major disaster for my Master node to die. Second, you can setup your Secondary NameNode to be in heartbeat mode. That way if your Master does die, it will pick up right "about" where you left off. I believe in practice you won't have anything missing.

I think in your case I would maybe setup a small experimental Hadoop cluster and utilize HBase to test. But I'm in agreement that at this point Mongo or Cassandra are the way to go.

Deployed HDP 2.3 Today!

Comments