Hadoop Setup - Ouch

the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
I took a training course on Hortonworks Hadoop a few months back and yesterday/today I actually went about deploying it. Whoa it was quite a nightmare, but not wholely HDP's fault. First, I went through and reloaded the server we were going to utilize for Ambari. I had set it up a few months back, but paused on deploying for a few reasons (which I am ultimately glad I did). No matter what I did I could not get the servers to respond. They'd connect via the passwordless ssh, but would then time out during the installation phase. I spend about three hours trying and trying to get it to work. I went home for the night and started working again this morning. As I continued to work I started thinking about another issue I had that might had been affecting it. Knowing it utilizes yum to do the install from a local repo I remembered I was having an issue with my spacewalk server when doing updates. Basically, it still checks the mirror list and because I don't allow the servers out to the internet it basically hangs while checking the mirrors. I jumped on the firewall and let them out to the internet. This instantly had three of the seven servers register. I went back and rebooted the other four servers, bam they registered.

Moving to the next step was about actually installing the needed software. I go and setup which nodes will hold what along with a few other items. Set the install and lots of failures across the board. Start reviewing the logs and find out that they default to directories that the systems can't use. Make a few adjustments and at least got it to the point where I could move to the next step (some of the services started, but most didn't). Couple more hours and I have just about everything running. My main issue now is my NameNode won't start, but once I fix the issue that should fix a bunch of the other problems I am having. The error I am getting is below (in the event anyone has some ideas):

Fail: Execution of ‘ulimit -c unlimited; su -s /bin/bash – hdfs -c ‘export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh –config /etc/hadoop/conf start namenode” returned 1. starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-Amaya.out

Problem is I set the ulimit to unlimited and it still gives me that error.

For anyone getting ready to deploy good luck! I'm still very interested in it and it's amazing, but not for the faint of heart.
WIP:
PHP
Kotlin
Intro to Discrete Math
Programming Languages
Work stuff

Comments

  • philz1982philz1982 Member Posts: 978
    I deployed a hadoop cluster to AWS. Talk about a major pain in the ass. I completely understand why people go through a software provider.
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Haha I hear ya philz! Definitely like working on it, but was really looking to see if the ethernet cable and ceiling were strong enough for me to hang myself (kidding of course!)
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Happy to report I got the Hadoop Cluster up and running! I went into the logs on the troublesome server and saw that the program was trying to bind to an old IP address. I checked the /etc/hosts and realized I hadn't updated it to reflect the changed ip. Set it up correctly and boom the NameNode was up. With that up I was able to start all the other services and all is right with the world. Only thing left is to figure out why Ganglia isn't reporting the stats for the other servers. Hope this helps anyone going down the Hadoop road!
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Let me tell you, want to see if you messed up any of your servers when you initially set them up? Install HDP. Every problem I have run into has been due to some setting on the server that I screwed up when I set them up months ago. At this point I am going to reinstall eight servers and start from scratch. I will say I definitely learned a lot along the way and it's nice to know I am getting my servers in proper working order.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • N2ITN2IT Inactive Imported Users Posts: 7,483 ■■■■■■■■■■
    Great attitude!

    Keep us posted, very interested in seeing how this turns up. Do you have analyst who are going to leverage this data or some department? Just curious on the back end, how this ties into your company.
  • NightShade03NightShade03 Member Posts: 1,383 ■■■■■■■□□□
    Great to see your progress on this (and glad you are having fun with it). I will say this one thing though; when deploying Hadoop on prem or in the cloud you should always use an automation tool. Building servers, configs, ensuring things are updated....all a major pain in the ***. Much easier to plan everything out on paper, build some puppet/chef/ansible roles and deploy! Obviously the pains of installing by hand help with the understanding peice and will help you learn how to plan/design better in the future too.
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    I have one full time database guy and another guy who is pretty versed in SQL so they'll be the go to for the everyday stuff. I plan on getting partially into the role as well, though I'll still be responsible for the overall operations aspect of the cluster. I've gotten a lot of perspective and if I seem to enjoy it (along with getting into DSU and completing my Masters in Applied Computer Science) I am considering looking at the MS in Analytics after.

    I'll give you some perspective on what we are doing:

    Network Monitoring (to a point) - I'm in regulation and to ensure compliance we are doing a fair amount of network monitoring. This includes file integrity, Netflow analysis, and log analysis. Ultimately we want to know if a change took place, when it happened, who did it, and verify it was approved. To do this we are currently utilizing OSSEC (file integrity and log analysis), Logstash (parsing of log data and Netflow) to convert data to JSON and import into Elasticsearch, and Elasticsearch (with Kibana) to run queries against the data. Elasticsearch is great for real time data analysis, but ultimately long term analysis needs to be done (and I don't feel it is a good tool for Financial analysis).

    Financial analysis (kinda) - Another area of regulatory compliance revolves around the reviewing of specific data. Currently all of this is done via providers creating and distributing reports. Generally speaking those reviewing this data are accountant like people so reports are XML or Excel. They can do some manipulation and are very good at finding compliance issues, but there are problems. First, the sheer volume of data is insane. No person can truly sort through it all and paint an accurate picture (at least in a timely fashion). Second, there's a fair amount of faith given to the providers that they are correctly reporting their data. Third, more insight is needed in how a provider arrived with the report they provided. Particularly, those reviewing the data need a better understanding of how it was manipulated. Finally, undue cost on providers to comply with reporting requirements. They data is only getting bigger and bigger (an example one XML report is running about 300 megs a day) and there is a period that it needs to be maintained thus another cost. Also, they are having to spend a lot of money to get the data into a valid report and they're still making mistakes a year in.

    How do we solve all these problems? Hadoop. The plan is to sit with each provider and map out exactly what their databases look like. There will be a review and agreement (between my team, the accountants, and the provider) about which fields will provide the correct data for review. From there the provider will create a replica of the data (for a day at a time) from which we will pull and place into Hive. Ultimately the replica will only contain 30 days of data (in the event there is an issue and we need to pull it again). Also, this will allow us to generate a monthly report ourselves and perform a comparison with the report they are running. Via Pentaho (an amazing tool, please see it if you don't know about it) we will run the needed queries to generate the reports from the raw data we collected. Obviously, it will take time to confirm that it is properly running, but once confirmed these reports will run automatically on a nightly basis. Through the Pentaho web interface the people who need the reports will be able to login and review them (on top of, at some point, being able to run their own reports). We've already gotten a report for them to review and let's just say they like it.

    The beauty of Hive is the schema is applied on read and through Pentaho the accountants will be able to look at not only canned reports, but also create their on without fear of messing up the data in the lake.

    Geolocation and long term network analysis - Currently geolocation is one regulatory requirement. We lack the ability to run reports or perform analysis on our own, but with Hadoop that will change. We'll be able to load this data in and then perform analysis on our own. We've done it via Excel (able to prove that someone traveled a distance of thousands of miles in the span of 15 seconds), but we'd like to go further. Trending the network data is also a big thing. In one case we were able to track the movement of data through several different servers and we'd like to be able to review over a period of several months in the search of APT.

    Also, to plug Hortonworks a bit, HDP 2.2 has added ACID to Hive as that is very important in the Enterprise world. Along with the needed security and auditing features. Did I mention it is 100% free?

    That's a small breakdown of what we are doing and trying to accomplish. It's a huge undertaking being completed by a team of four (not including my boss). I've been able to set the tempo that we need to treat this like BBQ (cook it low and slow). They tend to like to rush things around here, but I've illustrated that while it will take a decent amount of time, once it is done correctly for one provider all the others will be simpler. Hopefully this is pretty clear ;)
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Very solid advice! I have Spacewalk up and running for updates. I definitely need to review and get going on better deployment strategies. I'll definitely review this weekend!
    Great to see your progress on this (and glad you are having fun with it). I will say this one thing though; when deploying Hadoop on prem or in the cloud you should always use an automation tool. Building servers, configs, ensuring things are updated....all a major pain in the ***. Much easier to plan everything out on paper, build some puppet/chef/ansible roles and deploy! Obviously the pains of installing by hand help with the understanding peice and will help you learn how to plan/design better in the future too.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
Sign In or Register to comment.