Big Data - Hadoop

Anyone working in the Big Data realm? Seems I'll be heading down this path with work and hoping to avoid any common pitfalls along the way. Also (not that I think I am leaving, but always best to think ahead) would be nice to get some perspective on the job market. How often is one administrating the cluster along with performing the data analysis? All very interesting and I'm cautiously excited about it, but fear I might be in a little over my head.

Find more posts tagged with

Viraajdigital offers an exceptional Email Marketing Reseller Service designed to help businesses exp

Comments

NightShade03

There are a ton of different angles here. You can be the engineer that runs the cluster, the guy that deals with the data, or the security person managing...wait for it...the security of everything.

First you'll want to understand what type of big data are you working with. Is the org really using Hadoop or is there other BI and DW tools in play? If they are using Hadoop, is it native Hadoop or one of the commercial editions like Cloudera or Hortonworks? What vertical is the org that you work for in? Are there any compliance requirements.

As far as roles are concered:

Hadoop Engineer - Install, configure, maintain, and anything else that relates to keeping the cluster online and functioning properly. You biggest challenge will be ensuring that everything is online and operational, while constantly helping to add new hardware and extend features for a variety of other user types. Best tool at your disposal will be some form of monitoring platform.

Data Engineer - The person that will actually analyze the data. Surprisingly most people that fall into this role know very little about the underlying mechanics to Hadoop. They are the ones that munge the data, write MapReduce code, and work with the results for different visualizations and findings.

Security Engineer - Your job will be focused on the trifecta; visibility, authentication/authorization, and data security itself. You need to know who has access to what data, how is data partitioned, where the data is going, etc etc. This is a wide ranging role that requires a ton of security and operational knowledge.

The job market for this area (and big data in general) is exploding with one of the largest skills gaps next to IT security. The bigger problem here is that many people in this field that want to do things from the Data Science or data Engineer perspective are required to have a PhD. Now that's not to say you can't be a Hadoop engineer or the ops guy, but those making the big bucks and answering the most complex questions in different verticals are all in possession of a PhD.

If you haven't already I highly suggest you check out the Coursera courses. They are free and give a pretty high level dive into Data Science and Big Data as a whole. Also if you have other specific questions I'm more then happy to help/answer them

the_Grinch

First, thank you so much for this awesome post! Definitely helped to define the roles that will be involved in this project. Right now we are pretty small so overall the team will probably be wearing a few hats in the beginning (three of us...one database guy...myself more systems/networks...and the third who can do lots of different things). We're lucky in that everyone who will need access will be on site thus outside access will be limited if not completely blocked. I see your point about weighing the need of Hadoop versus another product (or traditional database). I believe Hadoop is the way to go based on the amount of data and the type of data (syslogs to number crunching revenue numbers). Unfortunately, we have no budget (beyond the hardware) so I don't believe we could select a commercial product. Compliance wise we are covered so not concerned in that aspect.

I'm definitely going to hit you up as we go further down this path.

NightShade03

Anytime! Starting small is definitely helpful because many orgs usually through a tons of resources and money thinking it will solve the problem, but in reality it only makes it worse.

Hadoop is defintely useful from different data types since it doesn't require you to define a schema in the same sense that RDBMS systems do. As for the commerical products...don't write them off completely. The are based on open source and entirely free to use. They follow the Red Hat model and only require payment for enterprise support. The benefit here is that these vendors have taken all of the core components of Hadoop and packaged them up in a single distribution which makes getting off of the ground 10x faster. Check out the Hortonworks Sandbox (personal favorite) which comes with 18 tutorials on various big data uses...all for free

the_Grinch wrote: »

I'm definitely going to hit you up as we go further down this path.

No problem! Happy to help!

the_Grinch

Sweet! I will definitely check it out!!!

UnixGuy

Well not sure if this helps but I manage a cluster, the users (Data analysts) do all the big data stuff. From my perspective, it's just a high performance cluster.

Edit: Forums don't allow me to give more reps+1 to NightShade!

NightShade03

I love cookies

Double chocolate chip please!

the_Grinch

We're going to try out Hortonworks! We learned with Elasticsearch the ability to monitor the cluster is very important. Thanks for the suggestion!!

NightShade03

Very cool! ElasticSearch is another awesome too that I love

the_Grinch

Haha, yeah we are really enjoying Elasticsearch. Finally have our cluster running properly and it has truly made a difference in our monitoring.

NightShade03

Another common big data strategy that orgs don't seem to be understanding is that tools like Splunk or ElasticSearch are supposed to be used for "hot data" or in other words, data that is available for quick search within a predefined window (usually up to 90 days). After the window, data should be offloaded to a SIEM or other long term storage solution (like HDFS) for historical analysis.

I'm actually in the middle of developing a 3 day workshop that discusses data analytics and how security ties into it...this thread is definitely part of the bigger picture.

the_Grinch

Haha, it's actually very funny you mention how Elasticsearch is meant for the realtime and then meant to be moved. The guy I work with (this entire setup has been his vision) has planned exactly as you have stated. We'll retain the Elasticsearch for a period of time and then move it into Hadoop after that window. Realtime = Elasticsearch and everything else will be Hadoop. Since you have experience with it, how do you perform your moves after 90 days? I know there is a "delete" in Elasticsearch, but it doesn't truly delete anything. Are you manually moving the files and importing them into your Hadoop cluster?

NightShade03

So technically the connection between Hadoop and ElasticSearch should always exist, which means even the hot data will be pushed to Hadoop in the same instance that it is also recorded in ElasticSearch/LogStash. The key here is in how you design the indices for ElasticSearch. One of the options is to create an index for each day or date. You can then create a cron job (or something similar) that will delete an index after it has reached a certain threshold. See this API call:

Delete Index

Another option to to use some of the up and coming tools from GitHub which will do all of this for you:

https://github.com/elasticsearch/curator

This is only one method to accomplish this task, but definitely the easiest. Also don't forget that ELK is still maturing and relatively new to the market in comparison to things like Splunk which has been around for 6+ years at this point.

Krones

Our reporting team uses Wherescape RED with MS SQL, and the Business team accesses data cubes via Excel.

The raw data (sessions, tracking, etc) is done with mysql (and that is well, quite a beast in and of itself - pretty big site)

Also, we may be evaluating Hortonsworks soon. I just installed basic hadoop/hive setup for our reporting team for now.

All above my head, but in good time.

Our data team tend to act as the DBAs but I end up helping with a lot of run of the mill domain and occasional admin tasks. Also learning my way around SSMS and running some queries. Fine with me (beats password resets in AD - haha) but I'm more interested in learning the NoSQL side versus Microsoft.

Sounds like you are on the right path. I think there is a definite distinction between DBA and Data Scientist. Cannot do it all.

the_Grinch

Yeah I will say this is some of the most interesting technology I have worked with and if we get everything running we will be doing some amazing things. I tend to think, at least initially, we'll be doing everything since our group will mainly be using it. As it grows we might then add or dedicate someone to it solely. Thanks again for all the help!

wes allen

NightShade03 wrote: »

Another common big data strategy that orgs don't seem to be understanding is that tools like Splunk or ElasticSearch are supposed to be used for "hot data" or in other words, data that is available for quick search within a predefined window (usually up to 90 days). After the window, data should be offloaded to a SIEM or other long term storage solution (like HDFS) for historical analysis.

CORRECTED FORMAT - QUOTE above reply

This is from a security perspective, but, I think most people that use splunk/elk and a SIEM tend to use the SIEM for 0-48 hours for alerting and incident workflow management, and splunk/elk for researching an incident and storing logs long term. Since most SIEM use slow DBs, trying to search through 30+ days of logs can be painful, where splunk has no problem.

the_Grinch

Splunk is definitely an amazing tool for sure. But you definitely pay a lot for the privilege to use it. ELK has given us the same performance at just the cost of the hardware we have it installed on. Now that we have everything setup properly we are having no issues going back to logs from March (these are logs for file changes, netflow, and about 120 servers with all types of other logs). We definitely need to get these logs off that box, but for now it will do.

NightShade03

Just as a follow up to this thread:

Recent Network World article talking about the demand for Data Scientists. There are 36,000 roles that need to be filled and only 6000 people to fill them. A DS with a few years gets 100 emails a day from recruiters and average starting salary goes from $200,000 - $300,000. Not bad for an in demand job!

Big Data scientists get 100 recruiter emails a day

the_Grinch

I think I know what Masters #2 will be