HDP Data Science Course

the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
Found out on Friday that my work has approved my and my boss attending the HDP Data Science course! I got our cluster up and running along with getting ready to begin to import the data. I had to make the argument that training was needed because while we have some idea about how to get the information we are seeking this course we better enable us to get this information. Definitely going to be a fun filled three days! Course description below:

Data Science for the Hortonworks Data Platform covers data science principles and techniques through lecture and hands-‐‐on experience. During this three-‐‐day course, students will learn the processes and practice of data science, including machine learning and natural language processing. Students will also learn the tools and programming languages used by data scientists, including Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-‐‐learn, the Natural Language Toolkit (NLTK), and Spark MLlib.

Upon completion of this course, students will be able to:
Recognize use cases for data science
Describe the architecture of Hadoop and YARN
Explain the differences between supervised and unsupervised learning
List the six machine learning tasks
Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
Use Mahout to run a machine learning algorithm on Hadoop
Write Pig scripts to transform data on Hadoop
Use Pig to prepare data for a machine learning algorithm
Write a Python script
Use NumPy to analyze big data
Use the data structure classes in the pandas library
Write a Python script that invokes a SciPy machine learning algorithm
Explain the options for running Python code on a Hadoop cluster
Write a Pig User Defined Function in Python
Use Pig streaming on Hadoop with a Python script
Write a Python script that invokes a scikit-‐‐learn machine learning algorithm
Use the k-‐‐nearest neighbor algorithm to predict values based on a training data set
Run a machine learning algorithm on a distributed data set on Hadoop
Describe use cases for Natural Language Processing (NLP)
Perform sentence segmentation on a large body of text
Perform part-‐‐of-‐‐speech tagging
Use the Natural Language Toolkit (NLTK) for implement NLP tasks and machine learning algorithms
Explain the components of a Spark application
Write a Spark application in Python
Run machine learning algorithms on Hadoop using Spark MLlib
WIP:
PHP
Kotlin
Intro to Discrete Math
Programming Languages
Work stuff

Comments

  • alxxalxx Member Posts: 755
    Thats a lot for three days! Overload time!

    Pandas, scipy, numpy, sckit can each take months to get to a decent in depth understanding or longer

    Have fun and Good luck!
    Goals CCNA by dec 2013, CCNP by end of 2014
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Thanks for the heads up! Figure it will be a decent overview and then I can see where we head after.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • yzTyzT Member Posts: 365 ■■■□□□□□□□
    pretty good course, I'd like to do something similar :D
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Class starts on Monday! I'll keep this thread up to date after each day.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • UnixGuyUnixGuy Are we having fun yet? Mod Posts: 4,232 Mod
    Excellent!
    Certs: GPEN, GCFA, CISM, CRISC, RHCE
    In Progress: MBA
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Day One is over and I have to say the course has been really interesting. I have taken a previous course with this company and the trainers have been amazing (I won't list the name here, but if you are wondering who the company is feel free to pm me). Hortonworks made the decision (rightfully in my opinion) to move away from Docker and VMWare to NoMachine and AWS. In the ops course you would do simple tasks and they would take forever to run. Also, you are relying on the local IT guy to make sure certain things were loaded correctly (we only had a few issues in my ops course, but obviously that slows things down).

    First we covered what is data science and a few cases of where it is used. Typical examples here, most notably the "Target knew a girl was pregnant before her father" case. It demonstrates how truly powerful data science can be. Next we covered what Hadoop was and how it all works together. Obviously, this was all review for myself and a number of people in the class, but Hortonworks likes to have no pre-reqs so they make sure to cover everything to level the playing field. This instructor provided a very good example for Map Reduce and it's always nice to see everything again.

    Next we covered Mahout and ran some algorithms over movie data. It was really cool so see how to take data, munge it a bit, and then run the algorithm to get movie recommendations. In this section we learned about the various means of machine learning (clustering, regression, etc) and when you would use them. I will say there isn't a lot of depth, but to be honest no short training course will give you that. Finally, we covered Pig and it was awesome. Unfortunately, I was unable to attend the Pig/Hive course with my boss and other coworker so this was my first foray into it. Really amazing stuff and definitely simplifies a lot of things. I could see applications in a lot of the data we work with currently. Pig would really simplify the generation of reports from some of the csv/xlsx files we get.

    Tomorrow we'll be spending a lot of time on Python, which will be a good thing. My instructors teaches it as a college course so I know we'll get a lot out of it. I'm definitely going to push to be sent to the Pig/Hive course as right now I only have one guy who can actually do it (it will be tough for my boss to work on it) and I'm pretty sure I can get it approved. Also, my instructor says he has some practice tests for the administration exam so I might be throwing that on my plate at some point.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Day Two is behind me and boy was it a dozy! We did some more work with Pig and man I like that product more and more. Just in the past two days I have seen why so many companies ask for Pig experience. It's a very important tool, but it's also not a difficult tool to work with. Pig Latin is very easy to follow and if you are versed in SQL you'll understand the thought process. After that, we dove into the Python. It was a pretty advanced overview of the language, along the lines of CodeSchool, but a bit deeper.

    It was nice to do some data analysis and truly see the power that this language has. After that we started with IPython and I have to say what an amazing suite of tools. To be able to easily build charts, graphs, and tables is truly a God send. I could see being able to build simple test dashboards or quick/dirty dashboards that don't need to be too pretty. Again this is truly an overview, but it definitely gives you a great base to jump from. Finally we worked with Python and Pig. Definitely amazing stuff and sort of hard to describe, but the just is that you can write a script to process data from within Pig.

    I highly recommend this course if you are thinking about Data Science as a future career. It will give you an idea of what your day may consist of and then make an informed decision of if that's what you want your life to be.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • yzTyzT Member Posts: 365 ■■■□□□□□□□
    Mmm it's me or are they focusing so much on Hadoop and its related technologies? The Hadoop ecosystem is already old, Spark is the future due to Hadoop's lack of real-time processing. That's what I've been reading for a while.
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    They are definitely applying data science over all the ecosystem. The point is that you want to select the right tool for the job. Spark is amazing, but how often do you truly need the real time capabilities? Also, Hadoop has made some major strides in regards to real time. Real time by the second? No. But can you wait a minute? Hadoop is in the minute range at this point.

    Also, Spark is run purely in memory so you would be limited by the amount of ram on the cluster. Along with the fact that you still need to store that data somewhere. I look at it from the standpoint of what I will be doing. Grabbing data on a daily basis and running a report. Loading the daily data into Spark and processing it, very possible. Running monthly, quarterly, and yearly? Probably looking at a Hive job. Plus this isn't data I need in realtime.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
  • the_Grinchthe_Grinch Member Posts: 4,165 ■■■■■■■■■■
    Had to travel home last night so I didn't get to post about the last day of class. It was a bit of a whirl wind day as I had to play catch up on my labs. We covered running a Python script via Pig through streaming. This included running it on the Pig server and sending out the work and sending the script to all the nodes for processing. I will say I didn't quite see what the major difference was in choosing one way over the other. I suspect it is a performance difference, but depending on the job it might not really matter as much. That was related to machine learning, then we covered natural language processing. This is interesting from an application standpoint as my instructor covered a pretty good example. You could pull web pages and through Hadoop along with NLP strip out all the executive names and business locations (as you would with a pentest). From there you could go a bit further and using that list find more connected data. We ended with a very brief view of Spark. Spark is starting to become the go to tool for a lot of people and I'm sure Hortonworks will be developing a course behind it soon enough.

    Overall, I really enjoyed the class. It's a 10000 foot view of data science and gives you ideas on what you might want to do with the data you are collecting. Obviously, in three days you won't become a data scientist, but it will give you the initial resources you need to develop a map to accomplish your goals.
    WIP:
    PHP
    Kotlin
    Intro to Discrete Math
    Programming Languages
    Work stuff
Sign In or Register to comment.