Sunday, November 12, 2017

Data Camp

I have a data-science type of job. I want to take this opportunity to shamelessly plug Datacamp, an online series of courses designed to teach the student about various data science topics and the very practical aspects of programming languages. You can do a track in the Python language or the R language. (There are also SQL courses. SQL-skills are very important as it’s good to have some data wrangling skills, but you’ll need R or Python to actually fit some of the predictive models that make “data science” a real discipline.) 

I thought I was an intermediate-to-advanced R user, but Datacamp really filled in some of the gaps in my knowledge. Like most R users (I think), I am self-taught. I learned the language doing things ad hoc, one-thing-at-a-time. Datacamp forced me to follow a well structured curriculum. You can actually learn to do some pretty advanced things in R without learning the basics. But "the basics" make it easier to write efficient code, readable by other R users. (See this hilarious example of an advanced R users being foiled by the basics. Basics which, by his own admission, he had learned and apparently forgotten.)

There are various "career tracks", depending on the level of skill you want to develop. There is a "data analyst" track for people who are basically Excel jockeys wanting to up their game a little. Then there is a "data scientist" track for people who want to build sophisticated predictive models using the latest modeling methods. There is a lot of overlap, too. If you finish the data analyst career track (which is 16 courses, each taking ~4 hours to complete), you're really not far from finishing the data scientist track (23 courses, 16 of which are in the data analyst track). Once you subscribe, you have unlimited access to all of their courses. Data Camp adds new courses all the time, some of which are extremely similar to existing courses. This kind of redundancy might seem pointless, but it's very helpful to get a different take on the same topic from a different set of instructors (each course is designed and led by one or two instructors). I would have missed quite a lot if I had said, "Meh. This was already covered in that other course." I've officially finished the Data Scientist career track and the (much shorter, overlapping) Machine Learning skills track, plus several unrelated courses. The material is excellent.

The courses are a series of videos, quizzes, and short programming exercises. When I say "short", I really mean five or six lines of programming. It's often fill-in-the-blank stuff. A video will introduce a statistical concept, then an R function that performs a related task. In the next window, you will be quizzed to see if you understand the concept. In the next window, you will have to enter the arguments of an R function, then examine the output. If this sounds like it's too much hand-holding, it's not. Some of the coding exercises took me a long time to figure out, and I still had to cheat. (If you're struggling with a programming exercise, you can ask for a hint. If you still don't get it, you can ask for the full solution and move on to the next exercise. I'm usually proud, but I was not above skipping some exercises. You get 100 experience points if you complete a programming exercise without help, 70 if you complete it after asking for the hint, and 0 if you ask for the full solution. Get to 100,000 xp, and you can cast fireball.)

At any rate, I don't think you're supposed to blow through the lessons and simply move on to the next thing. I think a practical use of Datacamp is more like the following. Go through all the lessons in some skill track in whatever order you please. At this point you don't know everything, but you have seen everything once. You have a mental index, so you can find stuff when you need it and review. When you need to use something (like that cool graphical trick for making vivid bar plots), you can review the videos, slides, quizzes, and programming exercises. You've already done the programming exercises, so your progress on those is saved. If you return to those lessons later, you will see the finished code. You can simply run the parts of the code you want to see and examine the output. The course that took you 4 hours to complete the first time can be reviewed in maybe 1 or 2 hours. Then you can apply your learnings to a cool work project and impress your boss. Or you can simply do spaced repetition and stay sharp on everything. I've re-done some lessons several times, even redoing entire courses.

A few pointers.

Use ?[function name] or (equivalently) help([function name]), which shows you documentation for the function you're using. Use it frequently. Use it even if you know how to do the programming exercise. There are a lot of arguments to any R function. There might be ten possible arguments, but you normally only enter two or three and the rest are handled by default values that you rarely have to mess with. Still, you should know what this function is doing.

Use the ls() function to see what's in your workspace. (Just ls(). It takes no arguments.) For the programming exercises, data and models are already loaded into your workspace. Sometimes these are the products of previous programming exercises. It's really easy to forget what you're working on, particularly if you leave a lesson and come back to it later. Also, some of the more advanced courses give you fewer details about what you're supposed to do. Instead of saying "Enter 'price ~.' as the first argument,  'binomial' as the second argument, and 'diamonds' as the third argument of the glm() function..." it might say "Build a logistic regression on the data, in which 'price' is a function of all other variables." If you forget the context and don't know what "the data" refers to, ls() will show you what's in your workspace. Once you have that information...

Use the str() function on things in your workspace. (Short for "structure", not "string".) If you call it on data, it will tell you what kind of data you're looking at, how many rows, what are the fields, what are the first few elements of each field, etc. If you call it on a model that you've built in a previous exercise, it will tell you what kind of model it is, what kinds of parameters are present, etc. Like I said, some of the advanced courses don't hold your hand as much, so you have to look at what data and model items exist in your workspace. Call ls() to see what's in there and call str() on those objects to see what you're working with.

There's a "run code" and a "submit answer" button, both of which run all of the code in your window. If you want to run just one or two lines, highlight the code and hit control+enter. See what this does, and use str(), head(), and summary() to take a look at any items you've created. This is a useful exercise. It makes you pause and think about what you're doing. Sometimes a single line of code really has several nested functions, the results of which get passed as arguments to the next function, and so on. It's easy to miss what this is doing. It helps to run the intermediate steps in these chains.

Some exercises are really freaking aggravating. Did you know that a filled circle is different from a solid circle? I didn't. But I do now! Anyway, stick with it. I promise you, you'll get over the these annoyances.

You can look at the slides from the videos while doing the coding exercises. This can be important, because often the video introduces a function, which is poorly explained or not introduced at all in the instructions for the coding exercise.

Don't be too proud to Google or check out Stacked Overflow for help. Some of the exercises don't give you enough information, or perhaps they are drawing on a previous, long-forgotten lesson (possibly from a different course, possibly one that you haven't taken yet!). It's fine. Remember you can always take a hint and even show the full answer. You can always do a "post mortem" on these and see where you went wrong by googling the library or functions you're using.

A note on pricing. I paid $300 for a 1-year subscription. They have special deals all the time, so I have no idea what I'd be paying if I signed up today. (I think I narrowly missed some "half-price" special. Oh, well.)  I'll probably renew at least for the next few years. This might sound like a lot of money, but it's trivial compared to the amount it can add to your salary. Maybe Datacamp isn't the right tool for you at this stage of your career. (There are other options like Coursera and Code Academy, but I know much less about these other services.)  But anyone with aspirations of self-improvement and career advancement should be willing to spend a comparable sum for education, accreditation, technical practice, professional development, or whatever. Buy textbooks, do online classes, or do some form of accreditation. If you have this ethic of self-improvement, with the associated willingness to shell out your own cash to achieve it, you will go far. If you take an "I won't do this unless corporate pays for it" attitude...well, then you have a chip on your shoulder and it's hurting your career. Do some self-improvement on your own time and on your own dime, and you can eventually abandon that stingy corporation for a more enlightened one. 

No comments:

Post a Comment