How to start learning R (part 2)

Filed in Coursera | Data science | R | R programming Leave a comment

Let’s keep talking about R! This is the second part of the post about how to start learning this programming language. In the first part, I tried to highlight that the most important thing about learning R is wanting to learn, no matter what your background is. It doesn’t matter if you have a technical or scientific background (although it helps) as long as you are willing to learn.

If you already know about programming or statistics, this Data Science specialization offered by Johns Hopkins University through Coursera will be much easier for you. If you don’t have any previous knowledge about it, don’t worry, you’ll learn. Just focus on wanting to learn and that’ll keep your motivation up.

Having said that, in this second part of the post, I explain in more detail about each of the courses. In the previous post, I talked through the common points within the 9 courses, but I forgot to add a couple tips that were useful for me to do the quizzes and assignments:

– Typical phrase that we all are tired of hearing: Google is your best friend. If there is something you don’t know, google it. You could change Google for Yahoo or the library in your town, it’s up to you, the thing here is that although most of concepts are explained in the weekly videos, there are specific parts in the quizzes and assignments that require of some investigation to solve them. It doesn’t have any difficulty because if there is something always at our reach, that’s information.

– The forums and the other students in the course are also very good friends: every time I’ve asked anything in one of the forums, it’s been answered almost immediately. I know it’s online, but I’ve perceived a lot of fellowship and eager to help.

The specialization has 9 courses that can be done independent from each other. However they are related and having done one course makes it easier to do the ones after. I’ve done 2 each month, in the same order that they are on the site on Coursera, and the same that they are explained in this post.

1. The Data Scientist’s Toolbox:

This is the easiest course. It’s worth it to start the second one at the same time because it’s very easy. The videos in the first week explain what the other courses are about, and the one in the second week do about what tools will be used during the specialization (Git, R Studio, etc…). It has 3 quizzes and a very simple assignment.

2. R Programming:

Probably the most difficult course for me, as I didn’t have any previous programming knowledge (html and CSS don’t count…). The videos are easy to understand but my problem was when I started to do the first assignment. In this case there are 3 assignments, 1 in the 2nd week, 1 in the 3rd and 1 in the 4th. Something I didn’t know is that watching the videos of week 2 is very helpful to do the assignment in week 1, and watching the videos in week 3 helps to do the assignment in week 2 (and so on). Usually that’s not the case, usually you don’t need to do that in the other courses, but in this one, it’s useful.

In the end I finished R programming with distinction (you get it when your grade is higher than 90%), and for me everything started to make sense when I realised that a programming language is like any other language and sometimes you just have to think in a logical way. Obviously, this is only suitable for simple functions, because to have an advanced knowledge of R (as for any language), you need to learn more and have more experience, but anyway, there are many (beautiful) things you can do by having a basic knowledge of this language.

The next 3 courses, until the Statistical Inference one, weren’t difficult.

3. Getting and Cleaning Data:

Here you learn to clean and organise data sets. In case they have NAs or you need to filter the data or subset a part of it, this course explains how to do it. It has 3 quizzes and a mandatory assignment. There is an optional assignment called Swirl that gives you extra points and also helps to clarify the concepts learnt.

4. Exploratory Data Analysis:

This course shows how to build plots in R to analyse the data. It’s very important because plotting is a recurrent subject in all the next courses, and is as well of a big matter when analysing data. As an analyst, making plots is essential when starting to analyse the data, both to detect outliers and to understand the relation between the different variables. Visual representation is always mentioned as part of communicating the findings, but it is as important for yourself as the analyst when starting to analyse.

It has 2 quizzes and 2 projects.

5. Reproducible Research:

This course shows how important it is that the analysis and the code are available to other people so they can reproduce the same research. It has 2 assignments and 2 quizzes.

6. Statistical Inference:

This course has a different teacher than the previous 5 and things get complicated again. Many think that it’s complicated because of the teacher, but for me it’s a mix of the difficulty of the subject in a big part, and in a much smaller part due to the way it’s explained. Brian Caffo teaches this course and Roger D. Peng taught the previous ones. In the first part of the post you can find my opinion about the 3 teachers.

Statistical Inference is like checking again the notes you took in uni in the subjects about statistics. In my case I had statistics as an optional subject while doing the degree on Advertising, and then as mandatory subjects when studying the Market Research degree. The course talks about probability, variance, distributions, coefficient intervals, t-tests, p-values, etc… The videos and quizzes require more time than in the previous courses. I did this one at the same time than Reproducible Research, and tried to finish Reproducible Research in the first weeks so I could spend more time on Statistical Inference.

It has 4 quizzes, 1 mandatory assignment and 1 swirl assignment. The videos are available on Brian Caffo’s Youtube channel. There are also some videos called Homework that were created to help do the quizzes, which are kind of difficult.

7. Regression Models:

This course, with the next one (Practical Machine Learning) and the R Programming one were my favourite, as well as the ones that took me more time. If you like statistics, you’ll really enjoy them. I’d remove from this course though the mathematical part that is used to explain some of the functions. It’s optional this time (not in Statistical Inference) but it can be confusing to understand the explained in the videos, and also the course don’t go that much in depth to necessarily understand the Maths behind. Despite that, Regression Models is a beautiful subject and this course is about analysing the relation between a dependant variable with other independent variables. It focus a lot in linear models.

This course has 4 quizzes, 1 mandatory assignment, and 1 swirl assignment. As the Statistical Inference course, the videos are available on Brian’s Youtube channel.

8. Practical Machine Learning:

Jeff Leek is the teacher in charge of this course. It could be said that this is the practical part of Regression Models, and it focuses in building predictive models in R, including models as Random Forrest, Boosting, Forecasting, etc… It requires more work than the first courses, but it’s totally worth it. It has 4 quizzes and 1 project.

9. Developing Data Products:

Nineth and last course, currently the one that I’m taking. After doing the last 3, I consider this one much easier. To summarise a lot again, it shows tools to build interactive applications and dashboards with Shiny, RCharts or GoogleVis, and presentations with Slidify or RStudio presenter. It has 3 quizzes and 1 assignment.

The specialization has a final Project (the Data Science Capstone) that is only available when you finish the 9 courses. I haven’t been able to sign up to do the project yet. Will share how it goes when I do!

I don’t have much more to add, just to highlight again that this specialization is for everyone that really wants to do it, even though your background is not technical. If you want to develop your career as an analyst, I totally recommend it. I won’t lie, it requires time and it has some parts that aren’t easy, but there is nothing impossible and it’s great to learn something new that you feel passion about.

As said in the previous post, if there is anyone interested in doing the specialization or has any question, I’m always available on Twitter, Linkedin, on this blog (you can leave a comment here) or by email 🙂

My story with R… how to start learning R

Filed in Coursera | Data science | R | R programming 1 Comment

f2d160940e19554a69af5fc7f06649fbI said once that R was like being in love. I don’t remember where I said that, but I guess in that moment it made sense… or probably, as it’s happening now, it didn’t… maybe it’s just that I enjoyed learning R so much that it became part of absurd conversations in which this statistical programming language was the subject of many jokes.

Definitely if you’re not a geek (I like to call it freak in Spanish), the jokes about R (no matter how good they are… some were great) will seem like bad ones to you, but this post is not about humour (I know, you might have noticed), this is about learning. I have wanted to write this for a while because people in general believe that if you are not a geek or don’t have a technical background, learning R is very difficult. Well, it’s not. Wanting to learn is more important than whatever your background is.

I did the degree in Advertising in university, which is not technical at all. I think I chose that degree because back then I enjoyed writing and journalism had high unemployment. I won’t go into detail about my Advertising degree because I’m trying to sound positive here, but my point is that my background wasn’t scientific at the beginning and it has become more as I’ve spent more years working in data analysis. From my point of view, being a good analyst doesn’t rely on being more scientific or artistic, it depends on having an analytical mindset.

Learning R is much easier if you have some programming knowledge, and as I didn’t have any, I was bit scared of starting the Data Science specialization that I’ve been studying for the last 4 months. The specialisation is offered by Johns Hopkins University through Coursera. It has 9 courses plus a final project, and now I can say (after 4 months and with no programming knowledge at the beginning), than I’ve almost finished all the courses (currently doing the last one).

If you want to learn R, I totally recommend this specialization. Also, as it’s through Coursera, you can do it for free. The only difference between the free and paid version is that you don’t get a certificate at the end with the free version.

In a different post I’ll explain in more detail each of the courses, however there are some things that all have in common. They all are 4 weeks long, and during each of these weeks there are videos available that you’ll have to watch to do the quizzes and assignments. The weekly videos can be watched in 1 or 2 hours (per week), and the ideal way to go is to do the quiz about those weekly videos after finishing watching them. Sometimes after going through so many videos, the brain is pretty dead and it’s better to wait until the next day. Unless you enjoy those moments when you read and reread the same all the time without having any idea of what it says because your concentration decided to go somewhere else.

Apart from the quizzes (which are usually 4, one per week), each course has assignments or small projects. Most of courses have one assignment only that is has to be due by the end of the third week. Most of them are also graded by the other students, so it’s very important to read the questions that need to be answered in each assignment (you can find the questions in the page where the assignment is submitted). The other students are usually very nice when grading, but if you’re missing any of the answers and for that they can’t give you all the points, obviously they won’t. You will also have to grade 4 of the other students’ assignment during the 4th week. It’s a great way to learn because you get to see different ways to solve the same problem.

I’ve done 2 courses each month, so I’ll be done in less than 5 months, and after that I’ll have to wait until the final project of the specialization (the Data Capstone) is available. You can’t start the project until the 9 courses are completed. Doing 2 courses each month requires time but is totally feasible, also it helps that 1 of the courses is always easier than the other.

The specialization has 3 teachers, Roger D. Peng, Jeff Leek, and Brian Caffo. Roger D. Peng teaches in the first 5 courses, which are specifically about R programming, about specific functions to clean the data, organise it, do easy calculations, plots, etc… I know I’m summarising a lot, but roughly that’s what they are about.

Brian Caffo teaches the courses about statistics (Statistical Inference and Regression Models), and in the last course (Developing Data Products). I’ve read online some critics about the way he teaches, and I have to agree that sometimes he makes the concepts more difficult than they really are, but statistics is not an easy subject to teach. He focuses too much on the mathematical explanation behind the formulas used to calculate the statistical concepts explained, and this is often unnecessary. Even though it is important, it doesn’t seem necessary due the nature of the course. Also, the mathematical part is optional in the course about Regression Models, and here, Brian says to skip the video if the student is not interested in learning about that. Despite all that, I wouldn’t say he’s a bad teacher at all, and that’s proved in the Developing Data Products course.

Jeff Leek is the teacher responsible of the second last course, Practical Machine Learning, which shows how to build predictive models in R on a more practical way than the courses about statistics do.

I personally liked the 3 teachers. The 3 of them show passion about what they do and it’s always a pleasure to listen to someone talking about something that they are passionate about.

R Programming, Regression models and Practical Machine Learning were my favourite courses. I did the last 2 at the same time during Christmas, and thanks to being on holidays, I could spend more time on them. Otherwise my advice is to not to do them together, because, with Statistical Inference, they take longer than the other courses.

As said above, I’ll publish a different post explaining in detail about all the courses, but if anyone has a question, I’m available on Twitter, Linkedin, leaving a message here or by email 🙂

R was an important part of the last months of my 2015 and I hope it keeps being that way during my 2016. It’s been long since the last time I enjoyed so much learning something new. R is not like being in love, but it leaves you that great sensation of feeling passionate for something does.