R vs Python

R vs Python

Ever since Harvard called ‘data scientist’ the sexiest job of the 21st century, it seems everyone wants to pick up data science skills. In the increasingly ML, AI and data driven world of today, data science is fast becoming a point of parity in many industries. The curious student stepping into the world of data science is told that there are many things that comprise data science – everything from data wrangling to building predictive models and visualizations. Invariably, the next thing one might ask is, what tools and techniques can best prepare you for your data science journey? As is often the case, one is often bombarded with terms they’ve never heard before. People might even suggest college courses on data analytics. However, not everyone has the time, will and resources to go back to college. So you decide to dust off your stats textbooks from high school if you still happen to have them and maybe even do a couple of online courses.

As you get your feet wet on the shores of data science, it becomes evident very quickly that data science is about a lot more than high school statistics. It also becomes clear that running clustering algorithms and random forests will take a lot more than pen, paper and the employment of your brain. The human brain is a remarkable processor that can do a lot of amazing things. However, while it was evolving on the savannah , the priority was not to build a processor that would be fantastic at performing multilinear multiple regression analysis in a matter of seconds. The priority was to build a brain flexible enough that given sufficient time it could develop systems to make life easier. Thankfully, we have had enough time to build such systems in the field of data science as well. The computing power that is now available at our fingertips is driving the data revolution. As you get started on that path, you understand the value and necessity of data science tools and you need to choose a starting point. There are many data science tools out there, but if you are an individual trying to learn the basics, you may not want to pay for a licensed software. Very often, the beginner data scientist filters her choices to two words that is heard a lot in data science circles – R and Python. One is just one letter long and the other reminds you of a reptile most people wouldn’t exactly want to hug to sleep. Instantly you recognize that you are now a geek if you weren’t one already. There’s no looking back now – time to decide which of the two is better foryou.

Let’s first take a look at the origins of these two languages. R and python are both open source programming tools that came into existence in the 90s. Both have huge numbers of people working on it in the background to ensure that they are up to date, and typical of the shared economy that we live in today, decentralization is sacred. If one programmer in a corner of a world thinks that there is something that can be done to improve the repository, she can write the code to make that change and use it to alter her own application. If she wishes to do so, she can even release it to the public so others who share the same problem as her can also benefit from her efforts. The flipside is that there can be multiple variants of what are essentially the same things, and it may not always be easy to decide which ones are best suited for your needs. However, for most data scientists who are just starting out, this isn’t much of a hindrance. Both environments have strong data science capabilities, but what really separates the two (and consequently their user bases, to a large extent) is the type of projects that they use the tool for. R is an implementation of the S programming language that took form in the pits of research, scientific and statistical communities. Python on the other hand is a multi-paradigm programming language that focuses on functionality and extensibility. It is clear that the core ideologies behind these two languages are vastly different. Let us look at the major differences between these two languages.

The crucial differences between R and python can largely be explained by their origin and evolution – and therefore by who used them in their early stages. R is a procedural language. You have a large number of different functions that you can and have to use to get your tasks done. That very often means working with many libraries and knowing which library to use for the task at hand. A lot of these libraries may be designed to carry out just one type of activity – making them specific and small. However, knowing ‘when to use what’ is not as daunting as it might first seem, thanks to the terrific documentation and mutual help communities that are available in the public domain. The advantage is that it provides you with more flexibility in certain scenarios because you can work with small functions and tweak the code the way you want rather than having to work with huge objects and classes, often making it lighter and nimbler. Python on the other hand is more object oriented than R, and encourages you to work in that manner. This means that you have large packages with often wider capabilities than libraries in R.This makes it easier to keep track of all the packages and manage them. Plus, if it matters to you, names in python are cooler – scrapy, pillow, twisted and pandas sound better than DMwR, rpart and e1071, don’t they? If you don’t care about cool names, one more thing about python that stands out above R is in how much more readable it is than R. As a result, the learning curve, especially for folks with little or no background in programming is a lot less steep than with R. Similarly, if you want to develop a model from first principles, the wider capabilities of python and its readability makes it a much less intimidating prospect than R.

R has a lot more data science capabilities built into it by default than python, which is not a surprise given its origins in statistics and research. For data science, python relies on packages a lot more even for what might be considered basic data science tasks. That is because it was not built to be a pure data science application, and this can be a good thing in many situations. If your requirement demands that you do more than just data analysis, python can take you much further than R can. This does not make one better than the other. It just makes both better at what they were designed for, and whichever sounds like what you want to do determineswhich is the better fit for you. R has greater statistical support in general – whether you are talking about libraries within the program or experts within the community. Having said that, it is important to note that python has been closing that gap recently.

However, we know that data science is about a lot more than just statistics and machine learning. Obtaining data and automating tasks is just as important. Statistics and machine learning may be the engine of data science. But an engine on its own is often not enough. You need to have a system that can leverage the engine’s output and feed it with that it needs. Here, python has a clear advantage over R. Web scraping, complex workflows and programs are all better suited for development in python than in R. If you want to automate your data science programs, once again python is the better tool to wield. This means that if your analysis is required to be repeatable, and not just ad hoc, as is the case with a lot of real time analyses and periodic reports, R may not provide you the dexterity that python will. Therefore, python should perhaps be your first consideration if you want to deploy your code in a production environment. However, if you are looking for presentations and communications that are developed based on the ad hoc data analysis you perform, R is way ahead of python at this point of time. So instead of productionizing your code, if your requirement is to communicate your analysis much more presentably and effectively, R may be the way forward for you.

So which one should you choose?
Both languages are solid for almost all data science challenges that may be thrown at you. Ultimately it is your requirements that will decide which one is a better fit for you.If all you want to do is to dabble in data science and get a feel for it to decide if it is something you might want to pursue more seriously, python might be a better starting point for you, since the learning curve is gentler and it is a generally more useful tool than R which is not very useful outside of statistics and data science. If on the other hand, you are very serious about learning statistics and applying statistical models on data and that is your primary requirement, it is very likely R is the right tool for you.If you are going to also be doing a lot of ‘software engineering’, it may once again be python that is best suited for you.

One important thing you may want to consider is which of these your colleagues are using. If you are new to data science, and you think you will depend a lot on the people around you as you work towards strengthening your grip on data science, it can help to speak the same language as the people around you. Also, if your colleagues prefer one of these over the other, it may be an indication that it is better suited for the work that is done inyour organization. If like a lot of people today, you are seriously considering a career in data science, it seems almost inevitable that you will end up familiarizing yourself with both of these brilliant languages. You can always learn both of these languages and choose between them for a given task depending on the requirement – it doesn’t always have to be one or the other.

However, if you have no preferences, you do not know what you will do with your newfound interest in data science, and you would just like to know which is generally a better value for your time, you can perhaps look at the job market to tell you which of the two is greater demand – which one might help land you the job you want. Indeed job trends suggests that although until 2014 there were more data science jobs available for people skilled in R, that has since shifted in favor of python with job postings mentioning python almost 30% more frequently than R. Even on data science platform Kaggle, python has emerged as a clear winner in terms of the number of kernels written. Google trends reflects this trend as well. The Institute of Electrical and Electronics Engineers (IEEE)published a ranking of the top ten programming languages of 2018. It placed python at the very top of the list followed by C++ and Java, with R all the way down in 7th place. This means that even outside of the realm of data science, python is a fantastic tool and a very useful skill to have. All of this suggests that although R has its moments, as a general-purpose data science tool, at this point of time, python is generally being preferred over R.

In the future, it may not matter much which one you choose, if you want to decide which one is better equipped to solve your problem. In April 2018, with the announcement of Ursa Labs, a new era in data science technology was ushered in. The main goals of Ursa Labs are to make it easier for data scientists working in different programming languages to collaborate and avoid redundant work by developers across languages.Ursa Labs will try to make sharing data and code with someone using another data science language easier, by creating new standards that work in all of them. Developers call this an improvement to “interoperability.” This is an initiative that is still in its infancy. All we can do for now is watch and see how this unfold. It is certainly an exciting time to be a data scientist.

Leave a Reply

Your email address will not be published. Required fields are marked *

ML and AI

ML and AI

The world we know today has come a long way in the last decade

The Internet Of Things (IoT) and the SAP Cloud Platform

The Internet Of Things (IoT) and the SAP Cloud Platform

The internet has revolutionised almost every aspect of modern life, from the way

You May Also Like