Introduction
After promising much without any significant breakthroughs for many years, in the last few years, the power of business data has really come to the fore and shown us that when it is wielded competent leaders, it can change the way businesses run and how business decisions are made. It has been able to transform whole industries (such as the advertisement industry) and drive technology adoption just so businesses can keep up with the break neck speed at which technology transforms the way we do business. With large data centers and cloud-based solutions becoming ever more prevalent, it has become easier for businesses to digitize their operations. The rate at which data is generated and captured today is truly astounding, and even though storage technology is growing as fast as it always has, it still hasn’t been able to keep up with the rate at which data is being generated. To some extent, this is a reason that has helped drive adoption of cloud platforms. If you have been involved with the SAP technology environment over the last decade or so, the shift to SAP HANA and what drove it would be evident. The interesting thing about SAP HANA is that it is not your traditional transactional database. It has been designed and developed such that it is best suited for real-time analytics and real-time applications. Over the last few years, SAP’s marketing tent pole has shifted from Netweaver to HANA – so much so that the former is barely even mentioned in SAP’s marketing content anymore.
With the power of data analytics becoming ever more evident, it is no longer enough to be able to simply record vast amounts of data and search through them efficiently if you cannot analyze the data to derive insights that can drive decision making. Advanced analytics processing within SAP HANA allows users to analyze and process their data within the platform itself. The architecture is such that you do not need to leave the SAP environment to do so. And when combined with the in-memory search facility of SAP HANA, real-time analytics becomes available at your fingertips. Dashboards, text analytics, machine learning, predictive modeling – all available within SAP HANA to help you solve a whole host of problems. However, one must keep in mind that not all businesses have the same problems that need solving. Sure, there are some data analysis problems that mostbusinesses share, and many of them can be tackled on SAP HANA itself. But what about those not-so-common problems? It would certainly be inefficient to have all such solutions available by default on the SAP HANA application, wouldn’t it? It would make the platform clunky and heavy – highly undesirable traits in any application. This is where the facility to embed R code into the SAP HANA database context comes in handy. Being able to use such a powerful and widely usedanalytics programming language opens up possibilities to customize your analysis of data to almost any level that you want.
How does the integration work?
The first question one might ask is how the integration between SAP HANA and the R programming language works. For starters, you need to convert the data to a format that can be processed by the R programming language and transfer it there. This is achieved by making use of a data exchange mechanism that transfers intermediate tables generated by SAP into the vector-oriented data structures that are understood by R. The advantage with this approach is that it does not necessitate the need for an additional copy on the R side as is the case with the tuple-based approach. An important thing to note is that R is an open source programming language.It is totally independent of SAP or its products and therefore, it is not included in your installation of SAP HANA. It must be downloaded separately from the R open source community if it is to be integrated with SAP HANA.Rserve, a TCP/IP server is also required to be installed from the open source community to interface between SAP HANA and the R environment. SAP HANA does not provide any kind of support for R and it is up to the user of the SAP HANA installation to choose whether or not to integrate and configure SAP HANA with R.
The flow of control within the integration
Let us now take a look at how the code is executed and how the architecture of the integration functions. The R code is first converted into a RLANG procedure before it is inculcated into SAP HANA SQL code. An external R environment is then used to execute the R code. That makes it similar to the execution of joins and aggregations in SQL. Ultimately, the entire code is sent to the SAP HANA database as a query, and that makes it an extremely elegant way to implement R code in an external environment. Although the implementation of R code is quite simple in SAP HANA, there are some complexities that had to be built into the SAP HANA architecture to make it possible. SAP HANA makes use of data flow graphs that describe database execution plans. Any given node in the execution plan can be a native database operation or a custom operation. The R operator is one such custom operation and it consumes several input objects and returns a result table. There are three systems that are involved in the architecture – the R environment, the SAP HANA database and the SAP HANA application.
Let us take a look at how the control switches between the different components of the architecture and how these components interact with each other. Imagine a token that represents the flow of control through the architecture. The token is not something that actually exists on the SAP HANA database; it is simply a representation of the flow that happens when the query containing the RLANG procedure is executed in SAP HANA. While the query is being processed, the token travels through the calculation model execution plan and at some point arrives at an R-operator. The R-operator contains an R client that serves as an interface between the calculation engine and Rserve. The token would be directed through the R client and into Rserve which would carry the token to the R host that is capable of running code created in the R programming language. When the token arrives at the R host, a dedicated R process gets created. At this point, the input tables and the R code is pulled from the R client and processed before generating an output. Once the R process is completed, the token travels back to the R client in the calculation engine and pulls the output R data frames into the calculation engine where it is quickly and efficiently converted into a data structure that is compatible within SAP HANA.Therefore, we see that the overall control flow resides only on the SAP HANA database. The execution of R code is treated as an external process, and this means that multiple R processes can be executed simultaneously in parallel. Having to handle parallel processes within the R runtime is not a concern since each new trigger of Rserve creates a dedicated R process.
Designing and running RLANG procedures
We have already seen thatR code is embedded into SQL queries to have them executed in SAP HANA. Let us now take a look at how that is implemented. The first thing you need to do is check if you have the system privilege ‘CREATE R SCRIPT’ on SAP HANA. This is necessary if you wish to create or call R procedures. First, you have to create tables in SAP HANA that share the same schema as the input tables. According to the requirements of the R code, you may have to add or remove columns from the tables. All of this is done using standard SQL commands such as CREATE, LIKE and DROP. Once the data tables and schemas are ready, the next step would be to create a procedure. The input and output have to be explicitly mentioned while the SQL procedure is instantiated. The syntax would be as follows.
CREATE PROCEDURE <procedure name> (IN <input data frame 1> “<input table 1>”, IN <input data frame 2> “<input table 2>”, OUT <output data frame 1> “<output table 1>”)
LANGUAGE RLANG AS
BEGIN
<insert R code here>
END;
CALL <procedure name>(““<input table 1>”, “<input table 2>”, “<output table 1>”) WITH OVERVIEW;
You must explicitly mention which tables from the SAP HANA database have to be used as inputs and which tables would hold the output values. The SAP HANA tables and the corresponding data frame names in the R process are directly tagged within the instantiation of the SQL procedure. Once you know that, if you have a background in SQL and R, all of this is very straightforward. “RLANG” is used to let the compiler know that we are dealing with an R procedure here. While the input tables are automatically converted into R data frames, it is the responsibility of the coder to ensure that the output objects are of type ‘data frame’ at the end of the procedure. You can also make use of the CALL command to trigger predefinedRLANG procedures from a SQLScript procedure.
Conclusion
As we have seen, the implementation of R code as an RLANG procedure embedded into SQL queries is a very elegant way to create and execute statistical modeling and machine learning algorithms in the SAP HANA database. For anyone with an understanding of SQL and R programming, the learning curve in integrating SAP HANA and R is minimal. Thanks to the similarities in the data structures of the SAP HANA databases and R data frames, the conversions between them is highly efficient. The seamless integration of SAP HANA and R opens up a wide range of options for the data scientist working with SAP HANA. When the power of SAP HANA’s in memory computing engine is combined with the wide range of tried and tested analytical capabilities of the R programming ecosystem the possibilities of real time business analytics are almost limitless.