For the most recent 24 Hours of PASS, the topic was Evolution of the Data Platform and I was honored to present A Gentle Introduction to Data Analytics in SQL Server 2016. (The recording is available here.)
The gist of my presentation was to describe analytics in general and how it compares and contrasts with business intelligence. I also covered some new capabilities that SQL Server 2016 brings to the table, including the ability to use R on data stored in SQL Server, and introduced Analytics in related technologies such as Power BI, data mining, and Azure Machine Learning. Here’s an awesome Sketchnote created and tweeted out by Matt Penny during the presentation which sums up the content very well.
I had been through the R demonstration several times on my computer before the day of the 24 Hours of PASS broadcast, having first worked up the R Script (which you can download here) for a preconference workshop on data analytics at SQL Saturday Nashville in January 2016. The night before the live presentation for 24 Hours of PASS, I confirmed that all was working well. And yet, lo and behold – demo fail during the presentation! The R script that I was trying to run in RStudio didn’t work!
Each time I tried to execute a RevoScaleR function (part of the R distribution available in SQL Server 2016), I got errors. For example, I tried to show documentation using help(RevoScaleR) and got the following message:
No documentation for ‘RevoScaleR’ in specified packages and libraries.
Or when I tried to show how to create a SQL Server data source object using the RxSqlServerData function, I got this message:
Error: could not find function “RxSqlServerData”
What the…? There was no time to sort it out on the spot, so I just carried on. Afterwards, however, I figured out what had happened. I had in fact changed ONE thing in my environment between the last run through of the demonstration and the presentation. Something I had no idea would have such a major effect.
You see, in RStudio, there is a setting for the R Version that you want to use. You access this on the Tools menu by selecting Global Options which I had originally set to C:\Program Files\Microsoft SQL Server\130\R_SERVER like this:
This had changed auto-magically when I decided that I wanted to demonstrate that you could use R inside Power BI Desktop. In that application, you open the File menu, select Options and Settings, select Options, and then R Scripting to set your R home directory as shown here.
Note the link “How to install R” which I had clicked and then used the link on that page to download Revolution Open (aka Microsoft R Open available for download here).
When I performed R Open installation, unbeknownst to me, the Global Options in RStudio changed to the path for R Open and consequently all the functions that required RevoScaleR packages weren’t available to RStudio. Had I thought about this possibility during the demonstration that failed, I could have fixed it quickly and carried on, but it would have been a wild guess and could have led down rabbit holes that I didn’t need to waste time on, so I made do by reviewing the results which I had thankfully already captured on slides.
As a follow-up to the presentation, here are questions posed by viewers followed by my responses:
Q: Is there a cost associated for front end user licenses to be able to access the power BI tool in Excel, or is it a free download?
A: You can download Power BI Desktop for free from http://powerbi.microsoft.com. To use comparable functionality inside Excel, you must have a license for Excel (standalone, Professional Plus, or Office 365 ProPlus).
Q: Is Data Analytics and sample db available for downloading on [the] Microsoft site?
A: I’m not sure specifically which samples you’re requesting with regard to “data analytics”, but.. here’s what I used in my demonstrations:
- AdventureWorksDW2014 – which I renamed as AdventureWorksDW to simplify my demos and make it easier (sort of) to re-use my scripts as versions change. You can download different versions at http://msftdbprodsamples.codeplex.com. The specific version doesn’t really matter because they all contain the vTargetMail view which I used throughout many of the demonstrations.
- I also used RStudio which is available for download from https://www.rstudio.com/products/rstudio/download2/ as a free tool. You can use any integrated development for R instead – such as R Tools for Visual Studio.
- I created an R Script that you can download here.
- For the data mining in Analysis Services, I had created a project based on the vTargetMail view and built out the project manually, but you can find a similar example in the sample database “Adventure Works Multidimensional Model SQL 2014 Full Database Backups.zip” which you can download here from Codeplex.
- I output the data from the vTargetMail view as a CSV file and then used Azure Machine Learning Studio online to explore the data visualization capabilities there as part of a logistic regression demo (not shown) to compare the results of machine learning to Analysis Services data mining among others.
Q: How should be a DBA prepared for R support in SQL2016? What all should they know?
- Reduce the memory for SQL Server – you need to leave more memory on the server for R than you usually would because R is a memory hog. You’ll need to experiment with this to find the right balance.
- Troubleshooting performance is challenging. Do analysis in R Studio (or IDE of choice) before moving code to SQL Server.
- Be aware that a lot of Windows users get created on the server in order for R to run – don’t worry about it.
Q: How is data analytics on SAS different from R support for SQL2016?
A: I’ve not used SAS (although I did go to their campus to teach MDX to their engineers once upon a time… beautiful campus!), so I can’t say with any specificity. SAS also has multiple products so it would be difficult to describe differences. In general – and please understand there may be something new in SAS that I don’t know about – I think the difference is that R in SQL 2016 allows you to run functions on data inside the database, thereby leveraging database resources and also server resources for parallelization. I’d love to get input from readers to expand on this topic.
Q: Recommend any good books/blogs/courses to get into data analytics? I’m a DBA wanting to see if data science is for me. Thanks!
A: There are so many options these days that it’s difficult to narrow down, but a good blog that I would start with is Buck Woody’s Backyard Data Science https://buckwoody.wordpress.com. I particularly like the book Charles Wheelan’s Naked Statistics for some great explanations of statistical concepts. For courses, you can find a lot of free content at Coursera and EdX. Many of these are meaty courses – not fluffy, but you can poke around a bit to see what they cover and see if anything sparks your interest. Because the courses are free, you don’t have to stay committed if you find that it isnt’ for you. And Microsoft recently announced the Microsoft Professional Degree in Data Science. You can sign up here to be notified when the program opens.
Q: Data mining has only been available in SSAS multidimensional cubes, and not Tabular models – not sure if I missed it, but will data mining become available in Tabular model?
A: I doubt that it would ever be, but I have no insight into the future on this point. There’s no benefit in particular to bundle data mining with Tabular when it’s already available on the same installation media. The end result is that you wind up with two instances of SSAS on a server (if you were to install both on the same server), but I can’t see that as a huge problem. If it were, then a distributed deployment would be the next option.