I’m a relatively new hockey fan. And I love data. Come join me on a series of posts as I see explore where a combination of these two interests leads.

Even if you’re not a hockey fan, along this journey I’ll introduce technologies, concepts, and techniques for data analytics that you can apply to the types of data that are more applicable to you.

In this post, I’ll use Power BI to start getting familiar with available data and answer a simple question. You can download a copy of the Power BI file that I created to see for yourself how I put together the data and the report.

Why Hockey?

My husband Dean is originally from Detroit and has long loved the Detroit Red Wings. When we started dating, he would encourage me to sit and watch games on TV which seemed almost constant because hockey has an incredibly long season compared to other sports. I watched, but I have to confess I didn’t have a lot of interest at the time. (Sshhh, don’t tell him I said that!)

One of the ways that I’m inclined to interpret the world is through data. And sports is rife with data! Ever hear of Moneyball? These days it’s common practice for a sports team to have an analytics team. I’ve not worked on a project with any of these teams (yet), although I did have a data analyst from the NFL in one of my classes years ago.

A Hockey Fan is Born

Fast forward a few years and I am a convert! My home town of Las Vegas got its first professional sports team and amazingly the first sport is hockey! Yes, hockey in the desert, go figure… The 2017-2018 season is the inaugural year for the Vegas Golden Knights, and oh what a year it has been! Far better and far more exciting than most people here ever imagined.

Dean and I bought season tickets hoping to enjoy a few good games (and see Dean’s beloved Red Wings whenever they come to town), but I must say most games have been more than good. And live NHL hockey is much more exciting in person than it is on TV. Now that I’m more familiar with the sport, I concede that watching on TV is now more appealing to me as well. My new goal (cough) is one day to meet not only the players, but also the analysts for the Vegas Golden Knights!

Where’s the Hockey Data?

During the games, I often think up questions that could be answered if only I had the data. So I went on a mission to find it and perform my own analyses. I’ve stumbled on a few databases that I need to find time to explore, but these database limit me to looking up information whereas I want to explore the raw data. The best way I have found so far is to use the free NHL statistics API: https://statsapi.web.nhl.com/api/v1. It’s undocumented, but I did find a YouTube video that provides an overview. Thanks to Kevin Sidwar (b|t), who produced that video, I now know that I can use access data by using the following API endpoints:

And that’s not all I can do… but it’s more than enough for today! I just have a simple question to answer, but at least I know I have a rich treasure trove to explore in future posts!

Gathering the Team Data

My first step is to use a Web data source using the URL https://statsapi.web.nhl.com/api/v1/teams to see what I have to work with. Then I’ll go through the following steps which I only highlight here (but I’ve included a video below each section if you’re new to Power BI):

  • Because the result is JSON, the query returns a list that I can then use the Convert Into Table transformation that appears on the ribbon.

  • And I need to remove the copyright from the query (but I copy and save it somewhere because I want to give attribution later in my report). I use the Remove Top Rows transform with a value of 1 to get rid of it.
  • My JSON values are still a list in my query results, so I expand the list and then expand the records.

In Power BI, a list is a structure that has a single column and multiple rows. If I were to create a list manually, it would look like this: {“item 1”, “item 2”, “item3”, 4}. Looks like an array to me. The elements in the list do not need to have the same data type, by the way. I’ll use lists in future posts to build out query parameters for dynamic reporting.

A record is a structure that has a single row and multiple columns. Just like a row in a relational table. When I click on a record in one of the rows, the user interface shows me the columns as rows below the table like this:

  • When I expand the records in the query (by clicking the Expand button to the right of the column name), I get columns in the table like this:

  • I still have some more expansions to do to get data for each team’s venue, division, conference, and franchise details. Part of me wants to normalize this data a bit and break these out into separate tables. For now, I’ll leave the data as is, other than renaming the columns for clarity.

In a few clicks, I was able to get all of this information into a single table. Nothing statistical here to analyze yet, but definitely the base information for each team that I can use to slice and dice later.

Gathering Game Scores and Team Records

One of my feelings at the game I attended this past week between the Knights and the Calgary Flames (on February 21, 2018) was that it was a pretty high scoring game as hockey goes. But I’m a newbie, so what do I know? I want to find the data to support (or refute) this hypothesis.

To keep things simple for now, I’m just going to analyze the scores for the current season and tackle other seasons later. The URL I need is https://statsapi.web.nhl.com/api/v1/schedule?season=20172018 which I use to create another query in Power BI.

Ooh, there’s so many things in here I want to explore, but I’ll save that for another day. I can always modify my query later. For now, I’m going to focus only on scores for the home and away teams for each game of the season.

Cleansing and Transforming the Data

Right now, the data is not ideal for analysis. Keeping in mind how I want to use the data, I need to perform some cleansing and transformation tasks. Any time I work with a new data source, I look to see if I need to do any of the following:

  • Remove unneeded rows or columns. Power BI stores all my data in memory when I have the PBIX file open. For optimal performance when it comes time to calculate something in a report and to minimize the overhead required for my reports, I need to get rid of anything I don’t need.
  • Expand lists or records. Whether I need to perform this step depends on my data source. I’ve noticed it more commonly in JSON data sources whenever there are multiple levels of nesting.
  • Rename columns. I prefer column names to be as short, sweet, and user friendly as possible. Short and sweet because the length of the name affects the width of the column in a report, and it drives me crazy when the name is ten miles long, but the value is an inch long—relatively speaking. User friendly is important because a report is pretty much useless if no one understands what a column value represents without consulting a data dictionary.
  • Rearrange columns. This step is mostly for me to look at things logically in the query editor. When the model is built, the fields in the model are listed alphabetically.
  • Set data types. The model uses data types to determine how to display data or how to use the data in calculations. Therefore, it’s important to get the data types set correctly in the Query Editor.

  • Add custom columns. This step is optional and depends on whether I want to create a new column in the query itself or in my model only. Given a choice, I lean towards building the custom column in the query. That way, if I need to use the information in multiple ways later, I have the logic defined once.

In this case, I’m adding two new columns:

Game Name, to concatenate the away team and home team names in addition to the date so that I have a user-friendly unique identifier for each game to use in my reports.

Total Goals, to calculate the sum of the away team score and the home team score. This is the key to answering my first question, what constitutes a high-scoring game?

  • Apply filters. This step is similar to the first one in which I remove the unneeded rows or columns. However, in that first case, I was identifying specific rows (as in the top n) or columns (by name). Now I need to consider conditions that determine whether to keep a row.

In my current data set, I can see that I have games included that haven’t happened yet. For my current purposes, I want to exclude those rows that have a date earlier than today, which of course is a moving target each day that I open the report. Power BI doesn’t allow me to set a filter based on today using the filter interface. Instead, I set the filter using an arbitrary date, and then open the Advanced Editor for the query and change the filter that uses a hard-coded date to a formula like this:

#”Filtered Rows” = Table.SelectRows(#”Changed Type2″, each [Game Date] < Date.From(DateTime.LocalNow()))

Restructuring the Data

I’ve spent twenty years building dimensional models, so that’s the way I tend to think about structuring data. I won’t spend time in this post explaining that approach. Just trust me on this one for now. Developing a data model depends on your end goal, and I’ll spend some time in one or more future posts on this topic.

Right now in my current query, I have away and home team data in separate columns which means I can’t easily build a report to see all scores – whether home or away – for a single team. I’d really rather have each row in my games table represent a single team. But I also don’t want to have rows that duplicate data for a single game. Gah!

To solve this problem, I’m going to duplicate the games query and then make adjustments to one version so that it contains only game data. The second version will contain the team data.

For the games query, I remove the columns that relate to teams, records, and scores (although I’ll keep the Total Goals column because that’s game-level detail, not team-level detail). Now I have this result:

And for the scores query – I remove the game-related columns, except for the Game PK column which I can use to tie to the two queries together later in a relationship.

But first, I’m going to take another step to duplicate the scores query – because what I really want to have is a single column each for teams, wins, losses, score, etc. I filter one version of the scores query to keep only the away team data and the other version to keep only the home team data. And I add a custom column to each query to designate it as home or away, and then combine the two queries to make a single final query that is really the one I’m going to relate to the game query later.

 

I clear the Enable Load on the scores away and scores home queries so they don’t get added to my model. I only need them as intermediate steps to build the final scores query.

After I click Close & Apply, Power BI automatically detects the relationships:

 

Answering the High Score Question

Now that my data is structured appropriately for my analysis, I can set up a report to view the data in different ways.

The final result contains the following visualizations:

  • Average of total goals. I set this up to calculate the average of total goals for all games to use as a baseline against which I can measure whether a game is relatively high-scoring or low-scoring. I also set this up to ignore any cross-filtering that might occur when I click on other visualizations in the report, because I want to see this value calculated for all games across the league for the season.
  • Table of games and total goals. Here I have a comprehensive list of games that I have sorted in descending order of total goals. When I click on a specific game in this list, the other visualizations on the page are filtered to show me the details for that game (except for the average of total goals).
  • Matrix of teams and scores by side, Away vs. Home. This matrix sums up the goals for each team based on whether the game was an away game or home game. When a game is selected in the first table, this matrix displays each team’s score for the selected game. When all games are listed, this isn’t a particularly interesting table to me, but it made me think about comparing the statistics for the season by side, which led me to create the next table.
  • Matrix of teams and score percentages by side, Away vs. Home. This matrix is a copy of the scores matrix, but I changed the values to display as a percentage of row total. I thought it would be interesting to see overall whether a team did better or worse at away versus home games. When a game is selected in the table, the values are 100% so it’s not interesting at that point, but it is interesting when you look at all teams even though I need to scroll through it. I don’t have a lot of space on this page. Yes, I can adjust the page size, but my primary interest at the moment is looking at high-scoring games – this was just an ancillary thought that I might pursue later in a new page of the report.

 Thinking out loud—Wouldn’t it be nice to hide or show an entire visualization based on some condition?

  • Column chart of teams and scores by side, Away vs. Home. Here I can see how the teams compare. I can write a whole post about this visualization in particular, but not today. Stay tuned. When I select a game, I can see the scores for each time represented visually in conjunction with the matrix above it.

My report provides a lot of information about game and team scores overall. In thinking about the game that sparked my question, which had a total goal of 10, I can see that there have been a LOT of games with higher scoring. On the other hand, I can also see the game was also well above average.

When I select the Flames @ Golden Knights game in the table, I can see the game score by team in the matrix and in the column chart.

Summary

Important takeaways in this post include the use of multiple features in Power BI:

  • Using JSON as a data source
  • Getting familiar with lists and records
  • General steps for cleansing and transforming data in the Query Editor
  • Restructuring data with filters, custom columns, and the use of the append query feature
  • Excluding intermediate queries from the Power BI model
  • Creating visualizations: card, table, matrix, and column chart
  • Displaying an average value rather than a sum
  • Displaying percentage of row totals
  • Cross-filtering
  • Adjusting interactions

Whew! That’s a lot! This post covered a lot of ground, and there’s so much more I want to cover! You’ll just have to come back again to see where I go with this.

If you have any questions or suggestions about what you’d like to see in the future, please add a comment below!

Click here to continue to Part 2 of this series.

You can see how I set up the report in the videos below.