A little this… a little that. The processes involved in a data analytics project require some alchemy, “a seemingly magical process of transformation, creation, or combination.”
As I work on my data project this week, I find more to explore and investigate and wrangle which is all necessary before I can get into more advanced techniques. Yet, while this little project of mine focuses on hockey data, the steps I need to perform and decisions I need to make are the same ones that would be necessary in a business-oriented project.
If you’re new to data analytics, you can learn the tricks of the trade by following along with me through this series to “see” a project in action. You can rewind to the beginning of the series by starting here and then following the pointers at the end of each page to the next post in the series.
What now?
For now, Power BI continues to my tool of choice for my project. My goals for today’s post are two-fold: 1) finish my work to address missing venues in the games table and 2) to investigate the remaining anomalies in the games and scores tables as I noted in my last post.
To recap, I noted the following data values that warranted further investigation :
- Total Goals minimum of 0 seems odd – because hockey games do not end in ties. I would expect a minimum of 0 so I need to determine why this number is appearing.
- Total Goals maximum of 29 seems high – it implies that either one team really smoked the opposing team or that both teams scored highly. I’d like to see what those games look like and validate the accuracy.
- Record Losses minimum of 0 seems odd also – that means at least one team has never had a losing season?
- Similarly, Record Wins minimum of 0 means one team has never won?
- Record OT minimum of 0 – I’m not sure how to interpret. I need to look.
- Score minimum of 0 seems to imply the same thing as Total Goals minimum of 0, which I have already noted seems odd.
One last “fix” for missing venues
In my last post, I created the Alt Venue column in the games table to display the the home team’s venue, but the Game Venue column still shows 127 games with a missing venue. I decide to clean up my data model a bit to “hide” these columns and create a new column called Arena because that’s the normal name by which we hockey fans call the place where we go to watch hockey.
Tip: This is an important part of reviewing your work with users in a business setting – make sure you’re using terms that people use, not the terms that are in the data structure.
To provide a value for this calculated column, I use the following DAX formula with the IF function to use the value in the Game Venue column, but use the Alt Venue value if the Game Venue is empty (or blank in DAX parlance):
Arena = if(isblank([Game Venue]), [Alt Venue], [Game Venue])
Then I can hide the Game Venue and Alt Venue columns and leave only the Arena column visible in the Fields list. To do this, I click the ellipsis button to the right of these columns and select Hide. I have the option of viewing the hidden fields or not by respectively selecting or clearing the View Hidden command on the context menu.
Tip: This capability is helpful when you have “supporting” columns that are used in calculations, but not necessary for reporting. It just shortens up the list of available fields to make it easier to find what you want as you’re creating a visual.
I adjust the table on the Games Categorical Variables page of my report by replacing Game Venue with Arena and confirm there are no blank values remaining.
More detective work
To analyze the anomalies mentioned above, I need to work through each table separately.
Min of Total Goals is 0
My first task is to figure out why Min of Total Goals is 0. This is weird to me because my understanding is that hockey games keep going until a team scores. Therefore, the Total Goals value can never be 0.
I start by setting up a page in which I can explore the details behind the summaries:
- I duplicate the Games Numerical Variables page (just right-click the page tab and select Duplicate Page) and rename it as Minimum Total Goals.
- Next, I remove the bottom table containing the Goals Spreads values.
- I then removed all the columns except Min of Total Goals.
- Then I add Game Name to get a detailed list of games.
- However, this list contains all games. I only want to focus on those with Min of Total Goals = 0 so I add a filter.
A lot of games still appear in the table. Hmm…
I know I have a table that shows the scores for each team on the Total Goals page. I can click on that table, click Copy in the ribbon, go back to my Games Analysis page, and click Paste on the ribbon.
Now when I select a game in the table containing Total Goals, I can see the scores by home and away teams in the newly pasted table. And indeed the two teams scored 0. Double hmm…
As a sanity check (a common practice I need to do for myself frequently, I confess!), I want to look at the NHL stats, but I need to put together the URL. However, I don’t want to have to remember how to put together the URL all the time, so I’ll create a calculated column in the Power BI model to construct it for me:
Game URL = “https://statsapi.web.nhl.com/api/v1/game/” & [Game ID] & “/feed/live”
Note: I could also have done this – the end result is the same:
Game URL = https://statsapi.web.nhl.com & [Game Link]
And then check this out! I can have Power BI convert this into a clickable URL for me by changing the Data Category to Web URL.
But wait – there’s more. I don’t really care what the URL is. I just want to click on it. So I can apply formatting to the table by setting the URL Icon toggle to On like this:
That way, I can save some screen real estate!
When I click the link, I get to the game page which lists a lot of data, but I can search for the terms “score” and “goal”. As far as I can see, there was in fact a tied score of 0. Hmm.. I can’t seem to find any official data online from NHL or other sports sites on this particular game because it’s preseason.
Note: I decided to add a Season URL to my model also, using the same approach as described above.
Season URL = “https://statsapi.web.nhl.com/api/v1/schedule?season=” & [Game Season]
Continuing on my sanity check quest, I pick a regular season game (one that’s begins later than September) such as the Atlanta Thrashers vs New York Rangers on 16 October 2003 and lo… it’s true. There CAN be a 0-0 score! Hmm.
But as I investigate further, I discover that this situation was true only until 2005 when the shootout was introduced.
“The new shootout rule guarantees a winner each game; ties have been eliminated.”
Therefore, the rule that there cannot be a 0-0 game is true now. Because this was my first season, I had no idea that the rule had changed at a previous point in time. I now know this rule does not apply to seasons prior to 2005-2006.
Tip: This is a perfect example of characteristics in the data that lead to questions which lead to discovery of new rules of which you may not be aware.
Now I can set up a table to view my data differently to determine if anomalies exist based on this new information:
- I need to add add a season filter, but now I discover I don’t have data types set correctly for Game Season, Season Start, and Season End.
- I decide to change the data types from text to whole number in the Query Editor because I might decide to reuse the query at a later date and want to have all my transformations in one spot rather than divided between the query and the model.
- In the model, I set the Default Summarization to Do Not Summarize for these three fields so that Power BI doesn’t try to do aggregations (sum, min, max, count, etc) on these values every time I include them in a visual.
- Now I add a filter to view seasons from 2005-2006 and later.
After adding a filter, I still see a bunch of games with Min of Total Goals = 0. Can we have a triple hmmm?
They look like preseason games for the 2006-2007 season. I add the Game Type column to the table to confirm, and yes – they are all preseason. Here’s a partial list:
So far, I haven’t turned up any reason via Internet search for this situation. Maybe they didn’t bother with shootouts in preseason games that year? I’ll file this away as a notation on my data and maybe someday I’ll learn the truth. [Update: As my friend and hockey buff extraordinaire Aaron Bertrand (b | t) pointed out in the comment section below, this is a prime case of dirty data.]
This analysis also explains the Min of Score = 0 also. So two anomalies “solved” for the investigative price of one.
Moving on to Max of Total Goals
I use a similar technique to explore the next scenario:
- I duplicate the Minimum Total Goals page and rename it as Maximum Total Goals.
- Then I adjust the table containing Total Goals and change the aggregation setting on the Min of Total Goals value so that it displays the maximum of Total Goals:
- I then clear the filters on the table.
- Next, I sort the Max of Total Goals in descending order:
The teams did indeed appear to collectively score 29 points with a score of 17-12. It’s a Game Type of A, which I haven’t decoded yet. I do know it’s not preseason (PR) or regular season (R). A little research uncovers that this is an All-Stars game. Aha!
Looks like the majority of games are All-Star games – and that means the Total Goals value seems logical. A game of stars is likely to have more scoring action than the normal mix of players. The exception seems to be Edmonton Oilers @ Chicago Blackhawks on 11 December 1985 which is a regular season game that apparently made history for having the most goals in a game. And still holds that distinction!
Ok – data validated on that front. Next…
Record Losses, Record Wins, and Record OTs
I duplicate the Scores Numerical Variables page and follow the procedure I describe above for the minimum total goals. When I see the detailed data, I see how to interpret this information and how I need to look at the data:
- At the time that a particular game was played, at least one of the teams had not yet lost to other teams in prior games.
- This scenario would not apply to an All-Star game, so I filter those out.
- Looking again – I see by the dates that Game Type P must be Playoffs, so I filter those out too.
- Same idea for Record Wins – at least one team had not won any games prior to the listed game.
- And for Record OTs – at least one team had no overtime scores prior to that game.
Here’s the result of my work so far – check out pages 14 to 17:
Recap
I have now completed my first pass through the data which required the calculation of summary statistics, a little data cleaning, and a lot of investigation to understand the “true” rules rather than rely on my rather limited knowledge of hockey. My investigation also allowed me to get a little closer to the data and understand the meaning of the statistics that I have so far.
Some key thoughts about the process described in this post that relate to any type of data analysis:
- Rename terms to match users’ “mental map” of the data. Don’t keep terms just because they come from the data.
- Don’t assume business rules apply equally to all the data all of the time. Rules have a way of changing. Not everyone remembers that they did or why. Document it when you find it!
- Dig into the detailed data behind the summarized values that look odd by confirming the details in your model match the source data. Do some research, ask questions. Be prepared where possible to explain why an unusual value is valid. What then? Clean it, delete it, ignore it, etc. It depends. How you use this information is situation-specific.
Also this post demonstrates several techniques you can use in Power BI:
- Clean up the fields visible in the model – hide them!
- Display a value conditionally using the IF function
- Concatenate static and dynamic values
- Convert a string of text to a Web URL
- Display a Web URL as an icon instead of text
In my next post, I plan to take another pass at the numeric statistics by analyzing the distribution of values.
5 Comments
[…] Stacia Varga has a post covering some of the yeoman’s work of data cleansing: […]
There’s no way all of those 2006 preseason games ended 0-0, even if they didn’t bother with a shootout (but they did). In fact, the September 24th game between Anaheim and LA ended 4-3 after a shootout (if you’re not a wiki conspiracy theorist, see https://en.wikipedia.org/wiki/2006%E2%80%9307_Anaheim_Ducks_season). So I think the data source smells funny, at least for the preseason games that year.
Indeed the data source is at issue here. And this is a perfect illustration of why it’s necessary to snoop around in the data before using it for critical decision-making. Also, why it’s necessary to engage with subject matter experts to review data that looks or smells funny. I haven’t had time to research these games further, but am not surprised and appreciate you calling this out. I have a lead on who is maintaining the API, so best case scenario, I can get this fixed at the API level. Worst case, I need to cobble together the missing data from other sources. Thanks, Aaron!
It’s definitely fun to snoop around the data and question validity, and sometimes it does require some help from outside of your wheelhouse. If this were basketball I would have been useless. 🙂 I consider myself pretty educated about hockey, but I learned a few things as a result too, so thanks!
[…] file – I removed all the report pages in which I was exploring the anomalies discussed in my last post and the post prior to […]