Shiny App


Introduction:

Over the past decade, interest in women’s sports has skyrocketed with the growing influence of social media and heightened popularity of global stars. However, even with the appearance of high-profile campaigns for equity in sports1, it has nonetheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts2. Although major tournaments such as the WNBA Finals are strong pulls for sports enthusiasts, the lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Furthermore, although we have seen how statistics can fuel sports passion and storytelling, it was only recently that data and advanced statistics for the WNBA became easily accessible to the public3. Therefore, we seek to not only provide more convenient and accessible information on the WNBA players, but to also promote sustained fan engagement and interactions with the league as well.

Our project aims to make the following contributions, which will be displayed in a public facing Shiny App:

  1. Develop archetypes of current WNBA players based on each player’s abilities and overall performance-based statistics
    • Perform the same exploration on NBA players using the same variables to discover similarities and differences in the type of players between the respective leagues
  1. Draw comparisons between WNBA players and NBA players
    • Each WNBA player with sufficient minutes will be matched with 4 similar NBA players based SOLELY on their tendencies and playstyle

Ultimately, we believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight and keep them there for years to come.


Data:

To define player archetypes in the WNBA and compare WNBA players to NBA players, data must be gathered on a seasonal basis. Using Basketball Reference4, player statistics were gathered dating back to 2018 (the first year WNBA play-by-play and shot location became available). One observation represented a player’s statistics for a single season. Relevant variables included:

  • Per 100 possession statistics such as:
    • Points, rebounds, assists
    • Turnovers, field goals made, field goals attempted
  • Advanced statistics designed to demonstrate player tendencies and efficiency such as:
    • PER (player efficiency rating), Usage Rate, Win Shares
  • Play by play statistics such as:
    • Type of turnover (bad pass, lost ball)
    • Foul information (fouls drawn, fouls committed)
  • Shooting tendency data such as:
    • Percentage of shots at the rim, percentage of shots from 3
    • Percentage of shots that were assisted

This data allows both playstyle and effectiveness to be evaluated and considered when developing player archetypes and subsequently creating player comparisons between WNBA players and NBA players.


Cleaning the WNBA all stats dataset:

  • Gathered data from the 2018 - 2021 seasons
  • Players with minutes per game < 10 and games played < 5 were dropped to remove players who’ve played an insignificant amount of time
    • Roughly 25% of players were removed based on an ECDF graph
  • TOT rows were eliminated due to BBallRef miscalculations of TOT rows
    • If a player played with multiple teams over the course of a season, we kept their statistics from teams on which they played more minutes
  • Re-labeled all player positions to either “G”, “F”, or “C” for guards, forwards, and centers respectively
  • In the end: 475 observations

Cleaning the NBA all stats dataset:

  • Gathered data from the 2018 - 2021 seasons
  • Players with less than 100 total minutes played were dropped to remove players who’ve played an insignificant amount of time
    • Roughly 10-15% of players were removed based on minutes played
  • Removed 3 outliers who did not have a TOT column
  • Re-labeled all player positions to either “PG”, “SG”, “SF”, “PF”, or “C” for point guards, shooting guards, small forwards, power forwards, and centers respectively
  • In the end:
    • 2330 observations when 2022 season is included
    • 1834 observations when only 2018-2021 seasons are included

The following visualizations and remaining analyses are based on these subsets of players.


Exploratory Data Analysis (EDA)

Before modeling, the distributions of variables within the WNBA dataset were examined to better understand the relationships that are present:

Position Distribution:

Distribution of Minutes Per Game across WNBA Players:

Shot distance:

Field Goals:

Assist & Block Percentage:


Methods:

1. Developing Archetypes:

To choose which subset of variables were important in determining the archetypes, principal component analysis (PCA) was used to reduce the dimensionality of the feature space.

PCA:

  • Filtering WNBA all stats columns before PCA:
    • Only used data from the 2021 season
    • Filtered out non-numeric, percentages, and highly correlated variables (i.e., FG %, 2 point FG attempts, corner 3 attempts)
  • PCA results:
    • Top 10 dimensions explain 90% of the variability
    • Grabbed top 3 drivers in each of the top 10 PC’s and removed repeated variables
    • Checked correlation matrix and filtered out highly correlated variables
  • Using PCA results, 19 variables were chosen to use in clustering:
    • Examples: average shot distance, total rebounds per 100 possessions, field goals made per 100 possessions

Clustering:

To allow for some uncertainty in the clustering results, a Gaussian Mixture Model (GMM) was used to yield soft assignments for clustering the players.

  • Constructed archetypes by observing and comparing predominantly performance-based statistics from each cluster:
    • Position distribution
    • Shooting distance
    • Average points
    • Average free throw attempts, field goal attempts, field goal percentage
    • Average rebounds (total and offensive), assists, steals, blocks, turnovers, personal fouls
    • Average field goal attempts between 3-10 feet, 10-16 feet, 16 feet - 3’s
    • Average 3-pointers attempts and percentage, true shooting percentage
    • Average defensive and offensive rating, win share, and player efficiency rating

2. WNBA vs NBA Playstyle Comparisons

Before running a model to derive playstyle comparisons, variables related to player tendencies and playstyles were selected. These included:

  • Field goal attempts per 100 possessions
  • Free throw attempts per 100 possessions
  • 3 point attempts per 100 possessions
  • Rebounds per 100 possessions
  • Assist percentage
  • Steal percentage
  • Block percentage
  • Average shot distance

To develop a model that outputs an NBA comparison for a WNBA player’s playstyle, a Gaussian Mixture Model (GMM) was trained using the past 5 seasons of NBA data (2018-2022). In doing so, clusters of NBA players were created with corresponding probabilities for each player belonging to each cluster. WNBA player profiles consisting of the same variables were then fed into the model, similarly receiving probabilities of belonging to each cluster. To derive the NBA player most similar to a WNBA player, the Euclidean distance between a WNBA player’s cluster probabilities and all NBA player’s cluster probabilities was calculated. The NBA players with the lowest corresponding distances of probabilities were selected as the comparisons for the WNBA player of interest. A GMM was chosen over K-Means clustering to take advantage of soft assignments and the probabilities generated by a GMM.


Results:

1. Developing Archetypes:

After applying a Gaussian Mixture Model to the subset of variables informed by using PCA, 5 clusters were returned for both the WNBA and NBA. Through the amalgamation of basketball knowledge and meticulously observing and comparing the cluster averages on all performance-based variables in our datasets, simple archetype labels were placed on the clusters in each league. After this process, our results indicated that the archetypes between the leagues were nearly identical — both the WNBA and NBA have reserves, traditional bigs, facilitators/shooters, and primary scores/initiators as 4 of their 5 clusters. The only divergence was a 5th WNBA cluster labeled shooting threats compared to a 5th NBA cluster labeled roleplayers.

The full descriptions of the archetypes are displayed below:

WNBA

  1. RESERVES
    • Benchwarmers
      • High turnovers & personal fouls
      • Lowest field goal percentage, offensive rating, and offensive win shares
      • Examples*: Stephanie Watts, Kristine Anigwe
  2. TRADITIONAL BIGS
    • Rebounder & rim-protector
      • High total and offensive rebounds
      • High close distance shots/layups
      • High personal fouls
      • Examples: Brianna Turner, Monique Billings
  3. FACILITATORS/SHOOTERS
    • Ball handler
      • High assists & steals
      • Versatile shooter
      • Examples: Kelsey Plum, Jewell Loyd
  4. PRIMARY SCORERS/INITIATORS
    • Superstar & shot creator
      • Offensive skilled combo-forwards
      • Defensive versatility
      • Highest usage
      • Examples: A’ja Wilson, Breanna Stewart
  5. SHOOTING THREATS
    • Sharpshooter
      • High 3 point attempts and percentage
      • Low rebounds
      • Examples: Sue Bird, Kia Nurse

*Player examples are from the 2021 season


NBA

  1. RESERVES
    • Benchwarmers
      • High turnovers & personal fouls
      • Lowest offensive rating, offensive win shares and defensive win shares
      • Examples*: Luke Kornet, Frank Ntilikina
  2. TRADITIONAL BIGS
    • Rebounder & rim-protector
      • High total and offensive rebounds
      • High close distance shots/layups
      • High personal fouls
      • Examples: Rudy Gobert, DeAndre Jordan
  3. FACILITATORS/SHOOTERS
    • Ball handler
      • High assists, low rebounds
      • High 3 point attempts and percentage
      • Examples: Duncan Robinson, Buddy Hield
  4. PRIMARY SCORERS/INITIATORS
    • Superstars
      • Offensive skilled self-creators
      • Defensive versatility
      • Highest points, assists, usage, free throw attempts
      • Highest defensive and offensive win share
      • Examples: Giannis Antetokounmpo, Stephen Curry
  5. ROLEPLAYERS
    • Versatile wings
      • Reliable shooters
      • Low usage
      • Examples: Robert Covington, Royce O’Neale

*Player examples are from the 2021 season


The following visualizations helped distinguish clusters and inform the archetype labeling:

*Points, 3 point attempts, rebounds, and assists are all player per 100 possession statistics

Points & 3 point attempts:

Rebounds & Assists:

Usage Percentage & Player Efficiency Rating (PER):

  • Usage % = an estimate of the percentage of team plays used by a player when they were on the floor

  • PER = a measure of per-minute production standardized such that the league average is 15

Offensive & Defensive Win Share (OWS/DWS):

  • OWS = an estimate of the number of wins contributed by a player due to offense

  • DWS = an estimate of the number of wins contributed by a player due to defense


Model uncertainty:

Like any other model, the classification of observations into clusters involves uncertainty. In the Gaussian Mixture Model that we used, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 5 clusters. The plots below indicate the 3 players in each cluster who had the highest cluster assignment uncertainty in the 2021 season.


2. WNBA vs NBA Player Comparison Based on Playstyle

The Gaussian Mixture Model built with NBA players using 8 playstyle variables produced 7 clusters. The Adjusted Rand Index with the performance based NBA clusters was 0.248. As part of the comparison process, the uncertainty for each player was also calculated. The 3 WNBA and NBA players with the highest uncertainty for each cluster are shown below (2021 season only).


Table of comparisons for 3 example players:

Shiny app examples for the 3 players in the table above:

Kelsey Plum

Allie Quigley

Breanna Stewart

Discussion:

  • The sample player comparisons listed in the table above do seem to pass the ‘eye test’. For example, Allie Quigley is known as a sharpshooter, very similar to her ‘closest’ NBA comparison Payton Pritchard
  • The 10 archetypes defined using performance-based statistics suggest that NBA and WNBA are nearly identical
    • However, it is important to note that the nature/style of the games played in the 2 leagues are still very different
  • The WNBA and NBA clustering with performance-based variables produced 5 clusters each, while the NBA clustering with playstyle variables produced 7 variables
    • This could be a result of variation among the observations in the overall data but could also suggest there could be multiple distinct playstyles that contribute to a single performance-based archetype
  • While it is difficult to measure results of unsupervised learning such as GMMs, the Adjusted Rand Index (ARI) can compare two classifications (such as the NBA performance based and NBA playstyle based clustering). The ARI of 0.248 between these two classifications suggests that they were not completely random partitions, but that they were far from identical clusters

So…is Breanna Stewart the Lebron James of the WNBA?

Based on our results, not exactly. Lebron James is not one of her top 4 comparisons shown on the Shiny App. However, he is her 5th comparison, so there are certainly similarities between the 2 players.

Limitations

  • Using a Gaussian Mixture Model makes the assumption that each component is Gaussian/multivariate normal distribution
  • The process of defining archetypes following clustering was not clear-cut: it relied mostly on a combination of cluster averages and knowledge of basketball
  • There were inconsistencies across seasons (comparing 2021 WNBA players to 2022 NBA players)
  • Calculating the Euclidean distance of cluster probabilities could lead to some comparisons that are considered erroneous
    • Euclidean distance is not designed for probability vectors
  • Trying to compare by playstyle across the WNBA and NBA is a flawed premise because the general style of play is different (e.g. lob threats in the NBA)

Next steps

  • Employ a different distance metric such as the Wasserstein metric, which is designed to work with probability distributions
  • Assess the work here by comparing it to other public WNBA/NBA archetypes or clustering results
  • Reverse the player comparison process by training the model on WNBA data first and subsequently determining comparisons for NBA players
  • Perform player clustering again once 2022 WNBA season is complete

Future work

  • Produce a metric, regularized adjusted plus-minus (RAPM), to measure attributes and provide better evaluations of WNBA players
    • Utilize RAPM to create future projections of players
  • Allow for more user interactions within the app
    • Add ability for user inputs
  • Try utilizing a decision tree to create comparisons
  • Create a team building player-type evaluation tool

Acknowledgements:

We would like to first express our gratitude toward Carnegie Mellon’s Statistics & Data Science Department for providing us a great opportunity to complete a project on sports analytics. In particular, this work would not have been possible without the valuable guidance and support of Dr. Ron Yurko, the lead instructor and director of CMSACamp, as well as Maxsim Horowitz, senior data analyst for the Atlanta Hawks, for advising our project. We are also grateful to all of those with whom we have had the pleasure to work during this and other related projects, including our fellow students and teaching assistants.



  1. Carnegie Mellon University, ↩︎

  2. Harvard University, ↩︎

  3. St. Olaf College, ↩︎