Introduction:

Over the past decade, interest in women’s sports has skyrocketed with the growing influence of social media and heightened popularity of global stars. However, even with the appearance of high-profile campaigns for equity in sports¹, it has nonetheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts². Although major tournaments such as the WNBA Finals are strong pulls for sports enthusiasts, the lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Furthermore, although we have seen how statistics can fuel sports passion and storytelling, it was only recently that data and advanced statistics for the WNBA became easily accessible to the public³. Therefore, we seek to not only provide more convenient and accessible information on the WNBA players, but to also promote sustained fan engagement and interactions with the league as well.

Our project aims to make the following contributions, which will be displayed in a public facing Shiny App:

Develop archetypes of current WNBA players based on each player’s abilities and overall performance-based statistics

Perform the same exploration on NBA players using the same variables to discover similarities and differences in the type of players between the respective leagues

Draw comparisons between WNBA players and NBA players

Each WNBA player with sufficient minutes will be matched with 4 similar NBA players based SOLELY on their tendencies and playstyle

Ultimately, we believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight and keep them there for years to come.

Data:

To define player archetypes in the WNBA and compare WNBA players to NBA players, data must be gathered on a seasonal basis. Using Basketball Reference⁴, player statistics were gathered dating back to 2018 (the first year WNBA play-by-play and shot location became available). One observation represented a player’s statistics for a single season. Relevant variables included:

Per 100 possession statistics such as:
- Points, rebounds, assists
- Turnovers, field goals made, field goals attempted
Advanced statistics designed to demonstrate player tendencies and efficiency such as:
- PER (player efficiency rating), Usage Rate, Win Shares
Play by play statistics such as:
- Type of turnover (bad pass, lost ball)
- Foul information (fouls drawn, fouls committed)
Shooting tendency data such as:
- Percentage of shots at the rim, percentage of shots from 3
- Percentage of shots that were assisted

This data allows both playstyle and effectiveness to be evaluated and considered when developing player archetypes and subsequently creating player comparisons between WNBA players and NBA players.

Cleaning the WNBA all stats dataset:

Gathered data from the 2018 - 2021 seasons
Players with minutes per game < 10 and games played < 5 were dropped to remove players who’ve played an insignificant amount of time
- Roughly 25% of players were removed based on an ECDF graph
TOT rows were eliminated due to BBallRef miscalculations of TOT rows
- If a player played with multiple teams over the course of a season, we kept their statistics from teams on which they played more minutes
Re-labeled all player positions to either “G”, “F”, or “C” for guards, forwards, and centers respectively
In the end: 475 observations

Cleaning the NBA all stats dataset:

Gathered data from the 2018 - 2021 seasons
Players with less than 100 total minutes played were dropped to remove players who’ve played an insignificant amount of time
- Roughly 10-15% of players were removed based on minutes played
Removed 3 outliers who did not have a TOT column
Re-labeled all player positions to either “PG”, “SG”, “SF”, “PF”, or “C” for point guards, shooting guards, small forwards, power forwards, and centers respectively
In the end:
- 2330 observations when 2022 season is included
- 1834 observations when only 2018-2021 seasons are included

The following visualizations and remaining analyses are based on these subsets of players.

Exploratory Data Analysis (EDA)

Before modeling, the distributions of variables within the WNBA dataset were examined to better understand the relationships that are present:

Position Distribution:

Distribution of Minutes Per Game across WNBA Players:

Shot distance:

Field Goals:

Assist & Block Percentage:

Methods:

1. Developing Archetypes:

To choose which subset of variables were important in determining the archetypes, principal component analysis (PCA) was used to reduce the dimensionality of the feature space.

PCA:

Filtering WNBA all stats columns before PCA:
- Only used data from the 2021 season
- Filtered out non-numeric, percentages, and highly correlated variables (i.e., FG %, 2 point FG attempts, corner 3 attempts)
PCA results:
- Top 10 dimensions explain 90% of the variability
- Grabbed top 3 drivers in each of the top 10 PC’s and removed repeated variables
- Checked correlation matrix and filtered out highly correlated variables
Using PCA results, 19 variables were chosen to use in clustering:
- Examples: average shot distance, total rebounds per 100 possessions, field goals made per 100 possessions

Clustering:

To allow for some uncertainty in the clustering results, a Gaussian Mixture Model (GMM) was used to yield soft assignments for clustering the players.

Constructed archetypes by observing and comparing predominantly performance-based statistics from each cluster:
- Position distribution
- Shooting distance
- Average points
- Average free throw attempts, field goal attempts, field goal percentage
- Average rebounds (total and offensive), assists, steals, blocks, turnovers, personal fouls
- Average field goal attempts between 3-10 feet, 10-16 feet, 16 feet - 3’s
- Average 3-pointers attempts and percentage, true shooting percentage
- Average defensive and offensive rating, win share, and player efficiency rating

2. WNBA vs NBA Playstyle Comparisons

Before running a model to derive playstyle comparisons, variables related to player tendencies and playstyles were selected. These included:

Field goal attempts per 100 possessions
Free throw attempts per 100 possessions
3 point attempts per 100 possessions
Rebounds per 100 possessions
Assist percentage
Steal percentage
Block percentage
Average shot distance

To develop a model that outputs an NBA comparison for a WNBA player’s playstyle, a Gaussian Mixture Model (GMM) was trained using the past 5 seasons of NBA data (2018-2022). In doing so, clusters of NBA players were created with corresponding probabilities for each player belonging to each cluster. WNBA player profiles consisting of the same variables were then fed into the model, similarly receiving probabilities of belonging to each cluster. To derive the NBA player most similar to a WNBA player, the Euclidean distance between a WNBA player’s cluster probabilities and all NBA player’s cluster probabilities was calculated. The NBA players with the lowest corresponding distances of probabilities were selected as the comparisons for the WNBA player of interest. A GMM was chosen over K-Means clustering to take advantage of soft assignments and the probabilities generated by a GMM.

Results:

1. Developing Archetypes:

After applying a Gaussian Mixture Model to the subset of variables informed by using PCA, 5 clusters were returned for both the WNBA and NBA. Through the amalgamation of basketball knowledge and meticulously observing and comparing the cluster averages on all performance-based variables in our datasets, simple archetype labels were placed on the clusters in each league. After this process, our results indicated that the archetypes between the leagues were nearly identical — both the WNBA and NBA have reserves, traditional bigs, facilitators/shooters, and primary scores/initiators as 4 of their 5 clusters. The only divergence was a 5th WNBA cluster labeled shooting threats compared to a 5th NBA cluster labeled roleplayers.

The full descriptions of the archetypes are displayed below:

WNBA

RESERVES
- Benchwarmers
  - High turnovers & personal fouls
  - Lowest field goal percentage, offensive rating, and offensive win shares
  - Examples*: Stephanie Watts, Kristine Anigwe
TRADITIONAL BIGS
- Rebounder & rim-protector
  - High total and offensive rebounds
  - High close distance shots/layups
  - High personal fouls
  - Examples: Brianna Turner, Monique Billings
FACILITATORS/SHOOTERS
- Ball handler
  - High assists & steals
  - Versatile shooter
  - Examples: Kelsey Plum, Jewell Loyd
PRIMARY SCORERS/INITIATORS
- Superstar & shot creator
  - Offensive skilled combo-forwards
  - Defensive versatility
  - Highest usage
  - Examples: A’ja Wilson, Breanna Stewart
SHOOTING THREATS
- Sharpshooter
  - High 3 point attempts and percentage
  - Low rebounds
  - Examples: Sue Bird, Kia Nurse

*Player examples are from the 2021 season

NBA

RESERVES
- Benchwarmers
  - High turnovers & personal fouls
  - Lowest offensive rating, offensive win shares and defensive win shares
  - Examples*: Luke Kornet, Frank Ntilikina
TRADITIONAL BIGS
- Rebounder & rim-protector
  - High total and offensive rebounds
  - High close distance shots/layups
  - High personal fouls
  - Examples: Rudy Gobert, DeAndre Jordan
FACILITATORS/SHOOTERS
- Ball handler
  - High assists, low rebounds
  - High 3 point attempts and percentage
  - Examples: Duncan Robinson, Buddy Hield
PRIMARY SCORERS/INITIATORS
- Superstars
  - Offensive skilled self-creators
  - Defensive versatility
  - Highest points, assists, usage, free throw attempts
  - Highest defensive and offensive win share
  - Examples: Giannis Antetokounmpo, Stephen Curry
ROLEPLAYERS
- Versatile wings
  - Reliable shooters
  - Low usage
  - Examples: Robert Covington, Royce O’Neale

*Player examples are from the 2021 season

The following visualizations helped distinguish clusters and inform the archetype labeling:

*Points, 3 point attempts, rebounds, and assists are all player per 100 possession statistics

Points & 3 point attempts:

Rebounds & Assists:

Usage Percentage & Player Efficiency Rating (PER):

Usage % = an estimate of the percentage of team plays used by a player when they were on the floor
PER = a measure of per-minute production standardized such that the league average is 15

Offensive & Defensive Win Share (OWS/DWS):

OWS = an estimate of the number of wins contributed by a player due to offense
DWS = an estimate of the number of wins contributed by a player due to defense

Model uncertainty:

Like any other model, the classification of observations into clusters involves uncertainty. In the Gaussian Mixture Model that we used, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 5 clusters. The plots below indicate the 3 players in each cluster who had the highest cluster assignment uncertainty in the 2021 season.

The link to our public-facing app is attached below:

Shiny App

Discussion:

The sample player comparisons listed in the table above do seem to pass the ‘eye test’. For example, Allie Quigley is known as a sharpshooter, very similar to her ‘closest’ NBA comparison Payton Pritchard
The 10 archetypes defined using performance-based statistics suggest that NBA and WNBA are nearly identical
- However, it is important to note that the nature/style of the games played in the 2 leagues are still very different
The WNBA and NBA clustering with performance-based variables produced 5 clusters each, while the NBA clustering with playstyle variables produced 7 variables
- This could be a result of variation among the observations in the overall data but could also suggest there could be multiple distinct playstyles that contribute to a single performance-based archetype
While it is difficult to measure results of unsupervised learning such as GMMs, the Adjusted Rand Index (ARI) can compare two classifications (such as the NBA performance based and NBA playstyle based clustering). The ARI of 0.248 between these two classifications suggests that they were not completely random partitions, but that they were far from identical clusters

So…is Breanna Stewart the Lebron James of the WNBA?

Based on our results, not exactly. Lebron James is not one of her top 4 comparisons shown on the Shiny App. However, he is her 5th comparison, so there are certainly similarities between the 2 players.

Limitations

Using a Gaussian Mixture Model makes the assumption that each component is Gaussian/multivariate normal distribution
The process of defining archetypes following clustering was not clear-cut: it relied mostly on a combination of cluster averages and knowledge of basketball
There were inconsistencies across seasons (comparing 2021 WNBA players to 2022 NBA players)
Calculating the Euclidean distance of cluster probabilities could lead to some comparisons that are considered erroneous
- Euclidean distance is not designed for probability vectors
Trying to compare by playstyle across the WNBA and NBA is a flawed premise because the general style of play is different (e.g. lob threats in the NBA)

Next steps

Employ a different distance metric such as the Wasserstein metric, which is designed to work with probability distributions
Assess the work here by comparing it to other public WNBA/NBA archetypes or clustering results
Reverse the player comparison process by training the model on WNBA data first and subsequently determining comparisons for NBA players
Perform player clustering again once 2022 WNBA season is complete

Future work

Produce a metric, regularized adjusted plus-minus (RAPM), to measure attributes and provide better evaluations of WNBA players
- Utilize RAPM to create future projections of players
Allow for more user interactions within the app
- Add ability for user inputs
Try utilizing a decision tree to create comparisons
Create a team building player-type evaluation tool

Acknowledgements:

We would like to first express our gratitude toward Carnegie Mellon’s Statistics & Data Science Department for providing us a great opportunity to complete a project on sports analytics. In particular, this work would not have been possible without the valuable guidance and support of Dr. Ron Yurko, the lead instructor and director of CMSACamp, as well as Maxsim Horowitz, senior data analyst for the Atlanta Hawks, for advising our project. We are also grateful to all of those with whom we have had the pleasure to work during this and other related projects, including our fellow students and teaching assistants.

References:

[1] https://www.teamheroine.com/blog/the-10-best-womens-sport-campaigns-of-2020

[2] https://www.si.com/sports-illustrated/2021/03/24/womens-sports-gender-study-discrepancy

[3] https://niemanreports.org/articles/covering-womens-sports/

[4] https://www.basketball-reference.com/wnba/years/2022_per_game.html

Carnegie Mellon University, amorai@cmu.edu ↩︎
Harvard University, mhombergbertley@college.harvard.edu ↩︎
St. Olaf College, noecke2@stolaf.edu ↩︎

Is Breanna Stewart the Lebron James of the WNBA? Developing WNBA and NBA Archetypes and Playstyle Comparisons

Amor Ai¹, Mykalyster Homberg², Andrew Noecker³

July 29, 2022

Introduction:

Data:

Exploratory Data Analysis (EDA)

Methods:

1. Developing Archetypes:

PCA:

Clustering:

2. WNBA vs NBA Playstyle Comparisons

Results:

1. Developing Archetypes:

WNBA

NBA

The link to our public-facing app is attached below:

Discussion:

So…is Breanna Stewart the Lebron James of the WNBA?

Limitations

Next steps

Future work

Acknowledgements:

References:

Is Breanna Stewart the Lebron James of the WNBA? Developing WNBA and NBA Archetypes and Playstyle Comparisons

Amor Ai1, Mykalyster Homberg2, Andrew Noecker3

July 29, 2022

Introduction:

Data:

Exploratory Data Analysis (EDA)

Methods:

1. Developing Archetypes:

PCA:

Clustering:

2. WNBA vs NBA Playstyle Comparisons

Results:

1. Developing Archetypes:

WNBA

NBA

2. WNBA vs NBA Player Comparison Based on Playstyle

Kelsey Plum

Allie Quigley

Breanna Stewart

The link to our public-facing app is attached below:

Discussion:

So…is Breanna Stewart the Lebron James of the WNBA?

Limitations

Next steps

Future work

Acknowledgements:

References:

Amor Ai¹, Mykalyster Homberg², Andrew Noecker³