Over the past decade, interest in women’s sports has skyrocketed with the growing influence of social media and heightened popularity of global stars. However, even with the appearance of high-profile campaigns for equity in sports1, it has nonetheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts2. Although major tournaments such as the WNBA Finals are strong pulls for sports enthusiasts, the lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Furthermore, although we have seen how statistics can fuel sports passion and storytelling, it was only recently that data and advanced statistics for the WNBA became easily accessible to the public3. Therefore, we seek to not only provide more convenient and accessible information on the WNBA players, but to also promote sustained fan engagement and interactions with the league as well.
Our project aims to make the following contributions, which will be displayed in a public facing Shiny App:
- Develop archetypes of current WNBA players based on each player’s abilities and overall performance-based statistics
- Perform the same exploration on NBA players using the same variables to discover similarities and differences in the type of players between the respective leagues
- Draw comparisons between WNBA players and NBA players
- Each WNBA player with sufficient minutes will be matched with 4 similar NBA players based SOLELY on their tendencies and playstyle
Ultimately, we believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight and keep them there for years to come.
To define player archetypes in the WNBA and compare WNBA players to NBA players, data must be gathered on a seasonal basis. Using Basketball Reference4, player statistics were gathered dating back to 2018 (the first year WNBA play-by-play and shot location became available). One observation represented a player’s statistics for a single season. Relevant variables included:
This data allows both playstyle and effectiveness to be evaluated and considered when developing player archetypes and subsequently creating player comparisons between WNBA players and NBA players.
Cleaning the WNBA all stats dataset:
Cleaning the NBA all stats dataset:
The following visualizations and remaining analyses are based on these subsets of players.
Before modeling, the distributions of variables within the WNBA dataset were examined to better understand the relationships that are present:
Position Distribution:
Distribution of Minutes Per Game across WNBA Players:
Shot distance:
Field Goals:
Assist & Block Percentage:
To choose which subset of variables were important in determining the archetypes, principal component analysis (PCA) was used to reduce the dimensionality of the feature space.
To allow for some uncertainty in the clustering results, a Gaussian Mixture Model (GMM) was used to yield soft assignments for clustering the players.
Before running a model to derive playstyle comparisons, variables related to player tendencies and playstyles were selected. These included:
To develop a model that outputs an NBA comparison for a WNBA player’s playstyle, a Gaussian Mixture Model (GMM) was trained using the past 5 seasons of NBA data (2018-2022). In doing so, clusters of NBA players were created with corresponding probabilities for each player belonging to each cluster. WNBA player profiles consisting of the same variables were then fed into the model, similarly receiving probabilities of belonging to each cluster. To derive the NBA player most similar to a WNBA player, the Euclidean distance between a WNBA player’s cluster probabilities and all NBA player’s cluster probabilities was calculated. The NBA players with the lowest corresponding distances of probabilities were selected as the comparisons for the WNBA player of interest. A GMM was chosen over K-Means clustering to take advantage of soft assignments and the probabilities generated by a GMM.
After applying a Gaussian Mixture Model to the subset of variables informed by using PCA, 5 clusters were returned for both the WNBA and NBA. Through the amalgamation of basketball knowledge and meticulously observing and comparing the cluster averages on all performance-based variables in our datasets, simple archetype labels were placed on the clusters in each league. After this process, our results indicated that the archetypes between the leagues were nearly identical — both the WNBA and NBA have reserves, traditional bigs, facilitators/shooters, and primary scores/initiators as 4 of their 5 clusters. The only divergence was a 5th WNBA cluster labeled shooting threats compared to a 5th NBA cluster labeled roleplayers.
The full descriptions of the archetypes are displayed below:
*Player examples are from the 2021 season
*Player examples are from the 2021 season
The following visualizations helped distinguish clusters and inform the archetype labeling:
*Points, 3 point attempts, rebounds, and assists are all player per 100 possession statistics
Points & 3 point attempts:
Rebounds & Assists:
Usage Percentage & Player Efficiency Rating (PER):
Usage % = an estimate of the percentage of team plays used by a player when they were on the floor
PER = a measure of per-minute production standardized such that the league average is 15
Offensive & Defensive Win Share (OWS/DWS):
OWS = an estimate of the number of wins contributed by a player due to offense
DWS = an estimate of the number of wins contributed by a player due to defense
Model uncertainty:
Like any other model, the classification of observations into clusters involves uncertainty. In the Gaussian Mixture Model that we used, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 5 clusters. The plots below indicate the 3 players in each cluster who had the highest cluster assignment uncertainty in the 2021 season.
The Gaussian Mixture Model built with NBA players using 8 playstyle variables produced 7 clusters. The Adjusted Rand Index with the performance based NBA clusters was 0.248. As part of the comparison process, the uncertainty for each player was also calculated. The 3 WNBA and NBA players with the highest uncertainty for each cluster are shown below (2021 season only).
Table of comparisons for 3 example players:
Shiny app examples for the 3 players in the table above:
Based on our results, not exactly. Lebron James is not one of her top 4 comparisons shown on the Shiny App. However, he is her 5th comparison, so there are certainly similarities between the 2 players.
We would like to first express our gratitude toward Carnegie Mellon’s Statistics & Data Science Department for providing us a great opportunity to complete a project on sports analytics. In particular, this work would not have been possible without the valuable guidance and support of Dr. Ron Yurko, the lead instructor and director of CMSACamp, as well as Maxsim Horowitz, senior data analyst for the Atlanta Hawks, for advising our project. We are also grateful to all of those with whom we have had the pleasure to work during this and other related projects, including our fellow students and teaching assistants.
[1] https://www.teamheroine.com/blog/the-10-best-womens-sport-campaigns-of-2020
[2] https://www.si.com/sports-illustrated/2021/03/24/womens-sports-gender-study-discrepancy
[3] https://niemanreports.org/articles/covering-womens-sports/
[4] https://www.basketball-reference.com/wnba/years/2022_per_game.html
Carnegie Mellon University, amorai@cmu.edu↩︎
Harvard University, mhombergbertley@college.harvard.edu↩︎
St. Olaf College, noecke2@stolaf.edu↩︎