NBA Draft: Prediction of NBA Three-Point Shooting using College Shooting Data

Jan 10, 2022

In the NBA draft world, the debate about using college stats as NBA predictors is hotly contested and ever-present in prospect discourse. Nowhere is this debate more intense than when it comes to three-point shooting ability.

Some evaluators weight college three-point percentage heavily in their analysis: how is a player going to shoot well in the NBA if they couldn’t even shoot well in the easier NCAA? Others prefer to use the eye test, factoring in difficulty of shot attempts (are they off the dribble or catch-and-shoot?) as well as shot form (is there an awkward hitch; is the release slow or too low; can this guy get his shot off in the NBA?) in their evaluation of a prospect’s three-point competence.1 A third group of new-wave statheads sits patiently waiting for a lull in the conversation, then interjects to point out that actually, three-point volume in college matters more than percentage, and frankly, you should really be looking at NCAA free-throw percentage.

It goes without saying that a holistic analysis incorporating quantitative data along with observations we still cannot adequately distill into numbers, like a player’s role within the offense, their ability to get good looks, or their shooting form’s translatability to the league is the most thorough option.

But how close can we come to projecting a player’s NBA three-point performance using just easily-available college stats?

Here, we’re going to use fairly rudimentary data analysis tools: simple linear regression, and simple NCAA statistics: three-point attempts per game, three-point percentage, and free-throw percentage. These three variables have been previously shown to correlate with NBA three-point performance2, so I’ve chosen to use them here.

The analysis was performed using Ben Pierce’s NBA-NCAA Comparisons dataset, which included data from all players drafted between 1947 and 2018 and NBA data as of the 2018-19 season.3

To craft a dataset for linear regression, I filtered the data to exclude all players drafted before 1995 as to select players in the “modern” era of three-point shooting, which is what we’re interested in projecting; the NBA’s decision to shorten the three-point line in 1994-95 served as a rough proxy for when three-point shooting began its precipitous rise in importance as a basketball skill.

I also selected only players who played more than 20 NCAA games (sorry, Kyrie), attempted more than 1 NCAA three per game, made a three in the NBA, and spent at least 5 seasons in the NBA. This served two purposes: first, it selected a subset of players whose three-point shooting would actually matter to their NBA outlook4, and second, it created a sample of players whose NCAA and NBA data was comprised of enough volume to make meaningful predictions.5

To start, let’s create a correlation matrix to visualize the correlation between NBA 3P% and our NCAA variables of 3P%, 3PA/gm, and FT%. I’ve also created a fifth variable that’s basically an aggregate of 3P%, 3PA/gm, and FT%6 to see how it performs.

A correlation matrix displaying Pearson’s r for college 3P%, college 3PA per game, college FT%, and NBA 3P%.

Here’s a quick summary of meaningful takeaways from the correlation table:

Of our three original variables, NCAA 3P% correlates most strongly with NBA 3P%. NCAA FT% is a close second, followed by NCAA 3PA/gm in third place.
The aggregate variable we created, z_ncaa_sum, is a more effective predictor of NBA 3P% than any of our three individual variables (r = 0.42).
None of these values for Pearson’s r indicate a strong correlation. They vary from moderately weak (r = 0.30 for NCAA FT%) to moderately strong (r = 0.42 for z_ncaa_sum), but there’s no standout sole predictor.
At the same time, every one of these variables seems to have a legitimate positive correlation with NBA 3P%. As they increase, NBA 3P% increases generally, though the size of that increase varies.
It’s very unsurprising that one of the strongest correlations observed (r = 0.51) is between NCAA 3P% and NCAA FT%. Shooting, whether it’s from the arc or the line, is a transferable skill.

Now, let’s construct a linear regression model using just NCAA 3P%, FT%, and 3PA/gm to project NBA 3P%, and then test how it performs with our dataset.

Ultimately, the linear regression model finds NCAA 3P% to be the most powerful predictor of NBA 3P% by a pretty significant margin. Interestingly, it’s more confident in NCAA 3PA/gm than NCAA FT% as a predictor in this model, even though the r-value for NCAA FT% was greater (this can be explained by the fact that FT% and 3P% tend to co-vary, so some of the correlation explained by FT% is also explained by 3P%, while 3PA/gm is more of a “standalone” predictor).

Our equation looks something like this:

NBA 3P% = 0.168 + 0.004(NCAA 3PA/gm) + 0.209(NCAA 3P%) + 0.112(NCAA FT%)

OK, so how’d the model do?

Well, it depends on how we judge it. The biggest knock against the model is its r-squared value of 0.1788, which is comparatively low. However, look anywhere else and the returns seem positive.

The most telling information comes from the distribution of our residuals, calculated by subtracting the actual NBA 3P% of each player from the predicted NBA 3P% that the model outputs. Below is a density plot of the residuals, with lines drawn at the 10th, 50th, and 90th percentiles.

For the middle 80% of players in the dataset, the model was able to accurately project NBA 3P% using just basic college box score data with a maximum error of about 4.5 percent in either direction. Shrink that to the middle 50%, and the error shrinks to just 2.1 percent in either direction — that’s pretty on the nose for a very rudimentary model.

The biggest issue the model seems to have is with projecting players on the lower end of the distribution. It systemically overestimates the shooting ability of players who score poorly on the projection metrics, which results in the density curve of residuals having a fat tail on the left side of the distribution.

This chart illustrates the model’s failure to project low-end shooters more clearly:

Plot displaying the standardized residual values for the linear regression model. The model breaks down below approximately the 16th percentile of prospects (z-score of -1 on the x-axis), overestimating NBA 3P performance for these players.

As illustrated by the plotted residuals, once the model gets to players whose projections fall about 1 standard deviation below the projection mean (the bottom 16% of projected NBA shooters), it starts overestimating their NBA three-point percentages.

These are career 24-26% three-point shooters that the model thinks will turn out to be career 30% guys. It seems like the linear regression model isn’t capable of adequately determining just how hard it will be for an already-poor college shooter to adjust to the more-challenging NBA three point shot.

Outside of this particular case, though, the model seems to hold up well in terms of accuracy. On the whole, it’s a fairly solid projector of NBA three-point shooting ability.

Here’s a plot of the model’s residuals compared to how many threes a player took in their NBA career:

Plot displaying residuals by number of NBA three-point attempts. For players with large numbers of three point attempts, the data “stabilizes” slightly above the mean line. For players with low volume of NBA 3P attempts, the data is more volatile and slightly skewed below the mean line.

This is an interesting plot to look at for a few reasons:

As you move left-to-right, the plotted points shift from being a varied cluster centered slightly below the mean line to a narrow set of points centered slightly above the mean line. This makes sense — as players take more threes, they trend more towards the “center” of the model’s projections, with the caveat that the guys who don’t take many threes are probably bad shooters while the guys who do take lots of threes are probably good shooters.
The biggest positive anomalies — guys who outperformed the model — are players commonly brought up when discussing the importance of the “eye test” in scouting. The most notable examples of this are Kawhi Leonard and Bruce Bowen. While the model misses Kawhi’s potential to improve his jumper based on a solid shot form, NBA analysts like MyNBADraft7 projected Leonard’s growth in that department: “He does show nice form on the jumper, so with extra work in the gym, he should be able to improve from the perimeter before he gets to the league.” Mike Miller and Klay Thompson are also examples of players who draft analysts had accurately pegged as top-end shooters, even though the model didn’t love them coming out of college.
This chart would look a lot different on the far-right end if it were made with data from 2021-22 — most importantly, Steph Curry, James Harden, and Damian Lillard would be more visible.

Another good way to evaluate the model’s efficacy is to look at its best and worst projected shooters coming out of college:

The model’s top projected 3P% players coming out of college.

The model’s worst predicted NBA three-point shooters in the dataset.

OK, so at the top end, our model kinda nailed it. It had Steph Curry pegged as the best shooter in the dataset and hit on Steve Nash, Damian Lillard, J.J. Redick and Kyle Korver as all being top-end NBA three-point shooters. Being blind to small-school bias (shoutout Davidson, Weber State, and Santa Clara) seems to be a benefit to our model; if the Timberwolves knew that Steph Curry would be the top shooter to enter the draft in 20 years, maybe they wouldn’t have passed on him for Jonny Flynn.

At the low end, it’s also fairly accurate — the model misses on Trevor Ariza, Al-Farouq Aminu, and Matt Barnes but nails its projections of NBA players like Antoine Walker and Rajon Rondo as poor three-point shooters even when scouts foresaw the development of three-point range in those players’ futures.89

On the whole, using a linear regression model to project NBA 3P% from NCAA stats is a useful tool in the toolbox of player evaluation, but should by no means be the be-all-end-all of prospect shooting evaluation. Other factors like a player’s shot diet (difficulty, range, catch-and-shoot vs. off-dribble, open vs. contested) and a player’s shot form should be factored into prospect analysis. However, disregarding college data’s ability to project NBA 3P% would be a mistake.

Recently, David Sadjak has done an excellent job at using tracking data to compare and contrast NBA shooting forms: https://davidsajdak8.shinyapps.io/shooting_similarity/

http://cs229.stanford.edu/proj2017/final-reports/5238212.pdf

You can access the dataset and the Python script used to scrape it here: https://data.world/bgp12/nbancaacomparisons

I don’t think the Bucks took Andrew Bogut first overall in 2005 (three NBA 3PM in fourteen seasons) with an eye on his stretch-big abilities.

It takes roughly 750 attempts for NBA three-point percentage to stabilize: https://fansided.com/2014/08/29/long-take-three-point-shooting-stabilize/

To compute z_ncaa_sum, I converted each player’s 3PT%, 3PA/gm, and FT% to z-scores, and then summed each player’s z-scores in each category. This created a variable that can be roughly modeled by the normal distribution N(0,2.36)

https://www.nba.com/hornets/2011_draft_prospect_leonard.html

https://www.nbadraft.net/players/rajon-rondo/

http://www.ibiblio.org/craig/draft/1996_draft/scout/sf.html

Hoops and Hoos

Discussion about this post