Skip to content

Binning Basketball Scores

November 14, 2012

Here’s a example of working with binned data.  There was some confusion with the notes when I talked about using Minitab.  In an old version of this course, we used Minitab (a popular statistics package), but you can now do everything with R — I’ll remove that Minitab material for future classes.

Let’s look at the number of points scored by a team in a NBA basketball game.  Is that a normally distributed variable?

From basketball-reference.com, specifically  http://www.basketball-reference.com/teams/PHI/2011_games.html, I collect the number of points scored by the Sixers in all games in the 2010-2011 season.   I collect the team points scored by both the Sixers and their opponents so there are a total of 162 observations.

It seems convenient to bin the data in bins 70-75, 75-80, …, 120-125.  I define the bins in a vector called bins

bins = seq(70, 125, by=5)

My data is in the vector score.  I construct a histogram using the function truehist in the MASS package.

library(MASS)
truehist(score, breaks=bins)

Next I want to fit a normal comparison curve.  We find a summary of the score values:

summary(score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
70.00 91.00 98.50 98.26 106.00 125.00

A reasonable normal mean estimate would be the median 98.50.  We can estimate the normal standard deviation by taking the difference in the quartiles and dividing by 1.35.  (Remember why?)  I overlay a normal curve by the curve function (dnorm is the normal density function, add = TRUE adds to the current plot).

m = 98.50; s = (106 – 91) / 1.35
curve(dnorm(x, m, s), add=TRUE, col=”red”, lwd=3)

It looks like a pretty good fit, but we can explore further by inspecting residuals.

The fit.gaussian function in the LearnEDA package will do all of the work.  You input the data, the bins, and the mean and sd of the normal curve.

stuff = fit.gaussian(score, bins, m, s)

By displaying stuff, you’ll all of the things that are now available — we have the counts, the expected probabilities (under the normal curve), the expected counts, and the residuals defined by sqrt(count) – sort(expected).

stuff
$counts
[1] 4 8 9 18 28 31 24 18 11 9 4
$probs [1] 0.01205618 0.03074140 0.06422667 0.10995273 0.15424505 0.17731324 0.16703208 0.12893940 0.08156254 0.04227676 0.01795559
$expected [1] 1.977214 5.041590 10.533174 18.032247 25.296189 29.079371 27.393261 21.146062 13.376257 6.933389 2.944716
$residual [1] 0.593865555 0.583078533 -0.245485097 -0.003798664 0.261970938 0.175235208 -0.334877689 -0.355844069 -0.340731788 [10] 0.366867012 0.283982432

To check the suitability of the normal fit, we plot the residuals against, say the bin midpoints.
(By the way, you can easily compute the double root residuals mentioned in the notes defined by DRR = sqrt(2+4 OBS) – sqrt(1+4 EXP) since you have the vector of observed counts OBS and the vector of expected counts EXP.)

bin.mids = (bins[1:11] + bins[2:12])/2
plot(bin.mids, stuff$residual)
abline(h=0)

When we look at residuals, we look for patterns and unusually large and small values.  Here it is harder to gauge “large or small” since these simple residuals don’t have the right scale.  But I do see one consistent pattern — the residuals are positive for bins at the extremes.  This indicates that the distribution of points scored has heavier tails than the normal.

Leave a Comment

Leave a comment