Microsoft Excel 2010 : Confidence Intervals and the Normal Distribution (part 1) - Constructing a Confidence Interval

12/12/2011 6:12:30 PM

A confidence interval is a range of values that gives the user a sense of how precisely a statistic estimates a parameter. The most familiar use of a confidence interval is likely the “margin of error” reported in news stories about polls: “The margin of error is plus or minus 3 percentage points.” But confidence intervals are useful in contexts that go well beyond that simple situation.

Confidence intervals can be used with distributions that aren’t normal—that are highly skewed or in some other way non-normal. But it’s easiest to understand what they’re about in symmetric distributions, so the topic is introduced here. Don’t let that get you thinking that you can use confidence intervals with normal distributions only.

The Meaning of a Confidence Interval

Suppose that you measured the HDL level in the blood of 100 adults on a special diet and calculated a mean of 50 mg/dl with a standard deviation of 20. You’re aware that the mean is a statistic, not a population parameter, and that another sample of 100 adults, on the same diet, would very likely return a different mean value. Over many repeated samples, the grand mean—that is, the mean of the sample means—would turn out to be very, very close to the population parameter.

But your resources don’t extend that far and you’re going to have to make do with just the one statistic, the 50 mg/dl that you calculated for your sample. Although the value of 20 that you calculate for the sample standard deviation is a statistic, it is the same as the known population standard deviation of 20. You can make use of the sample standard deviation and the number of HDL values that you tabulated in order to get a sense of how much play there is in that sample estimate.

You do so by constructing a confidence interval around that mean of 50 mg/dl. Perhaps the interval extends from 45 to 55. (And here you can see the relationship to “plus or minus 3 percentage points.”) Does that tell you that the true population mean is somewhere between 45 and 55?

No, it doesn’t, although it might well be. Just as there are many possible samples that you might have taken, but didn’t, there are many possible confidence intervals you might have constructed around the sample means, but couldn’t. As you’ll see, you construct your confidence interval in such a way that if you took many more means and put confidence intervals around them, 95% of the confidence intervals would capture the true population mean. As to the specific confidence interval that you did construct, the probability that the true population mean falls within the interval is either 1 or 0: either the interval captures the mean or it doesn’t.

However, it is more rational to assume that the one confidence interval that you took is one of the 95% that capture the population mean than to assume it doesn’t. So you would tend to believe, with 95% confidence, that the interval is one of those that captures the population mean.

Although I’ve spoken of 95% confidence intervals in this section, you can also construct 90% or 99% confidence intervals, or any other degree of confidence that makes sense to you in a particular situation. You’ll see next how your choices when you construct the interval affect the nature of the interval itself. It turns out that it smoothes the discussion if you’re willing to suspend your disbelief a bit, and briefly: I’m going to ask you to imagine a situation in which you know what the standard deviation of a measure is in the population, but that you don’t know its mean in the population. Those circumstances are a little odd but far from impossible.

Constructing a Confidence Interval

A confidence interval on a mean, as described in the prior section, requires these building blocks:

The mean itself
The standard deviation of the observations
The number of observations in the sample
The level of confidence you want to apply to the confidence interval

Starting with the level of confidence, suppose that you want to create a 95% confidence interval: You want to construct it in such a way that if you created 100 confidence intervals, 95 of them would capture the true population mean.

In that case, because you’re dealing with a normal distribution, you could enter these formulas in a worksheet:

=NORM.S.INV(0.025)
=NORM.S.INV(0.975)

The NORM.S.INV() function, described in the prior section, returns the z-score that has to its left the proportion of the curve’s area given as the argument. Therefore, NORM.S.INV(0.025) returns −1.96. That’s the z-score that has 0.025, or 2.5%, of the curve’s area to its left.

Similarly, NORM.S.INV(0.975) returns 1.96, which has 97.5% of the curve’s area to its left. Another way of saying it is that 2.5% of the curve’s area lies to its right. These figures are shown in Figure 1.

Figure 1. Adjusting the z-score limit adjusts the level of confidence. Compare Figures 7.6 and 7.7.

The area under the curve in Figure 1 , and between the values 46.1 and 53.9 on the horizontal axis, accounts for 95% of the area under the curve. The curve, in theory, extends to infinity to the left and to the right, so all possible values for the population mean are included in the curve. Ninety-five percent of the possible values lie within the 95% confidence interval between 46.1 and 53.9.

The figures 46.1 and 53.9 were chosen so as to capture that 95%. If you wanted a 99% confidence interval (or some other interval more or less likely to be one of the intervals that captures the population mean), you would choose different figures. Figure 2 shows a 99% confidence interval around a sample mean of 50.

Figure 2. Widening the interval gives you more confidence that you are capturing the population parameter but inevitably results in a vaguer estimate.

In Figure 2 , the 99% confidence interval extends from 44.8 to 55.2, a total of 2.6 points wider than the 95% confidence interval depicted in Figure 7.6 . If a hundred 99% confidence intervals were constructed around the means of 100 samples, 99 of them (not 95 as before) would capture the population mean. The additional confidence is provided by making the interval wider. And that’s always the tradeoff in confidence intervals. The narrower the interval, the more precisely you draw the boundaries, but the fewer such intervals will capture the statistic in question (here, that’s the mean). The broader the interval, the less precisely you set the boundaries but the larger the number of intervals that capture the statistic.

Other than setting the confidence level, the only factor that’s under your control is the sample size. You generally can’t dictate that the standard deviation is to be smaller, but you can take larger samples. The standard deviation used in a confidence interval around a sample mean is not the standard deviation of the individual raw scores. It is that standard deviation divided by the square root of the sample size, and this is known as the standard error of the mean.

The data set used to create the charts in Figures 1 and 2 has a standard deviation of 20, known to be the same as the population standard deviation. The sample size is 100. Therefore, the standard error of the mean is

or 2.

To complete the construction of the confidence interval, you multiply the standard error of the mean by the z-scores that cut off the confidence level you’re interested in. Figure 7.6 , for example, shows a 95% confidence interval. The interval must be constructed so that 95% lies under the curve and within the interval—therefore, 5% must lie outside the interval, with 2.5% divided equally between the tails.

Here’s where the NORM.S.INV() function comes into play. Earlier in this section, these two formulas were used:

=NORM.S.INV(0.025)
=NORM.S.INV(0.975)

They return the z-scores −1.96 and 1.96, which form the boundaries for 2.5% and 97.5% of the unit normal distribution, respectively. If you multiply each by the standard error of 2, and add the sample mean of 50, you get 46.1 and 53.9, the limits of a 95% confidence interval on a mean of 50 and a standard error of 2.

If you want a 99% confidence interval, use the formulas

=NORM.S.INV(0.005)
=NORM.S.INV(0.995)

to return −2.58 and 2.58. These z-scores cut off one half of one percent of the unit normal distribution at each end. The remainder of the area under the curve is 99%. Multiplying each z-score by 2 and adding 50 for the mean results in 44.8 and 55.2, the limits of a 99% confidence interval on a mean of 50 and a standard error of 2.

At this point it can help to back away from the arithmetic and focus instead on the concepts. Any z-score is some number of standard deviations—so a z-score of 1.96 is a point that’s found at 1.96 standard deviations above the mean, and a z-score of −1.96 is found 1.96 standard deviations below the mean.

Because the nature of the normal curve has been studied so extensively, we know that 95% of the area under a normal curve is found between 1.96 standard deviations below the mean and 1.96 standard deviations above the mean.

When you want to put a confidence interval around a sample mean, you start by deciding what percentage of other sample means, if collected and calculated, you would want to fall within that interval. So, if you decided that you wanted 95% of possible sample means to be captured by your confidence interval, you would put it 1.96 standard deviations above and below your sample mean.

But how large is the relevant standard deviation? In this situation, the relevant units are themselves mean values. You need to know the standard deviation not of the original and individual observations, but of the means that are calculated from those observations. That standard deviation has a special name, the standard error of the mean.

Because of mathematical derivations and long experience with the way the numbers behave, we know that a good, close estimate of the standard deviation of the mean values is the standard deviation of individual scores, divided by the square root of the sample size. That’s the standard deviation you want to use to determine your confidence interval.

In the example this section has explored, the standard deviation is 20 and the sample size is 100, so the standard error of the mean is 2. When you calculate 1.96 standard errors below the mean of 50 and above the mean of 50, you wind up with values of 46.1 and 53.9. That’s your 95% confidence interval. If you took another 99 samples from the population, 95 of 100 similar confidence intervals would capture the population mean. It’s sensible to conclude that the confidence interval you calculated is one of the 95 that capture the population mean. It’s not sensible to conclude that it’s one of the remaining 5 that don’t.

Others

- Microsoft Outlook 2010 : Configuring the Exchange Server Client - Configuring General Properties & Configuring Advanced Properties

- Microsoft Outlook 2010 : Configuring the Exchange Server Client - Outlook as an Exchange Server Client

- Opening a Notebook in OneNote 2010 & Inviting Others to Coauthor a Notebook

- Using OneNote Web App : Creating a Notebook & Exploring OneNote Web App

- Microsoft Project 2010 : Setting Up Resources - Setting Up Work Resources

- Microsoft Project 2010 : Sharing Projects with SharePoint

- Microsoft Access 2010 : Protect Databases - Assigning Passwords to Databases

- Microsoft Access 2010 : CREATING SCHEMA WITH ADOX

- Microsoft Word 2010 : Setting Advanced Save Options

- Microsoft Word 2010 : Changing Default Text and Page Settings