Microsoft Excel 2010 : Understanding Frequency Distributions (part 1) - Using Frequency Distributions

1/12/2013 11:34:54 AM

In addition to charts that show two variables—such as numbers broken down by categories in a Column chart, or the relationship between two numeric variables in an XY chart—there is another sort of Excel chart that deals with one variable only. It’s the visual representation of a frequency distribution, a concept that’s absolutely fundamental to intermediate and advanced statistical methods.

A frequency distribution is intended to show how many instances there are of each value of a variable. For example:

The number of people who weigh 100 pounds, 101 pounds, 102 pounds, and so on.
The number of cars that get 18 miles per gallon (mpg), 19 mpg, 20 mpg, and so on.
The number of houses that cost between $200,001 and $205,000, between $205,001 and $210,000, and so on.

Because we usually round measurements to some convenient level of precision, a frequency distribution tends to group individual measurements into classes. Using the examples just given, two people who weigh 100.2 and 100.4 pounds might each be classed as 100 pounds; two cars that get 18.8 and 19.2 mpg might be grouped together at 19 mpg; and any number of houses that cost between $220,001 and $225,000 would be treated as in the same price level.

As it’s usually shown, the chart of a frequency distribution puts the variable’s values on its horizontal axis and the count of instances on the vertical axis. Figure 1 shows a typical frequency distribution.

Figure 1. Typically, most records cluster toward the center of a frequency distribution.

You can tell quite a bit about a variable by looking at a chart of its frequency distribution. For example, Figure 1 shows the weights of a sample of 100 people. Most of them are between 140 and 180 pounds. In this sample, there are about as many people who weigh a lot (say, over 175 pounds) as there are whose weight is relatively low (say, up to 130). The range of weights—that is, the difference between the lightest and the heaviest weights—is about 85 pounds, from 116 to 200.

There are lots of ways that a different sample of people might provide a different set of weights than those shown in Figure 1. For example, Figure 2 shows a sample of 100 vegans—notice that the distribution of their weights is shifted down the scale somewhat from the sample of the general population shown in Figure 1.

Figure 2. Compared to Figure 1, the location of the frequency distribution has shifted to the left.

The frequency distributions in Figures 1 and 2 are relatively symmetric. Their general shapes are not far from the idealized normal “bell” curve, which depicts the distribution of many variables that describe living beings.

Still, many variables follow a different sort of frequency distribution. Some are skewed right (see Figure 3) and others left (see Figure 4).

Figure 3. A frequency distribution that stretches out to the right is called positively skewed.

Figure 4. Negatively skewed distributions are not as common as positively skewed distributions.

Figure 3 shows counts of the number of mistakes on individual Federal tax forms. It’s normal to make a few mistakes (say, one or two), and it’s abnormal to make several (say, five or more). This distribution is positively skewed.

Another variable, home prices, tends to be positively skewed, because although there’s a real lower limit (a house cannot cost less than $0) there is no theoretical upper limit to the price of a house. House prices therefore tend to bunch up between $100,000 and $200,000, with a few between $200,000 and $300,000, and fewer still as you go up the scale.

A quality control engineer might sample 100 ceramic tiles from a production run of 10,000 and count the number of defects on each tile. Most would have zero, one, or two defects, several would have three or four, and a very few would have five or six. This is another positively skewed distribution—quite a common situation in manufacturing process control.

Because true lower limits are more common than true upper limits, you tend to encounter more positively skewed frequency distributions than negatively skewed. But they certainly occur. Figure 4 might represent personal longevity: relatively few people die in their twenties, thirties, and forties, compared to the numbers who die in their fifties through their eighties.

1. Using Frequency Distributions

It’s helpful to use frequency distributions in statistical analysis for two broad reasons. One concerns visualizing how a variable is distributed across people or objects. The other concerns how to make inferences about a population of people or objects on the basis of a sample.

Those two reasons help define the two general branches of statistics: descriptive statistics and inferential statistics. Along with descriptive statistics such as averages, ranges of values, and percentages or counts, the chart of a frequency distribution puts you in a stronger position to understand a set of people or things because it helps you visualize how a variable behaves across its range of possible values.

In the area of inferential statistics, frequency distributions based on samples help you determine the type of analysis you should use to make inferences about the population.

Visualizing the Distribution: Descriptive Statistics

It’s usually much easier to understand a variable—how it behaves in different groups, how it may change over time, and even just what it looks like—when you see it in a chart. For example, here’s the formula that defines the normal distribution:

u = (1 / ((2π)^.5) σ) e ∧ (− (X − μ)² / 2 σ²)

And Figure 5 shows the normal distribution in chart form.

Figure 5. The familiar normal curve is just a frequency distribution.

The formula itself is indispensable, but it doesn’t convey understanding. In contrast, the chart informs you that the frequency distribution of the normal curve is symmetric and that most of the records cluster around the center of the horizontal axis.

Note

The formula was developed by a 17th century French mathematician named Abraham De Moivre. Excel simplifies it to this:

=NORMDIST(1,0,1,FALSE)

In Excel 2010, it’s this:

=NORM.S.DIST(1,FALSE)

Those are major simplifications.

Again, personal longevity tends to bulge in the higher levels of its range (and therefore skews left as in Figure 1.13). Home prices tend to bulge in the lower levels of their range (and therefore skew right). The height of human beings creates a bulge in the center of the range, and is therefore symmetric and not skewed.

Some statistical analyses assume that the data comes from a normal distribution, and in some statistical analyses that assumption is an important one. Be aware, though, that if you want to analyze a skewed distribution there are ways to normalize it and therefore comply with the requirements of the analysis. In general, you can use Excel’s SQRT() and LOG() functions to help normalize a negatively skewed distribution, and an exponentiation operator (for example, =A2∧2 to square the value in A2) to help normalize a positively skewed distribution.

Visualizing the Population: Inferential Statistics

The other general rationale for examining frequency distributions has to do with making an inference about a population, using the information you get from a sample as a basis. This is the field of inferential statistics.

A familiar example is the political survey. When a pollster announces that 53% of those who were asked preferred Smith, he is reporting a descriptive statistic. Fifty-three percent of the sample preferred Smith, and no inference is needed.

But when another pollster reports that the margin of error around that 53% statistic was plus or minus 3%, she is reporting an inferential statistic. She is extrapolating from the sample to the larger population and inferring, with some specified degree of confidence, that between 50% and 56% of all voters prefer Smith.

The size of the reported margin of error, six percentage points, depends in part on how confident the pollster wants to be. In general, the greater degree of confidence you want in your extrapolation, the greater the margin of error that you allow. If you’re on an archery range and you want to be virtually certain of hitting your target, you make the target as large as necessary.

Similarly, if the pollster wants to be 99.9% confident of her projection into the population, the margin might be so great as to be useless—say, plus or minus 20%. And it’s not headline material to report that somewhere between 33% and 73% of the voters prefer Smith.

But the size of the margin of error also depends on certain aspects of the frequency distribution in the sample of the variable. In this particular (and relatively straightforward) case, the accuracy of the projection from the sample to the population depends in part on the level of confidence desired (as just briefly discussed), in part on the size of the sample, and in part on the percent favoring Smith in the sample. The latter two issues, sample size and percent in favor, are both aspects of the frequency distribution you determine by examining the sample’s responses.

Of course, it’s not just political polling that depends on sample frequency distributions to make inferences about populations. Here are some other typical questions posed by empirical researchers:

What percent of the nation’s homes went into foreclosure last quarter?
What is the incidence of cardiovascular disease today among persons who took the pain medication Vioxx prior to its removal from the marketplace in 2004? Is that incidence reliably different from the incidence of cardiovascular disease among those who did not take the medication?
A sample of 100 cars from a particular manufacturer, made during 2010, had average highway gas mileage of 26.5 mpg. How likely is it that the average highway mpg, for all that manufacturer’s cars made during that year, is greater than 26.0 mpg?
Your company manufactures custom glassware and uses lasers to etch company logos onto wine bottles, tumblers, sales awards, and so on. Your contract with a customer calls for no more than 2% defective items in a production lot. You sample 100 units from your latest production run and find five that are defective. What is the likelihood that the entire production run of 1,000 units has a maximum of 20 that are defective?

In each of these four cases, the specific statistical procedures to use—and therefore the specific Excel tools—would be different. But the basic approach would be the same: Using the characteristics of a frequency distribution from a sample, compare the sample to a population whose frequency distribution is either known or founded in good theoretical work. Use the numeric functions in Excel to estimate how likely it is that your sample accurately represents the population you’re interested in.

Others

- Microsoft Excel 2010 : Charting Numeric Variables in Excel

- Microsoft Word 2010 : Working with Themes

- Microsoft Word 2010 : Copying Formatting, Working with Lists

- Microsoft Access 2010 : Using Forms to Enter and Edit Table Data (part 3) - Copying Records Within a Form

- Microsoft Access 2010 : Using Forms to Enter and Edit Table Data (part 2) - Using a Form to Delete Records from a Table

- Microsoft Access 2010 : Using Forms to Enter and Edit Table Data (part 1) - Moving from Record to Record in a Form, Undoing Changes Made Within a Form

- Microsoft OneNote 2010 : Using Page Templates

- Microsoft OneNote 2010 : Working with Subpages

- Microsoft Visio 2010 : Printing Basics (part 2) - Print Preview, Experimenting with Printing Without Wasting Trees

- Microsoft Visio 2010 : Printing Basics (part 1) - Printing from the Backstage Area