2.3 Visualizing Data With Histograms

Visualizing the Distribution of the Data

Once we have collected data, it is important to create a visual display to get a sense of how the data is distributed. One important type of graph we will use extensively is the histogram. A histogram consists of adjoining bars. It has both a horizontal axis and a vertical axis. The horizontal axis is labeled with what the data represents. Each bar might represent a single value or a range of values. The vertical axis is labeled with either the frequency or the relative frequency (or percent frequency or probability). The graph will have the same shape with either label. The histogram can reveal the shape of the data, the center, and give a sense of the spread or variability of the data.

Most often, we will let a computer program or an Internet applet produce a histogram for us. We can then edit the title, axes labels, number of bars used, etc, to get a histogram that is both visually pleasing and allows for discussion and interpretation. It is instructive, however, to think through how a histogram is actually created, so we know what is happening behind the scenes when the computer generates a histogram.

There is a process to follow when we want to construct a histogram. First decide how many bars or intervals (also called bins or classes) represent the data. Many histograms consist of 5 to 15 bins or classes, for clarity. Each of the data values would fall into one of the bins or intervals. Choose a starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the data value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05).

If all the data happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). When the starting point and other boundaries are carried to one additional decimal place, no data value will fall on a boundary. The next two examples go into detail about how to construct a histogram using continuous data and how to create a histogram using discrete data.

Example 1: Creating a Histogram by Hand using Continuous Data

Heights of Soccer Players: The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. Note that he heights are classified as continuous data, since height is measured quantity.

60; 60.5; 61; 61; 61.5; 63.5; 63.5; 63.5; 64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5; 70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71; 72; 72; 72; 72.5; 72.5; 73; 73.5; 74

The smallest data value is 60. If we were to start the first interval at 60, then the data value 60 would fall on the boundary. Allowing data values to land on a boundary is avoided because we want to be clear into which interval each data value lands. Instead, we shift the boundary to the left. Since the data with the most decimal places has one decimal (for instance, 61.5), we can make our starting point to shift to the left by 0.05 units, so no data value falls on the boundary point between two intervals.

It is convenient in this case to calculate 60 – 0.05 = 59.95 and start the left-most interval at 59.95.  (Side Note:  Doing this allows us to make sure our intervals for the histogram avoid hitting any data points.  You don’t have to do this, but it’s convenient because it helps avoid issues with data points on the boundaries of the intervals.)

The largest data value is 74, so using the same rule, 74 + 0.05 = 74.05 and ends the right-most interval at 74.05.

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars. You must choose the number of bars you desire. Suppose you choose eight bars.

\frac{74.05-59.95}{8} = 1.76, so the width of each bar would be 1.76 and we can round the value to 2 for ease.

The boundaries for each interval:

  • 59.95
  • 59.95 + 2 = 61.95
  • 61.95 + 2 = 63.95
  • 63.95 + 2 = 65.95
  • 65.95 + 2 = 67.95
  • 67.95 + 2 = 69.95
  • 69.95 + 2 = 71.95
  • 71.95 + 2 = 73.95
  • 73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.95–61.95. The heights that are 63.5 are in the interval 61.95–63.95. The heights that are 64 through 64.5 are in the interval 63.95–65.95. The heights 66 through 67.5 are in the interval 65.95–67.95. The heights 68 through 69.5 are in the interval 67.95–69.95. The heights 70 through 71 are in the interval 69.95–71.95. The heights 72 through 73.5 are in the interval 71.95–73.95. The height 74 is in the interval 73.95–75.95.

The following relative frequency histogram displays the heights on the x-axis and relative frequency on the y-axis. Notice we can see at a glance that the data values cluster between 65 inches and 68 inches and the data values span about 75 inches.

Histogram consists of 8 bars with the y-axis in increments of 0.05 from 0-0.4 and the x-axis in intervals of 2 from 59.95-75.95.

It is important to note that a frequency histogram would have the exact same shape. The only difference would be in how the vertical axis is labeled. Notice the tallest bar has a relative frequency of 0.4. There were 100 data values in the sample, so 0.4 of 100 is 40 data values. You can count the data values that fell between 65.95 and 67.95 to see that there were 40 values. A frequency histogram of this data would replace the 0.4 on the vertical scale with a 40, etc.

We constructed the relative frequency histogram by hand, but a computer program or Internet applet that creates a histogram will most often label the x-axis using whole numbers, even though the intervals are calculated behind the scenes as we have described, so that each data value falls into exactly one interval. When the computer program marks the x-axis in whole numbers, it would be impossible to know into which interval a boundary data value fell. In general, we will not worry about whether the x-axis is marked using whole numbers or marked as we have specified, because the histogram is a general visual display that gives us a sense of where the data is centered and how it is spread. If we need the details, we have the data at hand.

Example 2: Creating a Histogram by Hand using Discrete Data

Books Purchased: The following data are the number of books bought by 50 part-time college students at this college. The number of books is discrete data, since books are counted. Create a frequency histogram.

1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 6; 6

Eleven students bought one book. Ten students bought two books. Sixteen students bought three books. Six students bought four books. Five students bought five books. Two students bought six books.

Because the data are positive integers, in order to set up intervals so that every data value falls into exactly one interval, subtract 0.5 from 1, which is the smallest data value and add 0.5 to 6, which is the largest data value. Then the first interval begins at 0.5 and the final interval ends at 6.5.

Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, and 6, and the left-most interval starting point is 0.5, a width of one unit places the data value 1 in the middle of the interval from 0.5 to 1.5, the data value 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, etc.

Calculate the number of bars as follows: \frac{6.5-0.5}{\text{number of bars}} = 1 where 1 is the width of a bar. Therefore, we would use 6 bars.

The following histogram displays the number of books on the horizontal axis and the frequency on the vertical axis. In this case the most frequently number of books purchased is 3, occurring {\frac{16}{50}} = 0.32 or 32% of the time, and purchasing fewer than 3 books occurs more often than purchasing more than 3 books.

Notice you can tell that there were 11 data values in the first interval, 10 data values in the second interval, 16 data values in the third interval, etc. This histogram is called a frequency histogram but a relative frequency histogram would have the exact same shape. The only difference would be that the y-axis would be labeled in using the relative frequencies. There were 50 data values, so a frequency of 2 would correspond to a relative frequency of {\frac{2}{50}} = 0.4, etc.

Histogram consists of 6 bars with the y-axis in increments of 2 from 0-16 and the x-axis in intervals of 1 from 0.5-6.5.

Examples 3: Construct a Frequency Histogram

Data were collected for children who were having frequent headaches. The data are the number of hours spent playing video games on a hand-held device or a computer over a weekend. Using this data set, construct a histogram.

Number of Hours  Spent Playing Video Games on a Weekend
9.95 10 2.25 16.75 0
19.5 22.5 7.5 15 12.75
5.5 11 10 20.75 17.5
23 21.9 24 23.75 18
20 15 22.9 18.8 20.5

Answer: One possible histogram uses five intervals.

This is a histogram that matches the supplied data. The x-axis consists of 5 bars in intervals of 5 from 0 to 25. The y-axis is marked in increments of 1 from 0 to 10. The x-axis shows the number of hours spent playing video games on the weekends, and the y-axis shows the number of students.

The children in this sample have been experiencing headaches. We can see that the most frequent number of hours spent playing video games on a weekend is greater than 20. The data is more concentrated at a higher number of hours.

Notice that the horizontal axis labels the number of hours using whole numbers. This means some values in this data set fall on boundaries for the class intervals. By taking a close look at the histogram, a value is counted in a class interval if it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up histograms for the same data in different ways. There is more than one correct way to set up a histogram.

Videos

YouTube Khan Academy Video Histograms

definition

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Statistics for Engineers Copyright © by Vikki Maurer & Jeff Crabill & Linn-Benton Community College is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Feedback/Errata

Comments are closed.