2.2 Measures of Location

Doctors keep track of heights and weights of children as they grow. A common bit if information given to parents of the child is the percentile for the child’s height or weight. A child in the 10th percentile for weight weighs less compared to most the children of the same age. A child in the 80th percentile for height is taller than most of the children of the same age. What else can we say about these sorts of measures? Knowing how measurements like these are calculated will help us learn how to interpret and communicate them.

Percentiles and Quartiles

The common measures of location for a data set are percentiles and quartiles. Data sets often are presented in the order the data was collected. When a data set is ordered from smallest data value to largest, we can identify the middle of the data set to see what data values fall above or below that location. We can identify the data that fall into the lowest 10% or the middle 50% of the all the data values, for example.

For a number p  between 0 and 100, we define the pth percentile as the location for which p% of the data values are less than the pth percentile. For a child whose weight is in the 8th percentile, it means only 8% of all the children whose weights have been recorded have weights less or equal to the child’s weight. It also means that 92% of all children whose weights have been recorded have weights greater than or equal to the child’s weight.

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Percentiles are useful for making comparisons. In the past, universities and colleges have used percentiles from standardized tests extensively. One instance of the use of percentiles is how colleges and universities used the standardized test, called the SAT. A school would set a certain minimum percentile score and it was used as an acceptance factor. For example, if a school accepted any student with an SAT score at or above the 75th percentile, it meant the accepted students scored better than 75% of all the students who took the test that year, so the school would accept the students who scored in the top 25% of those who took the test.

Percentiles are mostly used with very large populations. Therefore, if you were to say you scored in the 90th percentile, then you can interpret it by saying 90% of the test scores are less than your score, rather than saying “less than or equal to” yours. In a large data set it is appropriate and acceptable to eliminate the “or equal to” interpretation because removing one particular data value would not significantly change the calculation or percentile locations.

The median is a number that measures the “center” of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following small set of data.

1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1

Ordered from smallest to largest:

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

[latex]\frac{6.8+7.2}{2}=7[/latex]    The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median, also called the second quartile or 50th percentile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value of the upper half of the data. To get the idea, consider the same data set:

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. There are seven data values, so the middle value of the lower half is two, which is the fourth data value.

1; 1; 2; 2; 4; 6; 6.8

The number two, which is part of the data, is the first quartile. One-fourth of the entire set of data values are the same as or less than two and three-fourths of the entire set of data values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartileQ3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The Interquartile Range

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).

[latex]\textrm{IQR} = Q_3 - Q_1[/latex]

Defining an “Outlier”

The IQR can help determine if a data value qualifies as an outlier, which are data values that are much larger or much smaller than the rest. There are formal definition used to determine if a data value can be categorized as an outlier. One generally accepted formula to test if a data value is an outlier is to see if it is at least 1.5(IQR) greater than the third quartile or 1.5(IQR) smaller than the first quartile. In other words, a data value can be labeled as an outlier if it meets the following conditions:

[latex]\textrm{Data Value} \geq Q_{3} + 1.5( \textrm{IQR} )[/latex]

[latex]\text{Data Value} \leq Q_{1} - 1.5 ( \text{IQR})[/latex]

 

Example 1: Quartiles and Outlier

For the following 11 salaries of employees at a small company, calculate each quartile and the IQR and determine if any salaries seem to be outliers. The salaries are in dollars.

$33,000;  $64,500;  $28,000;  $54,000;  $72,000;  $68,500;  $69,000;  $42,000;  $54,000;  $120,000;  $40,500

Answers:

First order the 11 data from smallest to largest.

$28,000;  $33,000;  $40,500;  $42,000;  $54,000;  $54,000;  $64,500;  $68,500;  $69,000;  $72,000;  $120,000

Because there are an odd number of data values, the median is the middle or 6th salary. The first quartile is the 3rd salary, The third quartile is the 9th salary.

Q1 = $40,500

Q2Median = $54,000

Q3 = $69,000

IQR = $69,000 – $40,500 = $28,500

(1.5)( IQR) = (1.5)($28,500) = $42,750

Q1 – (1.5)(IQR) = $40,500 – $42,750 = –$2250

Q3 + (1.5)(IQR) = $69,000 + $42,750 = $111,750

Notice there is no low salary that would qualify as an outlier. However, $120,000 is more than $111,750, so $120,000 is a potential outlier. There is likely a good reason for a company to have such a large salary compared to the rest. This could be the salary of the president, for example. We cannot assume outliers are mistakes but they should be noted.

These statistics we are calculating tell us a lot about data sets and can help interpret differences between data sets. We should stay mindful that when we are looking at the data from a sample, we are hoping to make generalizations that apply to the population. If we have a representative sample, then we can feel pretty confident that our sample information can be used to describe the population.

Example 2: Quartiles and Sample Comparisons

The calorie content for standard-sized and classic-style hot dogs  or low-fat hot dogs varies by manufacturer. Consider the following data of five classic-style hot dogs and six low-fat hot dogs. Find the quartiles and the IQR for each data set and interpret the results.

Classic-Style Hot Dogs Calories: 160, 150, 240, 150, 248

Low-Fat Hot Dogs Calories: 80, 70, 60, 45, 70, 45

Answer:

For classic-style hot dogs, the ordered data is 150, 150, 160, 240, 248.
Q1 = 150, Q2 = 160, Q3 = 244, and IQR = 94

For low-fat hot dogs, the ordered data is 45, 45, 60, 70, 70, 80.
Q1 = 45, Q2 = 65, Q3 = 70, and IQR = 25

The IQR of 94 calories for the classic-style hot dogs compared to 25 calories for the low-fat hot dogs indicates that the calorie content is much more variable for the classic-style hot dogs. Half of the classic-style hot dogs are between 150 calories and 244 calories, while half of the low-fat hot dogs are between 45 calories and 70 calories.

In the last example, notice that there were data values that repeated. Two of the low-fat hot dogs sampled had 45 calories and two had 70 calories. When data are repeated, they must be considered. They are each individual data values that bring critical information to the sample. There are times data are summarized and the frequency of occurrences has been listed. Greater care must be taken to ensure the quartiles are taking the frequency into account.

Example 3: Frequency Table and Quartiles

Fifty nursing students were asked how much sleep they get per school night (rounded to the nearest hour). The results are given in the following table. Find and interpret the quartiles.

Amount of Sleep per School Night (Hours) Frequency Relative Frequency Cumulative Relative Frequency
4 2 0.04 0.04
5 5 0.10 0.14
6 7 0.14 0.28
7 12 0.24 0.52
8 14 0.28 0.80
9 7 0.14 0.94
10 3 0.06 1.00

Note: When we add up all of the frequencies in the table, we verify the sample contained 50 students. Each relative frequency is calculated by dividing the corresponding frequency by the sample size. The 2 students who slept for 4 hours each night make up [latex]{\frac{2}{50}=0.4}[/latex] or 4% of the sample. The 5 students who slept for 5 hours each night make up [latex]{\frac{5}{50}=0.10}[/latex] or 10% of the sample. The right-most “Cumulative Relative Frequency” column accumulates the relative frequencies as we go down the list, so the seven students who slept for 4 or 5 hours each night together make up 14% of the sample. By the time we have reached the last entry in the table, we have accumulated 100% of the sample.

Answer:

Find the first quartile: The first quartile is the same as the 25th percentile. Notice the 0.28 in the “cumulative relative frequency” column is close to what we want. Twenty-eight percent of 50 data values is made up of the first 14 values. They include the two 4s, the five 5s, and the seven 6s. The 25th percentile is between the 12th and 13th data value, both of which are six hours, so the 25th percentile or the first quartile is 6 hours.

Find the median. Look again at the “cumulative relative frequency” column and find 0.52. This is close to what we are looking for. The median is the 50th percentile. Because there are 50 data values, the first 25 values make up the lower half of the data set and the last 25 values make up the upper half of the data set. There are 25 values less than the median and 25 values greater than the median. The median is between the 25th and 26th value, both of which is seven hours. Thus the median is 7 hours and we can conclude half of the nursing students sleep for at most 7 hours on a school night.

Find the third quartile. The third quartile is the same as the 75th percentile. You can “eyeball” this answer. If you look at the “cumulative relative frequency” column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)

If you were to do a little research, you would find several formulas for calculating percentiles. Let’s explore one such formula.

A Formula for Finding the k th Percentile

k = the kth percentile.

i = the index (position of a data value, after data has been ordered)

n = the total number of data values

Step 1: Order the data from smallest to largest.

Step 2: Calculate i = [latex]\frac{k}{100}(n+1)[/latex]

Step 3: Use the index i to determine in which position to look:

If i  is a whole number, then the kth percentile is the data value in the ith position in the ordered set of data.

If i  is not a whole number, then average the data values in the nearest positions to the ith position.

 

Example 4: Using a Formula to Calculate Percentiles

The age in years when diagnosed with Type-1 diabetes was recorded for 29 patients taking part in an experimental drug treatment group. Previous studies indicated that half of adults are diagnosed after the age of 30. Find the 50th percentile and the first quartile.

Ages in Years:  18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

Answer:

For the 50th percentile: k = 50 and n = 29. Calculate i, the value of the index.

i = [latex]\frac{k}{100}(n+1)=\frac{50}{100}(29+1) = 15[/latex]. Because i = 15 is an integer, the data value in the 15th position in the ordered data set is the 50th percentile. The 50th percentile is 47 years. This means half of the adults in this group were diagnosed after the age of 47 years.

The first quartile is the 25th percentile: k = 25 and n = 29. Calculate i, the value of the index.

i = [latex]{\frac{k}{100}(n+1)=\frac{25}{100}(29+1) = 7.5}[/latex], which is NOT a whole number, so we will average the data values in the 7th and 8th position. The age in the 7th position is 29 and the age in the 8th position is 30. The 25th percentile is  [latex]{\frac{29+30}{2}=29.5}[/latex] years. For the adults in this study, 75% were diagnosed after about age 30 years.

Interpreting Percentiles, Quartiles, and Median

A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less than or equal to the 15th percentile.

  • Low percentiles always correspond to lower data values.
  • High percentiles always correspond to higher data values.

A percentile may or may not correspond to a value judgment about whether it is “good” or “bad.” The interpretation of whether a certain percentile is “good” or “bad” depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered “good;” in other contexts a high percentile might be considered “good”. In many situations, there is no value judgment that applies.

Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.

Guideline

When writing the interpretation of a percentile in the context of the given data, be sure to include the following information:

  • the context of the situation being considered
  • the data value (value of the variable) that represents the percentile
  • the percent of individuals or items with data values below the percentile
  • the percent of individuals or items with data values above the percentile

Example 5:  Interpreting Quartiles and Percentiles

  1. On a timed exam, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.
  2. For the 100-meter dash, the third quartile for finishing times was 11.5 seconds. Interpret the third quartile in the context of the situation.
  3. On a 20 question test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.

Answers:

  1. Twenty-five percent of students finished the timed exam in 35 minutes or less. Seventy-five percent took at least 35 minutes. Notice we cannot attribute anything positive or negative about finishing faster or taking longer. We have no information about the results of the exam other than the times.
  2. Twenty-five percent of runners finished the race in 11.5 seconds or longer. Seventy-five percent of runners finished the race in 11.5 seconds or less. In this case, it is likely desirable to finish the race faster.
  3. Seventy percent of students answered 16 or fewer questions correctly on a 20-question test. Thirty percent of students answered 16 or more questions correctly. An instructor may use this information to generalize about the difficulty of the exam or the preparedness of the students, but we do not have any other information to offer such conclusions.

Videos

YouTube Video Representations of Data

Sources

Data from the United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/ (accessed April 3, 2013).

“1990 Census.” United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013).

definition

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Statistics for Engineers Copyright © by Vikki Maurer & Jeff Crabill & Linn-Benton Community College is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.