1.3 Sampling Methods and Data

Sampling Methods and Data

When do we need a sample? The answer is, not always. There are times when we might be able to consider data from an entire population, without too much cost or without it taking too much time. If our interest or research question has to do with a population we can access completely, then we do not need to take a sample. Consider a street of residential homes. There might be the need to gather opinions about installing speed bumps, so surveying every person who lives along that street is possible. In that case, the data gathered comes from the entire population and we would just have to think about how to communicate and summarize the data.  There are other times when the population is too vast or impossible to gather information from. Consider wanting to know how a new cholesterol-lowering drug affects adults. It would be impossible to give every adult in the world the drug and measure its effectiveness. For situations in which gathering information about the population is impossible or unethical, we take a sample. This section explores vocabulary surrounding data and several common sampling methods used to collect data.

Data Types

Data falls into two general categories: qualitative or quantitative.

Qualitative data are the result of categorizing or describing attributes of a population. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or blood type.

Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.

All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values. If you count the number of phone calls you receive for each day of the week, you might get values such as zero, one, two, or three. The number of phone calls is discrete.

All data that are the result of measuring are quantitative continuous data assuming that we can measure accurately. Measuring angles in radians, distances in meters, or mass in grams might result in such numbers as \frac{\pi}{6}, 0.23, \textrm{or } 154.375, and so on. The measurements can theoretically take on any value on an interval. For package shipped by Amazon, the number items in the package is quantitative discrete data but the weight of the package is quantitative continuous data.

Example 1: Classifying Data

For each situation, classify the data as qualitative, quantitative continuous, or quantitative discrete.

  1. A survey was sent to a representative sample of homes in Oregon and the number of people living in the home was recorded.
  2. During flu season, body temperature of 100 emergency room patients is sampled.
  3. In a run of 10,000 square yards of fabric, the number of flaws is recorded.
  4. A language learning App records all the languages users are actively accessing.
  5. An airline trying a new passenger loading method records the time it takes to get every passenger in their seats.
  6. At birth a baby’s weight is recorded in pounds.

Answers: (1) quantitative discrete, (2) quantitative continuous, (3) quantitative discrete, (4) qualitative, (5) quantitative continuous, (5) quantitative continuous

 

Omitting Categories and Missing Data

We will spend time in this course learning how to display and interpret data, as well as use it to make decisions. In this section, as we discuss data and sampling, we should make sure to mention something about leaving out data, well actually, about not leaving data out!

Consider a sample of 24,382 people surveyed about ethnicity, which as we know, is qualitative data. The results of the survey are provided in Table 1 and displayed in the bar graph labeled Figure 1. Notice that the total of the responses does not add to 24,382. Without carefully looking, we might miss that detail. The “Other/Unknown” category is completely missing. This missing category contains people who did not feel they fit into any of the ethnicity categories or who declined to respond entirely. This is important information. Is this a problem to leave off a category? In general, yes, it is a problem to leave off a category or leave out data entirely, especially when we want to summarize and interpret the data we are working with.

Table 1:
Ethnicity Frequency Percent
Asian 8794 36.1%
Black 1412 5.8%
Filipino 1298 5.3%
Hispanic 4180 17.1%
Native American 146 0.6%
Pacific Islander 236 1.0%
White 5978 24.5%
TOTAL 22,044 90.4%

Figure 1: Missing a Category

Now consider Figure 2, the bar graph that does contain the “Other/Unknown” category, which has a response percent of 9.6%. The “Other/Unknown” category is actually large compared to some of the other categories (Native American, 0.6%, Pacific Islander 1.0%). This is important to know when we think about what the data are telling us.

It can even be difficult to understand or interpret the information displayed in Figure 2 because we are visually comparing sizes of bars along with the different categories. If we reorganize the bars from largest percentage to smallest percentage, we transform the graph into a Pareto chart . The graph in Figure 3 is a Pareto chart and offers a more comfortable visual experience with the information. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret. 

Figure 2: All Categories Present

Figure 3: Pareto Chart

In order to fully understand and interpret data, be sure to include all the data and every category, unless your know for certain that a data value is actually a mistake.

Random Sampling

Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a representative sample of the population. A representative sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method we highlight has pros and cons. The easiest method to describe is called a simple random sample. Any group of n individuals is as likely to be chosen as any other group of n individuals, if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected. For example, your instructor wants to form a four-person team from a class of 30 students, for a project. To choose a simple random sample of size four from the class, the instructor puts each of the 30 names on a numbered list, so that every name is associated with a number from 1 to 30, as displayed in Table 2. The instructor then can use a random number generator (App on the Internet, such as random.org) to randomly generate four random numbers from 1 to 30. This would ensure that every possible combination of four students has the exact same likelihood of being selected for the project.

Table 2: Class Roster With an Assigned Number

ID Name
00 Anselmo
01 Bautista
02 Bayes
03 Cheng
04 Cuarismo
05 Cunningham
06 Fontecha
07 Hong
08 Hoobler
09 Jiao

30

Khan

In this case, the random number generator produced the following numbers: 9, 11, 21, and 3. The students randomly selected for the sample would be the students whose names were associated with that number.

Other Sampling Methods

Besides simple random sampling, there are other forms of sampling that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.

To choose a stratified sample, divide the population into non-overlapping groups, called strata, and then take a proportional number from each stratum. For example, you could stratify (group) your college population by department and then choose a proportional (same percentage) simple random sample from each stratum (each department) to get a stratified random sample. To choose the simple random sample from each stratum (department), you would number each member of the first department, number each member of the second department, and do the same for the remaining departments. Then you would use simple random sampling to choose a percentage of numbers from the first department and do the same for each of the remaining departments. Those numbers picked from the first department, picked from the second department, and so on represent the members who make up the stratified sample.

To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these selected clusters are in the cluster sample. For example, suppose each department on a college campus is a single cluster. If you randomly sample four clusters (departments) from all of the clusters, the four departments make up the cluster sample and every person in each selected cluster is part of the sample. The clusters do not have to be the same size.

To choose a systematic sample, randomly select a starting point and take every nth piece of data from a listing of the population. For example, suppose you have to do a phone survey. Your list contains 20,000 phone numbers. You must choose 400 phone numbers for the sample. Number the list of phone numbers 1 through 20,000 and use a simple random sample to pick a random number that represents the first element in the sample. From there choose every fiftieth (or other multiple) phone number thereafter until you have a total of 400 phone numbers (you might have to go back to the beginning of your list). Systematic sampling is frequently chosen because it is a simple method.

A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available. For example, a computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others.

Sampling data should be done very carefully. Collecting data carelessly can have devastating results. It is better for the person conducting the survey to select the sample respondents using a more random method.

Sampling With or Without Replacement

True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and thus may be chosen more than once. However for practical reasons, in most populations, simple random sampling is done without replacement. Surveys are typically done without replacement. That is, a member of the population may be chosen only once. Most samples are taken from large populations and the sample tends to be small in comparison to the population. Since this is the case, sampling without replacement is approximately the same as sampling with replacement because the chance of picking the same individual more than once with replacement is very low.

In a college population of 10,000 people, suppose you want to pick a sample of 1000 randomly for a survey. For any particular sample of 1000, if you are sampling with replacement,

  • the chance of picking the first person is 1000 out of 10,000 (0.1000);
  • the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999);
  • the chance of picking the same person again is 1 out of 10,000 (very low).

If you are sampling without replacement,

  • the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000);
  • the chance of picking a different second person is 999 out of 9999 (0.0999);
  • you do not replace the first person before picking the next person.

Compare the fractions    \frac{999}{10,000}   and \frac{999}{9999}. For accuracy, carry the decimal answers to four decimal places. To four decimal places, these numbers are equivalent (0.0999).

Sampling without replacement instead of sampling with replacement becomes a mathematical issue only when the population is small. For example, if the population is 25 people, the sample is ten, and you are sampling with replacement for any particular sample, then the chance of picking the first person is 10 out of 25, and the chance of picking a different second person is 9 out of 25 (you replace the first person).

If you sample without replacement, then the chance of picking the first person is 10 out of 25, and then the chance of picking the second person (who is different) is 9 out of 24 (you do not replace the first person).

Compare the fractions \frac{9}{25} and \frac{9}{24}. To four decimal places, \frac{9}{25} = 0.3600 and \frac{9}{24} = 0.3750. To four decimal places, these numbers are not equivalent.

Sampling Errors and Bias

When we analyze data, it is important to be aware of sampling error, which is the difference between the population parameter and the sample statistic being used to estimate it. Do not forget that we typically cannot examine the entire population, so we must take a sample. When we look at the sample and summarize various aspects of the sample, we use that information as our best guess as to what we expect in the population. In other words, we use sample information, sample statistics, to generalize about the population parameters. Every sample will lead to slightly different results, so we cannot expect that we can perfectly predict aspects of the population. For example, suppose you turn in five assignments and earn grades of 89, 75, 95, 80, and 65. If we find the average by adding up all the scores and dividing by the number of scores, then we see your average score is {\frac{89+75+95+80+65}{5}=80.8}. If we take a sample of three scores, such as 65, 75, and 80, then the average score of the sample is {\frac{65+75+80}{3} \approx 73.3}. Notice that the sample average of 73.3 is our best guess at the population average of 80.8, but they are not exactly the same. There will always be sampling error when we use the sample to generalize about a population. As a general rule, the larger the sample, the smaller the sampling error.

In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen). When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied. A biased sample will differ from the population so that using the information collected from the sample create misleading conclusions about the population.

Videos

Sampling Methods

Sampling Bias

License

Icon for the Creative Commons Attribution 4.0 International License

Introduction to Statistics for Engineers Copyright © by Vikki Maurer & Jeff Crabill & Linn-Benton Community College is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Feedback/Errata

Comments are closed.