Sampling

Sampling

9.1 Purpose

Sampling is a technique for obtaining an estimate from a population by studying, measuring, or interviewing a subset (or sample) of that population. This chapter discusses basic sampling concepts.

9.2 Strengths, weaknesses, and limitations

A well-selected sample yields an estimate of the target parameters in much less time and at much less cost than studying, measuring, or interviewing the entire population (conducting a census). It is often impossible to achieve 100 percent response because some of the entities to be studied, measured, or interviewed are unavailable or do not respond. A sample is sometimes more accurate than a census because obtaining numerous measurements introduces errors owing to fatigue, inaccurate or inconsistent data entry, and the use of less qualified personnel.

The sample answer, called an estimate, is almost never exactly the same as the corresponding population value. (This difference is called error.) Additionally, before a statistically valid sample can be selected, a great deal of information about the population must be available.

9.3 Input and related ideas

Before conducting a sample, it is necessary to define the specific information being sought and the population from which the sample will be drawn. For example, if an analyst needs information about perceived weaknesses in the existing sales order tracking system, the population would consist of all the people who utilize the existing system.

Sampling can be used to select the subset of a population to be interviewed (Chapter 8), the members of a JAD team (Chapter 14), or the members of an inspection team (Chapter 23). Sampling is an effective way to study an existing system by selecting the entities, transactions, occurrences, or personnel to be observed and measured. Sampling is an effective tool for estimating population characteristics when using such mathematical tools as simulation (Chapter 19) and queuing theory (Chapter 79). During the testing phase of the system development life cycle (Part VII), sampling is used to generate test data and select the specific events to be monitored. During the operation and maintenance phase (Part VIII), sampling is an effective tool for evaluating and monitoring performance and for implementing system controls (Chapter 77). For example, quality control is often implemented by taking random samples of a process. Sometimes the estimates generated by sampling a process are plotted on a control chart (Chapter 10) to determine if the process is in control.

9.4 Concepts

Sampling is a technique for obtaining an estimate from a population by studying, measuring, or interviewing a subset (or sample) of the population. This chapter discusses basic sampling concepts.

9.4.1 Why sample?

Every year, Consumer Reports magazine conducts tests on new automobiles and reports its findings to its readers. Given the (literally) millions of automobiles that roll off the assembly lines every year, testing the entire population would be incredibly time consuming, prohibitively expensive, and practically impossible, so the test results are based on a sample.

In many cases, testing a sample is actually more accurate than testing the entire population. A tester’s reactions and perceptions are likely to change between the first car and the tenth car, if only because of fatigue. Multiple tests mean considerable data, and data entry errors are inevitable. Multiple tests also imply multiple testers, not all of whom are equally skilled. Finally, the test conditions and criteria will almost certainly change over time. For example, if enough cars are crashed into a barrier, the barrier will eventually be deformed, thus changing the test conditions.

If the sample is drawn properly, it is reasonable to assume that the sample estimate reflects the population. The balance of this chapter discusses the process of drawing a good sample.

9.4.2 Sample size and sampling error

The difference between the sample estimate and the true population value is called error. As a general rule, the sampling error decreases as the sample size increases. For example, assuming a 95 percent confidence interval, a sample of 1,000 voters might predict the outcome of an election with an error of slightly more than plus or minus 3 percent. Increase the sample size to 4,000, and the error drops to plus or minus 1.5 percent, while a sample size of 10,000 reduces the error to less than plus or minus 1 percent.

A useful formula for computing the sample size is where z is a number from the normal distribution table that corresponds to the desired confidence interval, α is the standard deviation of the population as estimated by the sample standard deviation, and E is the maximum acceptable error between the sample mean and the actual population mean. For a 95 percent confidence interval, use z = 1.96. For a 99 percent confidence interval, use z = 2.575. As a practical matter, one-fifth the sample range can be used as an estimate of the standard deviation.

For example, suppose you want to estimate the average amount of money a state university student spends on food and beverages in an average week. The maximum acceptable error is $2. Based on a preliminary sample, α is estimated to be $8. The desired confidence interval is 95 percent. Plugging those numbers into Equation (9.1) suggests a sample size of or 63 students. (It is impossible to sample a fractional student, and rounding up yields a confidence interval slightly higher than 95 percent.) Assuming the students answer truthfully, averaging the weekly food expenditures of 63 randomly selected university students will yield a value that is within $2 of the population average with 95 percent confidence. To put it another way, there is a 0.95 probability that the sample mean will lie within $2 of the true mean. (Note: A real statistician would probably argue that the last statement is not technically correct, but in most cases it is a reasonable way to visualize a confidence interval.)

9.4.3 Bias

Simply selecting the right sample size is not enough, however. For example, a sample taken outside an expensive restaurant and a sample taken outside a food bank will almost certainly yield two very different (and equally invalid) estimates of the weekly food expenditures of university students because those samples are likely to be biased. A biased sample systematically favors some members of the population over others. To cite another example, if a telephone book is used to select a sample, people with unlisted numbers, people who have recently moved into that telephone market, and people with no telephone are automatically excluded from the sample.

Non-response bias occurs when one or more members of the selected group are not included in the sample. A survey that includes information only from people who answer their telephones at a certain time of day excludes one subset of the population. Dismissing or excluding people who refuse to answer certain questions is another source of non-response bias. Be aware of non-response bias. Before taking a sample, study the sampling process, identify subsets of the population that might be excluded or choose not to participate, and adjust the sampling process as necessary.

9.4.4 Random sampling

One relatively easy way to avoid introducing bias is to sample randomly. A sample is considered random if each member of the population has the same chance of being selected. Random samples yield unbiased estimates. Generally, an unbiased estimate is high about half the time and low about half the time.

There are two commonly used techniques for selecting a random sample. If the population is small, the members (or slips of paper representing each member) can be mixed thoroughly and the sample selected directly (like bingo markers or lottery tickets). For larger populations, assign each member a number and use a random number generator or a table of random numbers to select the sample.

9.4.5 Random-like samples

In cases where it is impossible or inconvenient to select a true random sample, the objective is to generate estimates that behave as though they were based on a random sample. The key to successful, almost random sampling is to avoid introducing bias. For example, imagine a grocer inspecting a shipment of fruit. An estimate based on a sample taken from a single box or even from the tops of several boxes is unlikely to accurately reflect the quality of all the fruit. However, if the grocer selects several boxes and then selects fruit from the top, the middle, and the bottom of each, the sample is likely to be random-like.

On an assembly line, selecting every tenth, hundredth, or thousandth item (generally, every nth item) as it flows by might be an effective way to select a random-like sample. An option is to select every m ± nth item), where n is a random number (for example, every 100 ± 5th item.

Avoid predictability when sampling human beings, however, because it often introduces bias. For example, if the boss walks through the work area every hour on the hour, he or she is likely to find everyone hard at work. If another boss were to use a random number table to define the times for random visits to the work area, he or she is likely to gain a more accurate picture of the employees’ work habits.

9.4.6 Stratified random sampling

With stratified random sampling, a population of size N is divided into m subgroups. Each subgroup is called a stratum, and each member of the population must lie in exactly one stratum. For example, dividing a group of people by sex yields two strata (male and female); dividing a group of voters into Democrat, Republican, Independent, and Socialist yields four strata; and comparing the products produced on the first, second, and third shifts calls for three strata. Samples are taken randomly within each stratum.

Stratified random sampling is important if the different strata have different means and/or different levels of variability. For example, suppose the newer, relatively inexperienced employees who work the third shift produce markedly more errors than the people who work the other two shifts. In such cases, stratified sampling tends to yield more accurate estimates than simple random sampling.

9.4.6.1 Proportional allocation

One technique for distributing a sample across several strata is called proportional allocation. If 200 employees are distributed over three shifts with 100 on first shift, 60 on second shift, and 40 on third shift, a reasonable sample distribution might be 50 percent first shift, 30 percent second shift, and 20 percent third shift.

9.4.6.2 Optimal allocation

If one stratum exhibits significantly more variability than the others, proportionally more samples should be taken from the inconsistent stratum. Also, if one stratum is more costly to measure or interview than another, proportionally fewer samples should be taken from the expensive stratum.

Optimal allocation is a technique for distributing a sample across several strata that considers variability and cost. The optimum allocation formula is where n_i is the number of samples in stratum i, n is the total sample size, W_i is the percentage of the population in stratum i, α_i is the standard deviation of stratum i, and C_i is the cost to sample stratum i. The formula calculates a relatively larger sample size for a given stratum if its variability (measured by α_i) is higher than average or if the cost of sampling from that stratum is lower than average.

For example, suppose n, the total sample size, is 500. The population is divided among three strata, with costs to sample of $3, $4, and $5 per item for strata 1, 2, and 3 respectively (C₁ = $3, C₂ = $4, andC₃ = $5). Stratum 1 contains 50 percent of the population (W₁ = 0.5), stratum 2 contains 30 percent of the population (W₂ = 0.3), and stratum 3 contains 20 percent of the population (W₃ = 0.2). Finally, the estimated standard deviations for the three strata are α₁ = 1.5, α₂ = 2, and α₃ = 2.5.

First calculate

ς(W_iα_i/(C_i^1/2)) = [W₁α₁/(C₁^1/2)] + [W₂α₂/(C₂^1/2)] + [W₃α₃/(C₃^1/2)]

= [0.5(1.5) / (3^1/2)] + [0.3(2) / (4^1/2)] + [0.2(2.2) / (5^1/2)]

≈ 0.433 + 0.300 + 0.224 = 0.957.

Next, compute

n₁/n = 0.433/0.957 = 0.452

n₂/n = 0.300/0.957 = 0.314

n₃/n = 0.224/0.957 = 0.234.

Those numbers suggest that n₁ (the stratum 1 sample size) should be 45.2 percent (or 226 units) of the total sample size (500 items), n₂ should be 31.4 percent (or 157 units), and n₃ should be 23.4 percent (or 167 units).

9.5 Key terms

Bias —: Any factor that systematically favors some members of the population over others when a sample is drawn.
Census —: A set of measurements (or interviews) for every element of a population.
Confidence interval —: A range of numbers around an estimate that contains the corresponding population parameter with the stated probability. For example, a 95 percent confidence interval for an estimate of the population mean is a range of numbers that contains the population mean with 95 percent certainty.
Error —: The difference between the value of a parameter as estimated by a sample and the actual value of that parameter for the entire population.
Estimate —: A value of a parameter determined by a sample.
Mean —: An arithmetic average; the sum of all the observations divided by the number of observations.
Non-response bias —: A form of bias that occurs when one or more members of the selected group are not included or choose not to participate in the sample.
Population —: The entire set of relevant entities or measurements.
Random sample —: A sample in which each item in the population has the same chance of being selected.
Range —: The difference between the highest value and the lowest value in a set of measurements.
Sample —: A selected subset of a population.
Standard deviation —: The square root of the variance.
Strata —: The set of subgroups in a stratified random sample.
Stratified random sampling —: A random sampling technique in which the population is divided into subgroups called strata such that each element of the population lies in exactly one stratum; samples are taken randomly within each stratum.
Stratum —: A single subgroup in a stratified random sample.
Unbiased estimate —: An estimate that is high about half the time and low about half the time.
Variance —: The average of the squared differences between the individual population values and the population mean.

9.6 Software

Random number tables are found in many statistics textbooks and/or in the software packages that accompany those books. Random number functions are found in most spreadsheet programs. SAS users can generate random observations from a binomial distribution (RANDBIN), an exponential distribution (RANEXP), a normal distribution (RANNOR), a Poisson distribution (RANPOI), or a uniform distribution (RANUNI). Minitab for Windows users should check the RANDOM DATA sub-window on the CALC pull down window.

9.7 References

1. Aczel, A. D., Complete Business Statistics, Irwin, Homewood, IL, 1989, chap. 16.

2. Badarinathi, R., Introduction to SAS, Dryden Press, New York, 1992, 21.

3. Bowerman, B. L. and O’Connell, R. T., Applied Statistics. Improving Business Processes, Irwin, Chicago, 1997.

Search This Blog

information systems course