Sampling I: Probability Samples

Joe Ripberger

Review

Total Survey Error (TSE)

  • Variance: variability of estimates across different samples
  • Bias: systematic deviation of an estimate from the true value

Sources of Survey Error (TSE)

Sources of Survey Error (TSE)

  • Specification (validity) error: difference between the concept the researcher intends to measure and the construct actually captured by the survey question
  • Measurement error: difference between the value a survey question records and the true value for the respondent, due to question wording, interviewer effects, mode, recall, or response biases
  • Processing error: difference introduced during data handling, such as coding, keying, editing, or weighting, that causes the stored data to deviate from the respondent’s intended answer

Sources of Survey Error (TSE)

  • Coverage error: difference between the target population and the sampling frame; occurs when some population units have no chance of selection or are erroneously included
  • Sampling error: difference between an estimate based on a sample and the true population value that arises because only a subset of units is observed
  • Nonresponse error: difference between respondents and nonrespondents that leads to estimates not representing the intended sample or population
  • Adjustment error: difference between the adjusted (e.g., weighted or imputed) survey estimates and the true population values that arises when the adjustment procedures are misspecified or imperfect

Total Survey Quality (TSQ)

  • Accuracy: total survey error is minimized
  • Credibility: data are considered trustworthy by the survey community
  • Comparability: demographic, spatial, and temporal comparisons are valid
  • Usability/interpretability: documentation is clear and metadata are well-managed
  • Relevance: data satisfy users needs
  • Accessibility: access to the data is user friendly
  • Timeliness/punctuality: data deliveries adhere to schedules
  • Completeness: data are rich enough to satisfy the analysis objectives without undue burden on respondents
  • Coherence: estimates from different sources can be reliably combined

Covering the population and selecting who to sample

Basic Concepts

  • Census: a complete enumeration of all units in the target population. Every individual (or household, organization, etc.) in the population is asked to provide information, so there is no sampling
    • Example: the 2020 Decennial Census (331,449,281 respondents; 99.98% of households)
  • Sample survey: a study in which information is collected from only a subset of units drawn from the population, selected through a sampling process. Results from this smaller group are then used to make inferences about the entire population
    • Example: the 2023 American Community Survey (ACS) (2,108,185 respondents)

2020 Census

  • By the time work ends in 2024 on the 2020 Census, it will have cost $13.7 billion—which is below the Census Bureau’s original estimate of $15.6 billion
    • Response data \(\rightarrow\) $5.6 billion
    • Address frame \(\rightarrow\) $600 million
  • Source: GAO 2023

Survey Sample

  1. Identify the target population: define the group you want to generalize to
  2. Construct the sampling frame: compile or select the list of population units
  3. Select the sample: randomly draw units from the frame
  4. Contact sampled units: invite them to participate in the survey
  5. Collect responses: record answers from those who complete the survey
  6. Account for nonresponse: adjust if necessary
  7. Analyze data and generalize: make inferences about the population

Construct the Sampling Frame

  • When compiling or selecting a sampling frame, it is important to consider and, when possible, measure the coverage rate—the proportion of the target population that is included in the frame
    • Low coverage rates increase the risk of coverage error, because some members of the target population have no chance of being sampled
  • Coverage is affected by:
    1. Who has access to the survey mode
    2. What lists or frames are available

Access to the Survey Mode

  • Many large surveys use multi-mode designs to maximize coverage rates and reduce nonresponse
    • Face-to-face: highest coverage, costly
    • Mail: broad reach, lower cost, slower turnaround
    • Telephone: historically landline, then landline + cell, now mostly cell
    • Web/internet: increasingly the first option, inexpensive and fast
  • Can be concurrent (same time) or sequential (in sequence)
    • Example (ACS sequential): web → mail → phone → in-person

Common Sampling Frames (Lists)

  • Address-based sample (ABS) frames: samples drawn from the U.S. Postal Service Delivery Sequence File (CDSF)
    • Covers nearly all U.S. households and supports multi-mode designs (mail, web, phone, in-person)
  • Random-digit dialing (RDD) frames: samples of telephone numbers (landline and/or cell) generated randomly
    • Historically common in telephone surveys but declining with fewer landlines and lower response rates
  • Area probability sampling (area frame): geographic units (e.g., census tracts, blocks) are sampled first, then households/individuals within them
    • Not a frame per se, but a method for constructing a frame; often used in face-to-face surveys (cluster sampling)
  • Internet sample frames?

Area Frame

Coverage Error

Warning

Coverage error may occur when the sampling frame does not fully match the target population. Units may be missing from the frame, or ineligible units may be included. As a result, some members of the target population have no chance of being selected, which may cause bias in survey estimates.

However, bias arises only if the excluded or ineligible units would have responded systematically differently to survey questions than those who were included.

Select the Sample (Sampling Design)

  • Probability sampling: each member of the sampling frame has a known, nonzero chance of being included in the sample
    • Simple random sampling (SRS): every unit in the frame has an equal probability of selection, and each possible sample of a given size is equally likely (without replacement)
    • SRS is subject to sampling error (variance), which we measure using standard errors (SEs) and margins of error (MoEs)
      • \(SE(\hat{p}_{SRS}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
        • Finite population correction (FPC): \(SE(\hat{p}_{SRS}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \;\cdot\; \sqrt{1 - \tfrac{n}{N}}\)
        • Use when sample covers more than ~5% of the population (sampling fraction)
      • \(MoE = z \times SE\)
      • Confidence interval: \(\hat{p} \pm MoE\)

Cluster Sampling

  • Cluster sampling: groups (clusters) of units, such as households within census blocks, are randomly selected first, and then units within those clusters are randomly sampled
    • Most often used for face-to-face or field surveys because it reduces travel and listing costs
    • Especially useful when a complete sampling frame of individuals or households is not available; instead, a frame of geographic areas (e.g., census blocks, tracts) can be used to construct the sample (area probability sampling)
    • Individuals within the same cluster tend to be more alike (correlated), which can increase sampling error compared to a simple random sample of the same size (design effect)

Design Effect

  • Design effect (DEFF, \(d^2\)): the ratio of the variance of a survey estimate under the actual design to the variance under a simple random sample (SRS) of the same size. DEFF = 1 for SRS; it is usually greater than 1 for complex samples, though it can be less than 1 in efficient stratified designs
    • \(SE(\hat{p}_{SRS}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
    • \(SE(\hat{p}_{cluster}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \;\cdot\; \sqrt{\,1 + (m - 1)\rho\,}\), where \(m\) = average number of sampled units per cluster and \(\rho\) = intraclass correlation (ICC, rate of homogeneity) — how similar people within a cluster are
      • \(SE(\hat{p}_{cluster}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n_{\text{eff}}}}\), where \(n_{\text{eff}} = \frac{n}{1 + (m - 1)\rho}\)
    • If \(\rho\) = 0: no similarity within clusters → SE same as SRS
    • If \(\rho\) > 0: people in the same cluster are more alike → effective sample size is smaller → SE grows
    • If clusters are large (\(m\) big) or \(\rho\) is high, the design effect can be large → estimates are much less precise

Stratified Sampling

  • Stratified sampling: the population is divided into strata (subgroups), and random samples are drawn from each stratum
    • Proportionate stratified sample: the sample sizes in each stratum are chosen so that they are proportional to the stratum’s share of the population
    • Disproportionate stratified sample: the sample sizes in each stratum are not proportional to the stratum’s population share (often done to ensure adequate numbers for subgroup analysis)
  • Used for all survey modes; often to:
    • Improve precision when strata are internally homogeneous
    • Ensure representation of key subgroups (e.g., small minorities, geographic areas)
    • Allow separate estimates for important subpopulations
  • Design effects in stratified samples often reflect gains in efficiency (DEFF < 1) when strata are homogeneous, but can reflect losses (DEFF > 1) when heavy weighting from disproportionate sampling is required

Sample Size

  • Optimal sample size is determined by:
    • Desired precision (margin of error / width of confidence interval)
    • Desired confidence level (e.g., 95% → \(z \approx 1.96\))
    • Estimated variability in the population (largest at \(\hat{p} = 0.5\) for proportions)
    • Population size (important when sampling fraction is large → FPC applies)
    • Design effect (DEFF) for complex samples (clustered or weighted designs)
    • Budget and logistical constraints

Sample Size

  • \(SE(\hat{p}_{SRS}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
  • \(MoE = z \times SE\) \[n = \frac{z^2 \, \hat{p}(1-\hat{p})}{MoE^2} = \frac{z^2 \times p \times q}{MoE^2}\]
  • Suppose you want to estimate the proportion of Norman, OK residents who support a new public safety policy. You want a 95% confidence level (z = 1.96) and a margin of error of ±3% (MoE = 0.03). Assume maximum variability (p = 0.5, q = 0.5)

Sample Size

  • Given: 95% confidence → \(z=1.96\); \(MoE=0.03\); worst case variability \(p=0.5\), \(q=1-p=0.5\)
  • Formula: \(n \;=\; \frac{z^2\times p \times q}{MoE^2}\)
  • Math: \(n \;=\; \frac{(1.96)^2 \times 0.5 \times 0.5}{(0.03)^2} = \frac{3.8416 \times 0.25}{0.0009} = \frac{0.9604}{0.0009} \approx 1067.11\)
  • Answer: Round up\(n = 1068\) respondents

Sample Size

Activity

Groups

  1. Ben and Bulbul (American Community Survey—ACS)
  2. Laken and Vanessa (Current Population Survey—CPS)
  3. Nate and Charlie (National Health Interview Survey—NHIS)
  4. Lauren and Renata (European Social Survey—ESS)
  5. Riley and Abby (Current Employment Statistics—CES)
  6. Elizabeth and Joy (General Social Survey—GSS)
  7. Alexis and Anna (American National Election Studies—ANES)

Instructions

Goal: Identify and summarize key features of a real survey’s design.

  1. Find a Survey
    • Open the survey_documentation folder in the class Dropbox, find your survey
  2. Use ChatGPT to learn a bit about the survey
    • Find a short description of what the survey is, its main focus, and what kinds of questions it asks
  3. Read the Documentation
    • Look for details about who the survey covers and how it was conducted
  4. Extract Key Elements
    • Target population, sample frame, sample design, sample size, and survey mode
  5. Summarize
    • Write 3–5 sentences the sample frame and sample design
  6. Share Back
    • Be ready to report-out to the rest of the class
45:00

Report-out

03:00