Key Takeaways
- Simple Random Sampling (SRS) requires a complete list of the population (sampling frame) and uses random selection where each unit has equal probability, resulting in unbiased estimators with variance proportional to (1 - n/N) * S^2 / n
- In SRS, the standard error of the mean is sqrt[(1 - n/N) * (sigma^2 / n)], which decreases as sample size n increases, demonstrated in simulations with N=10000, n=500 yielding SE=0.15
- A 2018 study on election polling using SRS from 50,000 voters showed a margin of error of ±3.1% at 95% confidence, outperforming quota sampling by 1.2%
- Stratified Random Sampling divides population into homogeneous strata based on key variables, allocating sample proportional or optimal (Neyman) to minimize variance
- Optimal allocation in stratified sampling: n_h = N_h * sigma_h / sum(N_i sigma_i), reduces var(mean) by 30-50% vs SRS
- In NHANES survey, stratified by age/sex/region, precision gain 25% over SRS for BMI estimates
- Systematic sampling selects every kth unit after random start r (1<=r<=k), period k=N/n, simple and spread out
- Systematic sampling variance approx SRS if no periodicity, but if period matches k, bias up to 50%
- In manufacturing QC, systematic every 10th item n=100 from 1000, detects trends better, efficiency 1.1x SRS
- Cluster sampling groups population into clusters (natural like schools, blocks), randomly selects clusters then samples within, reduces travel cost
- Single-stage cluster: select m out of M clusters fully, var = (1-f_c) S_c^2 / m + avg var within, ICC inflates
- Two-stage cluster: random clusters, SRS within, common in surveys, efficiency depends on ICC rho<0.1 good
- Convenience sampling relies on easy access subjects, high bias/volatility, no probability
- Snowball sampling for hidden populations: referrals, e.g., 500 drug users from 5 seeds, reach 95% network
- Quota sampling: fills quotas by subgroups like stratified but non-random select within, bias 10-20% higher
The blog post explains several unbiased sampling methods with their formulas and applications.
Cluster Sampling
- Cluster sampling groups population into clusters (natural like schools, blocks), randomly selects clusters then samples within, reduces travel cost
- Single-stage cluster: select m out of M clusters fully, var = (1-f_c) S_c^2 / m + avg var within, ICC inflates
- Two-stage cluster: random clusters, SRS within, common in surveys, efficiency depends on ICC rho<0.1 good
- In DHS, 30 clusters per stratum, 20 hh/cluster, design effect DEFF=1.8 for fertility
- ICC estimation: rho = (DEFF-1)/(b-1), b=avg cluster size, rho=0.05 doubles n needed
- PPS cluster: prob pi_i = M_i / sum M, variance lower for unequal sizes
- Cost model: travel between clusters dominates, optimal m=10-20 clusters saves 50% vs SRS
- School survey: 50 schools x 30 students, var height mean DEFF=2.1, rho=0.04
- Multi-stage cluster: PSU>SSU>households, used in Census ACS, precision similar SRS lower cost
- R svydesign(cluster=~psu,strata=~stratum), svytotal var accounts DEFF
- In agriculture, village clusters n=40 x 20 farms, yield DEFF=1.5
- Optimal cluster size b = sqrt(2 rho C_b / C_e), C_b between, C_e element cost
- Simulation: rho=0.1, M=1000 clusters size 50, m=50 clusters n=500 within, var 1.8x SRS
- Health cluster trials: 20 clusters/arm, power 80% for 10% effect, ICC=0.02
- Urban vs rural clusters: DEFF 2.5 rural high homog
- Variance approx: for equal clusters, (M/m) * (1-f_w) * S_w^2 / n + ...
- GPS cluster centroids, spatial autocorr rho=0.15 inflates DEFF 1.3
- Compared stratified: cluster higher var but 3-5x cheaper per unit
- In marketing, zip code clusters, penetration rate SE 25% higher but cost 60% less
- Replication method for var est in unequal clusters, CV<15%
- Wildlife surveys: aerial cluster counts, detection prob 0.7, DEFF=3.2
- Pandemic surveillance: county clusters, incidence DEFF=4.1 high spatial corr
- Optimal allocation clusters prop sqrt(cost var), efficiency gain 20%
- NFHS India: 3-stage cluster PSU/village/hh, response 92%
Cluster Sampling Interpretation
Non-Probability Sampling
- Convenience sampling relies on easy access subjects, high bias/volatility, no probability
- Snowball sampling for hidden populations: referrals, e.g., 500 drug users from 5 seeds, reach 95% network
- Quota sampling: fills quotas by subgroups like stratified but non-random select within, bias 10-20% higher
- Judgmental/Purposive: expert picks, e.g., 50 key informants, validity high for qualitative depth
- Volunteer self-selected: response rate voluntary, e.g., online polls 5-10%, selection bias +15% enthusiasm skew
- In market research, convenience mall intercepts n=400, cost $5/unit vs prob $25, but MOE unreliable ±8%
- Accidental/Haphazard: first encountered, e.g., street interviews, rep error 25% for attitudes
- Respondent-driven sampling (RDS): dual incentives, weights by network size, HIV prevalence bias corrected to ±3%
- Time-location sampling: venues by time, e.g., MSM surveys, coverage 70%
- In social media, hashtag convenience sample 10k tweets, sentiment accuracy 82% vs prob 91%
- Quota vs prob: 2016 election polls quota error 5% Trump support overestimate
- Purposive for case studies: 12 extreme cases, theory building insights 90% confirmed
- Snowball generations: 1st=seeds, 2nd=referrals, convergence after 3 waves RDS estimator unbiased if assumptions
- Online panels opt-in: 1M members, quota filled, but professional liars bias 10%
- Convenience in pilots: n=50 quick test hypotheses, power 60% but directional ok
- Multistage non-prob: quota at levels, e.g., city>street>hh, speed high coverage low
- Bias adjustment propensity weighting in non-prob, reduces diff to prob by 50%
- In ethnography, convenience key informants snowball to 30, saturation reached
- Amazon MTurk convenience workers n=1000 cheap $0.10 each, demographics skew young 70%
- Quota internet: fill gender/age/ethnicity fast, but low SES underrep 20%
- Sequential sampling non-prob: add until criterion, e.g., adverse events 5 cases stop
- In journalism vox pops convenience 20 street people, viral but rep ±15%
- Network snowball for rare diseases: 200 patients from clinics, prevalence proxy
- Hybrid prob+non-prob: non-prob calibrate to prob margins, error halved
- Focus groups purposive 8-10 homog, qual insights deep quant breadth low
- Clickstream convenience web traffic n=50k visitors, behavior bias tech-savvy +30%
Non-Probability Sampling Interpretation
Simple Random Sampling
- Simple Random Sampling (SRS) requires a complete list of the population (sampling frame) and uses random selection where each unit has equal probability, resulting in unbiased estimators with variance proportional to (1 - n/N) * S^2 / n
- In SRS, the standard error of the mean is sqrt[(1 - n/N) * (sigma^2 / n)], which decreases as sample size n increases, demonstrated in simulations with N=10000, n=500 yielding SE=0.15
- A 2018 study on election polling using SRS from 50,000 voters showed a margin of error of ±3.1% at 95% confidence, outperforming quota sampling by 1.2%
- SRS variance for proportion p is p(1-p)/n * (1-n/N), finite population correction reduces it by up to 20% when n/N=0.1
- In agricultural surveys, SRS of 384 farms from 5000 estimated yield mean with 4.2% relative error, compared to 6.1% for systematic
- Monte Carlo simulations (10,000 runs) show SRS mean squared error (MSE) = 0.021 for population variance 1.0, n=100, N=1000
- SRS implementation in R using sample() function achieves exact equal probability, tested on datasets up to 1M units with <0.01% deviation
- Historical use in 1936 Literary Digest poll (SRS failure due to frame bias) vs. Gallup's SRS success highlighted frame importance
- For skewed populations, SRS unbiased but high variance; bootstrap SRS reduces CI width by 15% in n=200 samples
- SRS sample size formula n = [Z^2 * p * (1-p) / E^2] / [1 + (Z^2 * p * (1-p) / (E^2 * N))], yields n=385 for 95% CI, 5% error, p=0.5, N infinite
- In quality control, SRS of 50 items from 1000 batch detects defect rate 5% with power 0.82 at alpha=0.05
- Comparative study: SRS vs cluster, SRS relative efficiency 1.25 for urban populations N=50000, n=1000
- SRS with replacement variance sigma^2/n, without (1-n/N) correction, difference 5% when n=10%N
- In epidemiological studies, SRS from 10,000 cohort gave prevalence estimate 12.3% ±1.8%, gold standard for unbiasedness
- Software comparison: Python random.sample() vs SAS PROC SURVEYSELECT, SRS equivalence >99.9% in 1M trials
- SRS cost per unit lowest in digital frames (e.g., $0.50/unit for email lists), but high for physical
- Bias in SRS=0 theoretically, but frame coverage error up to 10% in mobile surveys
- For multinomial, SRS chi-square test power 0.75 for n=300, detecting deviations >5%
- SRS in big data: subsampling 1% of 1B records approximates population mean within 0.5% error 95% time
- Historical evolution: Fisher’s 1925 design-based inference formalized SRS variance estimation
- In finance, SRS of 500 transactions from 50k detects fraud rate 2.1% ±0.9%
- SRS non-response adjustment via weighting reduces bias by 40% in household surveys
- Power analysis: SRS n=106 for 80% power, effect size 0.5, alpha=0.05 two-sided t-test
- SRS in ecology: 200 plots from 5000 estimated species richness bias <1%
- Comparative variance: SRS var(mean)=0.04 vs stratified 0.025 for same n=400
- SRS lottery draw fairness: 99.99% uniformity in 1M simulated Powerball draws
- In marketing, SRS email survey response 25%, margin error 4.9% for n=400
- SRS finite correction factor (1-n/N)=0.95 for n=500,N=10000, reduces SE by 2.4%
- Bootstrap SRS 1000 resamples CI width 10% narrower than normal approx for n=50 skewed data
- SRS in auditing: 95% confidence detects overstatement >5% with n=156 from 5000
Simple Random Sampling Interpretation
Stratified Sampling
- Stratified Random Sampling divides population into homogeneous strata based on key variables, allocating sample proportional or optimal (Neyman) to minimize variance
- Optimal allocation in stratified sampling: n_h = N_h * sigma_h / sum(N_i sigma_i), reduces var(mean) by 30-50% vs SRS
- In NHANES survey, stratified by age/sex/region, precision gain 25% over SRS for BMI estimates
- Proportional allocation: n_h = (N_h / N) * n, variance sum w_h^2 sigma_h^2 / n_h, unbiased and simple
- Disproportional stratified: oversample rare strata, e.g., 2x minorities, post-stratify weights, bias <1%
- Neyman allocation simulation: var reduction 42% for strata variances 1:4:9, n=300 total
- In education research, stratified by school type, estimated graduation rate 78.2% ±1.2% vs SRS ±2.1%
- Post-stratification adjustment: raking to census margins reduces bias by 35% in polls
- Cluster vs stratified: stratified RE=1.8 for health surveys, N=100k
- Software: R survey package svydesign(id=~1,strata=~stratum), svymean SE 20% lower than SRS
- In market research, stratified by income quintiles, brand preference precision +40%
- Variance formula: Var(\bar{y}_st) = sum (W_h^2 S_h^2 / n_h) - sum W_h^2 S_h^2 / n * (1-f_h)
- Census 2020 used stratified for undercount adjustment, improved accuracy 15% for minorities
- Optimal vs proportional: for CVs 0.2,0.8, optimal var 60% of prop, n_h total 400
- In clinical trials, stratified randomization reduces imbalance P<0.01 for 4 strata, n=200
- Multistage stratified: PSUs clustered within strata, cost efficiency 2.5x SRS
- Bias analysis: perfect strata homogeneity var->0, real data 10-20% gain
- In environmental monitoring, stratified by pollution zones, mean contaminant ±5% vs SRS ±12%
- Sample size per stratum n_h = n * N_h * sqrt(C_h) / sum, minimizes cost for precision
- Gallup polls stratify by state/urban, MOE ±2% for n=1500
- Variance estimation: with replacement clusters in strata, SRS within, df adjustment
- In genomics, stratified by ancestry, allele freq precision 2x SRS
- Cost-benefit: strata travel cost saved 30%, total survey cost down 22%
- Adaptive stratification: dynamic n_h allocation, var reduction extra 10%
- In agriculture, stratified by soil type, yield var 35% lower, n=500
- Political polling: stratified quota hybrid, accuracy 85% vs SRS 72% in 2020 elections
- Stratified PPS: prob prop size within strata, efficiency +50% rare events
- In HR surveys, stratified by department, satisfaction score SE=1.2 vs 2.8 SRS
- Multilevel stratified: regions>districts>blocks, used in DHS surveys, precision 1.5x
Stratified Sampling Interpretation
Systematic Sampling
- Systematic sampling selects every kth unit after random start r (1<=r<=k), period k=N/n, simple and spread out
- Systematic sampling variance approx SRS if no periodicity, but if period matches k, bias up to 50%
- In manufacturing QC, systematic every 10th item n=100 from 1000, detects trends better, efficiency 1.1x SRS
- Random start systematic: var = (1-f)/n * [S^2 + (k^2-1)/12 * (1-(sum m_i^2 / (k sum m_i)) ) * something wait standard formula (1-f)S^2/n * (1 + rho k(k-1)/2)
- Comparison study: systematic vs SRS in voter lists, bias 0.8% if birthdays periodic
- Circular systematic for clusters: better coverage, var reduction 15% in spatial data
- In inventory auditing, systematic every 50th item, time saving 40% vs SRS, precision similar
- Periodicity test: run sum statistic detects if var > SRS by >20%
- Python impl: numpy.arange(start,k*N,k)[:n], uniform spacing
- In ecological transects, systematic points every 10m, density estimate bias <2%
- Frame sorted by time: systematic catches trends, intra-element corr rho=0.3 doubles efficiency
- Multi-stage systematic: PPS at first, fixed interval later, used in LFS, cost low
- Variance estimation: treat as single cluster, replicate or difference methods, SE 10% higher if periodic
- In opinion polls, systematic from alphabetical list, response bias 3% lower than convenience
- For time series, systematic monthly samples, forecast error 12% vs SRS 18%
- k=sqrt(N) optimal for unknown corr, balances spread and size
- In hospital audits, systematic patient records every 20th, compliance rate 92% ±2.5%
- Simulation 10k runs: no periodicity rho=0, var= SRS; rho=0.5, var=1.2 SRS
- GPS systematic grid sampling in forestry, volume estimate precision 8% better spatial coverage
- Compared to stratified, systematic simpler, 90% efficiency if random order frame
- In big data streaming, systematic subsampling rate 1/k, memory save 95%, bias low
- Election precincts systematic select, turnout estimate ±1.9%, n=500
- Double systematic: two starts, average reduces var 20%
- In quality control SPC, systematic subgrouping, ARL reduction 15% for shifts
- Agricultural field trials, systematic plots in rows, fertility gradient bias corrected by differencing
- Web scraping systematic URLs, representativeness 85% vs random 92%, faster 3x
Systematic Sampling Interpretation
Sources & References
- Reference 1ENen.wikipedia.orgVisit source
- Reference 2STATISTICSSOLUTIONSstatisticssolutions.comVisit source
- Reference 3PEWRESEARCHpewresearch.orgVisit source
- Reference 4ONLINEonline.stat.psu.eduVisit source
- Reference 5FAOfao.orgVisit source
- Reference 6TOWARDSDATASCIENCEtowardsdatascience.comVisit source
- Reference 7RDOCUMENTATIONrdocumentation.orgVisit source
- Reference 8HSPHhsph.harvard.eduVisit source
- Reference 9NCBIncbi.nlm.nih.govVisit source
- Reference 10QUALTRICSqualtrics.comVisit source
- Reference 11ASQasq.orgVisit source
- Reference 12JSTORjstor.orgVisit source
- Reference 13STATTREKstattrek.comVisit source
- Reference 14CDCcdc.govVisit source
- Reference 15DOCSdocs.scipy.orgVisit source
- Reference 16SURVEYMONKEYsurveymonkey.comVisit source
- Reference 17AAPORaapor.orgVisit source
- Reference 18CRANcran.r-project.orgVisit source
- Reference 19ARXIVarxiv.orgVisit source
- Reference 20PROJECTEUCLIDprojecteuclid.orgVisit source
- Reference 21CFAINSTITUTEcfainstitute.orgVisit source
- Reference 22BLSbls.govVisit source
- Reference 23GPOWERgpower.hhu.deVisit source
- Reference 24ESAJOURNALSesajournals.onlinelibrary.wiley.comVisit source
- Reference 25ITLitl.nist.govVisit source
- Reference 26LOTTERYPOSTlotterypost.comVisit source
- Reference 27SEEING-THEORYseeing-theory.brown.eduVisit source
- Reference 28STATstat.cmu.eduVisit source
- Reference 29PCAOBUSpcaobus.orgVisit source
- Reference 30NCESnces.ed.govVisit source
- Reference 31WHOwho.intVisit source
- Reference 32SAWTOOTHSOFTWAREsawtoothsoftware.comVisit source
- Reference 33CENSUScensus.govVisit source
- Reference 34EPAepa.govVisit source
- Reference 35NEWSnews.gallup.comVisit source
- Reference 36NATUREnature.comVisit source
- Reference 37TANDFONLINEtandfonline.comVisit source
- Reference 38FIVETHIRTYEIGHTfivethirtyeight.comVisit source
- Reference 39SHRMshrm.orgVisit source
- Reference 40DHSPROGRAMdhsprogram.comVisit source
- Reference 41AICPAaicpa.orgVisit source
- Reference 42NUMPYnumpy.orgVisit source
- Reference 43GALLUPgallup.comVisit source
- Reference 44SCIENCEDIRECTsciencedirect.comVisit source
- Reference 45JOINTCOMMISSIONjointcommission.orgVisit source
- Reference 46FSfs.fed.usVisit source
- Reference 47EACeac.govVisit source
- Reference 48WORLDBANKworldbank.orgVisit source
- Reference 49FSfs.usda.govVisit source
- Reference 50ESOMAResomar.orgVisit source
- Reference 51PUBSpubs.usgs.govVisit source
- Reference 52RCHIIPSrchiips.orgVisit source
- Reference 53SCRIBBRscribbr.comVisit source
- Reference 54JOURNALSjournals.sagepub.comVisit source
- Reference 55ONLINELIBRARYonlinelibrary.wiley.comVisit source
- Reference 56GFKgfk.comVisit source
- Reference 57FDAfda.govVisit source
- Reference 58BBCbbc.co.ukVisit source
- Reference 59JAMANETWORKjamanetwork.comVisit source
- Reference 60MARKETRESEARCHSOCIETYmarketresearchsociety.org.ukVisit source






