Brief introduction to survey data analysis ideas
This is a short set of bullet-point notes by Chris Wild.
Contents
In almost all of the data analysis you have learned to do, the computer programs essentially assume that the observations you have come from a random sample from some process or infinite population
Technically, for a random sample all observations are "independent and identically distributed", or in practice in the survey context:
- all individuals have same probability of selection
- individuals are selected independently of one another
- Note: Sampling at random from a finite population with replacement meets these conditions but we don't sample that way in practice
Survey data
- Survey data is typically obtained using more complicated random sampling schemes
- that do not meet the technical requirements of "a random sample"
- Survey samples typically use stratified sampling, cluster sampling, etc.
- i.e., they use complex sampling designs
- Units are sampled from finite populations without replacement
- Different individuals may even have different probabilities of being sampled!
- If you use standard programs for survey data all the answers you get can be grossly wrong
- Wrong estimates, wrong standard errors, wrong confidence intervals, wrong p-values, ...
- Special programs know how to do these things properly
- But you have to tell the program how you obtained your data
Why not just do a simple random sample?
(e.g. get a list of all the people and draw a random sample without replacement)
- After all it's a really simple idea -- so why does no one ever do it with face-to-face interview surveys?
- The main reason is that it costs too much
- No one could ever afford all the travel time and travel costs to deliver interviewers to the doors of the randomly selected houses
- If we get someone to a location we want them to do as much of their work in the close vicinity of that location as possible
- Ridiculous extreme: "sample" all new Zealanders by taking everyone from a few streets in one suburb of Auckland
- Really cheap but completely useless, unrepresentative "sample of all NZers"
- So we will need to find out about how to make sensible tradeoffs
What do Agencies (e.g. Stats NZ) want to estimate from their data?
- Mainly means, totals, proportions (percentages) and ratios
- For the whole survey population and also broken out by subgroups
What do medical and social science researchers want to estimate from their data?
As for Agencies and also things like ...
- regressions
- logistic regressions etc.
All that is new here is that we use special programs designed for survey data and the program needs to be told how the sampling was done.
Apart from that it is pretty much business as usual
What is it?
- Involves sampling without replacement
- all possible samples are equally likely to be chosen
- Thus, each unit/person in the population is selected with equal probability
- To take a simple random sample you need a list of all units in the population (sampling frame)
- Each observation unit is assigned a number and a sample is selected so that each unit has the same chance of occurring in the sample
- can be thought of as “drawing numbers from a hat”
Strengths
- Requires no information other than sampling frame
- e.g. no assumptions about the distribution of population values
- Reasonably efficient when we do not have much prior knowledge
- Widely accepted as being “fair”, unbiased
- Simple theory and analysis
- Can use standard software if sample size is less than about 10% of population size
- Otherwise may need “finite population corrections” (fpc)
Weaknesses
- Often expensive and time-consuming
- Makes no use of any additional information we might have about the population
- Sampling frame may be difficult to obtain
- requires an accurate list of the whole population which may be impossible to get
- Very sensitive to non-response and other non-sampling errors
Elements of most survey sampling designs used in practice
- Sampling without replacement
- Complex sampling, some or all of:
- Stratified sampling
- Cluster sampling
- +-Unequal probabilities of selection
Why do it?
- Drawing a unit out of a hat, measuring it, putting it back in the hat, and then measuring it again on some subsequent draw seldom makes any practical sense
What are the consequences of ignoring sampling without replacement in the analysis?
- The usual standard errors of estimates of characteristics of the finite population are too big if the sample size n makes up a substantial fraction of the population size N
- Roughly, Actual standard error of an estimate is approximately Usual std error x sqrt(1 - n/N)
What is it?
- Divide the population into non-overlapping groups, called: strata (singular=stratum)
- so that each unit belongs to one, and only one, of the strata (groups)
- Take a sample of units from within each/every stratum (group)
- (e.g. the strata could correspond to geographical regions, or to age groups)
- Note: Stratified sampling tends to subdivide the population into a relatively small number of groups (then called strata), whereas cluster sampling tends to involve a larger number of groups (then called clusters). They differ in how we then use these groups when we draw samples.
- If we are thinking in terms of strata, it is because we plan only to collect data from every one of the groups.
Why do stratified sampling?
- We can use it to increase the precision of estimates (i.e. reduce their standard errors)
- We may have a way of defining strata so that individuals within the same stratum tend to more similar (homogeneous) and those from different strata are more different (heterogeneous). (e.g. if we are interested in incomes in Auckland, stratifying on suburb would tend to do this)
In this case, sensible stratified sampling leads to more precise estimates of quantities relating to the whole population than a simple random sample (i.e. estimates which have smaller standard errors, giving narrower confidence intervals …)
- It can provide some protection against bad (unlucky) samples
- We can ensure that the sample proportions in groups we particularly care about are the same as the population proportions (e.g. sample 50 males and 50 females. If we randomly sampled 100 people the sample proportions of males and females we got could be quite far from 50-50)
- We may want to report at the level of the strata (e.g. report the mean income for each region) and want to control how much data is collected in each stratum
- e.g. if we want to report incomes for Maori, Pacifica, European, Asian and Other we may want to sample the same numbers of people from each group so that all of these estimates have similar accuracies.
- It allows us to use different sampling methods in each stratum
-
- (e.g. telephone in rural areas and face-to-face interviews in cities)
- Interviewers can be trained to deal well with a particular stratum
- It often makes good practical sense to include more of “the big important units”
- (e.g. take all of the very large companies, sample 30% of the midsize companies and 5% of small companies)
What are the consequences of ignoring stratified sampling in the analysis?
- Standard errors reported from standard (non-survey) programs tend to be too big
- Estimates relating to the whole population from standard programs are often wrong
- They tend to be wrong unless the proportions of the total sample size allotted to each stratum closely approximate the corresponding proportions of the population that belong to that stratum
- i.e. unless each nj/n is approximately equal to Nj/Nj, (“proportional allocation”)
- Here nj is the number sampled in stratum j, n=∑nj, and Nj is the population number in stratum j . (The population size is N=∑Nj.)
What is it?
- Think in terms of all units in the entire population being subdivided into non overlapping groups called clusters, usually on the basis of physical proximity (close together)
-
- (e.g. if units are households we could treat all houses in the same street as forming a cluster, or all pupils in the same school could be a cluster)
- A cluster sample would select a sample of clusters from a list of all of the clusters and then select all of the units from the selected clusters
-
- (e.g. sample streets from a list of streets and then take all houses in the sampled streets)
- Multistage cluster sampling employs the clustering idea at several levels
-
- (e.g. sample schools from a list of schools and, for each selected school, sample classes from the list of classes in that school and then either take all or a sample of students from each of the selected classes. OR select towns, then census blocks within towns, then households within census blocks and then, finally, people within households)
- Note: Cluster sampling tends to employ a relatively large number of groups (then called clusters) whereas stratified sampling tends to involve a small number of groups (then called strata). They differ in how we then use these groups in our sampling plan. If we are thinking in terms of strata, it is because we plan to collect data from each and every group.
If we are thinking in terms of clusters, it is because we plan only to collect data just from a sampled subset of the groups.
Why do cluster sampling?
- It can be much cheaper than simple random sampling
- Units in a cluster are closer together (e.g. reducing travelling time)
- We can obtain information from a single source (which also reduces costs)
- So we can often get more accuracy for the same cost (or the same accuracy for a reduced cost)
- We don’t need a complete sampling frame of all individuals in the population, only lists of clusters and then lists of units (or sub-clusters) for the selected clusters only
- If we want to do interventions, we can often only apply them at the level of the cluster
-
- E.g. use different teaching methods on different classes
What are the consequences of ignoring cluster sampling?
- Cluster sampling generally leads to
- positive correlations between units in the same cluster
- An effective sample size which is smaller than the total number of units observed
- We have “less information” than we would from a simple random sample with the same number of units in it
- The effective sample size can be closer to the number of clusters sampled than to the number of units finally obtained
- Design effects (actually 1/ d.eff) give indications of efficiency loss (described in later Lectures)
- Standard errors reported from standard (non-survey) programs tend to be too small
- Coverage of 95% confidence intervals cover
- Estimates from standard programs relating to the whole population are often wrong
- simple or one-stage cluster sample select a sample of clusters from a list of all of the clusters and then select all of the units from the selected clusters
- e.g. sample streets from a list of streets and then take all houses in the sampled streets