Table of Contents
The data originated from the Demographic and Health Survey program. An SPSS file containing data for 1000 mothers was provided by Deepankar Basu. The data set described herein is provided in a space-delimited ASCII text file format. This data set contains demographic information for 1000 Indian mothers, including age, years of education, social status, and whether the mother has employment outside the home. There is also information about the gender of any children that the mother has and a count of how many of the children are currently alive.
The data set contains the following variables:
age - The mother's age. |
edu - The number of years of formal schooling that the mother has received. |
alive - How many of the mother's children are still alive. |
middle - Whether the mother is middle-class |
poor - Whether the mother is poor |
work - Whether the mother works for pay outside the home |
cord1 to cord14 - The gender of the mother's first child, second child, etc up to fourteenth child. |
The data set is provided as a space-delimited ASCII text file called india.txt.
The file indianMothersMeta.xml provides a StatDataML description of the data set.
The first task is to read the data set into R. The result should look like this:
cord1 cord2 cord3 cord4 cord5 cord6 cord7 cord8 cord9 cord10 cord11 cord12 cord13 cord14 age edu alive middle poor work 1 2 1 1 NA NA NA NA NA NA NA NA NA NA NA 30 0 3 0 1 1 2 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 32 0 1 0 1 0 3 2 1 1 1 2 NA NA NA NA NA NA NA NA NA 28 0 5 0 1 1 4 2 1 1 2 NA NA NA NA NA NA NA NA NA NA 39 0 4 1 0 0 5 1 1 NA NA NA NA NA NA NA NA NA NA NA NA 20 0 2 1 0 0 6 2 1 1 NA NA NA NA NA NA NA NA NA NA NA 25 0 3 1 0 1
The original format, with one column or variable for each child, is inefficient because there are large swathes of missing values (because most mothers have nowhere near 14 children). The table below shows the number of mothers with 1, 2, 3, etc children:
1 2 3 4 5 6 61 447 296 125 52 19
So, in fact, there is no need (in this particular set of 1000 Indian mothers) for any column beyond cord6.
A more efficient approach would be just to have one row for each child. We want to produce output like that below:
age edu alive middle poor work id born value 1 30 0 3 0 1 1 1 cord1 2 2 32 0 1 0 1 0 2 cord1 2 3 28 0 5 0 1 1 3 cord1 2 4 39 0 4 1 0 0 4 cord1 2 5 20 0 2 1 0 0 5 cord1 1 6 25 0 3 1 0 1 6 cord1 2
The gender variable in the data set (called "value" in the reshaped format) is just a numeric value: 1 or 2. This would be much better as a factor with meaningful level labels.
Create a new variable called "gender" for the indiaLong data frame, which is a factor with levels "boy" (value equals 1) and "girl" (value equals 2).
The result should look like this:
age edu alive middle poor work id born value gender 1 30 0 3 0 1 1 1 cord1 2 girl 2 32 0 1 0 1 0 2 cord1 2 girl 3 28 0 5 0 1 1 3 cord1 2 girl 4 39 0 4 1 0 0 4 cord1 2 girl 5 20 0 2 1 0 0 5 cord1 1 boy 6 25 0 3 1 0 1 6 cord1 2 girl
This exercise generates some simple counts to look at the number of girls versus the number of boys, under various conditions.
How many boys/girls overall?
boy girl 1421 1296
How many boys/girls cross-classified by birth order?
gender born boy girl cord1 527 473 cord2 479 460 cord3 261 231 cord4 111 85 cord5 33 38 cord6 10 9 cord7 0 0 cord8 0 0 cord9 0 0 cord10 0 0 cord11 0 0 cord12 0 0 cord13 0 0 cord14 0 0
How many boys/girls cross-classified by social status?
gender boy girl middle poor 0 0 3 1 1 410 370 1 0 1008 925 1 0 0