We start by downloading the Chilean Census 2017 from ECLAC website (link valid as of 2024-10-01):
url <- "https://redatam.org/cdr/descargas/censos/poblacion/CP2017CHL.zip"
zip <- "CP2017CHL.zip"
if (!file.exists(zip)) {
download.file(url, zip, method = "wget")
}
Now we can extract the files:
# install.packages("archive")
dout <- basename(zip)
dout <- sub("\\.zip$", "", dout)
archive::archive_extract(zip, dir = dout)
You can use unzip()
from base R, but this file in
particular gave me an error. The archive
package is a
wrapper around ‘libarchive’ that provides multi-format archive and
compression support.
The REDATAM files are now stored in the CP2017CHL
directory. We can read the REDATAM dictionary file (DIC or DICX for this
particular case):
library(redatam)
fout <- "chile2017.rds"
if (!file.exists(fout)) {
chile2017 <- read_redatam("CP2017CHL/BaseOrg16/CPV2017-16.dicx")
saveRDS(chile2017, fout)
} else {
chile2017 <- readRDS(fout)
}
One of the many possibilities with this census is to obtain the number of houses with overcrowding. For this, the Secretary for Social Development and Family (Ministerio de Desarrollo Social y Familia) divides the number of people residing in a dwelling and the number of bedrooms in the dwelling, with the special case of adding one to studio apartments and similar units, and the result is discretized as follows.
According to the census documentation in the previous ZIP file, this
consists in dividing the variables cant_pers
and
p04
from the vivienda
(housing) table to then
discretize the result. The documentation also states that we must join
the vivienda
table with zonaloc
(zones),
area
, distrito
(district) and
communa
(municipality) to match each house with its
corresponding municipality. This can be done with
dplyr
:
library(dplyr)
overcrowding <- chile2017$comuna %>%
select(ncomuna, comuna_ref_id) %>%
inner_join(
chile2017$distrito %>%
select(distrito_ref_id, comuna_ref_id)
) %>%
inner_join(
chile2017$area %>%
select(area_ref_id, distrito_ref_id)
) %>%
inner_join(
chile2017$zonaloc %>%
select(zonaloc_ref_id, area_ref_id)
) %>%
inner_join(
chile2017$vivienda %>%
select(zonaloc_ref_id, vivienda_ref_id, cant_per, p04) %>%
mutate(
p04 = case_when(
p04 == 98 ~ NA_integer_,
p04 == 99 ~ NA_integer_,
TRUE ~ p04
)
) %>%
filter(!is.na(p04))
) %>%
mutate(
overcrowding = case_when(
p04 >=1 ~ cant_per / p04,
p04 ==0 ~ cant_per / (p04 + 1)
)
) %>%
mutate(
overcrowding_discrete = case_when(
overcrowding < 2.5 ~ "No Overcrowding",
overcrowding >= 2.5 & overcrowding < 3.5 ~ "Mean",
overcrowding >= 3.5 & overcrowding < 5 ~ "High",
overcrowding >= 5 ~ "Critical"
)
) %>%
group_by(comuna = ncomuna, overcrowding_discrete) %>%
count()
Now we can filter for any municipality of our interest, for example:
# A tibble: 4 × 3
# Groups: comuna, overcrowding_discrete [4]
comuna overcrowding_discrete n
<fct> <chr> <int>
1 VITACURA Critical 9
2 VITACURA High 18
3 VITACURA Mean 174
4 VITACURA No Overcrowding 26752
# A tibble: 4 × 3
# Groups: comuna, overcrowding_discrete [4]
comuna overcrowding_discrete n
<fct> <chr> <int>
1 LA PINTANA Critical 497
2 LA PINTANA High 1112
3 LA PINTANA Mean 4522
4 LA PINTANA No Overcrowding 39163