Reading REDATAM files in R

We start by downloading the Chilean Census 2017 from ECLAC website (link valid as of 2024-10-01):

url <- "https://redatam.org/cdr/descargas/censos/poblacion/CP2017CHL.zip"
zip <- "CP2017CHL.zip"

if (!file.exists(zip)) {
  download.file(url, zip, method = "wget")
}

Now we can extract the files:

# install.packages("archive")
dout <- basename(zip)
dout <- sub("\\.zip$", "", dout)
archive::archive_extract(zip, dir = dout)

You can use unzip() from base R, but this file in particular gave me an error. The archive package is a wrapper around ‘libarchive’ that provides multi-format archive and compression support.

The REDATAM files are now stored in the CP2017CHL directory. We can read the REDATAM dictionary file (DIC or DICX for this particular case):

library(redatam)

fout <- "chile2017.rds"

if (!file.exists(fout)) {
  chile2017 <- read_redatam("CP2017CHL/BaseOrg16/CPV2017-16.dicx")
  saveRDS(chile2017, fout)
} else {
  chile2017 <- readRDS(fout)
}

One of the many possibilities with this census is to obtain the number of houses with overcrowding. For this, the Secretary for Social Development and Family (Ministerio de Desarrollo Social y Familia) divides the number of people residing in a dwelling and the number of bedrooms in the dwelling, with the special case of adding one to studio apartments and similar units, and the result is discretized as follows.

No overcrowding [0, 2.5)
Mean [2.5, 3.5)
High [3.5, 4.9)
Critical [5, Inf)

According to the census documentation in the previous ZIP file, this consists in dividing the variables cant_pers and p04 from the vivienda (housing) table to then discretize the result. The documentation also states that we must join the vivienda table with zonaloc (zones), area, distrito (district) and communa (municipality) to match each house with its corresponding municipality. This can be done with dplyr:

library(dplyr)

overcrowding <- chile2017$comuna %>%
  select(ncomuna, comuna_ref_id) %>%
  inner_join(
    chile2017$distrito %>%
    select(distrito_ref_id, comuna_ref_id)
  ) %>%
  inner_join(
    chile2017$area %>%
      select(area_ref_id, distrito_ref_id)
  ) %>%
  inner_join(
    chile2017$zonaloc %>%
      select(zonaloc_ref_id, area_ref_id)
  ) %>%
  inner_join(
    chile2017$vivienda %>%
      select(zonaloc_ref_id, vivienda_ref_id, cant_per, p04) %>%
      mutate(
        p04 = case_when(
          p04 == 98 ~ NA_integer_,
          p04 == 99 ~ NA_integer_,
          TRUE ~ p04
        )
      ) %>%
      filter(!is.na(p04))
  ) %>%  
  mutate(
    overcrowding = case_when(
      p04 >=1 ~ cant_per / p04,
      p04 ==0 ~ cant_per / (p04 + 1)
    )
  ) %>% 
  mutate(
    overcrowding_discrete = case_when(
      overcrowding  < 2.5                      ~ "No Overcrowding",
      overcrowding >= 2.5 & overcrowding < 3.5 ~ "Mean",
      overcrowding >= 3.5 & overcrowding < 5   ~ "High",
      overcrowding >= 5                        ~ "Critical"
    )
  ) %>%
  group_by(comuna = ncomuna, overcrowding_discrete) %>%
  count()

Now we can filter for any municipality of our interest, for example:

overcrowding %>%
  filter(comuna == "VITACURA")

overcrowding %>%
  filter(comuna == "LA PINTANA")

# A tibble: 4 × 3
# Groups:   comuna, overcrowding_discrete [4]
  comuna   overcrowding_discrete     n
  <fct>    <chr>                 <int>
1 VITACURA Critical                  9
2 VITACURA High                     18
3 VITACURA Mean                    174
4 VITACURA No Overcrowding       26752

# A tibble: 4 × 3
# Groups:   comuna, overcrowding_discrete [4]
  comuna     overcrowding_discrete     n
  <fct>      <chr>                 <int>
1 LA PINTANA Critical                497
2 LA PINTANA High                   1112
3 LA PINTANA Mean                   4522
4 LA PINTANA No Overcrowding       39163