Chapter 2 Data sources

The first dataset that caught our attention was the ARED (Authoritarian Ruling Elites Database) published in Harvard Dataverse.

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QZ9BSA

This dataset contains data on the ruling elites in 22 different authoritarian regimes over the world, from different continents and time periods.

The dataset is primarily focused on highlighting the demographics, time in power and the exit method of each of these elites.

We have also found this repo of datasets on Democracy very helpful. In particular, we have used the Democracy and Dictatorship Dataset (PACL).

https://xmarquez.github.io/democracyData/reference/pacl.html

We also used the democracy index, found here:

https://www.gapminder.org/data/documentation/democracy-index/

We are not including the two datasets above since they are pretty large. The datasets can be easily downloaded from the links provided.

The code used to generate GIFs from the above two datasets also take a while to run. While they are provided in this bookdown, we have chosen to embed the GIFs created by our code in the .Rmd files instead of running directly.

2.0.1 Data transformation for ARED Datasets

The set on authoritarian government elites is published via Harvard Dataverse project. Each elite having one row in the data set, the columns include both demographic and professional information about the elites.

The original data set was published in 23 different files. For this project, we have condensed the files into 1 unified file with consistent columns and value formats. The result

The files have undergone considerable cleaning in order to be ready for the cross-regime analysis we intend to perform. Here are some of the actions we have taken,

  1. We have replaced missing values with “na” for most columns.

  2. For columns with date as values, we reduced the entry to just “year” since many entries are missing the exact dates of the event. In the grand scheme of things, the exact dates are often not additive to our insights.

  3. We added the column of Personal ID to give each elite a unique identifier. This column would be called the “primary key” in an SQL analysis.

  4. We rearranged the columns so they are consistent across the regimes. Logically, there are three groups of information, and the columns are arranged in the following order,

  • “regime information”,
  • “elite personal information”
  • “elite career information”
  1. We have removed some uninformative columns from the file, including some with significant amount of missing values and some with calculated values.

  2. Names of the elites can be tricky given that they are written inconsistently. For the Chinese elites, there is no comma between first and last names. For the Laotian elites, there is no comma in names because culturally there is no last name.

  3. Fonts and formats (such as the fill of cells) are inconsistent across the files. We have made them uniform.

2.0.2 Data Transformation World Maps

In order to create the world map, for both the data sets I added a fill column that would correspond to the fill color of different government types in a map of the world. This is accomplished using case_when. I also fixed up some country names in both excel and R so the countries in my data sets would match the names of the countries in R’s built in map_data. This is relevant because I would then left join the downloaded data set and the one in R (with the countries coordinates for the map) on the country. The last bit of data transformation I performed for the world map was to change the year column from double to integer, this cleaned up the trailing decimals that popped up when the years cycled in the animation.

2.0.3 Data Transformation Continents Bar Graph

The data transformation for the bar graph was done using the same regime data set loaded before. This time I added a new column to the data frame that calculated the fraction of dictatorships of total countries grouped by each continent. The only other data transformation I performed on this data was to get rid of the NA values.