Seeking the Truth of a Colombian Tragedy: An Interview with Fox Fellow Valentina Rozo-Angel

September 28, 2022

How can we apply data science to human rights? More precisely, how does data science help us to approach the truth of tragic historical events? Moreover, how do national tragedies connect with our lives?

In this enlightening conversation, Valentina Rozo-Angel addresses all these matters. Valentina is a Fox Fellow 2022-23 at the MacMillan Center from the University of San Andres in Argentina, where she is a Master of Data Program student. She also holds an M.A. in Economics (University of Los Andes, Colombia.)

You were born and raised in Bogota. How do you remember Bogota in your childhood?

It was not a safe city. As a teenager, I remember my parents always saying: “Do not walk alone. Do not come back home late at night.”

Mainly because of crime or any memories of political violence?

Mostly crime. However, when I was twelve, the FARC guerrilla bombed the social club El Nogal, in Bogota’s most privileged area. I knew my mom was at the club. I called her desperately, but she did not pick up the phone. I remember the fear and the desperation. Fortunately, she was not at the club at the exact time of the bombing. However, the day after, I found out that some classmates’ parents had died in the attack. That was my only first-hand experience with political violence.

So, your academic interest came up later?

Much later, indeed. I began to work full-time on the conflict three years ago. I aim to estimate the number of victims of the armed conflict in Colombia. We work to know how many people died. But also how many were kidnapped, disappeared, recruited, and displaced by the conflict.

What is the most significant challenge in accessing information? I imagine it is a process that will never end.

It is exactly like that. There will never be a complete database of victims. No country, person, or institution can reach every corner of the country to document the total number of corpses. It is simply not possible. Many corpses were thrown into the river or burned and will never be found. No witnesses are left to report the massacre if an entire family is killed. There will never be a database with records of all violations. So, I use statistical and machine learning methods to answer the question of the actual number of victims.

Can you speak further about machine learning?

It combines statistics and computers to solve complex questions humans cannot answer by themselves. We compare more than one hundred databases, which is an impossible task for humans. The first step is record-linkage or deduplication. The machine learns what the person would do and learns to replicate the decisions the individual would make as a human. The machine identifies the differences and similarities between records and learns to think as a human would. So, if two institutions register one victim, one single victim may be counted twice. We want to avoid duplication. We observe the first one and analyze what information is incomplete. The databases have name, surname, gender, ethnicity, etc., and they often do not record any variable of interest. Then we fill in the missing data. The machine makes this calculation by comparing records. Finally, we focus on the fact that not all victims are registered in the databases. In this step, we use a statistical method called capture-recapture. This method estimates how many victims might have been left out of the database. Having an estimate is important because, in the end, we do not know how many they are.

It’s unrealistic to expect to establish a final number.

Yes, but we can still get an accurate range. For example, as for people killed as a consequence of the armed conflict in Colombia, the top of the range is 850,000 people murdered. The lowest estimate is 400,000.

So you say that it is a fact that at least 400,000 people were killed in Colombia because of the conflict?

Yes, at least 400,000 between 1985 and 2018.

Is there any other current work comparable somewhere else?

No. It is also the most extensive human rights and data science project ever, related to war or not. Never before have more than a hundred human rights databases been integrated to answer a sub-record question.

So, we have between 400,000 and 850,000 killed people. I understand they came from diverse backgrounds. I would like to know about the most common profile of victims in terms of gender, age, and ethnicity. Any particular striking patterns you found?

Unfortunately, the data is not very good at recording ethnicity. This variable is not usually documented. Consequently, historically invisibilized and discriminated against groups are also underrepresented on the final count. Same for LGBTQ. Sex is recorded but not gender. We only know if the victims were men or women. Besides that, we see the victim’s age, where the murder occurred, and the alleged perpetrator.

Knowing in advance that the final number will never be known, how will you decide when you finish the project?

When we decide on the final range. Even though the range between 400,000 and 850,000 is vast, we must embrace uncertainty if we cannot reduce the gap further. The difference between 400,000 and 850,000 is enormous, and it is essential to acknowledge that the number could go up to 850,000 despite not knowing it as a fact. We may have 450,000 uncounted people. No one documented them because they mostly were indigenous, Afro-Colombians, or people who lived in a remote región, twenty hours in a canoe from a town.

How has this research changed your perception of life?

It made me recognize my privilege. As I told you at the beginning, my closest personal approach to war was a bomb; at the end of the day, I was not a victim. So I have not been a direct victim of the conflict. Before this project, I was not aware of that privilege. And I think that it is a very Bogotan phenomenon because the war was much worse in other Colombian regions. It is typical of my generation. We do not have a memory of it. I have always lived in a very violent country, but I have been lucky and privileged not to be directly affected by it.

Finally, as a Fox Fellow, what do you plan to do during your time at Yale?

Colombia has two large but significantly different databases of armed conflict and victims of the conflict, so there is always a fight about which one we should use. People think they should take only one of them and favor that that helps them better prove what they wanted. As a Fox Fellow, I will take both and use another machine learning method, cluster algorithms, to identify patterns of conflict. When defining conflict, we usually work with a prefixed hypothesis, such as “the conflict is different depending on the department in which it happened.” Or “I think the conflict is different for men and women.” In contrast, cluster algorithms prevent us from working guided by a previous hypothesis. Instead, we enter information into the algorithm, and it decides what different groups are and what variables differentiate them. So using the two large databases is an innovative approach to understanding the conflict.

Interview by Francisco Ángeles, francisco.angeles@yale.edu