[Tuning In] Peter Bol on creating the China Biological Database and the power of digital humanities

Dr. Peter Bol is the Charles H. Carswell Professor of East Asian Languages and Civilizations at Harvard University. His research is centered on the history of China’s cultural elites at the national and local levels from the 7th to 17th century.

Professor Bol directs the China Biographical Database project, which is maintained by Harvard University, Academia Sinica, and Peking University. This online relational database currently contains some 350,000 historical figures and is being expanded to include all biographical data in China’s historical records from the last 2,000 years.

Our community members can ask questions in Slido.

Peter Bol, professor and database director

KrASIA (Kr): Can you tell us about the origins of the China Biological Database (CBD)?

Peter Bol (PB): It began in the 1990s with some initial work by a social historian named Robert Hartwell who I knew. He decided that if he should pass away, he would give everything to Harvard and I promised to continue his work. I didn’t see that as a likely prospect, so I said yes because you don’t want to insult somebody.

Then he upped and passed away quite unexpectedly and the materials came to us. In around 2005, almost 10 years after his death, I decided we really needed to take the 20,000 people in the database that he had compiled and somehow try and create something that other people could use and build upon.

But I also knew that he tended to make a lot of mistakes. So I arranged with colleagues at Academia Sinica Taiwan and Peking University to collaborate and clean up the database and release it to the world.

Kr: How did you organize all of that data so that it became a functional database?

PB: We did that and found enormous numbers of mistakes but along the way, I happened to talk to one of my colleagues in computer science and he asked whether I had heard of regular expressions and named entity recognition and other things like this. I hadn’t. He said if you have digital text, there are ways of mining it automatically and you don’t have to do it manually.

That’s where it really took off. We brought in Michael Fuller, who studies Chinese literature at the University of California Irvine.

We brought in a graduate student in computer science who could write these regular expressions, and one of my graduate students took on the job as a project manager. Today, we have almost 1 million people in the pipeline to be disambiguated and contextualized in the database.

Kr: What are the data-sharing norms with your international colleagues?

PB: At the Institute for History and Philology at Academia Sinica, which is where the premodern historians reside, colleagues there saw some value in this and they agreed to do two things. First of all, they gave us some important, very reliable digital texts, and we shared those digital texts with the group at the Institute for Ancient Chinese History at Peking University.

One of my colleagues at Peking University had some graduate students who could meet some university obligations by working on the project. So we put together an editorial group at the university, and we had our computer science group here at Harvard who would generate texts for our colleagues at Peking University to look over and make corrections. We were also iterating the whole time and trying to improve computational methods.

Then the Institute for History and Philology at Academia Sinica funded the creation of online databases in around 2008, so we not only had it as a standalone database on standalone computers, but you could go on the web and perform searches. So it has really been a wonderful tripartite collaboration with everyone contributing.

Kr: How has the database evolved?

PB: This is a relational database. We think that is important because it is a way of modeling people’s lives by asking complex queries. Here’s an example: How many people passed the Chinese Civil Service exam in a certain period, and how many of those people were related to each other by marriage? You can ask that kind of question. It is not meant to function as a biographical dictionary, it is meant to function as a way of looking at large amounts of data to see change over time and space.

We currently have built-in export functions so you can export to a geographic information system (GIS) program, a mapping program, or to social network analysis programs. In principle, we should be able to do that online, so you can do a query online and you can map it right away to see your distribution and download the data as you choose. Developing these capabilities online is going to be important.

There are different kinds of visualization. In fact, there is a group at a university in China that is very interested in visualization. They have been using the CBD to do experiments and figure out how you can create multiple kinds of visualization to get more information. For example, they used various kinds of network analysis to discover likely connections between people that are not given in the database already.

Kr: What are the future plans for the project?

PB: We are cautiously moving into the area of crowdsourcing to see if we can figure out ways of developing a following of people who are interested in history and have the knowledge to contribute information.

The biggest obstacle we have is not one of collaboration between people and universities and countries, but of disambiguation. The Chinese language has an enormous number of characters. Initially, we thought this wouldn’t be a big problem because people would have unique names.

It turns out at certain points in Chinese history, certain kinds of names are very popular. Since China has a much smaller number of surnames than we find in Western languages, we would run into a situation where we would have 50–60 people in a 100-year period with the same surname and given name. Unless you had complete data on all of those people, figuring out who is who is much more of a challenge than we originally thought.

Kr: What is the significance of digital humanities?

PB: How do we run the research cycle for the humanities in a digital environment? If we think about how we find topics to study or discuss, how we define questions, how we gather data, how we store it, how we analyze it, how we disseminate it, all of this can happen in a digital environment with digital methods.

Digital humanities are simply the humanities being conducted in a digital environment. There’s so much we can do—particularly since March, when universities and libraries shut down, we have had to do a lot of work in a digital environment, both in research and teaching.

Digital humanities include various free, open access utilities that people can use to study the past. It allows us to answer questions that were impractical to answer before. We are not good at looking at large quantities of things at once, but now we are able to do it with digital methods.

Kr: Two-thirds of public libraries in the US don’t have a formal digitization strategy. What are the biggest challenges to digitization that libraries face?

PB: It would be so good for public libraries if they had complete access to digital collections.

If you think about academic publishing and university presses, they were created to disseminate knowledge coming out of universities that were not commercially viable. The model progressed to the point where, instead, these universities presses would begin to generate profits.

I remember at a meeting with people from the university press where a professor said to them, “You were originally a means to disseminate knowledge. You are now an obstacle to its dissemination.” That kind of sums it up.

Academic work should be open access, particularly academic work that is funded by grants or foundations. The reason why it can’t be made freely available is that we haven’t set up the institutional mechanisms to do that. Meanwhile, publishing still costs money. You have editors, quality control, peer reviews, and so on.

Looking into the future, the dissemination of knowledge favors a digital environment. The marginal cost after the initial editing of a digital work is minimal compared to creating and shipping physical book copies.

Kr: How has China’s education system been able to digitize academic resources?

PB: The explosion of university education and quality education in the country meant that there was desperate demand for access to intellectual resources, Digitization was a solution to that.