MARC Fields in the HathiTrust

At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.

For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).

The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.

This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.

Continue reading MARC Fields in the HathiTrust

Your First Twitter Bot, in 20 minutes

Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete implementation. Some of the best bots demonstrate this simplicity: a nugget of an idea, with the nuance in the details. To implement a bot usually requires some programming, some data wrangling, and a server. However, it can be easier. By patching together some open datasets and a hosted version of a generative grammar, I’ll describe how to build a simple bot in 20 minutes. Continue reading Your First Twitter Bot, in 20 minutes