A small but useful tip today, on using iPython notebooks for a git project README while keeping an auto-generated version in the Markdown format that Github prefers.
At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.
For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).
The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.
This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.
I think it was the Pres. at dawn with the Spin Back Knuckle.
— bad Clue guesses (@BadClues) September 6, 2015
Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete implementation. Some of the best bots demonstrate this simplicity: a nugget of an idea, with the nuance in the details. To implement a bot usually requires some programming, some data wrangling, and a server. However, it can be easier. By patching together some open datasets and a hosted version of a generative grammar, I’ll describe how to build a simple bot in 20 minutes. Continue reading Your First Twitter Bot, in 20 minutes
With data from NYPL Labs’ What’s on the Menu?
I just put up a modest reference repository with various slices of data on US names. I included an estimate of names among US-born citizens today, by cross-referencing baby names data and population age distribution for 2014, and gender probabilities by name. Find it on Github.