DPLA

Here’s what I did this summer:
First, some details. The Digital Public Library of America (DPLA) is a set of discussions that began formally in late 2010 (by formally I mean at the point money was changing hands in the name of the project). Their goal was to imagine what a national digital library would look like in a US context, toward a future goal of realizing it.
The project, or course, has been an easy target for criticism. Some say that the project is be too ambitious to realize (often accompanied by a reference to NSDL and a shiver). Public librarians are concerned about the appropriation of ‘public’ in DPLA’s name and that it will affect their fund, academic libraries are concerned about a rich history of experience being missed. Others worry about the lack of a target audience, a solution in search of a problem, And of course, there the usual concern (perceived or real) of Harvard “discovering” an occupied land and putting their flag down.
However, criticism is becoming increasingly difficult to lob, primarily due to the efforts of John Palfrey and the steering committee to make this an open, democratic process. They don’t run from the criticism but have be courting it, encouraging knowledgeable parties to speak up about concerns and past experiences. The task that the DPLA is trying to achieve is tall, but they’re sincerely trying.
One of the ways of democratizing the DPLA development was with a summer Beta Sprint, where volunteer teams would have two months to develop a deliverable that covered an aspect of what a DPLA would look like, in the form of software, mockups, proposals or other research. They received over 60 statements of interest, nearly 40 submissions, and chose nine exemplary submissions to present at a plenary in Washington DC (including ours!).
I participated in a joint submission between DLF and IMLS DCC, where we reworked the content and experience of the IMLS DCC into a demonstration of how its content model could fit into the larger picture of DPLA. IMLS DCC is a large cultural heritage aggregation, hold item records to over a million library and museum items from thousands of collections across the US.
Our DLF/IMLS DCC beta sprint was generously funded by the Mellon Foundation, with Rachel Frick and Carole Palmer as the PIs. Under their guidance, Richard Urban and I spent our summer in coffee shops and labs conceptualizing how IMLS DCC would look and function for a more public audience (compared to its regular scholarly audience). A big chunk of this was overhauling the user interface to better emphasize content, adding social elements to the site, and adding focus on a browsing mode of use. Jacob Jett and Katrina Fenlon, incoming and outgoing DCC project managers, were also strongly involved in describing DCC’s past and overseeing backend changes, alongside the princely technical work of Tim Cole and programmer Winston Jansz.
The Beta Sprint model of working proved to be a rewarding experience, fulfilling in the ability it gave us to make quick decisions and realize them. At the end, we came out with a number of changes to integrate back into the DCC, both in the interface and in the backend optimizations we had made for the sprint. It’s not a sustainable model in the long-term — the temptation to cut corners would only harm a project that is in constant sprint — but the deadline hanging over us helped us fall into sync (as the team communally descended into sleep-deprived madness). It reminded me of my Multimedia BA, and comradeship and fun of the end-of-term late-night sessions with half of the program’s students designing/coding/editing away together.
Building Topic Models through Selective Document Exclusion
Earlier this month, I attended the ASIS&T 2011 Annual Meeting where, much to my delight, the paper I co-authored with Miles Efron and Katrina Fenlon was selected for the Best Paper Award.
In Building Topic Models in a Federated Digital Library through Selective Document Exclusion, we presented a way to improve the coherence of algorithmically derived topical models.
The work stems from topic modeling we were doing, first with PLSA and later LDA, on our IMLS DCC research group. The system we are working with brings together cultural heritage content from over a thousand institutions and, as a result contains quite diverse and often problematic metadata. This noise presents problems for inferring strongly coherent topic models, so Miles came up with the successful idea of identifying and removing topically weak documents from topic training. The paper outlines how this was done and the outcomes.
I encourage you to look through the full paper, which is fairly accessible, or the press release.
When to Ask for Help
I’m presenting a talk on evaluating project appropriateness for crowdsourcing today. I’ll be sharing my notes below.
Slides: CrowdsourcingDH2011
For those interested, this talk relates heavily to my MA thesis on the motivations of users in crowdsourcing.
Crowdsourcing Swift
Earlier this week I led a class on the topic of Crowdsourcing. Since our discussion was focused primarily on Human Computation research, I took the opportunity to show a live demonstration of Mechanical Turk.
After some thought of appropriate perception-based tasks that could be outsourced to workers and return meaningful results between the beginning and end of class, I settled on a modern day rewrite of Jonathan Swift’s A Modest Proposal.
If you haven’t read Swift’s famous 18th century satire, I encourage you to do so. In it, Swift describes the plights of the impoverished in Ireland before offering a solution: for the poor to sell their children to the wealthy for food. The brilliance of the piece is in the cold rhetoric being used to argue for such a shocking proposition.
Of course, a modern read of A Modest Proposal as a satire is different from a completely naive read, one where the reveal of the proposal is truly a shock and where there’s a risk that a reader may not recognize it as satire at all.
This is why the idea of paying workers to rewrite it in plain English, sentence-by-sentence with no context, provided much amusement. What would these workers think, looking at this sentence written in such unassuming prose and deciphering it, only to realize that it is about cannibalism. Even better, I suspected most wouldn’t realize it, except for those rewriting a select few sentences.
I compartmentalized the task into two steps: rewriting and voting. To add limitations to task of rewriting and constrain turkers from simply offering back the same line, I had the rewrites done as tweets, which is to say written in 140 characters of less. Each line was rewritten either two or three times (starting with three, I lowered the count after observing less noise than expected) before being promoted to the voting stage. In the voting stage, workers were presented with the original sentence and rewrites, and chose the best one.
The rewriting and voting modules were written in PHP and MySQL over the weekend, and then modified to fit into Mechanical Turk tasks using Amazon’s Command Line Tools. I paid $0.11 for each rewritten sentence and $0.02 for each vote. At 64 sentences, this cost around twenty dollars, though the rewriting wage was notably higher than comparable tasks on the site.
I have a somewhat hesitant relationship with paid human computation. Crowdsourcing with volunteers forces the organizer to be considerate of the crowds and offer them a satisfying intrinsic reward, but once you’re paying them it’s easy to see people as simply labour, because they are. Though this isn’t inherently bad, it introduces a slippery slope to an exploitative relationship. A Modest Proposal criticizes such dehumanization of citizens by using a systems-level look, appropriate considering the experiment was partially a response to Soylent, a Microsoft Office plug-in for outsourcing document proofing on Turk.
The crowdsourced Swift is on Twitter now, repulsing people with his views over the upcoming week. Follow him at @swiftsays.
Trench Raid at Roclincourt

I recently rediscovered a fun project that I did a few years ago as an undergrad.
The Body Snatchers: Trench Raid at Roclincourt was an animation of an early trench raid in the First World War. Part of Dr. Kathy Garay’s Peace and War in the 20th Century project, I worked with Gord Beck, map specialist at the McMaster University Library, to bring some of the rich materials from the project to life.
It has some rough edges, as could be expected from a student project, but it’s a lot more fun than I had remembered. Try it out.
Blog Consolidation
Yet, blogging is a useful exercise. With research, it forces you to polish it for a general audience and adapt it a medium that encourages conciseness. With other projects, it forces to to actually finish them and present them, leaving a record and not just a memory. Thus, don’t expect me to post on the other two blogs anymore, but I will make an effort to occasionally share trivialities and snippets here on my homepage.











