This is my research page. What do I do?
Posts About My Research
June 17th, 2012
Last month, I gave a presentation about paid crowdsourcing in the humanities at SDH-SEMI. Below are my notes.
The rhetorical model in the humanities is appreciation: we believe that by paying attention to an object of interest, we can explore it, find new dimensions within it, notice things about it that have never been noticed before, and increase its value.
In a 2004 talk, John Unsworth characterized the dominant model of the humanities as one of appreciation– rigorous and qualitative. By examining a work from multiple angles and multiple contexts, our belief is that we can learn “notice things about it that have never been noticed before, and increase its value.“ Such research does not easily lend itself to large-scales like quantitative work does: qualitative undertakings, ones of concentrated appreciation, are restrained by the amount of human involvement available.
However, as we explore new ways to utilize our digital environment for humanities research, so-called ‘big data’ approaches are not only becoming possible but inevitable. The archival efficiency of computers coupled with the digitization efforts of historians, librarians, and digital humanists has resulted in endless bytes of data to understand and call our own, while the offline limitations of scale have left a large area of questions thus far unturned.
In these democratic days, any investigation in the trustworthiness and peculiarities of popular judgments is of interest.
There are numerous approaches for scaled up humanities research. Today, I’ll speak of one in particular: crowdsourcing. In doing so, I’ll describe how crowdsourcing is currently being undertaken and share a project of my own – one where semi-anonymous online users rewrote Jonothon Swift’s A Modest Proposal – as one approach to crowdsourcing workflow.
June 11th, 2012
This week, I have a poster at JCDL detailing an algorithm I wrote for reconciling redundant annotations by multiple annotators.
In the last few years, the effectiveness of the Internet in soliciting large groups of people for large abstract and subjective tasks has been a jolting realization. In addition, a growing body of literature is showing that agreement among contributors in redundant work can meet or even surpass the quality of a single expert.
The strength of crowdsourcing often lies in the large number of participants, meaning that the selection criteria is often simply those who are willing. As a result, there is no guarantee of reliability among participants, self-selected and with few consequences. Majority voting works, and is often robust enough – especially for low-bandwidth tasks such as binary ratings. However, with less constricted tasks, lower numbers of redundant raters per task, or when there is a possibility of majority-voting ties, we need more robust ways to make estimates of a true annotation from multiple amateur classification. Furthermore, even if the data is reliable, if you are doing research on amateur annotations, you need reassurances in that reliability.
For this purpose, I designed an algorithm for assigning a reliability score to individual amateur raters and subsequently weigh their annotators in voting among multiple redundant ratings. In order to separate out the effect of individual users from the incidental effects of who they rate alongside, this algorithm uses an iterative approach, repeating the process of assigning user scores and setting votes multiple times, each step influenced by the previous.
Below is an some adapted text of my JCDL poster and you can also read the accompanying two-page paper [PDF].
December 3rd, 2011
Here’s what I did this summer:
First, some details. The Digital Public Library of America (DPLA) is a set of discussions that began formally in late 2010 (by formally I mean at the point money was changing hands in the name of the project). Their goal was to imagine what a national digital library would look like in a US context, toward a future goal of realizing it.
October 29th, 2011
Earlier this month, I attended the ASIS&T 2011 Annual Meeting where our paper was selected for the Best Paper Award.
In Building Topic Models in a Federated Digital Library through Selective Document Exclusion, we presented a way to improve the coherence of algorithmically derived topical models.
The work stems from topic modeling we were doing, first with PLSA and later LDA, on our IMLS DCC research group. The system we are working with brings together cultural heritage content from over a thousand institutions and, as a result contains quite diverse and often problematic metadata. This noise presents problems for inferring strongly coherent topic models, so Miles came up with the successful idea of identifying and removing topically weak documents from topic training. The paper outlines how this was done and the outcomes.