An Iterative Reliability Measure for Semi-Anonymous Annotators

This week, I have a poster at JCDL detailing an algorithm I wrote for reconciling redundant annotations by multiple annotators.

In the last few years, the effectiveness of the Internet in soliciting large groups of people for large abstract and subjective tasks has been a jolting realization. In addition, a growing body of literature is showing that agreement among contributors in redundant work can meet or even surpass the quality of a single expert.

The strength of crowdsourcing often lies in the large number of participants, meaning that the selection criteria is often simply those who are willing. As a result, there is no guarantee of reliability among participants, self-selected and with few consequences.  Majority voting works, and is often robust enough – especially for low-bandwidth tasks such as binary ratings. However, with less constricted tasks, lower numbers of redundant raters per task, or when there is a possibility of majority-voting ties, we need more robust ways to make estimates of a true annotation from multiple amateur classification. Furthermore, even if the data is reliable, if you are doing research on amateur annotations, you need reassurances in that reliability.

For this purpose, I designed an algorithm for assigning a reliability score to individual amateur raters and subsequently weigh their annotators in voting among multiple redundant ratings. In order to separate out the effect of individual users from the incidental effects of who they rate alongside, this algorithm uses an iterative approach, repeating the process of assigning user scores and setting votes multiple times, each step influenced by the previous.

Below is an some adapted text of my JCDL poster and you can also read the accompanying two-page paper [PDF].

Crowdsoucing Annotations

One of the challenges in annotating digital work is reconciling the breadth of the undertaking with the human resources required for high-quality data. Such tasks can be arduous and do not automate effectively. In recent years, an alternative has emerged in the concept of crowdsourcing: the large-scale collaboration of online users on a task. Crowdsourcing is being used for perception-based tasks such as classification and encoding, using multiple independent annotations to balance out the trade-off in quality from expert raters.

Sometimes crowdsourcing contributors are volunteers. To offer two examples that I quite like, Galaxy Zoo enlists amateurs astronomers to classify images of galaxies and Trove allows interested visitors to correct and annotate OCR transcriptions.

For more tedious tasks, contributors can be payed. The dominant approach to doing so online is through micropayment-based crowdsourcing. Services have emerged to simplify this process, such as Crowdflower and Amazon’s Mechanical Turk.

As digital libraries mature and grow their audience, they are well-positioned to crowdsource corrections, metadata improvements, and additional annotation.

The Project

The impetus for this study was a practical need: a gold standard tagged corpus where one did not exist. In related research on sentiment analysis of microblogging messages, parallel corpora were required in order to build a train and test a classifier.

Each tagging task asked for the opinion expressed on a specific topic (e.g. “What is the opinion of ‘Obama’ in the following tweet…”). There were five possible ratings, three for classification—negative, neutral, positive–and two for flagging—spam and incoherent. Thus, the corpus was a collection of Twitter messages classified by online workers. Each tweet was annotated three times according to the sentiment expressed.

To evaluate the quality of the Turk ratings, a parallel set of oracle classifications was tagged by myself. In collecting annotation, levels of tagging confidence would not be permitted (e.g. definitely negative, likely negative). This was to avoid offering a lazy answer that offers remission for imperfect responses. To measure whether this was as an acceptable design choice, the parallel tagged set did include a measure of certainty for comparison with the lack of one in the Turker task. This study found no notable difference in classification quality between clear-cut annotations and difficult ones. Even for uncertain questions, annotators tend to make the right choice when forced to make one.

How can you trust contributions from untrained, self-selected annotators?

The Problem

Raters are self-selected and semi-anonymous paid workers, resulting in unknown trustworthiness and reliability. They’re not necessarily bad workers, but you don’t know whether they are good.

Generally, people try to make meaningful contributions when asked to. However, the crowdsourcing dynamic allows for the possibility of:

  • Malicious contributors. Contributors that vandalize or – more often with paid crowdsourcing – trying to cheat the system.
  • Unreliable/sloppy contributors. Those who are hurried, inexperienced, or inattentive. Referring to Mechanical Turk workers, Bernstein et. al (2010) call these Lazy Turkers.
  • Confused contributors. Annotators that do not follow the codebook or misunderstand it, causing problems of inter-annotator reliability.

To account for the uncertain reliability of online annotators, this study outlines an iterative algorithm that develops a trust score for users and adjusts certainty in their ratings accordingly.

How Does it Work?

To account for the uncertain reliability of online annotators, this study outlines an iterative algorithm that develops a trust score for users and adjusts certainty in their ratings accordingly.

The proposed iterative algorithm works by repeating two steps.

1. Calculate a ‘confidence’ score for each possible rating of each document. If zero raters choose option A, for example, you have 0% confidence in that rating. If all raters choose option B, you have 100% confidence in that rating. The confidence rating takes into account user reliability scores.

2. Calculate a user reliability score for each user. This is based on the average confidence of all the ratings that the annotator makes.

3. GOTO step 1.

Here’s a hideous flowchart that people are fond of.

Scenario

Consider this tweet:

Is Obama schizophrenic? Asking Israel to stop settlements, but increasing US ‘aid’ to them… Advice: Put 3bn$+ where your mouth is, now!

What if two (bad) raters say that it expresses a neutral opinion about Obama, and one (good) rater says that it negative?

Adding up the votes and listening to the majority will be incorrect, because the tweet isn’t neutral. Not only would we have a bad rating, but the system now wrongly thinks that the good rater is less reliable.

Confidence scoring function

If the two bad ratings are consistently bad, the iterative algorithm will diminish their influence and use the good rater’s classification. Additionally, the iterations separate the influence of co-raters on a rater. Now the good rater won’t be punished for disagreeing with two other raters, because the algorithm knows that they were bad raters.

One thought on “An Iterative Reliability Measure for Semi-Anonymous Annotators

  1. Pingback: peter organisciak » Blog Archive » Report from the Joint Conference on Digital Libraries