Digital Humanities
Interview on planning crowd projects
A few weeks ago, Rodney Echols interviewed me about my thesis work for a Ahm Sayin‘. Check it out.
Words in the Wild

All our days are so unprofitable while they pass, that ’tis wonderful where or when we ever got anything of this which we call wisdom, poetry, virtue. We never got it on any dated calendar day. Some heavenly days must have been intercalated somewhere, like those that Hermes won with dice of the Moon, that Osiris might be born. It is said all martyrdoms looked mean when they were suffered. Every ship is a romantic object, except that we sail in. Embark, and the romance quits our vessel and hangs on every other sail in the horizon. Our life looks trivial, and we shun to record it. – Ralph Waldo Emerson
Garry and I have recently taken to collecting words for the Dictionary of Words in the Wild. Since then, I’ve found a curious side-effect in myself: I have a greater appreciation of urban beauty. I find myself noticing and appreciating the little quirks of the city in a way that I have not before, constantly regretting not havving a camera with me, regretting my inability to share the beauty. Words in the wild are by definition a human affect on nature, but when you really start looking, you appreciate this not as a tainting of nature but as a sort of growth of it.
It seems that the act of observation has changed the nature of what I set out to observe. In trying to digitize an object, we bestow upon it a status that had never been intended. This strikes me as perhaps a broader rule. In trying to digitize history, we fret about how, if at all, can we retain the essence of the object being digitize. Yet, perhaps its not the product that matters as much as the process. In seeing the shortcomings of the copy to the orginal, we are reaching toward a greater appreciation of the original than often been intended upon it’s creation.

Historical Reblogging
There’s a fascinating form of online narrative that’s emerged recently. For lack of a better term, I’ve taken to calling it historical reblogging.
Historical reblogging is the gradual publishing of historical texts online over a timeline that parallels the timeline of the texts. Specifically, date-specific artifacts such as letters and journals map wonderfully to the format of blogging. We’re used to seeing such collections all at once but in historical reblogging, items are revealed gradually, adding more authenticity to the stories.
I first came by this idea when involved with the Peace and War in the 20th Century project. Though I never saved then, I saw a number of archives that were blogging soldiers’ letters home. The idea began picking up steam for other archives too, such as The Orwell Diaries, which blogs George Orwell’s diary entries from [today minus 70 years]. Even microblogging got in on the action with one-sentence journals, such as those Depression-Era farmgirl Genevieve Spencer and a 99-year-old (in 1974) Great-Gram Pratt.
One of my favourite examples is the journals of Jordan Mechner, creator of the classic video game Prince of Persia. Last year, he began gradually posting his journals from 1985-1989, offering an intimitate look into the development of his most famous game. When makes it fascinating is the familiarity with the peopple, games, and companies mentioned, but my familiarity is removed in the sense that they’re artifacts of the past (I was born in 1986). Reading these journals makes it that much more accessible. Anybody who has played Prince of Persia will feel a pang of excitement at seeing the original reference video for the protagonist’s movements, and reading about the subseqent attempts to digitize it and trace it in tiny 8-bit pixels.
Today’s blogs are yesterday’s journals. The only difference is that in the past, we only gained access to material once it had been bestowed some historical signifance, always after the fact, and always at by the will of the gatekeepers. Today, with millions blogging, history is out there, right now, being made. One day, we’ll be visiting back to the teenage blogs of the next great artist and peering in at their beginnings. We just don’t see it yet.
Toward Meaningful Computing
Toward Meaningful Computing makes the case for Humanities Computing to work closely with the Computer Sciences, to help teach computers to understand meaning in data. However, such is view is problematic for a number of reasons. Primarily, it assumes that computers can come to understand meaning in an encompassing way. The premise, then, begins to sound like this: computer scientists have hit a wall in their quest to answer all of life’s answers through binary, and digital humanists are needed to provide the secret recipe of human-encoded meaning, so that computer scientists can go on with their job.
Does that sound troublesome to you? It certainly does to me.
A Detour from Theory
In reading Chris Anderson’s The End of Theory, I found myself constantly swinging beyond agreement and reaction. In it, Anderson writes that as the Internet becomes an enormous corpus of human history, modelling is becoming irrelevant. The reason for my mixed feelings is that while I disagree with the conclusion (or the extent of it), I very much understand the points Anderson makes to get there.
Anderson is certainly more qualified to speak on the scientific method than I am. He was a scientist for many years, before becoming a writer for the very well-respected Nature and Science, eventually settling in as Editor-in-Chief of Wired. Yet, his argument appears to be unnecessarily broad. The scientific method won’t die off, as there are still many uses in which it will reveal knowledge in traditional ways. However, as we reach problems that the scientific method cannot help with, the digital age’s gradual progress toward a corpus of human data may help. (A sidenote: perhaps Anderson intentionally exhaggerated his arguments to spur discussion. If this is the case, given the reactionary comments an the Wired page, it was not particularly successful.)
First reading the article, I kept thinking: Does having more data not allow you to create more relevant models? If a large corpus of data reveals something, do we need to retread the steps every time we intend to built on that knowledge? Contrary to my first reading, though, I think this is the exact point Anderson is making. If we have a large set of data (one nearly incomprehensible in scale, encompassing our present and our past), we are safer in assuming that knowledge derived from that data is sound. In other words, as our corpus of tangible human knowledge grows, it makes fallibilism (the bane of my existence) increasingly irrelevant as uncertainty decreases. If something is correct a thousand times out of a thousand, yes it could still be incorrect on that 1001st time, but it’s more reliable than if it had simply been right ten times out of ten. As we reach problems that we simply cannot test for absolute certainty, correllation will have to do. What Anderson appears to suggest is that, given the size of the data, this isn’t as bad as it sounds.
A Billion eBooks
In regards to the “billion ebooks” discussion going on at Humanist, I have that nothing to add that hasn’t already been said by Stephen Ramsay.
The wrong Wikipedia argument
“Scepticism about Wikipedia’s basic viability made some sense back in 2001; there was no way to predict, even with the first rush of articles, that the rate of creation and the average quality would both remain high, but today those objections have taken on the flavor of the apocryphal farmer beholding his first giraffe and exclaiming, ‘Ain’t no such animal!’ Wikipedia’s utility for millions of users has been settled; the interesting questions are elsewhere.”
- Clay Shirky, Here Comes Everybody, p.117
In my work on crowdsourcing, my advisors warn me to be careful of how I speak about Wikipedia around academics, because scholars are still divided on it. Clay Shirky’s quote perfectly encapsulates the situation: if it is clear that it works and that it works well, the question shouldn’t be “does it work?” Rather, we should be asking why it works. Kevin Kelly suggests that Wikipedia is “impossible in theory, but possible in practice“: shouldn’t we be tweaking our theories then? Perhaps then, the issue is that if an expert were to praise Wikipedia as reliable, they undermine society’s need for experts. Larry Sanger, creator/co-founder or Wikipedia, says no, but it’s certainly food for thought.
»
Numbers
Last week I wrote about the idea of trying to model the self by collecting a series of self-revelations and trying to organize them in a way where they may reveal insides that one had not previously considered.
This American Life had a whole episode earlier this year on trying to quantify things that should not be quantified. Quite appropriately, the have a series of stories of people who’ve tried experiments like I suggested and the lessons learned. Read the synopsis and listen to the episode at This American Life – Numbers. Like with every episode of the show, it’s highly recommended.
Artistic visualisation
This isn’t as much as direct response to Rockwell and Bradley’s Printing in Sand as a reaction to it. As I read through their embrace of scientific visualisation, a thought that I’ve been tossing around came to mind again. If visualising data should be concise, precise and easy, can is there any place for the abstract? That is to say, the artistic, the random, and the unfamiliar? Last year, Wordle struck a chord with the masses, despite it not providing much meaning beyond a pretty word count. Perhaps there’s a place in our hearts for the puzzle graph, where we don’t know immediately what’s going on, but we like to savour the time of figuring it out.
Why does iTunes Genius suck so much?
I’ve recently been mulling over the question, “Why does Genius suck so much?” and the implications that it has.
Genius is the playlist generation tool in Apple’s iTunes music software. You choose a song that you’re in the mood for, and it creates an entire playlist of similar songs. Essentially, its a recommender system; if you like x you’ll like y. The problem is that you get a very narrow point of view, with very little genre skipping. and no pleasantly clever surprises.
What sets Genius apart from other song recommender systems is that its essentially powered by the crowds. Apple has the luxury of a rich data set of habits and rating, and it appears to factor heavily into the recommendations. Indeed, algorithmic playlist generators were creating better results years before Genius came on the scene. So, what does this mean for the crowd?
The fact that computers can be better than humans in understanding art is off-putting. I’m still working through this problem, but here are some thoughts toward untangling it.
Ratings data is emotionless. When you rate a song 1 or 5, you’re giving it a universal ‘like’/'dislike’. This data doesn’t factor the mood of the song or the emotion of the listener. This is all very removed for circumstance. As I suggested to Bill Turkel, perhaps such simple crowd-based recommendations are better for high-level suggestions, like artists you may like, but useless at the micro-level (unless that data crowds are contributing is more specific to the topic of recommendations). In contrast, technology can quite effective interpreting the types and patterns of sound which represent an emotion. Certainly it can’t easily understand whether a song is good, but if you want a slow, jazzy rock song, that’s fairly achievable. This is something in which music recommendation is fairly unique, as it is easy to interpret than it would be to interpret thousands of movie plots or millions of book themes.
Despite this, perhaps the most-cited example of a good music recommender is Pandora, which is an internet radio based on the Music Genome Project (MGP). The MGP does use humans to categorize songs, having professionals tag each song with over 400 tags and using an algorithm to weigh the values. Pandora’s success shows that humans are indeed effective at understanding music, given that they’re looking at it in the right way.
There’s also the effect of popular media that makes human-based recommendations unbalanced. If a lot of people like Coldplay, the range of music that it will be recommended for will be broad. This additionally creates an echo loop where popular music simply grows in popularity. Inversely, it is very difficult for new music to enter the loop. If everybody that likes The Strokes like Yeah Yeah Yeahs, the recommender will reinforce this, brushing aside any similar new bands.
However, such problems are limited to the balance of the algorithm. Last.fm, which tracks all of its users’ listened music, is fairly effective in recommending similar music. Also, because of their detailed information on what a user has listened to, they can suggest less listened to songs. Though they don’t offer playlist generation, I wouldn’t put this beyond their abilities.
So where do crowds factor in here? If anything, Pandora suggests that this is best left to professionals. Certainly, you can’t get that sort of exhaustivity with crowds. The answer may lie in reliability. Large groups would be able to make much simpler connections, but on a larger and more verified scale. When I make a playlist with Lou Reed’s Take a Walk on the Wild Side, I always follow it with Urge Overkill’s Girl, You’ll Be A Women Soon. The songs are linked very little, but there’s something in me that recognizes the similarly cool feeling that I feel. If you could somehow capture millions of these sorts of links, that could lead somewhere.
(Cross-posted to Crowdstorming. Leave any comments there.)
