The Joint Conference on Digital Libraries (JCDL) took place at George Washington University this month. My first JCDL, the mix of information science and computer science was quite enjoyable. I was there with a poster, which I’ve previously described.
The highlight of the conference were the three solid keynote speakers. Jason Scott – lately of the Internet Archive – kicked off the conference with All You Cared about Is Gone and All Your Friends Are Dead: The Fun Frolic of Preservation Activism. Scott spoke about his own background as an accidental archivist, starting with BBSs and CD-ROM mailers, and how his desire to preserve online emphera eventually led to the Archive Team.
The Archive Team is a renegade group of data archivers, finding sites which are being shuttered or are expected to disappear in the close future, and trying their best to archive the data. Started in response to Scott’s call to arms when AOL Hometown was shut down, the Archive Team is a clever, capable crew, one that Scott claims to primarily be the mascot of. Among their most important saves, the Archive Team was able to capture a significant portion of Geocities – the hosted web space that everyone and their mother had before blogs – before it was taken down by Yahoo. Their current attempt to back up Apple’s MobileMe service is the largest effort yet, eclipsing Geocities considerably.
I don’t intend to focus simply on Scott, nor could I do his message more justice than his own writing, including Eviction, or the Coming Datapocalypse and Datapocalypso!. There is also a good recent article about Jason Scott and the Archive Team, Fire in the Library, in the MIT Technology Review. His talk set the tone for the entire conference, with constant callbacks from presentators.
The second keynote speaker was Carole Goble, presenting The Reality of Reproducibility for in silico [ppt slides]. Goble prefaced her talk by encouraging us to all read Borgman’s The Conundrum of Sharing Research Data (2011), which covers similar ground.
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” – David Donoho, “Wavelab and Reproducible Research,” 1995 (via Carol Gobel)
Goble described the magnitude of the problem with black boxes in research. While scientists write papers about their work and findings, reproducing their results is often difficult, if not impossible. This isn’t necessarily because the research is questionable – though Goble does touch on this issue – but often because there is not enough information to reproduce the research. As more research is dependent on computer programs and code, reproducibility should be trivial. Yet, when code is not shared and ‘publication’ is interpreted as a description of work rather than a self-contained contribution to society of one’s work, it inhibits reproducibility and consequently casts suspicion on the findings. As Goble put it, papers should be executable. To this end, she argues for a Reproducible Research System (RRS), consisting of a Reproducible Research Environment (RRE) for aggregating computational tools and a Reproducible Research Publisher (RRP) for publishing in a ways that a paper can connect to a RRE and be executable.
“We need a paradigm that makes it simple, even for scientists who do not themselves program, to perform and publish reproducible computational research.”
George Dyson gave the closing keynote with his talk, The Sensible Moment: 1680 – 2012. Based around his research for Turing’s Cathedral: The Origins of the Digital Universe, Dyson gave us a computing history of memory. Many interesting tidbits here, such as Francis Bacon’s binary cypher, where letters where encoding in 5-bit blocks. Also notable was Robert Hooke’s back of the envelope calculation of how many thoughts (ideas) a human is capable of, and Alfred Smee’s wonderfully concise description of consciousness.
By the 20th century, Dyson describes how philosophy and science fiction – such as H.G. Wells’s World Brain, a giant encyclopedia of human knowledge – eventually led to the reality of the EDVAC and ENIAC. Dyson outlined the pioneering work of Turing in conceptualizing stored memory and the subsequent work by von Neumann in generalizing it beyond a single long string of ticker tape. There are apparently debates over whether von Neumann was aware of Turing’s work, which Dyson found to be probable given that the von Neumann’s copy of the journal where Turing published his paper was nearly falling apart. Dyson spent the rest of the talk focused on von Neumann and his colleagues at the Institute for Advanced Study (where, incidentally, his own father Freeman Dyson worked for 40 years). Entertaining computer logbooks (“I can be just as stubborn as this thing”), personal letters, memos (such the humanists at IAS complaining about the computer scientists consuming too much sugar in their tea), meeting notes, and payroll sheets all paint a picture of early computer science. Through many tea-fueled nights in the late-forties and early-fifties, computing progresses from ticker-tape stored input and 53k of memory (worldwide!) to a research tool that supported assisted work in everything from atoms to the universe. Below are some of the common themes at this year’s JCDL. Keep in mind that as I sort through my notes, it may grow or change.
Chain of Ideas and Dependencies
“Dependency is the root of decay.”
A recurring theme was in the fragility of interdependent objects or processes. This was the central theme of Gobel’s talk, who warned of the detriments that occur when a piece of research is missing of a whole. Gobel thus advocates for fully self-contained research environments, and publishing environments that allow research to be executed within a paper. Jason Scott tangentially touched on the idea of dependencies in the context of preservation. Joking that humans are only good at destroying things and entertainment, Scott points out that every time something goes on undestroyed, it’s an unbroken chain of humans agreeing to preserve something. He was speaking to the wonder of a human cooperation, but his statement is equally effective at shifting focus to a) the fragility of the preservation process and b) the importance of a record’s history in understanding how that item has survived time. Interestingly, Dyson’s keynote abstract begins with a 1680 quote by Robert Hooke: ”There is as it were a continued Chain of Ideas coyled up in the Repository of the Brain, the first end of which is farthest removed from the Center or Seat of the Soul where the Ideas are formed, which is always the Moment present when considered.” This recalls, for me, Dostoevsky’s Man from Underground who said that “every primary cause at once draws after itself another still more primary, and so on to infinity. That is just the essence of every sort of consciousness and reflection.” There is a compelling idea here, of how thoughts depend on ever more primal thoughts, each one a necessary step in arriving at the most conscious thought. What if we were to apply this thinking to data, akin to Jason Scott’s note, where provenance is a more primary building block for our current data, and what we currently have is a blueprint for a richer future object.
Black boxes and the cloud
Both Scott and Gobel warned of black boxes, where data goes in and cannot easily be pulled out. For Gobel, this connects to problems with scientific reproducibility, while Scott looks at black boxes in terms of ownership and preservation. Users generate content, put it up into black boxes in the cloud, and there it stays. Perhaps the most appropriate example of a black box that was brought up during the conference was that of URL shorteners, third party sites that take long URLs and shorten them. When the shortened url is resolved, a user first goes to the URL shortener’s URL, which forwards the desired URL. Jason Scott spoke of the URL shortener tr.im, which shut down, killing many links in the process. Not only are URL shorteners are an ephemeral third-party which adds a dependency in the middle, but they also create the data curation issue of duplicating references to the same links.
Your metadata is my data
In the Big Data panel, James Snyder (of the Library of Congress’ National Audiovisual Conservation Center) pointed out that one person’s data is another person’s metadata. Working with audiovisual materials, supplemental items such as movie scripts are metadata for Snyder’s AV work. In other contexts, a script itself may be considered useful to scholarship. Robert Towne’s screenplay for Chinatown is a work all on its own, valuable not just as the basis for the Polanski’s film but also as a significant example of the art of screenwriting and for understanding Polanski’s divergences from the script. Borgman made a similar statement in her presentation, saying that data is in the eye of the beholder and that one person’s background data is another person’s foreground data (1).
Ownership over data
“If you tell people they can upload their content, you should have a clear and distinct way for them to retrieve their content. People do it ad-hoc as they can, but the abilities of most people, the people without an engineering degree or years of experience, to get back what they put up is minimal. It’s not that important. We should make it important.”
Throughout JCDL, we would often wax philosophical about ownership of data. Jason Scott shared concerns about people putting their intellectual material on websites that mishandle it, lose it, or treat it as if it their own. Goble talked about policy constraints that restrict data sharing, particularly touchy for publicly funded research. One admittedly easy question that was often asked (posed to Jason Scott, and Cathy Marshall, and the Big Data panel) was about people that have accepted the ephemerality of their online content and would not like their material archived. This scenario is not simply rhetorical, Scott admitted having heard from numerous people about this concern. Cathy Marshall also quoted the opinions of writers complaining about the LoC Twitter archives permanence of initially ephemeral content. The LoC’s solution, as asked of them by Twitter, is to restrict access to their Twitter message archive. Jason Scott offered the utilitarian view on the issue, explaining that while there are people who have public online content but do not want it preserved, the magnitude of such people is dwarfed by the number of people that appreciate an archive or their own work preserved within it. During the digging into data panel, somebody asked how content owners/creators/digitizers can make their data appropriate for research, convenient for the sorts of large projects that are seen in Digging into Data. In response, one of the participants asked for better documentation of data and Cassidy Sugimoto argued for more reliable identity disambiguation.
“With legislation or even the kind of peer pressure you all used to dupe people into Creative Commons, you could have this be the norm from this point forward.”
Not a research issue as much as a political issue – albeit related to future research consequences – the keynote speakers advocated for better data practices through a culture that celebrates or requires sensitivity to these. Scott’s writing is more explicit about the need for peer pressure, saying that companies need to offer export tools as a matter of principle and users should likewise demand it. Peer pressure is essentially how Google’s Data Liberation Front functions internally, I believe, convincing product teams to make it easy for users to import or export their data.
“If the methods used to generate the data are algorithmic, the method itself may need to be captured and curated.”
–Digital Curation Guide
Reproducibility was a large theme in Carole Gobel’s keynote, but largely ignored in the presentations that I attended. Perhaps reproducibility is an easier topic to think about when considering big picture issues than it is when performing day to day research. Goble made a strong case for considering archiving not just as a process of keeping data, but also of processes. ”Scientific reproducibility => data + method,” one of her slides argued. Reproducibility is – or is supposed to be – the cornerstone of the scientific method. It should be possible to confidently conduct research on the backs of prior studies, something that is not always the case. In a recent study, a cancer research lab tried to reproduce 53 landmark studies, but found that 47 of them could not be replicated. Difficulties in replicating results can occur for a number of reasons, from a sparcity of necessary details in the published work to more disingenuous or dishonest reasons. In addition to complicating future scholarship, Goble warns of increased distrust in scientific work as problematic published results increase. Goble also spoke of policy restraints on data sharing: pseudo sharing, strings-attached sharing, “just enough” sharing, embargos, and data flirting.
There were two panels that dealt with big data: one about Big Data challenges in general, and the other about the Digging into Data challenge. The Big Data panel discussed the definition of big data and use case of how ‘big data’ manifests itself in practice. Leslie Johnston, Chief of Repository Development at the Library of Congress, pointed out that the definition of ’Big data’ definition is fluid. “It used to mean data that is too plentiful to work with easily,” but how much data we can ‘work with easily’ changes daily, as do the scales at which we surpass that threshold. Among the discussions the Big Data panel focused on, consideration was given to the LoC Twitter archive, which will be available to researchers later this year. Even though the tweets being received from Twitter are exclusively public tweets, not everybody likes the idea of their ephemera being archived, so the Library of Congress has to balance privacy and openness in opening up the archive. They also have to decide what should be held for their own researchers and what they should provide to others as research support activities – balancing data research with data stewardship. The Twitter archive makes you realize the Internet’s daunting role in generating new data. We spent centuries ignoring low culture and pedestrian history and now, only a few decades after culturally realizing the value of such materials, we’re producing unbelievable amounts of it. Lots of everyday data on our culture and the people of the time, which appears fairly ephemeral on the surface but which may contribute to a larger understanding of the zeitgeist of the time. Where does one draw the line on preservation of online culture, and how can we know that in drawing a line we’re not losing material that is of interest to future scholarship? Naturally, Jason Scott’s own work touches on big data, as the Archive Team finds itself running up against increasing high walls of data. The Geocities scrape was for a while the largest torrent on ThePirateBay and their archiving of MobileMe has required more data than Brewster Kahle had initially budgeted for all of the Internet Archive for the full year. When asked about when one stops collecting, Scott did not have a definitive answer, but he did say that at a certain point, for sanity, you have to resign yourself to not being able to save everything. Still, my inclination is to agree with Scott’s approach, which is to do the best that we can. Christine Borgman’s short paper presentation (Data, Data Use, & Inquiry: A new point of view on data curation) touched on the question of what is kept or should be kept, finding that background data is often not considered important to save. This point is aggravated by the fact that background data is generally not cited, further obscuring its usefulness. James Snyder, also in the Big Data panel, pointed out that for A/V materials at the Library of Congress, they are required to maintain their materials in a readable for about 150 (approximately… I’m going from memory). For ”the active and ongoing management of data throughout its entire lifecycle of interest and usefulness to scholarship” (Cragin et al. 2007), Snyder’s mandate and resulting problems are archetypal. In the Digging into Data panel, Stuart Dempster of JISC talked about “smart digitisation,” a more selective approach to digitisation. Brett Bobley of NEH added that “Our ability to digitize has outstripped our ability to analyze” (source). It is interesting that this message is coming from funders, because I would have expected that preservation and digitisation should come before analysis. I’m not complaining – funders behind ‘analysis’ means that they’re in my end of the court – but surprised. From my point of view, conducting research on increasingly large corpora and dabbling in crowdsourcing for large-scale qualitative analysis, our ability to digitise seems to be lagging behind our ability to analyze. Google and Microsoft’s ability to digitise, however… . One of the Digging into Data projects, presented by Tom Ewing, was a project tracking the flow of influenza information through newspapers in the early 20th century. Such projects offer a promise to understand past societies in ways that our modern context-rich society is being understood. Ewing’s project recalls modern big data public health projects – like Google Flu Trends – as well as media zeitgeist projects – such as Information is Beautiful’s Mountains out of Molehillschart of global scares (led by swine flu and bird flu).
The cost of preserving what you need / Loss of important noise in conversion
George Dyson found that von Neumann’s copy of Turing’s paper was falling apart, especially compared to condition of the other the London Mathematical Society journals that his team had. Dyson points out that incidental ‘noise’ like the worn-out spine of a book is lost with digitisation. However, its not a necessary loss. The condition of works can be recorded, but it comes down to the costs involved with digitisation. You can always include more metadata, but it comes at cost and possibly a loss in breadth. I wonder if it would be feasible to classify data by completeness, though it is likely that nobody would admit sub-par data. Still, such an idea isn’t far off from Tim-Berners Lee’s five star linked open data classification scheme. Also, losing important ‘noise’ is not particularly an issue with digitisation, but rather of conversion between formats. Losing information about a print journal by digitising it is no less problematic than printing and photocopying a governor’s FOIA’d emails; however, it is the former conversion that is generally done. With data over the years, as formats come and go, conversion happens multiple times, so Dyson’s explanation of the physical-to-digital loss is just as important with digital-to-digital. Dyson’s store of the crumbling journal also brings up the issue of books as artifacts. Where digitisation project often are concerned with works and expression, in the FRBR sense, Dyson reminds us that sometimes the individual item is important.
Digital snake oil
Dyson’s experience in conducting his research had him pestering the Institute for Advanced Study repeatedly for physical records and digging through numerous basements. The impression that I was left with was that digitization lulls us into a false sense of comfort. Everything’s been digitised, right? At the same time, when it comes to historical materials that institutions don’t have room for – ones languishing in basements next to the water heater – digitized work may be the lower cost solution, though with an initial overhead cost. Dyson’s talk touched on false digital comfort in another way. When ruminating on how early computer pioneers would view today’s computers, he suggested that they would see it with predictable wonder, but also surprise at how similar today’s computing environment looks to their time. Perhaps a bold statement, but a reminder from Dyson not to accept the current way in computing as the best way (paraphased, via).
Data from Anonymous Online Workers
Finally, this year’s JCDL had a fair degree of interest in crowdsourcing. Much of it, unfortunately, did not deal with clever uses, or issues of preservation or curation. Rather, our general focus was on the big roadblock to crowdsourcing in academia: that it does not fall within our traditional measures of reliable. Papers by Jin Ha Lee and Xiao Hu (nominated for Best Paper and without a doubt the best received paper of the conference) and Cathy Marshall et al., as well as my own posterpresentation, tended to focus on the act of selling crowdsourcing; showing how it can be reliable for scholarship and why. Elsewhere, Goble briefly talked about a (non-anonymous, volunteer) crowdsourcing project that she leads, myExperiment.org. On the other end of the spectrum, James Snyder advocated in getting humans out of the preservation workflow as best as you can, for issues of quality. While we are often trying to compare crowdsourcing workers/contributors to experts, Snyder points out that even experts introduce errors.
More JCDL Notes
Notes on digital preservation and Jason Scott’s JCDL keynote, Robin Camille Hany Salah Eldeen’s Conference Report JCDL Tweets Archive JCDL Conference Schedule