Click the image for a map of topics in the Day of Digital Humanities. I meant to share this on Twitter, but it makes photos look like garbage.
I wrote a simple word count tool the other day. It adds a ‘word count’ option to the context menu (i.e. right-click menu) when you select text in Chrome. Install here.
A general purpose productivity tracker
I just published Progress Bar Timer in the Chrome Web Store. It lets you set up general purpose trackers in the form of progress bars. There are counters, timers, and clocks. Code is at Github, so feel free to submit bug reports and suggestions.
The application was designed toward my productivity habits but – spurred by the sense of the public eye on Github – I’ve tried to make it useful to others. My favorite use has been to combine a counter and a clock side-by-side. For example, during my field exam, I maintained a bar of word count progress alongside a bar showing where in the two week writing period I was.
In Historical Note: Information Retrieval and the Future of an Illusion (1988) Don Swanson offers his experienced perspective on IR and problems that we’ve ignored. He suggests that we explicate so-called ‘postulates of impotence’: statements of what cannot be done.
Swanson offers some postulates of impotence himself, as well as postulates of fertility. I like the idea of a research community formalizing the lies they’ve been telling themselves, so I’ve reproduced a truncated version of his postulates below.
In his postulates of impotence (PI), Swanson argues that fully automatic indexing and retrieval is not effectively possible. While he admits that computing brings many benefits or scale and speed, his PIs seek to remind us that it doesn’t necessarily mean that we are better at retrieval.
In his postulates of fertility (PF), Swanson offers a little-explored area where IR can help: making connections between disparate information that had not been considered previously. He cites scientific fields as a place where there is limited discussion and citation across field boundaries, but where doing so is extremely useful.
Do you have your own postulates of impotence for your field?
The Topic Modeling workshop is going on today at MITH and I’m back in Champaign writing my Field Exam. Still, in the excitment of #dhtopic, I thought I would share my script for processing WordPress exports for use with MALLET.
Use it in the following way:
usage: process.py [-h] [--split SPLIT_ON] input output positional arguments: input Location of WordPress XML export file. output Location of file export. Overwrites existing. optional arguments: -h, --help show this help message and exit --split SPLIT_ON [post|author] Define what constitutes a document. Default 'post'.
For split-on, choose either post (a document representation is the words of a post) or author (a document representation is the words that an author has written).
For example, using the Day of DH 2012 data,
python process.py 00-day-of-dh-2012-combined.xml ddh12-mallet-authors.txt \ --split "author"
Topics could be built however you’d like using the command line interface. e.g.
bin/mallet import-file --input ddh12-mallet-authors.txt \ --output ddh12authors.mallet --keep-sequence --remove-stopwords
Hope that’s useful. My own pet project is still on-going, but I’ll polish it up after my exam.
p.s. Because of my field exam, I’m putting this up without bells and whistles (or double-checking that it even still works). Leave a comment if you have a) problems, b) fixes, or c) want to provide extra details where I neglected them.
This weekend I finally read Edward Tufte’s The Visual Display of Quantitative Information(1983). Tufte is sometimes a polarizing figure —based on the trends of the day or in reaction to his thornier arguments— but at his best he communicates the potential of statistical graphics cohesively and convincingly. The concepts in this book are well-known by many of us at this point, and my embarrassment at waiting so long to read the source was quickly replaced by delight at it’s contents.
This fall, I’ve been brushing up on important or classic readings in my field, in preparation for my field exam. Since many of them are interesting, I hope to occasionally share summaries. Today’s classic is Ben Shneiderman’s Tree visualization with tree-maps: 2-d space-filling approach (1992).
Shneiderman introduces tree-maps: a two-dimension approach to for representing hierarchical tree structures. Today, of course, tree-maps are commonplace, with uses such as making sense of a budget, showing popularity of stories in the news, and – it’s original use – finding space-hogging files in a filesystem. Each node represents quantitative data as a rectangle, sized in proportion to its quantity among all the children of its parent. Hierarchies are shown by nesting other rectangles within the parent node’s rectangle.