I’m an information scientist with a digital humanities background, specializing in large-scale text analysis, crowds systems, and information retrieval over novel datasets.
With data from NYPL Labs’ What’s on the Menu?
I just put up a modest reference repository with various slices of data on US names. I included an estimate of names among US-born citizens today, by cross-referencing baby names data and population age distribution for 2014, and gender probabilities by name. Find it on Github.
The richness of language can be under-appreciated because of its mundane nature. James Somers’s essay You’re probably using the wrong dictionary recently turned me on to old dictionaries, which – with colorful descriptions and honest uncertainties – gratify much more than what we’ve come to expect of dictionaries. While modern dictionaries give you matter-of-fact descriptions of words you don’t know, older dictionaries have a vivid, more exciting style that is equally likely to enlighten you about words you do know. Tracking down references made by John McPhee about his own dictionary, Somers recommends Webster’s Revised Unabridged 1913 dictionary.
Reading Webster’s 1913 is a satisfying exercise. What strikes me most, however, are the descriptions of slang, colloquialisms, and vulgarities. These are terms or uses which are informal, conversational; the dictionary’s etymology for slang notes its roots in ‘having no just reason for being.’ With these entries, a work now seen as a record of American English is defining language which, by its own description, is “unauthorized”.
The tension results in a wonderful series of entries, some that are very familiar to us:
When analyzing anonymous user data in a team, I often take an extra step to help discussion: converting user identifiers to popular English name pseudonyms.
Pseudonyms tend to make the data more welcoming to team members that aren’t working directly with it, and helps you follow trends and outliers. It also helps in your visual sanity checks during analysis: names are simply easier to remember, thus helping you spot problems when inspecting the data.
Popular baby names are readily provided by the Social Security office, and I usually keep a derivative text list handy. In the simplest case, you can simply convert each unique id into a name. When I want to safeguard against name assignments changing as the data changes, I’ll save the ID>Name conversions in a basic CSV.
Below is a very basic example written in R to show how easy it is to do:
Motorola’s now-discontinued MotoACTV sportswatch gives you the commendable option to download all your running routes.
With a touch of data hacking, some manual editing to remove redundant routes, and some beautiful map tiles from Stamen, I ended up with a nice record of the places that I visited in 2012/13 and the parts of my town explored.