I just put up a modest reference repository with various slices of data on US names. I included an estimate of names among US-born citizens today, by cross-referencing baby names data and population age distribution for 2014, and gender probabilities by name. Find it on Github.
The richness of language can be under-appreciated because of its mundane nature. James Somers’s essay You’re probably using the wrong dictionary recently turned me on to old dictionaries, which – with colorful descriptions and honest uncertainties – gratify much more than what we’ve come to expect of dictionaries. While modern dictionaries give you matter-of-fact descriptions of words you don’t know, older dictionaries have a vivid, more exciting style that is equally likely to enlighten you about words you do know. Tracking down references made by John McPhee about his own dictionary, Somers recommends Webster’s Revised Unabridged 1913 dictionary.
Reading Webster’s 1913 is a satisfying exercise. What strikes me most, however, are the descriptions of slang, colloquialisms, and vulgarities. These are terms or uses which are informal, conversational; the dictionary’s etymology for slang notes its roots in ‘having no just reason for being.’ With these entries, a work now seen as a record of American English is defining language which, by its own description, is “unauthorized”.
The tension results in a wonderful series of entries, some that are very familiar to us:
When analyzing anonymous user data in a team, I often take an extra step to help discussion: converting user identifiers to popular English name pseudonyms.
Pseudonyms tend to make the data more welcoming to team members that aren’t working directly with it, and helps you follow trends and outliers. It also helps in your visual sanity checks during analysis: names are simply easier to remember, thus helping you spot problems when inspecting the data.
Popular baby names are readily provided by the Social Security office, and I usually keep a derivative text list handy. In the simplest case, you can simply convert each unique id into a name. When I want to safeguard against name assignments changing as the data changes, I’ll save the ID>Name conversions in a basic CSV.
Below is a very basic example written in R to show how easy it is to do:
Motorola’s now-discontinued MotoACTV sportswatch gives you the commendable option to download all your running routes.
With a touch of data hacking, some manual editing to remove redundant routes, and some beautiful map tiles from Stamen, I ended up with a nice record of the places that I visited in 2012/13 and the parts of my town explored.
How small can a crowdsourcing contribution be?
At November’s CrowdCamp workshop, a group of us got together and prototyped a number of sample systems to see how low-effort crowdsourcing would work. We posted a report at Follow the Crowd.
Our prototypes were silly at times, but helped us think about the mixture of low-effort input methods and non-distracting user contexts where low-effort crowdsourcing would work.
The ideas we prototyped, available at Github, include:
- A binary tweeting interface, that lets you type sentences using a choice between common words
- A passive image voting interface that captures a user’s smile as a ‘like’
- A browser extension proof-of-concept that lets a worker complete tasks while a page is load
- A hot-or-not style interface for choosing the better of two choices. The twist is that you’re choosing using affirmative grunts, so you can play it while listening (or pretending to listen?) to somebody!
Details at Follow the Crowd. Team was Jeff Bigham, Kotaro Hara, Rajan Vaish, Haoqi Zhang, and myself.
Based on an XKCD comic, the Up-Goer Five text editor only lets you write using the one thousand most common words in English. Here are my attempts to describe what I do in crowdsourcing and information retrieval using only common words.
How do you find something from a lot of written stuff? If there are hundreds of things or more, you can’t look at all of them! One way we can find things is to use the words to understand the ideas. Then, when you search with a question, we can find the ideas that you are asking about and find the written things that answer your question. However, words and ideas aren’t exactly the same thing, so we look for ways to make better how a computer understands the ideas in your question and in the stuff you’re looking through.
When people get together on computers, they make fun, cool, and strange things. My job is to understand why they do it, and how we can can work together to fix problems in the same ways.