A blog about linguistics and the beauty of language.
What's a wug? Look here.

Metadata: Hundreds of recordings, but what are they about?

At the end of each day of fieldwork, I sit down and “deal with my data”. No matter how tired I am, no matter how much I want to put it off, it’s an essential part of my daily routine. The first thing I do is back up my recordings. And while my files are transferring, I go over my notes and make sure that all of my metadata is in order. Metadata is all the information about how the recording was made — who was there, where we were, what we talked about, what audio recorder I used, etc. Metadata is data about data, and it’s a crucial aspect of good data management.

Proper metadata serves many roles. For me as a researcher, it means I can always identify which handwritten notes go with which recordings, and I can remember which person told me what when. But more importantly, when I archive my data, metadata makes the resources more accessible to other researchers. Ideally, when some future linguist looks at my recordings, they’ll be able to reconstruct the research context and methodology. This means they can properly interpret what the data means and how it should be analyzed.

In my practice, I record metadata in three ways. First, at the start of each session, I record relevant information in my field notebook. My metadata for my first research session this summer looks like this:

One page one of a notebook, there's the following text: 'May HP (Researcher), Moises GG (Speaker), June 1, 2019, SJTZ, San Jerónimo Tlacochahuaya, Archive + CC = YES, Zoom H5 w/ XYH5 mic + Shure headset, SD005, Multi, Folder01, Zoom01, 190601-200918. Below the metdata you can see the first notes, the word 'dig' in a box, with the Zapotec 'gal rga'an' and the note '(infinitive/nomlz)'.

May Helena Plumb, NB005, page 1

There’s lots of information here! First I’ve noted the two people involved in this research session, me (MayHP), and Moisés García Guzmán; at the end of the notebook, I include a list of name abbreviations and brief biographical sketches. I also note the date, and the language being documented. In this case, the language is San Jerónimo Tlacochahuaya Zapotec, which I abbreviate “SJTZ” (again, there’s a list of language abbreviations at the back of the book). I also include the town we were working, and I note whether Moisés gave permission for me to archive this material publicly. I recorded this session, so I also note information about the recording, including the audio recorder and microphones I used, and the SD card where the file was recorded. There’s more information about the recording equipment — you guessed it — at the back of the notebook.

The number string at the end of the metadata above, “190601-200918”, is the timestamp of the recording. I use that, in conjunction with the name of the speaker, as a way to uniquely name each audio file. When I transfer the files to my computer and organize them, I name them all with a common system to make them easy to search and identify. Each file is named with a language code, the time stamp, the name of the speaker, and a short description of the file contents.

A screenshot of files named with the schema defined above, for example 'SJTZ_190602-181041_MoisesGG_verb_paradigms_tam_Tr1.wav'

Recordings in my "SJTZ Verbs" folder. I discuss some of this data in my post about Tlacochahuaya Zapotec "tense".

Finally, I also keep a spreadsheet that catalogs all my files, and the metadata that is required to archive them. I archive my data at the Archive of the Indigenous Languages of Latin America (AILLA), and they have a specific set of metadata fields you’re required to fill out, such as the length of the audio file and the role (e.g. researcher, speaker, or translator) of each person involved. My files in AILLA are also organized folders (called “Resources”) for different grammatical or cultural topics.

The spreadsheet headings include 'Entered by', 'File name, 'Media Type', 'Resource Name', and 'Media language 1'. The media type for each file is 'sound_recording', and the resource name is 'SJTZ Verbs'. The first media language is 'Zapotec, Western Tlacolula Valley'.

A small section of my very large metadata spreadsheet. You can read more about my archive collection here.

It’s a lot of information to record at the end of each day, but it’s worth it to do it now! Last year I didn’t fully settle on a file-naming schema/organizational strategy until after I finished my fieldwork, and it caused a huge delay in archiving my data. Now that I have more practice, it all goes much more smoothly.

Want to learn more? Metadata is a very big topic, and people have lots of different ideas of what good metadata practice looks like. Here are a few starting points: