Friday, 2 August 2013

Meta-what?!
 
For my first blog post (you do have to eventually be cool too!), I have decided to mention a topic that is dear to my heart. It was back when I worked in software development (and particularly in database administration), and it still is, albeit in a different context.

So today we talk about metadata and why producers should care about it more. 
 
Some Context...
I am currently a PhD candidate in Information Systems (IS). One of the things I have to often do during that time is to read and review literature on different topics. This literature comes from a variety of sources and is accessed via a variety of online journal portals (EBSCOHost, ABI, ACM, IEEE Xplore etc.). As I browse through journals, do searches etc. I gather a number of articles which I download in PDF format and which I try to remember to add to my EndNote (or Zotero, your choice) library. After a few days' research I can end up with a hundred PDFs or so saved on my computer which I will then have to read, annotate etc.
 
Couple of issues here:
  1. These portals are built like 1990's online shopping websites. What I mean by that is that navigation is clumsy at best and downright frustrating at times (e.g., expired proxy sessions, slow downloads)
  2. Often saving the document and its citation info are two separate tasks
  3. These documents contain no useful metadata (see below for why that matters)
  4. These portals have Application Programming Interfaces (API) available but these are not public and reserved for institutions etc.
  5. When I search for articles, I am not in the mindset of organizing everything neatly already. I am wading through tons of (virtual) papers and cannot be bothered to save everything, add it to EndNote etc.
So what is metadata?
Metadata is "data about data". In other words, it is about giving some information regarding the actual data users will, well, use. It seems trivial and rather unnecessary but from the perspective of an outsider, metadata is not only cool, it can be crucial. It can be simple and fixed, such as specific header information you can store in a PDF (e.g., title, author, copyright info). And it can be more complicated, such as extensible metadata on database schemas. And this is where I want to draw a parallel with my previous employment. Metadata is not documentation, but it provides important clues as to something the data can help you achieve. The important part is that it is packaged alongside data itself but resides in a "logical" space that is separate from it.

What do I do with it?
Well, currently I am sitting on a pile of about 150 PDFs files, named using the following pattern: author1_author2_year.pdf. This is useful and in a sense somewhat akin to metadata. For instance, I have macros that I use to generate template Microsoft Word documents with a neat table with a hyperlink to each PDF to enter review info for each file. This way I concentrate on the task of reviewing the literature and not formatting documents to do so. This is because as a programmer (or ex-programmer, I'll let you decide on the technicalities), I am inherently lazy regarding repetitive tasks. I like to automate these as much as possible. Plus it makes for a healthy distraction from reading papers :-).

My issue here is that PDFs can store useful metadata, but when it is not done, well, it is useless. If there were actual metadata and APIs were accessible (ok I could even do without that one), I could program something automatic like the following:
  1. Read PDF metadata
  2. Look up reference online using provider API
  3. Download citation info
  4. Add it to my EndNote library
  5. Cross-check that all PDFs are in my EndNote library and ready to be reviewed in my Word documents
Now that would be neat. Unfortunately the lack of metadata and accessible API prevent me to do so. So I have to painfully open each PDF and do that by hand. I'll find a way to automate part of it somewhat, but it will be clumsy.

So, what can we learn from this?
Well, metadata can be very useful. Back when I worked, I used it a lot on Microsoft SQL Server schemas (using their Extended Properties) to create a sort of mini-descriptor on all tables that could be useful when automating programming tasks (e.g., automatic cleanup of archiving tables etc.). And this is where I go back to something I read on other websites about "services", "APIs" and so on. If you are going to publish APIs for clients, students, or whatever outsider you have to allow interacting with your own services, metadata is not just neat, I think it is pretty crucial. Documentation will only get you this far. Metadata can be used as a sort of "online documentation" that programmers can actually use when interacting with your data. Regular documentation cannot do that. And it's not just good for them, it can be good for you too. For example you may reduce resource contention as consumers can easily discard irrelevant information or not have to request extra, unnecessary data to do what they want using the metadata you give them.

Now if you'll excuse me, I have to get back to these PDFs...