Using MTurk with directEDGAR

I was helping a PhD student use directEDGAR to snip some tables from a filing.  Because the data is for his dissertation and he is just really getting going on it I am going to be discreet and not go into details about the data he is trying to collect.

While the table snipping went well we observed a pretty significant problem with the Normalization.  Specifically – for the approximately 20,000 tables there were 23,000 unique column headings.  The reason for this is that the disclosure is not highly regulated and so most companies choose their own labels for the columns as well as how they structure the data.

The PhD student actually needed at most four values from each table and so we parried back and forth about alternatives to normalizing the data.  Working with that many column headings was going to turn into a monumental task.  Ultimately he decided to use Amazon’s Mechanical Turk (MTurk) platform as a way to crowd-source the collection of the data.  Because I have been intrigued with this service since I first heard about it we volunteered to help him.  I wanted the experience of working with a real use-case to discover how flexible the framework was.

While it was a pretty significant time investment (maybe 40-60 hours) I came away with a real appreciation of how our tools can facilitate and simplify data collection when used with their platform.  I also came away with a strong appreciation for their platform.

We needed to be able to display the snip from the SEC filings to the workers and provide a way for workers to enter data from the displayed table.  To maximize worker efficiency this all had to be done on one page.   Once we understood this it was easy enough to take one of their stock code pages and make the modifications we needed to suit his data collection needs.

Here is a mockup of the data collection page we worked on (note, the table displayed is not the data he is collecting).

Mturk1

It was really neat to set this up because their platform simplifies the management of the data collection process.  The values year1 and year2 are variables that  are determined by the source table.  The Data Value # are the labels for the data he was looking for.  The worker gets to stay on one page, find the data values, enter them and hit submit.  When submitted Amazon’s back end processes the data for him and stores it with indicators for the source table.  In our tests it was taking us each less than a minute to identify and transcribe the data and move to the next one.

You can see I obfuscated the instructions, while they look busy in this view they collapse Once a user has gotten comfortable with the data collection process they can close them and then have a much cleaner work area.  We also hosted a web page for him with more than 20 marked up examples of the different forms of the data and detailed explanations of what he is looking for. The web page is linked to the detailed instructions.

For quality control he plans to have a significant number of these done by two different workers so he can then check for consistency.

This would have been a lot more difficult to crowd-source without his ability to isolate the specific table for review.  If instead he had passed a document and required the workers to find the table the cost would have been significantly higher and I suspect much more difficult to audit.  Further, having the opportunity to crowd-source rather than send it to a data service will probably get him his results much faster.

I see a lot of possibilities for helping some of our customers with this approach.  They can isolate snips or blocks of text and embed them directly into a data collection page with these tools.