Hand Collection of Data – Unavoidable at Times – Also Update on Tagging

I received a really nice compliment today – one of the nicest I have received in a while. I’m always uncomfortable asking people to shill for directEDGAR so I won’t ask this faculty member if I can use their name. However, the following comment came from an individual who began using directEDGAR as a PhD student and is now in their second year as a faculty member at another client school.

“This is very helpful. Thanks Burch! You always know how to reduce the time I need to spend hand-collecting data, which I sincerely appreciate!”

I’m sharing this not to brag. This faculty member had a data collection problem and they were wondering if the best option was to send a research assistant to the EDGAR website to hand collect the data they need. This problem is exactly the initial reason I developed directEDGAR. We were trying to collect audit fee data for a couple of papers. We didn’t have any other source for that data and so it had to be hand collected. (Yes there was life before AA) Today we are very focused on adding more automation features but the foundation of data collection has to start with search to find the data and then evaluating the costs and benefits of hand collecting the data or applying some of our more sophisticated tools or using Python to capture the data.

Whenever we try to capture data we always start with a careful review of 50 – 100 filings filings to learn about the idiosyncratic ways the combination of people and tools they use to create the filing impact the disclosure. If you were to study audit fees carefully you would find enough variability to perhaps cause you to pull some hair out. Some disclose the fees in a block of text, others disclose the fees in a table where the column heading are the period the fees apply to. The last relatively common form of the disclosure is the table with the category of fees as the column headings and the row labels are the time periods. And then you have those registrants who disclose fees using one pattern in a number of years and then switch to another pattern. I forgot about those cases where fees are reported in one of the table forms but the DEF 14A or 10-K has an image of the table rather than the actual table. There are also other forms of disclosure.

Once we learn about the disclosure forms we don’t worry too much about who uses what form. Instead we consider each disclosure form and decide the best strategy for that form. So for example – if we were collecting audit fees we would use the TableExtracation & Normalization tools to extract all of the tables we could using known variants of the words/phrases likely to appear in the tables (audit fees; audit-related; tax fees ). We would record/note the CIKs of those firms we wanted this data from who were missing.

So now we are getting to the hint I provided the person who offered the compliment. They asked if there was a better way than sending a research assistant to EDGAR to collect some item of data that was not going to be easy to collect using one of our tools – this item just has to be hand collected. If I have my list of CIKs and I need to collect data that is not disclosed in a form that allows the use of more automated tools can still speed up the hand collection by a factor of at least 10 compared to visiting EDGAR.

A significant amount of the time used for hand collection from EDGAR is the process of entering CIK/NAME into the front search box, clicking through the list of filings to locate the correct filing and then opening up the filing to find the disclosure. Another significant amount of time is required to transcribe the relevant metadata into Excel. These parts of the process have to be handled manually when you visit EDGAR. With directEDGAR these steps are handled in a much more efficient way so you can focus on the data. If I go back to the audit fee data – we have to hand collect those disclosures that are in text or in an image. So once we have used our tools to the extent it is practical we shift to hand collection. But there are a couple of tricks.

First we run a search, filtering on the CIK list of our hand collection sample. This gets the documents we want loaded into the application. In our case for audit fees we would typically just use the phrase audit related. At this stage we have saved between 30 and 60 seconds per filing because we do not have to type in the name into the EDGAR search pane, click through filings . . . (As I was writing this I timed myself to just locate and open filings from three different registrants and then find the area in the documents where the audit fees were disclosed. I averaged 45 seconds per filing).

CIK basic search for audit related

Next we need to create a data collection worksheet. The SummaryExtraction feature provides a starting point. I know that in most cases the audit fees that are in text are reported by fee category in different paragraphs. So I want to collect the fee data more or less in the manner it is disclosed – one row for each type of fee. To do that – since the most common division of fees is AUDIT/AUDIT RELATED/TAX/OTHER/TOTAL I am going to duplicate the Summary Extraction block 4 times (I want five copies). I will add the column headings I need and I am going to add in the fee categories. Once this is done I am ready to hand-collect this data.

Worksheet for Hand Collection

This observations in this worksheet are ordered the same way as the documents in the search application. At this point I have a job with a workflow that makes it easy to delegate. If I don’t have someone I can delegate this task to – at least a lot of the tedious steps have been removed. I still have to enter numbers and click between filings in the application. But no unnecessary steps are involved.

Suppose you are laser focused and more efficient than I am. Assume you can complete the steps required to find the data in the filing on EDGAR averaging 30 seconds per filing. Using our platform will save you 125 minutes for the small sample of audit fees that have to be hand collected.

The punch line for this is – don’t get too frustrated when you have to hand collect data. Sometimes it is unavoidable. If you are not imagining a reasonable work flow send an email and we will help you identify the best strategy.


Tagging update. We have processed a large amount of new metadata to add to our 10-K filings. Of course it was more complicated than I hoped. We finally have some code that is resilient enough for us to trust (3 exceptions out of 45,646 filings examined). And we have a special piece we are adding in as well (more on that later). I am getting ready to update the metadata in our current 10-K filings back to January 2011. Because of our architecture we can complete this step without any downtime to our service. If I can finish this over the weekend we will replace the existing indexes with new indexes that have the metadata the weekend of the 17th. If that happens as anticipated I will provide a warning here because the 10-K indexes will be unavailable as they are updated.

There is going to be one slight complication during the update period. When you run a search you are not searching documents – you are searching and index of the words in the documents – the last time the index was created. When you extract documents you are extracting the document as it exists on the disk at the time you do the document extraction. During the update process the existing indexes will be based on the old tagged documents. If you run a search and then do an extraction while this is going on the documents you extract will have the new metadata tags embedded into the html. This should not be significant but there will be differences.

As a technical note we embed the metadata tags between the closing body tag (</body>) and the closing html tag (</html>). They are of the form

</body>
<meta name="DOCTYPE" content="10K">
<meta name="SICCODE" content="3841">
<meta name="CNAME" content="RETRACTABLE TECHNOLOGIES INC">
<meta name="FYEND" content="1231">

In addition to the injection of new metadata we are going to create a new table object for direct access that has all of the metadata that we can collect about our 10-K filings and their filers. We have to balance a number of issues when making the decision to add additional fields into 10-K filings. If there is a field you want for your research and we don’t have it available in the filings we can at least make it available for access through our PreprocessedExtraction feature. More on that as we move forward.


On a personal note – the reason for the delay is because Patrick and I are going to be visiting colleges on the East Coast in a week. Man he has grown up fast!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s