Another Example of Why CIK Filtering is Critical

I was engaged in an email exchange with a potential new client and they specifically asked about extracting the text content from ALL 10-K filings. I added the bold upper-case – to me the word ALL is a trigger word (as my son would say) because I have to then feel motivated to explain why ALL is not always the best answer.

To make sure you have the context – our platform will provide direct access to the raw text of the documents we have indexed (the markup is left behind). More and more researchers are using this raw text to test various hypotheses and to train AI systems. So people ask me – how can I get ALL.

The problem with ALL is that too many SEC registrants are not going to have securities whose prices are readily available. So if you get ALL then at some point the ones that you can’t match to some measure of value are toast.

I can understand if folks think well – it is easy enough to push that button and worry about filtering later. I can discard those that I don’t need. But actually it is very costly to you to collect more data then you need. Your time is the most expensive part of any research project.

The prospect wanted to know specifics – how much time will it take me to download ALL 10-K filings. To answer this questions I logged into an instance and ran a search to identify all 10-K (and EXHIBIT-13) filings made from 1/1/2016 to 12/31/2017. There were a bit more than 17,000 10-Ks in this window. I set a time and pushed the magic button and in two hours and nine minutes later I had all 17,000+ raw text files ready to save to my local computer. That is not horrible time wise – it just works but – it took longer than it needed to because almost 1/2 of those will not match to other data if trying to test a value/security price hypothesis. In my analysis I told our prospect that the system delivered on average 133 filings per-minutes.

However, since I was triggered I ran a second test. This time I only extracted 10-K filings made in 2018. There were a bit more than 8,200 filings. So roughly this is half the size of the first group. How much time do you think it took to extract those 10-Ks? In my test it took 32 minutes – or a rate of about 256 files per minute! Almost twice as fast.

Why this significant rate difference? A small part of it might or might not be due to butterflies flapping their wings in my backyard the second time I did it. The biggest factor that drives up that timing difference is a complicated but cool memory issue in Windows. (I’m going to be nerdy here) Like most applications – we use OS system hooks to do the tasks we want to accomplish. Windows manages memory and all that cool stuff so we can focus on our goals. The cool think is that Windows retains memory references to everything that is done until – usually you close an application. Finally the punchline – when Windows runs out of RAM memory it starts using disk memory – so it writes all of that memory stuff to disk and has a nifty table it uses to figure out where things are. The problem is that once you overrun memory and the disk memory comes into play there is a substantial slowdown of the work you are doing. Our instance disks are fast but they are much slower than RAM.

My general rule of thumb is that once you have manipulated about 10,000 10-K filings (which is a lot) the manipulation of the next one is considerably slower than the manipulation of any earlier ones. This a heuristic, there are other factors involved – but I have used our application a lot. In the first experiment – when I extracted the 17.2K filings – the first 8,600 took about 50 minutes. The second 69 minutes. I told you this was cool – the second group was roughly 38% slower. One of the other factors in play is that in the first case the application has 17K +/- documents available and in the second case it was only keeping track of 8.2K. Less memory was available for the document extraction from the beginning.

So by CIK filtering you are reducing your total work load (and the time you need to pay attention to the process) substantially. Yes, I did compare two years to one year. But remember – I suspect you can’t match half of the 10-K filings from any one year to security prices data in a reliable way. I suspect a filter of total assets greater than zero and the existence of common stock for each year of your analysis would substantially reduce the filings you would extract.

Pushing the button is easy – waiting for data that you will not use can be expensive!

What Does ‘Since’ Mean in The Audit Report?

I thought I would be wrapping up our injection of new metadata into our 10-K filings today. However, I ran into an interesting snag. I discovered that despite an auditor reporting that they have audited the financial statements since some date, their first audit report might be either prior to or after that date.

Here is an example – Core Laboratories N. V.’s current auditor is KPMG. KPMG reports in the FYE 12/31/2020 10-K that they “have served as the Company‚Äôs auditor since 2015.” This same phrase is repeated in the 10-K for the FYE 12/31/2018 and 12/31/2019.

Mandatory tenure reporting began in 2018, so prior 10-K’s have no SINCE declaration or statement.

When I read the 10-K, I presumed that KPMG began auditing Core Laboratories’ financial statements in 2015 and that they would have been the signatories of the 12/31/2015 audit report in the 10-K released in early 2016.

This was not the case. The audit report in the 10-K released in 2016 for the 12/31/2015 FYE was signed by PricewaterhouseCoopers LLP. I then wondered if KPMG meant they had re-audited the 12/31/2015 FYE financial statements after becoming Core Laboratories’ auditor. This was also not the case – the first audit report from KPMG explicitly reports that their audit was for the financial statements for the FYE 12/31/2016.

This was confusing to me, so I went to find the 8-K that reported the change of auditor (to find all 8-K reporting on auditor changes use the search (ITEM_4.01 contains(YES)) and (DOCTYPE contains(8K)). ) The 8-K is interesting and helped me understand why KPMG is reporting that they have been the auditor since 2015. Here is a link to the 8-K: Core Laboratories AUCHANGE 8K.

Core Laboratories dismissed PwC on 4/29/2015. However, the dismissal was effective upon the issuance of the reports (financial and ICFR audits) for 12/31/2015. KPMG was appointed (and an engagement letter was signed) on 4/29/2015, with their appointment to be effective 1/1/2016.

I discovered this as I was working on some final touches to impute SINCE values hoping (actually assuming) that we could rely on the SINCE value that was reported from 2018 to the present to populate prior SINCE fields. I was getting ready to punch the button to approve this logic but I decided to test it. Basically, the test was to establish whether the auditors matched the SINCE value – was KPMG the auditor of Core Laboratories in 2015? I would say they were not.

So now, we have to sort this out and make sure we have the right tests to validate the declarations made in the 10-K. It is our intention to have the SINCE value represent the first FY the auditor signed the audit report in the 10-K. Despite KPMG’s declaration that they have audited Core Laboratories since 2015, we will change that value to 2016, the year they of the first audit report they signed..

New Tagging Examples

While I am slightly behind the schedule I shared in my last post – we are making progress. I have been hesitant to share exactly what this new tagging scheme would look like until now. Below are two examples of the new metadata we will be injecting into the 10-K and EX-13s. At the present time I do not plan to alter the metadata we inject into the the other exhibits except to add the EDGARLINK value. My initial thought is that you can access the metadata associated with the 10-K if you need some value for data collected from an exhibit.

Below is the metadata we will add to Slack’s 10-K that was filed on 3/12/2020. (We use Slack internally and I love it).

<meta name="SIC" content="7385">
<meta name="FYE" content="0131">
<meta name="CONAME" content="SLACK TECHNOLOGIES, INC.">
<meta name="ACCEPTANCETIME" content="20200312163209">
<meta name="ZIPCODE" content="94105">
<meta name="ENTITYADDRESSCITYORTOWN" content="SAN FRANCISCO">
<meta name="ENTITYADDRESSSTATEORPROVINCE" content="CA">
<meta name="ENTITYSMALLBUSINESS" content="FALSE">
<meta name="ENTITYEMERGINGGROWTHCOMPANY" content="TRUE">
<meta name="ENTITYSHELLCOMPANY" content="FALSE">
<meta name="ENTITYPUBLICFLOAT" content="7200000000">
<meta name="PUBLICFLOATDATE" content="20190731">
<meta name="ENTITYFILERCATEGORY" content="NAF">
<meta name="ENTITYPUBLICSHARESDATE" content="20200229">
<meta name="ENTITYPUBLICSHARESLABEL_1" content="CommonClassA">
<meta name="ENTITYPUBLICSHARESCOUNT_1" content="362046257">
<meta name="ENTITYPUBLICSHARESLABEL_2" content="CommonClassB">
<meta name="ENTITYPUBLICSHARESCOUNT_2" content="194761524">
<meta name="AUDITOR" content="KPMG">
<meta name="AUDITREPORTDATE" content="20200312">
<meta name="AUDITORSINCE" content="2015">
<meta name="AUDITORCITY" content="SAN FRANCISCO">
<meta name="AUDITORSTATE" content="CALIFORNIA">
<meta name="EDGARLINK" content="https://www.sec.gov/Archives/edgar/data/1764925/000176492520000251/a1312010-k.htm">

Below is the metadata we will add to Peloton’s 10-K filed on 9/10/2020. Note the acceptance time indicates the RDATE for this filing would be 20200911 since it was accepted after 5:30 pm on 9/10. (No I don’t have a Peloton bike!)

<meta name="SIC" content="3600">
<meta name="FYE" content="0630">
<meta name="CONAME" content="PELOTON INTERACTIVE, INC.">
<meta name="ACCEPTANCETIME" content="20200910180637">
<meta name="ZIPCODE" content="10001">
<meta name="ENTITYADDRESSCITYORTOWN" content="NEW YORK">
<meta name="ENTITYADDRESSSTATEORPROVINCE" content="NY">
<meta name="ENTITYSMALLBUSINESS" content="FALSE">
<meta name="ENTITYEMERGINGGROWTHCOMPANY" content="FALSE">
<meta name="ICFRAUDITORATTESTATIONFLAG" content="FALSE">
<meta name="ENTITYSHELLCOMPANY" content="FALSE">
<meta name="ENTITYPUBLICFLOAT" content="6281462442">
<meta name="PUBLICFLOATDATE" content="20191231">
<meta name="ENTITYFILERCATEGORY" content="NAF">
<meta name="ENTITYPUBLICSHARESDATE" content="20200831">
<meta name="ENTITYPUBLICSHARESLABEL_1" content="CommonClassA">
<meta name="ENTITYPUBLICSHARESCOUNT_1" content="239427396">
<meta name="ENTITYPUBLICSHARESLABEL_2" content="CommonClassB">
<meta name="ENTITYPUBLICSHARESCOUNT_2" content="49261234">
<meta name="AUDITOR" content="ERNST & YOUNG">
<meta name="AUDITREPORTDATE" content="20200910">
<meta name="AUDITORSINCE" content="2017">
<meta name="AUDITORCITY" content="NEW YORK">
<meta name="AUDITORSTATE" content="NEW YORK">
<meta name="EDGARLINK" content="https://www.sec.gov/Archives/edgar/data/1639825/000163982520000122/pton-20200630.htm">

There are two immediate implications of these changes. First, if you do a Summary or Context extraction – these values will be included in the results. The name value will be the column heading and the content value will be the row value. The second implication is that you can filter search results by the values of the content. Clearly you are not going to want to filter by the EDGARLINK – but the ability to filter by ENTITYFILERCATEGORY will help you more efficiently identify those subject to particular disclosure requirements.

To identify all of those that have multiple classes of stock we would just add the following to our search (ENTITYPUBLICSHARESLABEL_2 contains(*)). The Fields menu will list all of these fields so you don’t have to memorize the labels we have used.

We were asked to provide the EDGARLINK to allow you to map/match data you collect from our platform with data collected from other platforms that provide the accession number or a direct link to the filing. The EDGARLINK value can be parsed easily in Excel to give you the accession number.

Right now the constraint is AUDITOR – we have auditor data back to 2011 at the present time. We have been improving our collection strategy for this field and hope to accelerate the collection process in the coming months. The special challenge in collecting this value are those cases where the signature is an image file and then we want the location and audit report date as well. So even though many of you might be able to pull this out from AA but others can’t and we think this is a valuable field when controlling for disclosure.

We will not be able to initially add the AUDITORSINCE value for many filings with auditor changes prior to 2018 because that is going to require some separate effort to identify that data value. Procter & Gamble has been audited by Deloitte since 1890 – so we can trivially add that field to all of their 10-K filings. But we only have 533 CIK’s that have an AUDITORSINCE value prior to 1994. We have 2,457 that have had the same auditor since 2010.

My neighbors and some colleagues share an inside joke, I have stated many times that doing X is like making bread – it is a process rather than something that can happen perfectly the first time. As we move back into older filing windows there are many complications and challenges associated with identifying the metadata values (like making bread). Thus – while I expect the addition of the values to be relatively easy to manage moving forward I do anticipate some unexpected challenges as we attempt to add this data to historical filings. (Fortunately we have directEDGAR to support or work!)

Hand Collection of Data – Unavoidable at Times – Also Update on Tagging

I received a really nice compliment today – one of the nicest I have received in a while. I’m always uncomfortable asking people to shill for directEDGAR so I won’t ask this faculty member if I can use their name. However, the following comment came from an individual who began using directEDGAR as a PhD student and is now in their second year as a faculty member at another client school.

“This is very helpful. Thanks Burch! You always know how to reduce the time I need to spend hand-collecting data, which I sincerely appreciate!”

I’m sharing this not to brag. This faculty member had a data collection problem and they were wondering if the best option was to send a research assistant to the EDGAR website to hand collect the data they need. This problem is exactly the initial reason I developed directEDGAR. We were trying to collect audit fee data for a couple of papers. We didn’t have any other source for that data and so it had to be hand collected. (Yes there was life before AA) Today we are very focused on adding more automation features but the foundation of data collection has to start with search to find the data and then evaluating the costs and benefits of hand collecting the data or applying some of our more sophisticated tools or using Python to capture the data.

Whenever we try to capture data we always start with a careful review of 50 – 100 filings filings to learn about the idiosyncratic ways the combination of people and tools they use to create the filing impact the disclosure. If you were to study audit fees carefully you would find enough variability to perhaps cause you to pull some hair out. Some disclose the fees in a block of text, others disclose the fees in a table where the column heading are the period the fees apply to. The last relatively common form of the disclosure is the table with the category of fees as the column headings and the row labels are the time periods. And then you have those registrants who disclose fees using one pattern in a number of years and then switch to another pattern. I forgot about those cases where fees are reported in one of the table forms but the DEF 14A or 10-K has an image of the table rather than the actual table. There are also other forms of disclosure.

Once we learn about the disclosure forms we don’t worry too much about who uses what form. Instead we consider each disclosure form and decide the best strategy for that form. So for example – if we were collecting audit fees we would use the TableExtracation & Normalization tools to extract all of the tables we could using known variants of the words/phrases likely to appear in the tables (audit fees; audit-related; tax fees ). We would record/note the CIKs of those firms we wanted this data from who were missing.

So now we are getting to the hint I provided the person who offered the compliment. They asked if there was a better way than sending a research assistant to EDGAR to collect some item of data that was not going to be easy to collect using one of our tools – this item just has to be hand collected. If I have my list of CIKs and I need to collect data that is not disclosed in a form that allows the use of more automated tools can still speed up the hand collection by a factor of at least 10 compared to visiting EDGAR.

A significant amount of the time used for hand collection from EDGAR is the process of entering CIK/NAME into the front search box, clicking through the list of filings to locate the correct filing and then opening up the filing to find the disclosure. Another significant amount of time is required to transcribe the relevant metadata into Excel. These parts of the process have to be handled manually when you visit EDGAR. With directEDGAR these steps are handled in a much more efficient way so you can focus on the data. If I go back to the audit fee data – we have to hand collect those disclosures that are in text or in an image. So once we have used our tools to the extent it is practical we shift to hand collection. But there are a couple of tricks.

First we run a search, filtering on the CIK list of our hand collection sample. This gets the documents we want loaded into the application. In our case for audit fees we would typically just use the phrase audit related. At this stage we have saved between 30 and 60 seconds per filing because we do not have to type in the name into the EDGAR search pane, click through filings . . . (As I was writing this I timed myself to just locate and open filings from three different registrants and then find the area in the documents where the audit fees were disclosed. I averaged 45 seconds per filing).

CIK basic search for audit related

Next we need to create a data collection worksheet. The SummaryExtraction feature provides a starting point. I know that in most cases the audit fees that are in text are reported by fee category in different paragraphs. So I want to collect the fee data more or less in the manner it is disclosed – one row for each type of fee. To do that – since the most common division of fees is AUDIT/AUDIT RELATED/TAX/OTHER/TOTAL I am going to duplicate the Summary Extraction block 4 times (I want five copies). I will add the column headings I need and I am going to add in the fee categories. Once this is done I am ready to hand-collect this data.

Worksheet for Hand Collection

This observations in this worksheet are ordered the same way as the documents in the search application. At this point I have a job with a workflow that makes it easy to delegate. If I don’t have someone I can delegate this task to – at least a lot of the tedious steps have been removed. I still have to enter numbers and click between filings in the application. But no unnecessary steps are involved.

Suppose you are laser focused and more efficient than I am. Assume you can complete the steps required to find the data in the filing on EDGAR averaging 30 seconds per filing. Using our platform will save you 125 minutes for the small sample of audit fees that have to be hand collected.

The punch line for this is – don’t get too frustrated when you have to hand collect data. Sometimes it is unavoidable. If you are not imagining a reasonable work flow send an email and we will help you identify the best strategy.


Tagging update. We have processed a large amount of new metadata to add to our 10-K filings. Of course it was more complicated than I hoped. We finally have some code that is resilient enough for us to trust (3 exceptions out of 45,646 filings examined). And we have a special piece we are adding in as well (more on that later). I am getting ready to update the metadata in our current 10-K filings back to January 2011. Because of our architecture we can complete this step without any downtime to our service. If I can finish this over the weekend we will replace the existing indexes with new indexes that have the metadata the weekend of the 17th. If that happens as anticipated I will provide a warning here because the 10-K indexes will be unavailable as they are updated.

There is going to be one slight complication during the update period. When you run a search you are not searching documents – you are searching and index of the words in the documents – the last time the index was created. When you extract documents you are extracting the document as it exists on the disk at the time you do the document extraction. During the update process the existing indexes will be based on the old tagged documents. If you run a search and then do an extraction while this is going on the documents you extract will have the new metadata tags embedded into the html. This should not be significant but there will be differences.

As a technical note we embed the metadata tags between the closing body tag (</body>) and the closing html tag (</html>). They are of the form

</body>
<meta name="DOCTYPE" content="10K">
<meta name="SICCODE" content="3841">
<meta name="CNAME" content="RETRACTABLE TECHNOLOGIES INC">
<meta name="FYEND" content="1231">

In addition to the injection of new metadata we are going to create a new table object for direct access that has all of the metadata that we can collect about our 10-K filings and their filers. We have to balance a number of issues when making the decision to add additional fields into 10-K filings. If there is a field you want for your research and we don’t have it available in the filings we can at least make it available for access through our PreprocessedExtraction feature. More on that as we move forward.


On a personal note – the reason for the delay is because Patrick and I are going to be visiting colleges on the East Coast in a week. Man he has grown up fast!