Collecting Data Non-Numeric Data from Tables & Towards Building a Code Repository for Python Users

One of my current projects is to identify committee members for a large set of accelerated filers. I have looked at adding this as a data item to our platform but we have too many other issues that have a higher priority. However, I need this data for one of my own research projects. One of the special challenges is the lack of uniform disclosure practices.

A growing number of registrants are providing a tabular disclosure of the committee assignments of their board. When we can snip these tables from the documents then the committee assignments are pretty easy to merge with director compensation data available through the platform. For example – here is the summary of the committees as reported in Air Products & Chemicals 2016 DEF 14A (CIK 2969) (we snipped the table so the view is in the SmartBrowser):

Air Products & Chemicals 2016 Committee Membership Disclosure

Here is how the director compensation data from the same proxy filing looks when pulled from directEDGAR:

Air Products & Chemicals 2016 Director Compensation Data

Notice that the names are the same form, this is not always the case but our experience is that registrants maintain significant internal consistency in how they name their directors across multiple tables. Because the table that has the committee composition has ASCII characters we can Dehydrate/Rehydrate and normalize the disclosure:

Normalized Committee Data

We can then merge the committee data with the compensation data! Everything above can be created from directEDGAR. It gets more challenging when we identify tables that do not have ASCII values as the indicators. For example, here is the committee disclosure for ABBOTT LABS (CIK 1800):

Abbott Labs Committee Disclosure (2016)

The M and the C are image files. It looks really nice but it is hard to collect data from this table because when the table is snipped the image is gone. Here is an image of the table after it has been snipped:

ABBOTT LABS – Committee table after snipping

The ALT TEXT for the image is something called GRAPHIC. And to add insult to injury the ALT TEXT for the image for both CHAIR and MEMBERS are the same. If we DEHYDRATE/REHYDRATE tables like this – we will have the column headings, the names and the meeting frequency. We will not have values for the committee assignments. We can fix that using Python. We can convert this table to:

ABBOTT LABS – Committee after some Python magic!

Once the table has been converted into this form, the Dehydrator/Rehydrator processes allows conversion of the table into more useful data.

In addition to cases where I found images in the table, I also found cases where the registrant is using special (non-ASCII) symbols (check-marks, check-boxes, bullets . . .) to indicate committee membership.

Committee Table for CIK 49196 (Huntington Bancshares)

I needed a solution to conform the tables with images and non-ASCII characters because I really don’t want my research assistant doing work that can be automated. While I am still scoping out this work I think we are seeing about 400 of these per-year in our sample and with about ten years of data that is approximately 4,000 cases of asking someone to manually review these tables and enter the committee details into a spreadsheet. Not only do I not want to ask someone to do this work, I don’t want to manage it. I suspect you don’t either.

I came up with a solution and I started thinking about our mission – to reduce the effort you spend collecting data from EDGAR filings. We need to share this code. I know many of our clients use our search platform to find relevant filings and then download filings to work their own Python magic. I don’t see any reason why you should have to sort out a solution if we have one that we are using either in-production or in our own work.

In my solution I have not tried to map the images or special characters to their meaning based on a key to the table or even a translation of their meaning. Instead I am making an assumption that the frequency of the image/symbol has meaning. We are simply identifying the unique symbols/image and replacing them with an integer. The integer has no meaning other than the order in which they were found. My assumption is that there will be fewer symbols for chairs than for members. In the image above the members are indicated by “1” and the chairs by a “2”. If Mr. Alpern was the chair of the Nominations and Governance committee then the values would be reversed. We can identify frequency in the next step and work with the data to normalize if more.

Sometimes this is not enough. There are registrants who use a different image for each director/role. If Burch Kealey is a member of the compensation committee – there is one image for that and a different image to represent his membership in the audit committee and then a third image that conveys that he is the chair of the finance committee. In this case there might be 20+ different integers in the results. There is nothing we can do about this choice and so for those we have to go back to the source document to sort out the committee membership.

If you are using our platform and also augmenting our tools with Python – I want to make this type of code available so you can focus on data collection. Sometime in the next six months I would like to set something up on BitBucket so we can share useful code in a more structured way. I don’t have time to sort all of the intricacies of that solution but I have an intermediate solution. The intermediate solution is that I have created a folder on our platform (on the S:\ drive) called AvailableCode. Inside the folder I will save zipped folders that will have some commented code and some example input files that can be transformed with the code.

While I intend to provide comments in the code – my comments will be focused more on identifying the entry and exit points and how you might need to expand the code rather than necessarily fully ‘teaching’ Python.

Code can always be improved. One of the challenges is that there are choices that registrants make when preparing their files. We often don’t learn of the choices until we run the code against a set of new files and get an unexpected result. The code I will make available makes adjustments only for the cases I have seen while working on a particular set of files. For example, below are the binary representations of the various special symbols I have found in these committee tables for my sample.

process_symbols = {b'\xc3\x97', b'\xe2\x9c\x93', b'\xe2\x97\x8f',
b'\xc2\xb7', b'\xc2\x95', b'\xc3\xbc', b'\xe2\x97\x8f', b'\xe2\x9c\x94', b'\xef\x82\x80',b'\xe2\x80\xa2',b'\xc3\x96'}

If you use this code you may find tables with another symbol that was used to indicate committee membership or leadership. If you do then it is just necessary to add that symbol to the set of process_symbols and the code will then correct for that symbol. It would be great if you would share with us new symbols so we can add them to the code as well. Similarly, so far we have identified WingDings and WebDings as challenging fonts to process. There might be others. You will identify the existence of others if you are using this code on a different collection of tables and after processing you find something like the image from Abbott Labs after processing, something non-intelligible. Look at the html and it should not be difficult to identify the problem. If you are brand new at working with Python in this manner – send the table to support@ . . . and we will help you learn how to modify the code to accommodate this issue.

That brings me to my final point. If you have been thinking about learning Python to help your data collection – I think you will find this helpful. To get started I would suggest installing PyCharm. While they offer a free academic license – if you get comfortable using it – the professional license is reasonably priced and we know that nothing is really free. In addition – while Google is a great resource – to be effective using Python at some point you are going to want to understand the code – if I were trying to learn Python today I would pick tutorials/course from RealPython. The pricing seems reasonable and their organization and scope of coverage looks like it maps into what I think an academic researcher might want to learn. The code I am making available first requires we find, open and read files. You can easily look at the code and intuit what is going on. To understand the options available though I suspect this 30 minute course/lesson would be useful Reading and Writing Files. What I like about their material is that I can approach it with a question in my mind – search for relevant material based on my question and then pick through the options. When I first started (and to this day) I try to break my tasks down to the smallest steps and then Google for information. My first real line of Python was to open a file – so I search for “How to open a file in Python”.

One of the next posts I make will describe how to use Python to clean the noise from tables we have snipped. It is really challenging to snip committee tables – so I used the TableSnipper to snip a set of tables based on some gross characteristics of the tables and then I wrote some additional code to clean up the noise from those results – again to save my research assistant from having to trudge through a lot of unnecessary tables. I need to clean up the code a bit more. When I have finished cleaning the code so it can be useful for filtering a more generic set of tables I will add it to the folder on the platform.

Final note – to use the example bundles from S:\ navigate to the S drive on the platform. Open the AvailableCode folder and identify the zip folder you want to copy. For security reasons, when you select the folder and right-click with your mouse you will get a warning message:

Warning Message

Even though you are not opening the zip file – that message will appear. Click OK. The then right-click context menu will be available. Note – you cannot extract the contents on S as you don’t have write privileges. You can copy and then paste to the Temporary Files folder.

Once they are in that folder you can download them to your computer (review the download steps in this video at about 3:50 Accessing Files ). If you have Python installed you should only need to change the paths to then run the file and see the transformation. You can use the same code to transform your own files as needed. I should note that I am running Python 3.9 and I am using LXML directly (not through BeautifulSoup).

Stand-Alone SmartBrowser Installer Available

A colleague of mine at the University of Nebraska at Omaha needs Say-On-Pay vote results. The disclosure of these votes is relatively structured. This will allow us to write some Python code to extract and normalize the results for a large number from the filings. However, there are significantly diminishing returns to writing code to get every last one. For those that the code writing effort is too bothersome we will have our RA (a very diligent graduate student) review the documents and transcribe the data. We try to be conscious of the tedium of this type of work and reduce as many steps as we can. Our SmartBrowser offers an efficient way to review the relevant documents so she can focus on the data collection rather than making too many mouse-clicks.

The first step of this process is to extract the relevant documents from the platform. Below is an image of the results of the first search for this sample.

Searching for Say-on-Pay Votes

The search string was (DOCTYPE contains(8K*)) and (ITEM_5.07 contains(YES)) and compensation. The first parameter limits the search to 8-K or 8-K/A since we have not seen many results where the disclosure is in an exhibit. The second parameter limits the search to ITEM 5.07 as that is the item code for Submission of Matters to a Vote of Security Holders. The final parameter is to make sure that the word compensation is mentioned in the document. Clearly this is not a perfect search – it provides us a place to get started.

We want to extract the actual 8-K – I want to take advantage of the HTML structure to help better parse the votes so we are using the DocumentExtraction feature. An explanation/example of extracting documents and them moving them to your local computer (so you can share them with your RA or so you can write code to manipulate them) is available from this Video (DocumentExtraction and transfer are illustrated beginning at about 2:58). I will observe that you might want to also do a SummaryExtraction to generate a CSV file with all of the metadata about each document. In addition I would generate a listing of all of the documents that are extracted using the FileListing tool from the Utilities menu

Using the File Listing Utility

This file is useful because we have to change the name of the files extracted so as to prevent name collision and to make sure you can map the document back to the filing. If the files in the selected folder have the directEDGAR file name convention the utility will parse apart the name components (CIK, RDATE, . . .).

We wrote some code to parse out and normalize the votes for those cases where we have a very high level of certainty about the disclosure. In this sample there were about 300 specific filings where it is easier to manually collect the data. Now we are ready to use the SmartBrowser. The stand-alone version of the SmartBrowser offers a convenient way to review a large number of htm/txt files that are named per our convention (CIK-RDATE-CDATE-etc). It has all of the features of the version embedded in the platform.

Here is a screenshot of Boston Beer’s disclosure:

Say-on-Pay Voting Results Disclosure for Boston Beer

Like the version in the platform the stand-alone version of the SmartBrowser has many of the features of a standard web browser that are available from the right-click menu. In addition you can increase/decrease font size (CTRL + (either +/-)). A CTRL+F allows you to search within the document that is displayed. Remember that this is the entire 8-K so being able to search for Compensation and then jumping to that location will reduce the workload considerably.

The stand-alone SmartBrowser installs on Windows 7+. It does not require any license validation and has no embedded security – other than it will not display documents without our naming convention. (You can get around that by renaming your documents if you wanted to).

The bottom-line is that this allows us to get our research assistant to help without having to tie them to significant training on the use of the platform. We provide them a USB drive with the installer for the SmartBrowser, the documents, an Excel file with the metadata they need (CIK etc) and they can review these documents and focus on what is important – collecting the voting results.

This is the accounting faculty member in me speaking. I figure it takes anywhere from 30 seconds to 2 minutes to take a CIK to EDGAR, find the right 8-K filing, open the document and find the disclosure. We are saving them somewhere between 150 to 600 minutes (2.5-10 hours) of key strokes/mistakes and boredom on this sample. If you don’t have access to the SmartBrowser installer please send an email to support@ . . .

Another use of ANDANY

In the weeds again – the following is a bit dense – but details matter.

One of the reasons our application is so versatile is that we have more search operators than any other SEC Filing Search Platform. We have the standard Boolean operators (AND, OR and NOT), We have proximity operators (W/N and PRE/N). We have document operators (XFIRSTWORD and XLASTWORD). And then there is the ANDANY operator.

Here’s the problem – suppose we have a list of the members the Russell 3000 from the beginning of EDGAR until now. We have a set of search terms and we want to do two things. First, for each year determine if the companies in our sample filed a 10-K and then determine the frequency of some bag of words in each filing.

With directEDGAR’s ANDANY operator and CIK filtering tool I can do this in one step. The search I would construct would be ((DOCTYPE contains(10K*)) or (DOCTYPE contains(EX13)) ANDANY (my bag of words/phrases [with appropriate operators]). Because I want to limit the results to those filers identified as members of the R3000 I would have to use a CIK input file.

In this case the ANDANY operator essentially sets the contents of the ANDANY component component of the search as secondary and the documents as primary. This search would first identify all 10-K and Exhibit-13s that were filed by those CIKs in my sample. And then it would also identify the existence and their frequency of each of the words/phrases in each of the documents that were returned.

This is particularly important if we need to use a dummy variable for instance if some of our sample filed a 10-K in this window but did not use any of the words/phrases that were included in the bag of words.

This is different from a very similar search ((DOCTYPE contains(10K*)) or (DOCTYPE contains(EX13)) AND (my bag of words/phrases [with appropriate operators]). This second search will only return 10-K (and Exhibit-13) documents that have at least one of my bag of words/phrases. With this second search I then have to separately determine/identify those cases where my sample filed a 10-K (and therefore did not include anything that matched the search inside the ANDANY operator.

Let’s get further in the weeds and make this a 2 step process. Suppose we only want results if the company was a member of the R3000 for a particular year. For example, Ultra Petroleum (CIK 1022646) joined the R3000 in 2017. However – they have filed 10-Ks since 2002. Further, they were deleted in 2019. In this case, let us assume that we have a list of CIK/years that represent the precise fiscal years that we want to research. Ultra Petroleum is on our list for 2017 and 2018. Given the large number of additions and deletions and the fact that companies can be added, removed and then added back this is a complex list.

Step 1 in this case would be to just do a CIK filtered search for (DOCTYPE contains(10K*)) or (DOCTYPE contains(EX13). This will provide me a list of all 10-K and Exhibit 13s for my CIK sample. I would merge this with my RUSSELL 3000 composition list to identify only those 10-K/EX-13 filings for the CIK-YEAR pairs I want to research. In this case I would end up with 10-K filings for 2017 and 2018 for Ultra Petroleum.

The merged list identifies the exact documents I want to search – we know they exist but to keep it all in the output I would run the same search as above ((DOCTYPE contains(10K*)) or (DOCTYPE contains(EX13)) ANDANY (my bag of words/phrases [with appropriate operators]) – but rather than using CIK filtering I would use CIK-DATE filtering. The first output file has the balance sheet date for each filing – so I would do CIK-DATE filtering using the CDATE (the balance sheet date) as the value for the MATCHDATE parameter. I would set a zero (0) day window using the CDATE (balance sheet date) since we are confident about the existence of these documents (remember – we pulled these in step 1):

Filtering on CIK DATE

By running this 2 step process I have identified the exact 10-K/EX-13 that pertain to the period that the registrant was a member of the R3000. While I could have done this in one step there is a bit of a problem – balance sheet dates change over time for two reasons. First – registrants change their fiscal year-end. Second there are enough registrants with a 52/53 week year that it can get messy (Pepsico reports this in their 2013 10-K: In 2011, we had an additional week of results (53rd week). Our fiscal year ends on the last Saturday of each December, resulting in an additional week of results every five or six years.

Finally, this is the kind of search that requires/benefits from the use of Historical CIKs. When looking at a time-series we can lose observations if we do not account for CIK assignment changes because of entity changes (Google -> Alphabet). By using the Historical CIK option then our input file would be adjusted to include the CIK-YEAR pairs for any associated CIKs. To keep this post from getting any denser – please review this post for an explanation of how the Historical CIK feature modifies the request file (Historical CIK Mapping).

Using the New Metadata

The above video demonstrates the use of the new metadata in a search. To access the right indexes you will need to first do an index update – this is also illustrated in the video.

Please note – the search I illustrate above is just a simple search so I could avoid getting lost in the weeds. You can use as many of the new fields in a search as you need. For example – here is a search for documents with the word revenue within 500 words after the phrase critical audit matter* if Ernst (& Young) or Deloitte (and Touche) were the auditor and if the document was filed by a Large Accelerated Filer

((critical audit matter*) pre/500 revenue)  and ((AUDITOR contains(ernst)) or (AUDITOR contains(deloitt*)))  and (FilerCategory contains(LAF)) 

There is more coming!

Finally! (Well, Come Monday)

Lookie here – all of this metadata to be available on directEDGAR beginning on 8/30/2021:

<meta name="SICODE" content="7370">
<meta name="FYEND" content="1231">
<meta name="CNAME" content="ALPHABET INC.">
<meta name="FILINGDATE" content="20200204">
<meta name="ACCEPTANCE" content="20200203210359">
<meta name="ZIP" content="94043">
<meta name="DOCTYPE" content="10K">
<meta name="SECPATH" content="https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/goog10-k2019.htm">
<meta name="AddressCityOrTown content="MOUNTAIN VIEW">
<meta name="CurrentReportingStatus content="YES">
<meta name="SmallBusiness content="FALSE">
<meta name="WellKnownSeasonedIssuer content="YES">
<meta name="EmergingGrowthCompany content="FALSE">
<meta name="FilerCategory content="LAF">
<meta name="ShellCompany content="FALSE">
<meta name="AddressStateOrProvince content="CA">
<meta name="VoluntaryFilers content="NO">
<meta name="PublicFloat" content="663000000000.0">
<meta name="FloatDate" content="2019-06-28">
<meta name="CommonStockSharesOutstanding_1" content="299895185">
<meta name="ShareDate" content="2020-01-27">
<meta name="SecurityName_1" content="CommonClassA">
<meta name="CommonStockSharesOutstanding_2" content="46411073">
<meta name="SecurityName_2" content="CommonClassB">
<meta name="CommonStockSharesOutstanding_3" content="340979832">
<meta name="SecurityName_3" content="CapitalClassC">
<meta name="AUDITOR" content="ERNST & YOUNG">
<meta name="REPORT_DATE" content="2/3/2020">
<meta name="LOCATION_CITY" content="SAN JOSE">
<meta name="LOCATION_STATE" content="CALIFORNIA">
<meta name="SINCE" content="1999">

There is a lot of noise above (and I don’t think Jimmy Buffett is noise)- what is that? That is the new metadata block that was added to the 10-K Alphabet Inc. submitted to the SEC on 2/4/2020. This has taken longer than I had hoped – one of the reasons for the delay is the number of our clients who asked how to identify the auditor caused me to do some research. I learned that the availability of auditor data is spotty prior to the disclosure of audit fees beginning in 2001/2002. I know a number of our clients were using our search tools to find/identify the auditor and so that made me decide it was worth the effort to add the information we could about the auditor.

I think the most common request before this was either the accession number of the source file (accession.txt) or the path to the documents returned from a search. I will be happy to take feedback if you feel like the accession number should be added as a direct piece of metadata. However, as I think about balancing everything I have initially determined that since the accession number follows the CIK and is before the actual file name ( 000165204420000008 ) – it is easy enough to parse out of either a SummaryExtraction or ContextExtraction product created by the platform.

When available in a document, all of this metadata is accessible if you do a search and then do a SummaryExtraction of the results. Clearly some fields are not likely to be useful for searching, like (ACCEPTANCE). Some, though, are likely to be very useful for constructing your search – remember that disclosure requirements vary by FilerCategory. For instance, a LARGE ACCELERATED FILER (LAF) has an accelerated filing schedule relative to other filers (10-K is due 15 days sooner than a 10-K for an Accelerated Filer) and carries the largest disclosure burden of all filers.

To access the new fields, press the “fields” button. The Select Field box will become available, which allows you to populate the Value field in the interface. While you can use as many fields as you’d like in a search, you have to add them one at a time. This image shows the Select Field tool for the new Y2016-Y2020 index:

Field Listing Full 10K

The case of the listed fields is an indicator of the source of the field. If the field name is all upper case – we generated the value from some artifact during the processing or outside the system. If the field name is a sequence of words with the first letter in each word capitalized – we captured the field value from the filing in more or less an automated process.

We are adding less metadata to the exhibits included with the 10-K. Basically, we will be leaving out everything after the SECPATH value. Here is the metadata that is going to be embedded in the first Exhibit 10 filed with the 10-K referenced above:

<meta name="SICODE" content="7370">
<meta name="FYEND" content="1231">
<meta name="CNAME" content="ALPHABET INC.">
<meta name="FILINGDATE" content="20200204">
<meta name="ACCEPTANCE" content="20200203210359">
<meta name="ZIP" content="94043">
<meta name="DOCTYPE" content="EX10">
<meta name="SECPATH" content="https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/googexhibit10081.htm">

One of the consequences of not including the full field set that is attached to the 10-K filing into the exhibits is that it is not yet possible to directly search for EXHIBIT 10s that were included in the 10-K filings of a LARGE ACCELERATED FILER. Instead you would have to first run a search for (DOCTYPE contains(10K)) and (FilerCategory contains (LAF)). This will identify all 10-K filings made by LAFs. Do a summary extraction, save that file and pull the CIKs. Use that CIK list to do a CIK filtered search for (DOCTYPE contains(EX10)) and your search will only return EXHIBIT 10s that were included in a 10-K filing made by a LARGE ACCELERATED FILER.

The following table contains a list of the metadata labels as well as their definition. If you are a current filer you will receive an email that includes a third column (SOURCE) which describes how the value was determined.

METALABELDEFINITION
SICCODEStandard Industrial Classification Code (as assigned by the SIC)
 FYENDFiscal Year End for the most recent balance sheet included in the 10-K
 CNAMECompany Name
 FILINGDATEThe date the filing was submitted to EDGAR
 ACCEPTANCEThe date-time stamp associated with the acceptance of the filing by the EDGAR system
 DOCTYPEThe registrant is required to classify each document in a filing – this  tag indicates the classification of the document assigned by the registrant.
 SECPATHThe full path to the filing on EDGAR
 VoluntaryFilersYES/NO to indicate if the registrant is making this filing on a voluntary basis.
 SmallBusinessTRUE/FALSE to indicate if the registrant meets the SEC definition of a Small Business Filer
 ShellCompanyTRUE/FALSE to indicate if the registrant meets the SEC definition of a Shell Company
 FilerCategoryThe registrants conclusion regarding their classification per the SEC’s filer category classification definitions 
 FilerCategory     LAF – Large Accelerated Filer
 FilerCategory     AF – Accelerated Filer
 FilerCategory     NAF – Non-Accelerated Filer
 FilerCategory     SRC – Smaller Reporting Company
 FilerCategory     SRAF – Smaller Reporting Accelerated Filer (this is actually not a valid classification but it has been used by a number of registrants – there are 55 10Ks with this self-reported label)
 EmergingGrowthCompanyTRUE/FALSE to indicate if the filer meets the definition of an emerging growth company
 WellKnownSeasonedIssuerYES/NO to indicate whether or not the filer meets the definition of a Well Known Seasoned Issuer
 CurrentReportingStatusYes/No to indicate if  the registrant is current on their mandated filing obligations
 AddressStateOrProvinceThe State or Province of the headquarters of the filer
 AddressCityOrTownThe name of the City or Town of the headquarters of the filer
 ZIPThe ZIP/POSTAL code component of the address of the filer
 PublicFloatThe aggregate market value of the shares of the registrant held by non-affiliates as of the last day of the registrants most recent second quarter (if multiple float values are reported we sum them to maintain consistency – we do validation checks to catch the cases where a total is reported as well as the float for each class)
 FloatDateThe date used to determine the public float
 CommonStockSharesOutstanding_1The number of shares reported as outstanding – if there are multiple share types/classes reported this is the first listed.
 ShareDateThe measurement date which is the latest practical date (closest to the filing date) of the 10-K
 SecurityName_1If provided, the name of the security whose common shares outstanding are listed as CommonStockSharesOutstanding_1
 CommonStockSharesOutstanding_2The number of shares reported as outstanding – if there are multiple share types/classes reported this is the security listed second.
 SecurityName_2If provided, the name of the security whose common shares outstanding are listed as CommonStockSharesOutstanding_2
 CommonStockSharesOutstanding_3The number of shares reported as outstanding – if there are multiple share types/classes reported this is the security listed second.
 SecurityName_3If provided, the name of the security whose common shares outstanding are listed as CommonStockSharesOutstanding_2
 IcfrAuditorAttestationFlagTRUE/FALSE to indicate whether the filing includes an attestation by the/an external auditor on the Internal Controls over Financial Reporting
 AUDITORThe name of the auditor of the most recent financial statements (conforming to the FYEND tag)
 REPORT_DATEThe audit report date
 LOCATION_CITYThe city location of the auditor
 LOCATION_STATEThe state/country of the auditor.
 SINCEThe tenure of the auditor

This is only the beginning of our work on improving the opportunity to add fields to the filings. Right now the team has auditor details back to 2007 – we will collect all the way to 1994. We have identified a way to add an ACCEPTANCE datetime stamp to the earlier filings. While it is not available in the index or hdr files for most filings prior to about 2002 we have determined how to identify a very good proxy for this value. We will be redoing the 2021 10-K filings in the next week or so to add in the same meta data and then we will next redo the 2010 – 2015 10-K filings. We have also been working on supplementing the self-reported IcfrAuditorAttestationFlag field and are close to being able to add this value to a significant number of filings.

Details matter a lot. We have standardized the auditor names but we have not yet made any attempt to roll back auditor name changes. We cannot consider doing this until all of the auditor data has been collected. If you are an existing client and need the mappings for the auditor name standardization send me an email. We will not be fully populating all of the SINCE fields until all of the auditor data has been collected. We tried to use some algorithms to add this value when it was not reported but we were not happy with the results. Specifically there are too many cases of auditors reporting this value as the year they signed an engagement letter with the registrant – not the first year they audited financial statements.

Another Details matter a lot point – there are some registrants that report more than three classes of stock. There was one who reported nine values for CommonStockSharesOutstanding (indicating 9 classes of securities). But when we analyzed these we discovered that once we moved past 3 the results were fairly dicey with respect to trying to use the reported values in a meaningful way. If you are interested – we identified 41 unique CIKs that reported more than 3 classes of securities. Here is a link to the interactive data presentation for one (Strategic Student Housing & Senior Trust Inc). The only registrant I could identify where I thought this value was useful was Molson Coors (great product!) but in weighing where to start and recognizing that we would be adding two additional columns that would only be meaningful for one CIK – that seemed to be too much. We can revisit this based on your feedback.

We are doing something else that is pretty cool but I think I need to keep it under my hat for the moment. I just don’t want you to think this is it. There is always more!

As an aside one of the things I would still like to do is to push all of this field data into a database – make an isolated copy of the database and then provide you with the tools to access it for a data analytics class. I look at some of the stuff that is being used and I wonder whether or not those databases give students a real sense of the complexity and ‘messiness’ of real data. For example – there is no category SMALLER REPORTING ACCELERATED FILER – we only learn it exists by testing the data and establishing that some filings did not have a valid value for Filer Category. I could imagine creating a database that combines this metadata with other data (such as compensation) to give students a richer experience working with real financial data. Poke me if this interests you.

I will send a direct email to you as soon as this is complete. As of 5:45 PM 8/28 – this is the current status of the indexing operation:

Deep Into the Weeds – Filing Differences

When I speak to people about EDGAR they tend to view it as a static archive. I have known for a long time it is not. The filing that was present yesterday may not be the same filing that is available today. When we process our filings a key part of creating the filing architecture is our use of a date from the header file that we label as the RDATE. The RDATE is the YYYYMMDD that appears in the first line of the header file. This can differ from the Filing Date as reported on the SEC landing page for the filing as well as in the header. Here is an example of a portion of a header file for a 10-K filing made by CIK 1357878 (PEPTIDE TECHNOLOGIES INC):

<SEC-HEADER>0001472375-20-000004.hdr.sgml : 20200225
<ACCEPTANCE-DATETIME>20200123172636
<PRIVATE-TO-PUBLIC>
<ACCESSION-NUMBER>0001472375-20-000004
<TYPE>10-K/A
<PUBLIC-DOCUMENT-COUNT>7
<PERIOD>20190331
<FILING-DATE>20200123
<DATE-OF-FILING-DATE-CHANGE>20200123

Our system labels every document and artifact from this filing with the 1357878-R20200225-C20191231-F04 prefix. When I am presenting or discussing our architecture I often use the phrase Reveal Date to describe the RDATE. Many people always question our choice of using this particular date rather than the filing date (20200123). My argument for this is that if you are running an event study we know that the RDATE represents the last modification to the filing. The SEC anchors on the filing date because the registrant has a legal obligation that is determined to be satisfied based at least partly on the filing date (dense I know). But if you are using EDGAR data for a research study and you want to measure market response to the information in the filing – the date choice gets murkier.

I have historically not paid much attention to these issues other than the deliberations at the beginning on our architecture. Having said that I have wanted to provide the filing date as a piece of metadata and this is now allowed with our new architecture. In my mind if the RDATE differs significantly from the SEC Filing Date I think that is reason to exclude the data from the filing from an event study. So when we provide the RDATE and the Filing Date we are given you some useful information – if those dates differ significantly it might be worthwhile doing your event study analysis with and without those observations and then another option is to run it with one date and then the other.

An interesting thing happened today as I was working on our restructuring. In the past we ran two systems, one for our academic clients and one for our commercial clients who want same day access to new 10-K and Proxy filings. The work we did for academics was done on a local computer – the work for our commercial clients has largely been in AWS since about 2016. As a result of developing the new cloud architecture for our academic clients these systems have mostly merged. For the academic systems the filings were only updated quarterly or semi-annually and we distributed those to the client platforms on USB drives. (Now the academic clients have access to the same collection used by our commercial clients).

As a result of our past practices I have two copies of many 10-K and proxy filings. One copy that was captured during the business day soon after it was made available on EDGAR and then I have the copy that we pulled separately for our academic clients. This has led to some interesting discoveries. The filing above is a case in point.

The filing version that was captured and delivered to our commercial clients was made on 1/23/2020 (this date matches the Filing Date above) And then there is the version that was captured on 5/17/2020 for delivery to our academic clients. The version that was captured for our commercial clients has this prefix 1357878-R20200123-C20190331-F04. To be clear – our commercial clients received 1357878-R20200123-C20190331-F04 and our academic clients received 1357878-R20200225-C20190331-F04 – so the question is – what was different.

The version delivered to our academic clients included a copy of a letter the registrant submitted to the SEC confirming that the amendment conformed to the observations the SEC had made about the original 10-K filing submitted on June 24, 2019. This letter was not included in the original accession.txt file available on EDGAR on 1/23/2020. However, it was included in the accession.txt file that became available on 2/25/2020. To be specific this file (link to EDGAR file) was not available in the original filing. Confusing? For an EDGAR filing junkie like me it is kind of cool. It also supports my decision more than a decade ago to use the RDATE in the identifier rather than the filing date.

So far I have identified 106 cases where the original filing differs from the current filing out of approximately 42,000 10K filings I have examined. I am noticing these changes while trying frantically to finish our transition so I have not delved into them too much. If I see anything really interesting I suspect I am going to try to develop a paper on the differences. This one seemed the easiest to communicate about.

Word List Feature

I was on a call with a potential client today and I remembered to show them an often over-looked feature of our platform. We index the documents that are in the search engine. The indexes that process your search contain a mapping between words and document locations. When you submit a search the parser checks your words and adds some magic to identify those documents that meet your search criteria.

Because of this, you can have the application check for word existence as you are typing. To activate this feature use the Word List Index pull down to select one of the indexes you are searching.

Accessing the Word List

Then, as you type words into the Search Builder – the application will display the frequency of the word or similar words – do you know anyone else named Burch? In my conversation I wondered how many instances of their name existed (I will not share it here) and then we compared that count with the instances of the name Burch. I was actually surprised that there were more instances of Burch (but all are last names).

Actual Word Frequency

That was a silly use of this feature. However, when you are trying to exhaustively count all instances of some concept -using this feature might alert you to cases where the registrant made a typo. In the image below I am exploring to establish whether or not there might be cases where material weakness disclosures might be under-counted because of spelling errors. As you can I identified some variants of weaknesses.

This is a small detail – but as I often say – details matter.

Another Example of Why CIK Filtering is Critical

I was engaged in an email exchange with a potential new client and they specifically asked about extracting the text content from ALL 10-K filings. I added the bold upper-case – to me the word ALL is a trigger word (as my son would say) because I have to then feel motivated to explain why ALL is not always the best answer.

To make sure you have the context – our platform will provide direct access to the raw text of the documents we have indexed (the markup is left behind). More and more researchers are using this raw text to test various hypotheses and to train AI systems. So people ask me – how can I get ALL.

The problem with ALL is that too many SEC registrants are not going to have securities whose prices are readily available. So if you get ALL then at some point the ones that you can’t match to some measure of value are toast.

I can understand if folks think well – it is easy enough to push that button and worry about filtering later. I can discard those that I don’t need. But actually it is very costly to you to collect more data then you need. Your time is the most expensive part of any research project.

The prospect wanted to know specifics – how much time will it take me to download ALL 10-K filings. To answer this questions I logged into an instance and ran a search to identify all 10-K (and EXHIBIT-13) filings made from 1/1/2016 to 12/31/2017. There were a bit more than 17,000 10-Ks in this window. I set a time and pushed the magic button and in two hours and nine minutes later I had all 17,000+ raw text files ready to save to my local computer. That is not horrible time wise – it just works but – it took longer than it needed to because almost 1/2 of those will not match to other data if trying to test a value/security price hypothesis. In my analysis I told our prospect that the system delivered on average 133 filings per-minutes.

However, since I was triggered I ran a second test. This time I only extracted 10-K filings made in 2018. There were a bit more than 8,200 filings. So roughly this is half the size of the first group. How much time do you think it took to extract those 10-Ks? In my test it took 32 minutes – or a rate of about 256 files per minute! Almost twice as fast.

Why this significant rate difference? A small part of it might or might not be due to butterflies flapping their wings in my backyard the second time I did it. The biggest factor that drives up that timing difference is a complicated but cool memory issue in Windows. (I’m going to be nerdy here) Like most applications – we use OS system hooks to do the tasks we want to accomplish. Windows manages memory and all that cool stuff so we can focus on our goals. The cool think is that Windows retains memory references to everything that is done until – usually you close an application. Finally the punchline – when Windows runs out of RAM memory it starts using disk memory – so it writes all of that memory stuff to disk and has a nifty table it uses to figure out where things are. The problem is that once you overrun memory and the disk memory comes into play there is a substantial slowdown of the work you are doing. Our instance disks are fast but they are much slower than RAM.

My general rule of thumb is that once you have manipulated about 10,000 10-K filings (which is a lot) the manipulation of the next one is considerably slower than the manipulation of any earlier ones. This a heuristic, there are other factors involved – but I have used our application a lot. In the first experiment – when I extracted the 17.2K filings – the first 8,600 took about 50 minutes. The second 69 minutes. I told you this was cool – the second group was roughly 38% slower. One of the other factors in play is that in the first case the application has 17K +/- documents available and in the second case it was only keeping track of 8.2K. Less memory was available for the document extraction from the beginning.

So by CIK filtering you are reducing your total work load (and the time you need to pay attention to the process) substantially. Yes, I did compare two years to one year. But remember – I suspect you can’t match half of the 10-K filings from any one year to security prices data in a reliable way. I suspect a filter of total assets greater than zero and the existence of common stock for each year of your analysis would substantially reduce the filings you would extract.

Pushing the button is easy – waiting for data that you will not use can be expensive!

What Does ‘Since’ Mean in The Audit Report?

I thought I would be wrapping up our injection of new metadata into our 10-K filings today. However, I ran into an interesting snag. I discovered that despite an auditor reporting that they have audited the financial statements since some date, their first audit report might be either prior to or after that date.

Here is an example – Core Laboratories N. V.’s current auditor is KPMG. KPMG reports in the FYE 12/31/2020 10-K that they “have served as the Company‚Äôs auditor since 2015.” This same phrase is repeated in the 10-K for the FYE 12/31/2018 and 12/31/2019.

Mandatory tenure reporting began in 2018, so prior 10-K’s have no SINCE declaration or statement.

When I read the 10-K, I presumed that KPMG began auditing Core Laboratories’ financial statements in 2015 and that they would have been the signatories of the 12/31/2015 audit report in the 10-K released in early 2016.

This was not the case. The audit report in the 10-K released in 2016 for the 12/31/2015 FYE was signed by PricewaterhouseCoopers LLP. I then wondered if KPMG meant they had re-audited the 12/31/2015 FYE financial statements after becoming Core Laboratories’ auditor. This was also not the case – the first audit report from KPMG explicitly reports that their audit was for the financial statements for the FYE 12/31/2016.

This was confusing to me, so I went to find the 8-K that reported the change of auditor (to find all 8-K reporting on auditor changes use the search (ITEM_4.01 contains(YES)) and (DOCTYPE contains(8K)). ) The 8-K is interesting and helped me understand why KPMG is reporting that they have been the auditor since 2015. Here is a link to the 8-K: Core Laboratories AUCHANGE 8K.

Core Laboratories dismissed PwC on 4/29/2015. However, the dismissal was effective upon the issuance of the reports (financial and ICFR audits) for 12/31/2015. KPMG was appointed (and an engagement letter was signed) on 4/29/2015, with their appointment to be effective 1/1/2016.

I discovered this as I was working on some final touches to impute SINCE values hoping (actually assuming) that we could rely on the SINCE value that was reported from 2018 to the present to populate prior SINCE fields. I was getting ready to punch the button to approve this logic but I decided to test it. Basically, the test was to establish whether the auditors matched the SINCE value – was KPMG the auditor of Core Laboratories in 2015? I would say they were not.

So now, we have to sort this out and make sure we have the right tests to validate the declarations made in the 10-K. It is our intention to have the SINCE value represent the first FY the auditor signed the audit report in the 10-K. Despite KPMG’s declaration that they have audited Core Laboratories since 2015, we will change that value to 2016, the year they of the first audit report they signed..

New Tagging Examples

While I am slightly behind the schedule I shared in my last post – we are making progress. I have been hesitant to share exactly what this new tagging scheme would look like until now. Below are two examples of the new metadata we will be injecting into the 10-K and EX-13s. At the present time I do not plan to alter the metadata we inject into the the other exhibits except to add the EDGARLINK value. My initial thought is that you can access the metadata associated with the 10-K if you need some value for data collected from an exhibit.

Below is the metadata we will add to Slack’s 10-K that was filed on 3/12/2020. (We use Slack internally and I love it).

<meta name="SIC" content="7385">
<meta name="FYE" content="0131">
<meta name="CONAME" content="SLACK TECHNOLOGIES, INC.">
<meta name="ACCEPTANCETIME" content="20200312163209">
<meta name="ZIPCODE" content="94105">
<meta name="ENTITYADDRESSCITYORTOWN" content="SAN FRANCISCO">
<meta name="ENTITYADDRESSSTATEORPROVINCE" content="CA">
<meta name="ENTITYSMALLBUSINESS" content="FALSE">
<meta name="ENTITYEMERGINGGROWTHCOMPANY" content="TRUE">
<meta name="ENTITYSHELLCOMPANY" content="FALSE">
<meta name="ENTITYPUBLICFLOAT" content="7200000000">
<meta name="PUBLICFLOATDATE" content="20190731">
<meta name="ENTITYFILERCATEGORY" content="NAF">
<meta name="ENTITYPUBLICSHARESDATE" content="20200229">
<meta name="ENTITYPUBLICSHARESLABEL_1" content="CommonClassA">
<meta name="ENTITYPUBLICSHARESCOUNT_1" content="362046257">
<meta name="ENTITYPUBLICSHARESLABEL_2" content="CommonClassB">
<meta name="ENTITYPUBLICSHARESCOUNT_2" content="194761524">
<meta name="AUDITOR" content="KPMG">
<meta name="AUDITREPORTDATE" content="20200312">
<meta name="AUDITORSINCE" content="2015">
<meta name="AUDITORCITY" content="SAN FRANCISCO">
<meta name="AUDITORSTATE" content="CALIFORNIA">
<meta name="EDGARLINK" content="https://www.sec.gov/Archives/edgar/data/1764925/000176492520000251/a1312010-k.htm">

Below is the metadata we will add to Peloton’s 10-K filed on 9/10/2020. Note the acceptance time indicates the RDATE for this filing would be 20200911 since it was accepted after 5:30 pm on 9/10. (No I don’t have a Peloton bike!)

<meta name="SIC" content="3600">
<meta name="FYE" content="0630">
<meta name="CONAME" content="PELOTON INTERACTIVE, INC.">
<meta name="ACCEPTANCETIME" content="20200910180637">
<meta name="ZIPCODE" content="10001">
<meta name="ENTITYADDRESSCITYORTOWN" content="NEW YORK">
<meta name="ENTITYADDRESSSTATEORPROVINCE" content="NY">
<meta name="ENTITYSMALLBUSINESS" content="FALSE">
<meta name="ENTITYEMERGINGGROWTHCOMPANY" content="FALSE">
<meta name="ICFRAUDITORATTESTATIONFLAG" content="FALSE">
<meta name="ENTITYSHELLCOMPANY" content="FALSE">
<meta name="ENTITYPUBLICFLOAT" content="6281462442">
<meta name="PUBLICFLOATDATE" content="20191231">
<meta name="ENTITYFILERCATEGORY" content="NAF">
<meta name="ENTITYPUBLICSHARESDATE" content="20200831">
<meta name="ENTITYPUBLICSHARESLABEL_1" content="CommonClassA">
<meta name="ENTITYPUBLICSHARESCOUNT_1" content="239427396">
<meta name="ENTITYPUBLICSHARESLABEL_2" content="CommonClassB">
<meta name="ENTITYPUBLICSHARESCOUNT_2" content="49261234">
<meta name="AUDITOR" content="ERNST & YOUNG">
<meta name="AUDITREPORTDATE" content="20200910">
<meta name="AUDITORSINCE" content="2017">
<meta name="AUDITORCITY" content="NEW YORK">
<meta name="AUDITORSTATE" content="NEW YORK">
<meta name="EDGARLINK" content="https://www.sec.gov/Archives/edgar/data/1639825/000163982520000122/pton-20200630.htm">

There are two immediate implications of these changes. First, if you do a Summary or Context extraction – these values will be included in the results. The name value will be the column heading and the content value will be the row value. The second implication is that you can filter search results by the values of the content. Clearly you are not going to want to filter by the EDGARLINK – but the ability to filter by ENTITYFILERCATEGORY will help you more efficiently identify those subject to particular disclosure requirements.

To identify all of those that have multiple classes of stock we would just add the following to our search (ENTITYPUBLICSHARESLABEL_2 contains(*)). The Fields menu will list all of these fields so you don’t have to memorize the labels we have used.

We were asked to provide the EDGARLINK to allow you to map/match data you collect from our platform with data collected from other platforms that provide the accession number or a direct link to the filing. The EDGARLINK value can be parsed easily in Excel to give you the accession number.

Right now the constraint is AUDITOR – we have auditor data back to 2011 at the present time. We have been improving our collection strategy for this field and hope to accelerate the collection process in the coming months. The special challenge in collecting this value are those cases where the signature is an image file and then we want the location and audit report date as well. So even though many of you might be able to pull this out from AA but others can’t and we think this is a valuable field when controlling for disclosure.

We will not be able to initially add the AUDITORSINCE value for many filings with auditor changes prior to 2018 because that is going to require some separate effort to identify that data value. Procter & Gamble has been audited by Deloitte since 1890 – so we can trivially add that field to all of their 10-K filings. But we only have 533 CIK’s that have an AUDITORSINCE value prior to 1994. We have 2,457 that have had the same auditor since 2010.

My neighbors and some colleagues share an inside joke, I have stated many times that doing X is like making bread – it is a process rather than something that can happen perfectly the first time. As we move back into older filing windows there are many complications and challenges associated with identifying the metadata values (like making bread). Thus – while I expect the addition of the values to be relatively easy to manage moving forward I do anticipate some unexpected challenges as we attempt to add this data to historical filings. (Fortunately we have directEDGAR to support or work!)