10-K History – Data Filtering

Whenever I visit clients or respond to emails about data collection I always try to make the point that it is super critical to identify the sample based on strict criteria to minimize the inevitable chase at the end for missing data and to minimize the processing of the inevitable edge cases.  No matter how structured the disclosure requirements are set out in SEC regulations or the Accounting Standards Codification it is inevitable that some proportion of the SEC filers will get ‘creative’ in their form of the disclosure.  When they get creative – data collection becomes much more tedious as it becomes necessary to identify the structure of their disclosure before we can sort out how to capture the data.  If we can precisely  identify the sample firms before we turn to collecting the data items then we will reduce the effort we spend chasing the odd form of the diclosure.

I am helping a client understand how to use our platform to collect a data item that is disclosed only in the 10-K – it is not a required disclosure in any other filing and it is something that is not likely to be disclosed in any other filing (I did some tests and could not find this item disclosed in the combined  millions of  filings that are searchable with directEDGAR).  So related to this I do encourage our users to review the regulations (either the SEC disclosure requirement as set out in the Code of Federal Regulations or the Accounting Standards Codification).

So our client is trying to collect a particular data item – their sample was derived from some other financial data source.  It may seem a normal presumption that if a company has data available from some other financial data source then there should be a 10-K with this disclosure.

In this particular case there are three problems with the sample from our client.  The first is that some of the sample firms have public data because they have public debt.  So while they file a 10-K they might not have some data items included in the 10-K because the disclosure requirements differ by the nature of the laws that establish their filing obligations (ABS issuers versus public debt only versus common stock).  So while these companies file a 10-K they will not have the particular disclosure our client is trying to collect.  The second problem is that the sample firms may have not had a filing obligation at the time they showed up in the sample.  The third problem is that some of their sample are foreign registrants whose filing obligations differ substantially – they have the option to file 20-F and 6-Ks rather than  the expected 10-K/Q and 8-Ks (as well as a myriad of other filing differences).

The most common way to determine if a company has publicly traded equity is to look for evidence in one of the other data sources that would normally be used to source some of the data for research.  I suggest that as there is not an easy way using SEC filings to determine if a company has publicly traded common stock.  In other words there is not really an easy way using directEDGAR to establish whether a filer has publicly traded common stock.  For example, I played around with some searches to identify those 10-K filers that are privately held and struggled – because this is not a mandated disclosure.  One search I tried was to search all 10-K filings for the existence within the first 800 word of the beginning of the document  registrant or issuer or company within 10 words of the phrase privately held


Some of the results (LEVI STRAUSS and CINEMARK USA) were exactly what I was looking for – those registrants are (were) privately held.  However, many of the results were not what I was looking for.  Therefore if I needed to collect data from companies that had public equity – the best way to define the sample would be to use another tool to determine if they do have public equity.

The second and third issues that needed to be addresses is whether or not the company filed 10-Ks (since that is filing that contains the data we are looking for) in the window that is needed for this study.  We can use directEDGAR’s 10-K Filing History archive to establish whether or not a company has filed 10-Ks and for what period.    Our client had a list of approximately 13,000 CIK -YEAR observations  which represented 3,862 unique CIKs.  I used their list of unique CIKs to create a request file to determine the 10-K availability for their sample.  This file helped me in two ways.  First, for some of their sample CIK-YEAR pairs the date they were trying to collect data for was after (or before) the last (or first) date of the 10-K filings.  For example, they needed something from a 10-K filed by CIK 737644 after 1/1/2001.  The problem is that this CIK filed their last 10-K in 1997 (I determined this by using the 10-K history file results).


They can use the result file to determine if there is a 10-K filing within the time span that they need to collect the data.  And even better – the process also creates a file called missing.csv (clever name) that listed the CIKs from the request file for which no 10-K filing had ever been filed.  There were 477 CIKs from their original list of 3,862 CIKs that had never filed a 10-K.

So while we could not use directEDGAR  to establish if any in their sample did not have publicly traded equity we could use it to establish whether they filed any 10-Ks and also for what period.  The advantage of doing this work at the beginning is that we can more precisely define the data we should expect to collect.




Auditor Tenure Fast Collection

One of the projects we have been working on is enhancing our audit fee data.  Frankly the current presentation is lousy and not terribly useful.  So we sat down and developed a plan and identified some additional fields we needed to collect to include in the audit fee data as part of improving the value of this data.

One of the fields we need to add is the tenure.  For those of you not aware there was a rule change promulgated by the PCAOB in 2017 that required the tenure (auditor since) to be disclosed and has normally been included in the 10-K or Exhibit 13.  Since the disclosure is required it is much more standardized than the instances of auditor tenure disclosed voluntarily in the Proxy – the most common form is “have served as the Company’s auditor since YYYY”.

I wanted to collect that data myself so I could describe the process to our data team and set out to do using our Search, Extraction and Normalization Platform.

First – I needed to search for the phrase auditor since


As you can see I found 4,979 instances of that phrase (my universe was all 10-K filings that have been made since 1/1/2019.  Next I need to extract and normalize the phrase and convert any numbers after the phrase “auditor since” into the data value.  I used the ContextNormalization feature as you can see in this next image:


The Extraction Pattern translated into English means – extract the context and if a number is found following the phrase auditor since place it in the csv file in the column labeled audit_tenure.

So I invested a total of maybe 5 minutes.  So lets look at the results:


The context is available for review and the application normalized the context to extract the year value for our use to add to our new audit fee data.

There are some exceptions that had to be manually handled (110/4979).  Lets look at those:


As you can see – these registrants deviated from the standard disclosure and so I had to review the context and just key in the year value.  I am very comfortable working with our tools and in this context so it only took me about 20 minutes to review and key those missing values.

In total I spent roughly 45 minutes to capture this data value.  I spent about that much time trying to sort out how to describe the process in this blog post.    (When we upload the new audit fee data presentation the tenure field will actually be labeled as SINCE).



Version Release Coming!

2018 has been a busy year.  If you look back you will see that we added new data tables (Insider Trading, 10-K Filing History and Form D data.  In May we released 4.0.4 which was the foundation for allowing us to add the additional data tables as it significantly speeded up the delivery of the artifacts we process.  We also started adding the AGE and SINCE variables to director compensation and have been reworking our beneficial ownership tables to better deliver the data when a filer has multiple classes of stock.  We were able to move our extraction of the Effective Tax Rate Reconciliation table to a near real-time delivery rather than batch updates.

But it doesn’t stop.  When we deliver the filings and indexes for the last 2018 filings we will be including a new version ( of our application.  There are some important improvements coming with this version.

We added a ZOOM box for you to use to build your search phrases.  Our search engine can parse really large (think more than 32,000 character) search phrases.  While you can build the search phrase in Notepad or a similar application we decided to add a bigger box to use in the application.


We added a feature to allow you to identify/specify tables using specific words or values.  A key feature of our platform has always been that you can extract tables from the search results and then manage the data extraction from those tables.  One constraint that was imposed by that strategy is that all the tables across all the documents had to have some consistent phrase/term or value.  There are though cases where the registrants report the data in a unique fashion and so it was difficult to actually access tables.  So we developed a process that allows you to review a set of search results and then specify for each document a specific unique value the application will then use to identify and extract the relevant table(s).


We set up the foundation to quietly handle corporate actions that lead to a change in CIK but not the entity.  This one is a little complicated to explain so bear with me.  If you access most of the leading data sources for financial data and collect data for Alphabet – the time series of Alphabet will extend back to the first 10-K filing made by Google in 2005.  However – if you go to EDGAR and use the CIK that is returned from the financial data service you pulled the data from the first 10-K filing made by Alphabet was filed in 2016.  Because of a reorganization and a merger Alphabet became the successor issuer to Google but they have different CIKs.  With our new update when you submit a CIK for one of those companies (either Google’s 1288776 or Alphabet’s 1652044) our application will quietly add the complementary CIK when you select the new option Include Historical CIKs.  This feature has been added to every menu item that allows you to specify the CIK.


When you first install the application will add a special file that maps the CIKs.  However – because corporate actions that lead to this phenomena continue we have added a control in the Options menu that allows you to update the mappings at your convenience.


Another important change we added was to improve the usability of the application overall by adding keyboard shortcuts for every single menu item.  In the earlier versions of our software we only had keyboard shortcuts for the most used features – now every menu item can be accessed through the keyboard (without use of the mouse).  For example – to access the Search control it is only necessary to press Alt+Z + Tab and you are ready to key in your search phrase.

We also fixed some minor bugs and tweaked the licensing validation process a bit.  In earlier versions if our delivery server was occupied and you started the application the application would wait until the server was free before completely starting.  That hesitation should be gone now.



New Data Type 10K_HISTORY Coming Soon

I was doing a demo with a prospective client last week and had a typical experience – I submitted a request file for director compensation data for 2008 that the client supplied.  They had identified a sample of registrants and wanted to see how to access data related to their sample.  I used the file while they were watching and pulled the data – unfortunately the missing-cik-year report listed 67 CIKs for which no data was available for 2008.  I am glad we provide this summary so our users don’t have to muck around and identify which of their sample is missing.  Of course the next question is why are they missing?

Usually I will take them on a tour of EDGAR for a few of the listed CIKs and show that their are no filings for the time period.  The first listed CIK in the list of missing values was 3133 (AMSOUTH BANCORPORATION).  The reason why director compensation for AMSOUTH for 2008 is not available is readily apparent from this image


Our potential client observed that they understood but would like a more concrete way to establish whether or not data should be available.  I absolutely understood that and this issue has been bothering me for a while.

One alternative we considered was to try to find all of the delisting notices (15-12B) and create some summary of data from those filings.  Unfortunately – too many registrants do not actually file a 15-12B.  Further – there is another problem – sometimes data is missing because the registrant has not yet registered or even if they have registered they may not yet be obligated to file the reports that contain some of the data objects our clients are trying to collect.

The solution that we have settled on for the time being is to create a summary file that lists for every CIK that has ever filed any form of the 10-K the date of their first 10-K filing and the date of their most recent 10-K filing.  We have done so and uploaded this data to our distribution server.  We are calling this data type 10-K_HISTORY it should be visible in our ExtractionPreprocessed data window before 12/1.

Because this is a snapshot at a point in time we are setting this data up with an RDATE of 20180101.  This means that your request file will have to have the value of 2018 in the YEAR column for every CIK you want to check.

When you submit the request file we will return a results file that includes the following headings:


Note – the reference to FIRST_FILING and RECENT_FILING are specifically references to any form of a 10-K filing (10KSB, 10K405 etc).  So they are not really the first filing the registrant made on EDGAR.  The balance sheet date values are the balance sheet date that the filing covers.

We hope this makes it easier to understand why you might be missing data.  So rather than having to inspect EDGAR for relevant dates you can use your missing report to construct a request file to then check for missing values.  Here is a screen shot after testing the process with the results I alluded to at the beginning of this post:


We hope this makes data validation more efficient and less painful.  Since each CIK has only one row of data this should be quick data to access and act on.

Of course there are always catches.  I had 67 observations that were missing data for 2008.  I submitted all 67 CIKs and the results included another missing report for 5 CIKs.  Well it turns out that these 5 CIKs have never filed any form of 10-K.  For example, one of the CIKs belonged to COCA-COLA EUROPEAN PARTNERS PLC (1650107).  They file 20-F and 6-K forms.  Another belonged to DROPBOX (1467623) – they just went public in 2018 and have yet to file a 10-K.

Institutional Trading – the “Whales”

We often are asked to provide access to the 13F-HR reports filed by institutional managers.  While we have provided these filings optimized for our platform – ultimately what users want is the data organized in a way that is meaningful.  This has something I personally have wanted to do for a long time.  However, the problem has always been matching the name of the issuer to some more useful identifier.  The actual 13F filings list the issuer name and their CUSIP (Committee on Uniform Security Identification Procedures) assigned identifier.  We needed a way to map the CUSIP back to the Central-Index-Key (CIK) assigned by the SEC.  Early in our  life I approached CUSIP Global Services ( a division of S&P) about licensing the CUSIP so we could link filings to CUSIPs to CIKs.  The cost was prohibitive so we have been stuck there.

Recently we discovered another way to map the CUSIP to CIK and after extensive testing are confident in the mappings we generate.  Because of this we have started processing the 13F-HR reports.

What we are doing is aggregating all of the data by report quarter and issuer.  The SEC requires institutional investors with more than $100 million in securities under management to disclose their holdings in the 13F within 45 calendar days after the end of the quarter.  We are parsing all filings made within each window and then combining the data for each individual issuer from all of the filings into one single report.  So a report for say Conagra for the 4th quarter of 2018 will be available soon after February 15, 2019 (the deadline for the report).  Of course we expect to have to periodically update these summary reports when amendments are filed by the managers.

Our platform already provides access to the beneficial ownership data as reported in the DEF 14A/10-K.  The institutional ownership data complements the beneficial ownership table because the beneficial ownership table only contains details about owners of more than 10% of the equity and the ownership of directors and officers.  The beneficial owners of 10% are a subset of the institutional ownership reported in the proxy.  For example, CAG’s DEF 14A reports only two beneficial owners (other than management) Blackrock and Vanguard with a total of 74 million shares.  Our analysis of the 13F filings shows total holdings by institutions to be approximately 305 million shares for the 6/30 quarter.  This is almost a 4 fold increase and it also represents more than 75% of the approximately 390 million shares outstanding as of the end of June 2018.

By combining the data from our beneficial ownership tables for directors and officers with the institutional ownership reports we are starting to generate our users will have better measures of  these important characteristics of the distribution of equity.

I think we should have a pilot of the institutional ownership data available before the middle of October.  You will know this data is available by starting the application and using the Extraction\Preprocessed feature.  When the pilot data is available there will be a new entry 13F_PILOT in the Data Tables listing.

If you would like to dive into one of the files while we are completing our final testing please send me an email and I will make one available (for our clients only).


Pilot Test of Ownership Data Over – The Real ‘Stuff’ Now Loading

On August 3rd we loaded a test run of ownership data.  Our initial focus was roughly the S&P 500 for the period from 2013 to 2017.  We had some feedback on the initial data load and made some changes to the way the columns are displayed and ordered.  We also added some additional fields (including the ACCESSION-NUMBER) of the filings so that you can more easily explore the original source file if you have questions about the data.

Friday we took down the initial data and started loading replacement data with new columns and more CIKs.  We have approximately 8,000 CIKs and data going back to 2008 in the loading process that is running right now.  This is massive (it represents normalized data from more than 1,500,000 ownership filings.  Again – the data is organized by ISSUER – YEAR.   The data in each ISSUER – YEAR file is sorted by RPTOWNERCIK – FILING DATE.

We have preserved all of the footnotes and their association with individual data items.  Each row represents either a NONDERIVATIVE TRANSACTION, NONDERIVATIVE HOLDING, DERIVATIVE TRANSACTION, DERIVATIVE HOLDING or a REMARK.  The datatype for the row is indicated by the value reported in the datatype column.

The full load may take the balance of this week.  We will then identify any missing CIKs and fill in the data back to 2003.

Form 3, 4 & 5 Filing Data on Line!

One of the key drivers of our new architecture was to allow us to more easily expand the range of data values we deliver through the Search, Extraction & Normalization Engine.  We are experiencing that benefit right now – we just started delivering  normalized data from Form 3, 4 & 5 filings.

The SEC lays out the obligations of officers, directors and other Section 16 persons to report their holdings and transactions in their company’s securities and also derivative instruments where some security of the company is the underlying value determinant.  The reporting mechanism that is used today is either Form 3 (initial report of holding); Form 4 (transactions and other events that affect holdings) and then Form 5 (annual statement of holdings).  In about May 2003 these forms were made machine readable when the SEC introduced/required them to be filed using both an HTML and XML format. As an aside prior to the introduction of the XML form – these forms were only available through EDGAR from a limited number of companies – most filers choose to file a paper copy.

So this morning we released a pilot test of normalized data from these forms.  This has been a big undertaking (today there are more than five million ownership forms available through EDGAR).  Our initial pilot is focused on the S&P 500 for the period from 2013 to 2017.

One of the big struggles about this data was deciding how to organize it for you to call.  What we decided to do was prepare the data by COMPANY-CIK /YEAR.  So if you submit a request for Abbott Labs (CIK 1800) for 2017 we will deliver back the data extracted from every Form 3/4 and 5 filed by any person who filed one of those forms during the period from 1/1/2017 through 12/31/2017.

Each line of the results represents one reporting event for one person.  So if a reporting person describes 3 non-derivative and 2 derivative transactions in one filing the result file will have 5 lines – each line reports all of the values included in the form.

To access this data create a request file (three columns, CIK, YEAR and PF) and from the Extraction menu select ExtractionPreprocessed.  Once the request file has been validated and you select the Read Input button the Pull section will populate with the latest list of available tables.  The parsed ownership data is delivered when you select SECTION_16_ANNUAL_SUMMARY


There can be as many as 150 unique column headings in the result file – depending on the number of footnotes that are included for the transactions.  This is critical we attach any relevant footnote to the transaction the footnote is providing elaboration for.

I will be honest – our labeling for the footnotes is a bit tedious – but we think necessary to provide clarity as to what part of the form the footnote should be considered with.  For a little hint at these complications consider this image from a partial Form 4 filed by John Klinck an EVP for State Street (link to full form).


There are two footnotes to explain/elaborate on the value reported for the Amount of Shares Beneficially Owned Following Reported Transaction.  Note that this entry describes a non-derivative transaction.  These footnotes are indicated in the following manner – the text that is associated with the footnote with the (2) indicator is in a column labeled transactionshares.footnote.  The footnote indicated by the (3) value is labeled transactionshares.footnote_1.  In short we are adding an index value to all footnotes associated with a data entry after the first footnote for that particular data entry.  If an entry in a form has 4 footnotes then the last one listed will be indexed with a _3.  Footnotes are associated with a specific data entry and are thus keyed to that data value and are reported in the row that that data value is reported.

These forms allow the respondent to include REMARKS.  A remark applies to the entire form and so what we ended up doing is including these in a separate row.  Initially we thought to include them next to each transaction but decided that these might be more useful if they could be easily  isolated.  We include a column that describes the nature of the content for each row (datatype).  


This allows you to very quickly isolate and review any particular type of data.  All of the identifying information for each person associated with each form is included in each row.  There are indicators for the relationship between the reporting person and the issuer (isdirector/isofficer/isother).  The reporting person’s CIK is included so you can match back to our compensation data.

I am really excited about this update.  If you work with this data and have observations that would help us improve the utility please send me an email.  (burch [yada] directedgar.com).