Auditor Tenure Fast Collection

One of the projects we have been working on is enhancing our audit fee data.  Frankly the current presentation is lousy and not terribly useful.  So we sat down and developed a plan and identified some additional fields we needed to collect to include in the audit fee data as part of improving the value of this data.

One of the fields we need to add is the tenure.  For those of you not aware there was a rule change promulgated by the PCAOB in 2017 that required the tenure (auditor since) to be disclosed and has normally been included in the 10-K or Exhibit 13.  Since the disclosure is required it is much more standardized than the instances of auditor tenure disclosed voluntarily in the Proxy – the most common form is “have served as the Company’s auditor since YYYY”.

I wanted to collect that data myself so I could describe the process to our data team and set out to do using our Search, Extraction and Normalization Platform.

First – I needed to search for the phrase auditor since


As you can see I found 4,979 instances of that phrase (my universe was all 10-K filings that have been made since 1/1/2019.  Next I need to extract and normalize the phrase and convert any numbers after the phrase “auditor since” into the data value.  I used the ContextNormalization feature as you can see in this next image:


The Extraction Pattern translated into English means – extract the context and if a number is found following the phrase auditor since place it in the csv file in the column labeled audit_tenure.

So I invested a total of maybe 5 minutes.  So lets look at the results:


The context is available for review and the application normalized the context to extract the year value for our use to add to our new audit fee data.

There are some exceptions that had to be manually handled (110/4979).  Lets look at those:


As you can see – these registrants deviated from the standard disclosure and so I had to review the context and just key in the year value.  I am very comfortable working with our tools and in this context so it only took me about 20 minutes to review and key those missing values.

In total I spent roughly 45 minutes to capture this data value.  I spent about that much time trying to sort out how to describe the process in this blog post.    (When we upload the new audit fee data presentation the tenure field will actually be labeled as SINCE).



Version Release Coming!

2018 has been a busy year.  If you look back you will see that we added new data tables (Insider Trading, 10-K Filing History and Form D data.  In May we released 4.0.4 which was the foundation for allowing us to add the additional data tables as it significantly speeded up the delivery of the artifacts we process.  We also started adding the AGE and SINCE variables to director compensation and have been reworking our beneficial ownership tables to better deliver the data when a filer has multiple classes of stock.  We were able to move our extraction of the Effective Tax Rate Reconciliation table to a near real-time delivery rather than batch updates.

But it doesn’t stop.  When we deliver the filings and indexes for the last 2018 filings we will be including a new version ( of our application.  There are some important improvements coming with this version.

We added a ZOOM box for you to use to build your search phrases.  Our search engine can parse really large (think more than 32,000 character) search phrases.  While you can build the search phrase in Notepad or a similar application we decided to add a bigger box to use in the application.


We added a feature to allow you to identify/specify tables using specific words or values.  A key feature of our platform has always been that you can extract tables from the search results and then manage the data extraction from those tables.  One constraint that was imposed by that strategy is that all the tables across all the documents had to have some consistent phrase/term or value.  There are though cases where the registrants report the data in a unique fashion and so it was difficult to actually access tables.  So we developed a process that allows you to review a set of search results and then specify for each document a specific unique value the application will then use to identify and extract the relevant table(s).


We set up the foundation to quietly handle corporate actions that lead to a change in CIK but not the entity.  This one is a little complicated to explain so bear with me.  If you access most of the leading data sources for financial data and collect data for Alphabet – the time series of Alphabet will extend back to the first 10-K filing made by Google in 2005.  However – if you go to EDGAR and use the CIK that is returned from the financial data service you pulled the data from the first 10-K filing made by Alphabet was filed in 2016.  Because of a reorganization and a merger Alphabet became the successor issuer to Google but they have different CIKs.  With our new update when you submit a CIK for one of those companies (either Google’s 1288776 or Alphabet’s 1652044) our application will quietly add the complementary CIK when you select the new option Include Historical CIKs.  This feature has been added to every menu item that allows you to specify the CIK.


When you first install the application will add a special file that maps the CIKs.  However – because corporate actions that lead to this phenomena continue we have added a control in the Options menu that allows you to update the mappings at your convenience.


Another important change we added was to improve the usability of the application overall by adding keyboard shortcuts for every single menu item.  In the earlier versions of our software we only had keyboard shortcuts for the most used features – now every menu item can be accessed through the keyboard (without use of the mouse).  For example – to access the Search control it is only necessary to press Alt+Z + Tab and you are ready to key in your search phrase.

We also fixed some minor bugs and tweaked the licensing validation process a bit.  In earlier versions if our delivery server was occupied and you started the application the application would wait until the server was free before completely starting.  That hesitation should be gone now.



New Data Type 10K_HISTORY Coming Soon

I was doing a demo with a prospective client last week and had a typical experience – I submitted a request file for director compensation data for 2008 that the client supplied.  They had identified a sample of registrants and wanted to see how to access data related to their sample.  I used the file while they were watching and pulled the data – unfortunately the missing-cik-year report listed 67 CIKs for which no data was available for 2008.  I am glad we provide this summary so our users don’t have to muck around and identify which of their sample is missing.  Of course the next question is why are they missing?

Usually I will take them on a tour of EDGAR for a few of the listed CIKs and show that their are no filings for the time period.  The first listed CIK in the list of missing values was 3133 (AMSOUTH BANCORPORATION).  The reason why director compensation for AMSOUTH for 2008 is not available is readily apparent from this image


Our potential client observed that they understood but would like a more concrete way to establish whether or not data should be available.  I absolutely understood that and this issue has been bothering me for a while.

One alternative we considered was to try to find all of the delisting notices (15-12B) and create some summary of data from those filings.  Unfortunately – too many registrants do not actually file a 15-12B.  Further – there is another problem – sometimes data is missing because the registrant has not yet registered or even if they have registered they may not yet be obligated to file the reports that contain some of the data objects our clients are trying to collect.

The solution that we have settled on for the time being is to create a summary file that lists for every CIK that has ever filed any form of the 10-K the date of their first 10-K filing and the date of their most recent 10-K filing.  We have done so and uploaded this data to our distribution server.  We are calling this data type 10-K_HISTORY it should be visible in our ExtractionPreprocessed data window before 12/1.

Because this is a snapshot at a point in time we are setting this data up with an RDATE of 20180101.  This means that your request file will have to have the value of 2018 in the YEAR column for every CIK you want to check.

When you submit the request file we will return a results file that includes the following headings:


Note – the reference to FIRST_FILING and RECENT_FILING are specifically references to any form of a 10-K filing (10KSB, 10K405 etc).  So they are not really the first filing the registrant made on EDGAR.  The balance sheet date values are the balance sheet date that the filing covers.

We hope this makes it easier to understand why you might be missing data.  So rather than having to inspect EDGAR for relevant dates you can use your missing report to construct a request file to then check for missing values.  Here is a screen shot after testing the process with the results I alluded to at the beginning of this post:


We hope this makes data validation more efficient and less painful.  Since each CIK has only one row of data this should be quick data to access and act on.

Of course there are always catches.  I had 67 observations that were missing data for 2008.  I submitted all 67 CIKs and the results included another missing report for 5 CIKs.  Well it turns out that these 5 CIKs have never filed any form of 10-K.  For example, one of the CIKs belonged to COCA-COLA EUROPEAN PARTNERS PLC (1650107).  They file 20-F and 6-K forms.  Another belonged to DROPBOX (1467623) – they just went public in 2018 and have yet to file a 10-K.

Institutional Trading – the “Whales”

We often are asked to provide access to the 13F-HR reports filed by institutional managers.  While we have provided these filings optimized for our platform – ultimately what users want is the data organized in a way that is meaningful.  This has something I personally have wanted to do for a long time.  However, the problem has always been matching the name of the issuer to some more useful identifier.  The actual 13F filings list the issuer name and their CUSIP (Committee on Uniform Security Identification Procedures) assigned identifier.  We needed a way to map the CUSIP back to the Central-Index-Key (CIK) assigned by the SEC.  Early in our  life I approached CUSIP Global Services ( a division of S&P) about licensing the CUSIP so we could link filings to CUSIPs to CIKs.  The cost was prohibitive so we have been stuck there.

Recently we discovered another way to map the CUSIP to CIK and after extensive testing are confident in the mappings we generate.  Because of this we have started processing the 13F-HR reports.

What we are doing is aggregating all of the data by report quarter and issuer.  The SEC requires institutional investors with more than $100 million in securities under management to disclose their holdings in the 13F within 45 calendar days after the end of the quarter.  We are parsing all filings made within each window and then combining the data for each individual issuer from all of the filings into one single report.  So a report for say Conagra for the 4th quarter of 2018 will be available soon after February 15, 2019 (the deadline for the report).  Of course we expect to have to periodically update these summary reports when amendments are filed by the managers.

Our platform already provides access to the beneficial ownership data as reported in the DEF 14A/10-K.  The institutional ownership data complements the beneficial ownership table because the beneficial ownership table only contains details about owners of more than 10% of the equity and the ownership of directors and officers.  The beneficial owners of 10% are a subset of the institutional ownership reported in the proxy.  For example, CAG’s DEF 14A reports only two beneficial owners (other than management) Blackrock and Vanguard with a total of 74 million shares.  Our analysis of the 13F filings shows total holdings by institutions to be approximately 305 million shares for the 6/30 quarter.  This is almost a 4 fold increase and it also represents more than 75% of the approximately 390 million shares outstanding as of the end of June 2018.

By combining the data from our beneficial ownership tables for directors and officers with the institutional ownership reports we are starting to generate our users will have better measures of  these important characteristics of the distribution of equity.

I think we should have a pilot of the institutional ownership data available before the middle of October.  You will know this data is available by starting the application and using the Extraction\Preprocessed feature.  When the pilot data is available there will be a new entry 13F_PILOT in the Data Tables listing.

If you would like to dive into one of the files while we are completing our final testing please send me an email and I will make one available (for our clients only).


Pilot Test of Ownership Data Over – The Real ‘Stuff’ Now Loading

On August 3rd we loaded a test run of ownership data.  Our initial focus was roughly the S&P 500 for the period from 2013 to 2017.  We had some feedback on the initial data load and made some changes to the way the columns are displayed and ordered.  We also added some additional fields (including the ACCESSION-NUMBER) of the filings so that you can more easily explore the original source file if you have questions about the data.

Friday we took down the initial data and started loading replacement data with new columns and more CIKs.  We have approximately 8,000 CIKs and data going back to 2008 in the loading process that is running right now.  This is massive (it represents normalized data from more than 1,500,000 ownership filings.  Again – the data is organized by ISSUER – YEAR.   The data in each ISSUER – YEAR file is sorted by RPTOWNERCIK – FILING DATE.

We have preserved all of the footnotes and their association with individual data items.  Each row represents either a NONDERIVATIVE TRANSACTION, NONDERIVATIVE HOLDING, DERIVATIVE TRANSACTION, DERIVATIVE HOLDING or a REMARK.  The datatype for the row is indicated by the value reported in the datatype column.

The full load may take the balance of this week.  We will then identify any missing CIKs and fill in the data back to 2003.

Form 3, 4 & 5 Filing Data on Line!

One of the key drivers of our new architecture was to allow us to more easily expand the range of data values we deliver through the Search, Extraction & Normalization Engine.  We are experiencing that benefit right now – we just started delivering  normalized data from Form 3, 4 & 5 filings.

The SEC lays out the obligations of officers, directors and other Section 16 persons to report their holdings and transactions in their company’s securities and also derivative instruments where some security of the company is the underlying value determinant.  The reporting mechanism that is used today is either Form 3 (initial report of holding); Form 4 (transactions and other events that affect holdings) and then Form 5 (annual statement of holdings).  In about May 2003 these forms were made machine readable when the SEC introduced/required them to be filed using both an HTML and XML format. As an aside prior to the introduction of the XML form – these forms were only available through EDGAR from a limited number of companies – most filers choose to file a paper copy.

So this morning we released a pilot test of normalized data from these forms.  This has been a big undertaking (today there are more than five million ownership forms available through EDGAR).  Our initial pilot is focused on the S&P 500 for the period from 2013 to 2017.

One of the big struggles about this data was deciding how to organize it for you to call.  What we decided to do was prepare the data by COMPANY-CIK /YEAR.  So if you submit a request for Abbott Labs (CIK 1800) for 2017 we will deliver back the data extracted from every Form 3/4 and 5 filed by any person who filed one of those forms during the period from 1/1/2017 through 12/31/2017.

Each line of the results represents one reporting event for one person.  So if a reporting person describes 3 non-derivative and 2 derivative transactions in one filing the result file will have 5 lines – each line reports all of the values included in the form.

To access this data create a request file (three columns, CIK, YEAR and PF) and from the Extraction menu select ExtractionPreprocessed.  Once the request file has been validated and you select the Read Input button the Pull section will populate with the latest list of available tables.  The parsed ownership data is delivered when you select SECTION_16_ANNUAL_SUMMARY


There can be as many as 150 unique column headings in the result file – depending on the number of footnotes that are included for the transactions.  This is critical we attach any relevant footnote to the transaction the footnote is providing elaboration for.

I will be honest – our labeling for the footnotes is a bit tedious – but we think necessary to provide clarity as to what part of the form the footnote should be considered with.  For a little hint at these complications consider this image from a partial Form 4 filed by John Klinck an EVP for State Street (link to full form).


There are two footnotes to explain/elaborate on the value reported for the Amount of Shares Beneficially Owned Following Reported Transaction.  Note that this entry describes a non-derivative transaction.  These footnotes are indicated in the following manner – the text that is associated with the footnote with the (2) indicator is in a column labeled transactionshares.footnote.  The footnote indicated by the (3) value is labeled transactionshares.footnote_1.  In short we are adding an index value to all footnotes associated with a data entry after the first footnote for that particular data entry.  If an entry in a form has 4 footnotes then the last one listed will be indexed with a _3.  Footnotes are associated with a specific data entry and are thus keyed to that data value and are reported in the row that that data value is reported.

These forms allow the respondent to include REMARKS.  A remark applies to the entire form and so what we ended up doing is including these in a separate row.  Initially we thought to include them next to each transaction but decided that these might be more useful if they could be easily  isolated.  We include a column that describes the nature of the content for each row (datatype).  


This allows you to very quickly isolate and review any particular type of data.  All of the identifying information for each person associated with each form is included in each row.  There are indicators for the relationship between the reporting person and the issuer (isdirector/isofficer/isother).  The reporting person’s CIK is included so you can match back to our compensation data.

I am really excited about this update.  If you work with this data and have observations that would help us improve the utility please send me an email.  (burch [yada]



FORM D & More About RDATE

Dropbox went public on March 23 2018.  However – their first public filing was made almost a decade earlier – on July 7, 2009 when they reported raising more than seven million dollars from eight investors.  Their initial SEC filing was made on a Form D – their trading name at the time was Evenflow.  They filed a second Form D in 2014 describing an offering of securities totaling 450 million – by the time they made the filing they had raised 325 million of that 450 million total expected.

These were the only EDGAR filings from Dropbox until their draft registration statements were made public on 2/23/2018 (the same date as their S-1).

We have had several requests for Form D filing data. So I am delighted to report that we are in the final stages of delivering all the data extracted from the Form D filings to our distribution server.  When the Form D data goes live a new entry will appear in the ExtractionPreprocessed Data Pull List.  To access the data all you will need is a standard request file with the CIK, YEAR and PF focus values.

Traditionally the YEAR value should be the year the EDGAR filing was made available.  Of course this raises an unexpected complication for our users because we would not expect you to have any prior knowledge of a particular filing year for a Form D.  Therefore – to make this easier we are initially going to load all Form D filings with an RDATE of 6/30/2018.  We will add an additional field in the data RDATE_2 – this will be the actual data the filing was made available through EDGAR.  So when you build the request filed for Form D data the YEAR column will only need to have 2018.  You will then receive all Form D data for each CIK in your request list.

The column headings are going to reflect the location of the data item in the original form.  Here is an image from one of Dropbox’s Form D.  The section in the image is the section of the filing that reports on the issuer.


Here is an image of the data captured from this original form.  Please note that the data has been transposed for this post – when you access the data the column heading are the values in column A:


Since each data item in this section corresponds to details about the Primary Issuer the phrase primaryissuer is concatenated to other parts of the name with a period (.).  Since there are other addresses in the form the label issueraddress is used to make clear that the address information relates to the primary issuer rather than another address detail in the form (for example an address component of a related person).

Our initial push of Form D data was more focused than we would have liked.  Honestly I am trying to set up the automated processes.  Thus, rather than filling in all of the more than 300,000 Form D filings we identified those filers that have ever filed a 10-K.  This significantly reduced the initial testing load to a few more than 14,000 original documents.  As of today (8/2/2018)  the parsed data from the issuer section and the offering details from these filings has been pushed to our distribution system.  Sometime in the next 12 or so hours your client will update with the new option:


As I noted above – we are artificially setting the year value for these files as 2018.  The objective is to prevent you from having to guess when your particular registrant made a Form D filing.  So your request file should look something like:


Frankly, I was surprised by some of the filings.  Note that Abbott Labs (CIK:1800) is in the above list.  They have filed 3 Form Ds – but the amount of capital raised seems minuscule compared to their market capitalization or book value (one offering of $70,000).  I have noticed that some filers are using the capital to fund executive pension plans.  All of these details and others are available in the way we have made this data available for your research.

As I have described in our help materials and elsewhere we use the term RDATE to indicate when a filing was made available to users of EDGAR.  Note that I did not describe the date as a filing date – this was purposeful.  The filing date reflects the date a filing was either received by EDGAR or at an SEC filing office.  For most filings the RDATE will generally also be the filing date but there are enough discrepancies to pay attention and manage dates by the RDATE rather than the filing date.  While setting up the code for processing the Form D filings I observed instances where the difference between the filing date and the RDATE was substantial.  For example – here is an image from the  header page for a filing made by Diffusion Pharmaceuticals:


Notice the Filing Date of 7/3/2006 but also notice the date next to the SEC_HEADER tag  – 6/15/2015.  This is the date the filing was made public through EDGAR despite the filing date of 10/9/2007.  If you wanted to do an event study (imagine wanting to determine how public company market values were affected by news of a successful funding round for a competitor) then the  6/15/2015 date is probably the correct date to use or perhaps provides evidence that this observation should be considered for deletion given the gap between the filing date and the RDATE.  We use the SEC-HEADER date to assign the RDATE to each filing we process.  If there is a value for PERIOD in the header we use that for the CDATE as the header tag PERIOD maps into the index file CONFORMED PERIOD OF REPORT.  If there is no PERIOD value we use the filing date for the CDATE.  This is particularly important for the comment letters (form type UPLOAD)