Amazing Research Deserves Amazing Rewards

A while ago I was searching for a way to acknowledge researchers who cite directEDGAR as part of their data collection efforts.  I wanted to do something because I think that those citations in the academic journals are a key factor is establishing our legitimacy with the academic market.  So we started a program (that I at least think is neat) where we order up ice-cream from a local ice-cream shop and have the cartons customized with the name of the paper and the names of any of the authors who are our clients.  – Check out the images below.


The image above is the ice-cream delivered to Professor Lubomir P. Litov, he is a member of the Finance Department at the University of Oklahoma.  His (co-authored) paper Lead Independent Directors: Good governance or window dressing? was accepted by the Journal of Accounting Literature for their December 2019 issue.

Here is an image of the ice-cream we sent to Professor Matt Ege, Jennifer Glenn and Professor John Robinson, faculty and PhD student in the Department of Accounting at Texas A&M University.  Their paper Unexpected SEC Resource Constraints and Comment Letter Quality was accepted by Contemporary Accounting Research (CAR) in May – though there has not been a publication date reported yet on the CAR website.


One more, one of my colleagues at the University of Nebraska at Omaha, Professor Erin Bass had an ice-cream surprise recently.  She cited directEDGAR in her paper Top Management Team Diversity, Equality, and Innovation: A Multilevel Investigation of the Health Care Industry recently accepted for publication by the Journal of Leadership & Organizational Studies.

While this may seem a little strange.  Who would make ice-cream with an academic research title on the label?  I will observe that one of our authors reported a couple of years ago that it really made their kids want to read the paper with the title.  They (the kids) were bragging to all of their friends that their dad had ice-cream named after him!!  What could be better than that?

A little back story on eCreamery, the amazing company that handles all of the details.  This company was started by two women in Omaha in 2007.  They have been on SharkTank and Mr Buffett (that’s Warren, not Jimmy) has been known to pop in with some of his best buds.  eCreamery is only about 15 blocks from my house and when we go there we can count on the line being out the door.  Clearly – the best SEC filing search engine customers deserve only the best ice-cream.

So for the fine print.  The decision as to whether or not we send ice-cream as the result of a citation is strictly at our discretion.  There are times we do not send ice-cream – none of the authors is employed by one of our clients is a big one.  Another is that the authors are at one of our international schools or in Alaska or Hawaii.  (eCreamery will not guarantee a frozen delivery outside the continental US – go figure).  This program can be stopped at any time.

10-K History – Data Filtering

Whenever I visit clients or respond to emails about data collection I always try to make the point that it is super critical to identify the sample based on strict criteria to minimize the inevitable chase at the end for missing data and to minimize the processing of the inevitable edge cases.  No matter how structured the disclosure requirements are set out in SEC regulations or the Accounting Standards Codification it is inevitable that some proportion of the SEC filers will get ‘creative’ in their form of the disclosure.  When they get creative – data collection becomes much more tedious as it becomes necessary to identify the structure of their disclosure before we can sort out how to capture the data.  If we can precisely  identify the sample firms before we turn to collecting the data items then we will reduce the effort we spend chasing the odd form of the diclosure.

I am helping a client understand how to use our platform to collect a data item that is disclosed only in the 10-K – it is not a required disclosure in any other filing and it is something that is not likely to be disclosed in any other filing (I did some tests and could not find this item disclosed in the combined  millions of  filings that are searchable with directEDGAR).  So related to this I do encourage our users to review the regulations (either the SEC disclosure requirement as set out in the Code of Federal Regulations or the Accounting Standards Codification).

So our client is trying to collect a particular data item – their sample was derived from some other financial data source.  It may seem a normal presumption that if a company has data available from some other financial data source then there should be a 10-K with this disclosure.

In this particular case there are three problems with the sample from our client.  The first is that some of the sample firms have public data because they have public debt.  So while they file a 10-K they might not have some data items included in the 10-K because the disclosure requirements differ by the nature of the laws that establish their filing obligations (ABS issuers versus public debt only versus common stock).  So while these companies file a 10-K they will not have the particular disclosure our client is trying to collect.  The second problem is that the sample firms may have not had a filing obligation at the time they showed up in the sample.  The third problem is that some of their sample are foreign registrants whose filing obligations differ substantially – they have the option to file 20-F and 6-Ks rather than  the expected 10-K/Q and 8-Ks (as well as a myriad of other filing differences).

The most common way to determine if a company has publicly traded equity is to look for evidence in one of the other data sources that would normally be used to source some of the data for research.  I suggest that as there is not an easy way using SEC filings to determine if a company has publicly traded common stock.  In other words there is not really an easy way using directEDGAR to establish whether a filer has publicly traded common stock.  For example, I played around with some searches to identify those 10-K filers that are privately held and struggled – because this is not a mandated disclosure.  One search I tried was to search all 10-K filings for the existence within the first 800 word of the beginning of the document  registrant or issuer or company within 10 words of the phrase privately held


Some of the results (LEVI STRAUSS and CINEMARK USA) were exactly what I was looking for – those registrants are (were) privately held.  However, many of the results were not what I was looking for.  Therefore if I needed to collect data from companies that had public equity – the best way to define the sample would be to use another tool to determine if they do have public equity.

The second and third issues that needed to be addresses is whether or not the company filed 10-Ks (since that is filing that contains the data we are looking for) in the window that is needed for this study.  We can use directEDGAR’s 10-K Filing History archive to establish whether or not a company has filed 10-Ks and for what period.    Our client had a list of approximately 13,000 CIK -YEAR observations  which represented 3,862 unique CIKs.  I used their list of unique CIKs to create a request file to determine the 10-K availability for their sample.  This file helped me in two ways.  First, for some of their sample CIK-YEAR pairs the date they were trying to collect data for was after (or before) the last (or first) date of the 10-K filings.  For example, they needed something from a 10-K filed by CIK 737644 after 1/1/2001.  The problem is that this CIK filed their last 10-K in 1997 (I determined this by using the 10-K history file results).


They can use the result file to determine if there is a 10-K filing within the time span that they need to collect the data.  And even better – the process also creates a file called missing.csv (clever name) that listed the CIKs from the request file for which no 10-K filing had ever been filed.  There were 477 CIKs from their original list of 3,862 CIKs that had never filed a 10-K.

So while we could not use directEDGAR  to establish if any in their sample did not have publicly traded equity we could use it to establish whether they filed any 10-Ks and also for what period.  The advantage of doing this work at the beginning is that we can more precisely define the data we should expect to collect.




Auditor Tenure Fast Collection

One of the projects we have been working on is enhancing our audit fee data.  Frankly the current presentation is lousy and not terribly useful.  So we sat down and developed a plan and identified some additional fields we needed to collect to include in the audit fee data as part of improving the value of this data.

One of the fields we need to add is the tenure.  For those of you not aware there was a rule change promulgated by the PCAOB in 2017 that required the tenure (auditor since) to be disclosed and has normally been included in the 10-K or Exhibit 13.  Since the disclosure is required it is much more standardized than the instances of auditor tenure disclosed voluntarily in the Proxy – the most common form is “have served as the Company’s auditor since YYYY”.

I wanted to collect that data myself so I could describe the process to our data team and set out to do using our Search, Extraction and Normalization Platform.

First – I needed to search for the phrase auditor since


As you can see I found 4,979 instances of that phrase (my universe was all 10-K filings that have been made since 1/1/2019.  Next I need to extract and normalize the phrase and convert any numbers after the phrase “auditor since” into the data value.  I used the ContextNormalization feature as you can see in this next image:


The Extraction Pattern translated into English means – extract the context and if a number is found following the phrase auditor since place it in the csv file in the column labeled audit_tenure.

So I invested a total of maybe 5 minutes.  So lets look at the results:


The context is available for review and the application normalized the context to extract the year value for our use to add to our new audit fee data.

There are some exceptions that had to be manually handled (110/4979).  Lets look at those:


As you can see – these registrants deviated from the standard disclosure and so I had to review the context and just key in the year value.  I am very comfortable working with our tools and in this context so it only took me about 20 minutes to review and key those missing values.

In total I spent roughly 45 minutes to capture this data value.  I spent about that much time trying to sort out how to describe the process in this blog post.    (When we upload the new audit fee data presentation the tenure field will actually be labeled as SINCE).



Version Release Coming!

2018 has been a busy year.  If you look back you will see that we added new data tables (Insider Trading, 10-K Filing History and Form D data.  In May we released 4.0.4 which was the foundation for allowing us to add the additional data tables as it significantly speeded up the delivery of the artifacts we process.  We also started adding the AGE and SINCE variables to director compensation and have been reworking our beneficial ownership tables to better deliver the data when a filer has multiple classes of stock.  We were able to move our extraction of the Effective Tax Rate Reconciliation table to a near real-time delivery rather than batch updates.

But it doesn’t stop.  When we deliver the filings and indexes for the last 2018 filings we will be including a new version ( of our application.  There are some important improvements coming with this version.

We added a ZOOM box for you to use to build your search phrases.  Our search engine can parse really large (think more than 32,000 character) search phrases.  While you can build the search phrase in Notepad or a similar application we decided to add a bigger box to use in the application.


We added a feature to allow you to identify/specify tables using specific words or values.  A key feature of our platform has always been that you can extract tables from the search results and then manage the data extraction from those tables.  One constraint that was imposed by that strategy is that all the tables across all the documents had to have some consistent phrase/term or value.  There are though cases where the registrants report the data in a unique fashion and so it was difficult to actually access tables.  So we developed a process that allows you to review a set of search results and then specify for each document a specific unique value the application will then use to identify and extract the relevant table(s).


We set up the foundation to quietly handle corporate actions that lead to a change in CIK but not the entity.  This one is a little complicated to explain so bear with me.  If you access most of the leading data sources for financial data and collect data for Alphabet – the time series of Alphabet will extend back to the first 10-K filing made by Google in 2005.  However – if you go to EDGAR and use the CIK that is returned from the financial data service you pulled the data from the first 10-K filing made by Alphabet was filed in 2016.  Because of a reorganization and a merger Alphabet became the successor issuer to Google but they have different CIKs.  With our new update when you submit a CIK for one of those companies (either Google’s 1288776 or Alphabet’s 1652044) our application will quietly add the complementary CIK when you select the new option Include Historical CIKs.  This feature has been added to every menu item that allows you to specify the CIK.


When you first install the application will add a special file that maps the CIKs.  However – because corporate actions that lead to this phenomena continue we have added a control in the Options menu that allows you to update the mappings at your convenience.


Another important change we added was to improve the usability of the application overall by adding keyboard shortcuts for every single menu item.  In the earlier versions of our software we only had keyboard shortcuts for the most used features – now every menu item can be accessed through the keyboard (without use of the mouse).  For example – to access the Search control it is only necessary to press Alt+Z + Tab and you are ready to key in your search phrase.

We also fixed some minor bugs and tweaked the licensing validation process a bit.  In earlier versions if our delivery server was occupied and you started the application the application would wait until the server was free before completely starting.  That hesitation should be gone now.



New Data Type 10K_HISTORY Coming Soon

I was doing a demo with a prospective client last week and had a typical experience – I submitted a request file for director compensation data for 2008 that the client supplied.  They had identified a sample of registrants and wanted to see how to access data related to their sample.  I used the file while they were watching and pulled the data – unfortunately the missing-cik-year report listed 67 CIKs for which no data was available for 2008.  I am glad we provide this summary so our users don’t have to muck around and identify which of their sample is missing.  Of course the next question is why are they missing?

Usually I will take them on a tour of EDGAR for a few of the listed CIKs and show that their are no filings for the time period.  The first listed CIK in the list of missing values was 3133 (AMSOUTH BANCORPORATION).  The reason why director compensation for AMSOUTH for 2008 is not available is readily apparent from this image


Our potential client observed that they understood but would like a more concrete way to establish whether or not data should be available.  I absolutely understood that and this issue has been bothering me for a while.

One alternative we considered was to try to find all of the delisting notices (15-12B) and create some summary of data from those filings.  Unfortunately – too many registrants do not actually file a 15-12B.  Further – there is another problem – sometimes data is missing because the registrant has not yet registered or even if they have registered they may not yet be obligated to file the reports that contain some of the data objects our clients are trying to collect.

The solution that we have settled on for the time being is to create a summary file that lists for every CIK that has ever filed any form of the 10-K the date of their first 10-K filing and the date of their most recent 10-K filing.  We have done so and uploaded this data to our distribution server.  We are calling this data type 10-K_HISTORY it should be visible in our ExtractionPreprocessed data window before 12/1.

Because this is a snapshot at a point in time we are setting this data up with an RDATE of 20180101.  This means that your request file will have to have the value of 2018 in the YEAR column for every CIK you want to check.

When you submit the request file we will return a results file that includes the following headings:


Note – the reference to FIRST_FILING and RECENT_FILING are specifically references to any form of a 10-K filing (10KSB, 10K405 etc).  So they are not really the first filing the registrant made on EDGAR.  The balance sheet date values are the balance sheet date that the filing covers.

We hope this makes it easier to understand why you might be missing data.  So rather than having to inspect EDGAR for relevant dates you can use your missing report to construct a request file to then check for missing values.  Here is a screen shot after testing the process with the results I alluded to at the beginning of this post:


We hope this makes data validation more efficient and less painful.  Since each CIK has only one row of data this should be quick data to access and act on.

Of course there are always catches.  I had 67 observations that were missing data for 2008.  I submitted all 67 CIKs and the results included another missing report for 5 CIKs.  Well it turns out that these 5 CIKs have never filed any form of 10-K.  For example, one of the CIKs belonged to COCA-COLA EUROPEAN PARTNERS PLC (1650107).  They file 20-F and 6-K forms.  Another belonged to DROPBOX (1467623) – they just went public in 2018 and have yet to file a 10-K.

Institutional Trading – the “Whales”

We often are asked to provide access to the 13F-HR reports filed by institutional managers.  While we have provided these filings optimized for our platform – ultimately what users want is the data organized in a way that is meaningful.  This has something I personally have wanted to do for a long time.  However, the problem has always been matching the name of the issuer to some more useful identifier.  The actual 13F filings list the issuer name and their CUSIP (Committee on Uniform Security Identification Procedures) assigned identifier.  We needed a way to map the CUSIP back to the Central-Index-Key (CIK) assigned by the SEC.  Early in our  life I approached CUSIP Global Services ( a division of S&P) about licensing the CUSIP so we could link filings to CUSIPs to CIKs.  The cost was prohibitive so we have been stuck there.

Recently we discovered another way to map the CUSIP to CIK and after extensive testing are confident in the mappings we generate.  Because of this we have started processing the 13F-HR reports.

What we are doing is aggregating all of the data by report quarter and issuer.  The SEC requires institutional investors with more than $100 million in securities under management to disclose their holdings in the 13F within 45 calendar days after the end of the quarter.  We are parsing all filings made within each window and then combining the data for each individual issuer from all of the filings into one single report.  So a report for say Conagra for the 4th quarter of 2018 will be available soon after February 15, 2019 (the deadline for the report).  Of course we expect to have to periodically update these summary reports when amendments are filed by the managers.

Our platform already provides access to the beneficial ownership data as reported in the DEF 14A/10-K.  The institutional ownership data complements the beneficial ownership table because the beneficial ownership table only contains details about owners of more than 10% of the equity and the ownership of directors and officers.  The beneficial owners of 10% are a subset of the institutional ownership reported in the proxy.  For example, CAG’s DEF 14A reports only two beneficial owners (other than management) Blackrock and Vanguard with a total of 74 million shares.  Our analysis of the 13F filings shows total holdings by institutions to be approximately 305 million shares for the 6/30 quarter.  This is almost a 4 fold increase and it also represents more than 75% of the approximately 390 million shares outstanding as of the end of June 2018.

By combining the data from our beneficial ownership tables for directors and officers with the institutional ownership reports we are starting to generate our users will have better measures of  these important characteristics of the distribution of equity.

I think we should have a pilot of the institutional ownership data available before the middle of October.  You will know this data is available by starting the application and using the Extraction\Preprocessed feature.  When the pilot data is available there will be a new entry 13F_PILOT in the Data Tables listing.

If you would like to dive into one of the files while we are completing our final testing please send me an email and I will make one available (for our clients only).


Pilot Test of Ownership Data Over – The Real ‘Stuff’ Now Loading

On August 3rd we loaded a test run of ownership data.  Our initial focus was roughly the S&P 500 for the period from 2013 to 2017.  We had some feedback on the initial data load and made some changes to the way the columns are displayed and ordered.  We also added some additional fields (including the ACCESSION-NUMBER) of the filings so that you can more easily explore the original source file if you have questions about the data.

Friday we took down the initial data and started loading replacement data with new columns and more CIKs.  We have approximately 8,000 CIKs and data going back to 2008 in the loading process that is running right now.  This is massive (it represents normalized data from more than 1,500,000 ownership filings.  Again – the data is organized by ISSUER – YEAR.   The data in each ISSUER – YEAR file is sorted by RPTOWNERCIK – FILING DATE.

We have preserved all of the footnotes and their association with individual data items.  Each row represents either a NONDERIVATIVE TRANSACTION, NONDERIVATIVE HOLDING, DERIVATIVE TRANSACTION, DERIVATIVE HOLDING or a REMARK.  The datatype for the row is indicated by the value reported in the datatype column.

The full load may take the balance of this week.  We will then identify any missing CIKs and fill in the data back to 2003.