Accelerated Extraction of Executive Compensation Data

Because of some hard work on the part of our team we just turned on  Near Real-Time delivery of Executive and Director Compensation data as well as the 10-K Item sections.  We were processing these filings and extracting the data during a midnight run – now we are processing them just a few seconds or minutes after they are made available on EDGAR.

This is one of the struggles I have with marketing.  What exactly is real-time?  We can’t actually state that the data is made available concurrent with the filing since the filing has to happen first.  However, we are now much closer to real-time than we have ever been before.

So for example, if you were interested in collecting or analyzing Executive Compensation data from ENERNOC INC (CIK 1244937) (The World Leader in Energy Intelligence Software), our platform made their normalized Executive Compensation data available for analysis about two minutes after their proxy filing was visible on EDGAR.

Here is an image of the raw data at it appeared in their proxy filing:


Here is an image of the processed results  (I had to trim some of the identifiers out of this image).


Notice the addition of the CIK of the officers as well as their gender.  This data was available just two short minutes after their proxy was available from EDGAR.

New Data Coming to EDGAR: Form C

The Jumpstart Our Business Startups Act (the JOBS ACT) required the SEC to reduce the regulatory burden on issuers seeking to raise money through crowd funding and similar platforms.  While it took the SEC a bit more than three years to define their regulatory approach the new regulations go into effect on May 2016.  The regulation as it appears in the Federal Register is available here (Form C).

Over time this will provide an opportunity to test some unique research questions.  Academic researchers have struggled for years to collect data on companies in their early stages.  These rules will provide access to that data for companies that seek and are successful raising capital from the public under these alternate and early stages of capital raising.  Companies are going to have to file annual financial statements – though it appears that they will have a range of choices with respect to the level of detail and the nature of the certification associated with those financial statements (from certification by the CEO to an audit).  That range of options alone suggests some interesting opportunities to measure the value of an audit.  Potentially, researchers will have data on how markets perceive the value of an audit if there is variation in the choices these early stage companies make when they go public.

We will be watching for the initial filings of Form C’s in May.  If I have not noted this yet, the next step in our development schedule is to provide our users the ability to use our platform to access any filings at anytime.

Our Search versus Web Search

There is a critical difference between directEDGAR’s Search Architecture and that of all of our competitors.  Our search is run locally and theirs is run on the web.  We are the only SEC filing search provider that uses local search.

So why are we different?  That is an easy question to answer – we just want to save you time.  A web search might initially look faster.  If I type the phrase audit committee met in my Google search box I will get access to over 15 million documents in 0.40 seconds.  But now what am I going to do?

Google lists ten results per page.  Some of our competitors list 50 to 100 per page.  Suppose I want to randomly look at the results, I have to click on the next button at the bottom of the page repeatedly to move to the next batch.  This can quickly get tiring.

By localizing the search we can push all of the results to your computer and so you can look at any search result almost instantly.  In the image the search found over 58,000 instances of our search phrase and with the scroll bar we can easily ‘look’ at any number that we like.


And while the example image above shows 58,646 documents, it is usually not the case that our users want all 58,646 results.  Instead most users want to limit the results to a particular subset of registrants.  That is a difficult task to accomplish on a web platform.  I have seen some of our competitors allow searching by a limited number of CIKs but I have not yet seen anyone provide a robust way to search with a large set of CIKs.  With our application any size list (well truthfully this has not been tested – the most we have ever tested with before is 12,000 CIKS) of CIKs can be pasted into the filter box or a file with the CIKs can be selected.

For this blog post I decided to set a new record and used 14,314 unique CIKs!  The beauty of our search is not only will the Search Extraction & Normalization Engine limit the search to those specific CIKs it will also generate a list of all CIKs that did not have any documents in the search results.  Again, the last time we checked, none of our competitors offer this critical feature.

Here is an image of the CIK filtering process where we add the CIKs to the search.



And after the search the application reports not only the number of documents but the number of missing CIKs.


None of this would be useful unless we can report back to you who had missing documents.  That is easily available by hitting the View Misses button.  As you can see in the image below, we provide options so you can directly save the results to a file or select and paste them somewhere else.


While I can provide additional examples, I think I have supported my primary point – our localized search architecture is better because we can add more features than if we were serving search results over the web.  With a local search we can provide you almost instant access to any search result and we can implement robust CIK filtering that not only limits documents but reports back those that were missing.

New Web Site Finally Live

What a chore, our new website is finally live.  The fact of the matter is, I am not that creative and it has been hard to try to figure out how to describe directEDGAR’s feature set in a compact manner.  One thing that really helped was while redesigning our software we decided to call it the Search, Extraction & Normalization Engine.  While that is a mouthful it helped us better understand how to organize the website.  (new website).

We still have a way to go to make our advantages more clear.  But this new website gives us a platform that will make that easier.

andany Search Operator – deep dive

With the introduction of the latest version of our software we were able to take advantage of dtSearch’s amazing development work and incorporate their andany operator.  The andany operator solves a very complex search problem.  Think of this problem, you want to find a recipe to use up some bananas but you are sick of eating banana bread.

The standard response would be to search for banana not banana bread.  However, suppose someone has a recipe for the best banana pudding in the world but it is on the same page with a comment about how tiring they have found banana bread.  Your search expression would exclude that page.

The alternative search would be for banana andany banana bread.  This search would first search and find all pages that mention the word banana and then it would count and identify all instances of the phrase banana bread on those pages.  If the frequency count is the same then you can conclude that the only mention of banana on a particular page is in the context of banana bread.  However, if the frequency count of banana is greater than the frequency count of banana bread you just might want to explore that particular page.

For a more on target example, consider the problem faced by Li, Lundholm and Minnis.  The wanted to identify competitive intensity described in 10-K filings.  They used code they developed to identify all 10-K filings with instances of words rooted in competition and counted their frequency in each document.  Their code allowed them to also count instances of phrases like less competition.  This would be hard to replicate without access to their code because the only search one could develop without using the andany operator would be competition and not less competition.  So if we assume our search focus is a document/paragraph the search will look for all documents/paragraphs that have the word competition but exclude the document/paragraph if they also have the phrase less competition.

The same search using the andany operator would be competition andany less competition.  This search would return documents/paragraphs selected on the basis of the word competition and also count all instances of the phrase less competition.

The Li et al paper was an inspiration for finding a search engine that could address their research strategy without having to learn to program to collect the same data.  In tests we ran to validate the superb value that dtSearch would bring to directEDGAR we found significant documents that would be excluded from the results by using the not operator where competition words significantly exceeded phrases like less competition.

Version Coming Soon – Indexing

We shipped Version 4.0 of our application to clients with the updates for 12/31/2015.  While there were some substantial enhancements it was not fully complete.  Part of the help had not been completed and we had still not been able to port everything over from the ExtractionEngine.  We have made substantial progress since then and should start releasing in the next week.

One of the key features we added with this release is user controlled indexing.  For a long time our software has had the ability to download particular sections of 10-K filings (Risk Factors, Mine Safety etc) and with the new platform we offered a very direct way for you to isolate particular documents from the repository. While we had a word parser in the application one of the weaknesses was that we did not have a way for you to use the full set of search features on either the 10-K sections or the isolated documents you retrieved.  With our new indexing feature you can now build your own custom indexes and then use all of the features of the Search Extraction & Normalization Engine on the index or the documents in the index.

To use the indexing feature you point the application to a folder of documents.  The documents have to be named in the CIK-RDATE-CDATE pattern.  This is not a concern since every single document our processes create are named in that pattern.  Once the folder with the documents has been selected the application sets up the hierarchical directory structure, moves documents in CIK\RDATE-CDATE folders and then builds a full-text index.

So for the first time, if you want to do a search of only the Risk Factors section of 10-K’s you can now do so.

Even more exciting, we included the indexing feature as a prelude to offering you the ability to access any EDGAR filing.  More on this later.