Standalone SmartBrowser and Release Coming Soon

We have had several requests for allowing users to provide copies of our software to freelancers who would help with some data cleaning activities.  We just can’t do that.  However we have identified a middle ground.  We are working on a standalone version of the SmartBrowser.

The SmartBrowser allows users to cycle through a directory with files (tables or documents) that were created by our application without having to open the files manually. Here is an image of the SmartBrowser:


Some of the key features include:

  1. Parsing of file to identify the
    1. CIK
    2. RDATE
    3. CDATE
  2. Text box so you can type CIK so list will load at that position
  3. Ability to select any CIK, again list will load at the position of the selected CIK
  4. Full set of controls to allow you to move through the files including
    1. Previous – loads the last file reviewed
    2. Next – loads the next file in the queue
    3. Delete Current File – deletes the file that is being displayed
    4. Move Current File – creates a subdirectory named ForReview and moves the current file into that subdirectory

Additional features not displayed include a full text search capability as well as the ability to snap the table that is displayed into Excel.

We think when you want to use Amazon Mechanical Turk or freelancers to help isolate specific data you can email them the relevant snipped tables or documents that you have extracted with the new SmartBrowser.  Having access to the SmartBrowser will allow them to focus directly on data collection rather than file management.

The release will add features to smooth out the installation on managed desktops (those environments where users do not have installation rights) and for scenarios where multiple users share one computer with the Search, Extraction & Normalization Engine accessible to all logged in uers.  We will also push out the update to the code to address the files that are being renamed as htm files when they are actually txt format.

Update 7/20/2016 – is now available – for a download link please email Dr. Burch Kealey  (


Old (but New) Compensation Data

We started pushing Executive Compensation data included in Proxies/10-Ks for Fiscal Years that ended before 12/15/2006 to our distribution server today.  This data is not easily comparable to data filed after these rule changes.  The SEC specifically prepared for this lack of comparability by noting “. . .companies will not be required to “restate” compensation or related person transaction disclosure for fiscal years for which they previously were required to apply our rules prior to the effective date of today’s amendments. This means, for example, that only the most recent fiscal year will be required to be reflected in the revised Summary Compensation Table when the new rules and amendments applicable to the Summary Compensation Table become effective, and therefore the information for years prior to the most recent fiscal year will not have to be presented at all.”  If you have downloaded EC tables from our server from Proxies and 10-Ks filed in 2007 and 2008 you should have noticed how those tables typically only included one and then two years of data.  The SEC’s decision to allow a fresh-start with the new disclosure requirements explains that pattern.

I am highlighting this because it is important to think about the validity of a compensation data time series that crosses these two reporting regimes.  There are significant measurement and disclosure differences before and after a registrants adoption of the new form disclosures.   For example, registrants were allowed to delay the adoption of FAS 123R until the fiscal year in which they became subject to the new disclosure regime (SEC Press Release).

To highlight and make sure our clients are aware of the differences in measurement we decided not to normalize this data to the same structure we normalize post-2006 data.  For example, prior to the regulatory changes there was not a total column.  We are not including a total column.  The total is really indeterminate because of some significant variability in how individual companies measured equity type awards.  Most companies only reported the number of options granted – there was no monetary value established.

To understand the significance of these differences please compare the Summary Compensation Tables  from Abbott Laboratories’ 2006 and 2007 Proxy filing.


Notice that the option column reports the number of “Securities Underlying Options/SARs”.  Also notice that there is no total reported.

Here is the Summary Compensation Table from their 2007 Proxy  (first year under the new disclosure rules).


One year of data and a monetary value (determined by application of FAS 123R) for the Option Awards. Further, despite similar column headings the process under which the amounts included in most of the other columns changed substantially.  For example, the new rules state that “As we proposed, compensation that is earned, but for which payment will be deferred, must be included in the salary, bonus or other column, as appropriate.

To illustrate the results from pulling this data from our server the next image has the summary data file created using a request file for 2006 and 2007.


As I noted earlier, we are not adding a calculated Total for data that does not originally include a total.

Where to Start

At least twice a month I get an email from someone starting a new project looking for some direction.  My first questions back to them are almost always about the work they have done to identify their sample and the disclosure requirements that their sample is subject to.  These are critical first steps that often get overlooked.  Our experience is that spending some time at the early stages of a project addressing these questions can significantly reduce the stress and uncertainty of data collection because it allows you to get away from the nagging question – why did I not find [some data item]?

While I described the sample selection disclosure requirements as multiple steps they are really hard to separate from one another because they are so intertwined.  One area they are separate though is coverage in the most common databases used with EDGAR data.  For competitive reasons I probably can’t name the two elephants in the room but most of you know who they are.  Their products do not cover all SEC registrants.  While their populations may represent more than 98% of the capitalization of the US equity markets, in sheer numbers that is just a subset of all of the SEC registrants (best guess, less than 1/2).

Thus I think a critical first step is always to use their filtering tools to identify companies in their population that have the data items you need and meet your sample criteria.  One area that is important to filter on is to decide if you need sample companies that have publicly traded equity.  Commercial databases include entities that do not have any publicly traded equity but are SEC registrants.  It is not enough to confirm that they have a Central Index Key (CIK).  A complication related to this are the cases where the commercial database includes data for the entity that has publicly traded data as well as data for the subsidiaries that have filing obligations (usually because of publicly traded debt).  One example of this I like to share is Entergy.  Here is a shot of the bottom of the landing page for their 2015 FYE 10-K.


While I can only display five, there are seven registrants who simultaneously filed that same 10-K.  Only CIK 65984 (ENTERGY CORP /DE/ – the second listed) has publicly traded equity.  The other entities have various other securities that require public disclosure of their financial performance but do not have any proxy reporting requirements.

It took a long time to get here – the point is that without careful filtering users can get frustrated because they will not find compensation or director information for any of the other entities.  Some compensation data for subsidiary officers will be listed in the DEF 14A of Entergy Corp simply because there compensation flows to the parent and they will be among the five highest compensated officers of the combined entity.

Another area that is tricky has to do with regulatory disclosure changes.  Until SOX Director Compensation disclosures were generally about the schedule used to compensate directors rather than precise details about compensation to individual directors.  The rules governing Director Compensation can be reviewed here.  The rules became effective for fiscal years that ended after 12/15/2006.  When trying to build a sample of Director Compensation data, companies like Apple (late-September FYE (52/53 week FYs) do not provide a compensation table until their proxy filing on January 23, 2008.  Without investing time into understanding these disclosure requirements to then develop their sample users can expend considerable effort fruitlessly searching Proxy and then 10-K filings for data that is not available.

One more example has to do with the filing status of issuers.  The SEC has a graduated filing schedule and disclosure requirements based on the filing status of the registrants.  Companies that meet the definition of a Smaller Reporting Company have a choice to meet the full requirements of Regulation S-K or scaled disclosure requirements (described here).  In total there are 12 differences in the disclosure requirements for companies that qualify for and elect the relief available under these regulations.   These scaled disclosure requirements include the opportunity to omit  disclosures for Risk Factors (Item 1A) as well as reduce content in many other areas of a filing.  The SEC noted in 2008 that approximately 1/2 of 10-K filers would be eligible for this relief.

If you are trying to collect data from an area of the 10-K covered by these scaled disclosure opportunities your sample is going to be greatly affected by decisions made by the registrants.  We know that at least 1/3 of the 10-K filers at any one time do not provide disclosure about their Risk Factors through the relief offered under the scaled disclosure requirements for smaller reporting companies.  Here is a screenshot of a 10-K form from one registrant in 2016 that choose that disclosure regime.


The frustrating part about the scaled disclosure is that companies choose to implement their choices in a number of ways.  The above screenshot makes it clear.  Other registrants might use the word omitted; which is also clear.  Our experience though is that more than half will just leave the space blank with no words to indicate the reason for the omission.

There is a lot of detail in this post.  The point I am trying to make is simple though – it is really in your best interest to start research that requires the collection of data from EDGAR filings with a careful filtering of your commercial dataset.  Apply as many filters as you can so that you do not spend any time looking for (or cleaning) data that is not going to ultimately be used.


Managing Complex Searches

I was helping someone build a complex search today.  The goal was to identify all 10-K filings with some mention of Last-in, First-out inventory and confirm whether the auditor for the companies that mention LIFO belonged to a particular set of auditors.  Their auditor list was not too large (27 names).  Because we were not initially sure exactly what search to construct the search string started getting messy. We were not seeing some results that we expected and I wanted to clean it up so that our customers could better understand some of the choices we were making.

Unfortunately, our search box only displays the most recent 100 characters (including spaces) but our search parser will parse search strings with as many as 16,000 characters.  While the amount of the input that is visible is limited our application does allow users to paste in their search string that they might build outside the application.

So I need to take a break and throw in a plug for a really useful and free) utility called Notepad++.  It is has an amazing feature set.  If you are writing code (Python and Perl and many others) Notepad++ offers syntax highlighting.  It is a lot easier on memory than the included Notepad in the sense that it opens large files much easier.  It has an amazing search facility.  One of the reasons I use it is that it helps me balance parentheses when I am writing a complex search.

When I start getting confused with the search I am trying to construct I stop trying to build it in our search box and switch to Notepad++.

As I mentioned above, I started building the search inside the Search, Extraction & Normalization Engine. Here is a screenshot of part of the search, we can only see about 100 characters at a time.


I was not seeing results from some auditing firms that I expected and so I decided the best thing to do was to build the search in Notepad++.  If you place your mouse in the Search Phrase box and use the right-click button there is a context choice – select all.  So I did that to select and then copy my existing search phrase from our application to Notepad++ (NPP).


I pasted it squished down so you can see that there are 732 characters in my search.  Just pasting it into NPP isn’t enough, I want to see the search.  So I started adding line breaks after operators or terms. Here is an image of the search reorganized in NPP:


Organizing the search this way sure makes it a lot easier to read.  I have two clumsily drawn circles in the image up there.  The smaller one at the top illustrates the parentheses matching.  If you are next to a parentheses it will change color (you can set the color you want, the default is red) and the matching parentheses will also light-up.  That is a really useful feature that I use a lot when building nested searches. You can easily tell if you have unbalanced parentheses by placing the cursor next to one and then scroll up or down to look for another with the same highlighting.  If you don’t see one then you have an unbalanced parenthesis.

The second, longer red ‘circle’ illustrates that I think it is always a best practice to indent each new line of search content one space.  The reason for this is if you hit return and then start typing in the first column of the new line the parser will run together the last word (or operator) of the previous line with the first word (or operator) of the current line.

Once the search was in NPP it was easy to analyze.  The problem we were having turned out to be an unbalanced parenthesis and typos (read misspellings).  Once those were corrected we just selected our search (as it was displayed in NPP – we did not have to get rid of the line breaks) and then pasted it back into the search box.


Notice, that the line breaks are preserved.  The parser will take care of the line breaks when we submit the search.

For those of you new to our software, the andany operator is not used in the process of selecting documents.  Rather it is used to count instances of other words or phrases in the documents selected with the other operators.  While our focus in this search was on LIFO we also want to inspect references to FIFO to make sure we properly identify LIFO adopters (that explains the first andany operator).  The second, in front of the list of auditors is to identify the auditor(s) associated with the companies who do mention LIFO in their 10-K.

In the search results displayed below – there were 2,439 documents found.  These documents were selected because:

  1. They had either the phrase “last in” (hyphenated or not)
  2. or they had the term LIFO
  3. and they were 10-K documents (not exhibits)
  4. and I set a date filter to select only filings made between 1/1/2012 and 5/30/2016

Because of the andany  operators we could check on their auditor and find their context for FIFO if they mentioned either in the document.