Old (but New) Compensation Data

We started pushing Executive Compensation data included in Proxies/10-Ks for Fiscal Years that ended before 12/15/2006 to our distribution server today.  This data is not easily comparable to data filed after these rule changes.  The SEC specifically prepared for this lack of comparability by noting “. . .companies will not be required to “restate” compensation or related person transaction disclosure for fiscal years for which they previously were required to apply our rules prior to the effective date of today’s amendments. This means, for example, that only the most recent fiscal year will be required to be reflected in the revised Summary Compensation Table when the new rules and amendments applicable to the Summary Compensation Table become effective, and therefore the information for years prior to the most recent fiscal year will not have to be presented at all.”  If you have downloaded EC tables from our server from Proxies and 10-Ks filed in 2007 and 2008 you should have noticed how those tables typically only included one and then two years of data.  The SEC’s decision to allow a fresh-start with the new disclosure requirements explains that pattern.

I am highlighting this because it is important to think about the validity of a compensation data time series that crosses these two reporting regimes.  There are significant measurement and disclosure differences before and after a registrants adoption of the new form disclosures.   For example, registrants were allowed to delay the adoption of FAS 123R until the fiscal year in which they became subject to the new disclosure regime (SEC Press Release).

To highlight and make sure our clients are aware of the differences in measurement we decided not to normalize this data to the same structure we normalize post-2006 data.  For example, prior to the regulatory changes there was not a total column.  We are not including a total column.  The total is really indeterminate because of some significant variability in how individual companies measured equity type awards.  Most companies only reported the number of options granted – there was no monetary value established.

To understand the significance of these differences please compare the Summary Compensation Tables  from Abbott Laboratories’ 2006 and 2007 Proxy filing.


Notice that the option column reports the number of “Securities Underlying Options/SARs”.  Also notice that there is no total reported.

Here is the Summary Compensation Table from their 2007 Proxy  (first year under the new disclosure rules).


One year of data and a monetary value (determined by application of FAS 123R) for the Option Awards. Further, despite similar column headings the process under which the amounts included in most of the other columns changed substantially.  For example, the new rules state that “As we proposed, compensation that is earned, but for which payment will be deferred, must be included in the salary, bonus or other column, as appropriate.

To illustrate the results from pulling this data from our server the next image has the summary data file created using a request file for 2006 and 2007.


As I noted earlier, we are not adding a calculated Total for data that does not originally include a total.

Where to Start

At least twice a month I get an email from someone starting a new project looking for some direction.  My first questions back to them are almost always about the work they have done to identify their sample and the disclosure requirements that their sample is subject to.  These are critical first steps that often get overlooked.  Our experience is that spending some time at the early stages of a project addressing these questions can significantly reduce the stress and uncertainty of data collection because it allows you to get away from the nagging question – why did I not find [some data item]?

While I described the sample selection disclosure requirements as multiple steps they are really hard to separate from one another because they are so intertwined.  One area they are separate though is coverage in the most common databases used with EDGAR data.  For competitive reasons I probably can’t name the two elephants in the room but most of you know who they are.  Their products do not cover all SEC registrants.  While their populations may represent more than 98% of the capitalization of the US equity markets, in sheer numbers that is just a subset of all of the SEC registrants (best guess, less than 1/2).

Thus I think a critical first step is always to use their filtering tools to identify companies in their population that have the data items you need and meet your sample criteria.  One area that is important to filter on is to decide if you need sample companies that have publicly traded equity.  Commercial databases include entities that do not have any publicly traded equity but are SEC registrants.  It is not enough to confirm that they have a Central Index Key (CIK).  A complication related to this are the cases where the commercial database includes data for the entity that has publicly traded data as well as data for the subsidiaries that have filing obligations (usually because of publicly traded debt).  One example of this I like to share is Entergy.  Here is a shot of the bottom of the landing page for their 2015 FYE 10-K.


While I can only display five, there are seven registrants who simultaneously filed that same 10-K.  Only CIK 65984 (ENTERGY CORP /DE/ – the second listed) has publicly traded equity.  The other entities have various other securities that require public disclosure of their financial performance but do not have any proxy reporting requirements.

It took a long time to get here – the point is that without careful filtering users can get frustrated because they will not find compensation or director information for any of the other entities.  Some compensation data for subsidiary officers will be listed in the DEF 14A of Entergy Corp simply because there compensation flows to the parent and they will be among the five highest compensated officers of the combined entity.

Another area that is tricky has to do with regulatory disclosure changes.  Until SOX Director Compensation disclosures were generally about the schedule used to compensate directors rather than precise details about compensation to individual directors.  The rules governing Director Compensation can be reviewed here.  The rules became effective for fiscal years that ended after 12/15/2006.  When trying to build a sample of Director Compensation data, companies like Apple (late-September FYE (52/53 week FYs) do not provide a compensation table until their proxy filing on January 23, 2008.  Without investing time into understanding these disclosure requirements to then develop their sample users can expend considerable effort fruitlessly searching Proxy and then 10-K filings for data that is not available.

One more example has to do with the filing status of issuers.  The SEC has a graduated filing schedule and disclosure requirements based on the filing status of the registrants.  Companies that meet the definition of a Smaller Reporting Company have a choice to meet the full requirements of Regulation S-K or scaled disclosure requirements (described here).  In total there are 12 differences in the disclosure requirements for companies that qualify for and elect the relief available under these regulations.   These scaled disclosure requirements include the opportunity to omit  disclosures for Risk Factors (Item 1A) as well as reduce content in many other areas of a filing.  The SEC noted in 2008 that approximately 1/2 of 10-K filers would be eligible for this relief.

If you are trying to collect data from an area of the 10-K covered by these scaled disclosure opportunities your sample is going to be greatly affected by decisions made by the registrants.  We know that at least 1/3 of the 10-K filers at any one time do not provide disclosure about their Risk Factors through the relief offered under the scaled disclosure requirements for smaller reporting companies.  Here is a screenshot of a 10-K form from one registrant in 2016 that choose that disclosure regime.


The frustrating part about the scaled disclosure is that companies choose to implement their choices in a number of ways.  The above screenshot makes it clear.  Other registrants might use the word omitted; which is also clear.  Our experience though is that more than half will just leave the space blank with no words to indicate the reason for the omission.

There is a lot of detail in this post.  The point I am trying to make is simple though – it is really in your best interest to start research that requires the collection of data from EDGAR filings with a careful filtering of your commercial dataset.  Apply as many filters as you can so that you do not spend any time looking for (or cleaning) data that is not going to ultimately be used.


Managing Complex Searches

I was helping someone build a complex search today.  The goal was to identify all 10-K filings with some mention of Last-in, First-out inventory and confirm whether the auditor for the companies that mention LIFO belonged to a particular set of auditors.  Their auditor list was not too large (27 names).  Because we were not initially sure exactly what search to construct the search string started getting messy. We were not seeing some results that we expected and I wanted to clean it up so that our customers could better understand some of the choices we were making.

Unfortunately, our search box only displays the most recent 100 characters (including spaces) but our search parser will parse search strings with as many as 16,000 characters.  While the amount of the input that is visible is limited our application does allow users to paste in their search string that they might build outside the application.

So I need to take a break and throw in a plug for a really useful and free) utility called Notepad++.  It is has an amazing feature set.  If you are writing code (Python and Perl and many others) Notepad++ offers syntax highlighting.  It is a lot easier on memory than the included Notepad in the sense that it opens large files much easier.  It has an amazing search facility.  One of the reasons I use it is that it helps me balance parentheses when I am writing a complex search.

When I start getting confused with the search I am trying to construct I stop trying to build it in our search box and switch to Notepad++.

As I mentioned above, I started building the search inside the Search, Extraction & Normalization Engine. Here is a screenshot of part of the search, we can only see about 100 characters at a time.


I was not seeing results from some auditing firms that I expected and so I decided the best thing to do was to build the search in Notepad++.  If you place your mouse in the Search Phrase box and use the right-click button there is a context choice – select all.  So I did that to select and then copy my existing search phrase from our application to Notepad++ (NPP).


I pasted it squished down so you can see that there are 732 characters in my search.  Just pasting it into NPP isn’t enough, I want to see the search.  So I started adding line breaks after operators or terms. Here is an image of the search reorganized in NPP:


Organizing the search this way sure makes it a lot easier to read.  I have two clumsily drawn circles in the image up there.  The smaller one at the top illustrates the parentheses matching.  If you are next to a parentheses it will change color (you can set the color you want, the default is red) and the matching parentheses will also light-up.  That is a really useful feature that I use a lot when building nested searches. You can easily tell if you have unbalanced parentheses by placing the cursor next to one and then scroll up or down to look for another with the same highlighting.  If you don’t see one then you have an unbalanced parenthesis.

The second, longer red ‘circle’ illustrates that I think it is always a best practice to indent each new line of search content one space.  The reason for this is if you hit return and then start typing in the first column of the new line the parser will run together the last word (or operator) of the previous line with the first word (or operator) of the current line.

Once the search was in NPP it was easy to analyze.  The problem we were having turned out to be an unbalanced parenthesis and typos (read misspellings).  Once those were corrected we just selected our search (as it was displayed in NPP – we did not have to get rid of the line breaks) and then pasted it back into the search box.


Notice, that the line breaks are preserved.  The parser will take care of the line breaks when we submit the search.

For those of you new to our software, the andany operator is not used in the process of selecting documents.  Rather it is used to count instances of other words or phrases in the documents selected with the other operators.  While our focus in this search was on LIFO we also want to inspect references to FIFO to make sure we properly identify LIFO adopters (that explains the first andany operator).  The second, in front of the list of auditors is to identify the auditor(s) associated with the companies who do mention LIFO in their 10-K.

In the search results displayed below – there were 2,439 documents found.  These documents were selected because:

  1. They had either the phrase “last in” (hyphenated or not)
  2. or they had the term LIFO
  3. and they were 10-K documents (not exhibits)
  4. and I set a date filter to select only filings made between 1/1/2012 and 5/30/2016

Because of the andany  operators we could check on their auditor and find their context for FIFO if they mentioned either in the document.


Temporary Bug Fix

If you are using the Search, Extraction & Normalization Engine to download Item Sections of 10-K filings you might have noticed some strange formatting issues with some of those.  Here is an example:


This is a txt file.  However, because of a coding mistake I made  the file will be saved in your working directory as an htm file.  I don’t know what I was thinking.  The file comes from our server as txt file but I inserted a line to rename with the htm extension.

This has no effect on using other components of our software on the file.  That is you still get a valid word count and word frequencies, you can still index the files, they are just going to look ugly in the search results.  Here is the result of running the file through the Extraction/CountWordFrequencies part of the application.


I have fixed that part of the code.  I am not intending to push out the fix until we finish the next update (sort of a roll – your – own directEDGAR).  However, if you would like the fix sooner just email me.  This particular fix does not require a complete new install of directEDGAR – I send you a small file to install in the application folder.

Thank Goodness for directEDGAR

I have been spending the last couple of days working on some Python Regular Expressions to try to reduce the time it takes to snip ITEM sections of 10-K filings.  When I sat down to start this task  I opened up a few 10-K filings to remind myself of some of the peculiar details associated with their structure.

There are a couple of important details about the way we have set up our filing structure that really make this job a lot simpler than the alternative of spidering EDGAR based on reading an index to identify 10-K filings.  I want to share those in this post because these details are important.

First, I want to focus on 10-K forms, not the entire 10-K filing.  If I were spidering EDGAR or trying to download filings I would have to figure out what part of the txt file is the 10-K form and what part is not the 10-K form.  With directEDGAR’s Search Extraction & Normalization Engine I can just ask for a list of 10-K forms and get their paths. This is important because when I am writing a regular expression for ITEM within one or more spaces of SOME NUMBER  or particular NUMBER/LETTER combination (7a) I don’t want top have to think through the code to exclude false positives like the ones in the following image:


With the Search, Extraction & Normalization Engine I can get a list of the full file paths to all 10-K forms (excluding the exhibits) in about 2 minutes.  This saves me hours of time and lets me focus on the my real task.

Another real benefit of using our architecture is that it allows me to visually inspect the original filing very easily when I am not getting the results I am looking for.  As an example, I was still not getting the results I needed from a significant but small minority of 10-K forms.  Thus, I decided to inspect them to see why my Regular Expression was not identifying the end of the MD&A section of these filings.  Since directEDGAR stores the files locally I could just get a print of one of the paths and paste it into my browser:


That particular path is for a filing that had an ITEM 7 and had ITEM 7A but I could not identify the ITEM 7A starting point (which is the ending point for the ITEM 7).  The problem is a bit more convoluted – but my Regular Expression was not working on this filing.  So I need to inspect the filing to understand the problem.  By having the file stored locally I can easily find and review it.  Here is what I found:


A silly colon.  So far I had found a dot (.) and one or more dashes () following the item designation but I had not yet seen a colon.  Seeing this led me to modifying my Regular Expression to now anticipate the possibility of a colon in the line I am looking for.

I have pasted the new Regular Expression and the output (I am not the world’s best REGEX creator so don’t fixate on the one below) so you can see that I am now successfully pointing to the ITEM 7A position I need in the document to pull the ITEM 7.


Without directEDGAR’s advanced tools and architecture I would have to first download the document (accession-number.txt file).  Next I would have to find some way to focus my work on just the 10-K form of the document and then I would have to develop some mechanism to visually inspect the original document to determine the cause of missing or unexpected results.  All of these frictions are handled for you with directEDGAR.



Accelerated Extraction of Executive Compensation Data

Because of some hard work on the part of our team we just turned on  Near Real-Time delivery of Executive and Director Compensation data as well as the 10-K Item sections.  We were processing these filings and extracting the data during a midnight run – now we are processing them just a few seconds or minutes after they are made available on EDGAR.

This is one of the struggles I have with marketing.  What exactly is real-time?  We can’t actually state that the data is made available concurrent with the filing since the filing has to happen first.  However, we are now much closer to real-time than we have ever been before.

So for example, if you were interested in collecting or analyzing Executive Compensation data from ENERNOC INC (CIK 1244937) (The World Leader in Energy Intelligence Software), our platform made their normalized Executive Compensation data available for analysis about two minutes after their proxy filing was visible on EDGAR.

Here is an image of the raw data at it appeared in their proxy filing:


Here is an image of the processed results  (I had to trim some of the identifiers out of this image).


Notice the addition of the CIK of the officers as well as their gender.  This data was available just two short minutes after their proxy was available from EDGAR.

New Data Coming to EDGAR: Form C

The Jumpstart Our Business Startups Act (the JOBS ACT) required the SEC to reduce the regulatory burden on issuers seeking to raise money through crowd funding and similar platforms.  While it took the SEC a bit more than three years to define their regulatory approach the new regulations go into effect on May 2016.  The regulation as it appears in the Federal Register is available here (Form C).

Over time this will provide an opportunity to test some unique research questions.  Academic researchers have struggled for years to collect data on companies in their early stages.  These rules will provide access to that data for companies that seek and are successful raising capital from the public under these alternate and early stages of capital raising.  Companies are going to have to file annual financial statements – though it appears that they will have a range of choices with respect to the level of detail and the nature of the certification associated with those financial statements (from certification by the CEO to an audit).  That range of options alone suggests some interesting opportunities to measure the value of an audit.  Potentially, researchers will have data on how markets perceive the value of an audit if there is variation in the choices these early stage companies make when they go public.

We will be watching for the initial filings of Form C’s in May.  If I have not noted this yet, the next step in our development schedule is to provide our users the ability to use our platform to access any filings at anytime.