Custom Indexing with directEDGAR

One of the common requests we get from our clients is that they want to search just specific sections of 10-K filings.  The most common sections they look to index are the Risk Factors (Item 1A) and the MD&A (Item 7).  With our release of this is now possible.  You have to download the item sections to your local computer and then you have to run the indexing software built into the Search Extraction & Normalization Engine.

The process to download the various Items from the 10-K to your local computer is described in the Help file with the Search Extraction & Normalization engine.  Briefly, you need a CSV file with a list of CIKs, the years and the period focus (PF).  I am going to cheat in this example and use the list of most recently filed 10-Ks to generate my request file.


You can see my request file in the image above.  However, to show off a bit I also included a screenshot of our internal application that tracks the handling of each filing.  The image above is a bit fuzzy – the first CIK in the list is Monsanto (1110783).  If you go to EDGAR and check for the filing date and time of their 2016 10-K you will see that it was filed on EDGAR on 10/19/2016 at 3:31 CT ( 4:31 ET).  The details in our log above tell me that it was pushed to our server at 3:53 CT (4:53 ET).  So this Risk Factors section was available for our clients in less than 30 minutes after the filing was made on EDGAR.  Pretty amazing.

Once the request file has been created, the application interface has the controls to use to query our server and pull from it the actual Risk Factors sections for the list of CIK – Year pairs.


The process runs very fast, typically we can push more than 500 files a minute to your local computer.  However, time of day, our network and your network load all affect the download speed.  When the process is complete all the risk factor sections for your CIK-YEAR pairs are in the directory you specified.  They are named using the standard notation we use with the CIK, the date the filing is made available on EDGAR and for these particular filings the balance sheet date.  Each of these files can be individually viewed using our SmartBrowser.


As you can see we have controls to allow you to cycle through these quickly.  However, if you in-fact want to have the full range of directEDGAR features to use with these files they need to be indexed.  The indexing process requires that you select a collection of files and specify a destination for the files.  The application will handle all of the intermediate steps.


When the indexing is complete the application automatically adds the index to a directEDGAR_CUSTOM index library and that library is then selected as the active library. So it is only necessary to select the index and begin searching.


The search results will have the hit highlighting and hit-to-hit navigation features.  You can extract context around search results, extract tables, get word counts – in short all of the features associated with the primary indexes are made available on these custom indexes.  And while I ran a simple search above, you can run as complex a search as you need because you have the full power of our search engine managing the background processes.



Using MTurk with directEDGAR

I was helping a PhD student use directEDGAR to snip some tables from a filing.  Because the data is for his dissertation and he is just really getting going on it I am going to be discreet and not go into details about the data he is trying to collect.

While the table snipping went well we observed a pretty significant problem with the Normalization.  Specifically – for the approximately 20,000 tables there were 23,000 unique column headings.  The reason for this is that the disclosure is not highly regulated and so most companies choose their own labels for the columns as well as how they structure the data.

The PhD student actually needed at most four values from each table and so we parried back and forth about alternatives to normalizing the data.  Working with that many column headings was going to turn into a monumental task.  Ultimately he decided to use Amazon’s Mechanical Turk (MTurk) platform as a way to crowd-source the collection of the data.  Because I have been intrigued with this service since I first heard about it we volunteered to help him.  I wanted the experience of working with a real use-case to discover how flexible the framework was.

While it was a pretty significant time investment (maybe 40-60 hours) I came away with a real appreciation of how our tools can facilitate and simplify data collection when used with their platform.  I also came away with a strong appreciation for their platform.

We needed to be able to display the snip from the SEC filings to the workers and provide a way for workers to enter data from the displayed table.  To maximize worker efficiency this all had to be done on one page.   Once we understood this it was easy enough to take one of their stock code pages and make the modifications we needed to suit his data collection needs.

Here is a mockup of the data collection page we worked on (note, the table displayed is not the data he is collecting).


It was really neat to set this up because their platform simplifies the management of the data collection process.  The values year1 and year2 are variables that  are determined by the source table.  The Data Value # are the labels for the data he was looking for.  The worker gets to stay on one page, find the data values, enter them and hit submit.  When submitted Amazon’s back end processes the data for him and stores it with indicators for the source table.  In our tests it was taking us each less than a minute to identify and transcribe the data and move to the next one.

You can see I obfuscated the instructions, while they look busy in this view they collapse Once a user has gotten comfortable with the data collection process they can close them and then have a much cleaner work area.  We also hosted a web page for him with more than 20 marked up examples of the different forms of the data and detailed explanations of what he is looking for. The web page is linked to the detailed instructions.

For quality control he plans to have a significant number of these done by two different workers so he can then check for consistency.

This would have been a lot more difficult to crowd-source without his ability to isolate the specific table for review.  If instead he had passed a document and required the workers to find the table the cost would have been significantly higher and I suspect much more difficult to audit.  Further, having the opportunity to crowd-source rather than send it to a data service will probably get him his results much faster.

I see a lot of possibilities for helping some of our customers with this approach.  They can isolate snips or blocks of text and embed them directly into a data collection page with these tools.



Standalone SmartBrowser and Release Coming Soon

We have had several requests for allowing users to provide copies of our software to freelancers who would help with some data cleaning activities.  We just can’t do that.  However we have identified a middle ground.  We are working on a standalone version of the SmartBrowser.

The SmartBrowser allows users to cycle through a directory with files (tables or documents) that were created by our application without having to open the files manually. Here is an image of the SmartBrowser:


Some of the key features include:

  1. Parsing of file to identify the
    1. CIK
    2. RDATE
    3. CDATE
  2. Text box so you can type CIK so list will load at that position
  3. Ability to select any CIK, again list will load at the position of the selected CIK
  4. Full set of controls to allow you to move through the files including
    1. Previous – loads the last file reviewed
    2. Next – loads the next file in the queue
    3. Delete Current File – deletes the file that is being displayed
    4. Move Current File – creates a subdirectory named ForReview and moves the current file into that subdirectory

Additional features not displayed include a full text search capability as well as the ability to snap the table that is displayed into Excel.

We think when you want to use Amazon Mechanical Turk or freelancers to help isolate specific data you can email them the relevant snipped tables or documents that you have extracted with the new SmartBrowser.  Having access to the SmartBrowser will allow them to focus directly on data collection rather than file management.

The release will add features to smooth out the installation on managed desktops (those environments where users do not have installation rights) and for scenarios where multiple users share one computer with the Search, Extraction & Normalization Engine accessible to all logged in uers.  We will also push out the update to the code to address the files that are being renamed as htm files when they are actually txt format.

Update 7/20/2016 – is now available – for a download link please email Dr. Burch Kealey  (


Old (but New) Compensation Data

We started pushing Executive Compensation data included in Proxies/10-Ks for Fiscal Years that ended before 12/15/2006 to our distribution server today.  This data is not easily comparable to data filed after these rule changes.  The SEC specifically prepared for this lack of comparability by noting “. . .companies will not be required to “restate” compensation or related person transaction disclosure for fiscal years for which they previously were required to apply our rules prior to the effective date of today’s amendments. This means, for example, that only the most recent fiscal year will be required to be reflected in the revised Summary Compensation Table when the new rules and amendments applicable to the Summary Compensation Table become effective, and therefore the information for years prior to the most recent fiscal year will not have to be presented at all.”  If you have downloaded EC tables from our server from Proxies and 10-Ks filed in 2007 and 2008 you should have noticed how those tables typically only included one and then two years of data.  The SEC’s decision to allow a fresh-start with the new disclosure requirements explains that pattern.

I am highlighting this because it is important to think about the validity of a compensation data time series that crosses these two reporting regimes.  There are significant measurement and disclosure differences before and after a registrants adoption of the new form disclosures.   For example, registrants were allowed to delay the adoption of FAS 123R until the fiscal year in which they became subject to the new disclosure regime (SEC Press Release).

To highlight and make sure our clients are aware of the differences in measurement we decided not to normalize this data to the same structure we normalize post-2006 data.  For example, prior to the regulatory changes there was not a total column.  We are not including a total column.  The total is really indeterminate because of some significant variability in how individual companies measured equity type awards.  Most companies only reported the number of options granted – there was no monetary value established.

To understand the significance of these differences please compare the Summary Compensation Tables  from Abbott Laboratories’ 2006 and 2007 Proxy filing.


Notice that the option column reports the number of “Securities Underlying Options/SARs”.  Also notice that there is no total reported.

Here is the Summary Compensation Table from their 2007 Proxy  (first year under the new disclosure rules).


One year of data and a monetary value (determined by application of FAS 123R) for the Option Awards. Further, despite similar column headings the process under which the amounts included in most of the other columns changed substantially.  For example, the new rules state that “As we proposed, compensation that is earned, but for which payment will be deferred, must be included in the salary, bonus or other column, as appropriate.

To illustrate the results from pulling this data from our server the next image has the summary data file created using a request file for 2006 and 2007.


As I noted earlier, we are not adding a calculated Total for data that does not originally include a total.

Where to Start

At least twice a month I get an email from someone starting a new project looking for some direction.  My first questions back to them are almost always about the work they have done to identify their sample and the disclosure requirements that their sample is subject to.  These are critical first steps that often get overlooked.  Our experience is that spending some time at the early stages of a project addressing these questions can significantly reduce the stress and uncertainty of data collection because it allows you to get away from the nagging question – why did I not find [some data item]?

While I described the sample selection disclosure requirements as multiple steps they are really hard to separate from one another because they are so intertwined.  One area they are separate though is coverage in the most common databases used with EDGAR data.  For competitive reasons I probably can’t name the two elephants in the room but most of you know who they are.  Their products do not cover all SEC registrants.  While their populations may represent more than 98% of the capitalization of the US equity markets, in sheer numbers that is just a subset of all of the SEC registrants (best guess, less than 1/2).

Thus I think a critical first step is always to use their filtering tools to identify companies in their population that have the data items you need and meet your sample criteria.  One area that is important to filter on is to decide if you need sample companies that have publicly traded equity.  Commercial databases include entities that do not have any publicly traded equity but are SEC registrants.  It is not enough to confirm that they have a Central Index Key (CIK).  A complication related to this are the cases where the commercial database includes data for the entity that has publicly traded data as well as data for the subsidiaries that have filing obligations (usually because of publicly traded debt).  One example of this I like to share is Entergy.  Here is a shot of the bottom of the landing page for their 2015 FYE 10-K.


While I can only display five, there are seven registrants who simultaneously filed that same 10-K.  Only CIK 65984 (ENTERGY CORP /DE/ – the second listed) has publicly traded equity.  The other entities have various other securities that require public disclosure of their financial performance but do not have any proxy reporting requirements.

It took a long time to get here – the point is that without careful filtering users can get frustrated because they will not find compensation or director information for any of the other entities.  Some compensation data for subsidiary officers will be listed in the DEF 14A of Entergy Corp simply because there compensation flows to the parent and they will be among the five highest compensated officers of the combined entity.

Another area that is tricky has to do with regulatory disclosure changes.  Until SOX Director Compensation disclosures were generally about the schedule used to compensate directors rather than precise details about compensation to individual directors.  The rules governing Director Compensation can be reviewed here.  The rules became effective for fiscal years that ended after 12/15/2006.  When trying to build a sample of Director Compensation data, companies like Apple (late-September FYE (52/53 week FYs) do not provide a compensation table until their proxy filing on January 23, 2008.  Without investing time into understanding these disclosure requirements to then develop their sample users can expend considerable effort fruitlessly searching Proxy and then 10-K filings for data that is not available.

One more example has to do with the filing status of issuers.  The SEC has a graduated filing schedule and disclosure requirements based on the filing status of the registrants.  Companies that meet the definition of a Smaller Reporting Company have a choice to meet the full requirements of Regulation S-K or scaled disclosure requirements (described here).  In total there are 12 differences in the disclosure requirements for companies that qualify for and elect the relief available under these regulations.   These scaled disclosure requirements include the opportunity to omit  disclosures for Risk Factors (Item 1A) as well as reduce content in many other areas of a filing.  The SEC noted in 2008 that approximately 1/2 of 10-K filers would be eligible for this relief.

If you are trying to collect data from an area of the 10-K covered by these scaled disclosure opportunities your sample is going to be greatly affected by decisions made by the registrants.  We know that at least 1/3 of the 10-K filers at any one time do not provide disclosure about their Risk Factors through the relief offered under the scaled disclosure requirements for smaller reporting companies.  Here is a screenshot of a 10-K form from one registrant in 2016 that choose that disclosure regime.


The frustrating part about the scaled disclosure is that companies choose to implement their choices in a number of ways.  The above screenshot makes it clear.  Other registrants might use the word omitted; which is also clear.  Our experience though is that more than half will just leave the space blank with no words to indicate the reason for the omission.

There is a lot of detail in this post.  The point I am trying to make is simple though – it is really in your best interest to start research that requires the collection of data from EDGAR filings with a careful filtering of your commercial dataset.  Apply as many filters as you can so that you do not spend any time looking for (or cleaning) data that is not going to ultimately be used.


Managing Complex Searches

I was helping someone build a complex search today.  The goal was to identify all 10-K filings with some mention of Last-in, First-out inventory and confirm whether the auditor for the companies that mention LIFO belonged to a particular set of auditors.  Their auditor list was not too large (27 names).  Because we were not initially sure exactly what search to construct the search string started getting messy. We were not seeing some results that we expected and I wanted to clean it up so that our customers could better understand some of the choices we were making.

Unfortunately, our search box only displays the most recent 100 characters (including spaces) but our search parser will parse search strings with as many as 16,000 characters.  While the amount of the input that is visible is limited our application does allow users to paste in their search string that they might build outside the application.

So I need to take a break and throw in a plug for a really useful and free) utility called Notepad++.  It is has an amazing feature set.  If you are writing code (Python and Perl and many others) Notepad++ offers syntax highlighting.  It is a lot easier on memory than the included Notepad in the sense that it opens large files much easier.  It has an amazing search facility.  One of the reasons I use it is that it helps me balance parentheses when I am writing a complex search.

When I start getting confused with the search I am trying to construct I stop trying to build it in our search box and switch to Notepad++.

As I mentioned above, I started building the search inside the Search, Extraction & Normalization Engine. Here is a screenshot of part of the search, we can only see about 100 characters at a time.


I was not seeing results from some auditing firms that I expected and so I decided the best thing to do was to build the search in Notepad++.  If you place your mouse in the Search Phrase box and use the right-click button there is a context choice – select all.  So I did that to select and then copy my existing search phrase from our application to Notepad++ (NPP).


I pasted it squished down so you can see that there are 732 characters in my search.  Just pasting it into NPP isn’t enough, I want to see the search.  So I started adding line breaks after operators or terms. Here is an image of the search reorganized in NPP:


Organizing the search this way sure makes it a lot easier to read.  I have two clumsily drawn circles in the image up there.  The smaller one at the top illustrates the parentheses matching.  If you are next to a parentheses it will change color (you can set the color you want, the default is red) and the matching parentheses will also light-up.  That is a really useful feature that I use a lot when building nested searches. You can easily tell if you have unbalanced parentheses by placing the cursor next to one and then scroll up or down to look for another with the same highlighting.  If you don’t see one then you have an unbalanced parenthesis.

The second, longer red ‘circle’ illustrates that I think it is always a best practice to indent each new line of search content one space.  The reason for this is if you hit return and then start typing in the first column of the new line the parser will run together the last word (or operator) of the previous line with the first word (or operator) of the current line.

Once the search was in NPP it was easy to analyze.  The problem we were having turned out to be an unbalanced parenthesis and typos (read misspellings).  Once those were corrected we just selected our search (as it was displayed in NPP – we did not have to get rid of the line breaks) and then pasted it back into the search box.


Notice, that the line breaks are preserved.  The parser will take care of the line breaks when we submit the search.

For those of you new to our software, the andany operator is not used in the process of selecting documents.  Rather it is used to count instances of other words or phrases in the documents selected with the other operators.  While our focus in this search was on LIFO we also want to inspect references to FIFO to make sure we properly identify LIFO adopters (that explains the first andany operator).  The second, in front of the list of auditors is to identify the auditor(s) associated with the companies who do mention LIFO in their 10-K.

In the search results displayed below – there were 2,439 documents found.  These documents were selected because:

  1. They had either the phrase “last in” (hyphenated or not)
  2. or they had the term LIFO
  3. and they were 10-K documents (not exhibits)
  4. and I set a date filter to select only filings made between 1/1/2012 and 5/30/2016

Because of the andany  operators we could check on their auditor and find their context for FIFO if they mentioned either in the document.


Temporary Bug Fix

If you are using the Search, Extraction & Normalization Engine to download Item Sections of 10-K filings you might have noticed some strange formatting issues with some of those.  Here is an example:


This is a txt file.  However, because of a coding mistake I made  the file will be saved in your working directory as an htm file.  I don’t know what I was thinking.  The file comes from our server as txt file but I inserted a line to rename with the htm extension.

This has no effect on using other components of our software on the file.  That is you still get a valid word count and word frequencies, you can still index the files, they are just going to look ugly in the search results.  Here is the result of running the file through the Extraction/CountWordFrequencies part of the application.


I have fixed that part of the code.  I am not intending to push out the fix until we finish the next update (sort of a roll – your – own directEDGAR).  However, if you would like the fix sooner just email me.  This particular fix does not require a complete new install of directEDGAR – I send you a small file to install in the application folder.