Minor Bug Duplicate Rows in Request File

I was testing the downloading of our new Director Vote results data and discovered a new bug.  Yuck.  If your request file has 2 (or more) rows with the same CIK, YEAR and PF values the application will exit as if everything is ok but will not consolidate the data rows and will not complete the extraction from our server.  I apologize for this, I only discovered this accidentally today.  Since we are preparing an update to ship to your IT folks in the next week or two this should be corrected by then

Proxy Voting Results Available Soon

This past year we have fielded a number of requests for help extracting Proxy voting results.  The results of director elections are pretty easy to access with the TableExtraction and Normalization routines built into the Search, Extraction & Normalization Engine.  The other votes are more difficult because the votes are not usually labeled in the table, rather they are described in the preceding paragraph.


In the image above the votes relating to compensation and the approval of the auditor are described in a paragraph independent of the table itself.  Our clients who have attended our Python Bootcamps would be able to use some of the tricks we have shown them to identify the table of votes but the logic to extend the Normalization component of the Engine is tricky to implement in the Engine and we have not yet sorted out the complications.

Thus we have been working on adding these votes to our Preprocessed Data Feed.  Initially we will add the Director vote results but soon after we will also add the Advisory Vote on Compensation, the Approval of the Auditor and the results from the vote on the Frequency of an Advisory Vote on Executive Compensation.  Once those are addressed we will add the votes on shareholder proposals.  With the director votes we are adding their CIK as well as their Gender.  Here is an example of the output you can expect with the Director Vote results.  These results came from an 8-K filing made by Abbott Laboratories on 5/4/2016 that reported on the actions taken during their annual meeting on 4/29/2016 (these data points are included in the results we push to our clients).  The PERSON-CIK value is the CIK assigned to that individual by the SEC if they have ownership or other reporting requirements.


One of the advantages of including the director CIK as an identifier is it simplifies tracking directors (and officers) across entities.  While I was playing with the results I decided to see how many directors in one of our test files had multiple directorships.  This is not so easy to determine if we had to sort by name.  For example,   Mr. Maffei (who held directorships in six entities in our test sample) is  reported as Greg Maffei and Gregory B. Maffei.  When we can sort by PERSON-CIK identifying multiple directorships is more straightforward.


Our plan is to build the archive but also add this to our daily processing.  We fully expect that by the time the 2017 Proxy Season begins in earnest we will be able to push the vote results to our distribution server within minutes of the filing of the 8-K that reports the voting results.

The initial batch of the results of director elections will be available within the next two weeks.  While I will announce it here you will be able to tell what data is available by creating a request file per the instructions and look at the list of available Governance Data Tables.  The Director Voting results will be listed in that schedule of available Data Tables




Accessing 8-K Filing Reason Codes is as easy as Pie

I apologize, this post is going to be a bit denser than many of my others as we are going to take a deep dive into a unique feature of directEDGAR.  When a registrant has an obligation to file an 8-K the SEC requires them to use a conformed set of reason codes to describe all of the events that are being reported on in the 8-K.  A detailed list of the current 8-K filing reason codes can be found here.

One of the driving forces leading to the development of directEDGAR was I was trying to find all reported instances of auditor changes in 2001.  While many of these are reported in the business press there was no systematic way to identify all 8-Ks filed with a particular conformed code (say auditor change).  Thus from the very beginning we included tagging in the 8-K filing so a user could search for and find all 8-Ks filed for a particular reason code.  While I thought that was great it was clear very soon after that many researchers needed to control for all reasons any particular 8-K was filed (there can be as many as nine reason codes attached to any one 8-K filing).  We had clients asking how to get all reason codes for all 8-K filings.  There was no way to automatically extract those in one step.  Rather they had to do independent searches for each filing reason code and then consolidate the final results.  This was happening often enough we started maintaining a file that we would make available when someone asked for the codes.

Thus I knew when we developed our new application we needed to make it straightforward for a user to identify all 8-K filing reason codes for any set of 8-K filings the user wanted to extract.  I think we accomplished this in a very powerful way.  To illustrate I want to model the data collection of the Holder, Karim,  Lin and Pinsker (2016) paper Do Material Weaknesses In Information Technology-Related Internal Controls Affect Firms’ 8-K Filing Timeliness And Compliance?   I hate to say that Professor Holder had to collect his data the old fashioned way. I hope it helps him to know that his work was one of the inspirations for this added feature in our latest version of the Search Extraction & Normalization engine.

Holder et al. needed to check the filing date relative to the event date and all filing reason codes attached to each 8-K filing.  Using, I will search for all 8-K filings made in 2016 up to early June.  Here is a screen shot of the initial search.


Our SummaryExtraction feature extracts meta data about each document returned from the search as well as all of the document tags we add to each document.  This is done from the menu and requires that you specify the location where you want the csv file stored.  Here is a screen shot of the SummaryExtraction from this 8-K search.


It is unfortunate that I cannot display the full width of the file.  There are 42 columns.  Eleven columns identify the filing and include the word count for the filing as well as some meta-data about the filer (SIC-CODE, FYE, CIK, Name).  The rest of the columns describe the filing.  The RDATE is the date the filing was made available to the public through EDGAR.    The CDATE is the Conformed Date.  This date is required to be the date the earliest event that is being reported on in the 8-K took place.  The focus of the Holder et al. paper was whether or not there were insights about material internal control weaknesses because of the lag between the RDATE and the CDATE.

The remaining columns list all of the possible reason codes in the form ITEM_REASON_CODE.  A YES in the column indicates that the reason code was attached to that particular 8-K.  At the current time there are 31 possible reason codes for any particular 8-K filing.

Now I want to point out.  We have already collected all of the information we need from our population of 8-K filings to replicate the Holder et al. paper.  Those two steps – search for 8-K filings and then use the SummaryExtraction feature took a grand total of about 10 seconds.

You might not need to replicate Holder et al.  However a number of the requests we have gotten in the past for this data have been because users needed to control for 8-K filing events in an event study.  In that case you might want to filter on CIK instead of pulling all events.

(forgive the corny title, it is getting close to Thanksgiving).






Custom Indexing with directEDGAR

One of the common requests we get from our clients is that they want to search just specific sections of 10-K filings.  The most common sections they look to index are the Risk Factors (Item 1A) and the MD&A (Item 7).  With our release of this is now possible.  You have to download the item sections to your local computer and then you have to run the indexing software built into the Search Extraction & Normalization Engine.

The process to download the various Items from the 10-K to your local computer is described in the Help file with the Search Extraction & Normalization engine.  Briefly, you need a CSV file with a list of CIKs, the years and the period focus (PF).  I am going to cheat in this example and use the list of most recently filed 10-Ks to generate my request file.


You can see my request file in the image above.  However, to show off a bit I also included a screenshot of our internal application that tracks the handling of each filing.  The image above is a bit fuzzy – the first CIK in the list is Monsanto (1110783).  If you go to EDGAR and check for the filing date and time of their 2016 10-K you will see that it was filed on EDGAR on 10/19/2016 at 3:31 CT ( 4:31 ET).  The details in our log above tell me that it was pushed to our server at 3:53 CT (4:53 ET).  So this Risk Factors section was available for our clients in less than 30 minutes after the filing was made on EDGAR.  Pretty amazing.

Once the request file has been created, the application interface has the controls to use to query our server and pull from it the actual Risk Factors sections for the list of CIK – Year pairs.


The process runs very fast, typically we can push more than 500 files a minute to your local computer.  However, time of day, our network and your network load all affect the download speed.  When the process is complete all the risk factor sections for your CIK-YEAR pairs are in the directory you specified.  They are named using the standard notation we use with the CIK, the date the filing is made available on EDGAR and for these particular filings the balance sheet date.  Each of these files can be individually viewed using our SmartBrowser.


As you can see we have controls to allow you to cycle through these quickly.  However, if you in-fact want to have the full range of directEDGAR features to use with these files they need to be indexed.  The indexing process requires that you select a collection of files and specify a destination for the files.  The application will handle all of the intermediate steps.


When the indexing is complete the application automatically adds the index to a directEDGAR_CUSTOM index library and that library is then selected as the active library. So it is only necessary to select the index and begin searching.


The search results will have the hit highlighting and hit-to-hit navigation features.  You can extract context around search results, extract tables, get word counts – in short all of the features associated with the primary indexes are made available on these custom indexes.  And while I ran a simple search above, you can run as complex a search as you need because you have the full power of our search engine managing the background processes.



Using MTurk with directEDGAR

I was helping a PhD student use directEDGAR to snip some tables from a filing.  Because the data is for his dissertation and he is just really getting going on it I am going to be discreet and not go into details about the data he is trying to collect.

While the table snipping went well we observed a pretty significant problem with the Normalization.  Specifically – for the approximately 20,000 tables there were 23,000 unique column headings.  The reason for this is that the disclosure is not highly regulated and so most companies choose their own labels for the columns as well as how they structure the data.

The PhD student actually needed at most four values from each table and so we parried back and forth about alternatives to normalizing the data.  Working with that many column headings was going to turn into a monumental task.  Ultimately he decided to use Amazon’s Mechanical Turk (MTurk) platform as a way to crowd-source the collection of the data.  Because I have been intrigued with this service since I first heard about it we volunteered to help him.  I wanted the experience of working with a real use-case to discover how flexible the framework was.

While it was a pretty significant time investment (maybe 40-60 hours) I came away with a real appreciation of how our tools can facilitate and simplify data collection when used with their platform.  I also came away with a strong appreciation for their platform.

We needed to be able to display the snip from the SEC filings to the workers and provide a way for workers to enter data from the displayed table.  To maximize worker efficiency this all had to be done on one page.   Once we understood this it was easy enough to take one of their stock code pages and make the modifications we needed to suit his data collection needs.

Here is a mockup of the data collection page we worked on (note, the table displayed is not the data he is collecting).


It was really neat to set this up because their platform simplifies the management of the data collection process.  The values year1 and year2 are variables that  are determined by the source table.  The Data Value # are the labels for the data he was looking for.  The worker gets to stay on one page, find the data values, enter them and hit submit.  When submitted Amazon’s back end processes the data for him and stores it with indicators for the source table.  In our tests it was taking us each less than a minute to identify and transcribe the data and move to the next one.

You can see I obfuscated the instructions, while they look busy in this view they collapse Once a user has gotten comfortable with the data collection process they can close them and then have a much cleaner work area.  We also hosted a web page for him with more than 20 marked up examples of the different forms of the data and detailed explanations of what he is looking for. The web page is linked to the detailed instructions.

For quality control he plans to have a significant number of these done by two different workers so he can then check for consistency.

This would have been a lot more difficult to crowd-source without his ability to isolate the specific table for review.  If instead he had passed a document and required the workers to find the table the cost would have been significantly higher and I suspect much more difficult to audit.  Further, having the opportunity to crowd-source rather than send it to a data service will probably get him his results much faster.

I see a lot of possibilities for helping some of our customers with this approach.  They can isolate snips or blocks of text and embed them directly into a data collection page with these tools.



Standalone SmartBrowser and Release Coming Soon

We have had several requests for allowing users to provide copies of our software to freelancers who would help with some data cleaning activities.  We just can’t do that.  However we have identified a middle ground.  We are working on a standalone version of the SmartBrowser.

The SmartBrowser allows users to cycle through a directory with files (tables or documents) that were created by our application without having to open the files manually. Here is an image of the SmartBrowser:


Some of the key features include:

  1. Parsing of file to identify the
    1. CIK
    2. RDATE
    3. CDATE
  2. Text box so you can type CIK so list will load at that position
  3. Ability to select any CIK, again list will load at the position of the selected CIK
  4. Full set of controls to allow you to move through the files including
    1. Previous – loads the last file reviewed
    2. Next – loads the next file in the queue
    3. Delete Current File – deletes the file that is being displayed
    4. Move Current File – creates a subdirectory named ForReview and moves the current file into that subdirectory

Additional features not displayed include a full text search capability as well as the ability to snap the table that is displayed into Excel.

We think when you want to use Amazon Mechanical Turk or freelancers to help isolate specific data you can email them the relevant snipped tables or documents that you have extracted with the new SmartBrowser.  Having access to the SmartBrowser will allow them to focus directly on data collection rather than file management.

The release will add features to smooth out the installation on managed desktops (those environments where users do not have installation rights) and for scenarios where multiple users share one computer with the Search, Extraction & Normalization Engine accessible to all logged in uers.  We will also push out the update to the code to address the files that are being renamed as htm files when they are actually txt format.

Update 7/20/2016 – is now available – for a download link please email Dr. Burch Kealey  (bkealey@directedgar.com)


Old (but New) Compensation Data

We started pushing Executive Compensation data included in Proxies/10-Ks for Fiscal Years that ended before 12/15/2006 to our distribution server today.  This data is not easily comparable to data filed after these rule changes.  The SEC specifically prepared for this lack of comparability by noting “. . .companies will not be required to “restate” compensation or related person transaction disclosure for fiscal years for which they previously were required to apply our rules prior to the effective date of today’s amendments. This means, for example, that only the most recent fiscal year will be required to be reflected in the revised Summary Compensation Table when the new rules and amendments applicable to the Summary Compensation Table become effective, and therefore the information for years prior to the most recent fiscal year will not have to be presented at all.”  If you have downloaded EC tables from our server from Proxies and 10-Ks filed in 2007 and 2008 you should have noticed how those tables typically only included one and then two years of data.  The SEC’s decision to allow a fresh-start with the new disclosure requirements explains that pattern.

I am highlighting this because it is important to think about the validity of a compensation data time series that crosses these two reporting regimes.  There are significant measurement and disclosure differences before and after a registrants adoption of the new form disclosures.   For example, registrants were allowed to delay the adoption of FAS 123R until the fiscal year in which they became subject to the new disclosure regime (SEC Press Release).

To highlight and make sure our clients are aware of the differences in measurement we decided not to normalize this data to the same structure we normalize post-2006 data.  For example, prior to the regulatory changes there was not a total column.  We are not including a total column.  The total is really indeterminate because of some significant variability in how individual companies measured equity type awards.  Most companies only reported the number of options granted – there was no monetary value established.

To understand the significance of these differences please compare the Summary Compensation Tables  from Abbott Laboratories’ 2006 and 2007 Proxy filing.


Notice that the option column reports the number of “Securities Underlying Options/SARs”.  Also notice that there is no total reported.

Here is the Summary Compensation Table from their 2007 Proxy  (first year under the new disclosure rules).


One year of data and a monetary value (determined by application of FAS 123R) for the Option Awards. Further, despite similar column headings the process under which the amounts included in most of the other columns changed substantially.  For example, the new rules state that “As we proposed, compensation that is earned, but for which payment will be deferred, must be included in the salary, bonus or other column, as appropriate.

To illustrate the results from pulling this data from our server the next image has the summary data file created using a request file for 2006 and 2007.


As I noted earlier, we are not adding a calculated Total for data that does not originally include a total.