Finally 4.04 Release in Sight

When I set out to accomplish some home improvement project my wife and my amazing journeyman helper (my son) ask me how long I think it will take.  I set some number of hours, days or weeks and invariably it takes two to three times longer than planned.  I of-course, can keep my cheer during the process by my lovely bride sometimes gets annoyed because the disruption and mess are lasting longer than planned and my helper gets frustrated because after the initial rush of excitement he would probably prefer hanging out with other 13 year old boys (maybe girls too – not sure) rather than grinding away with his doting elderly dad.

The same happens when we plan a new release for directEDGAR.  We can easily see the start and the finish lines.  However, we never really understand how complex the terrain is until we are in the middle of the project.  I was hoping to release Version 4.04 by July and then it got pushed back until January and we are finally near the end.  It should be released in the next two weeks.

So what is new?

  • Improved the speed of access to our artifacts by a factor of 50 or more.  If you try to download say 5,000 or so Executive Compensation tables or MDA Snippets it takes upwards of 30 minutes  to an hour.  Now the same task will take less than one minute!  To get the stats for this bogpost I did a test to extract MDA for 5,000 CIK-YEAR pairs – using the old interface it took 42 minutes, the new interface took 23 seconds!!
  • We tweaked the Context Extraction to ignore the fields that were used in the search – fields are used only to focus the search and all fields are automatically included in every context or summary output.  This reduces the bulk of the Context Extraction so you don’t have to filter the csv file after the search is extracted to remove the irrelevant context lines.
  • We added a feature so you can generate a CSV file that contains all the meta-data about directEDGAR files and artifacts in a specific folder.  This is for those cases when you access some artifact or copy files from the main repository to another location and you need details about the files and the filers.  The new feature allows you to select a directory, all the files in the directory will be listed and we also parse the file names to give you the CIK, RDATE etc of the artifacts.  If the files is not a directEDGAR artifact we only provide the file name.
  • We made the DateFilter persistent.  This was probably a bad design choice in the beginning – because the date filter settings are ‘hidden’ after set we decided in the previous version to always reset the date filter to the default (no dates selected) after each search.  Some users have expressed a preference to have the filter persist across searches so we sorted out how to let you know the date filter is set for each search – you can clear it if needed.
  • We improved the way you set values for some of the meta-data filters for your searches.  Now those can be set as you select the filter item from the list rather than having to force you to go back into the search box, find the open parenthesis, set your cursor and then type.
  • We made some bug fixes including
    • Making sure the SmartBrowser opens in the right directory when you finish some process that causes the SmartBrowser to open.
    • When there is a problem with an input file (missing column or perhaps the file is still open and in-use) we let you know and give you the opportunity to close the file before moving forward rather than reporting an error that causes you to shut-down.
    • Making sure you can stop a search or other process that is running when you hit the Stop button.

The best thing about this release is the improvement to the artifact access.  When we started making director and executive compensation tables available directly we had imagined several hundred thousand artifacts and now we are over 2.5 million and adding more than one hundred thousand per month.  Our system was not designed for this.  We had a lot more to learn about how to manage the delivery than we imagined when we started.  However, there are some amazing folks working behind the scenes and they developed an infrastructure that will allow for a considerable amount of growth.  This is great because we are going to add new artifacts with the new infrastructure.  We have a large number of ideas we are waiting to implement until this release is pushed out to our customers.

Yes, we sometimes modify reported compensation!

Our system is designed to validate many characteristics of the director and executive compensation tables we extract and normalize.  One of the validation steps is to confirm that the reported total matches the sum of the components with an acceptable difference set at the absolute value of $10,000.

While it is easy to identify the reason for many of the exceptions (table reports 123.456 – value should be 123,456) others require more analysis and then ultimately some judgement.  I was working today and our system reported a $33,538 difference between the sum of the components and the reported total for CIK 1074902.  Here is the original table:



The reported total for Mr. Meilstrup in 2016 is $215,199, the sum of the components is $248,737, a difference of $33,538.

We reviewed the document and the table and wondered if the difference could be explained by the repetition of $45,793 in the NONEQUITY and NQDEFCOMP columns.  This seemed particularly likely since the amount reported for NONEQUITY in 2016 was more than three times the amount reported for the other officers of similar rank.  Thus we substituted the $12,255 amount reported in the 2016 column for the other Executive Vice Presidents and the difference between the components and the sum of the components dropped to $0.

Our final data push for this bank reflected the change we made


Why can’t I find the table you claim is in the filing.

Since we deliver the source table with our compensation downloads you might occasionally see some data that looks like a mismatch.  This most often happens with smaller reporting companies.  Here is an example – the table below is from CIK-845819.  They are a smaller reporting company, they also qualify for some of the disclosure relief offered by the JOBS-ACT.  Note – this table was labeled in the original filing as their Summary Compensation Table. 


This table triggered flags in our system and so it was pushed aside so one of our team could inspect and validate the data.  After reading their 10-K for 2017 and 2016 we finally determined that this was a combined Executive and Director Compensation table.  As a result we parsed it into two different tables.  The EC data went into one and the DC data into the other.

Here is the Normalized EC data:


All of the rows that related to past and present directors were eliminated from the normalized results.  We had to inspect the filings to identify the titles and those were added.  If an officer had no further association with a company past a particular year then we deleted the row.  Since the EC data is supposed to have the most recent three years of data we did leave in those individuals in they had an association within that past three year period.

For the DC data the company identified only two individuals who had a non-executive director position in 2017.  Thus we deleted all other individuals. The normalized DC data is here:


However, because we want to be fully transparent about the process we use to provide the data we need to trick our system.  When you download DC data for this company the CSV results will be like what is illustrated above (2 people only) but the htm source file that will be delivered will be the same one that is delivered with the EC except the TID value will be changed from 26 to 27.  If you then audit the table extraction it might seem confusing because table 27 in the sequence is not the same table as displayed at the top of this page.  We have to assign a TID that is different from the EC TID.

This kind of manipulation happens rarely and it tends to be the result of disclosures made by Smaller Reporting Companies.  The more common manipulation we have to do with Accelerated Filers is that sometimes we have to stitch two consecutive tables together when the company splits a table across pages.  Here is an example of that situation – this image is from a tool in our work flow so the tables were actually separated in the original document:


We actually combined those – so when you download this data you will see this table – which does not exist in the original document.


 Fortunately this does not happen often, maybe one in every 10,000 tables?



















































































Effective Tax Rate Reconciliation Extraction and Normalization Update

We decided in June to begin the process of tuning our platform to automatically Extract and Normalize the Effective Tax Rate Reconciliation.  We did this because we know many of our clients are interested in testing hypotheses about the factors affecting the Effective Tax Rate and certainly this is an important issue as Congress contemplates significant changes to the tax code.

We have made significant progress – most of the credit goes to Manish Pokhrel who is probably pulling his hair out right now because he probably just discovered another exception to what we originally imagined as a smooth and easy process.

I want to caution you that we are still not at the same level of perfection we are with some of the other artifacts we process but we are really close.  Today at 4:07 ENANTA PHARMACEUTICALS filed their latest 10-K.  Here is what the ETR tabled looked like in their filing:



ETR_rawSoon after we grabbed the filing and after much angst, hand-wringing and some black-magic our system made the raw data available to our clients for immediate use in their research and analysis in the following form.


The amazing thing I think is that in addition to this latest filing – we have every ETR table ENANTA has filed since they began filing 10-Ks in 2013.

Over 2 million EDGAR filing artifacts!

That’s right – our pre-parsing and normalization tools have made available over two million observations of various data items from EDGAR filings.  This is a huge number. It sort of blew me away when I received a message this morning from one of our team who is in the process of provisioning new storage space.  He was running a test involving just 10-K artifacts and had logged more than 1.7 million items.  Combining that with our PROXY extracted artifacts that total jumps to more than two million.

We are having to provision new space and work on a new delivery architecture because our existing system has out-grown both the ability to manage the volume of incoming artifacts as well as the number of outgoing requests.  We are getting ready to add insider trading data to match to the Named Executive Officers and the Directors.  That addition alone will add approximately another million items. We have also been parsing the older 10-Ks into item number sections to conform to our existing availability of the newer ones – that process should add at least another million separate files for download.

Once our new storage space is provisioned and working we will then turn back to finishing a new version of the Search, Extraction and Normalization Engine.  One of the key goals of this project will be to improve your download speed by a factor of at least four.  Our existing architecture did not allow for parallel access to our data repository.  Our new platform will allow us to design the application to run multiple simultaneous connections between your desktop and our repository for data access.


Extraction & Normalization of Board Meetings

I had an email from a faculty researcher how needed to capture the frequency of board meetings for a sample of companies – they wanted some help.  I was setting up the system and decided that this would be a worthwhile post to help illustrate why Search is not enough – you need Extraction and Normalization.

While there are many ways the concept of reporting the frequency of board meetings can be expressed – I know from past review that one form of the expression is ‘The Board met N times in YYYY’.  So that was the basis of my first search;


We found 796 relevant documents – now to EXTRACT & NORMALIZE those findings.  Just select the ContextNormalization feature from the menu and specify the inputs:


After pressing the Okay button the results will soon be available in the Output Folder.  The results include enough details to create an audit trail back to the original document and they also have the data that is needed:


I highlighted three of the rows to drive home the point that this is a versatile tool.  It can work with various forms of number expressions.

From start to finish this took me about three minutes.  The hard part of this is to continue and find the other ways this concept will be expressed.  I tried another form Board held.  This returned more results:


I would use the same strategy as before to Extract and Normalize – here is a peek at the results:


Again – a user intent on capturing the meeting frequency of a large sample is going to have to learn how the concept is expressed (clearly there are other ways to express this concept) and continue with the alternatives until they have identified the forms of expression in their sample.  However, once they use their knowledge our tools can help them very rapidly convert those search results into data.

Always Interesting Issues in Compensation

We Extract and Normalize the Executive and Director compensation data whether it is reported in the 10-K or the DEF 14A.  Compensation is a required disclosure in the 10-K but companies can take the relief offered by the CFR and chose to incorporate it by reference to the DEF 14A (proxy) if the proxy is expected to be filed within what we think is a 90 day window following the filing deadline for the 10-K.

We have started seeing more and more discrepancies between what is filed in the 10-K and what is ultimately reported in the proxy.  These discrepancies are not usually very large but they are interesting.  Argos Therapeutics Inc filed their proxy today – here is a link.  On May 1, 2017 they filed an amendment to their 10-K (10-K/A) which appears to have been filed solely to include Items 10 – 14.

When their filing was made today we captured the EC data and our system triggered an event because the table in the proxy covered the same reporting periods as the table in the 10-K/A and the totals did not match.  Here is the data that was reported in the 10-K/A:


Here is the EC data as it was reported in the proxy:


The total for each year has changed it it appears that the differences can be explained by differences reported for Other.  So looking more closely at the description of the Other amount it appears that they modified their description between the two filing.  Here is the language used in the proxy:argus_other

The description of Other in the 10-K/A does not mention 401(K) matching contributions – otherwise it matches verbatim the language used in the DEF 14A.  Now that we can explain the discrepancy we will remove the prior data and then update with the new table.