Over 2 million EDGAR filing artifacts!

That’s right – our pre-parsing and normalization tools have made available over two million observations of various data items from EDGAR filings.  This is a huge number. It sort of blew me away when I received a message this morning from one of our team who is in the process of provisioning new storage space.  He was running a test involving just 10-K artifacts and had logged more than 1.7 million items.  Combining that with our PROXY extracted artifacts that total jumps to more than two million.

We are having to provision new space and work on a new delivery architecture because our existing system has out-grown both the ability to manage the volume of incoming artifacts as well as the number of outgoing requests.  We are getting ready to add insider trading data to match to the Named Executive Officers and the Directors.  That addition alone will add approximately another million items. We have also been parsing the older 10-Ks into item number sections to conform to our existing availability of the newer ones – that process should add at least another million separate files for download.

Once our new storage space is provisioned and working we will then turn back to finishing a new version of the Search, Extraction and Normalization Engine.  One of the key goals of this project will be to improve your download speed by a factor of at least four.  Our existing architecture did not allow for parallel access to our data repository.  Our new platform will allow us to design the application to run multiple simultaneous connections between your desktop and our repository for data access.


Extraction & Normalization of Board Meetings

I had an email from a faculty researcher how needed to capture the frequency of board meetings for a sample of companies – they wanted some help.  I was setting up the system and decided that this would be a worthwhile post to help illustrate why Search is not enough – you need Extraction and Normalization.

While there are many ways the concept of reporting the frequency of board meetings can be expressed – I know from past review that one form of the expression is ‘The Board met N times in YYYY’.  So that was the basis of my first search;


We found 796 relevant documents – now to EXTRACT & NORMALIZE those findings.  Just select the ContextNormalization feature from the menu and specify the inputs:


After pressing the Okay button the results will soon be available in the Output Folder.  The results include enough details to create an audit trail back to the original document and they also have the data that is needed:


I highlighted three of the rows to drive home the point that this is a versatile tool.  It can work with various forms of number expressions.

From start to finish this took me about three minutes.  The hard part of this is to continue and find the other ways this concept will be expressed.  I tried another form Board held.  This returned more results:


I would use the same strategy as before to Extract and Normalize – here is a peek at the results:


Again – a user intent on capturing the meeting frequency of a large sample is going to have to learn how the concept is expressed (clearly there are other ways to express this concept) and continue with the alternatives until they have identified the forms of expression in their sample.  However, once they use their knowledge our tools can help them very rapidly convert those search results into data.

Always Interesting Issues in Compensation

We Extract and Normalize the Executive and Director compensation data whether it is reported in the 10-K or the DEF 14A.  Compensation is a required disclosure in the 10-K but companies can take the relief offered by the CFR and chose to incorporate it by reference to the DEF 14A (proxy) if the proxy is expected to be filed within what we think is a 90 day window following the filing deadline for the 10-K.

We have started seeing more and more discrepancies between what is filed in the 10-K and what is ultimately reported in the proxy.  These discrepancies are not usually very large but they are interesting.  Argos Therapeutics Inc filed their proxy today – here is a link.  On May 1, 2017 they filed an amendment to their 10-K (10-K/A) which appears to have been filed solely to include Items 10 – 14.

When their filing was made today we captured the EC data and our system triggered an event because the table in the proxy covered the same reporting periods as the table in the 10-K/A and the totals did not match.  Here is the data that was reported in the 10-K/A:


Here is the EC data as it was reported in the proxy:


The total for each year has changed it it appears that the differences can be explained by differences reported for Other.  So looking more closely at the description of the Other amount it appears that they modified their description between the two filing.  Here is the language used in the proxy:argus_other

The description of Other in the 10-K/A does not mention 401(K) matching contributions – otherwise it matches verbatim the language used in the DEF 14A.  Now that we can explain the discrepancy we will remove the prior data and then update with the new table.

The Numbers Don’t Add Up!

Today at 10:59 ET Brown-Foreman (CIK – 14693) filed their proxy.  I was particularly happy because  we had a call scheduled at 11:30 with a potential new client and I find it keeps their attention when we can demonstrate to them an example of our processes working with real-time filings for companies that they are familiar with.  Given Brown-Forman’s status as a Fortune-1000 company I was chortling to myself that this was perfect.

Unfortunately it was not as perfect as I would have liked.  Soon after the filing was made I received a notification that there was a TOTAL error in the Executive Compensation table.  Here is the original table.


We run several validation tests on the data as we are parsing the document – one of those is of course that the totals tie with the reported total.  There are a variety of edge cases where the total may not match because of a small mistake by the registrant.  One example we see is that a registrant may accidentally insert a period instead of a comma.  We flag all math errors for human intervention.

In this case we can’t identify the reason for the error.  The components of the 2017 compensation reported for Mr. McCallum sum up to 2,518,874.  The reported total is 2,578,874.


Our next step is to scrutinize the filing to see if there is some discussion of the $60,000 amount.  We couldn’t find anything so now we send off an email to their Investor Relations Department and wait for a response.  If we don’t get a response within a week we will push the table as is.

These kind of addition errors happen infrequently – but then we discovered another one about half an hour later when ALJ Regional Holdings (CIK – 1438731) filed their proxy while I was on the call with the client.   In this case the error was present in two years of data (2016 and 2015).


The interesting thing is that when they filed their proxy last year – the reported total for Mr. Reisch was $785,250 – which is the sum of the reported components.  However, they are now reporting $804,000 as the total for 2015.  Which document is correct?  As with Brown-Foreman we sent off an email and so I hope we find out soon.

The potential client was impressed that we had the infrastructure to address these issues.

Where is that Comp Data?

One of the things we pride ourselves on is what we think is the fastest and most comprehensive delivery of Executive and Director Compensation data on the planet (a little bit of hyperbole never hurts).  In an attempt to focus on that we have been working to add a modal window on our website so that a user who visits our prime website will see the most recent comp table we have processed.

I had an interesting question the other day from a visitor who wondered why we were displaying the compensation data from SPRINT in the middle of the day on 6/20 when four other issuers had reported EC data since Sprint filed their proxy at 5:00 pm on 6/19.  If we were so good where was that comp data?

Here is the sequence of filings:


So when I received the query early on the 20th I had to look.  We deliberately did not push out the DC/EC or Audit Fee data for those entities because of an interesting issue.  MITCHAM INDUSTRIES and DOCUMENT SECURITY SYSTEMS INC  both filed a 10-K a couple of months earlier and both reported that data in their 10-Ks so our system flagged the new data as \SAME CONTENT\DIFFERENT DOCUMENT.

Here is the EC table from MITCHAM INDUSTRIES as reported in their 10-K on 5/31:


Here is the EC table from their Proxy filed on 6/20:


Here is the data from their 10-K after we normalized it and made it available on 5/31:


Further FUNDVANTAGE TRUST’s and CHROMADEX CORP’s DEF 14A related to a Special Meeting and did not include summary compensation data.  It was not until SPEEDEMISSIONS INC filed their proxy at 10:20 (CT) before we had new data.

We ultimately do not replace the existing data with the new data if it matches the content from a previous filing.

While it would be silly of me to claim we are perfect, I did dodge that bullet as we did have the most timely compensation data available.


History Feature

When I was working on the last post I realized I have never shown our Search History feature.  I use this all the time as it saves me significant time and effort when working with a complex search that I am refining.

The Search History is available from the Search menu under the (wait for this) History tab.  It pulls up all of the Search History from your local user id since the last time you cleared the history.  When you select Search History the Search History control appears:


You can modify any of the fields in this control, or you can select the Open Search button and all of the relevant fields of the application will be populated with these values – you can then modify as needed.




Critical Benefit of directEDGAR Architecture

Anyone who has written code to work with SEC filings understands how messy the filings can actually be despite their apparent standardization.  The 10-K is a form that the filer is supposed to fill out.  But when you start writing regular expressions to find and parse sections of these filings for large sample it becomes clear very quickly how much variation there is in the presentation and language of what is seemingly very standardized.

I am working to improve our extraction of ITEM 7 Management’s Discussion and Analysis of Financial Condition and Results of Operations.  Beginning in 1999 the SEC amended the 10-K to include Quantitative and Qualitative Disclosures of Market Risk.  The specific requirement states clearly:


Since this follows Item 7 – if we want to isolate Item 7 we need to find the beginning of Item 7A as the cut-off  for Item 7.

Based on the description above we should be able to write a straight-forward Regular Expression to identify this line.  While there are different flavors – we use Python and to keep it general we should write something like  ‘\n[ \t]*?ITEM[ \t]*?7A\.[ \t]*?Quantitative[ \t]*?and[ \t]*?Qualitative’,re.I.  I am looking for all parts of the document that are new lines that may begin with 0 or more spaces or table followed with the word ITEM, more unknown number of spaces and tabs the number 7 . . .  With the re.I I am allowing the case of the words to vary.  With this expression I am giving some flexibility even though I would not necessarily expect to have to account for any variability because we are working with a form not a document that someone creates from scratch.

Using the REGEX from above I was able to successfully identify the beginning of the ITEM 7A section of the 10-K in 79% of the sample.  That sounds wonderful, except for the one year I am working with I have  4,170  observations.  Thus, I still have to deal with another 750 observations.  So the question comes down to this – how am I going to determine the correct way to identify the beginning of ITEM 7A for 750 companies?

To determine the actual data presentation I am going to have to review the documents for which I know there should be an ITEM 7A section of the 10-K but could not find one with my original expression. It doesn’t matter how I expect the data to be reported – it only matters how the data is actually reported.  I have to look at the missing filings and sort out how the words they used to identify that section.

That is easy enough with directEDGAR.  Since we localize all the filings – I am actually working with a local file version of the 10-K and so using the code I can easily view the path to a missing observation.  Here is the first interesting variant I discover:


So instead of QUANTITATIVE AND QUALITATIVE this registrant decided to reverse the language in the form.  An interesting question is – well how many did that?  It turns out 95 did (roughly two percent of the total and 1/7th of the missing values).  This discovery leads me to account for this in my REGEX so it gets a bit more complicated – ‘\n[ \t]*?ITEM[ \t]*?7A\.[ \t]*?(Qualitative|Quantitative)[ \t]*?and[ \t]*?(Qualitative|Quantitative)’,re.I.  In this version I am asking to inspect for lines with QUANTITATIVE or QUALITATIVE followed by AND and then again QUANTITATIVE or QUALITATIVE.

Because I can trivially look at the 10-K documents with directEDGAR I can very easily identify the missing observations and modify my code as needed until the point of diminishing returns.  What I mean here is that I am interested in collecting data, not writing the perfect regular expression (I will leave that to others more skilled).  In one of the missing cases I saw this:


So Small Business Issuers are exempt.  This tells me that I should do a search using the Search Extraction and Normalization Engine to find those filings with the phrase not provided quantitative.  Because not is a search operator it has to be qualified to use it as a search word.  This is accomplished by appending a tilde (~) to the word so the search phrase I am going to initially use to identify those 10-K filings where the issuer claims an exemption is not~ provided quantitative.  Not a very fruitful search – only four filings (2 different CIKs).

My ultimate regular expression got very complicated (I even had to account for six different spelling mistakes).  However, the process to create it was not complicated because every time I had missing values, I could look in the search results on my screen and immediately zero in on the more challenging documents to learn how their disclosure was different.


If I were using some other tool it would have been more tedious because I would have to visit the SEC website, type in the CIK of the company I am missing and then locate the troublesome 10-K.  Certainly this is doable but it is taking my focus of my goal – collecting data for my research.