Anyone who has written code to work with SEC filings understands how messy the filings can actually be despite their apparent standardization. The 10-K is a form that the filer is supposed to fill out. But when you start writing regular expressions to find and parse sections of these filings for large sample it becomes clear very quickly how much variation there is in the presentation and language of what is seemingly very standardized.
I am working to improve our extraction of ITEM 7 Management’s Discussion and Analysis of Financial Condition and Results of Operations. Beginning in 1999 the SEC amended the 10-K to include Quantitative and Qualitative Disclosures of Market Risk. The specific requirement states clearly:
Since this follows Item 7 – if we want to isolate Item 7 we need to find the beginning of Item 7A as the cut-off for Item 7.
Based on the description above we should be able to write a straight-forward Regular Expression to identify this line. While there are different flavors – we use Python and to keep it general we should write something like ‘\n[ \t]*?ITEM[ \t]*?7A\.[ \t]*?Quantitative[ \t]*?and[ \t]*?Qualitative’,re.I. I am looking for all parts of the document that are new lines that may begin with 0 or more spaces or table followed with the word ITEM, more unknown number of spaces and tabs the number 7 . . . With the re.I I am allowing the case of the words to vary. With this expression I am giving some flexibility even though I would not necessarily expect to have to account for any variability because we are working with a form not a document that someone creates from scratch.
Using the REGEX from above I was able to successfully identify the beginning of the ITEM 7A section of the 10-K in 79% of the sample. That sounds wonderful, except for the one year I am working with I have 4,170 observations. Thus, I still have to deal with another 750 observations. So the question comes down to this – how am I going to determine the correct way to identify the beginning of ITEM 7A for 750 companies?
To determine the actual data presentation I am going to have to review the documents for which I know there should be an ITEM 7A section of the 10-K but could not find one with my original expression. It doesn’t matter how I expect the data to be reported – it only matters how the data is actually reported. I have to look at the missing filings and sort out how the words they used to identify that section.
That is easy enough with directEDGAR. Since we localize all the filings – I am actually working with a local file version of the 10-K and so using the code I can easily view the path to a missing observation. Here is the first interesting variant I discover:
So instead of QUANTITATIVE AND QUALITATIVE this registrant decided to reverse the language in the form. An interesting question is – well how many did that? It turns out 95 did (roughly two percent of the total and 1/7th of the missing values). This discovery leads me to account for this in my REGEX so it gets a bit more complicated – ‘\n[ \t]*?ITEM[ \t]*?7A\.[ \t]*?(Qualitative|Quantitative)[ \t]*?and[ \t]*?(Qualitative|Quantitative)’,re.I. In this version I am asking to inspect for lines with QUANTITATIVE or QUALITATIVE followed by AND and then again QUANTITATIVE or QUALITATIVE.
Because I can trivially look at the 10-K documents with directEDGAR I can very easily identify the missing observations and modify my code as needed until the point of diminishing returns. What I mean here is that I am interested in collecting data, not writing the perfect regular expression (I will leave that to others more skilled). In one of the missing cases I saw this:
So Small Business Issuers are exempt. This tells me that I should do a search using the Search Extraction and Normalization Engine to find those filings with the phrase not provided quantitative. Because not is a search operator it has to be qualified to use it as a search word. This is accomplished by appending a tilde (~) to the word so the search phrase I am going to initially use to identify those 10-K filings where the issuer claims an exemption is not~ provided quantitative. Not a very fruitful search – only four filings (2 different CIKs).
My ultimate regular expression got very complicated (I even had to account for six different spelling mistakes). However, the process to create it was not complicated because every time I had missing values, I could look in the search results on my screen and immediately zero in on the more challenging documents to learn how their disclosure was different.