If you are using the Search, Extraction & Normalization Engine to download Item Sections of 10-K filings you might have noticed some strange formatting issues with some of those. Here is an example:
This is a txt file. However, because of a coding mistake I made the file will be saved in your working directory as an htm file. I don’t know what I was thinking. The file comes from our server as txt file but I inserted a line to rename with the htm extension.
This has no effect on using other components of our software on the file. That is you still get a valid word count and word frequencies, you can still index the files, they are just going to look ugly in the search results. Here is the result of running the file through the Extraction/CountWordFrequencies part of the application.
I have fixed that part of the code. I am not intending to push out the fix until we finish the next update (sort of a roll – your – own directEDGAR). However, if you would like the fix sooner just email me. This particular fix does not require a complete new install of directEDGAR – I send you a small file to install in the application folder.
I have been spending the last couple of days working on some Python Regular Expressions to try to reduce the time it takes to snip ITEM sections of 10-K filings. When I sat down to start this task I opened up a few 10-K filings to remind myself of some of the peculiar details associated with their structure.
There are a couple of important details about the way we have set up our filing structure that really make this job a lot simpler than the alternative of spidering EDGAR based on reading an index to identify 10-K filings. I want to share those in this post because these details are important.
First, I want to focus on 10-K forms, not the entire 10-K filing. If I were spidering EDGAR or trying to download filings I would have to figure out what part of the txt file is the 10-K form and what part is not the 10-K form. With directEDGAR’s Search Extraction & Normalization Engine I can just ask for a list of 10-K forms and get their paths. This is important because when I am writing a regular expression for ITEM within one or more spaces of SOME NUMBER or particular NUMBER/LETTER combination (7a) I don’t want top have to think through the code to exclude false positives like the ones in the following image:
With the Search, Extraction & Normalization Engine I can get a list of the full file paths to all 10-K forms (excluding the exhibits) in about 2 minutes. This saves me hours of time and lets me focus on the my real task.
Another real benefit of using our architecture is that it allows me to visually inspect the original filing very easily when I am not getting the results I am looking for. As an example, I was still not getting the results I needed from a significant but small minority of 10-K forms. Thus, I decided to inspect them to see why my Regular Expression was not identifying the end of the MD&A section of these filings. Since directEDGAR stores the files locally I could just get a print of one of the paths and paste it into my browser:
That particular path is for a filing that had an ITEM 7 and had ITEM 7A but I could not identify the ITEM 7A starting point (which is the ending point for the ITEM 7). The problem is a bit more convoluted – but my Regular Expression was not working on this filing. So I need to inspect the filing to understand the problem. By having the file stored locally I can easily find and review it. Here is what I found:
A silly colon. So far I had found a dot (.) and one or more dashes (–) following the item designation but I had not yet seen a colon. Seeing this led me to modifying my Regular Expression to now anticipate the possibility of a colon in the line I am looking for.
I have pasted the new Regular Expression and the output (I am not the world’s best REGEX creator so don’t fixate on the one below) so you can see that I am now successfully pointing to the ITEM 7A position I need in the document to pull the ITEM 7.
Without directEDGAR’s advanced tools and architecture I would have to first download the document (accession-number.txt file). Next I would have to find some way to focus my work on just the 10-K form of the document and then I would have to develop some mechanism to visually inspect the original document to determine the cause of missing or unexpected results. All of these frictions are handled for you with directEDGAR.