I received an interesting email from a client this week. They are trying to match comment letters (SEC Form UPLOAD) to the subject 10-K filing. The timing was perfect because I wanted to find a unique use case for a presentation at the University of Nebraska at Omaha to highlight how directEDGAR’s feature set can really accelerate data collection – especially if you are handy with coding in Python. I made a claim in the announcement for the presentation that with directEDGAR I could accomplish something probably ten times faster than relying just on Python.
The nice thing about comment letters is that they generally identify the filing in the subject area of the letter. I decided to begin this work by running a search over all comment letters for FORM 10 K* or FORM 10K*. I used the 10 K* to make sure I captured amendments and 10-KT. The FORM 10K* search was to capture any 10KSB, 10-KT and 10K405 filings. To make sure the search was focused on 10-K forms and variants as the subject of the comment letter I restricted the search to those where the references were within the first 200 words of the first word of the document. My final search was xfirstword pre/200 ((form 10 K*) or (form 10K*)) . This search identified 79,213 comment letters.
You can see in the image above I found 79,213 comment letters in my older copy of our filings in one minute and 15 seconds. I observe in those comment letters that the FORM reference is immediately followed by the date the filling was made. I want to capture that and I don’t want to spend a chunk of time to do this. One of the features of our application is that we create a text (txt) version of every document and load it into the index. This is good because most of the comment letters are pdf files. I do not need to mess with a library in Python to extract the text – I can just use the EXTRACTION\DocumentExtraction TextOnly feature of the application to dump a well named text only version of the letter into a folder on my computer.
That took about fifteen minutes – at the end of the fifteen minutes I had a text version of the 79K+ documents available on my desktop ready for use with Python. Below is an image of the directory – each file is named using our CIK-RDATE-CDATE-stuff convention so that is useful for making sure I have an audit trail back to the comment letter.
I want to scan some of these to make sure I understand their structure. Thus I used the SmartBrowser to quickly review the output.
Now I am ready to write some code to parse out the specific filing type and the filed date from the filing. I am fairly new to Python 3.9 – we anchored on 2.9 many moons ago. Since I am expecting most of our users are using one of the more modern version of Python I am wrote the code in 3.9.0. My general strategy is to read the first 30 lines of the text files as a list – find the line that begins with the word FORM – confirm it has 10-K (Actually I am going to look for FORM 10. I will then inspect the line to see if it begins with the word FILED. If I find the word FILED I will split and try to convert the remainder of the line into a date. I am also going to convert the date in the line that has FORM 10-K into a date object as well – this is our CDATE or better known as the balance sheet date. Below is an image of the lines I processed for this particular filing. I spent approximately 30 minutes on the code – and it is not perfect but my goal in this experiment is to demonstrate how our platform with your skills can accelerate your work.
Notice – I am trying to avoid going into the weeds here. If you get a chance to replicate what I am doing you will see why I made the decisions I did to identify the relevant dates as well as allow for the possibility that the RDATE could be on the next line after the line that described the FORM or the second line after.
I want to observe – these documents were created by humans – there are errors – some of the dates are outside the bounds of what would be expected:
Now that I have the balance sheet date (CDATE) and the dissemination date I am ready to access the actual filings. We have a CIK-DATE search feature on our application that will limit the results to documents/filings made by a unique list of CIK-DATE pairs – we can set a window around the date parameter if we choose. Since the SEC’s filing date may deviate in some cases from the dissemination date I am going to set a five day window.
The search/filtering process on this type of search can take a bit longer since we have to filter on CIK and date – but we are way ahead – in the test I just ran it took 12 minutes. So roughly a bit less than two hours invested. I am not going to claim that this is the end point. There were some exceptions I could clean up – but my goal was to demonstrate the possibility. In summary I spent less than two hours to go from an idea to a result. You can see in the image below – I can ‘touch’ the filing mentioned in that comment letter above (as well as all of Abbott’s other 10-K filings that were the subject of a comment letter).
I am going to share the code here. I am brand new to 3.9 and some of the conventions are different from 2.9 so if you are an expert 3.9 coder and want to improve the code – share your results. Otherwise – for those of you just starting out and want to play with the interaction of directEDGAR and Python this seems like a worthwhile place to muck around. My goal with this exercise was to highlight another way our platform can accelerate your data collection if you just think outside the box. Rough Code to Parse Comment Letters