I received an email this morning asking a really interesting question – how can I use directEDGAR to identify the auditor and the location of the auditor’s office in 10-Ks filed before 2000? This data object is not readily available from any source that I am aware of.
Step 1 of the process was to go look at some 10-K filings – I used the following search (CNAME contains(CONAGRA)) and (DOCTYPE contains(10K*)) – I simply wanted to review how this disclosure was made in Conagra’s 10-K. I selected Conagra because I have had many students take an internship and so their name came to mind first. Here is what the disclosure looked like:
So our client is looking to capture the name of the auditor and the location of their office – my immediate thought was that I could search for auditors by name, the name of the states and require that the auditor, state name and a date be in close proximity to one another. If the search was successful I would then extract the context. I had to go do some research to identify the name of the audit firms that existed during the span of time they are trying to collect this data for. This list is not meant to be exhaustive but as a starting point I came up with these auditors: (anderson or ernst or kpmg or waterhouse or pricewaterhousecoopers or coopers or pwc or deloitte or bdo or mcgladrey or grant or baker or crowe). Given that the audit report spans from 1/1/1995 to 12/31/2001 – I need a date search parameter – date(1/1/1995 to 12/31/2001). The magic here is that our index parser recognizes dates in US form. I also need to set the search to focus on 10-K filings as well as Exhibit 13 filings. Sometimes the audit report is included in the Exhibit 13 rather than the body of the 10-K.
Here is the search string I put together for my first stab at collecting this data:
date(1/1/1995 to 12/31/2001)
kpmg or (. . . more auditor names)
Arizona or (. . . more state/location names)
There are basically four parts to this search. First, the date span, then auditor names and the state name. These first 3 parts need to be grouped together since we have set some proximity parameters for these particular items. I then have the document restrictions. I need the document restrictions so I don’t find the content in a consent filing.
I want to keep a fairly tight context for the extraction so I set the Context option span to be 5 words (before to after). In my initial naive pass the search identified 41,567 documents. An example of the extracted context is here:
|accepted accounting principles. /s/ Arthur Anderson LLP – —————————— Boise, Idaho. February 2, 1998 <PAGE> UNAUDITED RESULTS OF QUARTERLY|
Clearly I need to make some improvements. I need to identify other names for auditors. Further, some state names may be abbreviated in some filings or the auditor may be domiciled in another country so I need to play with adding state abbreviations, names of countries or large international cities where I expect to find results. But the hard work is done – now we just need to experiment – identify auditors, places and setting the right distance between the date/auditor/location parameters.
The context extraction has an identifier for the critical values (name of state and the auditor name) and it has the actual context. I deleted a lot of the columns to focus on the relevant context for this particular example.
|accepted accounting principles. /s/ Arthur Anderson LLP – —————————— Boise, Idaho. February 2, 1998 <PAGE> UNAUDITED RESULTS OF QUARTERLY||1||1|
The next step is to parse out Boise – but this should not be difficult in Excel. It would be even easier in Python – but it is definitely doable in Excel.
I told the client – this was one of the most interesting searches I have performed in some time. As an after note – I built the search in Notepad++ in my first draft I had the parentheses wrong – using Notepad++ it was easy to keep track of the grouping.