Using the date search capability

I received an email this morning asking a really interesting question – how can I use directEDGAR to identify the auditor and the location of the auditor’s office in 10-Ks filed before 2000? This data object is not readily available from any source that I am aware of.

Step 1 of the process was to go look at some 10-K filings – I used the following search (CNAME contains(CONAGRA)) and (DOCTYPE contains(10K*)) – I simply wanted to review how this disclosure was made in Conagra’s 10-K. I selected Conagra because I have had many students take an internship and so their name came to mind first. Here is what the disclosure looked like:

Conagra 10-K Deloitte signature.

So our client is looking to capture the name of the auditor and the location of their office – my immediate thought was that I could search for auditors by name, the name of the states and require that the auditor, state name and a date be in close proximity to one another. If the search was successful I would then extract the context. I had to go do some research to identify the name of the audit firms that existed during the span of time they are trying to collect this data for. This list is not meant to be exhaustive but as a starting point I came up with these auditors: (anderson or ernst or kpmg or waterhouse or pricewaterhousecoopers or coopers or pwc or deloitte or bdo or mcgladrey or grant or baker or crowe). Given that the audit report spans from 1/1/1995 to 12/31/2001 – I need a date search parameter – date(1/1/1995 to 12/31/2001). The magic here is that our index parser recognizes dates in US form. I also need to set the search to focus on 10-K filings as well as Exhibit 13 filings. Sometimes the audit report is included in the Exhibit 13 rather than the body of the 10-K.

Here is the search string I put together for my first stab at collecting this data:

date(1/1/1995 to 12/31/2001)

(anderson or
ernst or
kpmg or (. . . more auditor names)


(Alabama or
Alaska or
Arizona or (. . . more state/location names)


(DOCTYPE contains(10k*))
(DOCTYPE contains(EX13))


There are basically four parts to this search. First, the date span, then auditor names and the state name. These first 3 parts need to be grouped together since we have set some proximity parameters for these particular items. I then have the document restrictions. I need the document restrictions so I don’t find the content in a consent filing.

I want to keep a fairly tight context for the extraction so I set the Context option span to be 5 words (before to after). In my initial naive pass the search identified 41,567 documents. An example of the extracted context is here:

accepted accounting principles. /s/ Arthur Anderson LLP – —————————— Boise, Idaho. February 2, 1998 <PAGE> UNAUDITED RESULTS OF QUARTERLY¬†
ContextExtraction from Date/Auditor Search

Clearly I need to make some improvements. I need to identify other names for auditors. Further, some state names may be abbreviated in some filings or the auditor may be domiciled in another country so I need to play with adding state abbreviations, names of countries or large international cities where I expect to find results. But the hard work is done – now we just need to experiment – identify auditors, places and setting the right distance between the date/auditor/location parameters.

The context extraction has an identifier for the critical values (name of state and the auditor name) and it has the actual context. I deleted a lot of the columns to focus on the relevant context for this particular example.

accepted accounting principles. /s/ Arthur Anderson LLP – —————————— Boise, Idaho. February 2, 1998 <PAGE> UNAUDITED RESULTS OF QUARTERLY 11

The next step is to parse out Boise – but this should not be difficult in Excel. It would be even easier in Python – but it is definitely doable in Excel.

I told the client – this was one of the most interesting searches I have performed in some time. As an after note – I built the search in Notepad++ in my first draft I had the parentheses wrong – using Notepad++ it was easy to keep track of the grouping.

Two New Filings Types Now Available

One of the critical benefits of our new platform is the ease in which we can distribute/add new filings to the search repository. Historically we have supported research projects when one of our clients has asked for a special directEDGAR repository to be created for a specific project. Off all of the filings available on EDGAR the one type that has been asked for most often has been S-1 filings. When we have created these I have been reluctant to push them out to everyone because of the level of coordination it would have required with your IT support. All of the coordination issues go away with our new platform.

I had a client reach out to me last week and asked if we could make S-1 filings and DRS (Draft Registration Statements) filings available to them. We already had an archive of S-1 filings so I just needed to update the archive and transfer it to our new platform. The DRS filings were ones I was not familiar with but they look particularly interesting for research into IPOs as well as governance. A simple description of the DRS filings is that they are S-1s confidentially filed while the registrant is sorting out the IPO process. We are including letters from the registrants in response to SEC comments in the DRS folder with the DOCTYPE tag DRSLETTER. Note – the original SEC Comment letters that prompt the response from the registrant have been available in the UPLOAD folder as they have been released. Since the DRS filings have only been available since Q4 2012 there are not that many of them. Both sets of filings are now available on our new platform.

If you are using our new platform – remember that your instance is loaded with your preferences and we maintain your configuration settings in the cloud. Thus you have to update the index library to have access. The process is pretty straightforward and is described in the Help under Indexing the specific topic is labeled Index Updates.

Help Content Indexing\Index Updates

Once you have updated the index library the S1MASTER and DRSMASTER indexes are available for use. You can then use our amazing search as well as the full range of Extraction and Normalization tools on these filings. In the image below I ran a search to identify all DRLETTER (DRSLTR) type documents.

Search for DRS LETTER documents using the DOCTYPE search field

Using directEDGAR in the classroom – Pivot Tables and Gender Diversity

(Note – there is a sample data file at the end so you can use it for class)

Our director of data products (Manish Pokhrel) shared this NY Times article with me this morning Diversity Push Barely Budges Boards to 12.5%, Survey Finds. Manish pushed this to me for three reasons. First, we have discussed adding some measure of ethnic diversity to our director compensation data. However, we have hesitated because of concerns that we could make a mistake by relying strictly on pictures. Second, the study was completed by one of our clients – though I am not making any claims about how much of our data was actually used in this study.

The third reason is that I have mentioned before that directEDGAR can be useful in the classroom to teach skills and to provoke discussion about important business issues. Most business schools have classes where students learn how to use some of the more advanced features of Excel – including the use of pivot tables.

When I read the article it seemed to me that given the attention to issues of diversity that creating an opportunity for students to muck around with at least one dimension of this issue while improving their data skills would be a winner.

I decided to create a data file for this analysis – but rather than focus on the largest 3,000 US companies I decided to start with the largest 500 as of June 30, 2020. There is such a diversity in size from the largest to the 3000th largest I am not sure the comparisons are meaningful. To provide some context – the largest company by market cap was/is Apple which had a market cap of over one trillion dollars – the 3,000th largest had a market cap of 75 million. The 500th largest had a market cap of about 8 billion.

I started with the 500 largest and created a request file to pull the most recent DC data. For most of the firms the most recent was 2019 – however there were a number that have reported FY 2020. The most recent data was from Cintas who filed today. And then I pulled the data for these firms from five years ago. So if their most recent DC data was 2020 – I pulled 2015 data. If their most recent data was 2019 I pulled 2014 data.

I lost 37 registrants because they did not have DC data available five years earlier (GoDaddy reported 2019 DC data in 2020 but their first year of DC data was reported for their 2015 FY).

The results are interesting. The highlights include the fact that 39 did not have any female directors in the first year of this analysis. Only one did not have any for the final year – but there is a caveat. The lone holdout was Liberty Broadband Corp – they appointed or nominated a woman who was appointed to the board in early 2020. However there were no female directors for 2019.

There were 905 women out of a total of 3,721 non-executive directors in the first year of analysis (2014/2015). However this distribution changed to 1,415 out of 3,458 in the final year.

There were 56 registrants where the proportion of female directors decline across the five year span. However, I hesitate to draw any inferences about the declines without a close review of the facts. For example – Howmet Aerospace (CIK 4281) had 4/11 female directors in 2014. Only 1/12 in 2019. However – they were involved in a complicated corporate restructuring (Alcoa became Alconic became Howmet Aerospace). The primary aluminum business was spun off to a new entity (named Alcoa) and many of the women leaders became directors of the new entity. Finally, Howmet shareholders elected 3 women to the board at the 2020 annual meeting.

I apologize, I got a little deeper into the weeds with this post than I intended. The bottom line is that this is an interesting issue and I intend to show my students how to organize this data using a pivot table. We can then have a discussion about what this means for them. Should students use evidence of a company’s diversity when they evaluate if they should accept a position with the company? It certainly gives them more interesting questions to ask in the interview – can you explain why your board composition does not reflect the community?

If you are a directEDGAR client – you can pull the data yourself. However, if you don’t want to or you are just interested in mucking around with this data – I have made available an xlsx file with this data summarized. I included the Summary – I will however, remove the Summary worksheet before I pass it to my students to create the pivot tables. Since the data includes the compensation as well as the SIC of the companies it is possible to create some different cuts/tables of the data.

Legal stuff – this data may only be used for non-commercial purposes. We make no warranty regarding its fitness for any particular purpose. We hold a copyright on this data and hope you will respect it. Here is the link to the Excel file (Director Compensation file)

Testing new field in DC data

As I have described before we have been working to identify those cases where registrants have gone through some type of reorganization that has led to the creation of a new entity that has the successor filing obligations of the original entity.

You have had the ability to use the mapping we have created when you use CIK filtering for tasks with directEDGAR. So if you have a file that has Alphabet in the sample (CIK 1652044) but you are requesting data for a time period when Google was the reporting entity (prior to 2016) our application would return data for both CIKs.

The problem was that you would not know why data for CIK 1288776 was included in the results (of course you would if your sample had one CIK but not if you had hundreds or thousands of CIKs).

To address this problem we are going to add a new field(s) to our data ALT_CIK_#. In English that would read as Alternate CIK #. In most cases there is only one alternate CIK (Google->Alphabet, Oracle -> Oracle). There are cases though where there are several (CIK 23498 -> CIK 1636023 -> CIK 1732845).

We are testing this now and will roll out comprehensively when we have finished the cloud shift. However, at times you will see this new field in some data you extract from our preprocessed core data. You will also see ALT_CIK as a new metadata field in the DEF14 search extractions on the cloud platform.

To illustrate this – suppose I am trying to access director compensation data and I create a request file using the CIK for Alphabet – I know that director compensation data was available beginning in 2007 and so the request file looks like the following:

Request file using just Alphabet’s CIK

So after I have created the request file I use the application to select it and to select the Include Historical CIKs checkbox as illustrated in the next image:

Using directEDGAR to extract Director Compensation data

Once the inputs have been selected hit the Okay button and the application will work the magic. The results file will have all of the usual data as well as this new field – in the image below I hid the data (CASH/STOCK etc) to make it easier to see the CIK (the value associated with the filing) and the ALT_CIK_1 field.

Results for Director Compensation with ALT_CIK field

I will observe that one thought was to replace the CIK with the successor CIK. I don’t want to do that because you could not audit/trace back to the source document the data came from. Further, I can imagine there will be cases where it is significant to control for this shift. We will be pushing this out throughout the platform as soon as we can.

We’re getting closer

Using directEDGAR in my browser

I’m a bit giddy with excitement – the video above illustrated identifying all 8-Ks filed as earnings announcements for a sample of approximately 2,800 registrants. An interesting number of them (555) did not have an earnings announcement filed in the span I was looking. My output included summary details about the filings and I was able to save those who did not have an 8-K filings as a separate list for review.

Those of you who use directEDGAR already are familiar with the ease of this task. I was talking to a client yesterday who recounted the misery of trying to identify these by using code and the SEC website when he was a PhD student. There have been more than 315,000 8-K filings made since 1/1/2016 (my time period) and for their study they had to inspect each one to determine its relevance. I need to ask him how long that took.

In addition – sometimes you need access that is more native to your work environment rather than dealing with the limitations of a cloud experience. So we are also providing a native application mode that will more closely match the experience of working off your desktop. We’ve got that handled as well. The following video illustrates using our application to collect the data where the registrant uses language such as Our audit committee met N times blah blah blah.

Native Application Mode – Context Normalization

Remember – while it looks like directEDGAR is running on my desktop in the above video – it is actually running in Oregon or Virginia – but it brings all of the features to your local environment.

Rabbit Holes – Why we don’t use the SEC labeled filing date with directEDGAR

If you’ve dived into directEDGAR you know that we have two key dates for searching filings. One we label the RDATE (R for revealed)- the label is weak in the sense that when I was developing our initial infrastructure I should have called it the DDATE (D for dissemination).

The RDATE is to provide you with some relative certainty regarding the date that the filing became available to EDGAR users. So for example if you wanted to study market response to filings then you would want to know the date the filing was revealed/made available or disseminated through the EDGAR platform. This is actually not the filing date as reported on EDGAR. I have a great example to illustrate this.

For background we are doing some work right now to identify any gaps in our director and executive compensation data. Specifically we ran some code to identify those cases where a registrant is missing one or more years in the time series of the compensation data we have available.

I was double checking the code results in preparation for assigning the data collection to one of our interns. The first registrant I identified was CIK 3906. We had DC data from proxy filings RDATEs from 2007 to 2010 and then we we had a result from an RDATE of 2012. So I presumed that we were missing 2011 and looked to see if I could sort out why we would have missed that particular year. First stop was EDGAR – and I see yes – there was a DEF 14A “filed” in 2007.

So then I switch to my network copy of directEDGAR – I want to sort out why we would have missed this observation and not just collect the data. So I open up the correct folder and I don’t see a folder that looks right. In the image below I would expect a folder in the sequence where the arrow is pointed. The folder above is the PRE 14A (we don’t use these for comp data as too often they will not have the complete data).

When I was comparing our archive with EDGAR I also realized that there were not any proxy filings on EDGAR with a 2012 date – the most recent proxy was filed in 2010.

I’ve been here before, I have gotten emails from clients who have found an occasional 8-K or 10-K filing where the dates have not matched to data they’ve collected elsewhere or matched to the filing date for the forms on EDGAR. I’ve shown them where we pulled our dates from . . . My point to them was that I trusted our dates but I’ve never tried to prove that the filings were not actually available on the filing dates. However, two days ago I was doing some arranging in my office and I came across a copy of directEDGAR that was about ten years old. We did a significant rebuild of our platform beginning in 2015 that was distributed to our customers in early 2016. The software changed and we also did a complete rebuild of the filing archive so we could use our new search engine. You may not recall but one constraint we were dealing with in the ‘old’ days was the two gigabyte file size limitation imposed by 32 bit Windows (64 bit Windows was was very uncommon when we started). This affected the size of the indexes and so our filings were organized in two year folders.

Below is an image from the correct folder – as you can see this version of directEDGAR was created in June 2010.

The filing I am looking for is not there. Another alternative is that I missed that filing when I constructed that version of directEDGAR. However, I am confident that is not the case as we did significant testing to make sure we knew how to capture the filings based on the indexes – that is if a filing was of the type that we wanted and it was listed in the index in that period we captured it.

However if you go to EDGAR right now and access the index (master.idx) for Q2 2007 the filing that is under discussion is listed there.

Does this mean we missed it? No actually the SEC indexes are not static. The EDGAR code platform modifies them frequently. It looks like the Q2/2007 master.idx file (which is the one we use) was last modified in September 2014.

When we do an update we don’t actually pull the latest index. Instead we pull all of the indexes (all the way back to 1993) and compare the new indexes to the last version of the index we have stored in the cloud and then we pull those filings that are not listed in the previous/archive index (no matter the stated filing date). My recollection is that the last update we did there were more than 3,000 filings that have an SEC ‘FILING DATE’ before 12/31/2019 that we had to capture and add to directEDGAR because they were not listed in our last comparison of indexes.

So back to the punch line we were actually not missing any compensation data for this filer. We had comp data from a 10-K/A that was filed in 2007 (RDATE 20070809 which also matches the filing date). The 2012 RDATE comp data we have correctly reports the YEAR (2006) that the data relates to. We have the complete time series of DC data for this registrant so we just have to delete the as reported with RDATE equal to 2012 since it duplicates the data included in the 10-K/A that was disseminated earlier. The filing in question was not available until 2012.

Of course the question is – why was this filing not disseminated until 2012. While I can’t fully answer that question – our research indicates that it is often the case that when one or another of the filings made by the company is under review by the SEC. In this case it seems that the PRE 14 was reviewed and the registrant responded to the points raised by the SEC in a letter associated with the DEF 14A. I can’t conclude anything more. However, I am very confident that the filing was not disseminated until 2012. If I do a search in directEDGAR PROXYMASTER I can find more than 200 filings with an RDATE more than 1,000 days greater than the CDATE (normally on proxy filings the RDATE is before the CDATE). When I spot check these (only 3) I see exactly what I saw with the filings from CIK 3906. Our CDATE matches, the filings were not available on the older version of directEDGAR and there is CORRESP included with the filing.

In summation – EDGAR is a ‘living’ thing. As I noted earlier – when we distributed the last update we identified more than 3,000 filings that were listed in previous indexes but were not in the version of the indexes we accessed in early January.

Data Delay – Interesting Problem

I was fully expecting to begin making the Director-Relationship data available by now. However, we have run into some really interesting problems that we are having to sort through. We made an assumption that there was a one-to-one relationship between a Central Index Key and a person when a person has a SEC reporting obligation. However, as we were aggregating our director data to organize it for the relationship data presentation our data guru (Manish Pokherel) discovered this was not true.

Manish was trying to create various integrity tests before we made the final merge and in one of the scenarios he tested he discovered that there are approximately 40 people who have multiple CIKs. Here is a screenshot of the SEC landing page for Dr. Glimcher (who was on the board of Bristol-Myers Squibb from 1997 to 2017).

Gimcher SEC Landing Page

Clearly these look to be the same person – if you follow the links and read her biographies in the related filings it becomes clear that yes, Dr. Glimcher ended up with two unique CIKs.

The problem is that we have one CIK associated with some instances of her compensation (and ownership data) for some filings and the other CIK associated with other instances. For the compensation data and the relationship data to have the most value we need to standardize it.

The decision we made last night is that we are going to use the most recent CIK of these individuals. This means we have to go back through the compensation data and replace any instances where the older CIK value is included as the PERSON-CIK. I will observe that other cases of this are not as clear cut as Dr. Glimcher’s.

This has really been an interesting exercise. This is the first time we have pulled all of our compensation data at one time and tried to do some deep analysis. All of our previous integrity analysis has focused on one individual company and a fairly limited time series at a time. We have over 69,000 unique directors identified (NAME-PERSON-CIK). So as you can imagine it is a special challenge to find ways to cross validate the data.

Bottom line is we need to do some more testing – not too much more but we are still trying to identify ways to make sure this resulting data is clean. We also have to sort out how to make sure we propagate a specific CIK for a person through our system. I want to make sure that when you download our ownership transaction data, director votes data, our beneficial ownership data (I can’t remember where else we use the PERSON-CIK) you get clear links across time and between entities.

Complete Redesign of directEDGAR’s Delivery Modality – 10 Beta Testers Needed

Yesterday we inaugurated our first substantial test platform for our new delivery system. After the January 2021 update there will not be any more updates delivered through the mail. Instead we are using some technology from Amazon to host and deliver our application in the cloud. We expect to begin transitioning clients to this platform on a voluntary basis early in the 4th quarter and while we will make the final 2020 update available to allow those that want to transition a bit slower – that will be the final update.

The AWS Appstream service is amazing and provides us the opportunity to improve the timeliness of filing access (think of near immediate) as well as relieve your IT staff of their responsibility for local management of directEDGAR. The best thing about this change is that we do not have to impose any of the limits that come from a web based search. We can continue to provide the absolute best search experience and make the platform available to you anywhere – anytime.

I am looking for 10 beta testers from our users. If you are interested please send me an email. As a beta tester you can help us make sure we have some good user feedback on the experience. I will note I am already excited about this experience – searching and rendering is about 40% faster than on my local computer. The people who will get the most out of this experience initially will be those who need access to the most timely 10-K and DEF 14A filings. I expect the full directEDGAR content to be available by the middle to the end of this month.

We also need some advice or suggestions regarding the addition of metadata to our filings. Right now we include the SEC dissemination date (RDATE) as well as the conformed date (CDATE) and the item codes for each 8-K filing. We include word counts and issuer details like SIC code, FYE. We also provide doctype that drills down to the conformed exhibit code. I am intending to add the actual filing date/time stamp (most filings made on Friday after 5:30 have a Monday RDATE as they are not disseminated until 6:00 AM on Monday. I also want to add filer status (LARGE ACCELERATED, ACCELERATED etc) Is there anything else that you think should be added? If you remember – the metadata provides additional filtering opportunities. The idea is we can search documents for search phrases and words and then additionally filter on the metadata to provide even more focused results.

More Data (Insider Affiliations and Peer Data)

As I have noted before – when we process compensation data (as well as insider trading, beneficial ownership, director votes and other data that includes people we add the PERSON-CIK and SEC-NAME of the people that are the subject or are included in the data table. We do this so you can link people across time and entities.

However, I had some discussions with clients recently and they made a critical observation – this data would be difficult for them to create because to actually capture all companies a person is affiliated with they would have to dump all of our data and then organize it. Thus we decided to create a new data artifact. We are going to create an affiliation table that lists all other SEC registrants that each officer and director of a public company is affiliated with.

For example consider Roxanne Austin who is a director at a number of public companies. If you look at this summary of Ms. Austin’s SEC filings you can see that she has a reporting obligation because of her service on the boards of five registrants.

While her other directorships are described in her biographical information in the various proxy statements each of the biography paragraphs are formatted differently and there is some variation in the form of the various companies so this data is difficult to parse and convert to a useful form. Here is her biography from the DEF 14A of Abbot Laboratories. It mentions all of the companies listed above but it takes significant programming to link here to those companies. While the language is pretty clear (Ms. Austin currently serves on the . . .) this is described differently in other filings.

We have this data because we assign the PERSON-CIK to each director of a public company – even if they do not have a filing obligation (ownership interest) in the filer. We have wanted to organize this data for some time but have struggled with sorting out exactly how to deliver it in a way that is useful. Our initial effort is going to standardize this by calendar year. Clearly one of the challenges is that the fiscal year of these companies can vary and board service or employment can start or finish at any time. We decided that the best way to organize this is by issuer CIK and CALENDAR YEAR. Here is an image of what the data would look like after sorting it by PERSON-CIK. (I’ve left some columns out of this image).

This data should be available in the next two weeks. More details on how to access it will be posted once it is live.

We have been working on PEER data for a really long time. It has been a real challenge. We are finally making some progress, enough to describe it – not quite enough to set a date. This and Non-GAAP earnings reconciliations are some of the most frequently requested data objects. A number of our clients have or are using our tools to extract and normalize the Non-GAAP earnings reconciliations but the peer data is more complicated. Here is an image of the peer group listing for Abbott Laboratories from their 2018 proxy filing.

We faced two major challenges trying to identify this data and normalize it. One is that there is no indication inside the table (and that data is in a table) that these are peer companies. The second challenge is that the name is not very useful in the form that is reported in the table. We’ve finally developed some processes to help collect this data systematically. We are converting it into a form that will allow you to access the CIK of each peer.

I will post an update on this later.

Trying to join the modern age! YouTube Videos

I’ve been lucky recently to have a chance to talk to some clients. One of the thoughts I heard most frequently is that they want access to use/feature information in a more direct fashion. Thus, I have decided to start trying to create use videos.

The videos will show up in the sidebar. With the initial set I intend to illustrate the process of searching for, extracting and normalizing audit committee meeting frequency data. The first video is perhaps a little too long but my son Patrick was nice enough to say he thought I stayed on topic. I will try to keep these focused and try to directly address a specific question.

I have to do some experimenting – my goal is not to alert those of you who subscribed to this blog each time a new video is posted. Rather I hope the videos show up silently in the sidebar and when you have some time/need to explore one of our features you can visit our new YouTube Channel when you have a question or want to review some feature. Of course those of you that want to learn about directEDGAR’s great features so you can join our client base – feel free to watch those videos as well!