Teaser-2 Version 4.6 Event Study Like Filtering

Our date filtering has taken a giant step forward.  Previously you could filter a search by dates but you had to apply the same date filtering to the entire set of CIKs that you were using in your search.  Version 4.6 adds the capability to set a discrete date filtering window around a particular CIK-DATE pair.

You need to supply a CSV file with a CIK column and a MATCHDATE column.  The values in the CIK column need to be the CIK values for your sample – no left padding of zeros – just the integer form of the CIK.  The values in the MATCHDATE column need to be dates in the form of MM/DD/YYYY or M/D/YYYY (for our international users if the standard date form in your locale is D/M/YYYY – our application will expect that form – whatever date form your version of Windows defines as ‘normal’),

Here is an image of a valid input file – notice there can be additional data in the file – the columns do not need to be adjacent but they need to be clean in the sense that the CIK and MATCHDATE column headings need to match exactly (no spaces, upper case etc).  You can have multiple CIK-DATE pairs – in the image below I have two different MATCHDATE values for CIK 1800


Once you have a CIK-DATE file created – start the application check the Use CIK button in the CIK/Date filter section and then select the Set CIK/DATE File button below the search box.


The user controls to manage the selection of the source file will become active.


Use the Browse button to navigate to and select the file to use for input.  Notice in the Range Days area of the control you can specify the number of days before the date separately from the number of days after the MATCHDATE.  Thus you can have a lop-sided window (0 to +180) or a symmetrical window (-45 to +45).  You also have the option to match to the RDATE or the CDATE.  Once you have selected the options the Okay button becomes active, select it to update the application with your input.

Once you have fully defined your search you can hit the Search button.  The application will return only those documents that matched your search criteria, matched by CIK and were filed within the date range you specified for the CIK-MATCHDATE pairs you specified in the input file.


I want to observe that the search time represents filtering through a base document collection of over one million documents to filter down to the 7,192 that matched my criteria.  Because this is research we understand how critical it is to be able to identify those CIK-DATE pairs that did not match any filings.  There is a View Misses button on the application that when selected will provide a list of those missing pairs.


Notice the Save as CSV button – selecting that will give you the chance to save these results to a file for manipulation, review or re-submission with a different span.  The CSV file will have the CIK and MATCHDATE columns.  The file will not contain any data values from your original submission file.






Teaser – 1 Version 4.6

We are finishing testing and working on the documentation for the next version of the ExtractionEngine.  We have added some new features that we hope will help with your research.  I will share these as we finish the testing and the documentation for each of the features.

The first item we have completed work on is that we added the ability for you to extract the text only from your search results.  While there is an increased interest in using text mining tools and performing sentiment analysis one blocker has been that most of the text mining software assumes/requires that the input text be in the form of plain text content.  Since most SEC filings are in html our clients have observed that it is painful to extract the text from a set of documents.

We added a new item to the Extraction Menu DocumentExtraction TextOnly.


If you select that option a new UI will start and give you the option to specify the destination directory where you would like the text form of the documents to be saved.  Once you have specified the destination directory and hit the Okay button text only copies of the original documents that were returned from the search will be placed in the specified directory.  Each document will be named in the CIK-RDATE-CDATE-~ form so you have an audit trail back to the source document.  Here is an image of a directory with a collection of filings that have been converted to text.



Here is a partial image of an 8-K filing in html format:


Here is an image of the same document after conversion to text:


Since we presume that these txt files will be used as inputs to some additional processing system we have not added line breaks.  The only line breaks in the file are those indicated by the html code so they are a bit painful to read.  However, all of the text is present.



Update to Audit Opinion Work

We have now completed the end-end linkage to make the audit opinions available for download automatically.  Specifically we have connected all of the code pieces that were required to identify the 10-K filings made by Large Accelerated Filers, extract the audit opinion from the filing and then push it to the server that handles your requests when submitted.  Because of the nature of the resources we are using these reports are currently available around 3:00 AM the morning after they are filed. That is if any are filed on Monday – the audit reports should be available to you at around 3:00 AM on Tuesday.

We are also working on providing a summary of the Critical Audit Matters described in the opinions as well as additional meta-data about the opinion.  If you would like to review the current status of this summary detail please follow this link (CAM_SUMMARY).  Below is an image of the current data columns in this file (columns A-E have meta-data that can be used to tie these details back to the actual 10-K filing where the opinion was found


We are specifically identifying the nature of the type of audit opinion (REPORT_TYPE) where the CAM are described (FINANCIAL or COMBINED where COMBINED includes both the opinion on the financial statements as well as the opinion on internal control).  We are also indicating the nature of the opinion and if an exception is noted then it will be reported in the EXCEPTION column.  CAM_COUNT is an integer that indicates the number of CAM listed in the opinion.  And then finally we are currently listing the title the auditor uses to describe the CAM.

Here are some fun facts (only accounting researchers/faculty would call these fun facts) from our initial analysis:

Distribution of CAM counts by registrant:


Five audit reports made mention of 4 CAMs, and 20 listed 3.  Here are the five registrants whose audit reports listed four CAM:


The most common CAM address measurement and impairment issues related to intangible assets (Goodwill generally but also the allocation across various types of intangible).  Both tax and revenue issues are also very prominent in the collection.  I think what surprised me the most is that there are mentions of inventory valuation issues in eight of the opinions.  I am not sure why I am surprised by this as our profession has roots in several frauds involving massive inventory misstatement.

We are isolating the CAM with the plan to create a separate file for each CAM mentioned in an opinion.  We have some ideas as to why this separation will be useful – but more on that later.  We are also developing a standardized taxonomy with the intention of delivering both the original language of the CAMs as described by the auditor and a more parsimonious description that would make using these as inputs to empirical models a bit simpler.  To make this a bit clearer – here are four descriptions of Goodwill Impairment Assessments

Goodwill Impairment Assessment – Cortland U.S. Reporting Unit
Goodwill Impairment Assessment
Goodwill Impairment Assessment – Adtalem Brazil Reporting Unit
Goodwill Impairment Assessment – Company-Owned Reporting Unit

A scan of these suggests these are all similar in nature and could be standardized to allow better sorting and coding.



Working with the CAM Audit Reports

A few days ago I posted that we had accomplished our goal of making the audit reports available for large accelerated filers on a timely basis.  I promised in that post to describe how to use those audit reports with our application.  In this post I describe those steps.

First you have to access the audit reports.  This is accomplished by submitting a request file with the CIK, YEAR and PF of the source document (in this case the 10-K).  Since the requirement to disclose CAM only applies to large accelerated filers with FYE ending after 6/30/2019 it could be tricky.  Here is a link to the latest model request file which lists the CIKs of all registrants who meet the criteria and who have released a 10-K since the implementation date (CAM_REQUEST_LINK).  For more direct instruction on how to download the actual audit report please review this blog post.  For this post I will begin assuming you have already downloaded the audit reports.  If you need those instructions – visit this post.

When you have finished downloading the audit reports they will be stored in the directory you specified in the ExtractionPreprocessed user interface.  Each audit report is an independent document and is named CIK-RDATE-CDATE-F##-22.htm. Where the CIK is the Central Index Key of the issuer.  The RDATE is the date the 10-K filing was made available on EDGAR. The CDATE is the balance sheet date.  The two digits following F are the last two digits of the original filing accession number and the 22 represents our internal text artifact number for the audit reports.


If you want to review these documents individually – select the SmartBrowser feature on our software and navigate to select the directory that has these artifacts.  The SmartBrowser will load the list of files and provide a list of the CIKs that are present in the left panel.  You can select any individual CIK to review that document or begin at that particular point to move forward.


However, if you would rather use the full features of our platform with these opinions you need to index them.  To prepare for indexing you need to create a new directory on your computer where the audit reports can be saved and the index can be created.  In this example I am creating a new directory in F:\myTemp\DEMO_CAM_INDEX.  Once the directory has been created select Create Index from the Indexing item on the menu bar


When the Create Index panel loads simply select the directory that has the original audit reports as the Source Files Directory to Index.  Select the directory you created for the destination as the Destination Directory and then select the Create Index button.


The indexing process will begin – there will be some messaging as the indexing progresses – including the file that is being processed and at the steps when the index has to pause to save the partial indexes.  When complete the application will report Indexing Complete.


When you hit the Close button the application focus will switch back to the main component.  However the active Index Library will switch from the library you were using to the directEDGAR_Custom library.  Your newly created index will be the last index in the list of custom indexes.


Select the index and start searching – you can now use all of the features of our application that are applicable.  For instance – suppose we want to identify all audit reports that mention revenue recognition:  a great initial search would be “revenue recognition”


As you notice in the screenshot – we have not injected any of our standard meta-data into the audit reports yet – that is why the company name is not yet displayed in the search panel.  We will make the code changes so this happens automatically in the next two weeks.  When that is done we will re-do these initial audit reports so the name is visible in the search panel and is reported in the Summary or Context extractions you might run.

To give you a quick example of using our platforms broader feature set with these artifacts I will quickly walk you through the process of extracting and normalizing the auditor tenure.  I know that tenure is generally reported with language similar to We have served as the company’s auditor since YYYY.  To find that language I am just searching for auditor since.  However, before I do that I am going to adjust the span of the ContextExtraction to five words – I want to minimize the noise in the output.  From the File menu select Options. . .


From the Options panel – Context is the first item.  Select Item and then replace the current value with the number 5 and make sure the Words radio button is selected.


Once you have adjusted the span – hit the OK button and then from the Normalization menu item select ContextNormalization.  There are three parameters to specify.  First, since we are working with the search results displayed in the application select the radio button next to Current Results.  You also need to specify an output directory – two files will be created and saved in the directory that you specify/select.  Finally you need to describe the nature of the normalization.


In this case we need the number that follows the string auditor since and we want to save it in a column with the heading tenure in the results.  When you have specified the parameters hit the Okay button.  When the application closes the ContextNormalization panel there will be two new files in the folder that you specified.


The FileToProcess.csv has the context from the search – the file with the date-timestamp appended has the results after the context has been normalized.


As you can see in the image above – the auditor tenure has been normalized from the context and is reported in the column cleverly labeled tenure.

We are in the midst of revamping our audit fee data and will be adding tenure as a field to that data (as well as the auditor name).  We will also work on linking the fee data with the audit reports so stay tuned for more updates.







Audit Reports with Critical Audit Matters Now Available to Download

As many of you know the PCAOB mandated that the auditors of large accelerated filers with fiscal years ending after 6/30/2019 include a description and other facts about Critical Audit Matters in the audit report.  These started becoming available in July.  We began overhauling some parts of our infrastructure to parse out these reports and make them available for direct download from our platform.  The initial work has been completed and these are now available.

Here is a screenshot of the Critical Audit Matters section of Apple’s audit report from their 10-K filed last week.  If you are not familiar with our application the audit report is displayed in our SmartBrowser – which allows our users to to review htm and txt documents with intuitive features to advance through a collection of documents.


These are interesting reading and I suspect there are some great opportunities for research with these reports.  I will note that I have already used some of the discussions from some earlier ones to help me make some points with my Intermediate Accounting classes.  There is something really salient for students when they read about the challenges of auditing revenue for a company that has multiple performance obligations and has to recognize revenue over time.

You should be able to see these reports listed in the artifact list when you create a request file – they should show up as the last item in the list of Text Sections list of artifacts.


If you have a properly organized request file, select AUDIT_REPORT from the ExtractionPreprocessed menu and hit the Okay button our server will process your request and deliver these snips to your desktop.

Of course the immediate problem is identifying those companies that are Large Accelerated Filers who have a FYE after 6/30/2019.  To make this step a little easier we will maintain a list of filers who meet that criteria and make it available to you as we update the archive.  A current list that is organized as a request file is available here.

Thus, if you save the request file you can then use it with the ExtractionPreprocessed feature to download these audit reports.  Right now we are still running the code on a batched basis.  In the near future we will automate this so these reports are available within 15 or so minutes after the source document (usually a 10-K) has been filed.

In the very near future I will make a new post that describes how you can use the built-in indexing engine of our application to build indexes of these documents so you can search them for relevant content.  Here is an image of one of my tests when I was searching for revenue recognition as a critical audit matter


Again – I will provide more information later – but this feature is now live



Amazing Research Deserves Amazing Rewards

A while ago I was searching for a way to acknowledge researchers who cite directEDGAR as part of their data collection efforts.  I wanted to do something because I think that those citations in the academic journals are a key factor is establishing our legitimacy with the academic market.  So we started a program (that I at least think is neat) where we order up ice-cream from a local ice-cream shop and have the cartons customized with the name of the paper and the names of any of the authors who are our clients.  – Check out the images below.


The image above is the ice-cream delivered to Professor Lubomir P. Litov, he is a member of the Finance Department at the University of Oklahoma.  His (co-authored) paper Lead Independent Directors: Good governance or window dressing? was accepted by the Journal of Accounting Literature for their December 2019 issue.

Here is an image of the ice-cream we sent to Professor Matt Ege, Jennifer Glenn and Professor John Robinson, faculty and PhD student in the Department of Accounting at Texas A&M University.  Their paper Unexpected SEC Resource Constraints and Comment Letter Quality was accepted by Contemporary Accounting Research (CAR) in May – though there has not been a publication date reported yet on the CAR website.


One more, one of my colleagues at the University of Nebraska at Omaha, Professor Erin Bass had an ice-cream surprise recently.  She cited directEDGAR in her paper Top Management Team Diversity, Equality, and Innovation: A Multilevel Investigation of the Health Care Industry recently accepted for publication by the Journal of Leadership & Organizational Studies.

While this may seem a little strange.  Who would make ice-cream with an academic research title on the label?  I will observe that one of our authors reported a couple of years ago that it really made their kids want to read the paper with the title.  They (the kids) were bragging to all of their friends that their dad had ice-cream named after him!!  What could be better than that?

A little back story on eCreamery, the amazing company that handles all of the details.  This company was started by two women in Omaha in 2007.  They have been on SharkTank and Mr Buffett (that’s Warren, not Jimmy) has been known to pop in with some of his best buds.  eCreamery is only about 15 blocks from my house and when we go there we can count on the line being out the door.  Clearly – the best SEC filing search engine customers deserve only the best ice-cream.

So for the fine print.  The decision as to whether or not we send ice-cream as the result of a citation is strictly at our discretion.  There are times we do not send ice-cream – none of the authors is employed by one of our clients is a big one.  Another is that the authors are at one of our international schools or in Alaska or Hawaii.  (eCreamery will not guarantee a frozen delivery outside the continental US – go figure).  This program can be stopped at any time.

10-K History – Data Filtering

Whenever I visit clients or respond to emails about data collection I always try to make the point that it is super critical to identify the sample based on strict criteria to minimize the inevitable chase at the end for missing data and to minimize the processing of the inevitable edge cases.  No matter how structured the disclosure requirements are set out in SEC regulations or the Accounting Standards Codification it is inevitable that some proportion of the SEC filers will get ‘creative’ in their form of the disclosure.  When they get creative – data collection becomes much more tedious as it becomes necessary to identify the structure of their disclosure before we can sort out how to capture the data.  If we can precisely  identify the sample firms before we turn to collecting the data items then we will reduce the effort we spend chasing the odd form of the diclosure.

I am helping a client understand how to use our platform to collect a data item that is disclosed only in the 10-K – it is not a required disclosure in any other filing and it is something that is not likely to be disclosed in any other filing (I did some tests and could not find this item disclosed in the combined  millions of  filings that are searchable with directEDGAR).  So related to this I do encourage our users to review the regulations (either the SEC disclosure requirement as set out in the Code of Federal Regulations or the Accounting Standards Codification).

So our client is trying to collect a particular data item – their sample was derived from some other financial data source.  It may seem a normal presumption that if a company has data available from some other financial data source then there should be a 10-K with this disclosure.

In this particular case there are three problems with the sample from our client.  The first is that some of the sample firms have public data because they have public debt.  So while they file a 10-K they might not have some data items included in the 10-K because the disclosure requirements differ by the nature of the laws that establish their filing obligations (ABS issuers versus public debt only versus common stock).  So while these companies file a 10-K they will not have the particular disclosure our client is trying to collect.  The second problem is that the sample firms may have not had a filing obligation at the time they showed up in the sample.  The third problem is that some of their sample are foreign registrants whose filing obligations differ substantially – they have the option to file 20-F and 6-Ks rather than  the expected 10-K/Q and 8-Ks (as well as a myriad of other filing differences).

The most common way to determine if a company has publicly traded equity is to look for evidence in one of the other data sources that would normally be used to source some of the data for research.  I suggest that as there is not an easy way using SEC filings to determine if a company has publicly traded common stock.  In other words there is not really an easy way using directEDGAR to establish whether a filer has publicly traded common stock.  For example, I played around with some searches to identify those 10-K filers that are privately held and struggled – because this is not a mandated disclosure.  One search I tried was to search all 10-K filings for the existence within the first 800 word of the beginning of the document  registrant or issuer or company within 10 words of the phrase privately held


Some of the results (LEVI STRAUSS and CINEMARK USA) were exactly what I was looking for – those registrants are (were) privately held.  However, many of the results were not what I was looking for.  Therefore if I needed to collect data from companies that had public equity – the best way to define the sample would be to use another tool to determine if they do have public equity.

The second and third issues that needed to be addresses is whether or not the company filed 10-Ks (since that is filing that contains the data we are looking for) in the window that is needed for this study.  We can use directEDGAR’s 10-K Filing History archive to establish whether or not a company has filed 10-Ks and for what period.    Our client had a list of approximately 13,000 CIK -YEAR observations  which represented 3,862 unique CIKs.  I used their list of unique CIKs to create a request file to determine the 10-K availability for their sample.  This file helped me in two ways.  First, for some of their sample CIK-YEAR pairs the date they were trying to collect data for was after (or before) the last (or first) date of the 10-K filings.  For example, they needed something from a 10-K filed by CIK 737644 after 1/1/2001.  The problem is that this CIK filed their last 10-K in 1997 (I determined this by using the 10-K history file results).


They can use the result file to determine if there is a 10-K filing within the time span that they need to collect the data.  And even better – the process also creates a file called missing.csv (clever name) that listed the CIKs from the request file for which no 10-K filing had ever been filed.  There were 477 CIKs from their original list of 3,862 CIKs that had never filed a 10-K.

So while we could not use directEDGAR  to establish if any in their sample did not have publicly traded equity we could use it to establish whether they filed any 10-Ks and also for what period.  The advantage of doing this work at the beginning is that we can more precisely define the data we should expect to collect.