SpecifiedTables Extraction Feature

We are working on updating our archive of Director Votes data and as I was training our newest intern (Patrick Kealey) on the process I realized I had never posted here about the SpecifiedTables Extraction tool.

Those of you who have used our Table Snipper know that it relies on the existence of uniform but unique language in a table. For example more than 90% of Executive Compensation tables have the words Salary, Year, Position and Total. While there are other tables that have some of those words there are very few other tables that have all of those words. Thus identification and extraction of that table is pretty straightforward.

The results of shareholder votes are reported in 8-K filings with the ITEM_CODE 5.07. So it is relatively easy to identify the actual source documents that summarize the votes. However, the table that describes the results of the votes cast for director election tend to fall into two categories. In our testing we have learned that about 30% of the tables include some form of the word NOMINEE and – in more than 99% of the tables that contain NOMINEE – the table is reporting the results of the election of directors. Here is an example from Pepsico (CIK 77476).

Pepsico Election Results

If the word Nominee is used in the table – the base TableSnipper makes it a trivial exercise to extract those tables. The other 70% are more problematic because they actually do not have any language that defines the column with the names of the candidates. For example, here is the summary Apple reported in 2019

Apple Election Results

We can’t use any of the other column headings to identify this table because every other proposal that was submitted to a vote uses those same column headings as illustrated in this image:

Apple Shareholder Vote Results

Since we can’t use specific language to identify a large number of tables the only alternative is to take a modified approach using the SpecifiedTables feature of the Search Extraction & Normalization Engine. Step 1 is to identify the relevant documents that have the data we need to collect using the Search feature. Step 2 is to do a SummaryExtraction which generates a CSV file that has the metadata about the documents returned from the search. One of the columns in the csv file is the FILENAME – this column has the full path to the actual source document. We generally retain the CIK, CNAME and FILENAME columns and delete all other columns. We add a new column – our practice is to name it DataValue – but it can be named anything you like.

Example Data Collection Worksheet

Remember – the listing in the SummaryExtraction file matches the order of the search results in the application. There is a one-to-one correspondence between the items listed in the SummaryExtraction file and the items listed in the application.

Search Results Listing

Now we need to review the documents and identify one value from each document that is contained in the director votes table. While this might seem tedious – the alternative is not so pretty. The fact that the SummaryExtraction file aligns with the listing of search results makes this task easy to describe to others. The requirement that they only have to capture one value from the table and transcribe it into the csv file makes it very manageable. Some of the search results do not report on director vote results – we leave those rows blank – to delete at the end. Here are the results of the transcription.

Transcription of Data from Search to Summary

Once the transcription is complete – delete all of the rows that report search results that did not have relevant data and save the file. Then use the Extraction\ExtractionPreprocessed tool on the application to select the csv file and specify the column that contains the data value that you want the application to use to identify the relevant table.

SpecifiedTables Feature of Application

When all of the parameters have been specified, select the Okay button. The application will use the FILENAME to access the relevant documents and find and extract the relevant tables based on the values specified in this request like file. If there are multiple tables with the same value specified in the request file – each of the tables will be extracted and labeled uniquely. The labeling follows the same pattern of all artifact labeling we use in directEDGAR to create an audit trail back to the source document. Here is Apple’s director votes table after we completed the process.

Apple’s Director Votes Table

Once the tables have been extracted we can then use the Dehydrator/Rehydration process to normalize the votes. Here is part of the output after Patrick finished normalizing the output:

Apple Director Votes after Dehydration/Rehydration

Obviously you are probably not going to want to invest effort into collecting Director Vote data since we are making that accessible through the platform. (Our data will include PERSON-CIK, TENURE and GENDER!). However – this process is the same process you should use when you are trying to collect data from tables that cannot be uniquely identified by some common words. We are often asked about collecting non-GAAP earning reconciliations. That project is not on our list at the moment but this process would make that very approachable and manageable.