Schedule Support & Tagging Update

I want to reduce the cost/barriers of getting support. To that end we have made a few updates. First, we have an email account that is monitored almost 24-7 because we have some very knowledgeable team members who work in other time zones. If you are stuck and what to see if we can immediately address your problem please send an email to support x@x directedgar dot com. Hopefully my attempt to avoid getting unwanted emails does not confuse you.

Next – how about the ability to schedule some quality time together to either address a specific problem or work on a general strategy for a particular project? I found a tool that allows us to schedule that quality time without too much effort. If you visit this scheduling page you can see my availability and pick a time that works for you. If none of the available times are suitable send me an email directly.

In early March I shared how to access a test index that had additional metadata to enhance your search or to provide more useful context for the results. Our goal was (is) to automate the addition of this metadata to our platform and back fill our older indexes with this data. It has been a process and while I want to get into the weeds with some of the special challenges – boring you is not likely to keep you reading. I will share that for me to be willing to go live with this we established an internal goal. The code had to work on all 10-K filings made in the first quarter of 2013, 2017 and 2021 with no errors. Errors in this case meaning that when an exception occurred – the possible cases for the exception were exhaustively evaluated and if we could not code a resolution then we could label the error in a meaningful way. Further the error cases have to be less than 1/2 of 1% (0.005) of the processed filings.

We finally have code that achieves those standards for Q1 2013 and 2017. We are going to run a test on Q1 2021 in the next several days to confirm that the results hold after a bit more careful error handling. So we are close. I am personally excited about this because I think you should be able to define your search by some of the metadata we are adding to the filings. I have had so many queries about identifying firms with dual classes of stock (see for example the effort described in this paper The Rise of Dual-Class Stock IPOS) – it should be trivial and I think we are going to make it trivial. I have already described that filer status affects disclosure in a number of ways. Size is often used as a proxy but why shouldn’t you be able to directly access filer status since it is the determinate of a registrant’s reporting obligations.

In some ways I am glad this has taken so long because we have had other questions about firm characteristics that we think are worthy to add as metadata. As a result, we have actually been actively collecting some other measures that we are going to include in our new metadata injection.

One critical piece of information is that I determined we cannot safely add some of the additional metadata to 10-K/A. The problem is that registrants are inconsistent about the reference point for the measurement of these values. I have seen registrants report their filer status for the balance sheet date of the financial statements included in the amendment. There have also been cases where their filer status has changed and so even though a scan of the amendment indicates they are following the disclosure regime of their prior filer status they have reported their current filer status on the face of their 10-K/A. There have been cases where registrants report their public float for the end of the second quarter prior to the balance sheet of the included financial statements but there have also been cases where the public float is for a trading date close to the filing date of the amendment. Finally there have also been cases where the public float is pulled from the end of the second quarter of their most recent 10-K and that 10-K was not amended.

There are still a lot more details to share. I will provide a fuller explanation when we move this to production. I am now predicting that we should be adding all of the new metadata to 2021 10-K filings in production in about two weeks. That will give us the insight we need to determine how best to back fill this data to our archives.

Minor Error – Temporary Work-Around

Two days ago a faculty member at Texas A&M reported that they were getting an unexpected error message. They prepared a request file to use the ExtractionPreprocessed feature. If you are not aware – the request files are limited to 20,000 CIK-YEAR pairs. The client reported that they had a request file with 19,999 CIK-YEAR pairs but when they submitted the file the request was blocked and they were getting the dreaded – File Too Large message.

File Too Large Error Message

I asked them to send me the file and I was trying all kinds of tricks to sort out the reason for the error. I failed to ask (or even consider) if they had checked the Include Historical CIKs box. I was focused on analyzing the file and any hidden attributes of the file rather than looking at the problem with a bit more open mind.

Fortunately (for me) Antonis Kartapanis (another TAMU accounting faculty member) was in the email chain and actively paying attention to the conversation. Antonis sent a message suggesting that the issue was caused by the selection of the Include Historical CIKs checkbox. And sure enough – I had not been checking the box, the TAMU faculty member who was having the problem was checking the box. I didn’t think to ask. Antonis tried with the box checked and then with it unchecked.

As a reminder – when the box is checked the application calls home and adds additional rows to the in-memory version of the file if your request files has a successor or predecessor CIK. For example, suppose you used CapitalIQ to create a sample and Alphabet was in your sample (along with many others). The CIK associated with Alphabet is 1652044. Perhaps you are trying to collect Director Compensation data from 2011 to 2021 and so you have 11 lines in the file relating to CIK 1652044.

Request file with Alphabet’s CIK

Once you have selected the artifact you want to pull the application loads your request file, removes duplicates, and if you have checked the Include Historical CIKs checkbox it reviews your file to determine if you have CIKs that need to be augmented. If any are present it first checks to confirm that the predecessor/successor CIK-YEAR pair is not in the file – if not it extends the file with the new pairs. In the case of the request in the image above the application will extend the file by adding new rows for CIK 1288776. The memory version of the file will now have 22 rows of CIK-YEAR pairs.

Extended Request File

Now the application will check the size of the final file. And this was the source of the problem. The augmented file exceeded the 20,000 CIK-YEAR pair limit because of the addition of the predecessor/successor CIK mappings. In a perfect world we would only add the CIK-YEAR pairs that are relevant and remove from the file those that are not. If we were doing that with this file the file would have CIK 1652044 for 2016-2021 and CIK 1288776 for years 2011-2015. (This is on the list but it is a long list).

If you’ve stuck with me so far – I think an easy fix would be to limit your request file to maybe around 18,000 CIK-YEAR pairs per-cycle until we come up with a more elegant solution. I’m so glad Antonis was paying attention. I think I would have beat my head against the monitor for many more hours before I thought to ask the magic question – are you using the Include Historical CIKs checkbox.

Executive Compensation – Some Big Numbers – Big Gender Gap

The New York Times had a summary article about the levels of Executive Compensation reported so far this year. Here is a link to the article (NYT Comp Article) by David Gelles. The article mentioned that the observations were based on proxy filings made through April 24, 2021. We had already seen some larger numbers reported in 10-K and 10-K/A filings the article mentioned that their data was limited to data collected from proxy filings. Since 4/30 was the deadline for the Part III (or proxy) filing we decided to take a bit of a dive into the compensation data we have processed and share some observations.

Before I start describing some of our findings – I have made an XLSX file available – the link is at the bottom of this post. I would appreciate a citation if you use any of the data included in the file.

Our top 20 is very different from the top 20 reported in in the article because we included those whose compensation was reported in a 10-K or 10-K/A filing as well as those filings that were filed by the 4/30 deadline. The top earner reported in the NYT article . Their highest paid executive, 211 million earned by Mr. Richison of Paycom, was only 6th on our list of all executive when we include those disclosures made in 10-K filings. Here is our top 20:

Top 20 Earners as Reported in SEC Filings Through April 30 2021

The NYT article referenced above focused on TOTAL compensation. I had already seen some really large bonus numbers – bonuses and salary tend to be the most certain forms of compensation (the amount realized is generally the amount reported) so I decided to dig into our database to identify all individuals who earned a bonus amount equal or greater than $1,000,000. We identified 549 individuals who met this criteria. The largest bonus was granted to Anthony Hsieh. He was awarded a bonus of 42.5 million dollars. Mr. Hsieh is the CEO of LOANDEPOT. There was very little explanation in their 10-K for the bonus (10-K Link). ” The amounts reported in this column reflect special one-time discretionary bonuses. Our board of directors and our CEO participated in the determination of the special bonus allocations.” Two other executives at Loandepot earned bonuses that placed them in the top 10 (Patrick Flanagan and Jeff Dergurahian were each awarded a 12.6 million dollar bonus). Here is a list of the top 20 bonus awards reported so far:

Top 20 Bonus Awards Reported in SEC Filings through 4/30/2021

As I was comparing the two lists above something struck me – there were no women listed in the top 20 of total earned compensation and only 2 made the top 20 of bonuses. So that made me curious and I decided to sort based on GENDER (we include a GENDER field in the data file below).

There are no women in the top 40 of total compensation. The first woman is Ruth Porat at number 45. Ms. Porat is the CFO of Google – she made the list because of a stock grant that was measured at more than 50 million dollars. As a matter of fact there are only two women in the top 100 (Ms. Porat and Carrie Wheeler the CFO of OpenDoor).

Of the 1,048 individuals represented in the set of those earning more than 10 million dollars – only 82 of them are women. However the total amounts earned by women in this data amounted to only 1.36 billion. Men on the other hand earned 22.1 billion. So women represented 7.8% of those earning more than 10 million but their gross earnings represented only 5.8% of the total 23.47 billion earned by this group of executives.

Only 60 women earned a bonus GTE 1 million. The average bonus earned by these women was 2.3 million. 487 men earned a bonus greater than 1 million – the average bonus earned by men was 2.8 million (2.7 million if you disregard the eye-popping 42.5 million that was awarded to Mr. Hseih.

There are some caveats. The data pulled represents all data for either the 2020 or 2021 fiscal year end. So for example a company whose fiscal year ended in February 2021 – if they have reported compensation for 2021 then it was considered to test whether or not it met the 1 or 10 million threshold. If they have not yet reported for their most recent fiscal year then we tested their previous fiscal year that ended in 2020. But it is a bit more complicated than that. Target’s year-end is 2/1, Best Buy’s is 1/30. Target tends to report earlier than Best Buy so we have data for the year ended 2/1/2021 for Target. But Target labels that as 2020 data. We also have data labeled for Best Buy for 2020 but because of the way Best Buy labels their data it is actually for the year-ended 1/30/2020. Based on their historical filing practices I think Best Buy is probably going to report sometime today or tomorrow. The actual date the data was disseminated through the SEC EDGAR platform is an a field labeled RDATE in the file.

There are some duplicate people in the file – yes you can collect two pay checks from different companies in the same year.

Here is a link to the Excel file (directEDGAR Compensation Summary 2020/2021). There are a number of fields that exist for audit purposes. Gender is included as well as the CIK (Central Index Key) of the individuals. We use the CIK of the individuals internally to track them and simplify the matching process across entities.

Finally, as I was working on this I was reminded of the push to introduce more data analytics to the accounting curriculum. We have had some internal discussions about sorting out how to make our data store accessible directly rather than through the application. The notion here is that if you are a business faculty member who needs to help students become more comfortable with using Python and similar technologies this collection of data might be a natural fit to teach students how to use those tools. I have had a preliminary discussion with one faculty member at one of our clients schools already. It would be interesting to learn if others have this interest. I had to use SQL statements and do a couple of transformations using dictionaries to find and organize the data to create the form of the data as it is in the spreadsheet. I also had to do some consistency and error checks so there is a lot to muck around with for learning purposes.

All data included in the file was extracted from SEC filings and normalized using directEDGAR’s proprietary platform. We process other types of data as well as offering an amazing search engine with more operators and more ways to filter results than any other on the market. Our search engine is augmented with unique tools that allow users to Extract and Normalize to create the inputs they need for their analytical and research projects.

Enhancing Search Focus with Numeric Ranges

I received an interesting email from a client this morning asking about including a dollar sign in a search to cut down on the noise from a search since they wanted disclosures only when a monetary amount was reported in proximity to the search phrase. Rather than share their search I will describe another similar search. Suppose you want to find the amounts reported as expenditures for research and development. A natural starting point would be to search for research and~ development. Note the ~ appended to a search operator causes the search engine to treat the word as a term not an operator. A search for research and~ development returns the phrase research and development. The search phrase research and development though returns any document with both word (no proximity constraint).

The problem (as indicated in the email) with the search research and~ development is that the phrase could exist in many places in a document without a disclosure of the amounts.

For example Apple used the phrase nine times in their 2017 10-K – most of the hits were noise – The Company believes ongoing investment in research and development (“R&D”), marketing and advertising is critical to the development and sale of innovative products, services and technologies.

If we add a number range constraint to the search we can significantly reduce the noise. Our application does not index dots, dashes, commas, dollar signs . . . but we do index number groups We can search for ranges of numbers by inserting the lower and upper bounds of the range separated by 2 ~ symbols.

To achieve the goal of identifying disclosures that might describe the amount of expenditures for research and development I proposed this search (research and~ development) w/10 1~~999. Clearly this search will take longer because the search engine is going to have to inspect every instance of the R&D phrase for proximity to any number in the range 1 to 999. But it will significantly reduce the noise from the first search. The first search yielded 19,391 documents when applied to the 10-K filings filed in 2016-2020. The second search returned only 13,466 documents. The noise is of course not completely eliminated but it is greatly reduced.

To use a number range in your search remember that dollar signs, dots and commas are not indexed. However the digits are. So a search for a number in the range of $1 to $999,999,999,999 can be reduced to 1~~999. If a number is found that meets the criteria – only the digits to the left of the decimal or comma will be highlighted (but they can be extracted with the Context Extraction feature).

directEDGAR & Python – Great Complements! Matching Comment Letters to Subject 10-K Filings

I received an interesting email from a client this week. They are trying to match comment letters (SEC Form UPLOAD) to the subject 10-K filing. The timing was perfect because I wanted to find a unique use case for a presentation at the University of Nebraska at Omaha to highlight how directEDGAR’s feature set can really accelerate data collection – especially if you are handy with coding in Python. I made a claim in the announcement for the presentation that with directEDGAR I could accomplish something probably ten times faster than relying just on Python.

The nice thing about comment letters is that they generally identify the filing in the subject area of the letter. I decided to begin this work by running a search over all comment letters for FORM 10 K* or FORM 10K*. I used the 10 K* to make sure I captured amendments and 10-KT. The FORM 10K* search was to capture any 10KSB, 10-KT and 10K405 filings. To make sure the search was focused on 10-K forms and variants as the subject of the comment letter I restricted the search to those where the references were within the first 200 words of the first word of the document. My final search was xfirstword pre/200 ((form 10 K*) or (form 10K*)) . This search identified 79,213 comment letters.

Search Results Displaying Comment Letters Relating to 10-K filings

You can see in the image above I found 79,213 comment letters in my older copy of our filings in one minute and 15 seconds. I observe in those comment letters that the FORM reference is immediately followed by the date the filling was made. I want to capture that and I don’t want to spend a chunk of time to do this. One of the features of our application is that we create a text (txt) version of every document and load it into the index. This is good because most of the comment letters are pdf files. I do not need to mess with a library in Python to extract the text – I can just use the EXTRACTION\DocumentExtraction TextOnly feature of the application to dump a well named text only version of the letter into a folder on my computer.

That took about fifteen minutes – at the end of the fifteen minutes I had a text version of the 79K+ documents available on my desktop ready for use with Python. Below is an image of the directory – each file is named using our CIK-RDATE-CDATE-stuff convention so that is useful for making sure I have an audit trail back to the comment letter.

Text Version of Comment Letters Ready for Processing

I want to scan some of these to make sure I understand their structure. Thus I used the SmartBrowser to quickly review the output.

Reviewing the Text Version of the Comment Letters Using the SmartBrowser

Now I am ready to write some code to parse out the specific filing type and the filed date from the filing. I am fairly new to Python 3.9 – we anchored on 2.9 many moons ago. Since I am expecting most of our users are using one of the more modern version of Python I am wrote the code in 3.9.0. My general strategy is to read the first 30 lines of the text files as a list – find the line that begins with the word FORM – confirm it has 10-K (Actually I am going to look for FORM 10. I will then inspect the line to see if it begins with the word FILED. If I find the word FILED I will split and try to convert the remainder of the line into a date. I am also going to convert the date in the line that has FORM 10-K into a date object as well – this is our CDATE or better known as the balance sheet date. Below is an image of the lines I processed for this particular filing. I spent approximately 30 minutes on the code – and it is not perfect but my goal in this experiment is to demonstrate how our platform with your skills can accelerate your work.

Parsing Example from Reading Comment Letters

Notice – I am trying to avoid going into the weeds here. If you get a chance to replicate what I am doing you will see why I made the decisions I did to identify the relevant dates as well as allow for the possibility that the RDATE could be on the next line after the line that described the FORM or the second line after.

I want to observe – these documents were created by humans – there are errors – some of the dates are outside the bounds of what would be expected:

Typo in Comment Letter

Now that I have the balance sheet date (CDATE) and the dissemination date I am ready to access the actual filings. We have a CIK-DATE search feature on our application that will limit the results to documents/filings made by a unique list of CIK-DATE pairs – we can set a window around the date parameter if we choose. Since the SEC’s filing date may deviate in some cases from the dissemination date I am going to set a five day window.

Setting Application Parameters to Find Specific 10-Ks referenced in Comment Letters.

The search/filtering process on this type of search can take a bit longer since we have to filter on CIK and date – but we are way ahead – in the test I just ran it took 12 minutes. So roughly a bit less than two hours invested. I am not going to claim that this is the end point. There were some exceptions I could clean up – but my goal was to demonstrate the possibility. In summary I spent less than two hours to go from an idea to a result. You can see in the image below – I can ‘touch’ the filing mentioned in that comment letter above (as well as all of Abbott’s other 10-K filings that were the subject of a comment letter).

I am going to share the code here. I am brand new to 3.9 and some of the conventions are different from 2.9 so if you are an expert 3.9 coder and want to improve the code – share your results. Otherwise – for those of you just starting out and want to play with the interaction of directEDGAR and Python this seems like a worthwhile place to muck around. My goal with this exercise was to highlight another way our platform can accelerate your data collection if you just think outside the box. Rough Code to Parse Comment Letters

Update to Historic CIK Mapping File

This post is a bit wonkish – but update instructions are near the end of this post. We made an update to our CIK mapping file late last week. This is the file the application uses to retrieve filings by companies that have completed some reorganization that triggers filings under a new CIK. Our file maps between the new and the old CIK so if you are trying to match data based on an old CIK but the registrant is filing under a new CIK the application will retrieve the data you requested even if you are using the new (or old) CIK.

For example – the entity know known as The Walt Disney Company files under the current CIK 1744489. Prior to their acquisition/merger with subsidiaries of The Fox Corporation they (Disney) made filings under CIK 1001039 (from late 1995 until late 2019. Prior to 1995 they filed under CIK 29082.

Our mapping file was developed to anticipate you collecting some data from another source that might have one or more of these CIK as the key and you wanting to match that data to some data you would anticipate finding in their filings. If you use the CIK filtering feature or if you retrieve any of our preprocessed data based on CIK – the application interface will have a box to check asking whether you want to use historical CIKs in your search/retrieval. The image below shows the check box to select if you were to run a search and included a CIK filtering file.

Include Historical CIKs Check Box

It is your option to determine if you want only data for your CIK file or if you want us to augment your CIK list with the values from our mapping file. If you select the Include Historical CIKs option the application will augment your CIK list with all of the additional CIKs that have been associated with the entity. So for example, if you have CIK 1744489 in your sample the application will automatically add CIK 29082 and CIK 1001039 to the in-memory version of your list as it processes the list for the task. If you have a request file that has CIK 1744489 and the YEAR values of 2021, 2020, 2019 and 2018 – the application will extend your list to include each of the additional CIKs and the YEAR values you identified for your original CIK list. To make this clear – the image below has the the original request in black and the extended list as determined by the application in red.

CIK/YEAR request augmented by the application

The values in red are not added to your version of the list – but they are used by the application. However, any missing values (in red or black) will be reported in the missing_cik_year_pairs.csv file at the end of the search/extraction. Sorry for getting lost in the details – but they are important. The real reason for this post is to make sure you remember to periodically update the mapping file on your version of directEDGAR and since we just updated the file this is a perfect time for you to update yours – the process is simple.

From the File menu selection select Options and then select Update Historical CIK. Press the Perform Update button.

Options panel – Update Historical CIK

The application will call home, license validity will be established and our server will return a copy of the latest mapping file which will be saved for use by the application. When the process is complete (usually a second or two) a confirmation message will appear.

Update Successful response message

We are still working/struggling to communicate with you about the results – which current CIKs mapped to an older CIK so you are not fully surprised by the fact that you asked for say the MDA from CIK 1744489 from 2018 but instead you received the MDA from CIK 1001039. The challenge is how to do this in real time – for instance APA (CIK 1841666) became the successor registrant to Apache Corp (CIK 6769) on March 1, 2021. While adding the CIK matching to the mapping file is trivial. It is much more complicated to go back and embed a new CIK in all of the prior documents.

Interesting Issue – We need to start thinking about our GENDER indicator.

I was introducing a new intern to one of our internal tools that we use for data processing. We have a number of dashboards that are populated with data when there is a missing value or if the system populates a field with an unexpected value. One of those dashboards is triggered when we are processing compensation data. If a new director or executive has not been included in any prior filing we are not likely to have a GENDER value. In this case the data shows up in a queue for someone to review and code.

When one of the team has to review a filing to identify the right code to use we provide them a link to the document in a dashboard with a place to enter the value they determine is correct. We have developed some proprietary tools to scan the document to identify the use of some specific person titles (Mr, Mrs, Ms, and the names that follow those title words. In addition – parts of the sentences personal pronouns are included in the dashboard with the referent names. If there are multiple titles or pronouns associated with one name (MR. KEALEY is the husband of MRS. KEALEY) these are flagged.

My intern wondered if we were making a mistake by limiting our search to those honorifics/titles. He became even more concerned when I explained the process we followed when we were not able to identify the GENDER using the dashboard. In those cases (when there no gender relating title or a gender explicit personal pronoun (he, she, her, him, he, his) associated with a person in a filing (yes it happens more than you would believe) we Google for a reference or image of the person. We make a determination based on a search result that contains the name of the person and the name of the entity that made the filing. Historically we have really thought it made our job easier if one of the search results has a picture of the person (presumably we did not find a picture in the proxy).

If you haven’t sorted out the problem yet, his question was – is this appropriate in a world where people are willing/prefer to identify as something other than the binary classification we use?

We try to add an indicator of GENDER to allow our research clients to test hypotheses on the association between various dimensions of firms and the characteristics of their executives and board. If we don’t sort out how to identify those executives and directors who believe that that the binary classification of their personage does not reflect some dimension of their identify we are failing to provide the right data. That was an awkward sentence but it reflects the truth and the problem.

While this is important I am stymied at the moment about how to move forward. I did a search in proxy (DEF 14A) filings made since 2005 for words that I thought would help identify these cases (gay, queer, lesbian, non-binary, LGBTQ). This seemed like a reasonable start to me – however a recent article in the New York Times made me less than confident that I had enough knowledge to create the right search (more on that later).

My first search was limited to filings made from 2005 to 2015. I found only 324 documents from 169 companies. There were only 135 documents from 69 companies where the context indicated that the finding was something other than about a Mr. or Mrs. Gay or an address (Gay Avenue, Gay St.). What was even more interesting was that there was no mention of directors with one or more of these attributes. All of the filings where these words existed the context was generally related to a statement about the registrant’s view of human rights and the value they place on a diverse workforce. The only other context for these words was when the words were included in shareholder proposals. I found no language used to indicate a person might prefer a non-binary classification.

However, when we move forward to the filings made in 2016 – 2020 some filers started indicating that some of their directors were LGBTQ. Usually this was disclosed in a Diversity matrix. One interesting thing about this disclosure is that it is not consistent across filers, even if they have a common board member. Specifically I have found examples where one board has a category to indicate which members of the board identify as members of the LGBTQ community – another filer with the same board member does not provide any indication about this characteristic of their board members. I will also observe that 2 registrants began providing a diversity matrix in 2019 that includes a non-binary classification related to gender. So far in 2021 (through 3/25) there are 4 registrants that have included this dimension in their diversity matrix. Despite the addition of this dimension to these matrices there is no indication of a director using this classification in any of the examples I could find.

This is something we are going to have to think about. Note – the only information I have found so far is just an indication that some board member identifies as LGBTQ. I have not yet identified information that would indicate a board member would classify themselves as other than F/M.

In an earlier paragraph I observed that I ran a search using words that I consider relevant. There was a really interesting New York Times article in the paper on 3/21 (the article first appeared online on 3/15) Who is making sure the A.I. Machines Aren’t Racist . I think I have read this article four times now. It is relevant to this problem – how can I/we authoritatively use language to classify a person who identifies as non-binary. This is kind of tricky. I keep telling myself that our first step is to focus on the personal pronouns used in text that describes the person. We can code for that and bring it up for review whenever there is more than one gender indicator or there are none that we are currently familiar with. The article though reminds me that we can get this wrong and need to be humble about the steps we intend to follow.

My current plan is to anticipate more nuanced disclosures about this attribute of board members and executives. Once we start finding evidence that is relatively unambiguous about diversity I think I am going to try to reach out and communicate with the filer to confirm our interpretation of their language. If we receive some positive confirmation then we will begin expanding the values used in the GENDER field.

I thought about putting up some example images of the disclosures we have found. I have decided that the people making these disclosures are intentionally making them to readers of the proxy – they are not necessarily making an announcement to the world. Thus I have decided not to provide examples at least until I can have a conversation with one or more of the directors who have made/supported the disclosure choices made by their companies.

A related problem is whether or not we propagate a disclosure choice made by one director to all of the registrants they are associated with. Imagine director FIRST_NAME LAST_NAME is identified as a member of the LGBTQ community and prefers the use of XIE as disclosed or apparent in proxy filings by FIRST_COMPANY in 2021. They are also directors of SECOND_COMPANY in 2018 – 2021 but the proxies for SECOND_COMPANY make no mention of this attribute. In the proxies for SECOND_COMPANY all references to FIRST_NAME LAST_NAME are made using either male or female pronouns. Do we change the GENDER from M/F in the data for SECOND_COMPANY or do we honor the disclosure choices they made in SECOND_COMPANY filings?

Bottom line is – this is now on our radar. I will indicate once we start needing to expand the values we report in the GENDER field. Right now I can’t imagine pointing to specific entities or people or even the documents where we drew the data to make our inferences. However, I am always happy to engage with you and take feedback on the classification decisions we ultimately make.

Missing PERSON-CIK and related metadata

During our normalization processes for Director Compensation we try to match the named directors with existing data to add the PERSON-CIK, SEC-NAME, AGE and SINCE (a measure of tenure). If we cannot match the data using our automatic processes the data table is shunted aside so a person can review and attempt to match manually. If we are not successful after reviewing the filing, prior filings and other source documents then we add some signal to the field to indicate the values we could not identify.

Certara Inc (CIK 1827090) filed their first 10-K on March 15, 2021. This filing included DC and EC data. We are missing values for some of the people listed in their DC table. Certara filed a draft registration statement in October 2020 and an S-1 in November. We have learned that many pre-IPO directors do not follow the registrant into the public market. In Certata’s case two directors who received compensation in 2020 resigned before the filing of their initial S-1 and so there is limited information about some attributes of the directors other than their names in the S-1 filing and in their initial 10-K. If directors resign before the S-1 filing and their holding are less than 5% they have no reporting obligation under Section 16. In these cases we try to discover if the individuals had some reporting obligation because of their relationship with another entity. This requires more than just a name match though.

In the case of Certara we can identify data for William E Klitgaard since he has a history on EDGAR. He has served as a director of Syneos Health since 2017. In his biography Syneos makes an explicit mention of his position on Certera’s board “Mr. Klitgaard currently serves on the board of directors and Audit Committee at Certara, a leading drug development consultancy . . .” This explicit mention will allow us to update the DC data of Certara with Mr. Klitgaard’s CIK.

However, Certara’s other director that resigned before their IPO was Edmundo Muniz. Their S-1 filing indicates that Dr. Muniz was their CEO from 2014 until early 2020. Their is not enough additional biographical information about him that would allow us to link him to another SEC registrant. In cases like this we would then search EDGAR for his name and then try to find some additional evidence to link him to the reporting company (Certera). But in this case there are no Muniz’s in EDGAR whose profile comes close to matching the profile that we would expect Dr. Muniz. Therefore – in this case we will be able to add his GENDER and his tenure (SINCE). We will not be able to add values for PERSON-CIK, SEC-NAME or AGE.

Periodically we will inspect EDGAR for cases like Dr. Muniz. Specifically we will look for name matches on EDGAR and if we find one we will then review the filings for the registrant to try to determine if the Dr. Muniz associated with that registrant is the same as the Dr. Muniz mentioned in Certera’s filings. It is much easier when they make an explicit mention of the other company (Certera in this case) in the biography of Dr. Muniz. Because he was not a director when Certara was a public company they are not obligated to make mention of that fact. If there is a close name match and the other company operates in a related industry we will use Google to try to determine if the individuals are the same or not. Until we can do that though the fields will continue to indicate that data is not available.

There are cases where a director might be affiliated with one or more public companies for long periods of time but never trigger a Section 16 filing obligation. We see this most often with banks and then foreign entities that are cross-listed into the US. We also see it when the director is the board representative/proxy for some security holder. I’ve been meaning to research the legal basis for the limited reporting by the foreign executive and board membersto learn why they are exempt. I’ll save that for another time. In the latter case the filings usually indicate that all compensation is passed through to the named director’s employer (security holder).

New MetaData Testing – Available Now

With the our filings now hosted in the cloud we can add new metadata to filings without having to ship out 6.0 TB hard drives to our clients. We are running a test right now with the addition of a number of new fields. This initial test is limited to 10-K filings made between 1/1/2016 and 12/31/2020 that are not amended. This test index is now available (10KTAGTEST Y2016-Y2020).

The fields we are adding in this initial test and their description are described in the schedule below:

Field NameDescription
ACCEPTANCEDATE/TIME of EDGAR system recording the ACCEPTANCE (but not the dissemination) of the filing. This value is in the form YYYYMMDDHHMMSS where hour is based on a 24 hour time representation. You will note that we historically have used the dissemination date so any filings made after YYYYMMDD1730 have an associated RDATE one business day later.
FILERCATEGORYThe entity reported filer category as defined by the SEC (see table below for codes).
COMMONSTOCKSHARESOUTSTANDINGThe number of common shares outstanding as of the latest practical date. This is the value reported on the cover sheet and is reported in shares.
COMMONSTOCK_DATEThe date as reported by the registrant that the number of common shares were calculated.
PUBLICFLOATThe label perhaps is not precise. This is the market value of all shares held by non-affiliates as of the last business day of the registrant’s second quarter.
FLOAT_DATEThis is the date the public float was measured.
ICFRAUDITAn indication as to whether or not the auditor of the financial statements also issued an opinion on the internal controls over financial reporting. The values are TRUE/FALSE This flag is initially only available for registrants who adopted the new 10-K form in 2020. While compliance was required for all 10-K filings made after 4/27/2020 we have noticed a number of registrants who did not conform to the requirements. This tag will initially be included only in filings where the registrant met their reporting obligation though we expect to backfill once our testing is complete.
Table Describing New Metadata Tags added to 10-K filings

The values for FILERCATEGORY and their meaning are:

CODEMEANING
LAFLarge Accelerated Filer
SRCSmaller Reporting Company
AFAccelerated Filer
NAFNon-Accelerated Filer
SRAFSmaller Reporting Accelerated Filer
Explanation of Codes used in NEW FILERCATEGORY Tag

Having these codes embedded in the filings now allows you to use these in your searches when appropriate. For example – suppose you want to find all 10-K filings made by Accelerated Filers who reported that they included an ICFR-Audit from their auditor. To include meatdata in a search we use the fields button on the application to select the appropriate field. We need to specify two fields for the search described above. First, we want to select the FILERCATEGORY field and specify AF for Accelerated Filer.

Selecting a field to use in a search.

Once we type in AF for Accelerated filer and hit the OK button the search pane in the application will populate with (FILERCATEGORY contains(AF)). To also limit the search to those filings that included a ICFR audit indication on the face of their 10-K we need to add the AND operator. Then use the fields button to select the ICFRAUDIT field and enter the word TRUE (valid values for this field are TRUE/FALSE).

Selecting the ICFRAUDIT field to use in a search.


After entering TRUE and then hitting the OK button our search is now (FILERCATEGORY contains(AF)) and (ICFRAUDIT contains(TRUE)). So we are looking for any 10-K filing made by an accelerated filer that indicated their auditor performed an audit of internal controls.

Whenever you perform a search and create a SummaryExtraction or a ContextExtraction the application always includes all of the available metadata from the filings that were returned in your search. Therefore you don’t have to specify the metadata in the search – all of the available metadata is included as columns in the output file. My search returned 16 documents. I created a SummaryExtraction from this search. If you want to review those results follow this link (Example SummaryExtraction)

Our processor code for the filings has historically added the same metadata to every document that is included in a filing. So all of the exhibits associated with a 10-K filing had the same tags as the 10-K filing. In this initial test we are only tagging the parent document (the 10-K) with the additional metadata. The exhibits will continue to include the existing metadata. For testing purposes the only documents in the 10KTAGTEST index folder are 10-K filings (no amendments or exhibits). The originally 10-K and exhibits are still in the usual place. Your feedback will help us determine whether we should add additional details to the individual exhibits.

To access this index from your instance – follow the steps described in the help to update your index libraries (File\Options, Index Library, Generate Index Library). Remember – you don’t have to do any browsing – when the Main Index Options is visible the Main Index Library Folder path should be visible – hit the Generate Library button.

Index Library Selection Tool – once visible – confirm the path exists in the Main Index Library Folder hit Generate Library

Unfortunately there is a catch. We have 3,300 or so 10-Ks filed in this window for which we cannot yet authoritatively confirm one or another item of metadata. This seems to mostly related to those cases where the registrant has two or more classes of common stock. Our challenge is matching the name of the common stock class with the shares outstanding. Right now we have identified some cases where we are matching the wrong name to the count. This is a critical issue and one we are still working on. This problem is why we are still running in test mode rather than in production. I am hoping we do not have to resort to manual matching. In those cases we have decided not to add metadata for those particular attributes yet. So you will see cases where the public float is reported but there is no data about the shares outstanding.

Finally – please remember that these values are reported by the registrant. There are bound to be registrant errors. This is a new area for us and our immediate focus is on capturing the values as reported. I have a very significant degree of confidence that when the registrant reports it is an ACCELERATED FILER we have captured that correctly. Whether or not they are is another issue. Also remember that the meaning of some metadata values will change across time. The authoritative source for an explanation of any specific terms/fields is always the SEC’s rule making disclosures.

Your feedback on additional metadata to include will be appreciated. As I was working on this I had a conversation with a client who expressed an interest in having a hyperlink to the document on EDGAR. This is the second time I have heard that this could be beneficial so we will add this field in the next iteration (less than a month I hope). So if you think some other fields would be useful please let us know.

Additional Proxy Forms Available

We have made five additional proxy type filings available through the platform. The specific filing types are DFAN14A, DEFC14A, DEFC14C, DEFM14A, DEFM14C. In some cases these filings are made by investor groups to communicate with the existing shareholders about their activities. Because these filings might be made by these third parties we have added the INVESTOR_CIK and INVESTOR_NAME tags to the filings.

The addition of these tags provides an opportunity to close the loop between an investor taking a position that triggers a SC 13D filing obligation and efforts by the investor to exert influence over the management/strategy/board of the issuer.

A brute force matching of these is not difficult. Step 1 would require that you search the SC 13D archive using xfirstword and (DOCTYPE contains(SC*)). This search identifies all SC 13 filings and amendments and leaves out the exhibits. The exhibits are not needed to identify the issuer/investor relationship. Use the SummaryExtraction feature to pull a listing of the filings – this listing will include all of the metadata attached to each filing.

Create a new column in the file (I named my new column ISSUER_INVESTOR) and use Excel’s CONCATENATE function to concatenate the value in the CIK column with the value in the INVESTOR_CIK column – note include an underscore or dash between the two values.

Full dump of SEC 13D filings with ISSUER_INVESTOR identifier created to match across filings.

To match these to particular filings run an analogous search in one of these other filing types. For example, I could run the following search over the DFAN index collection xfirstword and (DOCTYPE contains(DFAN*)) to identify all DFAN filings made – create the same ISSUER_INVESTOR identifier and then use VLOOKUP to match the ISSUER_INVESTOR combinations in the two filing types. So for instance – this strategy allowed me to match the SC 13D filings STARBOARD VALUE LLP filed related to their investment in DARDEN and then identify the associated DFAN filings that were also filed by STARBOARD.

Presumably we are doing this at scale and not looking at individual (one-off) documents. So if we have identified for example all DFAN fililngs that are associated with an SC 13D INVESTOR filer we can isolate those by using the CIK-DATE feature of the application to run a new search with the issuer CIK and the RDATE of the filings that were identified as a match (there were 3,884 DFAN filings associated with an INVESTOR who filed an SC 13D).

The new filings are available now – to access them please use the File Options Index Library Generate Library command to access the latest filings. Further – if you have not used the Zoom feature when trying to select an index – try it.

Index Library listing using the Zoom feature.