DC Database Updates to Occur Weekly

Our team hit the commit button on a new ‘job’ early this morning that will update the DC database available through the application on Sunday morning at 1:14 AM CT. Getting to this stage was a bigger project than we imagined initially but we learned a lot so I hope that what we learned translates well to getting the EC data ready for the same scheduling.

To see the benefit of this consider that Apple Inc (CIK:320193) filed their proxy last Thursday. As you can see from the image below – their Director Compensation data is ready for use:

To be absolutely clear – that image came from using the DB Query feature on our application as I was writing this post.

If you are curious as to why we are only updating weekly – read on!

We process the data in two stages. First, we have the processes that extract and normalize the data as it was reported in the table in the filing. During this phase we also populate the SEC_NAME, PERSON_CIK and GENDER fields. This is what gets immediately distributed to the old platform and to our API customers. Overnight we then attempt to automatically populate the YEAR and SINCE fields. If we can’t populate those fields using data that exists in our databases we have to queue these up for a human to populate after reviewing the source document. It can get really challenging to decide whether or not to pass on these fields or not. So by waiting until Sunday we are hoping that it is more likely that we will have finished whatever review is required and either populated those fields or signaled NOT AVAILABLE. I will warn though that during busy periods (peak proxy filing ‘season’ runs from late March until the end of April) we will get further behind. And we do the easy ones first and set the hard ones aside until we can devote the right expertise to reviewing the source documents. An easy one is where the disclosure of age and tenure are included in the source document. However, there are more than 400 cases each year where this data is not disclosed for an individual director. Whether we have populated those fields or not, the data will be available – if we are still hopeful that we can get a measure of AGE or SINCE then the field will be blank. If we have determined that we are not likely to find evidence to populate we will indicate with a NOT AVAILABLE – MONTH YEAR message.

Let me bore you with an example. OneWater Marine Inc. filed a proxy on 1/13/2023. A Mr. Greg A Shell was listed as a director but there were no details about his age or when his tenure began. It turns out he resigned in November because of a change in his primary employment. We found the announcement of his appointment to the board in March. To find his age we had to cross our fingers and hope to find another disclosure of age with a date to see if we could extrapolate to the proxy filing date. We were lucky to find an S-1 filing for an unrelated entity that reported his age (44) as of 12/31/2019. So this field could be populated (47) . I think it took 20 minutes to find this fact. While it may seem obvious to populate the SINCE value with the year the director shows up – we have determined that introduces error

Another issue to be sensitive to is that this data can change over time. For example, AGE and SINCE values will be constantly updated. But there are even bigger issues that can cause change. Items included in Part III of the 10-K are often omitted because the issuer will indicate that they are going to take advantage of the grace they are allowed to incorporate the information included in the proxy if it is filed within 120 days after the fiscal year-end. However, if something causes the company to delay the filing of the proxy (DEF 14A) then they are obligated to include the information from Items 10 – 14 in an amended 10-K (10-K/A). Further, there are issuers who will make the initial disclosure in a 10-K and then still file a proxy by the deadline (or later). We parse and normalize the data when it is filed and only respond to multiples when a second filing is made that includes the same data. It is a little complicated and tedious to fully describe how we handle these. The impact is that once a duplicate is filed we have some tests that run. If the first disclosure was in a 10-K or 10-K/A and the second is a DEF 14A we pull the first disclosure and rotate in the second disclosure. If both disclosures were from a DEF 14A we will remove both from the database until someone can verify the reasons for the second. If the second is because the original meeting date was canceled then we use the data from the second disclosure. If the second disclosure is because of a special meeting we confirm the originally scheduled meeting took place and delete the second disclosure and push the first disclosure back into rotation. There are even other cases. I told you this was tedious!

There are more exciting things coming in the next weeks. Stay posted.

More Errors in SEC Filings

I have been pounding on the XBRL data as we intend to make some subset of this available. One of the issues that is particularly important to me is to improve the way we make metadata available to you to organize your search. I don’t want to get bogged down in a deep discussion about our plans at this moment, I would just like to observe that we for sure want the Document and Entity Information (DEI) at your fingertips so you can better manage your search.

I created two datasets, one has all of the DEI numeric data and another has all of the text data. We knew some of the numeric values had errors. For example, did you know that EBay once reported that their public float was $ 31,354,367,947,000,000,000! Anyway, I was playing with the dataset in an SQLITE browser tool and decided to test filter on setting ICFR flag to false and the EntityFilerCategory type to LARGE ACCELERATED FILER. I expected none or just a few. Instead, the database returned 97 entities where the DEI table indicated that the entity was a LARGE ACCELERATED FILER and the IcfrAuditorAttestationFlag is set to false. Because it is a requirement that LAF have an audit of their internal controls I thought this was curious. Here is an image of part of that query.

I of course am hugely curious about that result. Frankly, my first thought was that another unexpected issue in the source files that I will have to pour through to sort out. The code to manage this process was really tricky because of filer specific idiosyncratic choices. Guess what – most of these were coded wrong, either by an employee of the filer making an error or some hiccup between the filer’s form creation and their EDGAR-IZING software. For example, you can see clearly in this audit report that the auditor issued an unqualified opinion on the Company’s internal control over financial reporting.

Some of the cases the auditor concluded that there were one or more material weaknesses in the internal controls, but my read of the Edgar Filer Manual suggests to me that the flag is to indicate whether or not their was an audit of internal controls, not the results of the audit.

My quick scan (using directEDGAR of course) indicates that 91 of the 97 cases were miscoded by the filers. My present thought is that we need to push this data out as-reported. I am going to muck around with it a bit more before we push it into the platform. Once again, because of the updates, once it is available it will be visible in the Database to Query area of the database tool.

I will observe that DraftKings is the result of a SPAC deal and I think based on the timing etc of their acquisition of the old DraftKings business that they were exempt from the requirement to have an audit of internal controls for their 2021 financials.

I hope you have a Happy New Year.

Back to Basics

While we spend a significant amount of time trying to develop new features the truth of the matter is that directEDGAR is still my preferred way to review filings even especially when I have to hand collect something. I am sharing this because in the last three weeks I have had several users ask an unrelated question and in a subsequent conversation it turns out that they were also visiting EDGAR to review some filings for data. My sense from these conversations was that they thought the effort of going to EDGAR was less than firing up directEDGAR because their N was relatively small. My rule of thumb has generally been that if I need to look at more than 5 filings of one type I save more time by using directEDGAR versus visiting EDGAR – and frankly with the new front-end on EDGAR I am starting to think my N should be 2.

I have 37 proxy statements filed in 2022 I need to review this morning from 18 different filers. The short story is I need to establish why we have multiple proxies for these particular filers and how we should code the DC/EC data that was pulled from these filing. I hope this does not sound too arrogant but I feel like I am pretty adept at navigating EDGAR efficiently. So when I assert that using directEDGAR is much faster than using the EDGAR interface. I am already factoring in my experience.

If I use the EDGAR interface to review these proxies I have to start by keying/pasting in the CIK and remembering to hit the TAB key (rather than enter argh!).

I land on the new landing page and because I need to scan each of the two (or in one case three) proxies I need to hit the Classic version link to load the no-frills listing of filings in date order. From there I need to key in DEF 14A to load the proxies.

Next, I need to click on the filing links to open to the landing page for the filing and then I need to click on the link to the actual filings before I can review and draw the conclusion I need to draw. Even describing the process is tedious. Type in CIK, hit Classic version link, type DEF 14A, select each relevant link to open and then hit the actual proxy filing link before I can review the filing.

Using directEDGAR has a slightly higher start-up cost because I have to log-in and wait for my session to load but that start-up cost becomes meaningless when we consider the need to look at multiple filings from multiple filers. Here is a video I made that demonstrates the way I would more efficiently access the same collection of documents.

Using directEDGAR to access the same set of filings – it is just more efficient.

While I grant you that you have to log-in; that time cost is quickly mitigated by the fact that all of the documents you need to review are available. Remember – the SummaryExtraction tool creates a list of the documents with all of the required metadata in the same order the documents are displayed in the Search Results frame so you have a ready tool to use to make the appropriate notes that you need from your review.

Frankly, with the changes to EDGAR over the last year I try to avoid it even more. I will go to EDGAR to look at one or two filings. Maybe a year ago I might have gone to look at ten or so. With the latest interface I get frustrated when the CIK will not come up in the search box the first time or I can’t get a name match if I am using names.

Historical CIK Mapping

I received a really nice email from Professor Richard Price (Richard is an accounting professor at OU) after yesterday’s post. One of the things we discussed in the exchange is the pain associated with entity/security changes etc. Richard brought up a great example, Disney has had 3 CIKS (their current is 1744489, their immediate past 1001039 and their oldest 29082). We added a feature sometime ago that automatically maps whatever CIK you bring to the platform with these others if you specify that you want to include historic CIKs. We did this because some of the tools you use only keep the most current even if we can reasonably conclude that the new entity is the successor issuer. Disney filed an 8-K12B that plainly states This Current Report on Form 8-K is being filed for the purpose of establishing Disney as the successor issuer to Old Disney pursuant to Rule 12g-3(a) when they transitioned from CIK 1001039 to 1744489. As a refresher on this issue please see the news about our 4.0.5 release (4.0.5 Release Announcement). One of the enhancement with the latest version is that you used to have to manually update the mapping file – that file gets updated automatically when you start the application today.

Anyway, Richard made an interesting observation about the need to link the CUSIP-CIK mapping to the historical CIK mapping that I understood but am not sure how to implement. Therefore I decided to make the historical CIK mapping more available. There is another new database available on the platform: CIK_MAPPING. This has the content of the json file we use internally to map CIKs. Richard did not say this directly but I inferred from his comments that this will be useful when CUSIPs and CIKs change. At the present time we are not going to try to modify the CUSIP-CIK file to reflect any information in the CIK_MAPPING file. Both are relatively small and by making those available we are giving you the information you need to review and make the choices that are appropriate for your research.

I hope this is a clear headed example. Suppose you come to our platform with a list of CUSIPs and included in the list is 38259P508. This is the CUSIP assigned to GOOGLE CLASS A. You are sensitive to the fact that this is complicated so you try to match on the first 6 in our database and you find that there is an entry for 3825P706 (GOOGLE CLASS C) and the mapped CIK is 1288776.

So you then run some searches using that CIK and you observe that some observations are missing. So you then take that CIK and investigate the CIK_MAPPING database and you discover that that CIK exists and has a related CIK 1652044.

Well, now this gets interesting – do we have all of the data we need from our original data source (where the CUSIP was the identifier)? We can check because we can use the CIK we just found to see if there is a matching CUSIP.

As you can see there is data, the CUSIP in the image above reflects the CUSIP assigned to ALPHABET INC CLASS C CAPITAL STOCK. So this provides a mechanism/process by which you can be more comfortable about establishing the time series.

I of course understand that you don’t want to work with these one CUSIP/CIK at a time and that is why they are available through the Query tool on the platform. Once again, if you select the database and hit the Execute button without specifying any parameters you can get the entire database available. Use the Save Results button to save as a csv file.

I will observe that these observations are tough to identify, I added a few more today while I was working on this post – and it took three hours to identify the two (really four) new entries. This is not work that can be delegated. Matching CUSIPS to CIKs can be delegated but it is still tedious and it has to be fit in around our primary work.

Notice that I am not that worried about security type – since our focus is on getting access to SEC identifiers from CUSIPs – I know we have cases where the nine digit CUSIP in our database relates to a derivative security (Call/Put) but the six digit maps to the entity. That is what I ultimately care about. It is not perfect but I hope it helps your research


I am making available a new database on the platform that has three fields; BASE_CUSIP, CUSIP and CIK. As you have probably seen, we have parsed the 13F filings and have been trying to link the CUSIP to the issuer CIK. CUSIPs are hard to get as they are issued by the American Banker’s Association and managed by Standard and Poor’s. What we did to identify the mapping between CUSIPs and CIKs is we parsed the SC 13G and SC 13D filings as they contain CUSIP values.

First Page of SC 13G/A Filed to Disclose Holdings in BioScrip by Heartland Advisors in 2007

Based on the metadata associated with this filing we could determine that this related to the company that was named BioScrip, whose CIK is 1014739. Because of the number of SC 13 filings that were made with this CIK we feel pretty comfortable that this mapping is correct.

Of course it is never that easy as you well know. Bioscrip was acquired in a reverse merger type arrangement in 2019 and while the legal entity with respect to filing with the SEC remained fixed, the shares were replaced with new shares and with those new shares a new CUSIP was issued (68404L201) as can be seen in the image from an SC 13 filing made in 2022.

Since our interest is in making it easier for you to search for SEC filing content, we believe this mapping will help you when you want to identify filings made by some security issuer when you have the CUSIP and not the CIK. From our perspective, the fact that the CUSIP changes is more or less irrelevant. If you have data that is indexed on CUSIP for this company that relates to security information prior to August 2019 you will have the CUSIP value 09069N108 – that maps to CIK 1014739. If your data is more recent (in 2020 for example) you would have CUSIP 6804L201 – that maps to CIK 1014739. And actually, if you have data prior to early 2005 you might have CUSIP 553044108 (the company was known as MIM CORP then). So we have one CIK that maps to three CUSIPS. We are pretty confident that we have enough evidence to draw the conclusion that the mapping is appropriate/reasonable. Specifically there were more than 40 confirming SC 13 filings with CUSIP 553044108. There were more than 100 SC 13 filings with CUSIP 09069N108 and more than 15 with CUSIP 68404L102. I would argue that the evidence is reasonable. However, you need to understand that we just analyzed the filings. We tested the validity of the CUSIPS we found and created tests relating to characteristics of the filing, the filer and the issuer. However, we don’t know what we don’t know. I have been trying to find errors, I suspect there can be some but we have done extensive testing.

Remember, the first six characters of the CUSIP are issuer specific, so we also added the BASE_CUSIP value(s) (09069N, 553044 and 68404L) as an additional field, just in-case. We clearly do not have every CIK mapped to a CUSIP. For some issuers they do not have any SC 13 filings. For others, the evidence is not strong enough (5 filings total, 3 with one CUSIP and 2 with another). There are those that report the SEDOL of the underlying security rather than the CUSIP of the ADR. We are also working with another data source that will yield some additional mappings.

Finally, you can filter on CIK – to access the entire listing just hit the Execute button without setting any operators/criteria. And of course you can export the results using the Save Results button.

End of Year Index Changes Coming – no action needed by you!

We are starting to prep for the end of the calendar year. We will consolidate all Y2021/Y2022 indexes into a Y2021 – Y2025 index and then create new Y2023 indexes for all filings made after 12/31/2022. With the software update we did earlier this year – you will not have to do anything. If you start the application after we complete the update around 1/1/2023 the old indexes will be merged and renamed. It does mean that if you are using one or more of our artifacts that includes the FILENAME field you might have to either manually change the path value or rerun your search on the new consolidated index.

To be clear – suppose you used our platform to identify the 8-K filings made so far in 2022 that related to an auditor change.

If you ran the summary extraction from this search the FILENAME field will have the path to the folder Y2022-Y2022 as you can see in the next image:

Once we consolidate and re-index that path will no longer be valid. However, it will be in a very predictable location – we can replace the Y2022-Y2022 component of the path with Y2021-Y2025 and the filed will be reachable using Python or one of the filepath filtering features.

I am particularly excited that when you open the application when this process is complete you will not have to do anything to access the new indexes – I can’t screenshot it right now because it has to happen – but the new index will be listed for you to select. Before the last update you would have had to run an Index Update through the options menu.

Significant Update to 13F Share Data – Read About Consequences of Chunking versus not Chunking

We completed a fairly intense update to the 13F Share Data available on our platform. First, unbeknownst to us there were CUSIPS that sometimes had lower-case characters. Note to self, remember that filers make mistakes. We fixed that issue. More importantly we have been working to identify more CUSIP-CIK mappings. We succeeded and so we updated the 13F share data with the CIK if we were able to identify the CIK of the underlying issuer. The first problem only affected about 100K rows, the new mapping adds about 22 million new rows with the CIK as a value. This was significant, we almost doubled the number of observations that have a CUSIP-CIK mapping. We now have over 45 million observations with a CUSIP-CIK match. We still need to identify the CIK-CUSIP mapping for about 17 million rows.

This gets dense now. One challenge with the next group is that some CUSIPS for these have been associated with two or more CIKs and we have to sort out how to make sure we map these correctly. As an example of one of these cases CUSIP 00206R102 – belongs to the entity known today as AT&T (CIK – 732717) – that was formerly known as Southwestern Bell Corporation. However, we found evidence that it was attached (wrongly) to securities issued by the prior entity known as AT&T (CIK – 5907) that was acquired by Southwestern Bell Corporation. I am confused as I am writing this because, well it is confusing. We are still trying to develop the correct algorithm to assign the correct CIK to these cases. Another challenge is that some of the CUSIPS are missing one or more leading or trailing zeros (0). We actually think this group will be the next one we update because we have parsed all of the SEC’s List(s) of 13F Securities and believe we can use these to address this group. Our plan is to confirm that for those CUSIPSs where there are one or more leading 0s we will check every other existing, known CUSIP to see if the non-Zero characters for a subset of some other CUSIP value. For example, APPLE’s CUSIP is 037833100. We will try to confirm that there is no other CUSIP that contains the sequence 378331, So with that evidence and some fuzzy name-matching we hopefully can conclude that when the reported CUSIP is 378331 and the name is close to APPLE that the proper CIK to map to in those cases is 320193. But all of this takes time.

I will observe that this is not quite as grim as it sounds. Some of the securities listed in the 13F HR where we are missing the CIK are derivative securities (ETF, trust shares and the like). We are trying to carefully identify these. I am going out on a limb and say from reviewing some of this data, at least 60% of the missing CIK values are from these type of derivative securities. ISHARES has more than 600 derivative securities listed in the Q3 2022 list of 13F Securities that are well represented in the holdings data. We may end up adding a field to flag these. As I am writing this I am waiting on some code to pull all unmatched pairs and their frequency to sort out our steps going forward.

In the meantime, here is another code example of working with the as reported 13FSHAREDATA. I prepared this because of some encouragement to add additional explanatory comments. This is available on S:\PythonCode. I am going to prep a short video that demonstrates using this code. This example presumes that you have a list of CIKs that you want to use to identify relevant holdings. You can trivially modify the code below to pull based on CUSIP if you supply a list of CUSIPs and change the key word cik to cusip in this code example where the word cik is present.

In the code below I demonstrate querying the database by quarter. I have written/spoken about my practice of trying to profile the optimal balance between chunk size and time. In this example I had 1,149 CIKs in my list. The output files contained a total of 10,657,575 observations. It took me 16 minutes to pull by quarter using one of our client instances. I modified the code to try pulling the same sample without conditioning on quarter. After 90 minutes I finally killed the job. It is not a CPU issue, it is a memory issue. I am having to stop myself from droning on here, it is kind of cool – I have 1,149 CIKs, each is defined/exists in one place in the memory stack and then there are references to their location in the data . . .! I will stop! Anyway – here is some code – and as noted above, it is available on S:\PythonCode.

import sqlite3
import csv

db_path = 'S:\\directEDGAR_DATA\\13FSHAREDATA.db'

# I highly recommend that you pull by year/quarter it ultimately will be faster
# than even pulling by year because you are not having to use the page file memory as much
# I am always messing around with trying to minimize total time for some operation and
# it is just true that this is a delicate balance

years = [str(year) for year in range(2013, 2022)]
qtrs = ['-03-31', '-06-30', '-09-30', '-12-31']

# suppose you have a list of CIKs and you want the individual reported transactions relating
# to the filers in that list I am assuming below that your list is in a text file with no header and is in the Temporary # Files folder on the instance you are working from.
# I am also assuming that your CIKs are not left-padded with 0 - if they are open the list in EXCEL and then save as
# a text file

with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt") as fh:
    my_cik_list = fh.readlines()

# A folder that you created to contain the results
DEST_FOLDER = "C:\\PhotonUser\\My Files\\Temporary Files\\13Fdata\\"

with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt", 'r') as fh:
    my_cik_list = fh.readlines()

# there is going to be a carriage return/line feed after each observation - this will remove those
my_cik_list = [cik.strip() for cik in my_cik_list]

# we need to turn the list into a tuple as that is the object type that an sqlite query prefers
my_ciks = tuple(my_cik_list)

for year in years:
    for qtr in qtrs:
        # we are going to save after each cycle - modify if you want to save less frequently
        # the problem you will face is that about one year of data will exceed the 'capacity' of
        # EXCEL if you have a large number of CIKs or CUSIPS
        results = []
        period = year + qtr
        conn = sqlite3.connect(db_path)
        # useful feature - preserves the mapping between the COLUMN and VALUE - so the COLUMN NAME is persistent
        conn.row_factory = sqlite3.Row
        cur = conn.cursor()
            f"""SELECT * from THIRTEENFSHAREDATA where periodofreport_DATE =  "{period}" and cik in {my_ciks} """)
        rows = cur.fetchall()
        # this just lets us see progress
        print(len(rows), period)
        if len(rows) == 0:
            print("no observations", period)
        for row in rows:
            d_row = dict(row)

        column_names = [k for k in results[0].keys()]
        header_dict = dict((ch, ch) for ch in column_names)

        dest_file = DEST_FOLDER + '13FHOLDINGS_' + period + '.csv'
        # I am assuming you have created a folder to contain the results - see above

        outref = open(dest_file, 'w', newline='')
        my_writer = csv.DictWriter(dest_file, fieldnames=column_names)

Off Topic (Somewhat) but Great Friday Reading

I say off-topic because this post is not about a new feature or update to our platform. Jack Ciesielski was the creator/founder of the Analyst’s Accounting Observer. Jack and his team read financial statements and unwound practices that they thought were dubious to create what he thought was the better representation of key metrics. Jack ‘retired’ a while ago though he still serves on the EITF, the Investor Advisory Group of the PCAOB and the CFA Institute’s Corporate Disclosure Policy Council. I was fortunate to meet Jack as they were early directEDGAR customers (they were with us when we had to mail CDs with updates and they were a pilot customer when we toyed with providing NAS devices to our customers).

I reached out to Jack a bit ago to see how he was doing in retirement. I guess he can’t stop thinking about accounting/markets and business because he shared that he transitioned from the Analyst’s Accounting Observer to a weekly newsletter “The Weekly Reader”. In this Jack has curated some interesting reads as well as a brief take on why they made it into the newsletter. He also offers a couple of bonus rounds. Here is a link to the latest version (Jack Ciesielski WR 11/11/2022) – if you would like to be added to Jack’s mailing list just send him an email (jciesielski [some important symbol here] accountingobserver.com). At least with Jack we can be assured that he is not trying to make money off of your email address!

Quick Update

Earlier this week I explained that we would move all of our archive of ITEM_1A to the platform and index them. This was completed, I had a message soon after wondering if we could do the same with the ITEM_7 (affectionately known as MD&A). This was completed over night. I did change the index name – originally I named the Risk Factors archive RISKFACTORS. When I finished the MD&A and looked at the placement in the index list – I decided I was making this unnecessarily complicated. So the names were changed to ITEM_RISKFACTORS and ITEM_MDA.

New Items for search.

And yes, you can use all of the existing features with these indexes. Specifically – if you want to run a CIK-DATE limited search, you need to have a file with the column headings CIK and MATCHDATE, there can be other columns but those columns must exist.

CIK DATE file for Search

Notice in the file above, I have multiple instances of the same CIK but different dates. That is because for my research example I want to establish whether or not a MDA exists in a window around each of those unique dates with the search phrase(s) I will use in my search. Once the file is created and available select the CIK/DATE Filter checkbox and hit the Set CIK/Date File to activate the file selector tool.

Selecting CIK/Matchdate file

After you select the file remember to specify the range and the specific date you want to use. The RDATE represents the dissemination date (which is often but not always the filed date) and the CDATE represents the balance sheet date AS REPORTED in the filing header. Once you have set the parameters hit the Okay button and enter your Search term(s) phrases into the search box before you hit the Perform Search button.

Expert tip – if you only want the existence of the document rather than identifying those with specific words/phrases use the search operator XFIRSTWORD (or XLASTWORD if you prefer).

Remember as well that the platform will generate a list of CIK-DATE pairs that did not match any criteria for the search. It could either be that they did not have the relevant search terms or if they did, the MATCHDATE you specified as well as the window did not match what is available.

Risk Factors Separate Index Available Now

Let me get to the punchline first – and then I will add some detail. If you log into our platform today and use the Zoom button next to the Library listing tool and then scroll down almost to the bottom you will see the new RISKFACTORS index:

Accessing Risk Factors Index

This index has the collection of the ITEM 1A Risk Factors we have previously parsed from 10-K filings. While these are available from our ExtractionPreprocessed feature and the application has a built in Custom-Indexing feature two things happened in the past week that caused me to move these to the application drive.

First, I had an interesting and long discussion with Professor David Lont at Otago University. David has been a directEDGAR user from almost the beginning. Honestly, some of the research goals he has shared with me in the past were critical factors that caused us to figure out how to add the ability to filter searches by CIK and unique date pairs. In our conversation last week David was not wishing for the ability to search the Risk Factors section, instead he was sharing a story about one of his junior colleagues general frustration with using directEDGAR. It was interesting, and painful to listen to these observations. When I think of hard/easy I am always probably measuring the difference between where we are versus what I had to do to collect purchase price allocations to establish the amount of goodwill recognized in acquisitions by reading 10-Ks on micro-fiche. That is probably not the right measuring stick!

Okay, so I had that discussion with David (and it was great to talk to him) and then two days ago a new user at another client school asked about searching within the Risk Factors section of the 10-K. Initially, my thought was to direct him to the sections in the documentation regarding ExtractionPreprocessed and then the section on Custom Indexing. But my conversation with David had been running around in my head and I thought to myself – that is adding unnecessary work to the process. What I needed to do was to just take the time to move a copy of the Risk Factors snips to the application and create an index just for those documents. The ExtractionPreprocessed was developed when we distributed our content to you and left it to your IT experts to update the indexes etc. We knew that there was no way they could be expected to manage the more than one million files that have accumulated over time. So we set up the system for you to pull the files and added the indexing feature so you could build your own index and search the snips.

I guess I get that might seem convoluted (remember my starting place).

However, there is a caveat, we have some other projects that are fairly significant in scope and I just don’t have the resources right now to do everything I wish I could do with these snips. At the moment we have not injected any metadata into the snips like we normally do with our documents. Any search results will have blank fields for the CNAME, DOCTYPE, FYEND and SIC. However, the search results will have the word count as well as the CIK, RDATE, CDATE and FVAL to provide a way to match back the original 10-K filings that the content was pulled from. I would actually like to put in the path to the original 10-K as a metadata field with the word count from the 10-K as another field.

One immediately nice thing about these is that you can get just the text by completing a search and then hitting the DocumentExtraction TextOnly item from the Extraction item on the menu bar. Thus if you want to use the raw text as an input for some process – that is immediately available.

If I was going to parse these to identify individual listed items I would prefer the htm version with the formatting preserved so as to use the tags to help identify the listed risk factors. As always, the DocumentExtraction feature will dump an original copy of the source file into the directory you specify.

FYI – Risk factors were only required after 12/1/2005. We are capturing the ITEM 1A section as it exists – and so this can be complicated for those entities that reference another document or another location in the 10-K. We are missing some and it is part of our work flow to determine if those can be captured. Further, while entity classification has changed across time – there are entity classes that are not obligated to provide separate disclosures about risk factors. Sometimes they simply remove ITEM 1A from their 10-K other times it is left in and they include language that indicates they are exempt.

Before I sat down to write this post I started the process of moving the MDA archive we have – there will be an MDA index soon (probably by Sunday 10/23/22). I also expect to move over the Business section as well.