I had a PhD student ask a really interesting question last week. Because I don’t want to disclose their research goals it took me a bit of time to come up with a good analogy. They had a search that had more than 100 search terms. They did a summary extraction and was scanning the summary file with the columns that list the terms and then the number of hits found in each document. They would then periodically look at the source document in our viewer to check how their words/phrases were actually used in context. The problem they identified was that they started to see cases where their search term was in the document but it was not used in the right context.
So my example is going to be – suppose I want to find all 10-K filings with mention of Texas. I believe that if the word Texas is reported in the 10-K that provides strong evidence that the company has operations of some sort in the state. However, once I start scanning the results I find plenty of cases where the word TEXAS is in a filing – but the problem is that the word TEXAS is used as part of a noun phrase or other construction that does not actually name Texas as a location of operations. For example, West Texas Intermediate is a benchmark used for pricing oil transactions. Mentions of Texas Instruments or Texas Pacific as a competitor. So the question is – if we don’t know the context of the word in use – how can we be sure the word is actually signifying what we hope it signifies? In other words, the existence of the word may not be sufficient evidence that the instance in meaningful in our case. Further, we do not know in advance all of the possible noun phrases and proper names that include the word TEXAS so we can’t exclude them or account for them in our search. (If you know all of the proper nouns and noun phrases to exclude in advance then modify search to account for those with the ANDANY operator TEXAS andany (TEXAS INSTRUMENTS OR WEST TEXAS INTERMEDIATE OR . . .).)
Here is an image of the search results, the filer is TEJON RANCH CO – the name has a Texas sort of ring to it but they are a real estate development and agribusiness company. They realize some royalties from mineral right leases on their land – which is in California. They appear to have no operations in Texas, but the royalties they receive seem to be tied to West Texas Intermediate.
One way to help identify and get a better handle of how TEXAS is used in a document is to do a ContextExtraction (this is the label we use on the platform) and set a really tight span. In this particular case I suggested to the PhD student that they set a span of 1 ‘Words’ as illustrated in the next image.
By doing so and scanning the results it becomes clear that there is a lot of noise in assuming just the mention of the word TEXAS in a 10-K filing is meaningful. We find cases where Texas Instruments is mentioned as a competitor. There are cases where Texas A&M and other universities with the word Texas in their name have a patent relationship with the registrant or one of the executive officers earned a degree at the university. Restricting the context to one word may not be the best choice in every case. That is okay because it is cheap to rerun and alter the span to test alternative strategies.
The point is that there is no way I could have known in advance all of the ways the word Texas might be used in the filing and be confident that the use of Texas was evidence of corporate activity in Texas. But by extracting the limited context and scanning it I can more confidently look for ways to better measure evidence of business activity.
I will disclose that in our exchange, they were wondering if this was the point they needed to start learning Python. I do encourage folks to start learning Python – but this is not a problem well solved by Python. We had the context around the word TEXAS from every 10-K filed from 2016-2020 in a csv file about four minutes after we started. Now it is going to take some effort to learn what should be included or excluded to make sure their measure reflects what they hope it reflects. Being able to look at these results is what is going to give them the understanding they need to move forward.