I was helping someone build a complex search today. The goal was to identify all 10-K filings with some mention of Last-in, First-out inventory and confirm whether the auditor for the companies that mention LIFO belonged to a particular set of auditors. Their auditor list was not too large (27 names). Because we were not initially sure exactly what search to construct the search string started getting messy. We were not seeing some results that we expected and I wanted to clean it up so that our customers could better understand some of the choices we were making.
Unfortunately, our search box only displays the most recent 100 characters (including spaces) but our search parser will parse search strings with as many as 16,000 characters. While the amount of the input that is visible is limited our application does allow users to paste in their search string that they might build outside the application.
So I need to take a break and throw in a plug for a really useful and free) utility called Notepad++. It is has an amazing feature set. If you are writing code (Python and Perl and many others) Notepad++ offers syntax highlighting. It is a lot easier on memory than the included Notepad in the sense that it opens large files much easier. It has an amazing search facility. One of the reasons I use it is that it helps me balance parentheses when I am writing a complex search.
When I start getting confused with the search I am trying to construct I stop trying to build it in our search box and switch to Notepad++.
As I mentioned above, I started building the search inside the Search, Extraction & Normalization Engine. Here is a screenshot of part of the search, we can only see about 100 characters at a time.
I was not seeing results from some auditing firms that I expected and so I decided the best thing to do was to build the search in Notepad++. If you place your mouse in the Search Phrase box and use the right-click button there is a context choice – select all. So I did that to select and then copy my existing search phrase from our application to Notepad++ (NPP).
I pasted it squished down so you can see that there are 732 characters in my search. Just pasting it into NPP isn’t enough, I want to see the search. So I started adding line breaks after operators or terms. Here is an image of the search reorganized in NPP:
Organizing the search this way sure makes it a lot easier to read. I have two clumsily drawn circles in the image up there. The smaller one at the top illustrates the parentheses matching. If you are next to a parentheses it will change color (you can set the color you want, the default is red) and the matching parentheses will also light-up. That is a really useful feature that I use a lot when building nested searches. You can easily tell if you have unbalanced parentheses by placing the cursor next to one and then scroll up or down to look for another with the same highlighting. If you don’t see one then you have an unbalanced parenthesis.
The second, longer red ‘circle’ illustrates that I think it is always a best practice to indent each new line of search content one space. The reason for this is if you hit return and then start typing in the first column of the new line the parser will run together the last word (or operator) of the previous line with the first word (or operator) of the current line.
Once the search was in NPP it was easy to analyze. The problem we were having turned out to be an unbalanced parenthesis and typos (read misspellings). Once those were corrected we just selected our search (as it was displayed in NPP – we did not have to get rid of the line breaks) and then pasted it back into the search box.
Notice, that the line breaks are preserved. The parser will take care of the line breaks when we submit the search.
For those of you new to our software, the andany operator is not used in the process of selecting documents. Rather it is used to count instances of other words or phrases in the documents selected with the other operators. While our focus in this search was on LIFO we also want to inspect references to FIFO to make sure we properly identify LIFO adopters (that explains the first andany operator). The second, in front of the list of auditors is to identify the auditor(s) associated with the companies who do mention LIFO in their 10-K.
In the search results displayed below – there were 2,439 documents found. These documents were selected because:
- They had either the phrase “last in” (hyphenated or not)
- or they had the term LIFO
- and they were 10-K documents (not exhibits)
- and I set a date filter to select only filings made between 1/1/2012 and 5/30/2016
Because of the andany operators we could check on their auditor and find their context for FIFO if they mentioned either in the document.