Extraction Engine finally released!!

Extraction Engine

It is done and wow!! This is the first general purpose Extraction Tool for html files. It was designed to help you with two tasks. First, it provides a systematic and incredibly efficient way to extract raw data from html filings stored on the directEDGAR drives. Second, it helps you easily conform the column headings. The result of the process is an csv file with identifiers for the company and documents that the data were extracted from and the column headings you choose.

The Extraction Engine should help you collect data from any table that you can describe (using key words and boolean operators) in a set of filings you specify using the ISYS-Runtime search tool. I have successfully extracted and conformed more than 20,000 lines of data in under four hours.

Imagine pulling director compensation and beneficial ownership tables from the proxy statements for 1,500 companies over three years one day. The next day imagine you want the details reported in the deferred tax footnote for a different set of 3,000 registrants. All of these tasks can be completed in hours instead of weeks or months.

I hope to finish adding the ability to extract and manage specific blocks of text next.


Welcome to my blog about all things to do with using SEC filings efficiently and effectively to collect and manage data for academic and other research purposes. I hope to focus on three main areas,;

  1. keeping you informed about planned upgrades and changes to directEDGAR
  2. I will also try to post information that will help you get the most out of your subscription
  3. I will try to keep you updated on regulatory changes that could affect a data collection strategy.

I hope you will visit my blog enough to keep abreast of changes
and you will share your experiences and suggestions as the content develops.