Fetch Technologies uses just two measurements to describe performance: accuracy and coverage. Accuracy, which describes how well a web extraction system is performing in extracting data fields from web pages, and coverage, which describes how well a web extraction system is performing in retrieving the targeted web pages, were both covered in part I. In this post, we describe how sampling is used to gather data for accuracy and coverage measurements.
Sampling
In practice, it’s difficult to compute the exact accuracy and coverage because our answer key is normally incomplete. That is, if we want to know precisely what the performance of a system was on a website, we would need to know, for every target page on the entire site, what the extracted values should be. Because this is usually impractical, we normally test the system on a set of sample pages. There are some important things to remember about sample pages. First, the set of sample pages should be randomly selected. That is, there should be no bias in terms of how the pages collected. Second, in order for our measurements to be reflective of the true performance of the system, we need to collect “enough” sample pages. If our sample set is too small, then we are in danger of obtaining a measurement that is too high or too low, due to bad luck. The larger the sample set, the better our chances are of getting an accurate measurement.
How big should our sample set be? Statisticians have been studying this problem for many years. There are formulas available to help determine an appropriate sample size depending how confident we want to be. For instance, suppose we want our accuracy to be at least 80% and we find that on 10 sample pages accuracy is 90%. Unfortunately, we can’t really be very confident that the true accuracy is really over 80%, because 10 sample pages is a very small sample. However, if we use 50 sample pages and measure accuracy at 90% (for instance), then statistical formulas tell us that we can be highly confident that the accuracy is at least 80%, if not higher. In fact, statistics allow us to define what we mean by “highly confident” in a very precise way. Information about statistical sampling can be found on-line – Wikipedia has good articles about sampling – or from statistics textbooks.
Performance Measurements and Statistics
Measuring both accuracy and coverage enables us to gain a clear understanding of any problems the system may be experiencing when errors are encountered. For instance, if both coverage measures look good but the accuracy is low then presumably the system is correctly identifying the target pages, but poorly extracting the data from these pages. On the other hand, if the %extra is high, %missing is low, and accuracy is poor, then presumably the problem is that the system is collecting too many pages due to an overly-general description of the target pages.
For the statistically minded, we note that our coverage measures are directly related to the terms recall and precision used in the scientific community. Specifically, %missing is the same as 1 – Recall, and %extra is the same as 1 – Precision. Also, statisticians often refer to extra pages as false positives, to missing pages as false negatives. Similarly, the correctly retrieve pages are referred to as true positives, and the pages that the system correctly determined not to retrieve are called true negatives.

Figure 4: Relationship of Missing and Extra to False Positives and False Negatives
The approach that we have described in this series of blog posts works well when there are a fixed number of data items being extracted from the target pages. However, there are more complex cases which make evaluation more difficult. For instance, sometimes we want to extract a list of data values from target pages. This makes it hard to measure accuracy, because the number of items in the list may vary from page to page. Fortunately, the coverage measures we introduced previously can be used very broadly.
In particular, we can create an answer key with all the data values that should result from the extraction process, and simply evaluate how many of the data values are missing from the answer key and how many extra data values there are on the answer key. This gives us a basic way to measure performance, but the downside is that we don’t compute a separate accuracy measure, and as a result it can be harder to figure out what went wrong if performance is poor. For instance, if there is a target list that includes the data value “John Smith”, and the system incorrectly extracts the result as “John”, than this will count as both a missing value (because “John Smith” is missing) and an extra value (because “John” is not a target value), which can be confusing.
The basic approach described in these posts has several virtues. First, it is relatively straightforward and practical to implement. Second, when extraction issues occur, they can be relatively easily pinpointed and debugged. Finally, the approach can be described in terms of standard statistical metrics, and if necessary can be extended to handle more complex situations.