Fetch Technologies Blog

Extracting Value from Chaos

Posted on: June 29th, 2011 by Timo Kissel No Comments

IDC’s 2011 Digital Universe Study titled “Extracting Value from Chaos” quantifies the amounts of data being generated today: “In 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes) – growing by a factor of 9 in just five years.“  And, according to IDC, “enterprises have some liability for 80% of information in the digital universe at some point in its digital life.”

While this might seem like a daunting challenge for IT executives, it really is a great opportunity to elevate the role of IT from being an internal service provider to being a driver of business transformation and business growth. This means that the IT executive’s job is to not just provide the infrastructure and tools to gather, manage and process this data, but to partner with business unit managers to educate them on the availability of all these new sources of data and the analytics available to extract business-relevant insight.

This insight might range from a deeper operational understanding of cost drivers (by giving fine-grained metrics on supply chain or workflow costs), to a deeper understanding of pricing (for example, by providing near-real-time pricing information from the web) to better understanding consumer sentiment on the web and uncovering new business opportunities.

As the IDC report notes, “Taking a lead in big data efforts provides the CIO with an opportunity to be the most significant strategic partner for a business unit or even drive a transformation of the entire enterprise.”

Managing this explosion of available data is difficult enough for IT executives, so elevating the discussion of “big data” from the “how” to the “why” will help focus these efforts on the areas of transformative business value from the get-go – and will hopefully unleash creative insights that will drive significant growth and make your company more productive, and just maybe will transform a whole industry!

[Facebook] [Twitter]

Artificial Intelligence in Background Checking

Posted on: May 18th, 2011 by Cathy Finley No Comments

Jerry Thurber, President of Tandem Select, blogs about the use of artificial intelligence in background screening:

http://backgroundcheck.tandemselect.com/bid/53806/Artificial-Intelligence-in-Background-Checking-A-Quick-Introduction

Here’s the full post:

Artificial intelligence (AI) has been a standard tool in many industries for a decade or more. Recently artificial intelligence has begun to gain a foothold in the background checking industry. AI, or machine learning, is a useful way to train machines to do things humans do when that task is not strictly the same all the time. In other words, the machine has to be smart enough to know how to adjust to basic changes in the process. An example of this in background checking occurs when you use AI “agents” (or robots if you prefer) to access and retrieve criminal history data. A growing volume of criminal history data is maintained in secure websites that are made available for background checking and pre employment screening. These sites require that a person log into the site, enter their credentials, then search for a specific name to see if there is a criminal record for that person. The reason why artificial intelligence is so nice with sites like these is because the exact navigation for finding these records differs from site to site and from search to search. For example, if I search for Jane Doe’s criminal record in Colorado I may need to search through three or four pages to cover all the information I need but if I then search for John Doe’s criminal record on that same Colorado criminal history site, I may find something that takes me off in a different direction and requires me to look at information that wasn’t relevant for Jane Doe but is relevant for John Doe’s criminal history. In other words, the search is situational. Jane Doe’s search required different navigation than John Doe’s search. When you use an artificial intelligence tool for a search like this, it can be “trained” to see these situational anomalies and it can adjust to make sure it follows the correct search path.

These trained agents have been found to be very accurate in their search execution. In one assessment done at Fetch Technologies they worked with one of their customers and found that the AI search was 10% more accurate than the human beings conducting those same searches. Machines don’t forget to look at every page or fail to navigate through all the records; they don’t get tired or type the wrong information. Once trained – they do their job completely, every time. The other advantage of AI tools is they don’t have to sleep. They can search records 24 hours a day, seven days a week. At Tandem Select, we have used AI tools to improve our accuracy and our turnaround time. When we can use artificial intelligence for criminal history site searches, we reduce turnaround time from several hours to usually several seconds. At Tandem Select, we think artificial intelligence helps our clients get better, faster, more accurate criminal history results.

[Facebook] [Twitter]

Plugging into Web APIs

Posted on: May 17th, 2011 by Greg Barish No Comments

Web services continue to explode at an amazing rate. Programmable Web, which aggregates info about the “Web as a Platform”, now contains a directory of 3200+ Web APIs. There are countless other private/undocumented APIs available as well that can be leveraged. With these APIs, you can do everything from social media monitoring to general prediction and classification tasks, from crowdsourced intelligence to automated chat messaging.

Web services can help power your Internet data management strategy in multiple ways. Specifically, these APIs can:

Bootstrap the information gathering process. Obtaining site data directly from APIs is often the most efficient and most comprehensive method. Social media APIs like Twitter and Facebook are good examples; both encourage developers to leverage their APIs as a way of keeping up to date with the rapid information flow on their networks. Organizations like Sunlight Labs have helped fuel the distribution of government-related data via API. Finally, there are lots of structured data marketplaces emerging.

Enrich data you have already have collected. Grow new data from existing data. For example, you can geocode addresses, estimate positive/negative sentiment, or extract named entities from unstructured text. Deriving more data from the data you already have enhances the insight about the overall data set. For more prediction oriented tasks, you can engage multiple APIs to implement a committee-based approach to data analysis.

Provide notification or trigger further external processing. Many Web APIs allow one to write to an external system, including the ability to upload data, not to mention sending e-mail, SMS messages, or post social media status updates per data “events”. The ability to transmit data using APIs can facilitate overall system integration; downstream systems can be triggered directly or simply poll transmission targets for new data.

Fundamentally, Web APIs expand the model of how data can be processed in a Web-connected system. You now have many options to choose from: (a) you can leverage functionality you already have built (e.g., your existing source code libraries, SQL functions/procedures), (b) build it from scratch yourself, (c) license 3rd party software to help you build it and integrate it into your system, and (d) now, with the emergence of these services, you can connect to it directly via network APIs (e.g., REST-style). The unique options afforded by (d) include platform normalization (HTTP/S becomes the common medium), real-time information gathering, and the potential of seamless dependency evolution (services can be frequently upgraded while maintaining a common API contract) .

At Fetch, the forthcoming edition of our Fetch Live Access platform makes it easier to integrate network APIs into your Internet data management solution. We call these integration points “plug-ins”. We have developed an extensible model of building plug-ins and, by marrying it with our existing agent technology, we make it a simple matter to leverage great Web APIs as part of agent processing, whether it’s for purposes of information gathering, data enrichment, or workflow integration. Web services continue to offer new and exciting ways to empower Internet data management, while fueling the Internet as a true distributed application platform. It’s amazing how much cutting-edge functionality is so easily accessible. Now all you need to do is… plug-in!

[Facebook] [Twitter]

The Data Frame

Posted on: May 11th, 2011 by Timo Kissel No Comments

Today, John Battelle posted a great blog entry about looking at the Web 2.0 landscape through the “data frame”, and I’m excited that this will be the theme for the Web 2.0 Summit 2011 later this year. I couldn’t agree with his point of view more: Web 2.0 was all about making the web a bidirectional engagement medium, and all this engagement resulted in an incredible accumulation of data from users and about users, and as he points out, we now need to classify and organize all this data to make use of it more easily.

In addition to all this consumer-centric data, let’s not forget that there are lots of additional classes of data that have moved onto the web – for example, government data, public records, prices – data that was previously locked in databases and file systems behind firewalls. I predict that being able to access and normalize these additional classes of data will be a key ingredient for additional insights by correlating it with all this not-previously-available consumer-centric data that John is discussing in his blog post. Unexpected correlations between disparate data sets lead to unexpected insights!

[Facebook] [Twitter]

Measuring the Performance of a Web Extraction System: Part II

Posted on: May 3rd, 2011 by Steven Minton No Comments

Fetch Technologies uses just two measurements to describe performance: accuracy and coverage. Accuracy, which describes how well a web extraction system is performing in extracting data fields from web pages, and coverage, which describes how well a web extraction system is performing in retrieving the targeted web pages, were both covered in part I. In this post, we describe how sampling is used to gather data for accuracy and coverage measurements.

 

Sampling

In practice, it’s difficult to compute the exact accuracy and coverage because our answer key is normally incomplete. That is, if we want to know precisely what the performance of a system was on a website, we would need to know, for every target page on the entire site, what the extracted values should be. Because this is usually impractical, we normally test the system on a set of sample pages. There are some important things to remember about sample pages. First, the set of sample pages should be randomly selected. That is, there should be no bias in terms of how the pages collected. Second, in order for our measurements to be reflective of the true performance of the system, we need to collect “enough” sample pages. If our sample set is too small, then we are in danger of obtaining a measurement that is too high or too low, due to bad luck. The larger the sample set, the better our chances are of getting an accurate measurement.

How big should our sample set be? Statisticians have been studying this problem for many years. There are formulas available to help determine an appropriate sample size depending how confident we want to be. For instance, suppose we want our accuracy to be at least 80% and we find that on 10 sample pages accuracy is 90%. Unfortunately, we can’t really be very confident that the true accuracy is really over 80%, because 10 sample pages is a very small sample. However, if we use 50 sample pages and measure accuracy at 90% (for instance), then statistical formulas tell us that we can be highly confident that the accuracy is at least 80%, if not higher. In fact, statistics allow us to define what we mean by “highly confident” in a very precise way. Information about statistical sampling can be found on-line – Wikipedia has good articles about sampling – or from statistics textbooks.

 

Performance Measurements and Statistics

Measuring both accuracy and coverage enables us to gain a clear understanding of any problems the system may be experiencing when errors are encountered. For instance, if both coverage measures look good but the accuracy is low then presumably the system is correctly identifying the target pages, but poorly extracting the data from these pages. On the other hand, if the %extra is high, %missing is low, and accuracy is poor, then presumably the problem is that the system is collecting too many pages due to an overly-general description of the target pages.

For the statistically minded, we note that our coverage measures are directly related to the terms recall and precision used in the scientific community. Specifically, %missing is the same as 1 – Recall, and %extra is the same as 1 – Precision. Also, statisticians often refer to extra pages as false positives, to missing pages as false negatives. Similarly, the correctly retrieve pages are referred to as true positives, and the pages that the system correctly determined not to retrieve are called true negatives.

Relationship of Missing/Extra to False Positives/Negatives

Figure 4: Relationship of Missing and Extra to False Positives and False Negatives

The approach that we have described in this series of blog posts works well when there are a fixed number of data items being extracted from the target pages. However, there are more complex cases which make evaluation more difficult. For instance, sometimes we want to extract a list of data values from target pages. This makes it hard to measure accuracy, because the number of items in the list may vary from page to page. Fortunately, the coverage measures we introduced previously can be used very broadly.
In particular, we can create an answer key with all the data values that should result from the extraction process, and simply evaluate how many of the data values are missing from the answer key and how many extra data values there are on the answer key. This gives us a basic way to measure performance, but the downside is that we don’t compute a separate accuracy measure, and as a result it can be harder to figure out what went wrong if performance is poor. For instance, if there is a target list that includes the data value “John Smith”, and the system incorrectly extracts the result as “John”, than this will count as both a missing value (because “John Smith” is missing) and an extra value (because “John” is not a target value), which can be confusing.

The basic approach described in these posts has several virtues. First, it is relatively straightforward and practical to implement. Second, when extraction issues occur, they can be relatively easily pinpointed and debugged. Finally, the approach can be described in terms of standard statistical metrics, and if necessary can be extended to handle more complex situations.

[Facebook] [Twitter]

Measuring the Performance of a Web Extraction System: Part I

Posted on: April 25th, 2011 by Steven Minton No Comments

Extracting data from websites is a complex process, and evaluating how well such systems perform can be difficult. In this post, we describe the approach Fetch Technologies uses in measuring the performance of an extraction system. Fetch Technologies uses just two measurements to describe performance: accuracy and coverage. Accuracy describes how well a web extraction system performs in extracting data fields from web pages. Coverage describes how well a web extraction system performs in retrieving the targeted web pages. This approach was developed by our scientists after years of experience with both commercial tests and scientific evaluations reported in leading academic journals.

 


Accuracy

First, let’s consider the basics. Let’s assume that we want to extract several data fields from a set of web pages. A data field is a piece of information that may appear on a page, for instance a price on a page from a retail site, an article title on a page from a news site, or a phone number from an individual’s home page. The data in a field may be a string, or a number, a URL, or any other type of data. A data field is the smallest unit of extracted data. Note that an address might be extracted as four separate data fields, such as street address, city, state and zip, or it might be extracted as a single data field containing the entire string. Figure 1 shows a page from a retail store, TractorSupply.com, showing three target fields: “Product Title”, “Price” and “Product Description”.

 

Tractor Store Page

Figure 1: Sample Webpage Showing Data Fields

 

For a fixed set of web pages, it is useful to create an answer key that specifies the target values for the data fields. That is, for every page the answer key indicates the correct value for each data field that should be extracted.

Once we have an answer key, we can characterize the performance of an extractor in terms of its accuracy over a set of pages. Accuracy is simply the number of correct values extracted out of the total number of target values.

 

Accuracy Equation

For instance, suppose there are 200 pages and 5 data fields, and the extractor gets 990 of the 1000 values correct, and the remaining 10 values wrong. Then the accuracy would be 99%.

 

Accuracy Equation with Values


Coverage

Accuracy is a valuable metric for describing extraction performance, but it is not the whole story. It is most appropriate for describing how well an extractor works on a given type of page, where each such page is reasonably expected to contain a target value for each field. Often, however, we have to navigate through a site to identify the target pages we want to extract target values from. For example, on a retail site, our target pages might be “product detail pages” each of which contains a target value for the fields “product name”, “price” and “product description”. On some sites the target pages might only be a small minority of the site’s pages. For instance, on TractorSupply.com, we might only be interested in extracting data from pages in the “Lawn and Garden” category.

Coverage can be measured counting the number of missing pages, and the number of extra pages retrieved by the system, as shown in Figure 2.

 

Coverage Graphic

Figure 2: Conceptual Illustration of Coverage

 

A system with perfect coverage would identify all the target pages, and only the target pages, as shown in Figure 3.

 

Perfect Coverage

Figure 3: Conceptual Illustration of Perfect Coverage

 

These can be expressed as percentages.

 

Perfect Coverage Percentage Equation

For instance, if there are 50 target pages, and the system identifies 51 pages, where 48 are correct and 3 are extra, then we say that 4% are missing (2 missing pages out of 50 target pages) and 6% are extra (3 extra pages out of the 51 pages identified).

 

Perfect Coverage Percentage Equation with Values

In the next post, we’ll describe how sampling is used to gather data for accuracy and coverage measurements.

[Facebook] [Twitter]

Five Nines

Posted on: April 22nd, 2011 by Timo Kissel No Comments

Yesterday’s outage of Amazon’s Web Services reminded all of us who are in the business of providing cloud-based services just how fragile this still emerging technology can be. When we at Fetch decided to move our applications into the cloud, public cloud technology was still too immature for us to be a viable option, so we built our own private cloud with the features we needed to support our business. I am proud to say that our private cloud has delivered five nines (99.999%) high availability for us to host our cloud-based applications – and while we are expanding our use of cloud technology to use public clouds when appropriate to the business need, I can sleep better at night knowing that our own private cloud has delivered on the uptime we need to serve our customers well!

[Facebook] [Twitter]

Bringing The Web to Big Data

Posted on: April 13th, 2011 by Timo Kissel No Comments

“Big Data” is a much-buzzed-about topic of conversation these days, and many of these conversations have been around the mechanics of how to deal with large amounts of data. Lots of progress has been made in terms of storage and processing (esp. around highly parallel approaches like map-reduce), and there is a plethora of hardware and software, both open source and proprietary, to get the infrastructure in place for Big Data projects. This Big Data infrastructure can easily be used to enable traditional data processing jobs to run both faster and cheaper – and to enable these traditional data processing jobs to run on much larger data sets. Looking at the roster of speakers and topics of a few recent conferences around Big Data, it is apparent that the issue of Big Data infrastructure is still an area that has lots of room for innovation and progress for years to come.

But what’s more exciting to me is the use of this Big Data infrastructure to glean novel insights by using new approaches, algorithms, and analytics that simply weren’t feasible before. For example, there is interesting work being done around the visualization and exploration of Big Data sets, which is critical in an era in which “the data will tell the story.” This story will be told via computer-assisted exploration of Big Data sets by humans that use the pattern recognition capabilities that our human brains have evolved – our brains are very good at finding emerging patterns in data that would otherwise go unnoticed. This is another instance of using computers to do what they’re good at (tireless processing of large amounts of information) and using humans to do what we’re good at – pattern recognition, creativity and insight – albeit now at a scale that would be impossible for us to execute without these novel tools.

Once a Big Data infrastructure is in place, the next question becomes what data to feed into this infrastructure. For many problems, the answer to this question is obvious (albeit perhaps incomplete). For example, a large retailer could have a ready supply of Big Data by gathering transaction metrics across its network of locations, and it would be able to gain near-real-time insight into the performance of its operations. Things become much more interesting, though, when we open the aperture of our Big Data lens and look across the largest source of information that there is – the Web. Imagine being able to get near-real-time insight into the global pricing trends of something that you sell – or of something that competes with something you sell. Who is lowering their prices? When? What retailers are running out of inventory? Who is running a sales campaign, and slashing prices in your segment? By how much? Getting near-real-time answers to these questions will not only allow companies to make pricing decisions more intelligently, it will also make the global market a more efficient one, benefitting consumers in the process.

We at Fetch specialize in getting data from the web quickly, efficiently, and at scale, and we deliver it to our customers in a way that is directly usable by them for further processing (for example, feeding into their Big Data systems). I am excited that we are part of this growing Big Data ecosystem, and I am looking forward to collaborating with our customers and partners to do our part to make sense of this explosion of data around us!



Timo Kissel, April 11, 2011.

[Facebook] [Twitter]

Spatial Match helping Realtors compete with Zillow and Trulia

Posted on: April 7th, 2011 by Cathy Finley No Comments

Interesting new play in the real estate space – powered by Fetch:

Spatial Match gives Realtors® competitive edge
Wednesday, April 6, 2011
By Kim Shindle

A new program for websites gives Realtors® a competitive edge over mega-sites like Zillow and Trulia.

“We recognize that it’s been a challenge for independent Realtors® without mega budgets to compete for consumers’ time when larger companies like Zillow and Trulia have such engaging websites,” said Grant Gould, vice-president of SpatialMatch®.com.

That’s why founders Gould and CEO John Perkins set out to create an information-rich mini-portal to be embedded into Realtors®’ websites to give consumers all the information they have been navigating off-site to get. The result for the Realtor® website is more interaction, more time-on-site, and ultimately more loyalty and deal flow from consumers.

“Most consumers start their search for a home online. When they go from a site like Zillow or Trulia to most Realtors®’ sites, the latter sites lack the deeper localized information that the consumers want prior to making their purchase decisions,” Gould said. “SpatialMatch® was designed to be embedded into the Realtor’s® site and gives consumers a way to research all aspects of areas and neighborhoods including a lifestyle search that helps them create a file of what’s important in their day-to-day life.”

For example, if consumers want to live in a specific school district, they can select and search for homes using that as a choice. If they want to live within 10 miles of their office, they can enter that address and any others as custom addresses and include them in the search. Or if they want to stop at Starbucks every morning and want it within five miles of their home, they can choose that as well. “We have 100 different databases feeding information into the matches and we continue to integrate new data monthly,” Gould explained.

See the rest of the article

[Facebook] [Twitter]

The Private Cloud: Perfect & Practical (via Enterprise Efficiency)

Posted on: March 24th, 2011 by Rick Parker No Comments

If you could build the perfect network — and the perfect IT organization — what would it look like, and how would it work? That may sound like an impossible task, but it’s not. It’s the true potential of the private cloud, and I’m amazed that more IT leaders aren’t taking advantage of it.

The perfect network would cost a fraction of what it costs companies to run their networks today. It wouldn’t just offer “five nines” of reliability — it would offer 100 percent reliability. It would be simple to manage, and it would scale quickly to any size.

What if you could do this now? The answer is you can. It’s the not the IT of the future, it is today’s IT — if you want it.  At Fetch Technologies, we have used Cloud IT to save more than $500,000 in purchase costs and $35,000 in monthly recurring costs.

Read the rest of Rick’s private cloud manifesto at Enterprise Efficiency

[Facebook] [Twitter]