Extract Reference

PureLoad Logo
PureTest 5.2
January 2015
http://www.pureload.com
support@pureload.com

Documentation Index -- Task Reference

Introduction

The basic extract tasks supports various techniques to extract data from a response (generated by any task) and assign the extracted data to a variable. This document describes more about the various techniques used.

Using Regular Expressions

The tasks ExtractTask and HttpExtractTask  optionally supports using regular expressions to extract data.  Using the tasks you specify a start and an end pattern and all data  between these patterns are extracted.

In this section we will give references to more information about the regular expression language used ans well as a few examples.

Regular Expressions

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. The version of Regular Expressions used by the extract tasks is the standard as supported by Java, which is most similar to that found in Perl.

For details on the regular-expression constructs, see the Java Pattern class documentation. In addition, the Java Tutorial on Regular Expressions might be useful.

Example

Let's say that we have an HTML content as follows:

<html>
<body>
<!-- ... -->
<a title="OK" href="/Controller?action=view&cat=4">Have
    Fun</a>
<a title="OK" href="/Controller?action=view&cat=5">Get It
    All</a>
<!-- ... -->
</body>
</html>

Now we want to extract the "href" that corresponds to the "Get It All" string. To handle this we specify the following parameters to the ExtractTask:

extract task

The somewhat tricky part here is the Start regular expression:

(?s)
Enable dotall mode. In dotall mode, the expression '.' matches any character, including a line terminator. By default '.' does not match line terminators.
Note: not really required in this example, but included here since it's very common that '.' is used even over line boundaries.
(href=")
This is the pattern we look for. I.e 'href="' is the start of the data we want to extract
(?=\S*Get\s*It\s*All)
The syntax (?=X) is a non-capturing lookahead construct, used to specify what is followed after the href=" pattern. In this case we want to find any non-whitespace character (\S), followed by Get It All, separated by whiite space characters (\s*);

The End pattern is often easy as in this example, simply a double quote (").

RegexTask

There is also a specialized RegexTask which stores capturing groups in variables. This allows for more straightforward regular expressions. The above example can be expressed as:
RegexTask
It is also possible to use more than one variable by entering a comma separated list of variable names. The number of variables must match the number of capturing groups. For example, to store both href values in the above HTML example as variable href1 and href2:
RegexTask

Using regular expression is powerful, but also sometimes quite complicated. An option is in many cases using XPath and HtmlXpathExtractTask (see the example below).

Using XPath

The tasks XmlXPathExtractTask and HtmlXpathExtractTask uses XML Path Language (XPath) expressions to define what data to be extracted. The difference between the two tasks is that XmlXPathExtractTask can only be used with well-formed XML and (optionally) namespaces. HtmlXpathExtractTask on the other hand uses a "relaxed" XML parser to allow extract using XPath even if XML/XHTML/HTML content isn't well formed. I.e. a typical HTML/XHTML page.

In this section we will give references to more information about XPath as well as a few examples.

XPath Expressions

XPath uses path expressions to select nodes or node-sets in an XML document. For more information about XPath and XPath expressions we recommend any XPath tutorials, for example: http://www.w3schools.com/XPath/.

XPath Example

For a simple example, let say that we have a response as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
    <book category="COOKING">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="CHILDREN">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="WEB">
        <title lang="en">XQuery Kick Start</title>
        <author>James McGovern</author>
        <author>Per Bothner</author>
        <author>Kurt Cagle</author>
        <author>James Linn</author>
        <author>Vaidyanathan Nagarajan</author>
        <year>2003</year>
        <price>49.99</price>
    </book>
    <book category="WEB">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>

From this response we want to extract data. A few examples:

We want to extract XPath Expression Result
title of the first book
/bookstore/book[1]/title Everyday Italian
price of the last book /bookstore/book[last()]/price
39.95
title of 1'st book with price greater than 35
/bookstore/book[price>35][1]/title
XQuery Kick Start
1'st author of the 1'st book in the WEB category
/bookstore/book[@category='WEB'][1]/author[1]
James McGovern

XPath Example, using Namespaces

When XPath is used to extract data from an Web Service response (such as SOAP) it's common that XML Namespaces are used. When namespaces are used it's important that you understand the basics regarding namespaces.

The following is a simple SOAP XML response, using namespaces:

<?xml version="1.0"?>
<SOAP-ENV:Envelope
    xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
    <SOAP-ENV:Body>
        <m:ThumbnailResponse
            xmlns:m="http://ast.amazonaws.com/doc/2006-05-15/">
            <aws:Response
                xmlns:aws="http://ast.amazonaws.com/doc/2006-05-15/">
                <aws:OperationRequest>
                    <aws:RequestId>3f8ceabd-2d15-47f0-b35e-d52ee868a4a6</aws:RequestId>
                </aws:OperationRequest>
                <aws:ThumbnailResult>
                    <aws:Thumbnail
                        Exists="true">http://location_to_thumbnail_for_www.alexa.com</aws:Thumbnail>
                    <aws:RequestUrl>www.alexa.com</aws:RequestUrl>
                </aws:ThumbnailResult>
                <m:ResponseStatus>
                    <m:StatusCode>Success</m:StatusCode>
                </m:ResponseStatus>
            </aws:Response>
            <aws:Response
                xmlns:aws="http://ast.amazonaws.com/doc/2006-05-15/">
                <aws:OperationRequest>
                    <aws:RequestId>3f8ceabd-2d15-47f0-b35e-d52ee868a4a6</aws:RequestId>
                </aws:OperationRequest>
                <aws:ThumbnailResult>
                    <aws:Thumbnail
                        Exists="true">http://location_to_thumbnail_for_www.amazon.com</aws:Thumbnail>
                    <aws:RequestUrl>www.amazon.com</aws:RequestUrl>
                </aws:ThumbnailResult>
                <m:ResponseStatus>
                    <m:StatusCode>Success</m:StatusCode>
                </m:ResponseStatus>
            </aws:Response>
        </m:ThumbnailResponse>
    </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

From this response we want to extract data. To do this we first have to define namespace prefix mapping (a parameter to the XmlXPathExtractTask). Let's use the following mapping:

SOAP-ENV http://schemas.xmlsoap.org/soap/envelope/
m
http://ast.amazonaws.com/doc/2006-05-15/
aws
http://ast.amazonaws.com/doc/2006-05-15/

Now we can extract data as in the following examples:

extract 1'st thumbnail url
XPath Expression: //m:ThumbnailResponse/aws:Response[1]/aws:ThumbnailResult/aws:Thumbnail
Result: http://location_to_thumbnail_for_www.alexa.com
extract thumbnail url for www.amazon.com
XPath Expression: //aws:Response/aws:ThumbnailResult[aws:RequestUrl='www.amazon.com']/aws:Thumbnail
Result: http://location_to_thumbnail_for_www.amazon.com'

HtmlXpathExtractTask and Relaxed HTML/XML Parser

The HtmlXpathExtractTask is based on XPath, but uses a relaxed XML parser, meaning that it is suitable for HTML/XHTML content.

As an example, let's say that we have the same content as in previous example, where we uses regular expressions:

<html>
<body>
<!-- ... -->
<a title="OK" href="/Controller?action=view&cat=4">Have
    Fun</a>
<a title="OK" href="/Controller?action=view&cat=5">Get It
    All</a>
<!-- ... -->
</body>
</html>

Now we want to extract the "href" that correspons to the "Have Fun" string. To handle this we specify the following parameters to the HtmlXpathExtractTask:

extract task

The only thing that complicates the XPath expression is that we need to handle that the "Have Fun" string we are looking for contains whitespace (including a line break). The solution is to use the XPath function normalize-space().



Copyright © 2015 PureLoad Software Group AB. All rights reserved.