Harvester: a Pattern Matching Language

Harvester is a pattern matching language designed to make it easy to scrape web pages, i.e. to extract data from an HTML page.

Its first implementation was used in a plugin for PlayOn. It was written in the programming language LUA.

Let’s start with a simple example. Assume there is a page at URL http://example.com/videos.htm which contains the following link:

<a href='http://example.com/myvideo.wmv'>Surfing on the beach</a>

We can extract the URL and the title using the following LUA code:

local page = GetURL("http://example.com/videos.htm")
local harvester = newHarvester("<a href='{value url}'>{value title}</a>")
local data = harvester.harvest(page)
VideoResource(data.title, data.url, …etc)

What did we do? First, we created a harvester based on a pattern that describes the values we are looking for and where in the file they are located. The pattern contains the literal strings that we expect to find in the file as well as two {value} tokens that represent the values to be extracted. Then we used the harvester to “harvest” the values into a table we called data. The names of the fields in the table (url and title) match the names in the {value} tokens.

Note: The VideoResource() call adds a selectable item to PlayOn.

The example above is very simple and you could achieve the same result with a couple of str_get_between() calls. But Harvester is much more powerful, so let’s look at a more realistic example. Assume that the file does not contain just one link but many links. And assume we’re interested only in the links that are inside a specific HTML table. We would use the following pattern to create the harvester. Note that I make use of the LUA delimiters [[ and ]] which allow me to create a pattern string that spans several lines:

local harvester = newHarvester( [[
   <div class="content">
   {group videotable}
      <table id="videos">
         {repeat video}
            <a href='{value url}'>{value title}</a>
         {/repeat}
      </table>
   {/group}
]] )

The LUA code to extract the data from the file would look like:

local page = GetURL("http://example.com/videos.htm")
local data = harvester.harvest(page)

and the data variable would contain the following fields:

data.videotable.video[1].title='MAX Geheugentrainer'
data.videotable.video[1].url='http://example.com/vf/10459046.wmv'
data.videotable.video[2].title='Missie MAX'
data.videotable.video[2].url='http://example.com/vf/10460553.wmv'
data.videotable.video[3].title='Nederland in Beweging!'
data.videotable.video[3].url='http://example.com/vf/10459043.wmv'
data.videotable.video[4].title='NOS Jeugdjournaal met gebarentolk'
data.videotable.video[4].url='http://example.com/vf/10459049.wmv'

As you can see, the structure of data matches the structure of the pattern. If a {repeat} is nested inside a {group}, it becomes a sub-field of the group’s field.

Harvester Language Overview

A Harvester pattern consists of strings and tokens.

A token is a piece of text enclosed in the token delimiter characters, which are by default { and }. For example:

{repeat videodata}

Valid tokens are {value}, {group} … {/group}, {repeat} … {/repeat}, {delim ..} and {* … *}, as well as the special token {param}. They are explained in detail below.

A string is any piece of text that that you expect to encounter literally in the page being harvested. It cannot contain any tokens, except the special tokens.

Basic tokens

{value name}

The purpose of Harvester is to extract data from the file that is being harvested. This is done using the {value} token. The result contains an element whose name will be name.

{group name} … {/group}

A group describes a section of the harvested page. The first element within the group must be a string, which determines the start of the group. If the last element is also a string, it determines the end of the group. If the first string is not found in the harvested file, the group is deemed absent and the corresponding value in the result is nil.

Example:

{group videotable}
    <table id="videos">
       {* ... etc .. *}
    </table>
{/group}

{repeat name} … {/repeat}

A repeating group is very similar to a group, but it may exist more than once in the harvested file. The resulting value in the result will be a table with as many entries as occurrences were found.

Example:

{repeat video}
    <a href=" .......... </a>
{/repeat}

Advanced tokens

{delim xy}

The default token delimiters { and } were chosen because they are unlikely to occur in HTML. However, that’s not true for Javascript and JSON. The {delim xy} token allows you to change the delimiters according to the (section of the) file you are harvesting. x and y represent the new left and right delimiters.

For example:

{delim <>}
title":{"$":"<value title>"
durationSecs":{"$":<value duration>}

If there is HTML further down in the file, you can reset the delimiters using

<delim {}>

Comments

{* … *}

Sometimes it is useful to include comments in a pattern. Comments are ignored and serve only as documentation.

For example:

<body>
{* If there is more than one page, there is a page selector at the top of each page *}
{group pageselect}
<select>
<option value="{value pagenum}">
</select>
{/group}

Strings and special tokens

A string contains regular text (including blanks) but may also contain the “special” tokens described in this section. These special tokens are considered part of the string they are adjacent to. For example, the following pattern consists of a single string:

<a href="{param vname}.htm">{param vname}</a>

All other tokens, including comments, have the same effect as a newline, i.e. they terminate the string that precedes them.

Note that newlines in the input text cannot be matched explicitly.

{param name}

This special token allows you to insert a value into the pattern at runtime. This makes it possible to reuse a pattern, or to adapt it to input collected during harvesting. For example:

{group my}
    <id>{param scriptId}</id>
    {* etc... *}
{/group}

The parameter value must be set before the harvest() function is called:

harvester.setparam("scriptId", "MYSCRIPT");
data = harvester.harvest(page)

In this case, the begin string of the group is <id>MYSCRIPT</id>

{{ .. }} (pattern match)

Simple strings only allow you to match fixed data. If you want to use LUA’s pattern matching facilities, enclose a LUA pattern in double token delimiters. If your pattern contains “}}”, make sure to escape it using the”%” character (i.e. replace “}}” with “%}%}” )..

For example, the following string will match any hyperlink whose URL is numeric:

<a href="{{%d+}}.gif">

Newlines

In patterns, newlines separate strings and tokens and are therefore significant in the pattern. They basically act as wildcards (i.e. don’t care). As an example, consider the following input file:

<li>Today's news <a href="videos/today.wmv">watch now</a>

The following pattern will not find a match (because there is a blank between <li> and <a href):

<li> <a href="{value url}"

but the following pattern wil find the match:

<li>
<a href="{value url}"

Note that comments play the same role and the following pattern will also find the match:

<li>{* whatever *}<a  href="{value url}"

An empty comment will work as well:

<li>{**}<a  href="{value url}"