Skip to content
/ surgeon Public

Declarative DOM extraction expression evaluator. πŸ‘¨β€βš•οΈ

License

Notifications You must be signed in to change notification settings

gajus/surgeon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Surgeon

Travis build status Coveralls NPM version Canonical Code Style Twitter Follow

DOM extraction expression evaluator.

Configuration

Name Description Default value
evaluator HTML parser and selector engine. Possible values: cheerio, browser. Use cheerio if you are running Surgeon in Node.js. Use browser if you are running Surgeon in a browser or headless browser (e.g. PhantomJS). cheerio

Cookbook

Unless redefined, all examples assume the following initialisation:

import surgeon from 'surgeon';

/**
 * @param configuration {@see https://github.com/gajus/surgeon#configuration}
 */
const x = surgeon();

Note:

For simplicity, strict-equal operator (===) is being used to demonstrate deep equality.

Extract single node

The default behaviour of a query is to match a single node and extract value of the textContent property.

const document = `
  <div class="title">foo</div>
`;

x('.title')(document) === 'foo';
x('.title {1}[0]')(document) === 'foo';
x('.title {0,1}[0]')(document) === 'foo';
x('.title {1,1}[0]')(document) === 'foo';

Extract multiple nodes

To extract multiple nodes, you need to specify a quantifier expression.

const document = `
  <div class="title">foo</div>
  <div class="title">bar</div>
  <div class="title">baz</div>
`;

const result = x('.title {0,}')(document);

result === [
  'foo',
  'bar',
  'baz'
];

Nested expression

Surgeon queries can be nested. Result of the parent query becomes the root element of the descending query.

const document = `
  <article>
    <div class='title'>foo title</div>
    <div class='body'>foo body</div>
  </article>
  <article>
    <div class='title'>bar title</div>
    <div class='body'>bar body</div>
  </article>
`;

const result = x('article {0,}', {
  body: x('.body'),
  title: x('.title')
})(document);

result === [
  {
    body: 'foo body',
    title: 'foo title'
  },
  {
    body: 'bar body',
    title: 'bar title'
  }
];

Validation

Validation is performed using regular expression.

const document = `
  <div class="title">foo</div>
`;

x('.title', /foo/)(document) === 'foo';

If the regular expression does not match the data, an InvalidDataError error is thrown (see Handling errors).

Conventions

Quantifier expression

A quantifier expression is used to assert that the query matches a set number of nodes.

The default quantifier expression value is {1}.

Syntax

Name Syntax
Fixed quantifier {n} where n is an integer >= 1
Greedy quantifier {n,m} where n >= 0 and m >= n
Greedy quantifier {n,} where n >= 0
Greedy quantifier {,m} where m >= 1

If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (UnexpectedResultCountError) if selector result count is out of the quantifier range.

Example

.title {1}
.title {0,1}
.title {0,}

Accessor expression

An accessor expression can be used to return a single item from an array of matches. An accessor expression must precede a quantifier expression.

The default accessor expression value is [0]. The default applies only if a quantifier expression is not specified. If a quantifier expression is specified, then by default all matches are returned.

Syntax

[n] where n is a zero-based index.

Example

.title {1}[0]

Attribute selector

An attribute selector is used to select a value of an HTMLElement attribute.

Syntax

@n where n is the attribute name.

Example

.title@data-id

Property selector

A property selector is used to select a value of an HTMLElement property.

Syntax

@.n where n is the property name.

Example

.title@.textContent

Error handling

There are many errors that Surgeon can throw. Use instanceof operator to determine the error type.

Name Description
NotFoundError Thrown when an attempt is made to retrieve a non-existent attribute or property.
UnexpectedResultCountError Thrown when a quantifier expression is not satisfied.
InvalidDataError Thrown when a resulting data does not pass the validation.

Example:

import {
  InvalidDataError
} from 'surgeon';

const document = `
  <div class="title">foo</div>
`;

try {
  x('.title', /bar/)(document);
} catch (error) {
  if (error instanceof InvalidDataError) {
    // Handle data validation error.
  } else {
    throw error;
  }
}

Debugging

Surgeon is using debug to provide additional debugging information.

To enable Surgeon debug output run program with a DEBUG=surgeon:* environment variable.

FAQ

Whats the difference from x-ray?

x-ray is a web scraping library.

The primary difference between Surgeon and x-ray is that Surgeon does not implement HTTP request layer. I consider this an advantage for the reasons that I have described in the following x-ray issue.

About

Declarative DOM extraction expression evaluator. πŸ‘¨β€βš•οΈ

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

No packages published

Contributors 3

  •  
  •  
  •