Xidel - HTML/XML/JSON data extraction tool

7 min read Original article ↗

Xidel is a command line tool to download and extract data from HTML/XML pages as well as JSON APIs.

  1. Print all URLs found by a google search.

    xidel https://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

  2. Print the title of all pages found by a google search and download them:

    xidel https://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'

  3. Generally follow all links on a page and print the titles of the linked pages:
    • With XPath: xidel https://example.org -f //a -e //title
    • With CSS selectors: xidel https://example.org -f "css('a')" --css title
    • With pattern matching: xidel https://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>"
  4. Another pattern matching example:

    If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>
    You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"
    (and this will also check, if the element containing "ood" is there, and fail otherwise)

  5. Calculate something with XPath using arbitrary precision arithmetics:

    xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"

  6. Print all newest Stackoverflow questions with title and url using pattern matching on their RSS feed:

    xidel http://stackoverflow.com/feeds -e "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

  7. Print all Reddit comments of a user, with HTML and URL:

    xidel "https://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?"

  8. Check if your Reddit letter is red:
    • Webscraping, combining CSS, XPath, JSONiq, and automatically form evaluation:

      xidel https://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"

    • Using the Reddit API:

      xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'https://www.reddit.com/api/me.json' -e '($json).data.has_mail'

  9. Use XQuery, to create a HTML table of odd and even numbers:

    Windows cmd: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml
    Linux/Powershell: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml
    (Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells)

  10. Export variables to shell

    Linux/bash: eval "$(xidel https://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
    This sets the bash variable $title to the title of the page and $links becomes an array of all links there.

    Windows cmd: FOR /F "delims=" %%A IN ('xidel https://site -e "title:=//title" -e "links:=//a/@href" --output-format cmd') DO %%A
    This sets the batch variable %title% to the title of the page and %links% becomes an array of all links there.

  11. Reading JSON:
    • Read the 10th array element: xidel file.json -e '$json(10)'
    • Read all array elements: xidel file.json -e '$json()'
    • Read property "foo" and then "bar" with JSONiq notation: xidel file.json -e '$json("foo")("bar")'
    • Read property "foo" and then "bar" with dot notation: xidel file.json -e '($json).foo.bar'
    • Read property "foo" and then "bar" with XPath-like notation: xidel file.json -e '$json/foo/bar'
    • Mixed example: xidel file.json -e '$json("abc")()().xyz/(u,v)'

      This would read all the numbers from e.g. {"abc": [[{"xyz": {"u": 1, "v": 2}}], [{"xyz": {"u": 3}}, {"xyz": {"u": 4}} ]]}.
      All selectors are sequence-transparent, i.e. you can use the same selector to read something from one value as to read it from several values. Arrays are converted to sequences with ()

    Using XPath 3.1 syntax (requires Xidel 0.9.9):
    • Read the 10th array element: xidel file.json -e '$json?10'
    • Read all array elements: xidel file.json -e '$json?*'
    • Read property "foo" and then "bar" with 3.1 notation: xidel file.json -e '$json?foo?bar'
  12. Convert table rows and columns to a CSV-like format:

    xidel https://site -e '//tr / string-join(td, ",")'

    string-join((...)) can generally be used to output some values in a single line.

    In the example tr / string-join calls string-join for every row.
  13. Modify/Transform an HTML file, e.g. to mark all links as bold (requires Xidel 0.9.9):

    Windows cmd:

    xidel --html your-file.html --xquery "x:replace-nodes(/, //a, function($e) { 
       $e/<a style='{string-join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a> 
            else .
    })" > your-output-file.html 
    Linux/Powershell:
    xidel --html your-file.html --xquery 'x:replace-nodes(/, //a, function($e) { 
       $e/<a style="{string-join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a> 
    })' > your-output-file.html

    This example combines three important syntaxes:

    • x:replace-nodes(/, //a, function($e) { .. }: This applies an anonymous function to every link a-element in the HTML document, whereby that element is stored in the variable $e and is replaced by the return value of the function.
    • <a>{@* except @style, node()}</a> : This creates a new a-element that has the same children, descendants and attributes as the current element, but removes the style-attribute.
    • style="{string-join((@style, "font-weight: bold"), "; ")}": This creates a new style-attribute by appending "font-weight: bold" to the old value of the attribute. A separating "; " is inserted, if (and only if) that attribute already existed.

There is various documentation available:

The last official release is Xidel 0.9.8, but a Xidel 0.9.9 development version is published irregularly for Windows, Linux (>= Ubuntu 20.10), Android and Windows, Linux (Ubuntu 20.04) and Mac as a preview for the next release. It is recommended to use the 0.9.9 version, since it contains bug fixes, is more performant, and partially supports XPath/XQuery 3.1. Thereby most of the JSONiq syntax has been replaced by the XPath 3.1 JSON syntax. It will be published officially, once all of XPath/XQuery 3.1 is implemented.

Usually, you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for HTTPS connections on Linux OpenSSL (including openssl-dev) and libcrypto are also required. For Unicode collations, libicu is required.

There are already nightly preview builds of the next version, Xidel 0.9.9. There are builds compiled on a build server (on Ubuntu 20.04; for Linux GLIBC < 2.34, Windows and Mac) and compiled locally (on Ubuntu 21.10; for Linux GLIBC >= 2.34, Android, and Windows).

You can also test it online on a webpage or directly by sending a request to the cgi service like https://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true.

The source history is stored in a mercurial repository together with the VideLibri source and dependencies, licensed as GPLv3+. There are mirrors on GitHub Build status and GitLab. These mirrors have the Xidel source only, in order to compile it you need to download the dependencies from their own repositories first. Or use the above source tarball, which also contains dependencies.

The source then needs to be compiled with FreePascal.
In a Unix-like shell you compile it by calling ./build.sh, which just calls FreePascal. If you want to call FreePascal directly yourself, you can use fpc xidel.pas in which case you need to pass the paths to all directories of the source using the -Fu, -Fi options.
Alternatively, Xidel can be compiled using the Lazarus IDE. For this install components/pascal/internettools.lpk in Lazarus, then open programs/internet/xidel/xidel.lpi and click on Run\Compile.

Pronunciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

You can join the Xidel mailing list, have discussions on SourceForge or follow @bibliothekapp.

Author: Benito van der Zander, benito_NOSPAM_benibela.de, www.benibela.de
(Please do not ask me how to scrape your website. Ask how to do something with Xidel instead. I know Xidel, I do not know your website. The point of the tool is to make it easy for anyone to parse any webpage. Scraping every webpage myself does not scale well.)

SourceForge.net Logo