256 Kilobytes

The Basics to Web Scraping with cURL and XPath

Articles in Web Scraping, Data Analysis | By August R. Garcia

Published 3 weeks agoFri, 28 Jun 2019 09:02:04 -0700 | Last update 3 weeks agoTue, 02 Jul 2019 17:37:33 -0700

Most of the methods used to scrape the web require setting up some files or other bullshit. However, there’s a better solution that works for both for quick and easy casual scraping, as well as for massive bulk scraping.

418 views, 2 RAMs, and 3 comments

Edit: For the lazy, this is the basic setup:

curl https://www.example.com | xmllint --html -xpath "//title/node()" -

The rest of the guide goes into more depth on applications, input and output sources, loops, and other use cases.


Most of the methods used to scrape the web require setting up some bullshit. Whether that’s a web scraping script using Python, a Quality Hack with Google Sheets, a High Quality sitemap trick, or a custom script for your shitty scraper site, creating all these files and shit probably makes you want to buy cheap rope online.

However, there’s a better solution that works for both for quick and easy casual scraping, as well as for massive bulk scraping. Specifically, using the cURL utility and a few XPath queries.

The Basics to Web Scraping with cURL and XPath

What is cURL?

cURL is a command line utility (and also a scripting library) that can be used to make headless requests HTTP requests to URLs. Basically, running a simple command like the following:

curl https://www.example.com​​​​​​​​​​​​​​

Will make a request to that webpage and give you the raw HTML, which looks like this:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

How to Scrape a Webpage into a File with cURL

The example above example prints out the scraped HTML direclty to the terminal. If you want to save the output to a file instead, you can use this command:

curl https://www.example.com > example.html

Which will create a file called "example.html" and store the output of the curl command (which will be the webpage's contents) into that file.

What is XPath?

XPath is a standard syntax that can be used to extract arbitrary data from XML documents. Incidentally, HTML is (basically) a subset of XML, which means that XPath queries can be used to extract data from HTML pages. For example, this XPath query can be used to extract all title tags from an HTML file:

//title

You can find a "cheat sheet" with a number of common XPath examples, tricks, and pieces of syntax on this webpage.

Using XPath to Parse XML from the Terminal (Using XMLLint)

There are a number of command line utilities that can be used for running XPath queries. The one that we are using here is XMLLint; you could replace this with basically any other arbitrary command line utility for parsing XML, although you'll want to make sure it can work with HTML. A basic example of XMLLint usage for parsing HTML is below:

xmllint --html -xpath "//title" example.html

This line of code will read in the input from the file "example.html" (which is the code saved under "How to Scrape a Webpage into a File with cURL") and then parse it and return all (one of the) title tags:

<title>Example Domain</title>

Combining cURL with XPath/XMLLint

To simplify the series of commands above, the following command can be used, which provides the same end result:

curl https://www.example.com | xmllint --html -xpath "//title" -

The pipe character passes the output of any arbitrary utility to another, which allows for the full process to be run instead of having to save a file and then run a second command. Note that "xmllint requires the - stdin redirect be at the end of the command" to handle the pipe character correctly.

Working with a List of URLs

Of course, to be able to scrape data efficiently, you probably want to work with a list of URLs. To do this, first create a file with a list of URLs (one URL per line), such as this file:

https://www.example.com/
http://www.genericwebsite.com/
https://www.computercoins.website/

To loop through the content of a file using the terminal/BASH, a basic example is below:

cat urls.txt | while read line; do echo $line; done

This will pass the contents of urls.txt to a loop. In this case, all the script does with each line is print it out to the terminal. To incorpoate the cURL and XMLLint script above, you can use the following:

cat urls.txt | while read line; do curl $line | xmllint --html -xpath "//title" - ; done > multiple-results.txt

Which gives you this result (saved to the file "multiple-urls.txt"):

<title>Example Domain</title><title>Generic Website</title><title>ComputerCoins Online</title>

These are all on the same line, whereas we want to have separators between them. To do this, we can make a slight modification to the script to echo a inebreak after each run of XMLLint:

cat urls.txt | while read line; do curl $line | xmllint --html -xpath "//title" - && echo "" ; done > multiple-results.txt

Which gives us this result:

<title>Example Domain</title>
<title>Generic Website</title>
<title>ComputerCoins Online</title>

To get only the contents within the tags (and not the tags as well), you can use the node() selector:

cat urls.txt | while read line; do curl $line | xmllint --html -xpath "//title/node()" - && echo "" ; done > multiple-results-real.txt

Which gives you this output:

Example Domain
Generic Website
ComputerCoins Online

A Real-World Demonstration: Scraping Yelp for Lead Generation

Anyway, it's time to apply this quality method of web scraping to a real-world scenario. Specifically, the goal is to:

  1. Scrape Yelp pages
  2. Scrape the business' official website, if one is listed
  3. If they don't have a website listed, leave a blank line

Note: Apparently Yelp is salty about web scrapers, so if you do this particular example, probably avoid getting your home IP banned. If you're trying to get data from Yelp specifically, there is also an API. Anyway, this is a cURL and XPath guide, not a Yelp API tutorial.

The List of URLs

Here's the list of individual URLs that we're scraping, saved as yelp-urls.txt:

https://www.yelp.com/biz/kwon-dentistry-gresham-3
https://www.yelp.com/biz/advantage-dental-portland
https://www.yelp.com/biz/brock-herriges-dmd-portland

Scraping Websites from the List of URLs

The scraping process for these URLs is fairly similar to the process used previously:

cat yelp-urls.txt | while read line; do curl $line | xmllint --html -xpath "//div[@class='mapbox']//span[@class='biz-website js-biz-website js-add-url-tagging']/a/node()" - | cat - && echo "" ; done > "output/yelp-output--$(date +%s).txt

A few addtions were to:

  • Add an additional pipe into a "cat -"; otherwise a blank line would not be added for lines that came back with no result.
  • Changed the filename to be put into a subdirectory (output/[files]) for organization (make sure the folder/directory actually exists before running the script).
  • Generate the filename based on the current UNIX timestamp (the number of milliseconds since January 1, 1970) so that a new file is created on each run of the script. 

Running the script provides the following result (the second URL has no listed website):

greshamsmiles.com

periopdx.com

Very nice.

Scraping and Saving Multiple Pieces of Data with cURL

While the above is decent, perhaps you want to put multiple pieces of data into your result file. Fortunately, you can in fact do that. One of the easiest ways to do that is by creating a CSV output.

How to Pass Data into a CSV from the Terminal

As you may know, a CSV file consists of comma separated values. This is true in the most literal way possible. Here's a thing you can do:

  • Open a text file (in Notepad, Gedit, or similar)
  • Copy paste the text below into the file:
    • one,two,three
    • four,five,six
  • Save the file as "data.txt"
  • Close the file
  • Change the filename to "data.csv"
  • Open the file again
  • Congratulations, that data is now in some cells

In case the implications are still unclear, this means that you can the result of arbitrary terminal outputs as CSV files easily provided that they are formatted correctly (cells split by commas and rows split by line breaks. Foe example: 

echo -e "one, two, three\nfour,five,six" > output.csv

Does exactly what you would expect (-e flag added so that the \n will be treated as a newline).

Adding the Yelp Source URLs to the Result Output CSV

With the above in mind, we can now modify the previously written script as follows:

cat yelp-urls.txt | while read line; do curl $line | xmllint --html -xpath "//div[@class='mapbox']//span[@class='biz-website js-biz-website js-add-url-tagging']/a/node()" - | cat - && echo ",$line" ; done > "output/yelp-output--$(date +%s).csv"

This will give us the same result as before, but with the URLs saved as well. Here's that result as a table:

greshamsmiles.com https://www.yelp.com/biz/kwon-dentistry-gresham-3
  https://www.yelp.com/biz/advantage-dental-portland
periopdx.com https://www.yelp.com/biz/brock-herriges-dmd-portland

And as the raw text if we change the extension to .txt:

greshamsmiles.com,https://www.yelp.com/biz/kwon-dentistry-gresham-3
,https://www.yelp.com/biz/advantage-dental-portland
periopdx.com,https://www.yelp.com/biz/brock-herriges-dmd-portland

Using this concept, this process can be altered further to gather additional pieces of data to include in the generated CSV, such as company addresses, email addresses, or phone numbers. 

In Conclusion

Hell yeah. What a high quality way to scrape assorted data without having to make a bunch of bullshit script files. Also, see this follow up post on crawling and scraping DuckDuckGo.

Users Who Have Downloaded More RAM:
Hash Brown (3 weeks ago)
Scuffed Dog (3 weeks ago)
🐏 ⨉ 2
Posted by August R. Garcia 3 weeks ago

Edit History

• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
• [2019-06-28 9:02 PDT] August R. Garcia (3 weeks ago)
🕓 Posted at 28 June, 2019 09:02 AM PDT

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 165 🗨 851 🐏 253
Site Owner

Grahew Mattham

August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxiessin, the developer console, literally every link building method, and other junk.

Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.


Account created 7 months ago.
165 posts, 851 comments, and 253 RAMs.

Last active 4 hours ago:
Posted thread [Solved] Google Sheets: "Text result of JOIN is longer than the limit of 50000 characters"

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 165 🗨 851 🐏 253
Site Owner

Crawling/Scraping Reddit with cURL

Seems like you can use the process described above for Reddit. Basic examples:

curl -s -A "256Kilobytes Bot Test/Debug" curl --user-agent "reddit scraper example" https://www.reddit.com/r/MildlyInteresting > "reddit/yelp-output--$(date +%s).html"

Or to get the pages as JSON data (can add ".json" to the end of any URL):

curl -s -A "256Kilobytes Bot Test/Debug" curl --user-agent "reddit scraper example" https://www.reddit.com/r/MildlyInteresting.json > "reddit/yelp-output--$(date +%s).json"

Seems that you must (?) set a user-agent to be able to get results to be returned correctly. Looks like there's also a rate-limit of something along the lines of one request per two-seconds. 

Also See:

Download more RAM. 🐏 ⨉ 0 Posted by August R. Garcia 3 weeks ago

Edit History

• [2019-06-30 15:27 PDT] August R. Garcia (3 weeks ago)
🕓 Posted at 30 June, 2019 15:27 PM PDT

Sir, I can do you a nice SEO.

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 165 🗨 851 🐏 253
Site Owner

Have been experimenting with cURL scraping from the terminal some more. One issue that has come up a few times with XMLLint is getting the set of matches found to easily be loopable or to otherwise have a separator or similar. A similar utility that you might consider is hxselect, which allows for separators to be specified. Ex:

curl https://www.example.com | hxselect -s "\n" p

Syntax uses CSS selectors rather than XPath.

Also see this page:

Download more RAM. 🐏 ⨉ 0 Posted by August R. Garcia 3 weeks ago

Edit History

• [2019-07-01 13:33 PDT] August R. Garcia (3 weeks ago)
• [2019-07-01 13:33 PDT] August R. Garcia (3 weeks ago)
🕓 Posted at 01 July, 2019 13:33 PM PDT

Sir, I can do you a nice SEO.

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 165 🗨 851 🐏 253
Site Owner

Basic DuckDuckGo result scraping:

curl https://duckduckgo.com/html?q=asdf | awk '{$1=$1};1' | sed '/^$/d' | hxselect -c a.result__url |  sed '/^$/d'

Also see:

Download more RAM. 🐏 ⨉ 0 Posted by August R. Garcia 3 weeks ago

Edit History

• [2019-07-01 15:56 PDT] August R. Garcia (3 weeks ago)
• [2019-07-01 15:56 PDT] August R. Garcia (3 weeks ago)
• [2019-07-01 15:56 PDT] August R. Garcia (3 weeks ago)
🕓 Posted at 01 July, 2019 15:56 PM PDT

Sir, I can do you a nice SEO.

Post a New Comment

To leave a comment, login to your account or create an account.

Do you like having a good time?

Read Quality Articles

Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

View Articles →

Argue with People on the Internet

Use your account to explain why people are wrong on the Internet forum.

View Forum →

Vandalize the Wiki

Or don't. I'm not your dad.

View Wiki →

Ask and/or Answer Questions

If someone asks a terrible question, post a LMGTFY link.

View Answers →

Make Some Money

Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

Register an Account
You can also login to an existing account or recover your password. All use of this site is subject to terms outlined in the terms of service and privacy policy.