256 Kilobytes

While tools like cURL are useful, allowing easy methods to fetch  raw HTML from webpages, they do not execute JavaScript. In fact, cURL only grabs the file located directly at the specified URL and will not download or load any other external files either, such as CSS sheets. For requests that can also render JavaScript and CSS, as well as generally interact with webpages, there are various other tools and libraries that can be used.

PhantomJS vs. Headless Chrome/Chromium vs. Headless Firefox

Phantom JS

  • + Extremely straightforward and lightweight. Basically the same as writing vanilla JavaScript in any other context.
  • + Long history and existed since at least 2011; large amounts of documentation, relevant forum threads, etc.
  • - No longer being maintained
  • - Environment may differ from "real" browsers, which can potentially result in discrepancies for testing.

Headless Chrome/Chromium

  • + Best command-line interface out-of-the-box.
  • + The --dump-dom flag makes integrating headless Chrome/Chromium requests with shell scripts easy; no helper/wrapper script files needed.
  • + Well maintained
  • + Great documentation
  • + Support for exporting/saving requests as PDFs with the --print-to-pdf flag
  • - Something something muh Google is evil

Headless Firefox

  • + Mozilla is probably a less evil company than Google
  • * Solid documentation

Overall:

  • PhantomJS was basically "the" library for headless browser automation until recently, but is no longer maintained and the realease of headless FireFox and Chrome suggest that it is unlikely to come back to life
  • Running headless instances of Chrome/Chromium seems to be the "best" option both in terms of out-of-the-box features and long-term support.
  • Headless Firefox seems adequate, but inferior to working with headless Chrome, unless you specifically need to use Firefox.

PhantomJS

PhantomJS has been around since at least 2011 and is, basically, the first popularized headless, scriptable web browser. While it is no longer being maintained (as of 2018), it is still a solid package with good documentation.

PhantomJS Example: Making "Not-Headless" Requests to Google Search Pages

Setup and Installation

  1. Install PhantomJS. Running sudo apt-get install phantomjs to install PhantomJS worked fine on Ubuntu Linux. If you're using a different operating system, the installation may vary. See the PhantomJS installation page for more information.
  2. Fuck around. If you want to familiarize yourself with PhantomJS further with basic examples, look at the quickstart guide, which is extremely straightforward.

Creat the Script

Create a new file called reacharound.js and paste the following code into it:

var page = require('webpage').create(),
        system = require('system'),
        t, address;

if (system.args.length === 1) {
        console.log('Usage: loadspeed.js [some URL]');
        phantom.exit();
}

address = system.args[1];
page.open(address, function(status) {
        if (status !== 'success') {
                console.log('FAIL to load the address');
        } else {
                console.log( page.content );
                phantom.exit();
        }
});

Then, save and exit the file.

Running the Script

To run the script, execute the following code:

phantomjs reacharound.js https://www.google.com/search?q=dog

Which will print out the page's code after loading and executing any JS.

Running the Script as a "Reacharound"

Since PhantomJS can easily be run from the terminal, the reacharound.js script above can also be easily piped between various BASH/shell scripts. For example, this code: 

phantomjs reacharound.js https://www.google.com/search?q=dog > reacharound-js-output.html

Will load the same page, but save the result into a file. This could incorporated into other scripts, such as a search engine result monitoring script or local directory scraper.

Other Notes and Observations

Command-Line Flags

There are various command-line flags listed in this documentation file, which can simplify some use cases without the need to write code solutions directly to a script.

User-Agent

The default user-agent is:

  • Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1

And can be changed as specified in this settings documentation file.

Detecting PhantomJS

There is a quality article on how to--at least in theory--detect whether requests are from PhantomJS, as well as an associated presentation embedded below.

Whether there are any sites that actually use these methods is unclear.

Headless Chrome/Chromium

With PhantomJS no longer being maintained, headless Chrome is one of the most common solutions in Current Year, being released initially in 2017.

Installation and Setup

  1. Install Chrome and/or Chromium. Running sudo apt-get install chromium-browser or sudo apt-get install google-chrome should work on Ubuntu Linux.
  2. [Troubleshooting]. Open Chromium/Chrome at least once before attempting to launch it headless. When writing this article, I ran into some error with Chromium after installing it was first installed to this machine, which then fixed itself after launching Chromium once normally to trigger the first-launch "setup" page. 
  3. More documentation. See the "getting started" guide from Google linked above.

Basic Headless Requests

Get Page Source

Here's an extremely basic example that grabs a page's source code, runs the JS, and then saves it to an HTML file. The "getting started" guide linked under "installation and setup" suggests that this command should be "google --headless..." instead of "google-chrome --headless..." or "chromium-browser --headless..."; it may depend on your system. 

google-chrome --headless --dump-dom https://www.google.com/search?q=dog > google-search-headless.html

Or:

chromium-browser --headless --dump-dom https://www.google.com/search?q=dog > google-search-headless.html

Get page as a PDF

google-chrome --headless --print-to-pdf https://www.google.com/search?q=dog

Other Notes and Observations

Puppeteer

For more advanced scripting and automation, note that there is a Node.JS library called Puppeteer that can be used to control headless Chrome programmatically. It can also/alternatively be combined with libraries tools like Selenium webdriver.

Headless Firefox

Headless Firefox is basically the same as headless Chrome, but for Firefox. It was also released in 2017.

Installation and Setup

Basic Headless Requests

firefox -headless --screenshot https://duckduckgo.com/?q=rare+clowns

Also see the list of command line options for other flags that can be used.

Other Observations

There are no other observations about headless Firefox. It works correctly. It's CLI has fewer "out of the box" tricks for common tasks than does headless Chrome. If you're using it for more complex testing or automation, you're probably combining it with something like Selenium.

Other Shit

In Conclusion

Hell yeah. It's time to use some or all of these various libraries for purposes that are relevant to your use case.

Users Who Have Downloaded More RAM:
Hash Brown (2 months ago)
🐏 ⨉ 1
Posted by August R. Garcia 2 months ago

Edit History

• [2019-07-08 23:03 PDT] August R. Garcia (2 months ago)
• [2019-07-08 23:03 PDT] August R. Garcia (2 months ago)
🕓 Posted at 08 July, 2019 23:03 PM PDT

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 183 🗨 918 🐏 271
Site Owner

Grahew Mattham

August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxiessin, the developer console, literally every link building method, and other junk.

Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.


Account created 9 months ago.
183 posts, 918 comments, and 271 RAMs.

Last active 1 hour ago:
Commented in thread It's Sunday

Profile Photo - Default jimdigriz United Kingdom
🗎 0 🗨 1 🐏 1

Of note is that (few, but not many!) 3rd parties can detect that there is a headless at play here:

https://antoinevastel.com/bot%20detection/2018/01/17/detect-chrome-headless-v2.html

http://geocar.sdf1.org/browser-verification.html

There are countermeasures that can help with varying levels of success.

Users Who Have Downloaded More RAM:
August R. Garcia (2 months ago)
🐏 ⨉ 1
Posted by jimdigriz 2 months ago 🕓 Posted at 09 July, 2019 08:23 AM PDT
Profile Photo - Scuffed Dog Scuffed Dog Horse racing tipster Horseshoe Bay
🗎 0 🗨 58 🐏 33
Quality User

Thenkle for this. I kiss Selenium every night before I go to sleep and pray to it in my dreams. Basically what I do is:

package chrome;
       public class HeadlessTesting {
            public static void main(String[] args) throws IOException{

and then I add rest of the code thanks to my sleepless VA from Sachsenhausen.

Download more RAM. 🐏 ⨉ 0 Posted by Scuffed Dog 2 months ago 🕓 Posted at 09 July, 2019 17:27 PM PDT
Profile Photo - Hash Brown Hash Brown Internet Activist &... United State of Euro...
🗎 61 🗨 428 🐏 203
Staff

This is brilliant sir.

Download more RAM. 🐏 ⨉ 0 Posted by Hash Brown 2 months ago 🕓 Posted at 10 July, 2019 17:44 PM PDT

"THAT DOG IS GETTING RAPED" - Terry A. Davis

Post a New Comment

To leave a comment, login to your account or create an account.

Do you like having a good time?

Read Quality Articles

Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

View Articles →

Argue with People on the Internet

Use your account to explain why people are wrong on the Internet forum.

View Forum →

Vandalize the Wiki

Or don't. I'm not your dad.

View Wiki →

Ask and/or Answer Questions

If someone asks a terrible question, post a LMGTFY link.

View Answers →

Make Some Money

Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

Register an Account
You can also login to an existing account or recover your password. All use of this site is subject to terms outlined in the terms of service and privacy policy.