Tag Archives: node.js

node.js — using cheerio.js to find all script elements in a page

Finding <script> nodes in a page

Why.. why? Just because it’s useful when pages had dynamic content in javascript. Is there a way to subsequently evaluate the javascript parsed.. that’s for another article, but for now, I’m going to assume you have node.js installed, and you have at least come idea of how to use it.

The idea

Finding all the <script> nodes in an HTML page, rendered using

‘request.get()’

.

In the example, url (in this case www.amazon.com) is resolved and the HTML loaded. The loaded HTML is then passed to cheerio using this expression:

var $ = cheerio.load(html,{ normalizeWhitespace: false, xmlMode: false, decodeEntities: true });

.. then iterated upon using the .each( ..) object method.

$(‘script’).each( function () {…

In the very simple example the follows the script is logged to the console (STDOUT) for display. In an more advanced and useful implementation, the returned javascript would be interacted with, parsed or some other action taken.

The Script

// MAKE REQUIREMENTS
var request = require(‘request’);
var cheerio = require(‘cheerio’);

// Local Vars
var url = ‘https://www.amazon.com’;

// Define the requests default params
var request = request.defaults({
jar: true,
headers: { ‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0’ },
})

// execute request and parse all the javascript blocks
request(formUrl, function (error, response, html) {
if (!error && response.statusCode == 200) {

// load the html into cheerio
var $ = cheerio.load(html,{ normalizeWhitespace: false, xmlMode: false, decodeEntities: true });

// iterate on all of the JS blocks in the page
$(‘script’).each( function () {
console.log(‘JS: %s’,$(this).text());
});
}
else {
console.log(‘ERR: %j\t%j’,error,response.statusCode);
}
});

End

node.js — parse page title (simple example)

node.js — Toolkit of the Code Gods!!

Or, so some would have you believe. Is it pretty awesome, YES. I’m I sold on it yet, NO. But it’s growing on me.

Since parsing webpages has been my business for nearly 15 years now, I’ve used a lot of tools and strategies, but it was only recently I decided to try out node.js for a few of my projects.

Starting with node.js

If you are new to node.js, go check out these URLS here. They more than successfully cover getting started with node.

The goal here is to answer a question that for some reason eluded my best searches for code examples. I thought I had the syntax dialed but still saw some strange responses. This page will show you definitively how to get a page title. Every time (every time the page loads at least).

How I parsed the title off a page

Here is how I did it, using cheerio and request:

/*
* MAKE REQUIREMENTS
*/
var request = require(‘request’);
var cheerio = require(‘cheerio’);

/*
* Handle Commandline Params
*/
var url = process.argv[2];

/*
* Local Vars
*/
// Define the requests default params
var request = request.defaults({
headers: { ‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0’ },
})

// DO THE WORK!!
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html,{ normalizeWhitespace: true, decodeEntities: true });
var title = $(‘title’).text();
console.log(“TITLE: %j”,title);
}
else {
console.log(‘ERR: %j\t%j’,error,response.statusCode);
}
});

Running the example from the command line looks like this (I’m using the 1st available parameter to pass my URL, hard-coding is for fools):

node get.title.js http://www.yahoo.com
TITLE: “Yahoo”

Conclusion

The reason I’ve posted this blog, is that this specific node.js cheerio syntax was not clearly specd:

var title = $(‘title’).text();

Enjoy toying around with node.js to parse your super-awesome pages.