Finding <script> nodes in a page
Why.. why? Just because it’s useful when pages had dynamic content in javascript. Is there a way to subsequently evaluate the javascript parsed.. that’s for another article, but for now, I’m going to assume you have node.js installed, and you have at least come idea of how to use it.
The idea
Finding all the <script> nodes in an HTML page, rendered using
‘request.get()’
.
In the example, url (in this case www.amazon.com) is resolved and the HTML loaded. The loaded HTML is then passed to cheerio using this expression:
var $ = cheerio.load(html,{ normalizeWhitespace: false, xmlMode: false, decodeEntities: true });
.. then iterated upon using the .each( ..) object method.
$(‘script’).each( function () {…
In the very simple example the follows the script is logged to the console (STDOUT) for display. In an more advanced and useful implementation, the returned javascript would be interacted with, parsed or some other action taken.
The Script
// MAKE REQUIREMENTS
var request = require(‘request’);
var cheerio = require(‘cheerio’);// Local Vars
var url = ‘https://www.amazon.com’;// Define the requests default params
var request = request.defaults({
jar: true,
headers: { ‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0’ },
})// execute request and parse all the javascript blocks
request(formUrl, function (error, response, html) {
if (!error && response.statusCode == 200) {// load the html into cheerio
var $ = cheerio.load(html,{ normalizeWhitespace: false, xmlMode: false, decodeEntities: true });// iterate on all of the JS blocks in the page
$(‘script’).each( function () {
console.log(‘JS: %s’,$(this).text());
});
}
else {
console.log(‘ERR: %j\t%j’,error,response.statusCode);
}
});
David,
You mentioned there is a way to run JS code within cheerio. I am being vexed by exactly this problem. Have you ever written a post about it?
Thanks,
Chao