There are growing needs to access Puppeteer's raw page
object both before and after requests.
My implementation of newpage
event was a mistake for the following three reasons:
- You cannot pass values retrieved from
page
object to the crawling results - You cannot access
page
object after requests in order get cookie values, console logs and etc. - You cannot return
Promise
so that you have to deal with race conditions
Thus, I'd like to introduce new feature of customCrawl
and hoping to replace newpage
event by it.
It goes like this:
const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
customCrawl: async (page, crawl) => {
// You can access the page object before requests
await page.setRequestInterception(true);
page.on('request', request => {
if (request.url().endsWith('/')) {
request.continue();
} else {
request.abort();
}
});
// The result contains options, links, cookies and etc.
const result = await crawl();
// You can access the page object after requests
result.content = await page.content();
// You need to extend and return the crawled result
return result;
},
onSuccess: result => {
console.log(`Got ${result.content} for ${result.options.url}.`);
},
});
await crawler.queue('https://example.com/');
await crawler.onIdle();
await crawler.close();
})();
Fixes: https://github.com/yujiosaka/headless-chrome-crawler/issues/254 https://github.com/yujiosaka/headless-chrome-crawler/issues/256 https://github.com/yujiosaka/headless-chrome-crawler/pull/233