replace newpage by custom crawl (!275) · Merge requests · yujiosaka / headless-chrome-crawler

Merged yujiosaka requested to merge replace-newpage-by-custom-crawl into master Jun 10, 2018

There are growing needs to access Puppeteer's raw page object both before and after requests. My implementation of newpage event was a mistake for the following three reasons:

You cannot pass values retrieved from page object to the crawling results
You cannot access page object after requests in order get cookie values, console logs and etc.
You cannot return Promise so that you have to deal with race conditions

Thus, I'd like to introduce new feature of customCrawl and hoping to replace newpage event by it. It goes like this:

const HCCrawler = require('headless-chrome-crawler');

(async () => {
  const crawler = await HCCrawler.launch({
    customCrawl: async (page, crawl) => {
      // You can access the page object before requests
      await page.setRequestInterception(true);
      page.on('request', request => {
        if (request.url().endsWith('/')) {
          request.continue();
        } else {
          request.abort();
        }
      });
      // The result contains options, links, cookies and etc.
      const result = await crawl();
      // You can access the page object after requests
      result.content = await page.content();
      // You need to extend and return the crawled result
      return result;
    },
    onSuccess: result => {
      console.log(`Got ${result.content} for ${result.options.url}.`);
    },
  });
  await crawler.queue('https://example.com/');
  await crawler.onIdle();
  await crawler.close();
})();

Fixes: https://github.com/yujiosaka/headless-chrome-crawler/issues/254 https://github.com/yujiosaka/headless-chrome-crawler/issues/256 https://github.com/yujiosaka/headless-chrome-crawler/pull/233