Modern websites are no longer static HTML pages. Most of them rely heavily on JavaScript to render content dynamically — making traditional scraping tools ineffective.
Puppeteer solves this problem by controlling a real Chromium browser, allowing you to interact with pages exactly like a real user.
This guide explains how to build a clean, reusable Puppeteer scraping project that can extract data from any website URL.
Why Puppeteer for Web Scraping?
Puppeteer provides:
- Full browser automation
- JavaScript rendering support
- DOM access after page load
- Ability to wait for dynamic elements
- Interaction with real user flows
It is especially useful for:
- SPAs (React, Vue, Angular)
- Pages with infinite scroll
- APIs hidden behind UI rendering
- Protected or delayed content loading
Project Structure
A clean scraping project should not be a single file. Instead, it should be modular:
puppeteer-scraper/
│
├── src/
│ ├── browser.js
│ ├── scraper.js
│ ├── selectors.js
│ └── utils.js
│
├── scripts/
│ └── runScraper.js
│
├── config/
│ └── sites.js
│
├── package.json
└── README.mdStep 1: Install Dependencies
npm init -y
npm install puppeteerStep 2: Browser Initialization Module
Instead of launching a browser in every file, we centralize it.
const puppeteer = require('puppeteer');
async function launchBrowser() {
return await puppeteer.launch({
headless: "new",
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
}
module.exports = { launchBrowser };Step 3: Generic Scraper Logic
This is the core reusable scraping engine.
async function scrapePage(page, url, waitSelector) {
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 0 });
if (waitSelector) {
await page.waitForSelector(waitSelector, { timeout: 15000 });
}
const data = await page.evaluate(() => {
const items = document.querySelectorAll('[data-scrape-item]');
const results = [];
items.forEach(el => {
results.push({
title: el.querySelector('[data-title]')?.innerText || '',
description: el.querySelector('[data-description]')?.innerText || '',
meta: el.querySelector('[data-meta]')?.innerText || ''
});
});
return results;
});
return data;
}
module.exports = { scrapePage };Step 4: Site Configuration (Reusable Approach)
Instead of hardcoding selectors, we define them in a config file.
module.exports = [
{
name: "Example Site 1",
url: "https://example.com/page-1",
waitSelector: ".content-container",
itemSelector: ".item-card"
},
{
name: "Example Site 2",
url: "https://example.com/page-2",
waitSelector: "#mainContent",
itemSelector: ".result-item"
}
];Step 5: Utility Functions
function logStart(siteName) {
console.log(`Starting scraping: ${siteName}`);
}
function logEnd(siteName) {
console.log(`Finished scraping: ${siteName}`);
}
module.exports = { logStart, logEnd };Step 6: Runner Script (Main Entry Point)
This script connects everything together.
const { launchBrowser } = require('../src/browser');
const { scrapePage } = require('../src/scraper');
const sites = require('../config/sites');
const { logStart, logEnd } = require('../src/utils');
(async () => {
const browser = await launchBrowser();
const page = await browser.newPage();
for (const site of sites) {
try {
logStart(site.name);
const results = await scrapePage(
page,
site.url,
site.waitSelector
);
console.log(site.name, results);
logEnd(site.name);
} catch (err) {
console.error(`Error scraping ${site.name}`, err);
}
}
await browser.close();
})();Step 7: Running the Project
node scripts/runScraper.jsHandling Real-World Challenges
1. Dynamic Content Loading
Some sites load data after scrolling or API calls.
await page.waitForTimeout(3000);Or wait for a specific condition:
await page.waitForFunction(() => {
return document.querySelectorAll('.item').length > 5;
});2. Pagination Handling
let hasNext = true;
while (hasNext) {
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.item'))
.map(el => el.innerText);
});
console.log(data);
hasNext = await page.evaluate(() => {
const btn = document.querySelector('.next-button');
if (btn) {
btn.click();
return true;
}
return false;
});
await page.waitForTimeout(2000);
}3. Infinite Scroll Handling
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 200);
});
});Best Practices
- Always wait for selectors before scraping
- Keep selectors configurable (avoid hardcoding)
- Separate scraping logic from browser logic
- Handle errors per site
- Avoid blocking the event loop
- Respect
robots.txtand legal boundaries
Conclusion
Puppeteer is a powerful tool for scraping modern websites that rely heavily on JavaScript. By structuring your project properly, you can create scalable and reusable scraping systems that work across different websites with minimal changes.
With this architecture, you can easily extend your scraper to:
- Multiple websites
- Scheduled scraping jobs
- Database storage (MongoDB, PostgreSQL)
- API-based data pipelines
- Headless automation systems