Web Scraping Any Website with Puppeteer

Fatih Delice
Fatih Delice

Modern websites are no longer static HTML pages. Most of them rely heavily on JavaScript to render content dynamically — making traditional scraping tools ineffective.

Puppeteer solves this problem by controlling a real Chromium browser, allowing you to interact with pages exactly like a real user.

This guide explains how to build a clean, reusable Puppeteer scraping project that can extract data from any website URL.


Why Puppeteer for Web Scraping?

Puppeteer provides:

  • Full browser automation
  • JavaScript rendering support
  • DOM access after page load
  • Ability to wait for dynamic elements
  • Interaction with real user flows

It is especially useful for:

  • SPAs (React, Vue, Angular)
  • Pages with infinite scroll
  • APIs hidden behind UI rendering
  • Protected or delayed content loading

Project Structure

A clean scraping project should not be a single file. Instead, it should be modular:

puppeteer-scraper/

├── src/
│   ├── browser.js
│   ├── scraper.js
│   ├── selectors.js
│   └── utils.js

├── scripts/
│   └── runScraper.js

├── config/
│   └── sites.js

├── package.json
└── README.md

Step 1: Install Dependencies

npm init -y
npm install puppeteer

Step 2: Browser Initialization Module

Instead of launching a browser in every file, we centralize it.

src/browser.js
const puppeteer = require('puppeteer');
 
async function launchBrowser() {
    return await puppeteer.launch({
        headless: "new",
        args: ['--no-sandbox', '--disable-setuid-sandbox']
    });
}
 
module.exports = { launchBrowser };

Step 3: Generic Scraper Logic

This is the core reusable scraping engine.

src/scraper.js
async function scrapePage(page, url, waitSelector) {
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 0 });
 
    if (waitSelector) {
        await page.waitForSelector(waitSelector, { timeout: 15000 });
    }
 
    const data = await page.evaluate(() => {
        const items = document.querySelectorAll('[data-scrape-item]');
        const results = [];
 
        items.forEach(el => {
            results.push({
                title: el.querySelector('[data-title]')?.innerText || '',
                description: el.querySelector('[data-description]')?.innerText || '',
                meta: el.querySelector('[data-meta]')?.innerText || ''
            });
        });
 
        return results;
    });
 
    return data;
}
 
module.exports = { scrapePage };

Step 4: Site Configuration (Reusable Approach)

Instead of hardcoding selectors, we define them in a config file.

config/sites.js
module.exports = [
    {
        name: "Example Site 1",
        url: "https://example.com/page-1",
        waitSelector: ".content-container",
        itemSelector: ".item-card"
    },
    {
        name: "Example Site 2",
        url: "https://example.com/page-2",
        waitSelector: "#mainContent",
        itemSelector: ".result-item"
    }
];

Step 5: Utility Functions

src/utils.js
function logStart(siteName) {
    console.log(`Starting scraping: ${siteName}`);
}
 
function logEnd(siteName) {
    console.log(`Finished scraping: ${siteName}`);
}
 
module.exports = { logStart, logEnd };

Step 6: Runner Script (Main Entry Point)

This script connects everything together.

scripts/runScraper.js
const { launchBrowser } = require('../src/browser');
const { scrapePage } = require('../src/scraper');
const sites = require('../config/sites');
const { logStart, logEnd } = require('../src/utils');
 
(async () => {
    const browser = await launchBrowser();
    const page = await browser.newPage();
 
    for (const site of sites) {
        try {
            logStart(site.name);
 
            const results = await scrapePage(
                page,
                site.url,
                site.waitSelector
            );
 
            console.log(site.name, results);
 
            logEnd(site.name);
        } catch (err) {
            console.error(`Error scraping ${site.name}`, err);
        }
    }
 
    await browser.close();
})();

Step 7: Running the Project

node scripts/runScraper.js

Handling Real-World Challenges

1. Dynamic Content Loading

Some sites load data after scrolling or API calls.

await page.waitForTimeout(3000);

Or wait for a specific condition:

await page.waitForFunction(() => {
    return document.querySelectorAll('.item').length > 5;
});

2. Pagination Handling

let hasNext = true;
 
while (hasNext) {
    const data = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.item'))
            .map(el => el.innerText);
    });
 
    console.log(data);
 
    hasNext = await page.evaluate(() => {
        const btn = document.querySelector('.next-button');
        if (btn) {
            btn.click();
            return true;
        }
        return false;
    });
 
    await page.waitForTimeout(2000);
}

3. Infinite Scroll Handling

await page.evaluate(async () => {
    await new Promise((resolve) => {
        let totalHeight = 0;
        const distance = 100;
        const timer = setInterval(() => {
            window.scrollBy(0, distance);
            totalHeight += distance;
 
            if (totalHeight >= document.body.scrollHeight) {
                clearInterval(timer);
                resolve();
            }
        }, 200);
    });
});

Best Practices

  • Always wait for selectors before scraping
  • Keep selectors configurable (avoid hardcoding)
  • Separate scraping logic from browser logic
  • Handle errors per site
  • Avoid blocking the event loop
  • Respect robots.txt and legal boundaries

Conclusion

Puppeteer is a powerful tool for scraping modern websites that rely heavily on JavaScript. By structuring your project properly, you can create scalable and reusable scraping systems that work across different websites with minimal changes.

With this architecture, you can easily extend your scraper to:

  • Multiple websites
  • Scheduled scraping jobs
  • Database storage (MongoDB, PostgreSQL)
  • API-based data pipelines
  • Headless automation systems