34 min read

How to Scrape Google AI Overview — 2026 (Playwright)

Step-by-step 2026 guide to scrape Google AI Overview with Playwright & Puppeteer: runnable code, detection heuristics, proxy rotation, and JSON output.

How to Scrape Google AI Overview — 2026 (Playwright)

If you need a reliable, repeatable way to detect and parse AI Overview modules, this guide shows how to build it with Playwright (Python) and Puppeteer (Node.js). We’ll focus on implementation: navigation, localization (hl/gl/uule), detection heuristics that don’t rely on brittle selectors, structured JSON output, proxy rotation, and containerized scale. Note: automated access to Google Search can violate Google’s Terms; review the current language before proceeding and operate at your own risk.

  • Terms reference: see the relevant section in Google’s Terms of Service about automated access and machine-readable instructions in robots.txt. Read the official document here: Google Terms of Service.

  • Litigation context: for a sense of recent enforcement posture in the U.S., see the docket for Google LLC v. SerpApi, LLC (filed Dec 2025) on CourtListener: Google v. SerpApi docket.

Preflight checklist

  • OS and runtimes: Python 3.10+ and Node.js 18+; Docker 24+ optional but recommended for CI/CD.

  • Headless browsers: Playwright (Chromium) and Puppeteer; ensure you can run headless locally.

  • Proxies: residential/ISP pool with auth; budget for rotation and bandwidth.

  • Queries and locales: a CSV/JSON of queries and desired locales; plan hl, gl, and optional uule values.

  • Storage: S3-compatible bucket or filesystem for HTML/screenshot artifacts and JSON outputs.

Minimal Playwright (Python): navigate, detect, parse, serialize

This minimal script opens a localized query, waits for page settle, detects an AI Overview using structural heuristics, and serializes output to JSON. It also captures HTML and a screenshot for provenance.

# requirements: playwright==1.43.0+ (or current), pydantic, orjson
  # quickstart: pip install playwright orjson pydantic && playwright install --with-deps
  
  import asyncio, json, time
  from typing import List, Optional
  from pydantic import BaseModel, Field
  from playwright.async_api import async_playwright
  
  class Citation(BaseModel):
      url: str
      anchor: Optional[str] = None
  
  class AIOPayload(BaseModel):
      schema_version: str = "1.0.0"
      query: str
      url: str
      locale: str
      detected: bool
      summary: Optional[str] = None
      bullets: List[str] = Field(default_factory=list)
      citations: List[Citation] = Field(default_factory=list)
      screenshot_path: Optional[str] = None
      html_path: Optional[str] = None
      fetched_at: float
  
  SEARCH_URL = "https://www.google.com/search?q={q}&hl={hl}&gl={gl}{uule}"
  
  # Optional: supply uule like "&uule=w+CAIQICINVW5pdGVkIFN0YXRlcw" (construct separately)
  
  async def detect_ai_overview(page):
      # Heuristic detection inside the page context: avoid brittle CSS classes.
      script = """
      () => {
        function textContent(el){return (el.innerText||'').trim();}
  
        // Candidate regions with landmark roles and known headings
        const regions = Array.from(document.querySelectorAll('[role="region"], section'));
  
        let best = null;
        for (const r of regions) {
          const tt = (r.getAttribute('aria-label')||'') + ' ' + textContent(r.querySelector('h2,h3,strong,header'));
          const t = tt.toLowerCase();
          if (t.includes('overview') || t.includes('ai overview') || t.includes('ai-generated')) {
            // Basic sanity: look for multiple outbound links, list-like blocks, or Q&A chunks
            const links = Array.from(r.querySelectorAll('a[href^="http"]'));
            if (links.length >= 3) {
              best = r; break;
            }
          }
        }
  
        if (!best) return null;
  
        // Extract simple model: summary, bullets, citations
        const summary = textContent(best.querySelector('div, p'));
        const bullets = Array.from(best.querySelectorAll('li')).slice(0, 10).map(li => textContent(li)).filter(Boolean);
        const citations = Array.from(best.querySelectorAll('a[href^="http"]'))
          .filter(a => !a.href.includes('google.com'))
          .slice(0, 20)
          .map(a => ({ url: a.href, anchor: textContent(a) }));
  
        return { summary, bullets, citations };
      }
      """
      return await page.evaluate(script)
  
  async def fetch_one(query: str, hl: str = 'en', gl: str = 'us', uule: str = '') -> AIOPayload:
      async with async_playwright() as pw:
          browser = await pw.chromium.launch(headless=True)
          context = await browser.new_context(
              user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
              viewport={"width": 1366, "height": 1000}
          )
          page = await context.new_page()
  
          url = SEARCH_URL.format(q=query.replace(' ', '+'), hl=hl, gl=gl, uule=("&uule="+uule if uule else ""))
          await page.goto(url, wait_until="domcontentloaded")
  
          # Handle consent if present (role/text-based to avoid brittle selectors)
          try:
              consent = page.get_by_role("button", name=lambda n: n and 'accept' in n.lower())
              if await consent.is_visible(timeout=3000):
                  await consent.click()
          except Exception:
              pass
  
          await page.wait_for_load_state("networkidle")
  
          # Optional: short wait/jitter; AIO often lazy-loads
          await page.wait_for_timeout(1000)
  
          data = await detect_ai_overview(page)
  
          # Artifacts
          ts = int(time.time()*1000)
          html_path = f"artifacts/{ts}_{hl}_{gl}_page.html"
          screenshot_path = f"artifacts/{ts}_{hl}_{gl}_page.png"
          await page.screenshot(path=screenshot_path, full_page=True)
          html = await page.content()
          with open(html_path, 'w', encoding='utf-8') as f:
              f.write(html)
  
          await context.close(); await browser.close()
  
          payload = AIOPayload(
              query=query,
              url=url,
              locale=f"{hl}-{gl}",
              detected=bool(data),
              summary=(data or {}).get('summary'),
              bullets=(data or {}).get('bullets', []),
              citations=[Citation(**c) for c in (data or {}).get('citations', [])],
              screenshot_path=screenshot_path,
              html_path=html_path,
              fetched_at=time.time()
          )
          print(payload.model_dump_json(indent=2))
          return payload
  
  if __name__ == '__main__':
      asyncio.run(fetch_one("best electric bikes 2026", hl='en', gl='us'))
  

Notes and hardening tips

  • Use role/text locators for consent banners. Prefer Playwright locators over brittle CSS classnames.

  • For exact location bias, consider uule. Community libraries can help you generate valid values; one example encoder exists here: SerpApi uule converter. Validate outputs empirically.

  • Playwright’s auto-waiting and locators improve reliability; review the official guidance in Playwright docs on locators.

Minimal Puppeteer (Node.js) with stealth and JSON output

This example mirrors the Python flow and adds a stealth plugin to reduce obvious headless signals.

// package.json deps: "puppeteer-extra", "puppeteer-extra-plugin-stealth", "puppeteer", "zod"
  // quickstart: npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth zod
  import fs from 'node:fs/promises';
  import path from 'node:path';
  import { fileURLToPath } from 'node:url';
  import puppeteer from 'puppeteer-extra';
  import StealthPlugin from 'puppeteer-extra-plugin-stealth';
  import { z } from 'zod';
  
  puppeteer.use(StealthPlugin());
  
  const __filename = fileURLToPath(import.meta.url);
  const __dirname = path.dirname(__filename);
  
  const Citation = z.object({ url: z.string(), anchor: z.string().optional() });
  const Payload = z.object({
    schema_version: z.literal('1.0.0'),
    query: z.string(),
    url: z.string(),
    locale: z.string(),
    detected: z.boolean(),
    summary: z.string().optional(),
    bullets: z.array(z.string()),
    citations: z.array(Citation),
    screenshot_path: z.string().optional(),
    html_path: z.string().optional(),
    fetched_at: z.number()
  });
  
  const SEARCH = (q, hl, gl, uule='') => `https://www.google.com/search?q=${encodeURIComponent(q)}&hl=${hl}&gl=${gl}${uule?`&uule=${uule}`:''}`;
  
  async function detectAIO(page){
    return await page.evaluate(() => {
      const pickText = el => (el?.innerText || '').trim();
      const regions = Array.from(document.querySelectorAll('[role="region"], section'));
      let candidate = null;
      for (const r of regions) {
        const tt = ((r.getAttribute('aria-label')||'') + ' ' + pickText(r.querySelector('h2,h3,strong,header'))).toLowerCase();
        if (tt.includes('overview') || tt.includes('ai overview') || tt.includes('ai-generated')) {
          const links = r.querySelectorAll('a[href^="http"]');
          if (links.length >= 3) { candidate = r; break; }
        }
      }
      if (!candidate) return null;
      const summary = pickText(candidate.querySelector('div,p'));
      const bullets = Array.from(candidate.querySelectorAll('li')).slice(0,10).map(li => pickText(li)).filter(Boolean);
      const citations = Array.from(candidate.querySelectorAll('a[href^="http"]'))
        .filter(a => !a.href.includes('google.com'))
        .slice(0,20)
        .map(a => ({ url: a.href, anchor: pickText(a) }));
      return { summary, bullets, citations };
    });
  }
  
  async function runOne(query, hl='en', gl='us', uule=''){
    const browser = await puppeteer.launch({ headless: true, args: ['--no-sandbox','--disable-setuid-sandbox','--disable-dev-shm-usage'] });
    const page = await browser.newPage();
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
    await page.setViewport({ width: 1366, height: 1000 });
  
    const url = SEARCH(query, hl, gl, uule);
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
  
    // Consent handling with role/text heuristics
    try {
      await page.waitForSelector('button', { timeout: 3000 });
      const buttons = await page.$$('button');
      for (const b of buttons) {
        const txt = (await (await b.getProperty('innerText')).jsonValue() || '').toLowerCase();
        if (txt.includes('accept') || txt.includes('agree')) { await b.click(); break; }
      }
    } catch {}
  
    await page.waitForNetworkIdle({ idleTime: 1000, timeout: 15000 });
  
    const data = await detectAIO(page);
  
    const ts = Date.now();
    const artDir = path.join(__dirname, 'artifacts');
    await fs.mkdir(artDir, { recursive: true });
    const htmlPath = path.join(artDir, `${ts}_${hl}_${gl}_page.html`);
    const shotPath = path.join(artDir, `${ts}_${hl}_${gl}_page.png`);
    await page.screenshot({ path: shotPath, fullPage: true });
    await fs.writeFile(htmlPath, await page.content(), 'utf8');
  
    await browser.close();
  
    const payload = {
      schema_version: '1.0.0',
      query, url, locale: `${hl}-${gl}`, detected: !!data,
      summary: data?.summary, bullets: data?.bullets || [], citations: data?.citations || [],
      screenshot_path: shotPath, html_path: htmlPath, fetched_at: ts / 1000
    };
  
    console.log(JSON.stringify(Payload.parse(payload), null, 2));
    return payload;
  }
  
  runOne('best electric bikes 2026').catch(e => { console.error(e); process.exit(1); });
  

Why this works when you scrape Google AI Overview

  • We anchor detection to landmarks (role/aria, headings) and corroborate with link density to reduce selector brittleness.

  • We persist artifacts (HTML + screenshot) for auditing and regression tests.

  • We output a stable JSON schema (versioned) to support downstream analytics.

Detection patterns: mutation observation, artifacts, and schema

Layout shifts and lazy loads are common. A simple mutation observer can capture late-rendered blocks to improve recall.

// Inject after initial network idle; collect late content for ~2s
  await page.evaluate(() => {
    const target = document.body;
    const hits = [];
    const pickText = el => (el?.innerText || '').trim();
    const isAIO = (node) => {
      const t = (node.getAttribute?.('aria-label') || '' ).toLowerCase();
      return t.includes('overview') || t.includes('ai overview') || t.includes('ai-generated');
    };
    const observer = new MutationObserver((mutations) => {
      for (const m of mutations) {
        for (const n of m.addedNodes) {
          if (!(n instanceof HTMLElement)) continue;
          if (isAIO(n)) {
            const links = n.querySelectorAll('a[href^="http"]');
            if (links.length >= 3) hits.push(n.outerHTML);
          }
        }
      }
    });
    observer.observe(target, { childList: true, subtree: true });
    setTimeout(() => observer.disconnect(), 2000);
    window.__AIO_HITS__ = hits;
  });
  

After this window, you can read window.__AIO_HITS__ and parse any late matches. Always keep storing HTML and screenshots on every run; they’re invaluable when your detectors miss.

Suggested JSON schema (extend as needed)

{
    "schema_version": "1.0.0",
    "query": "string",
    "url": "string",
    "locale": "hl-gl",
    "detected": true,
    "summary": "string | null",
    "bullets": ["string"],
    "citations": [ { "url": "string", "anchor": "string|null" } ],
    "images": [ { "url": "string", "alt": "string|null" } ],
    "provenance": {
      "screenshot_path": "string",
      "html_path": "string",
      "fetched_at": 1736700000
    }
  }
  

Anti-bot and proxy strategy (production notes)

  • Favor residential/ISP proxies over datacenter IPs; rotate per browser context and track health (success rates, response times, ban codes). Practical rotation patterns are discussed in practitioner guides like ScrapFly’s overview on rotation strategies: proxy rotation patterns.

  • Use human-like pacing and jitter; bound retries with exponential backoff on 429/403.

  • Keep consent handling role/text-based and conditional; don’t pin to CSS classnames that change often.

  • For Puppeteer in containers or serverless, a maintained Chromium build helps avoid glibc/headless pitfalls; Sparticuz provides a battle-tested distribution: Sparticuz/chromium.

Scheduling, concurrency, and Docker

A simple Dockerfile makes headless browser runs reproducible in CI/CD. For Playwright, prefer their official base images to ensure system deps and browsers are present.

# Playwright container for Python scraper
  FROM mcr.microsoft.com/playwright/python:v1.43.0
  WORKDIR /app
  COPY requirements.txt ./
  RUN pip install -r requirements.txt
  COPY . .
  # Optional: enable tracing and artifacts dir
  ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
  CMD ["python", "playwright_scraper.py"]
  

Operational tips

  • Use a queue (e.g., Redis + Celery for Python or BullMQ for Node) and cap concurrency per container so you don’t starve CPU/memory.

  • Consider capturing Playwright traces on failure for later inspection; see the workflow guidance in the official docs: Playwright documentation.

  • For Node fleets, tune Chromium flags like --disable-dev-shm-usage and --no-sandbox in containerized environments.

Testing, monitoring, and drift management

  • Record fixtures: store HTML and screenshots alongside parsed JSON; add unit tests that validate parsing functions against these fixtures.

  • Canary runs: maintain a small, daily set of queries across locales/devices to detect UI shifts early.

  • Drift alerts: if detection rates drop beyond a threshold or parsing fails schema validation, alert the on-call and trigger a selector re-discovery task.

  • Analytics: to tie outputs to marketing KPIs, you can map “detected presence” and “citation coverage” to visibility metrics; a practical framework is outlined here: AI search KPI frameworks.

Troubleshooting playbook

  • Frequent 429/403 blocks: reduce concurrency, add jitter, and rotate proxies more aggressively; verify your user agent and viewport variety; ensure cookies persist for session reuse.

  • No AI Overview detected but present visually: increase post-load wait, enable the mutation observer, and review artifacts; broaden heuristics (e.g., allow synonyms for headings or adjust link count thresholds).

  • Headless failures in CI: use the official Playwright images or a maintained Chromium build; ensure shared memory size and sandbox settings are compatible in your runtime.

Next steps and alternatives

  • Policy reminder: review the most recent language in the official Google Terms of Service and consult counsel about your organization’s risk tolerance. For a recent U.S. case relevant to scraping posture, see the Google v. SerpApi docket.

  • If you don’t want to maintain scraping code but still need AI answer monitoring and KPIs, consider a GEO/AEO monitoring platform. Disclosure: Geneo is our product; it focuses on tracking AI answer presence, citations, and sentiment across engines for agencies and brands.

  • For teams continuing the DIY path, extend your stack with full localization support (uule encoders like the community project linked above), a proxy health service, and nightly canary runs. When you aggregate results, tie presence and citation share to pipeline/revenue metrics using a KPI framework like the one linked earlier.

Secondary references worth bookmarking