Scraping w/ Puppetteer and a Rotating Proxy

Recently while scraping, I’ve been getting captchas and eventually blocked by websites (occasionally, although I space out my scraping to not overload services). So, my first thought was to use some free public APIs that serve a list of proxies to use. After querying the API, I would then query the proxy to check it’s health. If the server responded to me, I would then attempt to use the proxy. But, most of the time I would get errors:

Screenshot 2023-10-05 at 11.50.34 PM.png

Screenshot 2023-10-05 at 11.50.26 PM.png

Essentially, the connection would fail or the request would be refused from the website I’m scraping. Maybe they are blacklisted or being overused by others? Not sure here, but I needed to find something that would work. So I caved and starting looking at paid services. I eventually came across WebShare, which gives you 10 proxies and the functionality to auto rotate them. Pretty sweet. Immediately, this worked! Instead of getting errors, puppeteer popped up without any captchas (for a while of course).

Screenshot 2023-10-06 at 12.05.26 AM.png

javascript

// scraper.js
const browser = await puppeteer.launch({
  headless: false,
  args: [`--proxy-server=p.webshare.io:80`],
})
const page = (await browser.pages())[0]

// authenticate with proxy
await page.authenticate({
  username: process.env.PROXY_USERNAME,
  password: process.env.PROXY_PASSWORD,
})

javascript

// .env
// this file should be in your .gitignore if you're using git to keep it out of source control
PROXY_USERNAME = "<YourUsername>"
PROXY_PASSWORD = "<YourPassword>"

Scraping w/ Puppetteer and a Rotating Proxy

Python Environment Setup on Mac

What Makes a Good Life?