Goutte is a battle-tested PHP web scraping library. This comprehensive reference aims to thoroughly cover its capabilities.
Installation
Composer:
composer require fabpot/goutte
Client Configuration
Set user agent:
$client = new Goutte\\Client();
$client->setHeaders(['User-Agent' => 'Firefox']);
Set timeouts:
$client->setTimeout(30); // connection timeout
$client->setIdleTimeout(90); // idle timeout
Handle cookies:
$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie('session=foo'));
Custom client:
$stack = \\GuzzleHttp\\HandlerStack::create();
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);
$goutteClient = new Goutte\\Client();
$goutteClient->setClient($client);
Making Requests
GET request:
$crawler = $client->request('GET', '/products');
POST request:
$crawler = $client->request('POST', '/login', ['username' => '', 'password' => '']);
Upload files:
$crawler = $client->request('POST', '/upload', [], ['photo' => new FormData($path)]);
Attach session:
$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie($sessionCookie));
Follow redirects:
$client->followRedirects(true);
$crawler = $client->request('GET', $url); // follows redirects
Selecting Elements
CSS selector:
$els = $crawler->filter('div > span.title');
XPath expression:
$els = $crawler->filterXpath('//h1[@class="headline"]');
Combining CSS and XPath:
$crawler->filterXpath('//div')->filter('span.title');
Matching text:
$crawler->filterXpath('//p[contains(text(), "Hello")]');
Pagination links:
$crawler->selectLink($crawler->filterXpath('//a[text()="Next Page"]')->text());
Extracting Data
Get text:
$text = $el->text();
Get HTML:
$html = $el->html();
Get outer HTML:
$html = $el->outerHtml();
Get attribute:
$url = $el->attr('href');
Get raw response:
$response = $crawler->getResponse();
Interacting with Pages
Click link:
$link = $crawler->selectLink('Next')->link();
$crawler = $client->click($link);
Submit form:
$form = $crawler->selectButton('Submit')->form();
$crawler = $client->submit($form);
Upload file:
$form = $crawler->selectButton('Upload')->form();
$form['file'] = new \\Symfony\\Component\\HttpFoundation\\File\\UploadedFile('/path/to/file');
$crawler = $client->submit($form);
Scroll page:
$crawler->evaluateScript('window.scrollTo(0, document.body.scrollHeight)');
Handling Responses
Check status code:
$statusCode = $crawler->getResponse()->getStatus();
if ($statusCode === 200) {
// Success
}
Get response headers:
$headers = $crawler->getResponse()->getHeaders();
Get response body:
$html = $crawler->getResponse()->getContent();
Debugging and Logging
Debug client:
$client->getClient()->getConfig('handler')->push(new \\Monolog\\Handler\\ChromePHPHandler());
Log requests:
$logger = new \\Monolog\\Logger('goutte');
$stack = new \\GuzzleHttp\\HandlerStack();
$stack->push(\\GuzzleHttp\\Middleware::log($logger));
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);
Mocking Responses
Mock response:
use GuzzleHttp\\Handler\\MockHandler;
$mock = new MockHandler([
new \\GuzzleHttp\\Psr7\\Response(200, ['Content-Type' => 'text/html'], '<html>...</html>')
]);
$handler = \\GuzzleHttp\\HandlerStack::create($mock);
$client = new Goutte\\Client(['handler' => $handler]);
Rate Limiting
Limit per second:
$client = \\GuzzleHttp\\Client([
'handler' => \\GuzzleHttp\\HandlerStack::create(new \\GuzzleHttp\\Handler\\CurlHandler([
'curl' => [CURLOPT_BUFFERSIZE => 1024],
])),
'middleware' => new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10), // 10 requests per second
]);
Dynamic throttling:
$stack = new \\GuzzleHttp\\HandlerStack();
$stack->push(new \\SomeProvider\\DynamicThrottleMiddleware());
$client = new Goutte\\Client(['handler' => $stack]);
Asynchronous Requests
Concurrent requests:
use GuzzleHttp\\Promise;
$promises = [
'page1' => $client->requestAsync('GET', '<https://page1.com>'),
'page2' => $client->requestAsync('GET', '<https://page2.com>')
];
$results = Promise\\unwrap($promises);
Real World Use Cases
Using with Other Libraries
Integrate with Symfony DomCrawler for more advanced filtering:
$crawler = $client->request('GET', '<https://example.com>');
$domCrawler = new \\Symfony\\Component\\DomCrawler\\Crawler();
$domCrawler->addHtmlContent($crawler->html());
$filtered = $domCrawler->filter('div.content');
Batching and Concurrency
Improve efficiency for large scrapes by batching requests:
$batch = new \\Goutte\\BatchClient($client);
$batch->enqueue(['url' => 'page1.com']);
$batch->enqueue(['url' => 'page2.com']);
$crawlers = $batch->start();
Scrape in parallel for performance using Guzzle promises:
$promises = [
'page1' => $client->requestAsync('GET', 'page1.com'),
'page2' => $client->requestAsync('GET', 'page2.com')
];
$results = \\GuzzleHttp\\Promise\\settle($promises)->wait();
Best Practices
Respect robots.txt:
$client->getClient()->getConfig('handler')->push(RobotTxtMiddleware::create());
Implement rate limiting:
$stack->push(new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10)); // 10 rps
Avoid overloading servers:
$batch->setConcurrency(10); // only 10 concurrent requests
Scraping JavaScript Sites
Use Puppeteer to render JavaScript:
$browser = \\Puppeteer\\Puppeteer::launch();
$page = $browser->newPage();
$page->goto('<https://example.com>');
$html = $page->getHtml();
Persisting Scraped Data
Save to JSON file:
$data = $crawler->filter('.listing')->each(function ($node) {
return $node->text();
});
file_put_contents('listings.json', json_encode($data));
Debugging Tips
Enable Guzzle debug logging:
$stack->push(\\GuzzleHttp\\Middleware::log($logger, LogLevel::DEBUG));
Inspect headers and response codes:
$headers = $response->getHeaders();
$statusCode = $response->getStatusCode();
Proxy and User Agent Rotation
Rotate user agents to avoid blocks:
$agents = ['Firefox', 'Chrome', ...];
$client->setHeaders(['User-Agent' => $agents[array_rand($agents)]]);
Use proxies for IP rotation:
$client = new \\Goutte\\Client();
$client->getClient()->setProxy('104.198.224.19');
Useful Goutte Libraries
Real World Examples
Scrape pricing data:
$crawler->filter('.price')->each(function ($node) {
return $node->text();
});
Extract contact info:
$crawler->filter('.contact-list')->each(function ($node) {
return $node->filter('a')->each(function ($link) {
return $link->text();
});
});
Troubleshooting
Handling captchas:
// Option 1: Use a service like AntiCaptcha
// Option 2: Rotate proxies and retry on detection
Scraping paginated content:
while($crawler->filter('.next-page')->count() > 0) {
$nextPage = $crawler->selectLink('Next')->link();
$crawler = $client->click($nextPage);
// Scrape page
}