How To Find All URLs On A Domain's Website

Have you ever needed to find all the URLs on a website? There are many reasons why you might need to do this. In this article, we'll explore some common use cases and provide you with various methods to achieve your goal. Let's dive in!

Use Cases for Finding All URLs on a Website

SEO Analysis: Analyzing a website's structure and content for SEO purposes.
Broken Link Detection: Identifying broken links to improve user experience and website health.
Competitive Analysis: Researching competitor websites to understand their content strategy.
Web Scraping: Collecting data from a website for research or analysis.
Website Migration: Ensuring all pages are properly redirected during a website migration.
Content Audit: Reviewing and organizing a website's content.
Backlink Analysis: Discovering all pages on a website that have inbound links.

Methods for Finding All URLs

Google Site Search

One quick and easy method to find URLs on a website is to use Google's site search feature. Here's how it works:

Go to Google.com
In the search bar, type: site:example.com
Replace example.com with the domain you want to search.
Press Enter to see a list of indexed URLs for that domain.

Google will return a list of pages it has indexed for the specified domain. However, keep in mind that this method may not provide a complete list of all URLs, as some pages might not be indexed by Google.

Sitemaps and robots.txt

Another way to discover URLs on a website is by checking its sitemap and robots.txt files. These files provide valuable information about a website's structure and content.

Sitemap

A sitemap is an XML file that lists all the important pages of a website. It helps search engines understand the website's structure and content. Common sitemap filenames include:

sitemap.xml

sitemap_index.xml

sitemap-index.xml

sitemap.txt

sitemap.php

To find a website's sitemap, try appending these filenames to the domain URL. For example:

https://example.com/sitemap.xml

https://example.com/sitemap_index.xml

https://example.com/sitemap-index.xml

https://example.com/sitemap.txt

https://example.com/sitemap.php

robots.txt

The robots.txt file is used to instruct search engine crawlers which pages they should or shouldn't index. It can also contain the location of the website's sitemap. To find the robots.txt file, simply append /robots.txt to the domain URL:

https://example.com/robots.txt

Look for lines that contain Sitemap: followed by a URL. These indicate the location of the website's sitemap(s).

Open Source Spider Tools

For a more comprehensive approach to finding all URLs on a website, you can use open source spider tools. These tools crawl a website and discover all the accessible pages. Some popular options include:

Scrapy (Python): A powerful web crawling and scraping framework.
BeautifulSoup (Python): A library for parsing HTML and XML documents.
Puppeteer (JavaScript): A Node.js library for controlling a headless Chrome browser.

These tools provide more flexibility and control over the crawling process compared to the previous methods. They allow you to customize the crawling behavior, handle dynamic content, and extract URLs based on specific patterns or criteria.

Advanced Examples

in Python:

Here's a basic Python script that performs a web crawler to find all URLs on a website up to a specified depth using the requests and beautifulsoup4 libraries:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def is_valid_url(url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

def get_urls(url, depth):
    urls = set()
    visited_urls = set()

    def crawl(url, current_depth):
        if current_depth > depth:
            return

        visited_urls.add(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        for link in soup.find_all('a'):
            href = link.get('href')
            if href is None:
                continue

            absolute_url = urljoin(url, href)
            if is_valid_url(absolute_url) and absolute_url not in visited_urls:
                urls.add(absolute_url)
                crawl(absolute_url, current_depth + 1)

    crawl(url, 0)
    return urls

# Example usage
start_url = '<https://example.com>'
depth = 2

urls = get_urls(start_url, depth)
print(f"Found {len(urls)} URLs:")
for url in urls:
    print(url)

Here's how the script works:

We define two helper functions:
Inside the get_urls function, we initialize two sets: urls to store the discovered URLs and visited_urls to keep track of the URLs that have been visited.
We define an inner function called crawl(url, current_depth) that performs the actual crawling recursively:
We start the crawling process by calling crawl(url, 0) with the starting URL and initial depth of 0.
Finally, we return the set of discovered URLs.

In the example usage, we specify the starting URL as '' and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

Note: Make sure to install the required libraries by running pip install requests beautifulsoup4 before running the script.

in PHP:

Now, Here's a basic PHP script that performs a web crawler to find all URLs on a website up to a specified depth using the built-in DOMDocument and DOMXPath classes:

<?php

function isValidUrl($url) {
    return filter_var($url, FILTER_VALIDATE_URL) !== false;
}

function getUrls($startUrl, $depth) {
    $urls = [];
    $visitedUrls = [];

    function crawl($url, $currentDepth) use (&$urls, &$visitedUrls, $depth) {
        if ($currentDepth > $depth) {
            return;
        }

        $visitedUrls[] = $url;
        $html = file_get_contents($url);
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);
        $links = $xpath->query('//a');

        foreach ($links as $link) {
            $href = $link->getAttribute('href');
            if ($href === null) {
                continue;
            }

            $absoluteUrl = getAbsoluteUrl($url, $href);
            if (isValidUrl($absoluteUrl) && !in_array($absoluteUrl, $visitedUrls)) {
                $urls[] = $absoluteUrl;
                crawl($absoluteUrl, $currentDepth + 1);
            }
        }
    }

    crawl($startUrl, 0);
    return $urls;
}

function getAbsoluteUrl($baseUrl, $relativeUrl) {
    $parsedBaseUrl = parse_url($baseUrl);
    $scheme = $parsedBaseUrl['scheme'];
    $host = $parsedBaseUrl['host'];
    $path = $parsedBaseUrl['path'] ?? '/';

    if (filter_var($relativeUrl, FILTER_VALIDATE_URL) !== false) {
        return $relativeUrl;
    }

    if (strpos($relativeUrl, '/') === 0) {
        return $scheme . '://' . $host . $relativeUrl;
    } else {
        $basePath = dirname($path);
        return $scheme . '://' . $host . $basePath . '/' . $relativeUrl;
    }
}

// Example usage
$startUrl = '<https://example.com>';
$depth = 2;

$urls = getUrls($startUrl, $depth);
echo "Found " . count($urls) . " URLs:\\n";
foreach ($urls as $url) {
    echo $url . "\\n";
}

Here's how the PHP script works:

We define three functions:
Inside the getUrls function, we initialize two arrays: $urls to store the discovered URLs and $visitedUrls to keep track of the URLs that have been visited.
We define an inner function called crawl($url, $currentDepth) that performs the actual crawling recursively:
We start the crawling process by calling crawl($startUrl, 0) with the starting URL and initial depth of 0.
Finally, we return the array of discovered URLs.

In the example usage, we specify the starting URL as '' and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

in Node:

Here's a basic Node.js script that performs a web crawler to find all URLs on a website up to a specified depth using the axios and cheerio libraries:

const axios = require('axios');
const cheerio = require('cheerio');
const url = require('url');

async function isValidUrl(url) {
  try {
    const response = await axios.head(url);
    return response.status === 200;
  } catch (error) {
    return false;
  }
}

async function getUrls(startUrl, depth) {
  const urls = new Set();
  const visitedUrls = new Set();

  async function crawl(url, currentDepth) {
    if (currentDepth > depth) {
      return;
    }

    visitedUrls.add(url);
    try {
      const response = await axios.get(url);
      const $ = cheerio.load(response.data);
      const links = $('a');

      for (let i = 0; i < links.length; i++) {
        const href = $(links[i]).attr('href');
        if (href === undefined) {
          continue;
        }

        const absoluteUrl = new URL(href, url).toString();
        if (await isValidUrl(absoluteUrl) && !visitedUrls.has(absoluteUrl)) {
          urls.add(absoluteUrl);
          await crawl(absoluteUrl, currentDepth + 1);
        }
      }
    } catch (error) {
      console.error(`Error crawling ${url}: ${error.message}`);
    }
  }

  await crawl(startUrl, 0);
  return Array.from(urls);
}

// Example usage
const startUrl = '<https://example.com>';
const depth = 2;

getUrls(startUrl, depth)
  .then((urls) => {
    console.log(`Found ${urls.length} URLs:`);
    urls.forEach((url) => {
      console.log(url);
    });
  })
  .catch((error) => {
    console.error(`Error: ${error.message}`);
  });

Here's how the Node.js script works:

We import the required libraries: axios for making HTTP requests, cheerio for parsing HTML, and the built-in url module for URL handling.
We define two functions:
Inside the getUrls function, we initialize two sets: urls to store the discovered URLs and visitedUrls to keep track of the URLs that have been visited.
We define an inner async function called crawl(url, currentDepth) that performs the actual crawling recursively:
We start the crawling process by calling crawl(startUrl, 0) with the starting URL and initial depth of 0.
Finally, we return the array of discovered URLs by converting the urls set to an array using Array.from().

In the example usage, we specify the starting URL as '' and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

in Rust:

Here's a basic Rust script that performs a web crawler to find all URLs on a website up to a specified depth using the reqwest and scraper libraries:

use reqwest::Client;
use scraper::{Html, Selector};
use std::collections::HashSet;
use url::Url;

async fn is_valid_url(client: &Client, url: &str) -> bool {
    match client.head(url).send().await {
        Ok(response) => response.status().is_success(),
        Err(_) => false,
    }
}

async fn get_urls(start_url: &str, depth: usize) -> HashSet<String> {
    let client = Client::new();
    let mut urls = HashSet::new();
    let mut visited_urls = HashSet::new();

    async fn crawl(client: &Client, url: &str, depth: usize, urls: &mut HashSet<String>, visited_urls: &mut HashSet<String>) {
        if depth == 0 {
            return;
        }

        visited_urls.insert(url.to_string());
        match client.get(url).send().await {
            Ok(response) => {
                let body = response.text().await.unwrap();
                let document = Html::parse_document(&body);
                let selector = Selector::parse("a").unwrap();

                for element in document.select(&selector) {
                    if let Some(href) = element.value().attr("href") {
                        let absolute_url = Url::parse(url)
                            .unwrap()
                            .join(href)
                            .unwrap()
                            .to_string();

                        if is_valid_url(&client, &absolute_url).await && !visited_urls.contains(&absolute_url) {
                            urls.insert(absolute_url.clone());
                            crawl(&client, &absolute_url, depth - 1, urls, visited_urls).await;
                        }
                    }
                }
            }
            Err(e) => {
                eprintln!("Error crawling {}: {}", url, e);
            }
        }
    }

    crawl(&client, start_url, depth, &mut urls, &mut visited_urls).await;
    urls
}

#[tokio::main]
async fn main() {
    let start_url = "<https://example.com>";
    let depth = 2;

    let urls = get_urls(start_url, depth).await;
    println!("Found {} URLs:", urls.len());
    for url in urls {
        println!("{}", url);
    }
}

Here's how the Rust script works:

We use the reqwest library for making HTTP requests and the scraper library for parsing HTML.
We define two functions:
Inside the get_urls function, we create a new Client instance and initialize two HashSets: urls to store the discovered URLs and visited_urls to keep track of the URLs that have been visited.
We define an inner async function called crawl that performs the actual crawling recursively:
We start the crawling process by calling crawl with the start_url, depth, and mutable references to urls and visited_urls.
Finally, we return the urls set containing the discovered URLs.

In the main function, we specify the starting URL as "" and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

Note: Make sure to add the following dependencies to your Cargo.toml file:

[dependencies]
reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.12"
tokio = { version = "1.0", features = ["full"] }
url = "2.2"

in Java:

Here's a basic Java script that performs a web crawler to find all URLs on a website up to a specified depth using the jsoup library:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;

public class WebCrawler {
    private static boolean isValidUrl(String url) {
        try {
            URL obj = new URL(url);
            HttpURLConnection conn = (HttpURLConnection) obj.openConnection();
            conn.setRequestMethod("HEAD");
            int responseCode = conn.getResponseCode();
            return responseCode == HttpURLConnection.HTTP_OK;
        } catch (IOException e) {
            return false;
        }
    }

    private static Set<String> getUrls(String startUrl, int depth) {
        Set<String> urls = new HashSet<>();
        Set<String> visitedUrls = new HashSet<>();

        crawl(startUrl, depth, urls, visitedUrls);
        return urls;
    }

    private static void crawl(String url, int depth, Set<String> urls, Set<String> visitedUrls) {
        if (depth == 0) {
            return;
        }

        visitedUrls.add(url);
        try {
            Document document = Jsoup.connect(url).get();
            Elements links = document.select("a[href]");

            for (Element link : links) {
                String href = link.attr("abs:href");
                if (isValidUrl(href) && !visitedUrls.contains(href)) {
                    urls.add(href);
                    crawl(href, depth - 1, urls, visitedUrls);
                }
            }
        } catch (IOException e) {
            System.err.println("Error crawling " + url + ": " + e.getMessage());
        }
    }

    public static void main(String[] args) {
        String startUrl = "<https://example.com>";
        int depth = 2;

        Set<String> urls = getUrls(startUrl, depth);
        System.out.println("Found " + urls.size() + " URLs:");
        for (String url : urls) {
            System.out.println(url);
        }
    }
}

Here's how the Java script works:

We use the jsoup library for making HTTP requests and parsing HTML.
We define two methods:
Inside the getUrls method, we initialize two HashSets: urls to store the discovered URLs and visitedUrls to keep track of the URLs that have been visited.
We define a recursive method called crawl that performs the actual crawling:
We start the crawling process by calling crawl with the startUrl, depth, and references to urls and visitedUrls.
Finally, we return the urls set containing the discovered URLs.

In the main method, we specify the starting URL as "" and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

Note: Make sure to add the following dependency to your project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

in C++:

Here's a basic C++ script that performs a web crawler to find all URLs on a website up to a specified depth using the libcurl library for making HTTP requests and the gumbo-parser library for parsing HTML:

#include <iostream>
#include <string>
#include <unordered_set>
#include <curl/curl.h>
#include <gumbo.h>

static std::size_t WriteCallback(void *contents, std::size_t size, std::size_t nmemb, void *userp) {
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

bool isValidUrl(const std::string& url) {
    CURL *curl = curl_easy_init();
    if (curl) {
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
        curl_easy_setopt(curl, CURLOPT_NOBODY, 1L);
        CURLcode res = curl_easy_perform(curl);
        long responseCode;
        curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE, &responseCode);
        curl_easy_cleanup(curl);
        return responseCode == 200;
    }
    return false;
}

void extractUrls(GumboNode *node, std::unordered_set<std::string>& urls, const std::string& baseUrl) {
    if (node->type != GUMBO_NODE_ELEMENT) {
        return;
    }

    GumboAttribute *href = nullptr;
    if (node->v.element.tag == GUMBO_TAG_A &&
        (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) {
        std::string absoluteUrl = href->value;
        if (absoluteUrl.substr(0, 4) != "http") {
            absoluteUrl = baseUrl + "/" + absoluteUrl;
        }
        if (isValidUrl(absoluteUrl)) {
            urls.insert(absoluteUrl);
        }
    }

    GumboVector *children = &node->v.element.children;
    for (unsigned int i = 0; i < children->length; ++i) {
        extractUrls(static_cast<GumboNode*>(children->data[i]), urls, baseUrl);
    }
}

void crawl(const std::string& url, int depth, std::unordered_set<std::string>& urls, std::unordered_set<std::string>& visitedUrls) {
    if (depth == 0 || visitedUrls.count(url) > 0) {
        return;
    }

    visitedUrls.insert(url);

    CURL *curl = curl_easy_init();
    if (curl) {
        std::string htmlContent;
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &htmlContent);
        CURLcode res = curl_easy_perform(curl);
        curl_easy_cleanup(curl);

        if (res == CURLE_OK) {
            GumboOutput *output = gumbo_parse(htmlContent.c_str());
            extractUrls(output->root, urls, url);
            gumbo_destroy_output(&kGumboDefaultOptions, output);

            for (const std::string& newUrl : urls) {
                crawl(newUrl, depth - 1, urls, visitedUrls);
            }
        }
    }
}

std::unordered_set<std::string> getUrls(const std::string& startUrl, int depth) {
    std::unordered_set<std::string> urls;
    std::unordered_set<std::string> visitedUrls;
    crawl(startUrl, depth, urls, visitedUrls);
    return urls;
}

int main() {
    std::string startUrl = "<https://example.com>";
    int depth = 2;

    std::unordered_set<std::string> urls = getUrls(startUrl, depth);
    std::cout << "Found " << urls.size() << " URLs:" << std::endl;
    for (const std::string& url : urls) {
        std::cout << url << std::endl;
    }

    return 0;
}

Here's how the C++ script works:

We use the libcurl library for making HTTP requests and the gumbo-parser library for parsing HTML.
We define several functions:
Inside the getUrls function, we initialize two unordered_sets: urls to store the discovered URLs and visitedUrls to keep track of the URLs that have been visited.
We start the crawling process by calling crawl with the startUrl, depth, and references to urls and visitedUrls.
Inside the crawl function:
Finally, we return the urls set containing the discovered URLs.

In the main function, we specify the starting URL as "" and a depth of 2. The script will crawl the website starting from that URL and find all accessible URLs up to a depth of 2.

Conclusion

Finding all URLs on a website is a common task with various use cases. Whether you need to analyze a website's structure, detect broken links, or collect data for research, there are several methods available to accomplish this goal.

You can start with a simple Google site search to get a quick overview of indexed pages. Checking the website's sitemap and robots.txt files can provide valuable insights into its structure and important pages.

For a more comprehensive approach, open source spider tools like Scrapy, BeautifulSoup, and Puppeteer offer powerful capabilities to crawl websites and extract URLs programmatically.

How To Find All URLs On A Domain's Website

Use Cases for Finding All URLs on a Website

Methods for Finding All URLs

Google Site Search

Sitemaps and robots.txt

Sitemap

robots.txt

Open Source Spider Tools

Advanced Examples

in Python:

in PHP:

in Node:

in Rust:

in Java:

in C++:

Conclusion

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

How To Find All URLs On A Domain's Website

Use Cases for Finding All URLs on a Website

Methods for Finding All URLs

Google Site Search

Sitemaps and robots.txt

Sitemap

robots.txt

Open Source Spider Tools

Advanced Examples

in Python:

in PHP:

in Node:

in Rust:

in Java:

in C++:

Conclusion

The easiest way to do Web Scraping

Don't leave just yet!