Extracting URLs from Text in Python

When working with text data in Python, you may need to identify and extract any URLs (web addresses) found within strings and text documents. Python has some helpful built-in methods and modules to detect, validate, and extract links from text.

Using Regular Expressions

One of the most common ways to find URLs is with regular expressions (regex). Here is an example regex pattern that will match most URLs:

import re

url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"

text = "Visit my blog at https://www.myblog.com and my wiki at http://example.wiki.org!" 

print(re.findall(url_regex, text))

This will print out a list of all matches:

['https://www.myblog.com', 'http://example.wiki.org']

The regex handles HTTP/HTTPS, with or without "www.", and domain suffixes like ".com" properly.

Validating URLs

We can take it a step further and validate that extracted strings are valid URLs using Python's urllib module:

from urllib.parse import urlparse

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

print(is_valid_url("https://example.com")) # True
print(is_valid_url("example")) # False

This checks for the presence of a scheme like "http" and a network location.

Practical Usage

Some use cases where you may want to find URLs:

Extracting links from a scraped web page to crawl

Validating user-entered URLs from a form

Finding malicious links in chat messages

Gathering anchors from Markdown/HTML documents

The key is choosing the right technique based on your data source and end goal. Regex gives flexibility but can cause issues at scale.

Hopefully this gives you a starter kit for effectively detecting links in text with Python!

Extracting URLs from Text in Python

Using Regular Expressions

Validating URLs

Practical Usage

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Extracting URLs from Text in Python

Using Regular Expressions

Validating URLs

Practical Usage

The easiest way to do Web Scraping

Don't leave just yet!