When working with text data in Python, you may need to identify and extract any URLs (web addresses) found within strings and text documents. Python has some helpful built-in methods and modules to detect, validate, and extract links from text.
Using Regular Expressions
One of the most common ways to find URLs is with regular expressions (regex). Here is an example regex pattern that will match most URLs:
import re
url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
text = "Visit my blog at https://www.myblog.com and my wiki at http://example.wiki.org!"
print(re.findall(url_regex, text))
This will print out a list of all matches:
['https://www.myblog.com', 'http://example.wiki.org']
The regex handles HTTP/HTTPS, with or without "www.", and domain suffixes like ".com" properly.
Validating URLs
We can take it a step further and validate that extracted strings are valid URLs using Python's
from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
print(is_valid_url("https://example.com")) # True
print(is_valid_url("example")) # False
This checks for the presence of a scheme like "http" and a network location.
Practical Usage
Some use cases where you may want to find URLs:
The key is choosing the right technique based on your data source and end goal. Regex gives flexibility but can cause issues at scale.
Hopefully this gives you a starter kit for effectively detecting links in text with Python!