Regular Expressions (Regex) - Complete Guide
Table of Contents
- Introduction
- Basic Syntax
- Metacharacters
- Character Classes
- Quantifiers
- Anchors
- Groups and Capturing
- Lookaheads and Lookbehinds
- Flags/Modifiers
- Common Patterns
- Examples
- Best Practices
- Tools and Testing
Introduction
Regular expressions (regex or regexp) are powerful pattern-matching tools used to search, match, and manipulate text. They provide a concise and flexible way to identify strings of text based on specific patterns.
When to Use Regex
- Text validation (emails, phone numbers, passwords)
- Data extraction from logs or documents
- Search and replace operations
- Input sanitization
- Parsing structured text
Basic Syntax
Literal Characters
Most characters in regex match themselves literally:
hello
Matches: "hello" in "hello world"
Case Sensitivity
By default, regex is case-sensitive:
Hello
Matches: "Hello" but not "hello"
Metacharacters
Metacharacters have special meanings in regex and need to be escaped with \
to match literally.
Character | Meaning | Example |
---|---|---|
. |
Any character except newline | a.c matches "abc", "axc" |
^ |
Start of string/line | ^hello matches "hello" at start |
$ |
End of string/line | world$ matches "world" at end |
* |
Zero or more | ab* matches "a", "ab", "abbb" |
+ |
One or more | ab+ matches "ab", "abbb" but not "a" |
? |
Zero or one | ab? matches "a", "ab" |
\ |
Escape character | \. matches literal "." |
` | ` | OR operator |
[] |
Character class | [abc] matches "a", "b", or "c" |
() |
Grouping | (ab)+ matches "ab", "abab" |
{} |
Quantifier | a{2,4} matches "aa", "aaa", "aaaa" |
Character Classes
Basic Character Classes
[abc] # Matches 'a', 'b', or 'c'
[a-z] # Matches any lowercase letter
[A-Z] # Matches any uppercase letter
[0-9] # Matches any digit
[a-zA-Z] # Matches any letter
[a-zA-Z0-9] # Matches any alphanumeric character
Negated Character Classes
[^abc] # Matches any character except 'a', 'b', or 'c'
[^0-9] # Matches any non-digit character
Predefined Character Classes
Class | Equivalent | Description |
---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Any word character |
\W |
[^a-zA-Z0-9_] |
Any non-word character |
\s |
[ \t\r\n\f] |
Any whitespace |
\S |
[^ \t\r\n\f] |
Any non-whitespace |
Quantifiers
Basic Quantifiers
Quantifier | Meaning | Example |
---|---|---|
* |
0 or more | a* matches "", "a", "aa", "aaa" |
+ |
1 or more | a+ matches "a", "aa", "aaa" |
? |
0 or 1 | a? matches "", "a" |
Specific Quantifiers
{n} # Exactly n times
{n,} # n or more times
{n,m} # Between n and m times
Examples:
\d{3} # Exactly 3 digits
\d{3,} # 3 or more digits
\d{3,5} # Between 3 and 5 digits
Greedy vs Non-Greedy
.* # Greedy: matches as much as possible
.*? # Non-greedy: matches as little as possible
.+? # Non-greedy: one or more, but as few as possible
Anchors
Position Anchors
Anchor | Meaning | Example |
---|---|---|
^ |
Start of string/line | ^Hello |
$ |
End of string/line | world$ |
\b |
Word boundary | \bword\b |
\B |
Non-word boundary | \Bword\B |
Examples
^Hello$ # Matches exactly "Hello"
\bcat\b # Matches "cat" as a whole word
\d+$ # Matches digits at end of string
Groups and Capturing
Basic Groups
(abc) # Capturing group
(?:abc) # Non-capturing group
Backreferences
(hello) \1 # Matches "hello hello"
(\w+) \1 # Matches repeated words like "the the"
Named Groups
(?<name>\w+) # Named group (some flavors)
(?P<name>\w+) # Named group (Python)
Lookaheads and Lookbehinds
Positive Lookahead
\d+(?=px) # Matches digits followed by "px"
Negative Lookahead
\d+(?!px) # Matches digits NOT followed by "px"
Positive Lookbehind
(?<=\$)\d+ # Matches digits preceded by "$"
Negative Lookbehind
(?<!\$)\d+ # Matches digits NOT preceded by "$"
Flags/Modifiers
Flag | Meaning | Example |
---|---|---|
i |
Case insensitive | /hello/i matches "Hello" |
g |
Global (find all matches) | /cat/g finds all "cat" |
m |
Multiline | ^ and $ match line breaks |
s |
Dot matches newline | . includes \n |
x |
Extended (ignore whitespace) | Allows comments in regex |
Common Patterns
Email Validation
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Phone Number (US)
^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$
URL
^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$
Password Strength
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
- At least 8 characters
- At least one lowercase letter
- At least one uppercase letter
- At least one digit
- At least one special character
IP Address
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Date (MM/DD/YYYY)
^(0[1-9]|1[0-2])\/(0[1-9]|[12][0-9]|3[01])\/\d{4}$
HTML Tags
<\/?[\w\s]*>|<.+[\W]>
Examples
Extract Information
# Extract email addresses
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
# Extract phone numbers
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}
# Extract hashtags
#\w+
# Extract URLs
https?:\/\/[^\s]+
Validation
# Validate credit card (basic)
^\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}$
# Validate time (24-hour)
^([01]?[0-9]|2[0-3]):[0-5][0-9]$
# Validate hex color
^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$
Text Processing
# Remove extra whitespace
\s+
# Split on punctuation
[.!?]+
# Find repeated words
\b(\w+)\s+\1\b
# Match quoted strings
"[^"]*"
Best Practices
1. Keep It Simple
- Start with simple patterns and build complexity gradually
- Break complex patterns into smaller, testable parts
2. Use Character Classes
# Good
[0-9]
# Less clear
(0|1|2|3|4|5|6|7|8|9)
3. Escape Special Characters
# To match literal dots
\\.
# To match literal brackets
\[|\]
4. Use Non-Greedy Quantifiers When Needed
# Greedy (may capture too much)
<.*>
# Non-greedy (better for HTML tags)
<.*?>
5. Optimize Performance
- Use anchors (
^
,$
) when possible - Avoid excessive backtracking
- Use atomic groups for performance-critical applications
6. Comment Complex Patterns
# Using extended mode (x flag)
(?x)
^ # Start of string
(?=.*[a-z]) # Must contain lowercase
(?=.*[A-Z]) # Must contain uppercase
(?=.*\d) # Must contain digit
.{8,} # At least 8 characters
$ # End of string
7. Test Thoroughly
- Test with both matching and non-matching cases
- Consider edge cases (empty strings, special characters)
- Validate against real-world data
Tools and Testing
Online Regex Testers
- Regex101 (regex101.com) - Comprehensive with explanations
- RegExr (regexr.com) - Visual regex builder
- RegexPal (regexpal.com) - Simple testing
Programming Language Integration
JavaScript
const pattern = /\d+/g;
const text = "abc 123 def 456";
const matches = text.match(pattern); // ["123", "456"]
Python
import re
pattern = r'\d+'
text = "abc 123 def 456"
matches = re.findall(pattern, text) # ['123', '456']
PHP
$pattern = '/\d+/';
$text = "abc 123 def 456";
preg_match_all($pattern, $text, $matches);
Common Pitfalls
- Catastrophic backtracking - Avoid nested quantifiers
- Greediness - Use non-greedy quantifiers when appropriate
- Case sensitivity - Remember to use case-insensitive flag when needed
- Escaping - Always escape special characters in literals
- Testing - Always test with edge cases
This guide provides a comprehensive foundation for working with regular expressions. Practice with real examples and gradually build complexity as you become more comfortable with the syntax and concepts.