Regular Expressions (Regex) - Complete Guide

Table of Contents

  1. Introduction
  2. Basic Syntax
  3. Metacharacters
  4. Character Classes
  5. Quantifiers
  6. Anchors
  7. Groups and Capturing
  8. Lookaheads and Lookbehinds
  9. Flags/Modifiers
  10. Common Patterns
  11. Examples
  12. Best Practices
  13. Tools and Testing

Introduction

Regular expressions (regex or regexp) are powerful pattern-matching tools used to search, match, and manipulate text. They provide a concise and flexible way to identify strings of text based on specific patterns.

When to Use Regex

  • Text validation (emails, phone numbers, passwords)
  • Data extraction from logs or documents
  • Search and replace operations
  • Input sanitization
  • Parsing structured text

Basic Syntax

Literal Characters

Most characters in regex match themselves literally:

hello

Matches: "hello" in "hello world"

Case Sensitivity

By default, regex is case-sensitive:

Hello

Matches: "Hello" but not "hello"

Metacharacters

Metacharacters have special meanings in regex and need to be escaped with \ to match literally.

Character Meaning Example
. Any character except newline a.c matches "abc", "axc"
^ Start of string/line ^hello matches "hello" at start
$ End of string/line world$ matches "world" at end
* Zero or more ab* matches "a", "ab", "abbb"
+ One or more ab+ matches "ab", "abbb" but not "a"
? Zero or one ab? matches "a", "ab"
\ Escape character \. matches literal "."
` ` OR operator
[] Character class [abc] matches "a", "b", or "c"
() Grouping (ab)+ matches "ab", "abab"
{} Quantifier a{2,4} matches "aa", "aaa", "aaaa"

Character Classes

Basic Character Classes

[abc]       # Matches 'a', 'b', or 'c'
[a-z]       # Matches any lowercase letter
[A-Z]       # Matches any uppercase letter
[0-9]       # Matches any digit
[a-zA-Z]    # Matches any letter
[a-zA-Z0-9] # Matches any alphanumeric character

Negated Character Classes

[^abc]      # Matches any character except 'a', 'b', or 'c'
[^0-9]      # Matches any non-digit character

Predefined Character Classes

Class Equivalent Description
\d [0-9] Any digit
\D [^0-9] Any non-digit
\w [a-zA-Z0-9_] Any word character
\W [^a-zA-Z0-9_] Any non-word character
\s [ \t\r\n\f] Any whitespace
\S [^ \t\r\n\f] Any non-whitespace

Quantifiers

Basic Quantifiers

Quantifier Meaning Example
* 0 or more a* matches "", "a", "aa", "aaa"
+ 1 or more a+ matches "a", "aa", "aaa"
? 0 or 1 a? matches "", "a"

Specific Quantifiers

{n}         # Exactly n times
{n,}        # n or more times
{n,m}       # Between n and m times

Examples:

\d{3}       # Exactly 3 digits
\d{3,}      # 3 or more digits
\d{3,5}     # Between 3 and 5 digits

Greedy vs Non-Greedy

.*          # Greedy: matches as much as possible
.*?         # Non-greedy: matches as little as possible
.+?         # Non-greedy: one or more, but as few as possible

Anchors

Position Anchors

Anchor Meaning Example
^ Start of string/line ^Hello
$ End of string/line world$
\b Word boundary \bword\b
\B Non-word boundary \Bword\B

Examples

^Hello$     # Matches exactly "Hello"
\bcat\b     # Matches "cat" as a whole word
\d+$        # Matches digits at end of string

Groups and Capturing

Basic Groups

(abc)       # Capturing group
(?:abc)     # Non-capturing group

Backreferences

(hello) \1  # Matches "hello hello"
(\w+) \1    # Matches repeated words like "the the"

Named Groups

(?<name>\w+)        # Named group (some flavors)
(?P<name>\w+)       # Named group (Python)

Lookaheads and Lookbehinds

Positive Lookahead

\d+(?=px)   # Matches digits followed by "px"

Negative Lookahead

\d+(?!px)   # Matches digits NOT followed by "px"

Positive Lookbehind

(?<=\$)\d+  # Matches digits preceded by "$"

Negative Lookbehind

(?<!\$)\d+  # Matches digits NOT preceded by "$"

Flags/Modifiers

Flag Meaning Example
i Case insensitive /hello/i matches "Hello"
g Global (find all matches) /cat/g finds all "cat"
m Multiline ^ and $ match line breaks
s Dot matches newline . includes \n
x Extended (ignore whitespace) Allows comments in regex

Common Patterns

Email Validation

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Phone Number (US)

^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$

URL

^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$

Password Strength

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
  • At least 8 characters
  • At least one lowercase letter
  • At least one uppercase letter
  • At least one digit
  • At least one special character

IP Address

^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

Date (MM/DD/YYYY)

^(0[1-9]|1[0-2])\/(0[1-9]|[12][0-9]|3[01])\/\d{4}$

HTML Tags

<\/?[\w\s]*>|<.+[\W]>

Examples

Extract Information

# Extract email addresses
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

# Extract phone numbers
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}

# Extract hashtags
#\w+

# Extract URLs
https?:\/\/[^\s]+

Validation

# Validate credit card (basic)
^\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}$

# Validate time (24-hour)
^([01]?[0-9]|2[0-3]):[0-5][0-9]$

# Validate hex color
^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$

Text Processing

# Remove extra whitespace
\s+

# Split on punctuation
[.!?]+

# Find repeated words
\b(\w+)\s+\1\b

# Match quoted strings
"[^"]*"

Best Practices

1. Keep It Simple

  • Start with simple patterns and build complexity gradually
  • Break complex patterns into smaller, testable parts

2. Use Character Classes

# Good
[0-9]

# Less clear
(0|1|2|3|4|5|6|7|8|9)

3. Escape Special Characters

# To match literal dots
\\.

# To match literal brackets
\[|\]

4. Use Non-Greedy Quantifiers When Needed

# Greedy (may capture too much)
<.*>

# Non-greedy (better for HTML tags)
<.*?>

5. Optimize Performance

  • Use anchors (^, $) when possible
  • Avoid excessive backtracking
  • Use atomic groups for performance-critical applications

6. Comment Complex Patterns

# Using extended mode (x flag)
(?x)
^                    # Start of string
(?=.*[a-z])         # Must contain lowercase
(?=.*[A-Z])         # Must contain uppercase
(?=.*\d)            # Must contain digit
.{8,}               # At least 8 characters
$                   # End of string

7. Test Thoroughly

  • Test with both matching and non-matching cases
  • Consider edge cases (empty strings, special characters)
  • Validate against real-world data

Tools and Testing

Online Regex Testers

  • Regex101 (regex101.com) - Comprehensive with explanations
  • RegExr (regexr.com) - Visual regex builder
  • RegexPal (regexpal.com) - Simple testing

Programming Language Integration

JavaScript

const pattern = /\d+/g;
const text = "abc 123 def 456";
const matches = text.match(pattern); // ["123", "456"]

Python

import re
pattern = r'\d+'
text = "abc 123 def 456"
matches = re.findall(pattern, text)  # ['123', '456']

PHP

$pattern = '/\d+/';
$text = "abc 123 def 456";
preg_match_all($pattern, $text, $matches);

Common Pitfalls

  1. Catastrophic backtracking - Avoid nested quantifiers
  2. Greediness - Use non-greedy quantifiers when appropriate
  3. Case sensitivity - Remember to use case-insensitive flag when needed
  4. Escaping - Always escape special characters in literals
  5. Testing - Always test with edge cases

This guide provides a comprehensive foundation for working with regular expressions. Practice with real examples and gradually build complexity as you become more comfortable with the syntax and concepts.