Most analysts know SQL regex exists and can’t quite remember the syntax, so they reach for Python. The result is a 200-line notebook doing what a six-line warehouse query would have done. This is that six-line reference: six patterns, each one a real production question. This article is the cheatsheet we hand new analysts on day one: six concrete patterns, each one a real question we’ve actually been asked to answer.
In This Article
- What regex is, in one paragraph
- 1. Count the vowels in each string
- 2. Count the consonants (via negation)
- 3. Strings with exactly three digits
- 4. Phone numbers with two or more dashes
- 5. Names that start with a capital letter
- 6. Phone numbers matching a strict pattern
- Cross-dialect notes
- When not to use regex in SQL
- Takeaway
We’ll use PostgreSQL’s regexp_matches() throughout because it’s the cleanest of the SQL regex APIs and because every modern warehouse (Snowflake, BigQuery, Redshift, ClickHouse) has a near-identical equivalent. Where the syntax diverges, we’ll flag it.
What regex is, in one paragraph
A regular expression is a search pattern. The pattern bat matches bat, combat, and battalion. The pattern [aeiou] matches any single vowel. The pattern ^[A-Z] matches any string that starts with an uppercase letter. regexp_matches(column, pattern, 'g') returns one row per match — that’s what makes it useful in SQL: count the rows, you get the count of matches.
The 'g' flag is “global” — find all matches, not just the first. In SQL it’s the one you want 90% of the time.
1. Count the vowels in each string
The first thing every analyst writes when they discover regex.
with example_table as (
select * from (values
(1, 'google'),
(2, 'yahoo'),
(3, 'bing'),
(4, 'rambler')
) as map(id, source_type)
)
select source_type, count(1) as vowel_count
from (
select *, regexp_matches(source_type, '([aeiou])', 'g') as pattern
from example_table
) as t
group by source_type;
The pattern ([aeiou]) matches one vowel. With the 'g' flag, regexp_matches emits one row per vowel found, so count(1) after group by source_type gives you the per-string vowel count.
Where it fails: Case sensitivity. [aeiou] doesn’t match A, E, etc. Either uppercase the column first with lower(source_type) or expand the pattern to [aeiouAEIOU]. The first is more reliable when the data has UTF-8 vowels (é, ü) that the bracket class doesn’t cover.
2. Count the consonants (via negation)
The naive way: list every consonant in a bracket class. The elegant way: invert. [^aeiou] matches any character that is not a vowel, which on a lowercased alphabetic string equals the consonant count.
with example_table as (
select * from (values
(1, 'google'), (2, 'yahoo'), (3, 'bing'), (4, 'rambler')
) as map(id, source_type)
)
select source_type, count(1) as consonant_count
from (
select *, regexp_matches(source_type, '([^aeiou])', 'g') as pattern
from example_table
) as t
group by source_type;
The ^ inside [ ] is the negation operator. Outside brackets it means “start of string.” Two completely different meanings for the same character. This is a source of bugs, not elegance.
Where it fails: This pattern matches digits, spaces, punctuation, and anything else that isn’t a vowel. On the string google.com it counts the period and the m as “consonants.” For real consonant counting, tighten to [bcdfghjklmnpqrstvwxyz] or filter to alphabetic first with [^aeiou[:alpha:]] — but at that point you’ve lost the elegance and might as well enumerate.
3. Strings with exactly three digits
A more interesting question: which rows contain exactly three digit characters? This is where HAVING earns its keep.
with example_table as (
select * from (values
(1, '1a2s3d'),
(2, 'qw12e'),
(3, 'q56we1651qwe'),
(4, 'qw4e2')
) as map(id, source_type)
)
select source_type, count(*) as digit_count
from (
select *, regexp_matches(source_type, '\d', 'g') as pattern
from example_table
) as t
group by source_type
having count(*) = 3;
\d is shorthand for [0-9]. Same idea: emit a row per digit, group, then filter the groups by total count.
In our example only row 1 (1a2s3d, three digits) survives the HAVING. Row 2 has two digits, row 3 has six, row 4 has two — all filtered out.
Where it fails: \d in some SQL dialects matches Unicode digits, not just ASCII 0-9. If your column might contain Arabic-Indic digits (٠١٢) or Devanagari (०१२), you’ll get matches you didn’t expect. Use [0-9] if you mean ASCII-only.
4. Phone numbers with two or more dashes
Switching from synthetic data to a real analyst task: find phone numbers in messy free-text input that are missing the canonical formatting. A useful proxy: how many dashes does the string contain?
with example_table as (
select * from (values
(1, '617-555-0142'),
(2, '6175550142'),
(3, '+16175550142'),
(4, '1-617-555-0142')
) as map(id, source_type)
)
select source_type,
case when count(1) >= 2 then 'True' else 'False' end as has_two_dashes
from (
select *, regexp_matches(source_type, '-', 'g') as pattern
from example_table
) as t
group by source_type;
Note: this query silently drops rows with zero dashes (row 2 here). To include them you need a left join back to the original table, since regexp_matches returns no rows for non-matches and the grouping has nothing to anchor on.
select e.source_type,
case when count(t.pattern) >= 2 then 'True' else 'False' end as has_two_dashes
from example_table e
left join lateral (
select regexp_matches(e.source_type, '-', 'g') as pattern
) t on true
group by e.source_type;
The LATERAL join runs regexp_matches per row and keeps the row even when no dashes are found. This is the pattern we use in production.
Where it fails: This counts the number of dashes, not whether the dashes are in valid positions. ---abc--- has six dashes and would pass. A real phone-format validator needs a positional pattern, which is Task 6 below.
5. Names that start with a capital letter
A common data-cleaning task: find rows where the value looks like it was capitalized correctly. ^[A-Z] does it.
with example_table as (
select * from (values
(1, 'alex'),
(2, 'Alex'),
(3, 'Vasya'),
(4, 'petya')
) as map(id, source_type)
)
select source_type
from (
select *, regexp_matches(source_type, '^[A-Z]', 'g') as pattern
from example_table
) as t
group by source_type;
This returns only Alex and Vasya. The ^ outside brackets means “anchored to the start of the string”; [A-Z] is the bracket class for uppercase ASCII letters.
Where it fails: Names like O'Brien, MacDonald, de la Vega, or 李明 need different patterns. ^[A-Z] will miss 李明 entirely (no ASCII uppercase) and miss de la Vega (lowercase start, but a proper name). If you’re cleaning a multinational customer table, do not lean on this pattern as a “valid name” check — use it as a “probably formatted as Title Case” check and treat it as a heuristic, not a rule.
6. Phone numbers matching a strict pattern
The most useful regex an analyst writes: validate a phone-number column. North American numbers have a few common formats — +1XXXXXXXXXX, (XXX) XXX-XXXX, XXX-XXX-XXXX. Catching the canonical form is straightforward:
with example_table as (
select * from (values
(1, '+1(617)5550142'),
(2, '+16175550142'),
(3, '6175550142'),
(4, '555-01-42')
) as map(id, source_type)
)
select source_type
from (
select *, regexp_matches(source_type, '^\+1[0-9]{10}$', 'g') as pattern
from example_table
) as t
group by source_type;
The pattern reads: anchored to the start (^), a literal + (escaped because + is a metacharacter), a literal 1, then exactly ten digits ([0-9]{10}), anchored to the end ($).
Of our four rows, only row 2 (+16175550142) matches. Row 1 has parentheses; row 3 is missing the +1; row 4 is malformed.
A looser pattern that accepts both +16175550142 and (617) 555-0142:
regexp_matches(source_type, '^(\+1)?[\s\-\(\)]*[0-9]{3}[\s\-\(\)]*[0-9]{3}[\s\-]?[0-9]{4}$', 'g')
We’ve shipped both versions in client engagements. The looser pattern catches more real-world entries; the strict one is what you want for a contact-form validator where you’re rejecting bad input.
Where it fails: International numbers. The moment your dataset has +44, +33, +86, +357, the ^\+1 anchor breaks. For multinational datasets, use a library like Google’s libphonenumber from a UDF instead of regex — phone-number validation is one of those problems that looks like regex and is actually a parsing tree.
Cross-dialect notes
| Dialect | Function | Notes |
|---|---|---|
| PostgreSQL | regexp_matches(col, pattern, 'g') | Returns table-of-arrays; covered above |
| Snowflake | regexp_count(col, pattern) | Returns scalar count, simpler for tasks 1–3 |
| BigQuery | REGEXP_EXTRACT_ALL(col, pattern) | Returns array; use ARRAY_LENGTH for count |
| Redshift | regexp_count(col, pattern) | Like Snowflake; older Redshift uses POSIX |
| ClickHouse | extractAll(col, pattern) | Returns array; PCRE2 syntax |
| MySQL 8+ | REGEXP_LIKE(col, pattern) / REGEXP_REPLACE(...) | No “match all” — workaround with subqueries |
| SQLite | regexp operator (extension), no built-in count | Usually not the right tool for this |
The “count of matches” idiom we used above for PostgreSQL is regexp_count(col, pattern) in Snowflake / Redshift / Oracle, and ARRAY_LENGTH(REGEXP_EXTRACT_ALL(col, pattern)) in BigQuery. Same idea, three syntaxes.
When not to use regex in SQL
Three rules we apply when reviewing analyst code:
- If the warehouse has a dedicated function, use that.
STARTS_WITH(col, 'pref')beatsregexp_matches(col, '^pref')for readability and speed. Same forCONTAINS,ENDS_WITH,SPLIT_PART. - If the pattern is a parsing problem, use a UDF. Phone numbers, email addresses, URLs, dates in heterogeneous formats — these have battle-tested libraries. Regex makes them a thirty-line query that breaks on the first edge case.
- If the query runs on >100M rows, benchmark. Regex is slower than literal-string operations by 2–10×. On Snowflake or BigQuery with elastic compute it doesn’t matter much. On a self-hosted Postgres or a ClickHouse table without skip indexes, it absolutely does.
Takeaway
Six patterns cover most analyst regex work:
[aeiou]— character class[^aeiou]— negated character class\d,[0-9]— digitsHAVING count(*) = N— exact count via grouping^[A-Z]— anchor + class for “starts with”^...$with literal escapes and{N}quantifiers — strict format validation
Memorize those, know which dialect you’re in, and most of the “I need to do this in Python” answers turn back into one-line SQL queries. The remaining cases — internationalized names, multi-country phone numbers, real email validation — are honest exceptions; reach for a UDF or pre-process them upstream, not a longer regex.