Data Engineering

Regular Expressions in SQL: Six Patterns Analysts Actually Use

· 13 min read

Most analysts know SQL regex exists and can’t quite remember the syntax, so they reach for Python. The result is a 200-line notebook doing what a six-line warehouse query would have done. This is that six-line reference: six patterns, each one a real production question. This article is the cheatsheet we hand new analysts on day one: six concrete patterns, each one a real question we’ve actually been asked to answer.

In This Article

  1. What regex is, in one paragraph
  2. 1. Count the vowels in each string
  3. 2. Count the consonants (via negation)
  4. 3. Strings with exactly three digits
  5. 4. Phone numbers with two or more dashes
  6. 5. Names that start with a capital letter
  7. 6. Phone numbers matching a strict pattern
  8. Cross-dialect notes
  9. When not to use regex in SQL
  10. Takeaway

We’ll use PostgreSQL’s regexp_matches() throughout because it’s the cleanest of the SQL regex APIs and because every modern warehouse (Snowflake, BigQuery, Redshift, ClickHouse) has a near-identical equivalent. Where the syntax diverges, we’ll flag it.

What regex is, in one paragraph

A regular expression is a search pattern. The pattern bat matches bat, combat, and battalion. The pattern [aeiou] matches any single vowel. The pattern ^[A-Z] matches any string that starts with an uppercase letter. regexp_matches(column, pattern, 'g') returns one row per match — that’s what makes it useful in SQL: count the rows, you get the count of matches.

The 'g' flag is “global” — find all matches, not just the first. In SQL it’s the one you want 90% of the time.

1. Count the vowels in each string

The first thing every analyst writes when they discover regex.

with example_table as (
  select * from (values
    (1, 'google'),
    (2, 'yahoo'),
    (3, 'bing'),
    (4, 'rambler')
  ) as map(id, source_type)
)
select source_type, count(1) as vowel_count
from (
  select *, regexp_matches(source_type, '([aeiou])', 'g') as pattern
  from example_table
) as t
group by source_type;

The pattern ([aeiou]) matches one vowel. With the 'g' flag, regexp_matches emits one row per vowel found, so count(1) after group by source_type gives you the per-string vowel count.

Where it fails: Case sensitivity. [aeiou] doesn’t match A, E, etc. Either uppercase the column first with lower(source_type) or expand the pattern to [aeiouAEIOU]. The first is more reliable when the data has UTF-8 vowels (é, ü) that the bracket class doesn’t cover.

2. Count the consonants (via negation)

The naive way: list every consonant in a bracket class. The elegant way: invert. [^aeiou] matches any character that is not a vowel, which on a lowercased alphabetic string equals the consonant count.

with example_table as (
  select * from (values
    (1, 'google'), (2, 'yahoo'), (3, 'bing'), (4, 'rambler')
  ) as map(id, source_type)
)
select source_type, count(1) as consonant_count
from (
  select *, regexp_matches(source_type, '([^aeiou])', 'g') as pattern
  from example_table
) as t
group by source_type;

The ^ inside [ ] is the negation operator. Outside brackets it means “start of string.” Two completely different meanings for the same character. This is a source of bugs, not elegance.

Where it fails: This pattern matches digits, spaces, punctuation, and anything else that isn’t a vowel. On the string google.com it counts the period and the m as “consonants.” For real consonant counting, tighten to [bcdfghjklmnpqrstvwxyz] or filter to alphabetic first with [^aeiou[:alpha:]] — but at that point you’ve lost the elegance and might as well enumerate.

3. Strings with exactly three digits

A more interesting question: which rows contain exactly three digit characters? This is where HAVING earns its keep.

with example_table as (
  select * from (values
    (1, '1a2s3d'),
    (2, 'qw12e'),
    (3, 'q56we1651qwe'),
    (4, 'qw4e2')
  ) as map(id, source_type)
)
select source_type, count(*) as digit_count
from (
  select *, regexp_matches(source_type, '\d', 'g') as pattern
  from example_table
) as t
group by source_type
having count(*) = 3;

\d is shorthand for [0-9]. Same idea: emit a row per digit, group, then filter the groups by total count.

In our example only row 1 (1a2s3d, three digits) survives the HAVING. Row 2 has two digits, row 3 has six, row 4 has two — all filtered out.

Where it fails: \d in some SQL dialects matches Unicode digits, not just ASCII 0-9. If your column might contain Arabic-Indic digits (٠١٢) or Devanagari (०१२), you’ll get matches you didn’t expect. Use [0-9] if you mean ASCII-only.

4. Phone numbers with two or more dashes

Switching from synthetic data to a real analyst task: find phone numbers in messy free-text input that are missing the canonical formatting. A useful proxy: how many dashes does the string contain?

with example_table as (
  select * from (values
    (1, '617-555-0142'),
    (2, '6175550142'),
    (3, '+16175550142'),
    (4, '1-617-555-0142')
  ) as map(id, source_type)
)
select source_type,
       case when count(1) >= 2 then 'True' else 'False' end as has_two_dashes
from (
  select *, regexp_matches(source_type, '-', 'g') as pattern
  from example_table
) as t
group by source_type;

Note: this query silently drops rows with zero dashes (row 2 here). To include them you need a left join back to the original table, since regexp_matches returns no rows for non-matches and the grouping has nothing to anchor on.

select e.source_type,
       case when count(t.pattern) >= 2 then 'True' else 'False' end as has_two_dashes
from example_table e
left join lateral (
  select regexp_matches(e.source_type, '-', 'g') as pattern
) t on true
group by e.source_type;

The LATERAL join runs regexp_matches per row and keeps the row even when no dashes are found. This is the pattern we use in production.

Where it fails: This counts the number of dashes, not whether the dashes are in valid positions. ---abc--- has six dashes and would pass. A real phone-format validator needs a positional pattern, which is Task 6 below.

5. Names that start with a capital letter

A common data-cleaning task: find rows where the value looks like it was capitalized correctly. ^[A-Z] does it.

with example_table as (
  select * from (values
    (1, 'alex'),
    (2, 'Alex'),
    (3, 'Vasya'),
    (4, 'petya')
  ) as map(id, source_type)
)
select source_type
from (
  select *, regexp_matches(source_type, '^[A-Z]', 'g') as pattern
  from example_table
) as t
group by source_type;

This returns only Alex and Vasya. The ^ outside brackets means “anchored to the start of the string”; [A-Z] is the bracket class for uppercase ASCII letters.

Where it fails: Names like O'Brien, MacDonald, de la Vega, or 李明 need different patterns. ^[A-Z] will miss 李明 entirely (no ASCII uppercase) and miss de la Vega (lowercase start, but a proper name). If you’re cleaning a multinational customer table, do not lean on this pattern as a “valid name” check — use it as a “probably formatted as Title Case” check and treat it as a heuristic, not a rule.

6. Phone numbers matching a strict pattern

The most useful regex an analyst writes: validate a phone-number column. North American numbers have a few common formats — +1XXXXXXXXXX, (XXX) XXX-XXXX, XXX-XXX-XXXX. Catching the canonical form is straightforward:

with example_table as (
  select * from (values
    (1, '+1(617)5550142'),
    (2, '+16175550142'),
    (3, '6175550142'),
    (4, '555-01-42')
  ) as map(id, source_type)
)
select source_type
from (
  select *, regexp_matches(source_type, '^\+1[0-9]{10}$', 'g') as pattern
  from example_table
) as t
group by source_type;

The pattern reads: anchored to the start (^), a literal + (escaped because + is a metacharacter), a literal 1, then exactly ten digits ([0-9]{10}), anchored to the end ($).

Of our four rows, only row 2 (+16175550142) matches. Row 1 has parentheses; row 3 is missing the +1; row 4 is malformed.

A looser pattern that accepts both +16175550142 and (617) 555-0142:

regexp_matches(source_type, '^(\+1)?[\s\-\(\)]*[0-9]{3}[\s\-\(\)]*[0-9]{3}[\s\-]?[0-9]{4}$', 'g')

We’ve shipped both versions in client engagements. The looser pattern catches more real-world entries; the strict one is what you want for a contact-form validator where you’re rejecting bad input.

Where it fails: International numbers. The moment your dataset has +44, +33, +86, +357, the ^\+1 anchor breaks. For multinational datasets, use a library like Google’s libphonenumber from a UDF instead of regex — phone-number validation is one of those problems that looks like regex and is actually a parsing tree.

Cross-dialect notes

DialectFunctionNotes
PostgreSQLregexp_matches(col, pattern, 'g')Returns table-of-arrays; covered above
Snowflakeregexp_count(col, pattern)Returns scalar count, simpler for tasks 1–3
BigQueryREGEXP_EXTRACT_ALL(col, pattern)Returns array; use ARRAY_LENGTH for count
Redshiftregexp_count(col, pattern)Like Snowflake; older Redshift uses POSIX
ClickHouseextractAll(col, pattern)Returns array; PCRE2 syntax
MySQL 8+REGEXP_LIKE(col, pattern) / REGEXP_REPLACE(...)No “match all” — workaround with subqueries
SQLiteregexp operator (extension), no built-in countUsually not the right tool for this

The “count of matches” idiom we used above for PostgreSQL is regexp_count(col, pattern) in Snowflake / Redshift / Oracle, and ARRAY_LENGTH(REGEXP_EXTRACT_ALL(col, pattern)) in BigQuery. Same idea, three syntaxes.

When not to use regex in SQL

Three rules we apply when reviewing analyst code:

  1. If the warehouse has a dedicated function, use that. STARTS_WITH(col, 'pref') beats regexp_matches(col, '^pref') for readability and speed. Same for CONTAINS, ENDS_WITH, SPLIT_PART.
  2. If the pattern is a parsing problem, use a UDF. Phone numbers, email addresses, URLs, dates in heterogeneous formats — these have battle-tested libraries. Regex makes them a thirty-line query that breaks on the first edge case.
  3. If the query runs on >100M rows, benchmark. Regex is slower than literal-string operations by 2–10×. On Snowflake or BigQuery with elastic compute it doesn’t matter much. On a self-hosted Postgres or a ClickHouse table without skip indexes, it absolutely does.

Takeaway

Six patterns cover most analyst regex work:

  • [aeiou] — character class
  • [^aeiou] — negated character class
  • \d, [0-9] — digits
  • HAVING count(*) = N — exact count via grouping
  • ^[A-Z] — anchor + class for “starts with”
  • ^...$ with literal escapes and {N} quantifiers — strict format validation

Memorize those, know which dialect you’re in, and most of the “I need to do this in Python” answers turn back into one-line SQL queries. The remaining cases — internationalized names, multi-country phone numbers, real email validation — are honest exceptions; reach for a UDF or pre-process them upstream, not a longer regex.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →