oracle concepts for you: Regular Expressions in Oracle

What Are Regular Expressions?

Regular expressions (normally abbreviated regex or regexp) are sequences of characters that form a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace” like operations. The concept of regular expressions came into existence around 1950s through an American mathematician named Stephen Kleene. These expressions were used first time in UNIX. In today’s world, regular expressions are widely supported in many programming languages like Oracle, Java, .NET, Perl, Python, C++, Ruby, AWK etc.

Regular Expressions in Oracle:

Regular Expressions came into Oracle with 10g database. Oracle Regular Expressions provide a simple yet powerful mechanism for rapidly describing patterns and greatly simplifies the way in which you search, extract, format, and otherwise manipulate text in the database. Oracle DB implements regular expression support with a set of Oracle Database SQL functions and conditions that enable you to search and manipulate string data. You can use these functions in any environment that supports Oracle Database SQL. You can use these functions on a text literal, bind variable, or any column that holds character data such as CHAR, NCHAR, CLOB, NCLOB, NVARCHAR2, and VARCHAR2 (but not LONG).
Here are those condition and functions supported by Oracle:

Oracle Condition/Function	Description
REGEXP_LIKE	It is similar to the `LIKE` condition, except `REGEXP_LIKE` performs regular expression matching instead of the simple pattern matching performed by `LIKE`. It searches a character column for a pattern. Use this function in the `WHERE` clause of a query to return rows matching a regular expression. The condition is also valid in a constraint or as a PL/SQL function returning a Boolean.
REGEXP_INSTR	It extends the functionality of the `INSTR` function by letting you search a string for a regular expression pattern. It searches a string for a given occurrence of a regular expression pattern and returns an integer indicating the position in the string where the match is found. You can specify which occurrence you want to find and the start position.
REGEXP_REPLACE	It extends the functionality of the `REPLACE` function by letting you search a string for a regular expression pattern. It searches for a pattern in a character column and replaces each occurrence of that pattern with the specified string.
REGEXP_SUBSTR	It extends the functionality of the `SUBSTR` function by letting you search a string for a regular expression pattern. It returns the substring matching the regular expression pattern that you specify.
REGEXP_COUNT	It complements the functionality of the `REGEXP_INSTR` function by returning the number of times a pattern occurs in a source string. The function evaluates strings using characters as defined by the input character set. It returns an integer indicating the number of occurrences of pattern. If no match is found, then the function returns 0.

Here are the syntax for those condition and functions:

Oracle Condition/Function	Syntax
REGEXP_LIKE	REGEXP_LIKE(search_string, pattern [,match_option])
REGEXP_INSTR	REGEXP_INSTR(search_string, pattern [, position [, occurrence [, return_option [, match_option]]]])
REGEXP_REPLACE	REGEXP_REPLACE(search_string, pattern [,replace_string [, position [, occurrence [, match_option]]]])
REGEXP_SUBSTR	REGEXP_SUBSTR(search_string, pattern [, position [, occurrence [, match_option]]])
REGEXP_COUNT	REGEXP_COUNT(search_string, pattern [,position] [,match_option])

Where:

search_string: the string you are searching in.
pattern: A regular expression. (Max 512 bytes).
position: The character position from which you want the search to begin.
occurrence: The occurrence you are searching for. If you want to return the second match for the regular expression, this is a 2.
return_option: 0 means return the pattern’s beginning position. 1 mean return the ending character position.
match_option: can have one of these values:
‘c’: case-sensitive (default);
‘i’: case-insensitive;
‘n’: Allow match-any-character operator (.);
‘m’: Treat source string as multiple line. For every line ^ and $ will be used for the beginning and ending of the line.

Oracle Database follows the exact syntax and matching semantics for these operators as defined in the IEE POSIX (Portable Operating System Interface) standard for matching ASCII (English language) data.
Here is the list of POSIX Metacharacters that can be used Oracle Database Regular Expressions:

Syntax	Description	Example
.	Matches any character in the supported character set except NULL.	a.b matches abb, acb, adb etc.
+	Matches one or more occurrences	a+ matches a, aa, aaa
?	Matches zero or one occurrence	ab?c matches `abc` and `ac`, but does not match `abbc`.
*	Matches zero or more occurrences	`ab*c` matches `ac`, `abc`, and `abbc.`
^	Matches the beginning of a string by default. In multiline mode, it matches the beginning of any line anywhere within the source string.	`^abc` matches `abc` in the string `abcxyz` but does not match `abc` in `xyzabc`.
$	Matches the end of a string by default. In multiline mode, it matches the end of any line anywhere within the source string.	`$abc` matches `abc` in the string `xyzabc` but does not match `abc` in `abcxyz`.
\	Escape character. Use a backslash to search for a character that is normally threated as a metacharacter.	The expression `\+` searches for the plus character (`+`). It matches the plus in the string `abc+xyz.`
[]	Bracket expression for specifying a matching list that should match any one of the expressions represented in the list.	`[abc]` matches the first character in the strings `all`, `bill`, and `cold`, but does not match any characters in `doll`.
[^ ]	Non matching character list. Matches none of the expressions represented in the list	The expression `[^abc]` matches the character `d` in `abcdef`, but not the character `a`, `b`, or `c`.
\|	Matches one of the alternatives.	`a\|b` matches character `a` or character `b`.
*`[:`class`:]`*	Matches any character belonging to the specified POSIX character class. [:alnum:] Alphanumeric characters [:alpha:] Alphabetic characters [:ascii:] ASCII characters [:blank:] Space and tab [:cntrl:] Control characters [:digit:] Digits [:graph:] Visible characters [:lower:] Lowercase letters [a-z] [:print:] Visible characters and spaces [:punct:] Punctuation and symbols. [:space:] All whitespace characters, including line breaks [:upper:] Uppercase letters [:word:] Word characters [:xdigit:] Hexadecimal digits	The expression `[[:upper:]]+` searches for one or more consecutive uppercase characters.
*`[.`element`.]`*	Specifies a collating element to use in the regular expression.	The expression `[[.ch.]]` searches for the collating element `ch` and matches `ch` in string `chabc`, but does not match `cdefg`.
{m}	Matches exactly m times	`a{3}` matches the strings `aaa`, but does not match `aa`.
{m,}	Matches at least m times	`a{3,}` matches the strings `aaa` and `aaaa`, but does not match `aa`.
{m,n}	Matches at least m times but no more than n times	`a{3,5}` matches the strings `aaa`, `aaaa`, and`aaaaa`, but does not match `aa`.
\n	The backreference expression (n is a digit between 1 and 9) matches the nth subexpression enclosed between ‘(‘ and ‘)’ preceding the \n	The expression `(abc\|def)xy\1` matches the strings `abcxyabc` and `defxydef`, but does not match `abcxydef` or`abcxy`.
( )	Grouping expression, treated as a single subexpression	`(abc)?def` matches the optional string `abc`, followed by `def`. Thus, the expression matches `abcdefghi` and `def`, but does not match `ghi`.

Oracle Database also provides built-in support for some of the most heavily used Perl regular expression extensions that are not included in the POSIX standard but do not conflict with it. Here is the list that describes Perl-influenced metacharacters supported in Oracle Database regular expression functions and conditions.

Reg. Exp.	Description
\d	A digit character. It is equivalent to the POSIX class [[:digit:]].
\D	A non-digit character. It is equivalent to the POSIX class [^[:digit:]].
\w	A word character, which is defined as an alphanumeric or underscore (_) character. It is equivalent to the POSIX class [[:alnum:]_].
\W	A non-word character. It is equivalent to the POSIX class [^[:alnum:]_].
\s	A whitespace character. It is equivalent to the POSIX class [[:space:]].
\S	A non-whitespace character. It is equivalent to the POSIX class[^[:space:]].
\A	Only at the beginning of a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \A does not match the beginning of each line.
\Z	Only at the end of string or before a newline ending a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \Z does not match the end of each line.
\z	Only at the end of a string.
*?	The preceding pattern element 0 or more times (non-greedy). Note that this quantifier matches the empty string whenever possible.
+?	The preceding pattern element 1 or more times (non-greedy).
??	The preceding pattern element 0 or 1 time (non-greedy). Note that this quantifier matches the empty string whenever possible.
{n}?	The preceding pattern element exactly n times (non-greedy). In this case{n}? is equivalent to {n}.
{n,}?	The preceding pattern element at least n times (non-greedy).
{n,m}?	At least n but not more than m times (non-greedy). Note that {0,m}?matches the empty string whenever possible.