How to use regular expressions

How to use regular expressions

Speaking of regularity, many people may have a headache for this thing, except that it seems that it is difficult for computers to quickly understand this thing, let alone if it is used. Let's explore regular expressions from the shallower to the deeper:

ps: This article is suitable for readers who have no basic regular expressions.

Regular expression can be simply defined as a string matching method, as for the source, you can refer to: regular expression

Simple to use

There is such a string of ABC12345ABC1234AB12C , for this string, what should I do if I want to extract the letters in it?

1. You can find all the letter lists to form an array, [A,B,C...Z]     

2. Convert the string into an array of characters and traverse       

3. If it is a letter, continue, if not, continue to the next match    

The above analysis process roughly describes the process of not using regular expressions. If you use regular expressions, how to write it?

First of all, we want to match letters, so I need to know what is used to express letters in regular?

[az]//Match all lowercase letters 
[AZ]//Match all uppercase letters 
[a-zA-Z]//Match all letters 
[0-9]//Match all numbers 
[0-9\.\-]//match all numbers, periods and minus signs 
[\f\r\t\n]//Match all white characters

According to the above content, you can see that [AZ] is used to represent the letters of AZ. When we use the expression [AZ] to do the test, we find that all letters can be matched

Based on the simple description above, let's go deeper. What if I want to match all the strings in it? This uses our other expressions, which can express a context as metacharacters

Let's take a look at the metacharacters we need:

character

description

{n}

n is a non-negative integer. Matches certain n times. For example,'o{2}' cannot match the'o' in "Bob", but it can match the two o's in "food".

{n,}

n is a non-negative integer. Match at least n times. For example,'o{2,}' cannot match the'o' in "Bob", but it can match all o's in "foooood". 'o{1,}' is equivalent to'o+'. 'o{0,}' is equivalent to'o*'.

{n,m}

Both m and n are non-negative integers, where n <= m. Matches at least n times and at most m matches. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to'o?'. Please note that there can be no spaces between the comma and the two numbers.

+

Match the preceding sub-expression one or more times. For example,'zo+' can match "zo" and "zoo", but not "z". + Is equivalent to {1,}.

?

Matches the preceding subexpression zero or one time. For example, "do(es)?" can match "do" or "does". ? Equivalent to {0,1}.

.

Match any single character except "\n". To match any character including'\n', use a pattern like "(.|\n)".

Based on the above table, we can find the metacharacters {n,} and + we want, so our expression can be written as [AZ]{1,} or [AZ]+

So far we have been able to write a simple regular expression, the above matching result is:

ABC
ABC
AB
C

In this result, we can see that both ABC and AB satisfy the current regular expression, but why does the first one not return AB or A? That is, the following result also satisfies the current expression:

AB
C
ABC
AB
C

Greedy mode

Regular expressions generally tend to match the maximum length, which is the so-called greedy match.

The above means that if there are multiple possible matches, I will match as many strings as possible.

Actual combat-remove html tags in html

 <ul class="dropdown-menu">
    <li><a href="#" class="dropdown-header">Business functions</a>
    </li>
    <li><a href="#">Information creation</a>
    </li>
    <li><a href="#">Information query</a>
    </li>
    <li><a href="#">Information Management</a>
    </li>
    <li role="separator" class="divider"></li>
    <li><a href="#" class="dropdown-header">System features</a>
    </li>
    <li><a href="#">Settings</a>
    </li>
</ul>

For the above html tag, we remove the html tag, the first step is to be able to match the corresponding tag, we know that the html tag starts with "<" and ends with ">".

1. Determine the regular situation of "<[A bunch of arbitrary regulars] >"

2. In the " a bunch of arbitrary regulars " regulars , it can be any character, such as

 <a href="blog.laofu.online">Fu Wei’s web blog</a> 

3. According to the analysis of 2, we find the metacharacter "." that can represent all characters in the regular, then the character can be expressed as <.+>

4. For the regularity of <.+>, the corresponding html tags can be found, but we also encountered unexpected results. Because of the regular greedy pattern, the matching result also contains the Chinese characters we want

5. For the above problems, we can consider replacing " a bunch of arbitrary regulars" with " a bunch of regulars that do not contain Html tags"

6. For the above analysis, we can change the regularity to <[^<>]+>

The final matching result:

Relevant information

Complete Works of Expressions

character

description

\

Mark the next character as a special character, or a literal character, or a backward reference, or an octal escape character. For example, "n" matches the character "n". "\N" matches a newline character. The serial "\\" matches "\" and "\(" matches "(".

^

Match the beginning of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after "\n" or "\r".

$

Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before "\n" or "\r".

*

Matches the preceding sub-expression zero or more times. For example, zo* can match "z" as well as "zoo". *Equivalent to {0,}.

+

Match the preceding sub-expression one or more times. For example, "zo+" can match "zo" and "zoo", but not "z". +Equivalent to {1,}.

?

Matches the preceding subexpression zero or one time. For example, "do(es)?" can match the "do" in "does" or "does". ? Equivalent to {0,1}.

{n}

n is a non-negative integer. Matches determined n times. For example, "o{2}" cannot match the "o" in "Bob", but it can match the two o's in "food".

{n,}

n is a non-negative integer. Match at least n times. For example, "o{2,}" cannot match the "o" in "Bob", but it can match all o in "foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".

{n,m}

Both m and n are non-negative integers, where n<=m. Matches at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". "O{0,1}" is equivalent to "o?". Please note that there can be no spaces between the comma and the two numbers.

?

When the character immediately follows any other qualifier (*,+,?,{n},{n,},{n,m}), the matching mode is non-greedy. The non-greedy mode matches the searched string as little as possible, while the default greedy mode matches the searched string as much as possible. For example, for the string "oooo", "o+?" will match a single "o", and "o+" will match all "o"s.

.

Match any single character except "\n". To match any character including "\n", use a pattern like "(.|\n)".

(pattern)

Match the pattern and get this match. The obtained matches can be obtained from the generated Matches collection, the SubMatches collection is used in VBScript, and the $0...$9 properties are used in JScript. To match parenthesis characters, use "\(" or "\)".

(?:pattern)

Matches the pattern but does not obtain the matching result, which means that this is a non-acquisition match and will not be stored for later use. This is useful when using the or character "(|)" to combine parts of a pattern. For example, "industr(?:y|ies)" is a simpler expression than "industry|industries".

(?=pattern)

Positive positive pre-check, match the search string at the beginning of any string that matches the pattern. This is a non-acquisition match, that is, the match does not need to be acquired for later use. For example, "Windows(?=95|98|NT|2000)" can match "Windows" in "Windows2000", but cannot match "Windows" in "Windows3.1". Pre-check does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, instead of starting after the character that contains the pre-check.

(?!pattern)

Forward negative pre-check, match the search string at the beginning of any string that does not match the pattern. This is a non-acquisition match, that is, the match does not need to be acquired for later use. For example, "Windows(?!95|98|NT|2000)" can match "Windows" in "Windows3.1" but cannot match "Windows" in "Windows2000". Pre-check does not consume characters, that is to say, after a match occurs, the search for the next match starts immediately after the last match, instead of starting from the character that contains the pre-check

(?<=pattern)

The reverse affirmative pre-inspection is similar to the positive affirmative pre-inspection, but in the opposite direction. For example, "(?<=95|98|NT|2000)Windows" can match "Windows" in "2000Windows" but cannot match "Windows" in "3.1Windows".

(?<!pattern)

The reverse negative pre-check is similar to the positive negative pre-check, but in the opposite direction. For example, "(?<!95|98|NT|2000)Windows" can match "Windows" in "3.1Windows" but cannot match "Windows" in "2000Windows".

x|y

Match x or y. For example, "z|food" can match "z" or "food". "(Z|f)ood" matches "zood" or "food".

[xyz]

Character collection. Match any one character contained. For example, "[abc]" can match the "a" in "plain".

[^xyz]

Negative character set. Matches any character that is not included. For example, "[^abc]" can match the "p" in "plain".

[az]

Character range. Match any character in the specified range. For example, "[az]" can match any lowercase alphabetic character from "a" to "z".

[^az]

The range of negative characters. Match any character that is not in the specified range. For example, "[^az]" can match any character that is not in the range of "a" to "z".

\b

Match a word boundary, that is, the position between the word and the space. For example, "er\b" can match the "er" in "never", but it cannot match the "er" in "verb".

\B

Match non-word boundaries. "Er\B" can match the "er" in "verb", but it cannot match the "er" in "never".

\cx

Matches the control character specified by x. For example,/cM matches a Control-M or carriage return character. The value of x must be one of AZ or az. Otherwise, treat c as a literal "c" character.

\d

Match a digit character. Equivalent to [0-9].

\D

Match a non-digit character. Equivalent to [^0-9].

\f

Matches a form feed character. Equivalent to/x0c and/cL.

\n

Match a newline character. Equivalent to/x0a and/cJ.

\r

Matches a carriage return character. Equivalent to/x0d and/cM.

\s

Matches any blank characters, including spaces, tabs, form feeds, etc. Equivalent to [\f\n\r\t\v].

\S

Match any non-whitespace character. Equivalent to [^/f\n\r\t\v].

\t

Matches a tab character. Equivalent to/x09 and/cI.

\v

Matches a vertical tab character. Equivalent to/x0b and/cK.

\w

Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]".

\W

Match any non-word character. Equivalent to "[^A-Za-z0-9_]".

\xn

Match n, where n is the hexadecimal escape value. The hexadecimal escape value must be two digits long. For example, "\x41" matches "A". "\X041" is equivalent to "\x04&1". ASCII encoding can be used in regular expressions. .

\num

Match num, where num is a positive integer. A reference to the obtained match. For example, "(.)\1" matches two consecutive identical characters.

\n

Identifies an octal escape value or a backward reference. If at least n sub-expressions were obtained before/n, then n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.

\nm

Identifies an octal escape value or a backward reference. If there are at least nm obtained sub-expressions before/nm, then nm is a backward reference. If there are at least n acquisitions before/nm, then n is a backward reference followed by the text m. If the preceding conditions are not met, if n and m are both octal digits (0-7),/nm will match the octal escape value nm.

\nml

If n is an octal digit (0-3), and both m and l are octal digits (0-7), the octal escape value nml is matched.

\un

Match n, where n is a Unicode character represented by four hexadecimal digits. For example,/u00A9 matches the copyright symbol (©).

Commonly used regular expressions

username

/^[a-z0-9_-]{3,16}$/

password

/^[a-z0-9_-]{6,18}$/

Hexadecimal value

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

E-mail

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([az\.]{2,6})$//^[az\d ]+(\.[az\d]+)*@([\da-z](-[\da-z])?)+(\.{1,2}[az]+)+$/

URL

/^(https?:\/\/)?([\da-z\.-]+)\.([az\.]{2,6})([\/\w/.-]*) *\/?$/

IP address

/((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5 ]|[01]?\d\d?)//^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][ 0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) $/

HTML tags

/^<([az]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Delete code\comment

(?<!http:|\S)//.*$

The range of Chinese characters in Unicode encoding

/^[\u2E80-\u9FFF]+$/

Reference materials:

  1. Regular expression documentation
  2. Regular Expressions-Tutorial

(End of this article)

Author: Old pay if they feel there is help for you, you can subscribe below, or select the right side of the donation, if there are problems, please donate after consultation, thank you if you have any intellectual property rights, copyright issues or theory wrong, please correct me . Freely reprint-non-commercial-non-derivative-keep the signature, please follow: Creative Commons 3.0 license , please join the group 113249828: click to add group or send me an email laofu_online@163.com

Reference: https://cloud.tencent.com/developer/article/1368320 How to use regular expressions-Cloud + Community-Tencent Cloud