Javascript regular expressions-a refined summary of knowledge points

Javascript regular expressions-a refined summary of knowledge points

1. Definition

Regular Expression, abbreviated as "regex", "regexp", is a string used to describe a series of strings matching a certain syntactic rule from left to right, which can realize search , replace , and extract functions.

2. Components

  • Ordinary characters: computer printable characters, usually numbers, uppercase and lowercase letters, some punctuation marks such as. ,_ Wait
  • Metacharacters: Characters that represent special semantics, such as escape characters, ^, |, etc.
  • Modifier (marker): specify additional matching strategies, usually not written in the expression, but outside the expression, such as/abc/g matches all abc strings

3. Metacharacters

3.1 Matching characters

MetacharacternamedescriptionStringRegular
.Dot operatorMatches a character only once , one of all characters except the newline character. Note that only/. represents the dot itself, and (.) or .{1} or (.){1,2} are all dot operatorscar/.ar/g => car
[]Character set/classOnly match a character once , you can use - internally to specify the range of the character set. Note that the character set does not care about the order.a3 d3 .3/[a-cA-C.]3/g => a3 d3 .3
[^]Negative character setOnly negate a character once , and use - internally to specify the range of the negated character set. Note that the negated character set does not care about the order.a3 d3 .3/[^a-cA-C.]3/g => a3 d3 .3
\wMatch a character only once , one of all alphanumeric underscores, equivalent to [a-zA-Z0-9_]word/\w/g => word , note that it is matched four times, not one-time matching word
\WMatch a character only once , one of all non-alphanumeric underscores, equivalent to [^\w]*()/\W/g => *() , note that it is matched three times, not a one-time match *()
\dMatch a character only once , one of all digits, equivalent to [0-9]2021/\d/g => 2021 , note that it is matched four times, not 2021 at one time
\DMatch a character only once , one of all non-digits, equivalent to [^\d]abc/\D/g => abc , note that it is matched three times, not abc one time
\sMatch a character only once , one of all spaces, equivalent to [\t\n\f\r\p{Z}]\t\n/\s/g => \t\n , note that it is matched twice, not one time match\t\n
\SMatch a character only once , one of all non-spaces, equivalent to [^\s]abc/\S/g => abc , note that it matches three times, not one-time match abc
\fMatch a form feed
\nMatches a newline character
\rMatches a carriage return
\tMatches a tab
\vMatches a vertical tab
\pMatch CR/LF (equivalent to/r\n), used to match DOS line terminator
\Escape characterUsed to match some reserved characters and Chinese , such as [] () {}. * + ?^ $ |\test\u6d4b\u8bd5 => test

Note: If you want to capture acddfs in 3acddfs23, you should use/w+ instead of/w, because the latter will only capture one alphabetic character

3.2 Number of matches (quantifier)

MetacharacterdescriptionStringRegular
{n, m}Limit the character or group of characters before {n,m} can be repeated at least n times and at most m timesabababab/[ab]{2,3}/=> aba bab
?Equivalent to {0, 1}abababab/[ab]?/=> a bababab
+Equivalent to {1,}abababab/[ab]+/=> abababab
*Equivalent to {0, }, usually used to filter some dispensable stringsabababab/[ab]*/=> **abababab

3.3 Matching position

MetacharacternamedescriptionStringRegular
^Anchor pointUsed to check whether the matched string is at the beginning of the matched string.abc/^b/=> abc
$Anchor pointUsed to check whether the matched string is at the end of the matched string.abc/$b/=> abc
\bWord boundaryUsed to check whether the matched string meets the word boundary of the matched stringcat dcatd/\bcat\b/=> cat dcatd
\BNon-word boundaryThe word boundary is used to check whether the matched string does not meet the word boundary of the matched stringcat dcatd/\Bcat\B/=> cat d cat d

3.4 Logic processing

MetacharacterdescriptionStringRegular
|Or operatorThe car is the/(t|T)he/g => The car is the
! And [^]Not operatorThe car is the/(t|!T)he/g => The car is the

3.5 Sub-expression/grouping

(exp)(exp)

Sub-expression is an expression enclosed in parentheses, also called grouping. According to different purposes, it can be divided into capture group and non-capturing group, which can be used for backtracking/backward reference, zero-width assertion and other functions. Can construct very complex regular expressions

4. Modifiers

Modifierdescriptionexampleexpression
iIgnore caseThe the is/The/gi => The the is
gGlobal searchThe the is/The/gi => The the is
mMulti-line modifier, which changes the anchor metacharacter ^ $ from a single string boundary to a multi-line boundarycat
sat
/.at$/g =>cat sat
/.at$/gm => cat sat
sSingle-line modifier, allowing. Dot operator to recognize newlinecat
sat
/.+/s => cat sat

5. Capture group and non-capturing group

We further categorize sub-expressions, which can be divided into two types: capturing group and non-capturing group. The capturing group matches the result and captures the result at the same time. It can be self-named and automatically assigned a group number. The non-capturing group only matches the result. The result of the capture will not be automatically assigned a group number

5.1 Capture group

modenameStringRegular
(exp)Normal capture groupgo go/(go)/=> go go
(?<name>exp)
(?'name'exp)
Named capture groupgo go/('name'go)/k'name'/=> go go
\NumberingBack reference to numbered capture groupgo go/(go)/1/=> go go
\k<name> or/k'name'
$<name> or $'name'
Back reference to named capture groupgo go/('name'go)/k'name'/=> go go

1. Concept supplement

  • Back reference: also called back reference, back reference, etc. The content captured by the capture group can be referenced not only through the program outside the regular expression, but also inside the regular expression. This type of reference is reverse To reference. The role of backreferences is usually to find or limit repetitions, to limit the occurrence of specified identifier pairs, and so on. But pay special attention, backreference refers to repeating the text after a successful match, not repeating the sub-expression itself! ! !
  • Balance group: The balance group is a further use of the capture group, but JS has not yet been implemented, please refer to @ this article

2. Distribution rules

  • Whether it is a normal capture group or a named capture group, there is a group number. The rule is to assign the group number of the common capture group for the first time from left to right, incrementing from 1, and then assign the number of the named capture group for the second time from left to right , Increment from the largest common capture group number. Group number 0 represents the entire expression
  • In javascript,/1 or/k'name' are used inside regular expressions, and 1or1 or name is used to replace the second replacement parameter.

5.2 Non-capturing group

modenameStringRegular
(?:exp)Normal non-capturing groupgo go/(?:go)/1/=> go go error, no/1
(? Judge exp)Zero-width assertion non-capturing group
(?#Comment)
  • The normal non-capturing group (?:exp) is consistent with the normal (exp) matching behavior, except that the group number cannot be backreferenced or assigned, which can save memory.

6. Zero-width assertion

The zero-width assertion, just like its name, is a zero-width match. It is also a kind of non-capturing group under sub-expression. Its function is to add a qualification (assertion) to the specified position to specify before or after the position The characters of must meet the qualification conditions to make the sub-expression in the regular match successfully. In fact, the four metacharacters at the matching position in the above metacharacters also belong to the category of zero-width assertions.

6.1 Basic form

(?Judgment wordexp)(? Judgment word exp)
  • Judgment: Conditional judgment words, !, =, <=, <!, different judgment words represent how to adapt the position of exp
  • exp: The expression that needs to be matched when matching positions

6.2 Academic terms

In fact, the zero-width assertion and the following nouns seem very bluffing, but in fact they are only synonymous with different matching postures, and what they do is not complicated

6.2.1 Formula

formulanamedescription
(?=exp)Zero-width positive prediction lookahead assertionThe characters appearing after the assertion position need to match the exp expression
(?<=exp)Zero width is reviewing and assertingThe character before the assertion position needs to match the exp expression
(?!exp)Zero-width negative prediction lookahead assertionAssert that the character that appears after the position cannot match the exp expression
(?<!exp)Assertion after zero-width negative reviewAssert that the character before the position cannot match the exp expression

Note: The zero-width assertion is just a matching position ! ! ! The character whose position matches successfully will not appear in the final result, so it is a non-capturing group! !

6.2.2 Examples

Stringexpressionresult
<div>this is vedio.</div>/(?<=<div>)(.)+(?=</div>)/<div> this is vedio. </div>
<div>this is vedio.</div>/<(?!/)\w+>/< div >this is vedio.</div>

7. Greedy and non-greedy modes

Greedy and non-greedy modes affect the matching behavior of subexpressions or characters modified by quantifiers . To be precise, it affects the meaning of quantifiers , which are two extremes of each other.

7.1 Classification

1. Quantifier

  • Match-priority quantifiers: Among the quantifiers of the number of meta-characters matched, the quantifiers belonging to the greedy mode, {m,n}, {m,},?, *, +
  • Ignore priority quantifiers: Among the quantifiers of the number of meta-characters that are matched, those that belong to the non-greedy mode are usually followed by ?, {m,n}?, {m,}?, ??, *?, +?

2. Mode

  • Greedy mode: Sub-expression or character modified by matching priority quantifier, such as (exp)+. On the premise that the entire expression matches successfully, match as much as possible.
  • Non-greedy mode: Also known as lazy mode, sub-expressions or characters modified by the first quantifier are ignored, such as (exp)+?. On the premise that the entire expression matches successfully, match as little as possible. Non-greedy mode is only supported by some @NFA engines.

7.2 Difference between (exp)(m, n) and (exp)(m, n)?

(exp){m, n} means to match exp as much as possible before reaching the upper limit n, (exp){m,n} means to match exp as little as possible after meeting the lower limit m.

We know that the metacharacter quantifier? + * can be converted into the form corresponding to {m,n}, such as (exp)* represents {1, }, that is, as many texts that match exp can be matched as much as possible, and (exp) *? stands for {1, }?, that is, after satisfying the lower limit of 1 exp, there will be fewer matches as much as possible.

7.3 Examples

  • Source string: aa<div>test1<\div>bb<div>test2<\div>cc
  • Expression 1:/<div>.*<\div>/=> aa <div>test1<\div>bb<div>test2<\div> cc
  • Expression 2:/<div>.*?<\div>/=> aa <div>test1<\div> bb<div>test2<\div>cc
  • Expression 3:/<div>.*?<\div>/g => aa <div>test1<\div> bb <div>test2<\div> cc

Expression 1 uses the greedy mode. When the first </div> is matched, the entire expression can be matched successfully, but because it is greedy mode, you still have to try to match to the right to see if there is still A longer substring that can be successfully matched. After the second </div> is matched, there is no substring to the right that can be successfully matched. The match ends, and the matching result is aa <div>test1<\div>bb<div >test2<\div> cc

The second expression uses the non-greedy mode. When the first </div> is matched, the entire expression can already be matched successfully. Since the non-greedy mode is used, the matching ends and no more attempts to the right are made. The matching result is aa <div>test1<\div> bb<div>test2<\div>cc

Expression 3 also uses non-greedy mode. When the first </div> is matched, the entire expression can already be matched successfully. But because the modifier marks the global search g, you still have to try to match to the right to see if there is a substring that matches this expression. After the second </div> is matched, there is no substring to the right that can be successfully matched. String, it ends after two matches, the matching result is aa <div>test1<\div> bb <div>test2<\div> cc

The above explanation is only based on the analysis from the perspective of application, just for the convenience of understanding. @The actual matching is not so simple . Generally speaking, the performance of non-greedy matching is lower than that of greedy matching.

8. JavaScript compatibility

ES2018: negative zero-width assertion, named capture group, single-line modifier, Unicode escape