字符串--正则表达式（量词）

Leefs 2019-12-23 AM 2113℃ 0条

### 字符串--正则表达式（量词）

### 前言

本篇讲述《Java编程思想》第13.6.3节，量词

### 概念

量词(Quantifier)：用来指定某部分正则所重复的次数。

**类型**

> + 贪婪型：量词总是贪婪的，除非有其他的选项被设置。贪婪表达式会为所有可能的模式发现尽可能多的匹配。导致此问题的一个典型理由就是假定我们的模式仅能匹配第一个可能的字符组,如果它是贪婪的,那么它就会继续往下匹配.
> + 勉强型: 用问号来指定,这个量词匹配满足模式所需的最少字符数,也称作懒惰的,最少匹配的,非贪婪的,或不贪婪的.
> + 占有型: 目前,这种类型的量词只有在Java中才可用.当正则表达式被应用于字符串时,它会产生相当多的状态,以便在匹配失败的时候可以回溯.而“占有的”量词并不保存这些中间状态，因此它们可以防止回溯。它们常常用于防止正则表达式失控，因此可以使正则表达式执行起来更有效。

**总结**

`greedy`: 贪婪型, 最大匹配方式;

`reluctant`: 懒惰型, 最小匹配方式;

`possessive`: 独占型, 全部匹配方式; 也翻译为[`支配型`];

这3种量词, 是修饰量词的量词, 可以理解为正则格式重复的匹配类型。

| Greedy(贪婪) | Reluctant(懒惰) | Possessive(独占) | 说明                                  |
| ------------ | --------------- | ---------------- | ------------------------------------- |
| `X?`         | `X??`           | `X?+`            | `X`, 出现0或1次                       |
| `X*`         | `X*?`           | `X*+`            | `X`, 出现0到多次                      |
| `X+`         | `X+?`           | `X++`            | `X`, 出现1到多次                      |
| `X{n}`       | `X{n}?`         | `X{n}+`          | `X`, 精确匹配 `n` 次                  |
| `X{n,}`      | `X{n,}?`        | `X{n,}+`         | `X`, 最少出现 `n` 次                  |
| `X{n,m}`     | `X{n,m}?`       | `X{n,m}+`        | `X`, 最少出现 `n` 次, 最多出现 `m` 次 |

我们先创建3个基本的正则表达式：字母 “`a`” 后面紧跟 `?`, `*`, 或者 `+`。然后使用贪婪型来进行匹配。 先来看看碰到空字符串 `""` 是什么情况：

```java
Enter your regex: a?
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search: 
No match found.
```

### 零长匹配

上面的示例中, 前两个正则成功匹配, 因为 `a?` 和 `a*` 都允许出现 0 次 `a`. 且开始索引和结束索引 都是 0, 这和之前所见的情形略有不同。空字符串`""` 的长度为0, 所以只能在索引0处匹配。这种情况称为零长匹配(Zero-Length Match).

零长匹配可能出现的情况包括: 空文本, 字符串起始处, 字符串结尾处, 以及任意两个字符之间. 零长匹配很容易辨认, 因为开始索引和结束索引的位置相等。

下面来看几个零长匹配的示例。输入文本为单个字母 “`a`” , 你会看到一些有趣的地方:

```java
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
```

3种量词都可以匹配到字母”a”, 但前两个还找到了一次零长匹配, 在 index=1 的位置, 也就是字符串结尾之处. 可以看到, 匹配器先在 index=0 和 index=1 之间找到了字符 “a”, 往后类推, 直到再也匹配不到为止. 根据使用量词的不同, 文本结尾处的空白(nothing)可能被匹配到, 也可能不被匹配到。

我们看看连续输入5个字母”`a`“的情况:

```java
Enter your regex: a?
Enter input string to search: aaaaa
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 1 and ending at index 2.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a*
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a+
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
```

正则 `a?` 对每个字母进行1次匹配, 因为它匹配的是0到1个 `"a"`. 正则 `a*` 会匹配2次: 其中第1次匹配多个连续的字母 “a” , 第2次是零长匹配, 字符串结束位置 index=5 的地方. 而 `a+` 只会匹配所有出现的”a”字母, 忽略最后的空白(nothing)。

现在, 我们想知道, 前2个正则在碰到其他字母时会发生什么. 例如碰到 “ababaaaab” 之中的 `b` 字母时。

请看示例:

```java
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
```

字母 “b” 出现在索引为 1, 3, 8 的位置, 输出结果也表明零长匹配出现在这些地方. 正则 `a?` 不会专门查找字母”b”, 而只查找 存在/或不存在字母 “a” 的地方. 如果量词允许0次匹配, 则只要不是 “a” 字母的地方都会出现一次零长匹配. 其余的”a”则根据前面介绍的规则进行匹配。

要精确匹配某个格式 n 次, 只需要在大括号内指定数字即可:

```java
Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.
```

正则 `a{3}` 匹配连续出现的三个“`a`”字母。第一次测试匹配失败, 是因为字母`a`的数量不足. 第二次测试时, 字符串中刚好包含3个 `a` 字母, 所以匹配了一次。第三次测试也触发了一次匹配, 因为输入文本的签名有3个 `a` 字母. 后面再出现的字母, 与第一次匹配无关。如果后面还有这种格式的字符串, 则使用后面的子串触发后续匹配:

```java
Enter your regex: a{3}
Enter input string to search: aaaaaaaaa
I found the text "aaa" starting at index 0 and ending at index 3.
I found the text "aaa" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
```

要求某种格式至少出现`n`次，可以在数字后面加一个逗号,例如:

```java
Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.
```

同样是9个字母a, 这里就只匹配了一次，因为9个 `a` 字母的序列也满足 “最少3个a字母” 的需求。

如果要指定出现次数的最大值，在大括号内加上第二个数字即可:

```java
Enter your regex: a{3,6} // 最少3个,最多6个a字母
Enter input string to search: aaaaaaaaa
I found the text "aaaaaa" starting at index 0 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
```

这里的第一个匹配在达到上限的6个字符时停止. 第二个匹配包含了剩下的字母, 恰好是要求的最小字符个数: 三个 `a`. 如果输入的文本再少一个字符, 第二次匹配就不会发生, 因为只有2个 `a` 则匹配不了该格式。

### 关联到捕获组和/或字符集的量词

到目前为止, 我们只是用量词来测试了单个字符的情况. 但实际上, 量词只关联到一个字符上, 所以正则 “`abc+`” 的含义是: “字母`a`, 后面跟着字母`b`, 然后再跟着1到多个字母`c`”. 而不表示1到多次的 “abc”. 当然, 量词可以关联到字符组(Character Class)和捕获组(Capturing Group), 例如 `[abc]+`, 表示 “出现1到多次的a或b或c, 也就是abc三个字母组成的任意组合”), 而正则 `(abc)+` 表示 “`abc`” 这个 group 整体出现 1次到多次, 例如 `abcabcabc`。

让我们看一个具体的示例, 指定分组 `dog` 连续出现三次。

```java
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.
```

第一个示例, 匹配了3次, 因为量词作用于整个捕获组. 如果把小括号去掉, 就会匹配失败, 因为这时候量词`{3}`只作用于字母”`g`“。

类似地,我们将量词作用于整个字符组(character class):

```java
Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.
```

第一个示例中, 量词 `{3}` 作用于整个字符组, 在第二个示例中, 量词只作用于字母 “c”。

### 贪婪,懒惰和全量量词之间的区别

贪婪(Greedy),懒惰(Reluctant)和全量(Possessive)这三种量词模式之间有一些细微的差别。

贪婪量词(Greedy quantifier), 其试图在第一次匹配时就吃掉所有的输入字符. 如果尝试吃掉整个字符串失败, 则放过最后一个字符, 并再次尝试匹配, 重复这个过程, 直到找到一个匹配, 或者是没有可回退的字符为止. 根据正则中的量词, 最后尝试匹配的可能是0或1个字符。

懒惰量词(reluctant quantifier),采取的策略正好相反: 从输入字符串的起始处, 每吃下一个字符,就尝试进行一次匹配. 最后才会尝试匹配整个输入字符串。

独占量词(possessive quantifier), 则是吃下整个输入字符串, 只进行一次匹配尝试. 独占量词从不后退, 即使匹配失败, 这点是和贪婪量词的不同。

请看下面的示例:

```
Enter your regex: .*foo  // Java默认贪婪型
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // 懒惰型
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // 独占模式
Enter input string to search: xfooxxxxxxfoo
No match found.
```

第一个示例使用的是贪婪量词 `.*`, 匹配0到多个的任意字符(anything), 紧随其后的是字母 “f” “o” “o”。因为是贪婪量词, `.*` 部分首先吃掉整个输入字符串, 发现整个表达式匹配不成功, 因为最后三个字母(“f” “o” “o”)已经被 `.*` 吃掉了; 然后, 匹配器放开最后1个字符,再放开最后1个字符,再放开最后1个字符, 直到右边剩下 “foo” 为止, 这时候匹配成功, 查找结束。

第二个示例是懒惰型, 所以最开始什么都不吃. 因为后面不是 “foo”，所以不得不吃下第一个字母(“x”), 然后就触发了第一次匹配, 在索引0到4之间。接着从索引4的后面再次进行匹配尝试, 直到尝试完整个输入字符串。在索引4到13之间触发了第二次匹配。

第三个例子, 使用的是独占量词, 所以没有匹配成功。在这个示例中, 因为整个输入字符串都被 `.*+` 吃掉了, 剩下的空白自然不能对应 “foo”. 由此可知, 独占量词只能用于匹配所有字符的情况, 它从不后退; 如果都不能匹配到, 独占量词的性能会比贪婪型好一些。

*附：[参考原文链接](https://blog.csdn.net/renfufei/article/details/79029624)*

标签: Java, Java编程思想, 字符串, 正则表达式

非特殊说明，本博所有文章均为博主原创。

如若转载，请注明出处：https://lilinchao.com/archives/326.html

上一篇字符串--创建正则表达式

下一篇【转载】正则表达式--Pattern和Matcher

字符串--正则表达式（量词）

评论已关闭

栏目分类

标签云

友情链接申请

字符串--正则表达式（量词）

评论已关闭

 栏目分类

标签云

友情链接申请

栏目分类

标签云

友情链接申请