13 Desember 2020
Materi ini hanya tersedia dalam bahasa berikut: English, Espa??ol, Fran??ais, Italiano, ?????????, ??????????????, ????????????????????, ????????????. Tolong, menerjemahkan ke dalam Indonesia.

Sets and ranges [...]

Several characters or character classes inside square brackets [???] mean to ???search for any character among given???.

Sets

For instance, [eao] means any of the 3 characters: 'a', 'e', or 'o'.

That???s called a set. Sets can be used in a regexp along with regular characters:

// find [t or m], and then "op"
alert( "Mop top".match(/[tm]op/gi) ); // "Mop", "top"

Please note that although there are multiple characters in the set, they correspond to exactly one character in the match.

So the example below gives no matches:

// find "V", then [o or i], then "la"
alert( "Voila".match(/V[oi]la/) ); // null, no matches

The pattern searches for:

  • V,
  • then one of the letters [oi],
  • then la.

So there would be a match for Vola or Vila.

Ranges

Square brackets may also contain character ranges.

For instance, [a-z] is a character in range from a to z, and [0-5] is a digit from 0 to 5.

In the example below we???re searching for "x" followed by two digits or letters from A to F:

alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF

Here [0-9A-F] has two ranges: it searches for a character that is either a digit from 0 to 9 or a letter from A to F.

If we???d like to look for lowercase letters as well, we can add the range a-f: [0-9A-Fa-f]. Or add the flag i.

We can also use character classes inside [???].

For instance, if we???d like to look for a wordly character \w or a hyphen -, then the set is [\w-].

Combining multiple classes is also possible, e.g. [\s\d] means ???a space character or a digit???.

Character classes are shorthands for certain character sets

For instance:

  • \d ??? is the same as [0-9],
  • \w ??? is the same as [a-zA-Z0-9_],
  • \s ??? is the same as [\t\n\v\f\r ], plus few other rare Unicode space characters.

Example: multi-language \w

As the character class \w is a shorthand for [a-zA-Z0-9_], it can???t find Chinese hieroglyphs, Cyrillic letters, etc.

We can write a more universal pattern, that looks for wordly characters in any language. That???s easy with Unicode properties: [\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}].

Let???s decipher it. Similar to \w, we???re making a set of our own that includes characters with following Unicode properties:

  • Alphabetic (Alpha) ??? for letters,
  • Mark (M) ??? for accents,
  • Decimal_Number (Nd) ??? for digits,
  • Connector_Punctuation (Pc) ??? for the underscore '_' and similar characters,
  • Join_Control (Join_C) ??? two special codes 200c and 200d, used in ligatures, e.g. in Arabic.

An example of use:

let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;

let str = `Hi ?????? 12`;

// finds all letters and digits:
alert( str.match(regexp) ); // H,i,???,???,1,2

Of course, we can edit this pattern: add Unicode properties or remove them. Unicode properties are covered in more details in the article Unicode: flag "u" and class \p{...}.

Unicode properties aren???t supported in IE

Unicode properties p{???} are not implemented in IE. If we really need them, we can use library XRegExp.

Or just use ranges of characters in a language that interests us, e.g. [??-??] for Cyrillic letters.

Excluding ranges

Besides normal ranges, there are ???excluding??? ranges that look like [^???].

They are denoted by a caret character ^ at the start and match any character except the given ones.

For instance:

  • [^aeyo] ??? any character except 'a', 'e', 'y' or 'o'.
  • [^0-9] ??? any character except a digit, the same as \D.
  • [^\s] ??? any non-space character, same as \S.

The example below looks for any characters except letters, digits and spaces:

alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .

Escaping in [???]

Usually when we want to find exactly a special character, we need to escape it like \.. And if we need a backslash, then we use \\, and so on.

In square brackets we can use the vast majority of special characters without escaping:

  • Symbols . + ( ) never need escaping.
  • A hyphen - is not escaped in the beginning or the end (where it does not define a range).
  • A caret ^ is only escaped in the beginning (where it means exclusion).
  • The closing square bracket ] is always escaped (if we need to look for that symbol).

In other words, all special characters are allowed without escaping, except when they mean something for square brackets.

A dot . inside square brackets means just a dot. The pattern [.,] would look for one of characters: either a dot or a comma.

In the example below the regexp [-().^+] looks for one of the characters -().^+:

// No need to escape
let regexp = /[-().^+]/g;

alert( "1 + 2 - 3".match(regexp) ); // Matches +, -

???But if you decide to escape them ???just in case???, then there would be no harm:

// Escaped everything
let regexp = /[\-\(\)\.\^\+]/g;

alert( "1 + 2 - 3".match(regexp) ); // also works: +, -

Ranges and flag ???u???

If there are surrogate pairs in the set, flag u is required for them to work correctly.

For instance, let???s look for [????????] in the string ????:

alert( '????'.match(/[????????]/) ); // shows a strange character, like [?]
// (the search was performed incorrectly, half-character returned)

The result is incorrect, because by default regular expressions ???don???t know??? about surrogate pairs.

The regular expression engine thinks that [????????] ??? are not two, but four characters:

  1. left half of ???? (1),
  2. right half of ???? (2),
  3. left half of ???? (3),
  4. right half of ???? (4).

We can see their codes like this:

for(let i=0; i<'????????'.length; i++) {
  alert('????????'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};

So, the example above finds and shows the left half of ????.

If we add flag u, then the behavior will be correct:

alert( '????'.match(/[????????]/u) ); // ????

The similar situation occurs when looking for a range, such as [????-????].

If we forget to add flag u, there will be an error:

'????'.match(/[????-????]/); // Error: Invalid regular expression

The reason is that without flag u surrogate pairs are perceived as two characters, so [????-????] is interpreted as [<55349><56499>-<55349><56500>] (every surrogate pair is replaced with its codes). Now it???s easy to see that the range 56499-55349 is invalid: its starting code 56499 is greater than the end 55349. That???s the formal reason for the error.

With the flag u the pattern works correctly:

// look for characters from ???? to ????
alert( '????'.match(/[????-????]/u) ); // ????

Tugas

We have a regexp /Java[^script]/.

Does it match anything in the string Java? In the string JavaScript?

Answers: no, yes.

  • In the script Java it doesn???t match anything, because [^script] means ???any character except given ones???. So the regexp looks for "Java" followed by one such symbol, but there???s a string end, no symbols after it.

    alert( "Java".match(/Java[^script]/) ); // null
  • Yes, because the [^script] part matches the character "S". It???s not one of script. As the regexp is case-sensitive (no i flag), it treats "S" as a different character from "s".

    alert( "JavaScript".match(/Java[^script]/) ); // "JavaS"

The time can be in the format hours:minutes or hours-minutes. Both hours and minutes have 2 digits: 09:00 or 21-30.

Write a regexp to find time:

let regexp = /your regexp/g;
alert( "Breakfast at 09:00. Dinner at 21-30".match(regexp) ); // 09:00, 21-30

P.S. In this task we assume that the time is always correct, there???s no need to filter out bad strings like ???45:67???. Later we???ll deal with that too.

Answer: \d\d[-:]\d\d.

let regexp = /\d\d[-:]\d\d/g;
alert( "Breakfast at 09:00. Dinner at 21-30".match(regexp) ); // 09:00, 21-30

Please note that the dash '-' has a special meaning in square brackets, but only between other characters, not when it???s in the beginning or at the end, so we don???t need to escape it.

Peta tutorial