rule

Bubs@lemmy.zip · 8 days ago

As someone who doesn’t know any regex syntax, is there any simple explanation for what the expression on the board does?

NaibofTabr@infosec.pub · edit-2 8 days ago

Basic concept: the purpose of regex is to search input text for matching patterns of characters.

Assuming this is correct (including the spaces):

/ ^(\d{3}) - (\d{2}) - (\d{5}) \2 \1$ / g

Then:

/ The first forward slash is the delimiter which tells the code that this is the start of the regex (start interpreting the expression after this).

^ The caret marks the beginning of the text string being searched for a match or the beginning of a line of text, meaning that any matches found by the following regex must begin at the beginning of the input text, or at the beginning of a new line of text, not somewhere in the middle of it.

(\d{3}) This is the first group for matching actual text characters. The \d matches any single digit (0-9). The {3} attached to it means that there must be exactly 3 digits adjacent to each other, no more, no less.

_-_ (underscore indicating that there is a space in the original expression) This must match a [space][dash][space] as literal characters.

(\d{2}) As before, this matches two adjacent digits. This is the second matching group.

_-_ Same as above, [space][dash][space].

(\d{5}) Same as the two patterns before, this matches five adjacent digits. This is the third group.

_\2 The [space] here matters, indicating that there must be a space character between the previously matched group of five digits and the following match group \2, which says to match the same text as the most recently matched 2nd group. In this case the second group would be (\d{2}), so this must match the same two digits as were matched by (\d{2}) in the same order.

_\1 Similar to the above, this must match a [space] and then the same text as the first most recently matched group. In this case that would be the (\d{3}).

$ This is the same as the ^, only it matches the end of the input text or the end of a line of text. This means that there cannot be any more characters in the input text after the last characters that match the specified pattern.

/ g The / is again a delimiter, indicating the end of the regex. The g means “global”, which instructs the code to search the entire input text for all possible matches and return all of them at the end of the search (default regex behavior is to search until the first match, then stop and return that result).

So example matches would look like this:
111 - 22 - 33333 22 111
012 - 01 - 01234 01 012
987 - 98 - 98765 98 987

But this would not match:
11 - 222 - 33333 222 11 (incorrect numbers of digits in the first and second groups)
012 - 01 - 01234 10 012 (the second group of 2 digits does not match the first group of 2 digits)
987-98-9876598987 (spaces are missing)
111 22 33333 22 111 (dashes are missing)

Speculation:
The matched string looks like a serial number or part number or something like that, so probably the use case for this regex is to search through a file containing a long list of such numbers all separated on new lines of text, to find specific ones (for some reason). Maybe numbers that match this pattern are invalid, or maybe only numbers that match this pattern are valid and everything else that might be in the file needs to be removed.

Based on this I think the end is actually wrong and should be / gm (m for multi-line) to allow for searching (and returning) multiple lines of input text. Otherwise, this should be part of code which splits the lines of the input text file into individual strings and then feeds them through the regex one at a time - but if that’s the case then using the g (global) flag doesn’t really make sense.

With thanks to https://regex101.com/

Bubs@lemmy.zip · 8 days ago

Solid explanation

NaibofTabr@infosec.pub · 8 days ago

Updated because I clicked the reply button before it was actually done.

Bubs@lemmy.zip · 8 days ago

Oh lordy, that’s just a tad bit longer XD

expr@piefed.social · 8 days ago

Traditionally, the global flag is used to mean global within a line, meaning all matches in a line.

NaibofTabr@infosec.pub · 8 days ago

Right, but this expression has an explicit ^ and $, so if there’s anything else in the input line besides a single instance of the pattern, it won’t match. This makes the g kind of pointless, there can’t possibly be multiple instances of the pattern in the same line and still return a valid match.

Randelung@lemmy.world · 8 days ago

TIL groups can be used to look for repeating strings.

NigelFrobisher@aussie.zone · 8 days ago

Never knew the repeat group bit. Can’t really think of a practical use case for it though…

glibg10b@lemmy.zip · 8 days ago

I use it in Vim. Sometimes you want to rename a variable that’s present multiple times in the same line

NigelFrobisher@aussie.zone · 8 days ago

True, regex is nothing if not an everything tool!

exu@feditown.com · 8 days ago

Why not just match for the variable and use /g?

glibg10b@lemmy.zip · 7 days ago

I assumed that’s what they meant by “group bit”. I guess maybe they were talking about capture groups

a_jeering_serpent@sopuli.xyz · 8 days ago

(?:\d{3}-){2}(?:\d{4}) would match a ten digit us-format phone number, though I’d recommend using two literally instead of a repeat for maintainability reasons. Regex needs no assistance being terse and obtuse, humans need time to understand regex patterns, even ones they wrote not long ago. Make that part easier on your collaborators, and treat your past and future selves like remote asynchronous collaborators, always.

espurr@sopuli.xyz · 8 days ago

Bubs@lemmy.zip · 8 days ago

Fair enough lol