the original said reggae but i misread it as regex and got this idea lol

original comic artist: thisstupidtwink@insta

  • Bubs@lemmy.zip
    link
    fedilink
    English
    arrow-up
    56
    ·
    8 days ago

    As someone who doesn’t know any regex syntax, is there any simple explanation for what the expression on the board does?

    • NaibofTabr@infosec.pub
      link
      fedilink
      English
      arrow-up
      92
      ·
      edit-2
      7 days ago

      Basic concept: the purpose of regex is to search input text for matching patterns of characters.

      Assuming this is correct (including the spaces):

      / ^(\d{3}) - (\d{2}) - (\d{5}) \2 \1$ / g

      Then:

      / The first forward slash is the delimiter which tells the code that this is the start of the regex (start interpreting the expression after this).

      ^ The caret marks the beginning of the text string being searched for a match or the beginning of a line of text, meaning that any matches found by the following regex must begin at the beginning of the input text, or at the beginning of a new line of text, not somewhere in the middle of it.

      (\d{3}) This is the first group for matching actual text characters. The \d matches any single digit (0-9). The {3} attached to it means that there must be exactly 3 digits adjacent to each other, no more, no less.

      _-_ (underscore indicating that there is a space in the original expression) This must match a [space][dash][space] as literal characters.

      (\d{2}) As before, this matches two adjacent digits. This is the second matching group.

      _-_ Same as above, [space][dash][space].

      (\d{5}) Same as the two patterns before, this matches five adjacent digits. This is the third group.

      _\2 The [space] here matters, indicating that there must be a space character between the previously matched group of five digits and the following match group \2, which says to match the same text as the most recently matched 2nd group. In this case the second group would be (\d{2}), so this must match the same two digits as were matched by (\d{2}) in the same order.

      _\1 Similar to the above, this must match a [space] and then the same text as the first most recently matched group. In this case that would be the (\d{3}).

      $ This is the same as the ^, only it matches the end of the input text or the end of a line of text. This means that there cannot be any more characters in the input text after the last characters that match the specified pattern.

      / g The / is again a delimiter, indicating the end of the regex. The g means “global”, which instructs the code to search the entire input text for all possible matches and return all of them at the end of the search (default regex behavior is to search until the first match, then stop and return that result).

      So example matches would look like this:
      111 - 22 - 33333 22 111
      012 - 01 - 01234 01 012
      987 - 98 - 98765 98 987

      But this would not match:
      11 - 222 - 33333 222 11 (incorrect numbers of digits in the first and second groups)
      012 - 01 - 01234 10 012 (the second group of 2 digits does not match the first group of 2 digits)
      987-98-9876598987 (spaces are missing)
      111 22 33333 22 111 (dashes are missing)


      Speculation:
      The matched string looks like a serial number or part number or something like that, so probably the use case for this regex is to search through a file containing a long list of such numbers all separated on new lines of text, to find specific ones (for some reason). Maybe numbers that match this pattern are invalid, or maybe only numbers that match this pattern are valid and everything else that might be in the file needs to be removed.

      Based on this I think the end is actually wrong and should be / gm (m for multi-line) to allow for searching (and returning) multiple lines of input text. Otherwise, this should be part of code which splits the lines of the input text file into individual strings and then feeds them through the regex one at a time - but if that’s the case then using the g (global) flag doesn’t really make sense.


      With thanks to https://regex101.com/

      • expr@piefed.social
        link
        fedilink
        English
        arrow-up
        10
        ·
        8 days ago

        Traditionally, the global flag is used to mean global within a line, meaning all matches in a line.

        • NaibofTabr@infosec.pub
          link
          fedilink
          English
          arrow-up
          14
          ·
          8 days ago

          Right, but this expression has an explicit ^ and $, so if there’s anything else in the input line besides a single instance of the pattern, it won’t match. This makes the g kind of pointless, there can’t possibly be multiple instances of the pattern in the same line and still return a valid match.

        • glibg10b@lemmy.zip
          link
          fedilink
          English
          arrow-up
          4
          ·
          8 days ago

          I use it in Vim. Sometimes you want to rename a variable that’s present multiple times in the same line

        • a_jeering_serpent@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          3
          ·
          7 days ago

          (?:\d{3}-){2}(?:\d{4}) would match a ten digit us-format phone number, though I’d recommend using two literally instead of a repeat for maintainability reasons. Regex needs no assistance being terse and obtuse, humans need time to understand regex patterns, even ones they wrote not long ago. Make that part easier on your collaborators, and treat your past and future selves like remote asynchronous collaborators, always.

  • morto@piefed.social
    link
    fedilink
    English
    arrow-up
    34
    ·
    edit-2
    8 days ago

    A couple of years ago, there was a guy wanting to use some llm-based agentic whatever to extract a specific information from a group of research articles in pdf. His justification was that it was a complex kind of information that needed to be extracted. After the same task was assigned to me, I dumped the llm and did the same thing, with fewer errors and much faster, by just using some regex patterns with pdfgrep. It can be complicated, but it’s so powerful! And I don’t even know shit about regex, I just used a search engine and some trial and error lol

    • jjj@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      3
      ·
      8 days ago

      Is there a version of regex with comments?

      I mean one would typically insert it as a literal in another language but if there are flexible macros it could be done without any runtime cost/standard reinventing.

  • Jayjader@jlai.lu
    link
    fedilink
    English
    arrow-up
    11
    ·
    8 days ago

    Backreferences are bad for performance and make the grammar being matched irregular, if I remember my comp sci classes. I don’t they should be taught in a Regex 101 class.

    • a_jeering_serpent@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      4
      ·
      7 days ago

      Are you thinking of lookarounds? Backreferences do have some performance impact but a lookahead or lookbehind much more so. That definitely breaks the regularity, but I’m not sure that applies to backreferences (which may be my own ignorance). Performance wise unmatched lookarounds are the least performant getting worse as the size of the corpus increases. A positive lookahead/lookbehind has to scan all the text before or after the assertion to determine match failure and likewise negatives must do the same to determine match success. Greedier matching also amplifies things here (do you want just the first match or all of them?)

      I’m more fluent in regex syntaxes than in the implementation details of any specific regex engine, so please correct me if you know Im wrong, both for my own edification and so that when I share things going forward Im sharing the most accurate information that I can.

      • Jayjader@jlai.lu
        link
        fedilink
        English
        arrow-up
        5
        ·
        7 days ago

        Turns out we’re both off the mark: it’s catastrophic backtracking that is “dangerously” vulnerable to performance issues. Something as simple as (a+)+b is enough to trigger the “bad” behavior. I assume you can achieve it with back references and lookarounds as well.

        This video gives a good breakdown of what exactly is going on inside a compiled regex automata that encodes such a case: https://www.youtube.com/watch?v=gITmP0IWff0

        • a_jeering_serpent@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          2
          ·
          7 days ago

          Thank you friend! I honestly had almost forgotten that you could + on a group (in extended syntax i think?) like you can with *. In my experience I find lots of * groups and I do my best to convert those to a range eg {3,5}. When you can’t typically you can set least still use an open range floor {3,} or ceiling {,5}. I’m a big fan of explicit constraints when you have enough information to set them. It’s another good maintainability practice in my experience. The more clear the regex the less example data you need to understand the intention. I especially like eg ruby’s regexp.x flag that let’s you ignore literal newlines and whitespace in the pattern (not to be confused with regexp.X which does the same but for the corpus), so you can split your pattern over multiple lines. I like to use indentations when it helps readability and that also allows a multi line comment header indented the same way. Sometimes you can even set inline comments depending on language/engine/syntax. For significant whitespace in the pattern wrap each whitespace character in a character class containing only itself: eg [ ][ ] for two literal spaces to match. This is also how I handle patterns for eg sed or grep in bash/zsh which have their own whitespace semantics, to get whitespace literals in your patterns without the need to escape anything. The non-literal part of the pattern doesnt change, and the literal part gets substituted in piped through something like sed -E ‘/./[\1]/g’

  • Dæmon S.@catodon.rocks
    link
    fedilink
    arrow-up
    9
    ·
    8 days ago

    [email protected] TIL, through a meme (yep, memes can be very teaching, too), that I can reuse capture groups in a recurring manner inside a RegExp (I didn’t know about the \(number) thing, but I readily inferred, due to past experience with using \(number) in KDE Kate’s Regexp replace, it had something to do with “this position must contain the nth group verbatim”, opened the DevTools, tried .match with a fixed version of the meme’s regex (i.e. without the invalid spaces) and a random phone sequence my mind conjured out of thin air, and voilá, the slash-number thing indeed behaved as I guessed it would behave). So… Thanks to whoever made the meme because TIL thanks to you!

    /^Be( )not\1(a)fr\2id$/ (Biblically-accurate RegExp).

    • kartoffelsaft@programming.dev
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      8 days ago

      This is a cool feature of a lot of regex implementations but I will warn you that reusing capture groups in a match means it’s impossible for any regex engine to guarantee a reasonable upper bound (best they can do is O(n!) I think? I’d have to look up the details). In a replacement string this is a non-issue because there’s no way they can recurse out of control.

      Edit: found the video I originally heard this from: https://www.youtube.com/watch?v=gITmP0IWff0

      • ChickenLadyLovesLife@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 days ago

        I was a programmer for 25 years and I never needed to use regex for anything. Also never once needed to write a sorting algorithm. My favorite sorting algorithm was the “SORT BY” clause in SQL.

    • adarza@piefed.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 days ago

      you will find regex in numerous things that have nothing to do with writing code. i don’t even need to leave firefox to find several instances in the addons i have installed.

    • SkunkWorkz@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      I just ask an LLM if I ever need to create a regex query. Which is almost never hence I don’t understand regex. Many programmers don’t understand regex. Like when do you ever need regex if you program a game engine.