• bash aesthetics question: special characters in reg exp in [[ ... =~~ .

    From Kenny McCormack@21:1/5 to All on Mon Jul 22 21:59:11 2024
    Note: this is just a question of aesthetics. Functionally, it all works as expected.

    Sample bash code:

    f="$(fortune)" # Get some multi-line output into "f"
    # Look for foo followed by bar on the same line
    [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

    The point is you need the "anything other than a newline" or else it might match foo on one line and bar on a later line. The above is the only way I could figure out to express a newline in the particular flavor of reg exps
    used by the =~ operator.

    The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
    converted to a hard newline. This makes the listing look strange. I'd
    rather see "\n".

    Is there any way to get this?

    --
    Shikata ga nai...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Kenny McCormack on Mon Jul 22 22:47:59 2024
    On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
    converted to a hard newline. This makes the listing look strange. I'd rather see "\n".

    I see what you mean:

    $ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ set | grep -A 4 '^test'
    test ()
    {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }

    Is there any way to get this?

    Patch Bash so that when it's listing code, any items that need '...'
    quoting and that contain control characters are printed as $'...'
    syntax with escape sequences.

    Someone who had their original code as '
    ' will might not want that. It has to be an option.

    If Bash stored a bit in the code indicating "this word was produced
    using $ syntax", then it could be recovered accordingly.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arti F. Idiot@21:1/5 to Kenny McCormack on Mon Jul 22 22:00:00 2024
    On 7/22/24 3:59 PM, Kenny McCormack wrote:
    Note: this is just a question of aesthetics. Functionally, it all works as expected.

    Sample bash code:

    f="$(fortune)" # Get some multi-line output into "f"
    # Look for foo followed by bar on the same line
    [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

    The point is you need the "anything other than a newline" or else it might match foo on one line and bar on a later line. The above is the only way I could figure out to express a newline in the particular flavor of reg exps used by the =~ operator.

    The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
    converted to a hard newline. This makes the listing look strange. I'd rather see "\n".

    Is there any way to get this?


    Not sure this really addresses your 'type funcName' query but maybe
    somewhat better output from 'type funcName' ? :

    ...
    regex=$(printf 'foo[^$\n]*bar')
    [[ "$f" =~ $regex ]] && echo "foo bar"

    Kind of wish the regex string could be bracketed by "/" as in awk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to Arti F. Idiot on Tue Jul 23 05:33:31 2024
    In article <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>,
    Arti F. Idiot <addr@is.invalid> wrote:
    ...
    Not sure this really addresses your 'type funcName' query but maybe
    somewhat better output from 'type funcName' ? :

    ...
    regex=$(printf 'foo[^$\n]*bar')
    [[ "$f" =~ $regex ]] && echo "foo bar"

    Yes. I think there are actually some other situations like this (i.e.,
    issues involving using =~) - where putting the reg exp into a variable and
    then using the variable works better. It seems to be a common solution.

    Kind of wish the regex string could be bracketed by "/" as in awk.

    Yes. I've often wished the same. It is annoying that you have to escape
    any spaces in the regexp (since it isn't delimited, like reg exps are in
    most other languages [not just AWK]).

    Incidentally, here's another situation - involving using $'' but not
    involving =~. In another script, I have:

    ctrla=$'\001'

    Then I use $ctrla thereafter. But when the function is listed, the above
    line comes out as:

    ctrla=$''

    Unless I pipe it into "less", in which case it displays as:

    ctrla=$'^A'

    (with the ^A in reverse video). The point being that, as before with \n,
    there is a hard 001 character in there, not a graphic representation of it
    (as there should, IMHO, be).

    Agreeing with what Kaz wrote, I'm not objecting to there being hard
    characters in the internal representation of the code, but rather, I am
    saying that when it is displayed (e.g., by the "type" command), it should be rendered in a visible way.

    And, yet, changing gears once again, I don't quite understand why you can't write [\n] with =~. You have to write [$'\n']. It's not like that in most other languages (e.g., AWK).

    Which all kind of echoes back to the other recent thread in this NG about regular expressions vs. globs. The cold hard fact is that there really is
    no such thing as "regular expressions" (*), since every language, every program, every implementation of them, is quite different.

    (*) As an abstract concept, separate from any specific implementation.

    --
    Trump - the President for the rest of us.

    https://www.youtube.com/watch?v=JSkUJKgdcoE

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Tue Jul 23 11:48:11 2024
    On 23.07.2024 00:47, Kaz Kylheku wrote:
    On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    The problem is that if the above is in a function, when you list out the
    function with "type funName", the \n has already been digested and
    converted to a hard newline. This makes the listing look strange. I'd
    rather see "\n".

    I see what you mean:

    $ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ set | grep -A 4 '^test'
    test ()
    {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }

    Is there any way to get this?

    Of course (and out of curiosity) I tried that display detail as well
    in Kornshell to see how it behaves, and using a different command to
    display it...


    With my (old?) bash:

    $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ typeset -f f
    f ()
    {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }


    The same with ksh:

    $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ typeset -f f
    f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }


    And for good measure also in zsh:

    % f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    % typeset -f f
    f () {
    [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
    }


    Both seem to show "better aesthetics". Too bad it doesn't help for
    your bash context.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 11:46:19 2024
    In article <v7nu8t$15bon$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    Both (ksh & zsh) seem to show "better aesthetics".

    Indeed, it does. That is how it should work.

    Too bad it doesn't help for your bash context.

    Alas, it doesn't.

    --
    In American politics, there are two things you just don't f*ck with:

    1) Goldman Sachs
    2) The military/industrial complex

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Tue Jul 23 16:44:37 2024
    On 23.07.2024 13:46, Kenny McCormack wrote:
    In article <v7nu8t$15bon$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    Both (ksh & zsh) seem to show "better aesthetics".

    Indeed, it does. That is how it should work.

    BTW, it's interesting that bash and zsh both reformat (sort
    of pretty-print) the code (when using 'typeset -f'), only
    that zsh keeps that literal '\n'. This may show a way (by
    zsh example) how to follow Kaz' suggestion of patching the
    bash. (But, frankly, I'm not sure it was meant seriously.)

    But ksh displays it as it had been typed in; a raw format.
    If you define your function, say, as multi-line code you
    also see it that way, there's no processing at that point
    (or the original retained as copy). I didn't expect that.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 16:13:40 2024
    In article <v7ofkl$18d66$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 23.07.2024 13:46, Kenny McCormack wrote:
    In article <v7nu8t$15bon$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    Both (ksh & zsh) seem to show "better aesthetics".

    Indeed, it does. That is how it should work.

    BTW, it's interesting that bash and zsh both reformat (sort
    of pretty-print) the code (when using 'typeset -f'), only
    that zsh keeps that literal '\n'. This may show a way (by
    zsh example) how to follow Kaz' suggestion of patching the
    bash. (But, frankly, I'm not sure it was meant seriously. (see ** below))

    Yes. ksh seems to dump it out literally as is (as it was typed), but bash (and, I guess also zsh - I have zero knowledge or experience of zsh) pretty prints it. But it seems zsh does a prettier print than bash.

    One thing that bash does that's annoying is puts semicolons on the end of (almost) every line. I have, on occasion, had to recover a function from
    the bash pretty print (*), and one of the things that needs to be done is
    to remove those extraneous semicolons.

    (*) BTW, the command I use is "type". I.e., "type funName" displays the function definition of function funName. That seems to be the same as your
    use of "typeset".

    But ksh displays it as it had been typed in; a raw format.
    If you define your function, say, as multi-line code you
    also see it that way, there's no processing at that point
    (or the original retained as copy). I didn't expect that.

    Yep. Note also that bash reformats something like:

    cmd1 &&
    cmd2 &&
    cmd3

    to:

    cmd1 && cmd2 && cmd3

    which is annoying.

    (**) I've hacked the bash source code for less. So, yeah, it is possible.

    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/ThePublicGood

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Tue Jul 23 18:48:42 2024
    On 23.07.2024 18:13, Kenny McCormack wrote:
    In article <v7ofkl$18d66$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 23.07.2024 13:46, Kenny McCormack wrote:

    One thing that bash does that's annoying is puts semicolons on the end of (almost) every line.

    Ouch!

    I have, on occasion, had to recover a function from
    the bash pretty print (*), and one of the things that needs to be done is
    to remove those extraneous semicolons.

    (*) BTW, the command I use is "type". I.e., "type funName" displays the function definition of function funName. That seems to be the same as your use of "typeset".

    I started tests with 'type' but the result was something undesirable
    (forgot already what it was), so I tried the 'typeset -f' which had
    better results (with ksh, zsh, at least).

    Actually I was just playing around, since your post made me curious.
    (I almost never inspect function definitions using one method or the
    other. The interesting functions are non-trivial and already tested,
    so interactively looking them up makes no sense for me. And other
    functions are part of shell programs, either monolithic or used as
    lib.) But as a side-effect of my tries I noticed another bug in the
    ksh93u+m shell that I'm using. :-/ (But I'm digressing.)


    But ksh displays it as it had been typed in; a raw format.
    If you define your function, say, as multi-line code you
    also see it that way, there's no processing at that point
    (or the original retained as copy). I didn't expect that.

    Yep. Note also that bash reformats something like:

    cmd1 &&
    cmd2 &&
    cmd3

    to:

    cmd1 && cmd2 && cmd3

    which is annoying.

    Indeed. It reminds me the philosphy that I often noticed in MS (and
    nowadays also in Linux software, sadly) contexts; they seem to think
    their auto-changes are better than the intention of the programmer.


    (**) I've hacked the bash source code for less. So, yeah, it is possible.

    Ah, okay. (Would not be my preferred way. :-)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 17:15:41 2024
    In article <v7omtd$19ng6$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    Indeed. It reminds me the philosphy that I often noticed in MS (and
    nowadays also in Linux software, sadly) contexts; they seem to think
    their auto-changes are better than the intention of the programmer.

    The overall plan is to turn programming into a minimum wage job. That's
    why they are starting to call it "coding" and make it sound like something anybody can do.

    So, they have to take as much as possible of the choice/initiative out of it. Make it the modern equivalent of a factory job.

    --
    After Using Gender Slur Against AOC, GOP Rep. Yoyo Won't Apologize 'For Loving God'.

    That's so sweet...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Kenny McCormack on Tue Jul 23 18:34:52 2024
    On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    Which all kind of echoes back to the other recent thread in this NG about regular expressions vs. globs. The cold hard fact is that there really is
    no such thing as "regular expressions" (*), since every language, every program, every implementation of them, is quite different.

    (*) As an abstract concept, separate from any specific implementation.

    Yes, there are regular expressions as an abstract concept. They are part
    of the theory of automata. Much of the research went on up through the
    1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star

    In the old math/CS papers about regular expressions, regular expressions
    are typically represented in terms of some input symbol alphabet
    (usually just letters a, b, c ...) and only the operators | and *,
    and parentheses (other than when advanced operators are being discussed,
    like intersection and complement, whicha re not easily constructed from
    these.)

    I think character classes might have been a pragmatic invention in
    regex implementations. The theory doesn't require [a-c] because
    that can be encoded as (a|b|c).

    The ? operator is not required because (R)? can be written (R)(R)*.

    Escaping is not required because the oeprators and input symbols are
    distinct; the idea that ( could be an input symbol is something that
    occurs in implementations, not in the theory.

    Regex implementors take the theory and adjust it to taste,
    and add necessary details such as character escape sequences for
    control characters, and escaping to allow the oeprator characters
    themselves to be matched. Plus character classes, with negation
    and ranges and all that.

    Not all implementations follow solid theory. For instance, the branch
    operator | is supposed to be commutative. There is no difference
    between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to
    right order and stop at the first match.

    This matters when regexes are used for matching a prefix of the input;
    if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Tue Jul 23 18:20:14 2024
    On 2024-07-23, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 23.07.2024 00:47, Kaz Kylheku wrote:
    On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    The problem is that if the above is in a function, when you list out the >>> function with "type funName", the \n has already been digested and
    converted to a hard newline. This makes the listing look strange. I'd
    rather see "\n".

    I see what you mean:

    $ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ set | grep -A 4 '^test'
    test ()
    {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }

    Is there any way to get this?

    Of course (and out of curiosity) I tried that display detail as well
    in Kornshell to see how it behaves, and using a different command to
    display it...


    With my (old?) bash:

    $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ typeset -f f
    f ()
    {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }


    The same with ksh:

    $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    $ typeset -f f
    f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }


    And for good measure also in zsh:

    % f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
    % typeset -f f
    f () {
    [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
    }

    It bolsters the argument that Bash could use a fix to be this
    way also.

    zsh preserves the original syntax. So it is saving information
    in the stored code about how the datum was represented in
    the source code:

    % f() { [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar" ; }
    sun-go% typeset -f f
    f () {
    [[ "$f" =~ foo[^'
    ']*bar ]] && echo "foo bar"
    }

    I can understand why an implementor wouldn't want to save this.

    If the code that we see in "typeset" is the actual code that
    executes, it means that in the $'...' case, zsh has to process
    the escape sequences, whereas bash has expanded them out upfront.

    If the code that we see in "typeset" is not the actual code
    that executes, then that requires extra storage. The Bash
    project might be reluctant to imitate that strategy.

    Oh look, zsh preserves comments:

    sun-go% f() { # f function
    function> :
    function> }
    sun-go% typeset -f f
    f () {
    # f function
    :
    }

    I doubt that when f is called, it's actually dealing with the
    lexical details any more like comments; it's storing some
    compiled version of the code along with the source, more likely.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 00:51:44 2024
    Kaz Kylheku <643-408-1753@kylheku.com> writes:

    On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    Which all kind of echoes back to the other recent thread in this NG about
    regular expressions vs. globs. The cold hard fact is that there really is >> no such thing as "regular expressions" (*), since every language, every
    program, every implementation of them, is quite different.

    (*) As an abstract concept, separate from any specific implementation.

    Yes, there are regular expressions as an abstract concept. They are part
    of the theory of automata. Much of the research went on up through the 1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star

    In the old math/CS papers about regular expressions, regular expressions
    are typically represented in terms of some input symbol alphabet
    (usually just letters a, b, c ...) and only the operators | and *,
    and parentheses (other than when advanced operators are being discussed,
    like intersection and complement, whicha re not easily constructed from these.)

    I think character classes might have been a pragmatic invention in
    regex implementations. The theory doesn't require [a-c] because
    that can be encoded as (a|b|c).

    The ? operator is not required because (R)? can be written (R)(R)*.

    (Aside: the choice is arbitrary but + would be a more "Unixy" choice for
    that operator.)

    Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
    occurs in implementations, not in the theory.

    Regex implementors take the theory and adjust it to taste,
    and add necessary details such as character escape sequences for
    control characters, and escaping to allow the oeprator characters
    themselves to be matched. Plus character classes, with negation
    and ranges and all that.

    Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
    between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.

    This matters when regexes are used for matching a prefix of the input;
    if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    This is more a consequence of the different views. The in the formal
    theory there is no notion of "matching". Regular expressions define
    languages (i.e. sets of sequences of symbols) according to a recursive
    set of rules. The whole idea of an RE matching a string is from their
    use in practical applications.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Ben Bacarisse on Wed Jul 24 03:25:14 2024
    On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    This matters when regexes are used for matching a prefix of the input;
    if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    This is more a consequence of the different views. The in the formal
    theory there is no notion of "matching". Regular expressions define languages (i.e. sets of sequences of symbols) according to a recursive
    set of rules. The whole idea of an RE matching a string is from their
    use in practical applications.

    Under the set view, we can ask, what is the longest prefix of
    the input which belongs to the language R1|R2. The answer is the
    same for R2|R1, which denote the same set, since | corresponds
    to set union.

    Broken regular expressions identify the longest prefix, except
    when the | operator is used; then they just identify a prefix,
    not necessarily longest.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Wed Jul 24 11:53:00 2024
    On 23.07.2024 19:15, Kenny McCormack wrote:
    In article <v7omtd$19ng6$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    Indeed. It reminds me the philosphy that I often noticed in MS (and
    nowadays also in Linux software, sadly) contexts; they seem to think
    their auto-changes are better than the intention of the programmer.

    The overall plan is to turn programming into a minimum wage job. That's
    why they are starting to call it "coding" and make it sound like something anybody can do.

    And sometimes it doesn't even appear as coding; during a small episode
    in Java I could observe they are just clicking together pieces of code
    from drop-down menus using Eclipse.


    So, they have to take as much as possible of the choice/initiative out of it. Make it the modern equivalent of a factory job.

    Interesting comparison. But, yes.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Wed Jul 24 11:41:48 2024
    On 23.07.2024 20:34, Kaz Kylheku wrote:

    [...]

    In the old math/CS papers about regular expressions, regular expressions
    are typically represented in terms of some input symbol alphabet
    (usually just letters a, b, c ...) and only the operators | and *,
    and parentheses (other than when advanced operators are being discussed,
    like intersection and complement, whicha re not easily constructed from these.)

    I think character classes might have been a pragmatic invention in
    regex implementations. The theory doesn't require [a-c] because
    that can be encoded as (a|b|c).

    While formally we can restrict to some basic elements it's quite
    inconvenient in practice. I recall that in Compiler Construction
    and Automata Theory we regularly used the 'all-but' operator for
    complementing input symbol sets. And also obvious abbreviations
    for larger sets of symbols (like 'digits', etc.). Not only in
    practical regexp implementations, also in education certainly no
    one wants to waste time.


    The ? operator is not required because (R)? can be written (R)(R)*.

    ITYM: (R)+


    Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
    occurs in implementations, not in the theory.

    Regex implementors take the theory and adjust it to taste,
    and add necessary details such as character escape sequences for
    control characters, and escaping to allow the oeprator characters
    themselves to be matched. Plus character classes, with negation
    and ranges and all that.

    There are (at least) two different types of such adjustments. One
    are the convenience enhancements (like the '+' or the multiplicity
    ('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
    perspective, all stay within the same [theoretical] class of the
    Regular Languages. (There's other types of extensions that we find
    in implementations that leave that language class.)


    Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
    between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.

    In my book it's not necessary to follow "solid theory"; if, and only
    if, it's documented, correctly/sensibly implemented, and implications
    made clear to the programmer.

    There's two common examples that come to mind. I think it's okay if
    there's support for, e.g., back-references if it's clearly stated
    that you should not expect that this code will run in O(N), linear
    complexity. But there have been implementations - don't recall if it
    was in Java, Perl, or both - where they implemented "generalizations"
    of "Regexp-processings" that had the runtime-effect that even for
    *true* Regular Expressions you had no O(N) guarantee any more; and
    this is IMO a fatal decision! If the programmer uses only RE elements
    from the class of Regular Languages there should always be complexity
    O(N) guaranteed.


    This matters when regexes are used for matching a prefix of the input;
    if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    I think we should differentiate the matching process (implementation
    specific syntax for formulas of REs, matching implementation methods)
    from the Regular Languages theory; the complete strings of text that
    we match are in general not part of the Regular Language, the Regular Expression only specifies the subset of (matching) strings as part of
    the respective Regular Language. And, WRT complexity, we should also
    be aware that the O(N) in Regular Languages is the complexity of the
    "match", not the length of the data that is skimmed for a match. The
    various algorithms combine these complexities in supposedly efficient
    ways. Some RE parsers failed.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 14:17:14 2024
    Kaz Kylheku <643-408-1753@kylheku.com> writes:

    On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    This matters when regexes are used for matching a prefix of the input;
    if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    This is more a consequence of the different views. The in the formal
    theory there is no notion of "matching". Regular expressions define
    languages (i.e. sets of sequences of symbols) according to a recursive
    set of rules. The whole idea of an RE matching a string is from their
    use in practical applications.

    Under the set view, we can ask, what is the longest prefix of
    the input which belongs to the language R1|R2. The answer is the
    same for R2|R1, which denote the same set, since | corresponds
    to set union.

    What is "the input" in the set view. The set view is simply a recursive definition of the language.

    Broken regular expressions identify the longest prefix, except
    when the | operator is used; then they just identify a prefix,
    not necessarily longest.

    What is a "broken" RE in the set view?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Ben Bacarisse on Wed Jul 24 18:35:51 2024
    On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:

    On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    This matters when regexes are used for matching a prefix of the input; >>>> if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    This is more a consequence of the different views. The in the formal
    theory there is no notion of "matching". Regular expressions define
    languages (i.e. sets of sequences of symbols) according to a recursive
    set of rules. The whole idea of an RE matching a string is from their
    use in practical applications.

    Under the set view, we can ask, what is the longest prefix of
    the input which belongs to the language R1|R2. The answer is the
    same for R2|R1, which denote the same set, since | corresponds
    to set union.

    What is "the input" in the set view. The set view is simply a recursive definition of the language.

    It is a separate string under consideration.

    We have a set, and are asking the question "what is the longest prefix
    of the given string which is a member of the set".

    Broken regular expressions identify the longest prefix, except
    when the | operator is used; then they just identify a prefix,
    not necessarily longest.

    What is a "broken" RE in the set view?

    Inconsistency in being able to answer the question "what is the longest
    prefix of the string which is a member of the set".

    Broken regexes contain a pitfall: they deliver the right answer
    for expressions like ab*. If the input is "abbbbbbbc",
    they identify the entire "abbbbbbb" prefix. But if the branch
    operator is used, as in "a|ab*", oops, they short-circuit.
    The "a" matches a prefix of the input, and so that's done; no need
    to match the "ab*" part of the branch.

    The "a" prefix is in the language described from the language; a
    set element has been identified. But it's not the longest one.

    It is an inconsistency. If the longest match is not required, why
    bother finding one for "ab*"; for that expression, the "a" prefix could
    also just be returned.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 22:28:32 2024
    Kaz Kylheku <643-408-1753@kylheku.com> writes:

    On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:

    On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    This matters when regexes are used for matching a prefix of the input; >>>>> if the regex is interpreted according to the theory should match
    the longest possible prefix; it cannot ignore R3, which matches
    thousands of symbols, because R2 matched three symbols.

    This is more a consequence of the different views. The in the formal
    theory there is no notion of "matching". Regular expressions define
    languages (i.e. sets of sequences of symbols) according to a recursive >>>> set of rules. The whole idea of an RE matching a string is from their >>>> use in practical applications.

    Under the set view, we can ask, what is the longest prefix of
    the input which belongs to the language R1|R2. The answer is the
    same for R2|R1, which denote the same set, since | corresponds
    to set union.

    What is "the input" in the set view. The set view is simply a recursive
    definition of the language.

    It is a separate string under consideration.

    We have a set, and are asking the question "what is the longest prefix
    of the given string which is a member of the set".

    It's better, then, (as in the latter wording) not to use a term from the "implementation" view of REs.

    Broken regular expressions identify the longest prefix, except
    when the | operator is used; then they just identify a prefix,
    not necessarily longest.

    What is a "broken" RE in the set view?

    Inconsistency in being able to answer the question "what is the longest prefix of the string which is a member of the set".

    Broken regexes contain a pitfall: they deliver the right answer
    for expressions like ab*. If the input is "abbbbbbbc",

    they identify the entire "abbbbbbb" prefix. But if the branch
    operator is used, as in "a|ab*", oops, they short-circuit.
    The "a" matches a prefix of the input, and so that's done; no need
    to match the "ab*" part of the branch.

    I don't see any "pitfall". The answer to you question "what is the
    longest prefix of the given string which is a member of the set" is not
    "a" and nothing in the either the formal definition of the language
    "a|ab*" nor in the wording of the question is a pitfall. The longest
    prefix of "abbbbbbbc" that is in the language "a|ab*" is, unambiguously, "abbbbbbb".

    The "a" prefix is in the language described from the language; a
    set element has been identified. But it's not the longest one.

    Yes. But there is no "pitfall" and the RE is not "broken" in any formal
    sense at all.

    An implementation might be broken and there are pitfalls to look out for
    when viewing REs as patterns to match, but that's my whole point. This
    is all about the "other" view, not the view of REs as defining formal languages.

    It is an inconsistency. If the longest match is not required, why
    bother finding one for "ab*"; for that expression, the "a" prefix could
    also just be returned.

    We could, of course, ask about other prefixes of "abbbbbbbc" that are in
    the language "a|ab*". I don't see anything inconsistent here at all.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)