Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Mike Dippel
  Wed Sep 18 21:59:44 2024
  from Sun City Center, Fl via Telnet
- Spaceboy
  Tue Sep 17 18:33:14 2024
  from Nashville, Tn via Telnet
- Rixter
  Mon Sep 16 18:34:32 2024
  from Madison, Nc via Telnet
- Rixter
  Fri Sep 13 10:13:27 2024
  from Madison, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	21
Nodes:	6 (0 / 6)
Uptime:	32:00:28
Calls:	139
Files:	91
Messages:	42,750

bash aesthetics question: special characters in reg exp in [[ ... =~~ .

From Kenny McCormack@21:1/5 to All on Mon Jul 22 21:59:11 2024

Note: this is just a question of aesthetics. Functionally, it all works as expected.

Sample bash code:

f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

The point is you need the "anything other than a newline" or else it might match foo on one line and bar on a later line. The above is the only way I could figure out to express a newline in the particular flavor of reg exps
used by the =~ operator.

The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".

Is there any way to get this?

--
Shikata ga nai...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Kenny McCormack on Mon Jul 22 22:47:59 2024

On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:

The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd rather see "\n".

I see what you mean:

$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

Is there any way to get this?

Patch Bash so that when it's listing code, any items that need '...'
quoting and that contain control characters are printed as $'...'
syntax with escape sequences.

Someone who had their original code as '
' will might not want that. It has to be an option.

If Bash stored a bit in the code indicating "this word was produced
using $ syntax", then it could be recovered accordingly.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arti F. Idiot@21:1/5 to Kenny McCormack on Mon Jul 22 22:00:00 2024

On 7/22/24 3:59 PM, Kenny McCormack wrote:

Note: this is just a question of aesthetics. Functionally, it all works as expected.

Sample bash code:

f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

The point is you need the "anything other than a newline" or else it might match foo on one line and bar on a later line. The above is the only way I could figure out to express a newline in the particular flavor of reg exps used by the =~ operator.

The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd rather see "\n".

Is there any way to get this?

Not sure this really addresses your 'type funcName' query but maybe
somewhat better output from 'type funcName' ? :

...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"

Kind of wish the regex string could be bracketed by "/" as in awk.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to Arti F. Idiot on Tue Jul 23 05:33:31 2024

In article <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>,
Arti F. Idiot <addr@is.invalid> wrote:
...

Not sure this really addresses your 'type funcName' query but maybe
somewhat better output from 'type funcName' ? :

...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"

Yes. I think there are actually some other situations like this (i.e.,
issues involving using =~) - where putting the reg exp into a variable and
then using the variable works better. It seems to be a common solution.

Kind of wish the regex string could be bracketed by "/" as in awk.

Yes. I've often wished the same. It is annoying that you have to escape
any spaces in the regexp (since it isn't delimited, like reg exps are in
most other languages [not just AWK]).

Incidentally, here's another situation - involving using $'' but not
involving =~. In another script, I have:

ctrla=$'\001'

Then I use $ctrla thereafter. But when the function is listed, the above
line comes out as:

ctrla=$''

Unless I pipe it into "less", in which case it displays as:

ctrla=$'^A'

(with the ^A in reverse video). The point being that, as before with \n,
there is a hard 001 character in there, not a graphic representation of it
(as there should, IMHO, be).

Agreeing with what Kaz wrote, I'm not objecting to there being hard
characters in the internal representation of the code, but rather, I am
saying that when it is displayed (e.g., by the "type" command), it should be rendered in a visible way.

And, yet, changing gears once again, I don't quite understand why you can't write [\n] with =~. You have to write [$'\n']. It's not like that in most other languages (e.g., AWK).

Which all kind of echoes back to the other recent thread in this NG about regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every program, every implementation of them, is quite different.

(*) As an abstract concept, separate from any specific implementation.

--
Trump - the President for the rest of us.

https://www.youtube.com/watch?v=JSkUJKgdcoE

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kaz Kylheku on Tue Jul 23 11:48:11 2024

On 23.07.2024 00:47, Kaz Kylheku wrote:

On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:

The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".

I see what you mean:

$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

Is there any way to get this?

Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...

With my (old?) bash:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

The same with ksh:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }

And for good measure also in zsh:

% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}

Both seem to show "better aesthetics". Too bad it doesn't help for
your bash context.

Janis

[...]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 11:46:19 2024

In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

Both (ksh & zsh) seem to show "better aesthetics".

Indeed, it does. That is how it should work.

Too bad it doesn't help for your bash context.

Alas, it doesn't.

--
In American politics, there are two things you just don't f*ck with:

1) Goldman Sachs
2) The military/industrial complex

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Tue Jul 23 16:44:37 2024

On 23.07.2024 13:46, Kenny McCormack wrote:

In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

Both (ksh & zsh) seem to show "better aesthetics".

Indeed, it does. That is how it should work.

BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously.)

But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 16:13:40 2024

In article <v7ofkl$18d66$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 23.07.2024 13:46, Kenny McCormack wrote:

In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

Both (ksh & zsh) seem to show "better aesthetics".

Indeed, it does. That is how it should work.

BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously. (see ** below))

Yes. ksh seems to dump it out literally as is (as it was typed), but bash (and, I guess also zsh - I have zero knowledge or experience of zsh) pretty prints it. But it seems zsh does a prettier print than bash.

One thing that bash does that's annoying is puts semicolons on the end of (almost) every line. I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.

(*) BTW, the command I use is "type". I.e., "type funName" displays the function definition of function funName. That seems to be the same as your
use of "typeset".

But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.

Yep. Note also that bash reformats something like:

cmd1 &&
cmd2 &&
cmd3

to:

cmd1 && cmd2 && cmd3

which is annoying.

(**) I've hacked the bash source code for less. So, yeah, it is possible.

--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ThePublicGood

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Tue Jul 23 18:48:42 2024

On 23.07.2024 18:13, Kenny McCormack wrote:

In article <v7ofkl$18d66$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 23.07.2024 13:46, Kenny McCormack wrote:

One thing that bash does that's annoying is puts semicolons on the end of (almost) every line.

Ouch!

I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.

(*) BTW, the command I use is "type". I.e., "type funName" displays the function definition of function funName. That seems to be the same as your use of "typeset".

I started tests with 'type' but the result was something undesirable
(forgot already what it was), so I tried the 'typeset -f' which had
better results (with ksh, zsh, at least).

Actually I was just playing around, since your post made me curious.
(I almost never inspect function definitions using one method or the
other. The interesting functions are non-trivial and already tested,
so interactively looking them up makes no sense for me. And other
functions are part of shell programs, either monolithic or used as
lib.) But as a side-effect of my tries I noticed another bug in the
ksh93u+m shell that I'm using. :-/ (But I'm digressing.)

But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.

Yep. Note also that bash reformats something like:

cmd1 &&
cmd2 &&
cmd3

to:

cmd1 && cmd2 && cmd3

which is annoying.

Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.

(**) I've hacked the bash source code for less. So, yeah, it is possible.

Ah, okay. (Would not be my preferred way. :-)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Tue Jul 23 17:15:41 2024

In article <v7omtd$19ng6$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.

The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something anybody can do.

So, they have to take as much as possible of the choice/initiative out of it. Make it the modern equivalent of a factory job.

--
After Using Gender Slur Against AOC, GOP Rep. Yoyo Won't Apologize 'For Loving God'.

That's so sweet...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Kenny McCormack on Tue Jul 23 18:34:52 2024

On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:

Which all kind of echoes back to the other recent thread in this NG about regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every program, every implementation of them, is quite different.

(*) As an abstract concept, separate from any specific implementation.

Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the
1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

The ? operator is not required because (R)? can be written (R)(R)*.

Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Janis Papanagnou on Tue Jul 23 18:20:14 2024

On 2024-07-23, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 23.07.2024 00:47, Kaz Kylheku wrote:

On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:

The problem is that if the above is in a function, when you list out the >>> function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".

I see what you mean:

$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

Is there any way to get this?

Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...

With my (old?) bash:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

The same with ksh:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }

And for good measure also in zsh:

% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}

It bolsters the argument that Bash could use a fix to be this
way also.

zsh preserves the original syntax. So it is saving information
in the stored code about how the datum was represented in
the source code:

% f() { [[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar" ; }
sun-go% typeset -f f
f () {
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

I can understand why an implementor wouldn't want to save this.

If the code that we see in "typeset" is the actual code that
executes, it means that in the $'...' case, zsh has to process
the escape sequences, whereas bash has expanded them out upfront.

If the code that we see in "typeset" is not the actual code
that executes, then that requires extra storage. The Bash
project might be reluctant to imitate that strategy.

Oh look, zsh preserves comments:

sun-go% f() { # f function
function> :
function> }
sun-go% typeset -f f
f () {
# f function
:
}

I doubt that when f is called, it's actually dealing with the
lexical details any more like comments; it's storing some
compiled version of the code along with the source, more likely.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 00:51:44 2024

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:

Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is >> no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.

(*) As an abstract concept, separate from any specific implementation.

Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the 1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

The ? operator is not required because (R)? can be written (R)(R)*.

(Aside: the choice is arbitrary but + would be a more "Unixy" choice for
that operator.)

Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Ben Bacarisse on Wed Jul 24 03:25:14 2024

On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.

Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Wed Jul 24 11:53:00 2024

On 23.07.2024 19:15, Kenny McCormack wrote:

In article <v7omtd$19ng6$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.

The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something anybody can do.

And sometimes it doesn't even appear as coding; during a small episode
in Java I could observe they are just clicking together pieces of code
from drop-down menus using Eclipse.

So, they have to take as much as possible of the choice/initiative out of it. Make it the modern equivalent of a factory job.

Interesting comparison. But, yes.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kaz Kylheku on Wed Jul 24 11:41:48 2024

On 23.07.2024 20:34, Kaz Kylheku wrote:

[...]

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

While formally we can restrict to some basic elements it's quite
inconvenient in practice. I recall that in Compiler Construction
and Automata Theory we regularly used the 'all-but' operator for
complementing input symbol sets. And also obvious abbreviations
for larger sets of symbols (like 'digits', etc.). Not only in
practical regexp implementations, also in education certainly no
one wants to waste time.

The ? operator is not required because (R)? can be written (R)(R)*.

ITYM: (R)+

Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

There are (at least) two different types of such adjustments. One
are the convenience enhancements (like the '+' or the multiplicity
('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
perspective, all stay within the same [theoretical] class of the
Regular Languages. (There's other types of extensions that we find
in implementations that leave that language class.)

Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.

In my book it's not necessary to follow "solid theory"; if, and only
if, it's documented, correctly/sensibly implemented, and implications
made clear to the programmer.

There's two common examples that come to mind. I think it's okay if
there's support for, e.g., back-references if it's clearly stated
that you should not expect that this code will run in O(N), linear
complexity. But there have been implementations - don't recall if it
was in Java, Perl, or both - where they implemented "generalizations"
of "Regexp-processings" that had the runtime-effect that even for
*true* Regular Expressions you had no O(N) guarantee any more; and
this is IMO a fatal decision! If the programmer uses only RE elements
from the class of Regular Languages there should always be complexity
O(N) guaranteed.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

I think we should differentiate the matching process (implementation
specific syntax for formulas of REs, matching implementation methods)
from the Regular Languages theory; the complete strings of text that
we match are in general not part of the Regular Language, the Regular Expression only specifies the subset of (matching) strings as part of
the respective Regular Language. And, WRT complexity, we should also
be aware that the O(N) in Regular Languages is the complexity of the
"match", not the length of the data that is skimmed for a match. The
various algorithms combine these complexities in supposedly efficient
ways. Some RE parsers failed.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 14:17:14 2024

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.

Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

What is "the input" in the set view. The set view is simply a recursive definition of the language.

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.

What is a "broken" RE in the set view?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Ben Bacarisse on Wed Jul 24 18:35:51 2024

On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

This matters when regexes are used for matching a prefix of the input; >>>> if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.

Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

What is "the input" in the set view. The set view is simply a recursive definition of the language.

It is a separate string under consideration.

We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.

What is a "broken" RE in the set view?

Inconsistency in being able to answer the question "what is the longest
prefix of the string which is a member of the set".

Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.

The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.

It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Kaz Kylheku on Wed Jul 24 22:28:32 2024

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

This matters when regexes are used for matching a prefix of the input; >>>>> if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive >>>> set of rules. The whole idea of an RE matching a string is from their >>>> use in practical applications.

Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

What is "the input" in the set view. The set view is simply a recursive
definition of the language.

It is a separate string under consideration.

We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".

It's better, then, (as in the latter wording) not to use a term from the "implementation" view of REs.

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.

What is a "broken" RE in the set view?

Inconsistency in being able to answer the question "what is the longest prefix of the string which is a member of the set".

Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",

they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.

I don't see any "pitfall". The answer to you question "what is the
longest prefix of the given string which is a member of the set" is not
"a" and nothing in the either the formal definition of the language
"a|ab*" nor in the wording of the question is a pitfall. The longest
prefix of "abbbbbbbc" that is in the language "a|ab*" is, unambiguously, "abbbbbbb".

The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.

Yes. But there is no "pitfall" and the RE is not "broken" in any formal
sense at all.

An implementation might be broken and there are pitfalls to look out for
when viewing REs as patterns to match, but that's my whole point. This
is all about the "other" view, not the view of REs as defining formal languages.

It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.

We could, of course, ask about other prefixes of "abbbbbbbc" that are in
the language "a|ab*". I don't see anything inconsistent here at all.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

System Info

bash aesthetics question: special characters in reg exp in [[ ... =~~ .