Wrong time of night for doing regex?

List overview All Threads
Download

newer

older

MUUG Meeting, Jan 14, 7:30pm --...

creat() fails on non-root owned...

Hartmut W Sager

4 Jan 2020 4 Jan '20

11 a.m.

This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line

...

From AncientBBS1 Thu Jan 2, 1986 20:50:00

gets changed to

...

From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Attachments:

attachment.html (text/html — 4.0 KB)

Show replies by date

Mark Campbell

4 Jan 4 Jan

4:56 p.m.

I don't think you can use \s in the replacement regex as it has no special meaning there. In my local testing with perl, it seems to treat it as a literal escape for the letter s. What tool are you using to run the regex?

Substitute in a space, seems to work as expected:

2020-01-04 10:45:30 ~ TOR-M001 %: ccat test | perl -pe 's/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+/\1 0\2 /'

...

From AncientBBS1 Thu Jan 07 1986 20:50:00

2020-01-04 10:45:35 ~ TOR-M001 %: ccat test

...

From AncientBBS1 Thu Jan 7, 1986 20:50:00

What might be easier (and more readable) is if each line has a fixed length from the beginning, you can match perhaps a little more clearly by doing something like s/^(.{23}) (\d),/\1 0\2/ if I'm understanding what you want to do (prepend 0s to dates and remove the comma).

On Sat, Jan 4, 2020 at 10:27 AM Hartmut W Sager hwsager@marityme.net wrote:

...

This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line From AncientBBS1 Thu Jan 2, 1986 20:50:00 gets changed to From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Hartmut W Sager

6:24 p.m.

Hi Mark,

Actually, "\s" is a single space in a replacement string too, like in a search string. Almost all the escaped codings are quite fine in the replacement string too, though not nearly as many are needed there than are needed in the search string.

Thanks for your other thoughts too. I did figure out the problem, and in my main reply (to myself), you'll see a detailed explanation.

Hartmut W Sager - Tel +1-204-339-8331

On Sat, 4 Jan 2020 at 10:58, Mark Campbell nitrodist@gmail.com wrote:

...

I don't think you can use \s in the replacement regex as it has no special meaning there. In my local testing with perl, it seems to treat it as a literal escape for the letter s. What tool are you using to run the regex?

Substitute in a space, seems to work as expected:

2020-01-04 10:45:30 ~ TOR-M001 %: ccat test | perl -pe 's/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+/\1 0\2 /' From AncientBBS1 Thu Jan 07 1986 20:50:00 2020-01-04 10:45:35 ~ TOR-M001 %: ccat test From AncientBBS1 Thu Jan 7, 1986 20:50:00

What might be easier (and more readable) is if each line has a fixed length from the beginning, you can match perhaps a little more clearly by doing something like s/^(.{23}) (\d),/\1 0\2/ if I'm understanding what you want to do (prepend 0s to dates and remove the comma).

On Sat, Jan 4, 2020 at 10:27 AM Hartmut W Sager hwsager@marityme.net wrote:

...
This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line From AncientBBS1 Thu Jan 2, 1986 20:50:00 gets changed to From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Hartmut W Sager

6:52 p.m.

...

What tool are you using to run the regex?

Oops, I forgot to answer that. Vedit (the text editor) runs regex internally. I don't know whether they programmed that part themselves, or are using code from elsewhere.

...

if I'm understanding what you want to do (prepend 0s to dates and remove

the comma).

In the simplified test case, yes (plus de-blanking the extra blanks), but in the real case, it's much more than that, including a re-ordering of "fields" to match the mbox spec.

Hartmut W Sager - Tel +1-204-339-8331

On Sat, 4 Jan 2020 at 10:58, Mark Campbell nitrodist@gmail.com wrote:

...

I don't think you can use \s in the replacement regex as it has no special meaning there. In my local testing with perl, it seems to treat it as a literal escape for the letter s. What tool are you using to run the regex?

Substitute in a space, seems to work as expected:

2020-01-04 10:45:30 ~ TOR-M001 %: ccat test | perl -pe 's/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+/\1 0\2 /' From AncientBBS1 Thu Jan 07 1986 20:50:00 2020-01-04 10:45:35 ~ TOR-M001 %: ccat test From AncientBBS1 Thu Jan 7, 1986 20:50:00

What might be easier (and more readable) is if each line has a fixed length from the beginning, you can match perhaps a little more clearly by doing something like s/^(.{23}) (\d),/\1 0\2/ if I'm understanding what you want to do (prepend 0s to dates and remove the comma).

On Sat, Jan 4, 2020 at 10:27 AM Hartmut W Sager hwsager@marityme.net wrote:

...
[... deleted ...]

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Hartmut W Sager

7:02 p.m.

...

What might be easier (and more readable) is if each line has a fixed

length from the beginning, you can match perhaps a little more clearly by doing something like s/^(.{23}) (\d),/\1 0\2/

No, due to great sloppiness in the original data (and two sources), the line lengths vary considerably, with/without commas in a few places, variable multi-space sequences, etc.

Hartmut W Sager - Tel +1-204-339-8331

On Sat, 4 Jan 2020 at 10:58, Mark Campbell nitrodist@gmail.com wrote:

...

[... delered ...]

Dan Martin

4:59 p.m.

Hi Hartmut

I am not familiar with your replacement syntax \1\s0\2\s

Rubular shows the groups as: 1 From AncientBBS1 2 Thu 3 Jan and 3 others

and for the truncated expression: 1 Jan 2 2

I find rubular a convenient online tool for checking regex https://rubular.com/

-Dan

On Sat, Jan 4, 2020 at 10:27 AM Hartmut W Sager hwsager@marityme.net wrote:

...

This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line From AncientBBS1 Thu Jan 2, 1986 20:50:00 gets changed to From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Hartmut W Sager

6:24 p.m.

Hi Dan,

"\s" is a single space, "0" is just "0", and "\1" and "\2" are variables that reference parts/segments of the search string.

Thanks for the rubular tip-off. Being a classic hard-core programmer, I'm not used to those kind of tools, but I might look at rubular. I did figure out the problem, and in my main reply (to myself), you'll see a detailed explanation.

Hartmut W Sager - Tel +1-204-339-8331

On Sat, 4 Jan 2020 at 10:59, Dan Martin dan@martinmedcorp.com wrote:

...

Hi Hartmut

I am not familiar with your replacement syntax \1\s0\2\s

Rubular shows the groups as: 1 From AncientBBS1 2 Thu 3 Jan and 3 others

and for the truncated expression: 1 Jan 2 2

I find rubular a convenient online tool for checking regex https://rubular.com/

-Dan

On Sat, Jan 4, 2020 at 10:27 AM Hartmut W Sager hwsager@marityme.net wrote:

...
This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line From AncientBBS1 Thu Jan 2, 1986 20:50:00 gets changed to From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Hartmut W Sager

6:23 p.m.

After tons more experimenting, I figured it out! But I don't know whether it's a bug or a feature in Vedit, or proper regex behaviour (various online regex documentation didn't help at all).

It turns out, at least in this regex implementation, that a pair of enclosing parentheses can only serve one of two purposes, not both, at the same time. Those two purposes are:

1. Mark a group that can then be referred to by a variable like "\3" in the replacement string. 2. Enclose a group with alternation (regex terminology) containing several alternatives separated by the "or" operator "|".

Furthermore, at least in this regex implementation, even the type-2 usage (above) increments the "\nnn" counter for variables that can be used in the replacement string, even though the matching "\nnn" variable cannot actually be used in the replacement string!

The solution I figured out (and tested - it works): Enclose the search segment in double (nested) parentheses "((" and "))", and the outer parentheses are then a type-1 usage which can be referenced in the replacement string. But you have to make sure you use the correct "\nnn" variable by numbering the opening parentheses "(" strictly from left to right (which is normal in regex). This unfortunately exhausts the 9 variables "\1" thru "\9" more rapidly.

E.g. Search string: abc((def|ghi))jkl\s(mn[0-9])op((qrs|tuv))xy([0-9])z Replacement string: Can use variables \1, \3, \4, \6, but not \2, \5.

Hartmut W Sager - Tel +1-204-339-8331

On Sat, 4 Jan 2020 at 05:00, Hartmut W Sager hwsager@marityme.net wrote:

...

This might be the wrong time of night for doing regex (i.e., my mistake), or my trusty Vedit text editor has a bug in its regex implementation.

Original search string: ^(From AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s,]+(19[0-9][0-9])[\s,]+([0-9][0-9]:[0-9][0-9]:[0-9][0-9])\s*$ Replacement string: <Nah, skip it>

The above search string gives a syntax error. I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and proceeded to stepwise simplification to narrow it down. I finally got down to:

Search string: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s,]+ Replacement string: \1\s0\2\s

The new search works fine (as did some of the previous stepwise simplified ones), but the replacements are baffling me. The line From AncientBBS1 Thu Jan 2, 1986 20:50:00 gets changed to From AncientBBS1 Thu 02 1986 20:50:00

I.e., the variable \1 seems to get lost. In my previous stepwise simplified cases, multiple variables got lost when the search worked at all.

Why am I doing this? I need to massage some old BBS messages into the retarded mbox format, whose date format (on the "From " line) of "Tue Nov 05 19:02:00 1985" is particularly illogical. Be that as it may, The two sources of these messages I am processing had further sloppiness in their dates, done by some ancient BBS bozos. I did successfully fix a lot of that already with regex.

Hartmut W Sager - Tel +1-204-339-8331

Trevor Cordes

5 Jan 5 Jan

10:10 a.m.

On 2020-01-04 Hartmut W Sager wrote:

...

It turns out, at least in this regex implementation, that a pair of enclosing parentheses can only serve one of two purposes, not both, at the same time. Those two purposes are:

Mark a group that can then be referred to by a variable like "\3"

in the replacement string. 2. Enclose a group with alternation (regex terminology) containing several alternatives separated by the "or" operator "|".

That's just plain evil. Nasty!

The de facto standard is (obviously) PCRE and your program (you said vi?) is obviously not PCRE. I'd be shocked if vi doesn't offer you some way to replace the regex engine? Or at least out-source the regex work to a filter? Not sure, I don't use vi.

In PCRE each () serves both purposes, unless you use (?:) in which case you only get purpose #2 (and save CPU cycles).

The others are correct, using \s in the right hand side is not PCRE. In PCRE \s means "(most) any whitespace" in the regex, and will be just "s" in the substitution.

PCRE = One Ring^H^H^H^HRegex to rule them all. Most programs with regex use the PCRE library now, or give the option, and if you always use -P with grep you'll basically never have to touch another substandard regex engine again! :-) All the perl-haters might find it amusing that they use "perl" on a daily basis because of PCRE :-) (Well, sort of.)

...

I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or"

In most (all?) regex engines (especially PCRE; but pretty sure all!) the rule is "first, most". So the order you put your alternates may matter. In the above case, order probably doesn't matter because things surrounding that bit must be space/comma. Order matters in things where surrounding bits can match the same bits, and things like eating escaped chars, like escaped double-quotes in CSVs: /"(\"|[^"])+"/ works, but /"([^"]|\")+"/ doesn't.

As always, the O'Reilly regex book is an amazing way to fully understand exactly what is going on and will really open a lot of eyes!!

Hartmut W Sager

6 Jan 6 Jan

9:35 a.m.

Thanks, Trevor, for your useful comments. As a result, I've spent some time in the PCRE regex documentation, and have discovered just how feeble the regex implementation is in my Vedit (no, not vi!) text editor. Even tonight, I've run into more problems.

Other than the lousy regex implementation, though, Vedit has served me well continuously since 1982 (with a large number of upgrades of course).

Hartmut W Sager - Tel +1-204-339-8331

On Sun, 5 Jan 2020 at 04:10, Trevor Cordes trevor@tecnopolis.ca wrote:

...

On 2020-01-04 Hartmut W Sager wrote:

...
It turns out, at least in this regex implementation, that a pair of enclosing parentheses can only serve one of two purposes, not both, at the same time. Those two purposes are:

Mark a group that can then be referred to by a variable like "\3"

in the replacement string. 2. Enclose a group with alternation (regex terminology) containing several alternatives separated by the "or" operator "|".

That's just plain evil. Nasty!

The de facto standard is (obviously) PCRE and your program (you said vi?) is obviously not PCRE. I'd be shocked if vi doesn't offer you some way to replace the regex engine? Or at least out-source the regex work to a filter? Not sure, I don't use vi.

In PCRE each () serves both purposes, unless you use (?:) in which case you only get purpose #2 (and save CPU cycles).

The others are correct, using \s in the right hand side is not PCRE. In PCRE \s means "(most) any whitespace" in the regex, and will be just "s" in the substitution.

PCRE = One Ring^H^H^H^HRegex to rule them all. Most programs with regex use the PCRE library now, or give the option, and if you always use -P with grep you'll basically never have to touch another substandard regex engine again! :-) All the perl-haters might find it amusing that they use "perl" on a daily basis because of PCRE :-) (Well, sort of.)

...
I am a bit suspicious of the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or"

In most (all?) regex engines (especially PCRE; but pretty sure all!) the rule is "first, most". So the order you put your alternates may matter. In the above case, order probably doesn't matter because things surrounding that bit must be space/comma. Order matters in things where surrounding bits can match the same bits, and things like eating escaped chars, like escaped double-quotes in CSVs: /"(\"|[^"])+"/ works, but /"([^"]|\")+"/ doesn't.

As always, the O'Reilly regex book is an amazing way to fully understand exactly what is going on and will really open a lot of eyes!! _______________________________________________ Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

2036

Age (days ago)

2038

Last active (days ago)

roundtable@muug.ca

9 comments

4 participants

tags (0)

participants (4)

Dan Martin
Hartmut W Sager
Mark Campbell
Trevor Cordes