RegEx help needed (sed?)

List overview All Threads
Download

newer

older

Fwd: Come join DC204 for OpenChaos...

Linux Pro Mag has a neat DVD this...

Kevin McGregor

29 Nov 2016 29 Nov '16

3:42 p.m.

I'm using sed to massage some input. Specifically, I have input lines like

aaaaaaaaaa/BBBBB_ccc@00000 or aaaaaaaaaa/BBBBB@00000

and I want the output to always be

BBBBB

I've got most of it, but I can't figure out how to get rid of anything at the end of the line after EITHER the underscore OR the '@' (including either of those two characters).

Is this possible in one expression in sed? I can do it with piping the input through sed twice but I was wondering if one pass would do it. Currently I'm using

sed "s/^.*/(.*)[_@].*$/\1/"

Which doesn't get rid of the "_ccc" when it appears, just the "@00000".

Suggestions?

Attachments:

attachment.html (text/html — 925 bytes)

Show replies by date

Trevor Cordes

1 Dec 1 Dec

6:29 a.m.

On 2016-11-29 Kevin McGregor wrote:

...

I'm using sed to massage some input. Specifically, I have input lines like

aaaaaaaaaa/BBBBB_ccc@00000 or aaaaaaaaaa/BBBBB@00000

and I want the output to always be

BBBBB

The others pegged it with the "greedy" *.

I highly recommend using a pcre engine (or just perl!) as it gives you non-greedy *? which can give you performance improvements, and also save you some trouble/syntax. In fact, in tons (most?) cases people who aren't regex ninjas really are thinking non-greedy so plunking non-greedy in everywhere usually DWIM.

...

sed "s/^.*/(.*)[_@].*$/\1/"

The reason the above (and any greedy .* solution) is inefficient is that the first .* will eat up every char to the end of string, then backtrack testing each one for the / char. Same with the 2nd .*, it will eat every char to end of string then backtrack looking for a _ or @! Inefficient. The only time this doesn't matter is the last .* since you're looking for the EOS anyhow.

May not seem like much, but if you were processing really large strings (I work on multi-MB strings a lot) or GB-sized lists of these lines then you'd see a difference. In fact, with large inputs, you can cause a regex engine to "hang" pretty easily with intermixed .*'s due to backtracking. Like /.*..*..*/ type things, which can happen even when they aren't so obvious. .*? will save you in many of those cases.

Plus, using perl we can make the code *much* terser. I love terse! Terse is good.

# shortest method, and fastest as we are yanking rather than modifying # perl allows any char delimiter so we don't get backslashitis use 5.18.0; # gives us say, which is print with NL appended say((m#/([^_@]*)#)[0]);

# more like the sed call (replacement) but with non-greedy: print s#.*?/(.*?)[_@].*#$1#r;

# or use from a cmd line: perl -ne 'print ((m#/([^_@]*)#)[0]."\n")'

Perl is far and away the best program for working with regex and they are 100% native and first-class types in perl, no need to mess with most quoting, you get a choice of delimter, etc. Every time I have to pcre in other languages (php, etc) I want to vomit with how ugly the syntax gets.

I highly highly highly highly recommend everyone who ever needs to do this stuff read the O'Reilly Mastering Regular Expressions book. It'll make you grok the "greedy" and backtracking stuff like it's second nature. Might be the best book they've ever put out.

Just my $1.02!

3187

Age (days ago)

3189

Last active (days ago)

roundtable@muug.ca

1 comments

2 participants

tags (0)

participants (2)

Kevin McGregor
Trevor Cordes