Command line challenge: trim garbage from start and end of a file.

List overview All Threads
Download

newer

older

Re: [RndTbl] wireless access

Question about Laptops

John Lange

10 Nov 2010 10 Nov '10

4:51 p.m.

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca

Show replies by date

Sean Walberg

10 Nov 10 Nov

4:55 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

OTTOMH:

perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output start/); $state = 2 if ($state == 1 and /output end/) ; print if ($state == 1)' < infile > outfile

I'll bet there's a shorter AWK version though.

Sean

On Wed, Nov 10, 2010 at 10:51 AM, John Lange john@johnlange.ca wrote:

...

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg sean@ertw.com http://ertw.com/

Sean Walberg

5:05 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

Oh, and good glaven, there is a shorter AWK one:

awk '/output start/,/output end/' < infile > outfile

Sean

On Wed, Nov 10, 2010 at 10:55 AM, Sean Walberg sean@ertw.com wrote:

...

OTTOMH:

perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output start/); $state = 2 if ($state == 1 and /output end/) ; print if ($state == 1)' < infile > outfile

I'll bet there's a shorter AWK version though.

Sean

On Wed, Nov 10, 2010 at 10:51 AM, John Lange john@johnlange.ca wrote:

...
I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg sean@ertw.com http://ertw.com/

-- Sean Walberg sean@ertw.com http://ertw.com/

Adam Thompson

5:13 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

The AWK version is functionally identical, and not very much shorter, or any more elegant:

awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/ {s=0}’

(the perl version can generally be made that small, too.)

I would instead suggest sed(1), since this is precisely what it’s designed for:

sed –n ‘/output start/,/output end/p’ < infile

-Adam

From: roundtable-bounces@muug.mb.ca [mailto:roundtable-bounces@muug.mb.ca] On Behalf Of Sean Walberg Sent: Wednesday, November 10, 2010 10:56 To: Continuation of Round Table discussion Subject: Re: [RndTbl] Command line challenge: trim garbage from start and end of a file.

OTTOMH:

perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output start/); $state = 2 if ($state == 1 and /output end/) ; print if ($state == 1)' < infile > outfile

I'll bet there's a shorter AWK version though.

Sean

On Wed, Nov 10, 2010 at 10:51 AM, John Lange john@johnlange.ca wrote:

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg sean@ertw.com http://ertw.com/

Gilles Detillieux

5:32 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

I may have misinterpreted the question before. If you want the "output start" and "output end" marker lines in the output (which I guess your grep pipeline would do), then Adam's sed script will do that. Mine, using the "d" commands, will output only the data in between. The shortest awk script to do the same would be:

awk '/output start/{s=1};s==1;/output end/{s=0};'

awk '/output end/{s=0};s==1;/output start/{s=1};'

The first is a simplification of Adam's, which outputs the output marker lines, while the second, using the same statements in the opposite order, suppresses the markers. Of perl, awk and sed, I suspect sed is the most lightweight, and probably the quickest, unless perl can outperform sed on larger files. awk has a reputation for being pretty slow. I tend to favour sed unless awk or perl makes the job a lot easier.

Gilles

On 11/10/2010 11:13 AM, Adam Thompson wrote:

...

The AWK version is functionally identical, and not very much shorter, or any more elegant:
awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/ {s=0}’
(the perl version can generally be made that small, too.)

I would instead suggest sed(1), since this is precisely what it’s designed for:
sed –n ‘/output start/,/output end/p’ < infile
-Adam

*From:* roundtable-bounces@muug.mb.ca [mailto:roundtable-bounces@muug.mb.ca] *On Behalf Of *Sean Walberg *Sent:* Wednesday, November 10, 2010 10:56 *To:* Continuation of Round Table discussion *Subject:* Re: [RndTbl] Command line challenge: trim garbage from start and end of a file.

OTTOMH:

perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output start/); $state = 2 if ($state == 1 and /output end/) ; print if ($state == 1)' < infile > outfile

I'll bet there's a shorter AWK version though.

Sean

On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john@johnlange.ca mailto:john@johnlange.ca> wrote:

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca http://www.johnlange.ca

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Sean Walberg

5:56 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin:

[sean@bob tmp]$ W=/usr/share/dict/words [sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end; head -1000 $W) > infile [sean@bob tmp]$ wc -l infile 481831 infile [sean@bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null

real 0m0.411s user 0m0.393s sys 0m0.016s [sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile > /dev/null

real 0m0.678s user 0m0.631s sys 0m0.029s

I ran it a bunch more times and the results were similar. YMMV, benchmarks are lies, etc.

Sean

On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux < grdetil@scrc.umanitoba.ca> wrote:

...

I may have misinterpreted the question before. If you want the "output start" and "output end" marker lines in the output (which I guess your grep pipeline would do), then Adam's sed script will do that. Mine, using the "d" commands, will output only the data in between. The shortest awk script to do the same would be:

awk '/output start/{s=1};s==1;/output end/{s=0};'

or

awk '/output end/{s=0};s==1;/output start/{s=1};'

The first is a simplification of Adam's, which outputs the output marker lines, while the second, using the same statements in the opposite order, suppresses the markers. Of perl, awk and sed, I suspect sed is the most lightweight, and probably the quickest, unless perl can outperform sed on larger files. awk has a reputation for being pretty slow. I tend to favour sed unless awk or perl makes the job a lot easier.

Gilles

On 11/10/2010 11:13 AM, Adam Thompson wrote:

...
The AWK version is functionally identical, and not very much shorter, or any more elegant:
awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/ {s=0}’
(the perl version can generally be made that small, too.)

I would instead suggest sed(1), since this is precisely what it’s designed for:
sed –n ‘/output start/,/output end/p’ < infile
-Adam

*From:* roundtable-bounces@muug.mb.ca [mailto:roundtable-bounces@muug.mb.ca] *On Behalf Of *Sean Walberg *Sent:* Wednesday, November 10, 2010 10:56 *To:* Continuation of Round Table discussion *Subject:* Re: [RndTbl] Command line challenge: trim garbage from start and end of a file.

OTTOMH:

perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output start/); $state = 2 if ($state == 1 and /output end/) ; print if ($state == 1)' < infile > outfile

I'll bet there's a shorter AWK version though.

Sean

On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john@johnlange.ca mailto:john@johnlange.ca> wrote:

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- John Lange www.johnlange.ca http://www.johnlange.ca
-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada) _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg sean@ertw.com http://ertw.com/

John Lange

6:15 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

I noticed that the sed version was much faster than the grep hack I tried ( all those pipes!) but I didn't time it.

All things considered, I award "awk" the prize of "most elegant" for its 3 less characters in the command string and slight performance edge!

Thanks guys!

John

On Wed, Nov 10, 2010 at 11:56 AM, Sean Walberg sean@ertw.com wrote:

...

Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin: [sean@bob tmp]$ W=/usr/share/dict/words [sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end; head -1000 $W) > infile [sean@bob tmp]$ wc -l infile 481831 infile [sean@bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null real 0m0.411s user 0m0.393s sys 0m0.016s [sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile > /dev/null real 0m0.678s user 0m0.631s sys 0m0.029s I ran it a bunch more times and the results were similar. YMMV, benchmarks are lies, etc. Sean On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux grdetil@scrc.umanitoba.ca wrote:

...

-- John Lange www.johnlange.ca

Gilles Detillieux

6:37 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

Interesting! Which version of awk did you test? I have to admit I haven't looked into awk performance in quite some time. My early experience, on older Unix systems (pre-Linux), confirmed what I had read about awk being pretty slow. But I seem to recall that even on older Linux systems, gawk wasn't exactly speedy then either. I imagine the GNU awk developers must have remedied that since, though, if that is indeed what you were testing.

Searching online for discussions on awk performance found one from 2002 suggesting gawk was much faster than nawk, and another from this past August that suggested the opposite. Perhaps the developers of the two have been leap-frogging each other with optimizations to their code?

On 11/10/2010 11:56 AM, Sean Walberg wrote:

...

Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin:

real 0m0.411s user 0m0.393s sys 0m0.016s [sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile > /dev/null

real 0m0.678s user 0m0.631s sys 0m0.029s

I ran it a bunch more times and the results were similar. YMMV, benchmarks are lies, etc.

Sean

On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux <grdetil@scrc.umanitoba.ca mailto:grdetil@scrc.umanitoba.ca> wrote:

I may have misinterpreted the question before.  If you want the "output
start" and "output end" marker lines in the output (which I guess your
grep pipeline would do), then Adam's sed script will do that.  Mine,
using the "d" commands, will output only the data in between.  The
shortest awk script to do the same would be:

awk '/output start/{s=1};s==1;/output end/{s=0};'

or

awk '/output end/{s=0};s==1;/output start/{s=1};'

The first is a simplification of Adam's, which outputs the output marker
lines, while the second, using the same statements in the opposite
order, suppresses the markers.  Of perl, awk and sed, I suspect sed is
the most lightweight, and probably the quickest, unless perl can
outperform sed on larger files.  awk has a reputation for being pretty
slow.  I tend to favour sed unless awk or perl makes the job a lot
easier.

Gilles

On 11/10/2010 11:13 AM, Adam Thompson wrote:
 > The AWK version is functionally identical, and not very much
shorter, or
 > any more elegant:
 >
 >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
{s=0}’
 >
 > (the perl version can generally be made that small, too.)
 >
 >
 >
 > I would instead suggest sed(1), since this is precisely what it’s
 > designed for:
 >
 >     sed –n ‘/output start/,/output end/p’ < infile
 >
 >
 >
 > -Adam
 >
 >
 >
 >
 >
 > *From:* roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>
 > [mailto:roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>] *On Behalf Of *Sean Walberg
 > *Sent:* Wednesday, November 10, 2010 10:56
 > *To:* Continuation of Round Table discussion
 > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
start
 > and end of a file.
 >
 >
 >
 > OTTOMH:
 >
 >
 >
 > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output
 > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
 > ($state == 1)' < infile > outfile
 >
 > I'll bet there's a shorter AWK version though.
 >
 >
 >
 > Sean
 >
 >
 >
 > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john@johnlange.ca
<mailto:john@johnlange.ca>
 > <mailto:john@johnlange.ca <mailto:john@johnlange.ca>>> wrote:
 >
 > I have files with the following structure:
 >
 > garbage
 > garbage
 > garbage
 > output start
 > .. good data
 > .. good data
 > .. good data
 > .. good data
 > output end
 > garbage
 > garbage
 > garbage
 >
 > How can I extract the good data from the file trimming the garbage
 > from the beginning and end?
 >
 > The following works just fine but it's dirty because I don't like the
 > fact that I have to pick an arbitrarily large number for the "before"
 > and "after" values.
 >
 > grep -A 999999 "output start" <infile> | grep -B 999999 "output
end" >
 > newfile
 >
 > Can anyone come up with something more elegant?
 >
 > --
 > John Lange
 > www.johnlange.ca <http://www.johnlange.ca> <http://www.johnlange.ca>

--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca
<mailto:grdetil@scrc.umanitoba.ca>>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
_______________________________________________
Roundtable mailing list
Roundtable@muug.mb.ca <mailto:Roundtable@muug.mb.ca>
http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg <sean@ertw.com mailto:sean@ertw.com> http://ertw.com/

Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Sean Walberg

6:50 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

I used whatever was on my Fedora 13 box:

[sean@bob ~]$ awk --version GNU Awk 3.1.8 [sean@bob ~]$ sed --version GNU sed version 4.2.1

The difference gets much bigger if you use a more complex regexp.

[sean@bob tmp]$ time awk '/.*output.*start.*/,/.*output.*end.*/' < infile > /dev/null

real 0m0.450s user 0m0.393s sys 0m0.010s [sean@bob tmp]$ time sed -n '/.*output.*start.*/,/.*output.*end.*/p' < infile > /dev/null

real 0m1.726s user 0m1.495s sys 0m0.017s

Awk didn't seem to blink an eye. Strangely enough, since the beginning and ending .*'s are completely superfluous, they seem to throw sed for a loop, even if the middle .* is replaced with a space.

Sean

On Wed, Nov 10, 2010 at 12:37 PM, Gilles Detillieux < grdetil@scrc.umanitoba.ca> wrote:

...

Interesting! Which version of awk did you test? I have to admit I haven't looked into awk performance in quite some time. My early experience, on older Unix systems (pre-Linux), confirmed what I had read about awk being pretty slow. But I seem to recall that even on older Linux systems, gawk wasn't exactly speedy then either. I imagine the GNU awk developers must have remedied that since, though, if that is indeed what you were testing.

Searching online for discussions on awk performance found one from 2002 suggesting gawk was much faster than nawk, and another from this past August that suggested the opposite. Perhaps the developers of the two have been leap-frogging each other with optimizations to their code?

On 11/10/2010 11:56 AM, Sean Walberg wrote:

...
Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin:

[sean@bob tmp]$ W=/usr/share/dict/words [sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end; head -1000 $W) > infile [sean@bob tmp]$ wc -l infile 481831 infile [sean@bob tmp]$ time awk '/output start/,/output end/' < infile >

/dev/null

...
real 0m0.411s user 0m0.393s sys 0m0.016s [sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile > /dev/null

real 0m0.678s user 0m0.631s sys 0m0.029s

I ran it a bunch more times and the results were similar. YMMV, benchmarks are lies, etc.

Sean

On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux <grdetil@scrc.umanitoba.ca mailto:grdetil@scrc.umanitoba.ca> wrote:
I may have misinterpreted the question before.  If you want the
"output

...
start" and "output end" marker lines in the output (which I guess
your

...
grep pipeline would do), then Adam's sed script will do that.  Mine,
using the "d" commands, will output only the data in between.  The
shortest awk script to do the same would be:

awk '/output start/{s=1};s==1;/output end/{s=0};'

or

awk '/output end/{s=0};s==1;/output start/{s=1};'

The first is a simplification of Adam's, which outputs the output
marker

...
lines, while the second, using the same statements in the opposite
order, suppresses the markers.  Of perl, awk and sed, I suspect sed
is

...
the most lightweight, and probably the quickest, unless perl can
outperform sed on larger files.  awk has a reputation for being
pretty

...
slow.  I tend to favour sed unless awk or perl makes the job a lot
easier.

Gilles

On 11/10/2010 11:13 AM, Adam Thompson wrote:
 > The AWK version is functionally identical, and not very much
shorter, or
 > any more elegant:
 >
 >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
{s=0}’
 >
 > (the perl version can generally be made that small, too.)
 >
 >
 >
 > I would instead suggest sed(1), since this is precisely what it’s
 > designed for:
 >
 >     sed –n ‘/output start/,/output end/p’ < infile
 >
 >
 >
 > -Adam
 >
 >
 >
 >
 >
 > *From:* roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>
 > [mailto:roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>] *On Behalf Of *Sean Walberg
 > *Sent:* Wednesday, November 10, 2010 10:56
 > *To:* Continuation of Round Table discussion
 > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
start
 > and end of a file.
 >
 >
 >
 > OTTOMH:
 >
 >
 >
 > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and
/output

...
 > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
 > ($state == 1)' < infile > outfile
 >
 > I'll bet there's a shorter AWK version though.
 >
 >
 >
 > Sean
 >
 >
 >
 > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john@johnlange.ca
<mailto:john@johnlange.ca>
 > <mailto:john@johnlange.ca <mailto:john@johnlange.ca>>> wrote:
 >
 > I have files with the following structure:
 >
 > garbage
 > garbage
 > garbage
 > output start
 > .. good data
 > .. good data
 > .. good data
 > .. good data
 > output end
 > garbage
 > garbage
 > garbage
 >
 > How can I extract the good data from the file trimming the garbage
 > from the beginning and end?
 >
 > The following works just fine but it's dirty because I don't like
the

...
 > fact that I have to pick an arbitrarily large number for the
"before"

...
 > and "after" values.
 >
 > grep -A 999999 "output start" <infile> | grep -B 999999 "output
end" >
 > newfile
 >
 > Can anyone come up with something more elegant?
 >
 > --
 > John Lange
 > www.johnlange.ca <http://www.johnlange.ca> <
http://www.johnlange.ca%3E

...
--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca
<mailto:grdetil@scrc.umanitoba.ca>>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/

...
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
_______________________________________________
Roundtable mailing list
Roundtable@muug.mb.ca <mailto:Roundtable@muug.mb.ca>
http://www.muug.mb.ca/mailman/listinfo/roundtable
-- Sean Walberg <sean@ertw.com mailto:sean@ertw.com> http://ertw.com/

Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada) _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

-- Sean Walberg sean@ertw.com http://ertw.com/

Trevor Cordes

25 Dec 25 Dec

8:50 p.m.

New subject: Command line challenge: trim garbage from start and end of a file.

On 2010-11-10 Sean Walberg wrote:

...

Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin:

I know it's an old thread... but I had to have a go at you awk/sed weenies. ;-)

My solution is perl regex:

perl -e '$/=undef;open I,$ARGV[0];$_=<I>;/(?:^|\n)(output start\n.*\noutput end\n)/s and print $1' infile

It's not a filter (requires a filename) but could probably easily be made into one.

I recall reading in perl books that perl regex was faster than sed/awk and the above takes advantage of the slurp-whole-file that $/ allows.

On my computer the awk/sed/perl times compare like so:

time sed -n '/output start/,/output end/p' < infile > /dev/null 0.264+0.002c 0:00.26s 100.0% 0+0<774k | 1+39cs 0+259pg 0sw 0sg

time awk '/output start/,/output end/' < infile > /dev/null 0.183+0.003c 0:00.18s 100.0% 0+0<774k | 1+28cs 0+298pg 0sw 0sg

time perl -e '$/=undef;open I,$ARGV[0];$_=<I>;/(?:^|\n)(output start\n.*\noutput end\n)/s and print $1' infile > /dev/null 0.032+0.017c 0:00.05s 80.0% 0+0<8168k | 1+19cs 0+4196pg 0sw 0sg

Wow! But yikes, look at the mem usage. Good thing RAM is plentiful these days. In 1980 sed would be the better bet for sure.

...

[sean@bob tmp]$ W=/usr/share/dict/words [sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end; head -1000 $W) > infile [sean@bob tmp]$ wc -l infile 481831 infile

Gilles Detillieux

10 Nov 10 Nov

5:13 p.m.

sed '1,/^output start/d; /^output end/,$d' < infile > newfile

On 11/10/2010 10:51 AM, John Lange wrote:

...

I have files with the following structure:

garbage garbage garbage output start .. good data .. good data .. good data .. good data output end garbage garbage garbage

How can I extract the good data from the file trimming the garbage from the beginning and end?

The following works just fine but it's dirty because I don't like the fact that I have to pick an arbitrarily large number for the "before" and "after" values.

grep -A 999999 "output start" <infile> | grep -B 999999 "output end" > newfile

Can anyone come up with something more elegant?

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

5488

Age (days ago)

5533

Last active (days ago)

roundtable@muug.ca

10 comments

5 participants

tags (0)

participants (5)

Adam Thompson
Gilles Detillieux
John Lange
Sean Walberg
Trevor Cordes