Re: [RndTbl] Command line challenge: trim garbage from start and end of a file.

10 Nov 2010

      I used whatever was on my Fedora 13 box:
[sean@bob ~]$ awk --version
GNU Awk 3.1.8
[sean@bob ~]$ sed --version
GNU sed version 4.2.1
The difference gets much bigger if you use a more complex regexp.
[sean@bob tmp]$ time awk '/.*output.*start.*/,/.*output.*end.*/' < infile >
/dev/null
real    0m0.450s
user    0m0.393s
sys     0m0.010s
[sean@bob tmp]$ time  sed -n '/.*output.*start.*/,/.*output.*end.*/p' <
infile > /dev/null
real    0m1.726s
user    0m1.495s
sys     0m0.017s
Awk didn't seem to blink an eye. Strangely enough, since the beginning and
ending .*'s are completely superfluous, they seem to throw sed for a loop,
even if the middle .* is replaced with a space.
Sean
On Wed, Nov 10, 2010 at 12:37 PM, Gilles Detillieux <
grdetil@scrc.umanitoba.ca> wrote:
...
Interesting!  Which version of awk did you test?  I have to admit I
haven't looked into awk performance in quite some time.  My early
experience, on older Unix systems (pre-Linux), confirmed what I had read
about awk being pretty slow.  But I seem to recall that even on older
Linux systems, gawk wasn't exactly speedy then either.  I imagine the
GNU awk developers must have remedied that since, though, if that is
indeed what you were testing.
Searching online for discussions on awk performance found one from 2002
suggesting gawk was much faster than nawk, and another from this past
August that suggested the opposite.  Perhaps the developers of the two
have been leap-frogging each other with optimizations to their code?
On 11/10/2010 11:56 AM, Sean Walberg wrote:
...
Adam and I were having an offline discussion, and some testing shows
that AWK outperforms SED by a slight margin:
[sean@bob tmp]$ W=/usr/share/dict/words
[sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output
end; head -1000 $W) > infile
[sean@bob tmp]$ wc -l infile
481831 infile
[sean@bob tmp]$ time awk '/output start/,/output end/' < infile >
/dev/null
...
real    0m0.411s
user    0m0.393s
sys     0m0.016s
[sean@bob tmp]$ time  sed -n '/output start/,/output end/p' < infile >
/dev/null
real    0m0.678s
user    0m0.631s
sys     0m0.029s
I ran it a bunch more times and the results were similar.  YMMV,
benchmarks are lies, etc.
Sean
On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux
<grdetil@scrc.umanitoba.ca mailto:grdetil@scrc.umanitoba.ca> wrote:
I may have misinterpreted the question before.  If you want the

"output
...
start" and "output end" marker lines in the output (which I guess

your
...
grep pipeline would do), then Adam's sed script will do that.  Mine,
using the "d" commands, will output only the data in between.  The
shortest awk script to do the same would be:

awk '/output start/{s=1};s==1;/output end/{s=0};'

or

awk '/output end/{s=0};s==1;/output start/{s=1};'

The first is a simplification of Adam's, which outputs the output

marker
...
lines, while the second, using the same statements in the opposite
order, suppresses the markers.  Of perl, awk and sed, I suspect sed

is
...
the most lightweight, and probably the quickest, unless perl can
outperform sed on larger files.  awk has a reputation for being

pretty
...
slow.  I tend to favour sed unless awk or perl makes the job a lot
easier.

Gilles

On 11/10/2010 11:13 AM, Adam Thompson wrote:
 > The AWK version is functionally identical, and not very much
shorter, or
 > any more elegant:
 >
 >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
{s=0}’
 >
 > (the perl version can generally be made that small, too.)
 >
 >
 >
 > I would instead suggest sed(1), since this is precisely what it’s
 > designed for:
 >
 >     sed –n ‘/output start/,/output end/p’ < infile
 >
 >
 >
 > -Adam
 >
 >
 >
 >
 >
 > *From:* roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>
 > [mailto:roundtable-bounces@muug.mb.ca
<mailto:roundtable-bounces@muug.mb.ca>] *On Behalf Of *Sean Walberg
 > *Sent:* Wednesday, November 10, 2010 10:56
 > *To:* Continuation of Round Table discussion
 > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
start
 > and end of a file.
 >
 >
 >
 > OTTOMH:
 >
 >
 >
 > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and

/output
...
 > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
 > ($state == 1)' < infile > outfile
 >
 > I'll bet there's a shorter AWK version though.
 >
 >
 >
 > Sean
 >
 >
 >
 > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john@johnlange.ca
<mailto:john@johnlange.ca>
 > <mailto:john@johnlange.ca <mailto:john@johnlange.ca>>> wrote:
 >
 > I have files with the following structure:
 >
 > garbage
 > garbage
 > garbage
 > output start
 > .. good data
 > .. good data
 > .. good data
 > .. good data
 > output end
 > garbage
 > garbage
 > garbage
 >
 > How can I extract the good data from the file trimming the garbage
 > from the beginning and end?
 >
 > The following works just fine but it's dirty because I don't like

the
...
 > fact that I have to pick an arbitrarily large number for the

"before"
...
 > and "after" values.
 >
 > grep -A 999999 "output start" <infile> | grep -B 999999 "output
end" >
 > newfile
 >
 > Can anyone come up with something more elegant?
 >
 > --
 > John Lange
 > www.johnlange.ca <http://www.johnlange.ca> <

http://www.johnlange.ca%3E
...
--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca
<mailto:grdetil@scrc.umanitoba.ca>>
Spinal Cord Research Centre       WWW:

http://www.scrc.umanitoba.ca/
...
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
_______________________________________________
Roundtable mailing list
Roundtable@muug.mb.ca <mailto:Roundtable@muug.mb.ca>
http://www.muug.mb.ca/mailman/listinfo/roundtable

--
Sean Walberg <sean@ertw.com mailto:sean@ertw.com>    http://ertw.com/

Roundtable mailing list
Roundtable@muug.mb.ca
http://www.muug.mb.ca/mailman/listinfo/roundtable
--
Gilles R. Detillieux              E-mail: grdetil@scrc.umanitoba.ca
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
_______________________________________________
Roundtable mailing list
Roundtable@muug.mb.ca
http://www.muug.mb.ca/mailman/listinfo/roundtable
-- 
Sean Walberg sean@ertw.com    http://ertw.com/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

Re: [RndTbl] Command line challenge: trim garbage from start and end of a file.