I found myself needing a type of -limit -quit option in find. I couldn't see a built-in way to do it, even with GNU find. GNU find does let you count to 1 and quit, simply by using -quit, but not count to X then quit.
Why do I want to quit at all? Why not just do find|wc -l? The dirs I'm scanning have about 200k files and are sometimes over NFS. Either way, a full find|wc takes a long time and a lot of resources, especially if the find has to do a stat (for mtime, etc). With find|wc my 1 find command took 10+ mins. With my new method, it's a few seconds.
Here's the best solution I could think up. It's sub-optimal I'm sure (requires execs and a temp file), but I couldn't see an easier way to do it within the confines of find (without writing my own find, which I didn't want to do in this case).
See the find command example on line 6 of the script. Arg 1 is a temp file path (normal race condition safety precautions apply). Arg 2 is the number to count to.
cat find-count-helper #!/usr/bin/perl -w # # allows a type of counting short-circuit in find # much faster in huge dirs than doing a find | wc -l # use: # find path ( -name 'exclude-dir' -prune ) -o -type f -print -exec # /usr/local/script/find-count-helper /tmp/unique-temp-file 5 ; -quit # will find the first 5 matching files then quit
$ENV{'SHELL'}='/bin/bash';
$file=$ARGV[0]; $max =$ARGV[1];
$_=`cat $file 2>/dev/null`; chop; $_=0 if !$_; $_++; if ($_>=$max) { unlink $file; exit 0; } open(O,'>',$file) or die; print O "$_\n"; close(O); exit 1;
-----Original Message----- From: roundtable-bounces@muug.mb.ca [mailto:roundtable- bounces@muug.mb.ca] On Behalf Of Trevor Cordes Sent: Saturday, November 05, 2011 10:00 To: MUUG RndTbl Subject: [RndTbl] fast counting with find
I found myself needing a type of -limit -quit option in find. I couldn't see a built-in way to do it, even with GNU find. GNU find does let you count to 1 and quit, simply by using -quit, but not count to X then quit.
Why do I want to quit at all? Why not just do find|wc -l? The dirs I'm scanning have about 200k files and are sometimes over NFS. Either way, a full find|wc takes a long time and a lot of resources, especially if the find has to do a stat (for mtime, etc). With find|wc my 1 find command took 10+ mins. With my new method, it's a few seconds.
Here's the best solution I could think up. It's sub-optimal I'm sure (requires execs and a temp file), but I couldn't see an easier way to do it within the confines of find (without writing my own find, which I didn't want to do in this case).
Doesn't " find /path -args | head -1000 | wc -l" give you nearly the same result? It may generate more disk i/o in the background (depending on pipe buffering and signalling semantics) but should just as fast when used interactively.
(For the pedantic among us, that should read "find /path -args -print | head -n 1000 | wc -l" since direct specification of the line count to head(1) in option-style syntax is deprecated in POSIX.)
Head(1) will exit immediately upon counting 1000 (or whatever maximum number) of rows, which will generate SIGPIPE to find, which will (more or less) immediately exit. So this might generate more I/O if the read(2) in head(1) blocks until the pipe(7) fills enough to satisfy a read(2) call, head(1) read(2)s from the pipe, counts to X and terminates, which sends SIGPIPE to find(1) but in the interim find(1) has continued processing to fill the next BUFSIZ's worth of pipe(7)... Meanwhile, wc does not receive SIGPIPE because it hasn't blocked on a write(2) call and processes normally, exiting when head(1) generates EOF on the pipe.
Based on some naïve test I just ran, the amount of extra disk I/O involved is below the threshold of human measurement, but the numbers I generated indicated that the pipe writer generates anywhere between another 100bytes and 4k of output. Four kilobytes of find(1) output could easily indicate a couple dozen megabytes of disk I/O, or even more in pathological cases. However, this isn't an issue (as long as you don't have disk IOPS contention from other processes) because the wall time remains the same.
Or am I misinterpreting what you want to accomplish altogether?
-Adam
On 11/05/2011 05:11 PM, Adam Thompson wrote:
-----Original Message----- From: roundtable-bounces@muug.mb.ca [mailto:roundtable- bounces@muug.mb.ca] On Behalf Of Trevor Cordes Sent: Saturday, November 05, 2011 10:00 To: MUUG RndTbl Subject: [RndTbl] fast counting with find
I found myself needing a type of -limit -quit option in find. I couldn't Why do I want to quit at all? Why not just do find|wc -l? The dirs I'm scanning have about 200k files and are sometimes over NFS. Either way, a full find|wc takes a long time and a lot of resources, especially if the find has to do a stat (for mtime, etc). With find|wc my 1 find command took 10+ mins. With my new method, it's a few seconds.
Doesn't " find /path -args | head -1000 | wc -l" give you nearly the same result? It may generate more disk i/o in the background (depending on pipe buffering and signalling semantics) but should just as fast when used interactively.
(For the pedantic among us, that should read "find /path -args -print | head -n 1000 | wc -l" since direct specification of the line count to head(1) in option-style syntax is deprecated in POSIX.)
I regularly use sed Nq (where N is a number) instead of head because sed 100q is universal, and head sometimes requires -n and sometimes doesn't, and that's annoying.
It seems like limiting the number of matches may not be the goal after all, instead perhaps it would be better to limit the resources that find uses? e.g. with recent coreutils you can limit the time it runs with the timeout(1) command.
The difference becomes immediately obvious if you think about what option you'd like added to findutils to give the desired result, one which stops find when the number of matches is reached, or one which stops find after some number of paths are seen. Depending on the options supplied these two could be very different.
Peter