-----Original Message----- From: roundtable-bounces@muug.mb.ca [mailto:roundtable- bounces@muug.mb.ca] On Behalf Of Trevor Cordes Sent: Saturday, November 05, 2011 10:00 To: MUUG RndTbl Subject: [RndTbl] fast counting with find
I found myself needing a type of -limit -quit option in find. I couldn't see a built-in way to do it, even with GNU find. GNU find does let you count to 1 and quit, simply by using -quit, but not count to X then quit.
Why do I want to quit at all? Why not just do find|wc -l? The dirs I'm scanning have about 200k files and are sometimes over NFS. Either way, a full find|wc takes a long time and a lot of resources, especially if the find has to do a stat (for mtime, etc). With find|wc my 1 find command took 10+ mins. With my new method, it's a few seconds.
Here's the best solution I could think up. It's sub-optimal I'm sure (requires execs and a temp file), but I couldn't see an easier way to do it within the confines of find (without writing my own find, which I didn't want to do in this case).
Doesn't " find /path -args | head -1000 | wc -l" give you nearly the same result? It may generate more disk i/o in the background (depending on pipe buffering and signalling semantics) but should just as fast when used interactively.
(For the pedantic among us, that should read "find /path -args -print | head -n 1000 | wc -l" since direct specification of the line count to head(1) in option-style syntax is deprecated in POSIX.)
Head(1) will exit immediately upon counting 1000 (or whatever maximum number) of rows, which will generate SIGPIPE to find, which will (more or less) immediately exit. So this might generate more I/O if the read(2) in head(1) blocks until the pipe(7) fills enough to satisfy a read(2) call, head(1) read(2)s from the pipe, counts to X and terminates, which sends SIGPIPE to find(1) but in the interim find(1) has continued processing to fill the next BUFSIZ's worth of pipe(7)... Meanwhile, wc does not receive SIGPIPE because it hasn't blocked on a write(2) call and processes normally, exiting when head(1) generates EOF on the pipe.
Based on some naïve test I just ran, the amount of extra disk I/O involved is below the threshold of human measurement, but the numbers I generated indicated that the pipe writer generates anywhere between another 100bytes and 4k of output. Four kilobytes of find(1) output could easily indicate a couple dozen megabytes of disk I/O, or even more in pathological cases. However, this isn't an issue (as long as you don't have disk IOPS contention from other processes) because the wall time remains the same.
Or am I misinterpreting what you want to accomplish altogether?
-Adam