Shortly after I came up with the idea, I realized it could mostly be done as a Unix pipeline.
Although I hadn't used the option before,
man findindicated that
findcould indeed return the size of the file and nothing else. Trying to write this article on my home machine, I discovered that is a characteristic of GNU find, not available on the Mac. So on other machines you may need to do more, maybe use
ls -lor have
findprint out the number of blocks a file takes up ... less accurate, less complete, but sufficient for a quick proof of concept.
findis printing a series of file sizes, one per line. My original thought was to take the logarithm of the size and truncate to an integer. But Perl will only calculate loge, so I would need to manually multiple that by loge 10. After clobbering myself over the head for ten minutes trying to achieve that, I realized that the number of digits in the size IS the upper limit of the integer portion of log10.
perl -nreads the input line by line, and applies the
-eexpression to each line. Specifying perl5.10 (or later) and using
-eallows me to use
\nand sparing an explicit
$_. I SHOULD chomp the newline off the input before getting it's length, but I can simply subtract 1. I could subtract the character now, but I found it easier to do it later.
The output of the Perl component is a series of lines, each with a number specifying how many digits appear in the file length.
sortorders them, obviously, and
uniq -creplaces multiple instances of a value with a single instance and the number of times that value appears.
Little Lord Flaunteroy would chomp off the newlines at the end of each line, and eliminate the leading spaces used by
uniq -c. But I'm planning to split each line on space characters, to separate the count and value fields. By splitting on one-or-more spaces, the leading spaces, however many there may be, generate a single leading field with no data, which I just ignore. In real code I would parenthesize the right hand expression and use square brackets to slice off the values I want. In a one-liner, it's simpler to add a dummy variable. Use the digit-count as an exponent to obtain an unreachable upper limit ... don't forget to drop the value by one, to make up for counting the newline a few stages back. A test with an empty file, or at least one with less than ten characters in it, will remind you to make that adjustment. All that's left is to output the results.