Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing things, if you have better ideas.

The program goes through the file hierarchy, storing data in a data structure , %stats.

0 HASH(0x1009a1000)
3 => HASH(0x1009d0c90)
'N' => 32
'sum' => 16995
4 => HASH(0x1009d0e40)
'N' => 17
'sum' => 57691
5 => HASH(0x1009d0ea0)
'N' => 3
'sum' => 35747

The '3', '4', '5 refer to the number of digits in the file size, i.e., '3' refers to file sizes in the range 100-999. 'N' is the number of files in this size range,'sum' is the sum of all the actual file sizes for the range.

The finale output shows there are 166 zero-byte files ... perhaps they were created but never populated, due to some error, or perhaps they communicated some binary data merely by their presence. Sometimes compilers use zero-length files to indicate the extent of their progress. Files with sizes between 1 and 9 bytes number 9, while 327 are under 100 bytes. This blog is too narrow, so I cut the table short, but I do have four 45GB files which store data comparing the performance ofsudoku-solving algorithms.

size N sum avg
---- - --- ---
< 1 166 0 0
< 10 59 411 7
< 100 327 17,840 55
< 1,000 3,800 1,939,738 510
< 10,000 8,231 31,002,459 3,767
< 100,000 3,575 109,346,315 30,586


#!/usr/local/bin/perl5.10
# Determine file counts for size categories 1-9, 10-99
#

use warnings;
use strict;
use feature ':5.10';

use Readonly;
use Getopt::Long;
use Data::Dumper;

# --------------------------------------------------------
# Constants
#

Readonly my $DOT              => q{.};
Readonly my $DOTDOT           => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};
Readonly my $LASTFIELD        => -1;
Readonly my $SPACE            => q{ };
Readonly my $TAB              => qq{\t};

Readonly my $COMMIFY_WIDTH   => 3;
Readonly my $ROUNDING_OFFSET => 1.0 / 2;

# -------------------------------------------------------
# Main
#
my %options = ( root => q{~} );

GetOptions( \%options, 'root=s' )
  or die "Error parsing command line arguments.\n";

my $stats = process_sizes( $options{root} );
say Dumper($stats) if exists $ENV{DEBUG};
display_summary($stats);

# -------------------------------------------------------
# Subroutines
#

# Explore directories, accumulating file sizes. Initialize
# dirs list with the one root, add subdirs, processing
# recursively.
#
sub process_sizes {
    my (@dirs) = @_;

    my %stats;
    while (@dirs) {
        process_one_dir( \%stats, \@dirs );
    }
    return \%stats;
}

# ........................................................
# For any directory, use glob() to get a list of all the
# files in the directory.  Ignore '.' & '..'.
#
# Spaces in directory names need to be escaped, otherwise
# glob() will think the string contains multiple 
# candidates to match.

#
sub process_one_dir {
    my ( $stats, $dirs ) = @_;

    my $one_dir = shift @{$dirs};
    say "process_one_dir => $one_dir"
        if $ENV{DEBUG};
    $one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g;

  ENTRY:
    while ( my $entry = glob "$one_dir/*" ) {
       next ENTRY
          if $entry eq $DOT or $entry eq $DOTDOT;
        process_one_entry( $entry, $dirs, $stats );
    }
}

# ........................................................
# Stick subdirectories onto the @dirs array for recursive
# searching.  File require additional processing based on
# their size.
#
# Note: specify the file name only on the first stats()
# call on a file ( -d or other -X ). Use '_' as the file
# name in additional calls to re-use the data rather than
# making a new call.
#
sub process_one_entry {
    my ( $entry, $dirs, $stats ) = @_;

    if ( -d $entry ) {
        push @{$dirs}, $entry;
    }
    else {
        say "${TAB}record_one_size => $entry"
            if $ENV{DEBUG};
        record_one_size( $stats, -s _ );
    }
}

# ........................................................
# For empty files, update a special '0' category. Use the
# length of the size of non-empty files to determine the
# category ( equivalent to ceiling( log10( size ))), and
# keep track of the number of files and the sum of the
# bytes for the category.

#

sub record_one_size {

    my ( $stats, $size ) = @_;



    my $len = $size == 0 ? 0 : length $size;

    $stats->{$len}{N}++;

    $stats->{$len}{sum} += $size;

}



# ........................................................

#

sub print_header {

    my ($fmt) = @_;



    my @fields = (qw/size N sum avg/);

    printf $fmt, @fields;

    printf $fmt, map { q{-} x length $_ } @fields;

}



# .........................................................

# For each size category, get desired field, either 'N'

# or 'sum'.  Sort these and grab the largest value to

# determine how wide the field needs to be.

#

sub calc_field_width {

    my ( $stats, $tags, $field ) = @_;



    my @values = map { $stats->{$_}{$field} } @{$tags};

    my $width = length( ( sort { $a <=> $b }

                        @values )[$LASTFIELD] );

    return $width;

}



# ........................................................

# Field widths need an extra 33% to allow for commafying

# numbers.  The calulation will try to put a comma to the

# left of the entire number if leftmost group has three

# digits, so drop one comma in that case.  But don't drop

# comma if the width is zero, because that would mmake the

# width negative.

#

sub allow_for_commas {

    my ($width) = @_;



    my $commas = int $width / $COMMIFY_WIDTH;

    $commas--

        if $commas && $width % $COMMIFY_WIDTH == 0;

    return $width + $commas;

}



# ........................................................

# The keys of the hash are the exponents of the size

# categories, i.e., the log of the size, 0, 1, 2, ... Sort

# these and take the largest to determine how wide size

# field has to be.  Determine the width for the N and sum

# fields.  Reuse the 'sum' field width for the average,

# since it can only be less or the same width.

#

# To generate the format string, take an array of the

# %fwidth keys and use map{} to extract the corresponding

# fwidth value, stick them inside the format '%s' and join

# the various pieces using spaces.  Add 33% padding to

# widths to allow for commas.

#

sub gen_fmt {

    my ($stats) = @_;



    my @tags = sort { $a <=> $b } keys %{$stats};

    my %fwidth;

    # size is N zeroes plus leading '1'

    $fwidth{size} = 1 + $tags[$LASTFIELD];

    $fwidth{count} = calc_field_width $stats, \@tags, 'N');

    $fwidth{sum}   = calc_field_width $stats, \@tags, 'sum');



    my $fmt = join $SPACE,

              map { '%' . allow_for_commas( $fwidth{$_} ) . 's' }

                  (qw/size count sum sum/);

    $fmt .= "\n";



    return ( $fmt, \@tags );

}



# ........................................................

# commify a number. Traditional solution see 'perldoc

# perlfaq5 q.13'

#

sub commify {

my $text = reverse $_[0];

$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;

return scalar reverse $text;

}



# ........................................................

#

sub print_line {

    my ( $fmt, $stats, $log ) = @_;



    my $size = '1' . '0' x $log;

    printf $fmt, map { commify($_) } (

        $size,

        $stats->{$log}{N},

        $stats->{$log}{sum},

        int $ROUNDING_OFFSET

            + $stats->{$log}{sum} / $stats->{$log}{N}

    );

}



# ........................................................

# All the field widths need to be dynamic, since largest

# file might be terabytes in size, there could be millions

# of files in some size range, the sum could be huge

# Calculate the format string once, then reuse it with
# leading spaces for the header, and a leading '< ' for
# the actual sizes. Everything is ( or can be ) a string in
# perl.
#
sub display_summary {
my ($stats) = @_;

my ( $fmt, $tags ) = gen_fmt($stats);
print_header(" $fmt");

for my $width ( @{$tags} ) {
print_line( "< $fmt", $stats, $width );
}
}
# --------------------------------------------------------
# End of File

If you aren't Happy
Which Dwarf Are You?

Search This Blog