Saturday, March 5, 2011

Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing things, if you have better ideas.

The program goes through the file hierarchy, storing data in a data structure , %stats.

0  HASH(0x1009a1000)
   3 => HASH(0x1009d0c90)
      'N' => 32
      'sum' => 16995
   4 => HASH(0x1009d0e40)
      'N' => 17
      'sum' => 57691
   5 => HASH(0x1009d0ea0)
      'N' => 3
      'sum' => 35747


The '3', '4', '5 refer to the number of digits in the file size, i.e., '3' refers to file sizes in the range 100-999. 'N' is the number of files in this size range,'sum' is the sum of all the actual file sizes for the range.

The finale output shows there are 166 zero-byte files ... perhaps they were created but never populated, due to some error, or perhaps they communicated some binary data merely by their presence. Sometimes compilers use zero-length files to indicate the extent of their progress. Files with sizes between 1 and 9 bytes number 9, while 327 are under 100 bytes. This blog is too narrow, so I cut the table short, but I do have four 45GB files which store data comparing the performance ofsudoku-solving algorithms.


            size     N           sum           avg
            ----     -           ---           ---
<              1   166             0             0
<             10    59           411             7
<            100   327        17,840            55
<          1,000 3,800     1,939,738           510
<         10,000 8,231    31,002,459         3,767
<        100,000 3,575   109,346,315        30,586



#!/usr/local/bin/perl5.10
# Determine file counts for size categories 1-9, 10-99
#

use warnings;
use strict;
use feature ':5.10';

use Readonly;
use Getopt::Long;
use Data::Dumper;

# --------------------------------------------------------
# Constants
#

Readonly my $DOT              => q{.};
Readonly my $DOTDOT           => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};
Readonly my $LASTFIELD        => -1;
Readonly my $SPACE            => q{ };
Readonly my $TAB              => qq{\t};

Readonly my $COMMIFY_WIDTH   => 3;
Readonly my $ROUNDING_OFFSET => 1.0 / 2;

# -------------------------------------------------------
# Main
#
my %options = ( root => q{~} );

GetOptions( \%options, 'root=s' )
  or die "Error parsing command line arguments.\n";

my $stats = process_sizes( $options{root} );
say Dumper($stats) if exists $ENV{DEBUG};
display_summary($stats);

# -------------------------------------------------------
# Subroutines
#

# Explore directories, accumulating file sizes. Initialize
# dirs list with the one root, add subdirs, processing
# recursively.
#
sub process_sizes {
    my (@dirs) = @_;

    my %stats;
    while (@dirs) {
        process_one_dir( \%stats, \@dirs );
    }
    return \%stats;
}

# ........................................................
# For any directory, use glob() to get a list of all the
# files in the directory.  Ignore '.' & '..'.
#
# Spaces in directory names need to be escaped, otherwise
# glob() will think the string contains multiple 
# candidates to match.
#
sub process_one_dir {
    my ( $stats, $dirs ) = @_;

    my $one_dir = shift @{$dirs};
    say "process_one_dir => $one_dir"
        if $ENV{DEBUG};
    $one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g;

  ENTRY:
    while ( my $entry = glob "$one_dir/*" ) {
       next ENTRY
          if $entry eq $DOT or $entry eq $DOTDOT;
        process_one_entry( $entry, $dirs, $stats );
    }
}

# ........................................................
# Stick subdirectories onto the @dirs array for recursive
# searching.  File require additional processing based on
# their size.
#
# Note: specify the file name only on the first stats()
# call on a file ( -d or other -X ). Use '_' as the file
# name in additional calls to re-use the data rather than
# making a new call.
#
sub process_one_entry {
    my ( $entry, $dirs, $stats ) = @_;

    if ( -d $entry ) {
        push @{$dirs}, $entry;
    }
    else {
        say "${TAB}record_one_size => $entry"
            if $ENV{DEBUG};
        record_one_size( $stats, -s _ );
    }
}

# ........................................................
# For empty files, update a special '0' category. Use the
# length of the size of non-empty files to determine the
# category ( equivalent to ceiling( log10( size ))), and
# keep track of the number of files and the sum of the
# bytes for the category.
#
sub record_one_size {
    my ( $stats, $size ) = @_;

    my $len = $size == 0 ? 0 : length $size;
    $stats->{$len}{N}++;
    $stats->{$len}{sum} += $size;
}

# ........................................................
#
sub print_header {
    my ($fmt) = @_;

    my @fields = (qw/size N sum avg/);
    printf $fmt, @fields;
    printf $fmt, map { q{-} x length $_ } @fields;
}

# .........................................................
# For each size category, get desired field, either 'N'
# or 'sum'. Sort these and grab the largest value to
# determine how wide the field needs to be.
#
sub calc_field_width {
    my ( $stats, $tags, $field ) = @_;

    my @values = map { $stats->{$_}{$field} } @{$tags};
    my $width = length( ( sort { $a <=> $b }
                        @values )[$LASTFIELD] );
    return $width;
}

# ........................................................
# Field widths need an extra 33% to allow for commafying
# numbers. The calulation will try to put a comma to the
# left of the entire number if leftmost group has three
# digits, so drop one comma in that case. But don't drop
# comma if the width is zero, because that would mmake the
# width negative.
#
sub allow_for_commas {
    my ($width) = @_;

    my $commas = int $width / $COMMIFY_WIDTH;
    $commas--
        if $commas && $width % $COMMIFY_WIDTH == 0;
    return $width + $commas;
}

# ........................................................
# The keys of the hash are the exponents of the size
# categories, i.e., the log of the size, 0, 1, 2, ... Sort
# these and take the largest to determine how wide size
# field has to be. Determine the width for the N and sum
# fields. Reuse the 'sum' field width for the average,
# since it can only be less or the same width.
#
# To generate the format string, take an array of the
# %fwidth keys and use map{} to extract the corresponding
# fwidth value, stick them inside the format '%s' and join
# the various pieces using spaces. Add 33% padding to
# widths to allow for commas.
#
sub gen_fmt {
    my ($stats) = @_;

    my @tags = sort { $a <=> $b } keys %{$stats};
    my %fwidth;
    # size is N zeroes plus leading '1'
    $fwidth{size} = 1 + $tags[$LASTFIELD];
    $fwidth{count} = calc_field_width $stats, \@tags, 'N');
    $fwidth{sum}   = calc_field_width $stats, \@tags, 'sum');

    my $fmt = join $SPACE,
              map { '%' . allow_for_commas( $fwidth{$_} ) . 's' }
                  (qw/size count sum sum/);
    $fmt .= "\n";

    return ( $fmt, \@tags );
}

# ........................................................
# commify a number. Traditional solution see 'perldoc
# perlfaq5 q.13'
#
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}

# ........................................................
#
sub print_line {
    my ( $fmt, $stats, $log ) = @_;

    my $size = '1' . '0' x $log;
    printf $fmt, map { commify($_) } (
        $size,
        $stats->{$log}{N},
        $stats->{$log}{sum},
        int $ROUNDING_OFFSET
            + $stats->{$log}{sum} / $stats->{$log}{N}
    );
}

# ........................................................
# All the field widths need to be dynamic, since largest
# file might be terabytes in size, there could be millions
# of files in some size range, the sum could be huge

# Calculate the format string once, then reuse it with 
# leading spaces for the header, and a leading '< ' for 
# the actual sizes. Everything is ( or can be ) a string in 
# perl. 
#  
sub display_summary {
    my ($stats) = @_; 


    my ( $fmt, $tags ) = gen_fmt($stats); 
    print_header(" $fmt");


    for my $width ( @{$tags} ) { 
        print_line( "< $fmt", $stats, $width ); 
    } 

# -------------------------------------------------------- 
# End of File

No comments: