Skip to main content

Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing things, if you have better ideas.

The program goes through the file hierarchy, storing data in a data structure , %stats.

0  HASH(0x1009a1000)
   3 => HASH(0x1009d0c90)
      'N' => 32
      'sum' => 16995
   4 => HASH(0x1009d0e40)
      'N' => 17
      'sum' => 57691
   5 => HASH(0x1009d0ea0)
      'N' => 3
      'sum' => 35747


The '3', '4', '5 refer to the number of digits in the file size, i.e., '3' refers to file sizes in the range 100-999. 'N' is the number of files in this size range,'sum' is the sum of all the actual file sizes for the range.

The finale output shows there are 166 zero-byte files ... perhaps they were created but never populated, due to some error, or perhaps they communicated some binary data merely by their presence. Sometimes compilers use zero-length files to indicate the extent of their progress. Files with sizes between 1 and 9 bytes number 9, while 327 are under 100 bytes. This blog is too narrow, so I cut the table short, but I do have four 45GB files which store data comparing the performance ofsudoku-solving algorithms.


            size     N           sum           avg
            ----     -           ---           ---
<              1   166             0             0
<             10    59           411             7
<            100   327        17,840            55
<          1,000 3,800     1,939,738           510
<         10,000 8,231    31,002,459         3,767
<        100,000 3,575   109,346,315        30,586



#!/usr/local/bin/perl5.10
# Determine file counts for size categories 1-9, 10-99
#

use warnings;
use strict;
use feature ':5.10';

use Readonly;
use Getopt::Long;
use Data::Dumper;

# --------------------------------------------------------
# Constants
#

Readonly my $DOT              => q{.};
Readonly my $DOTDOT           => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};
Readonly my $LASTFIELD        => -1;
Readonly my $SPACE            => q{ };
Readonly my $TAB              => qq{\t};

Readonly my $COMMIFY_WIDTH   => 3;
Readonly my $ROUNDING_OFFSET => 1.0 / 2;

# -------------------------------------------------------
# Main
#
my %options = ( root => q{~} );

GetOptions( \%options, 'root=s' )
  or die "Error parsing command line arguments.\n";

my $stats = process_sizes( $options{root} );
say Dumper($stats) if exists $ENV{DEBUG};
display_summary($stats);

# -------------------------------------------------------
# Subroutines
#

# Explore directories, accumulating file sizes. Initialize
# dirs list with the one root, add subdirs, processing
# recursively.
#
sub process_sizes {
    my (@dirs) = @_;

    my %stats;
    while (@dirs) {
        process_one_dir( \%stats, \@dirs );
    }
    return \%stats;
}

# ........................................................
# For any directory, use glob() to get a list of all the
# files in the directory.  Ignore '.' & '..'.
#
# Spaces in directory names need to be escaped, otherwise
# glob() will think the string contains multiple 
# candidates to match.
#
sub process_one_dir {
    my ( $stats, $dirs ) = @_;

    my $one_dir = shift @{$dirs};
    say "process_one_dir => $one_dir"
        if $ENV{DEBUG};
    $one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g;

  ENTRY:
    while ( my $entry = glob "$one_dir/*" ) {
       next ENTRY
          if $entry eq $DOT or $entry eq $DOTDOT;
        process_one_entry( $entry, $dirs, $stats );
    }
}

# ........................................................
# Stick subdirectories onto the @dirs array for recursive
# searching.  File require additional processing based on
# their size.
#
# Note: specify the file name only on the first stats()
# call on a file ( -d or other -X ). Use '_' as the file
# name in additional calls to re-use the data rather than
# making a new call.
#
sub process_one_entry {
    my ( $entry, $dirs, $stats ) = @_;

    if ( -d $entry ) {
        push @{$dirs}, $entry;
    }
    else {
        say "${TAB}record_one_size => $entry"
            if $ENV{DEBUG};
        record_one_size( $stats, -s _ );
    }
}

# ........................................................
# For empty files, update a special '0' category. Use the
# length of the size of non-empty files to determine the
# category ( equivalent to ceiling( log10( size ))), and
# keep track of the number of files and the sum of the
# bytes for the category.
#
sub record_one_size {
    my ( $stats, $size ) = @_;

    my $len = $size == 0 ? 0 : length $size;
    $stats->{$len}{N}++;
    $stats->{$len}{sum} += $size;
}

# ........................................................
#
sub print_header {
    my ($fmt) = @_;

    my @fields = (qw/size N sum avg/);
    printf $fmt, @fields;
    printf $fmt, map { q{-} x length $_ } @fields;
}

# .........................................................
# For each size category, get desired field, either 'N'
# or 'sum'. Sort these and grab the largest value to
# determine how wide the field needs to be.
#
sub calc_field_width {
    my ( $stats, $tags, $field ) = @_;

    my @values = map { $stats->{$_}{$field} } @{$tags};
    my $width = length( ( sort { $a <=> $b }
                        @values )[$LASTFIELD] );
    return $width;
}

# ........................................................
# Field widths need an extra 33% to allow for commafying
# numbers. The calulation will try to put a comma to the
# left of the entire number if leftmost group has three
# digits, so drop one comma in that case. But don't drop
# comma if the width is zero, because that would mmake the
# width negative.
#
sub allow_for_commas {
    my ($width) = @_;

    my $commas = int $width / $COMMIFY_WIDTH;
    $commas--
        if $commas && $width % $COMMIFY_WIDTH == 0;
    return $width + $commas;
}

# ........................................................
# The keys of the hash are the exponents of the size
# categories, i.e., the log of the size, 0, 1, 2, ... Sort
# these and take the largest to determine how wide size
# field has to be. Determine the width for the N and sum
# fields. Reuse the 'sum' field width for the average,
# since it can only be less or the same width.
#
# To generate the format string, take an array of the
# %fwidth keys and use map{} to extract the corresponding
# fwidth value, stick them inside the format '%s' and join
# the various pieces using spaces. Add 33% padding to
# widths to allow for commas.
#
sub gen_fmt {
    my ($stats) = @_;

    my @tags = sort { $a <=> $b } keys %{$stats};
    my %fwidth;
    # size is N zeroes plus leading '1'
    $fwidth{size} = 1 + $tags[$LASTFIELD];
    $fwidth{count} = calc_field_width $stats, \@tags, 'N');
    $fwidth{sum}   = calc_field_width $stats, \@tags, 'sum');

    my $fmt = join $SPACE,
              map { '%' . allow_for_commas( $fwidth{$_} ) . 's' }
                  (qw/size count sum sum/);
    $fmt .= "\n";

    return ( $fmt, \@tags );
}

# ........................................................
# commify a number. Traditional solution see 'perldoc
# perlfaq5 q.13'
#
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}

# ........................................................
#
sub print_line {
    my ( $fmt, $stats, $log ) = @_;

    my $size = '1' . '0' x $log;
    printf $fmt, map { commify($_) } (
        $size,
        $stats->{$log}{N},
        $stats->{$log}{sum},
        int $ROUNDING_OFFSET
            + $stats->{$log}{sum} / $stats->{$log}{N}
    );
}

# ........................................................
# All the field widths need to be dynamic, since largest
# file might be terabytes in size, there could be millions
# of files in some size range, the sum could be huge

# Calculate the format string once, then reuse it with 
# leading spaces for the header, and a leading '< ' for 
# the actual sizes. Everything is ( or can be ) a string in 
# perl. 
#  
sub display_summary {
    my ($stats) = @_; 


    my ( $fmt, $tags ) = gen_fmt($stats); 
    print_header(" $fmt");


    for my $width ( @{$tags} ) { 
        print_line( "< $fmt", $stats, $width ); 
    } 

# -------------------------------------------------------- 
# End of File

Comments

Popular posts from this blog

Perl5, Moxie and Enumurated Data Types

Moxie - a new object system for Perl5 Stevan Little created the Moose multiverse to upgrade the Perl 5 programming language's object-oriented system more in line with the wonderfull world of Perl 6. Unfortunately, it's grown into a bloated giant, which has inspired light-weight alternatives Moos, Moo, Mo, and others. Now he's trying to create a modern, efficient OO system that can become built into the language. I've seen a few of his presentations at YAPC (Yet Another Perl Conference, now known as TPC, The Perl Conference), among them ‎p5 mop final final v5 this is the last one i promise tar gz While the package provides some POD documentation about the main module, Moxie, it doesn't actually explain the enum package, Moxie::Enum. But delving into the tests directory reveals its secrets. Creating an Enum package Ranks { use Moxie::Enum; enum by_ARRAY => qw( unused 2 3 4 5 6 7 8 9 10 J Q K A ); enum by_HASH => { 2 => 2, 3 =...

Creating Perl5 Objects with Moxie

Having in the previous article prepared data types for car suits and card ranks, I can now combine them to provide a playing card class, using Stevan Little's Moxie module (version 0.04, so definitely early days.) The goal is to provide an object-oriented paradigm to the Perl 5 programming language which is more sophisticated, more powerful and less verbose than manually bless() -ing hashes. To achieve that goal it needs to be faster and light-weight compared to Moose. Currently, Moxie.pm and and MOP.pm are add-on modules, but eventually, when they are more complete, when the wrinkles have been ironed out, and when they have gained acceptance and a community of users, they might be merged into the Perl core. One significant feature of Moxie is that it reduces boilerplate code. You don't have to specify warnigns or strict . As well, the features or the perl you are using are enabled, among them say , state , signatures , and post_deref . A Simple Moxie Class packag...

Book review: 390+ Python Interview Questions and Answers

I downloaded a preview portion of 390+ Python MCQs from Anazon, thinking reading through it would help me advance my Python skills beyond what I have learned from Harvard’s online CS50P (Python) course. I’m an experienced program looking to add a new skill to my repertoire, and while the course covered many significant aspects of Python programming, there are many other details to perfect, such as best practices, developing packages, and so on. The book is written by Manish Dnyandeo Salunke, who claims 15 years experience in IT,  but it is not clear who published it. It is obvious no one edited it, or verified the correctness of the questions, answers and explanations. Amazon allowed me to download a sample of (I think) 57 questions. Roughly half of these were wrong, and some of the others struck me as irrelevant. The maximum allowed length for an identifier, apparently, is 79 characters. Anything over 20 characters should be considered unusual, so sufficient to say the limit is se...