Skip to main content

Are your files SM? M? XL? the procedural XL solution

I presented this challenge at work as a beat-the-winter-blahs activity.

But then I felt I couldn't present a merely operative solution. As the local Perl geek, I felt required to demonstrate that Perl could do clean, well written solutions, not just line noise. I wanted a balance between advanced Perl features and easy comprehension. On the Unix command line, I prefer piping commands together, rather than using a bunch of intermediate files. Similarly in Perl I advocate using map{} and grep{}, and feeding the output of one function into the input of another. I consider those standard idioms any programmer needs to learn.

On the other hand, there were a couple of situations where exceptions and oddities in the code made me uncomfortable. Pulling things into subroutines such as calc_field_width() simplified things and made the code clearer, I think. I guess the lesson of that is that short subroutines are better. I knew that!

I'm open to suggestions for alternative ways of doing things, if you have better ideas.

The program goes through the file hierarchy, storing data in a data structure , %stats.

0  HASH(0x1009a1000)
   3 => HASH(0x1009d0c90)
      'N' => 32
      'sum' => 16995
   4 => HASH(0x1009d0e40)
      'N' => 17
      'sum' => 57691
   5 => HASH(0x1009d0ea0)
      'N' => 3
      'sum' => 35747


The '3', '4', '5 refer to the number of digits in the file size, i.e., '3' refers to file sizes in the range 100-999. 'N' is the number of files in this size range,'sum' is the sum of all the actual file sizes for the range.

The finale output shows there are 166 zero-byte files ... perhaps they were created but never populated, due to some error, or perhaps they communicated some binary data merely by their presence. Sometimes compilers use zero-length files to indicate the extent of their progress. Files with sizes between 1 and 9 bytes number 9, while 327 are under 100 bytes. This blog is too narrow, so I cut the table short, but I do have four 45GB files which store data comparing the performance ofsudoku-solving algorithms.


            size     N           sum           avg
            ----     -           ---           ---
<              1   166             0             0
<             10    59           411             7
<            100   327        17,840            55
<          1,000 3,800     1,939,738           510
<         10,000 8,231    31,002,459         3,767
<        100,000 3,575   109,346,315        30,586



#!/usr/local/bin/perl5.10
# Determine file counts for size categories 1-9, 10-99
#

use warnings;
use strict;
use feature ':5.10';

use Readonly;
use Getopt::Long;
use Data::Dumper;

# --------------------------------------------------------
# Constants
#

Readonly my $DOT              => q{.};
Readonly my $DOTDOT           => q{..};
Readonly my $ESCAPE_BACKSLASH => q{\\};
Readonly my $LASTFIELD        => -1;
Readonly my $SPACE            => q{ };
Readonly my $TAB              => qq{\t};

Readonly my $COMMIFY_WIDTH   => 3;
Readonly my $ROUNDING_OFFSET => 1.0 / 2;

# -------------------------------------------------------
# Main
#
my %options = ( root => q{~} );

GetOptions( \%options, 'root=s' )
  or die "Error parsing command line arguments.\n";

my $stats = process_sizes( $options{root} );
say Dumper($stats) if exists $ENV{DEBUG};
display_summary($stats);

# -------------------------------------------------------
# Subroutines
#

# Explore directories, accumulating file sizes. Initialize
# dirs list with the one root, add subdirs, processing
# recursively.
#
sub process_sizes {
    my (@dirs) = @_;

    my %stats;
    while (@dirs) {
        process_one_dir( \%stats, \@dirs );
    }
    return \%stats;
}

# ........................................................
# For any directory, use glob() to get a list of all the
# files in the directory.  Ignore '.' & '..'.
#
# Spaces in directory names need to be escaped, otherwise
# glob() will think the string contains multiple 
# candidates to match.
#
sub process_one_dir {
    my ( $stats, $dirs ) = @_;

    my $one_dir = shift @{$dirs};
    say "process_one_dir => $one_dir"
        if $ENV{DEBUG};
    $one_dir =~ s{(\s)}{$ESCAPE_BACKSLASH$1}g;

  ENTRY:
    while ( my $entry = glob "$one_dir/*" ) {
       next ENTRY
          if $entry eq $DOT or $entry eq $DOTDOT;
        process_one_entry( $entry, $dirs, $stats );
    }
}

# ........................................................
# Stick subdirectories onto the @dirs array for recursive
# searching.  File require additional processing based on
# their size.
#
# Note: specify the file name only on the first stats()
# call on a file ( -d or other -X ). Use '_' as the file
# name in additional calls to re-use the data rather than
# making a new call.
#
sub process_one_entry {
    my ( $entry, $dirs, $stats ) = @_;

    if ( -d $entry ) {
        push @{$dirs}, $entry;
    }
    else {
        say "${TAB}record_one_size => $entry"
            if $ENV{DEBUG};
        record_one_size( $stats, -s _ );
    }
}

# ........................................................
# For empty files, update a special '0' category. Use the
# length of the size of non-empty files to determine the
# category ( equivalent to ceiling( log10( size ))), and
# keep track of the number of files and the sum of the
# bytes for the category.
#
sub record_one_size {
    my ( $stats, $size ) = @_;

    my $len = $size == 0 ? 0 : length $size;
    $stats->{$len}{N}++;
    $stats->{$len}{sum} += $size;
}

# ........................................................
#
sub print_header {
    my ($fmt) = @_;

    my @fields = (qw/size N sum avg/);
    printf $fmt, @fields;
    printf $fmt, map { q{-} x length $_ } @fields;
}

# .........................................................
# For each size category, get desired field, either 'N'
# or 'sum'. Sort these and grab the largest value to
# determine how wide the field needs to be.
#
sub calc_field_width {
    my ( $stats, $tags, $field ) = @_;

    my @values = map { $stats->{$_}{$field} } @{$tags};
    my $width = length( ( sort { $a <=> $b }
                        @values )[$LASTFIELD] );
    return $width;
}

# ........................................................
# Field widths need an extra 33% to allow for commafying
# numbers. The calulation will try to put a comma to the
# left of the entire number if leftmost group has three
# digits, so drop one comma in that case. But don't drop
# comma if the width is zero, because that would mmake the
# width negative.
#
sub allow_for_commas {
    my ($width) = @_;

    my $commas = int $width / $COMMIFY_WIDTH;
    $commas--
        if $commas && $width % $COMMIFY_WIDTH == 0;
    return $width + $commas;
}

# ........................................................
# The keys of the hash are the exponents of the size
# categories, i.e., the log of the size, 0, 1, 2, ... Sort
# these and take the largest to determine how wide size
# field has to be. Determine the width for the N and sum
# fields. Reuse the 'sum' field width for the average,
# since it can only be less or the same width.
#
# To generate the format string, take an array of the
# %fwidth keys and use map{} to extract the corresponding
# fwidth value, stick them inside the format '%s' and join
# the various pieces using spaces. Add 33% padding to
# widths to allow for commas.
#
sub gen_fmt {
    my ($stats) = @_;

    my @tags = sort { $a <=> $b } keys %{$stats};
    my %fwidth;
    # size is N zeroes plus leading '1'
    $fwidth{size} = 1 + $tags[$LASTFIELD];
    $fwidth{count} = calc_field_width $stats, \@tags, 'N');
    $fwidth{sum}   = calc_field_width $stats, \@tags, 'sum');

    my $fmt = join $SPACE,
              map { '%' . allow_for_commas( $fwidth{$_} ) . 's' }
                  (qw/size count sum sum/);
    $fmt .= "\n";

    return ( $fmt, \@tags );
}

# ........................................................
# commify a number. Traditional solution see 'perldoc
# perlfaq5 q.13'
#
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}

# ........................................................
#
sub print_line {
    my ( $fmt, $stats, $log ) = @_;

    my $size = '1' . '0' x $log;
    printf $fmt, map { commify($_) } (
        $size,
        $stats->{$log}{N},
        $stats->{$log}{sum},
        int $ROUNDING_OFFSET
            + $stats->{$log}{sum} / $stats->{$log}{N}
    );
}

# ........................................................
# All the field widths need to be dynamic, since largest
# file might be terabytes in size, there could be millions
# of files in some size range, the sum could be huge

# Calculate the format string once, then reuse it with 
# leading spaces for the header, and a leading '< ' for 
# the actual sizes. Everything is ( or can be ) a string in 
# perl. 
#  
sub display_summary {
    my ($stats) = @_; 


    my ( $fmt, $tags ) = gen_fmt($stats); 
    print_header(" $fmt");


    for my $width ( @{$tags} ) { 
        print_line( "< $fmt", $stats, $width ); 
    } 

# -------------------------------------------------------- 
# End of File

Comments

Popular posts from this blog

Creating Perl5 Objects with Moxie

Having in the previous article prepared data types for car suits and card ranks, I can now combine them to provide a playing card class, using Stevan Little's Moxie module (version 0.04, so definitely early days.) The goal is to provide an object-oriented paradigm to the Perl 5 programming language which is more sophisticated, more powerful and less verbose than manually bless()-ing hashes. To achieve that goal it needs to be faster and light-weight compared to Moose. Currently, Moxie.pm and and MOP.pm are add-on modules, but eventually, when they are more complete, when the wrinkles have been ironed out, and when they have gained acceptance and a community of users, they might be merged into the Perl core.

One significant feature of Moxie is that it reduces boilerplate code. You don't have to specify warnigns or strict. As well, the features or the perl you are using are enabled, among them say, state, signatures, and post_deref.
A Simple Moxie Class package Card { …

Perl5, Moxie and Enumurated Data Types

Moxie - a new object system for Perl5 Stevan Little created the Moose multiverse to upgrade the Perl 5 programming language's object-oriented system more in line with the wonderfull world of Perl 6. Unfortunately, it's grown into a bloated giant, which has inspired light-weight alternatives Moos, Moo, Mo, and others. Now he's trying to create a modern, efficient OO system that can become built into the language.

I've seen a few of his presentations at YAPC (Yet Another Perl Conference, now known as TPC, The Perl Conference), among them ‎p5 mop final final v5 this is the last one i promise tar gz<. So I was delighted to recently see an announcement of the module Moxie, and decided to try implementing a card game.

While the package provides some POD documentation about the main module, Moxie, it doesn't actually explain the enum package, Moxie::Enum. But delving into the tests directory reveals its secrets.
Creating an Enum package Ranks { use Moxie::Enum; …

Adventures in Autovivification

Having recently started a new job, I was exposed to old code with multi-step tests against autovivification in multi-level hashes. You get used to the code you have seen, but in a new environment it‘s irritating and jarring.
Moose does not generally have the problem, first because class structure is pre-declared, because values are accessed using accessor functions rather than directly, and because responsibility is delegated down to attributes, avoiding long chains. On the other hand, Moose has it's own overhead, so hand-rolled objects, and bare structures still have their use.
If you don‘t protect against autovivification, then mis-spelling a key, or referencing keys which haven‘t been instantiated in this instance, causes those keys to instantly come into existence.
#!/usr/bin/perl use warnings; use strict; use Data::Dump 'dump'; use 5.024; my $var = {key1 => {key2 => {key3 => 'a'}}}; say dump $var; if ( $var->{key1}{key2}{key3b}[13]{foob…