“My Neckbeard Grew Three Sizes That Day” or How I Beat a GNU tool with Perl

Pulling back from the archives this is a repost of a previous blog post, with minor edits to include a more revent version of the app. This time ripped from a guest spot at The Nubby Admin, a fantastic blog from a fellow tech nerd.

I spend a lot of time doing text based data processing. A lot of time. During an analysis, I often want to do things like look at 'Top Talkers', 'Most Frequent Visitors', or really anything that comprises a list of unique identifiers sorted by count. As a result, I've translated two actions into a series of pipes:

What's the count of events per thingy: | sort | uniq -c | sort -n
Who has been doing whatever: | sort -u

This tends to work pretty well in most cases. Today, however, was not one of those cases. While attempting get a list of unique MACs I started out with a source (i.e. non-uniqued) 16GB text file with one MAC per line. This is where things got annoying. Muscle memory kicked in and since this matched Action #2, I ran the following command:

cat macs_all.txt | sort -u > macs_unique.txt

I expected it to take a few minutes, so I went back to the other things I was doing and let it go. I checked back 15 minutes later, and it was still running. Waited 5 minutes...still running. When the command had been running for 45 minutes, I got fed up and decided that I could do better. Perl, being my go to tool, came to the rescue in the form of hashes. I won't go into gritty detail, but a Perl hash is a data structure that consists of a list of key/value pairs. Whenever you assign a value to a key it will add an entry for the key if it doesn't exist, or update the value if it does. Since a key cannot be in the same hash multiple times, it makes for a pretty good hack to generate a unique list. The full source of the app is included at the end, but the basic data flow is:

Read MAC from input
If MAC exists in list next, else add to list
If EOF print all MACs

This worked significantly better for me. The output was not sorted, but that's fine, I didn't need it sorted, only unique. The timing information looked a lot better too.

packs@ node1:~> time cat macs_all.txt | sort -u > macs_unique.txt

real    181m12.417s
user    176m13.926s
sys     1m42.335s
packs@ node1:~> time cat macs_all.txt | ./fast_uniq.pl > macs_fast_uniqed.txt

real    8m9.074s
user    7m28.176s
sys     0m46.271s

The times can't really be directly compared, since output from fast_uniq.pl isn't actually sorted. Given the pretty substantial difference I think we can reasonably accept the fact that fast_uniq.pl is better in this use case. After seeing this, I'm tempted to add some functionality so I stop using both sort and uniq entirely.

I'm interested to hear if anyone else has done something similar or explain to me how much my code sucks.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;
use Pod::Usage;

# Setup main variables from script arguments
my ($unique, $sort, $count, $quiet, $DEBUG);

# Grab the options passed on the command line.
GetOptions (
  "unique|u" => \$unique,              # flag
  "sort|s" => \$sort,                  # flag
  "count|c" => \$count,                # flag
  "quiet|q" => \$quiet,                # flag
  "verbose|v" => \$DEBUG,              # flag
  "help|?|h" => sub { pod2usage(1); }, # flag
) or pod2usage("$0: Unrecognized program argument.");

if( !( defined($unique) or defined($sort) or defined($count) ) ) {
  pod2usage("$0:  Required argument missing.");
}

my ( %hash, $key, $value );

if ( $unique ) {
  print "Entered unique loop\n" if $DEBUG;

  while( my $mac = <> ) {
    next unless $mac;
    chomp $mac;

    $hash{$mac}++;
  }

  # Do all the prints here
  if ( $sort ) {
    print "Entered sorted print loop\n" if $DEBUG;

    for $key ( sort keys %hash ) {
      if ( $count ) {
        print "$hash{$key} \t$key\n";
      }
      else {
        print "$key\n";
      }
    }
  }
  else {
    print "Entered unsorted print loop\n" if $DEBUG;
    for $key (keys %hash) {
      if ( $count ) {
        print "$hash{$key} \t$key\n";
      }
      else {
        print "$key\n";
      }
    }
  }
}

if ( $sort and not $unique ) {
  # Join to an array and sort for print
  print "So you want sorted but uniqued data, eh? Patience my dear, patience.\n"
}

__END__

=head1 NAME

fast_sorter - Reimplementation of GNU sort and GNU uniq to perform faster
by Scott Pack

=head1 SYNOPSIS

fast_sorter.pl [options]

 Options:
   -u, --unique     Deduplicate entries in the input
   -s, --sort       Print the output ASCII sorted
   -c, --count      Print the output along with duplication counts
   -v, --verbose    Print debugging information
   -h, --help       Brief help message


By Scott Pack

=cut