“My Neckbeard Grew Three Sizes That Day” or How I Beat a GNU tool with Perl

Pulling back from the archives this is a repost of a previous blog post, with minor edits to include a more revent version of the app. This time ripped from a guest spot at The Nubby Admin, a fantastic blog from a fellow tech nerd.

I spend a lot of time doing text based data processing. A lot of time. During an analysis, I often want to do things like look at 'Top Talkers', 'Most Frequent Visitors', or really anything that comprises a list of unique identifiers sorted by count. As a result, I've translated two actions into a series of pipes:

  1. What's the count of events per thingy: | sort | uniq -c | sort -n
  2. Who has been doing whatever: | sort -u

This tends to work pretty well in most cases. Today, however, was not one of those cases. While attempting get a list of unique MACs I started out with a source (i.e. non-uniqued) 16GB text file with one MAC per line. This is where things got annoying. Muscle memory kicked in and since this matched Action #2, I ran the following command:

cat macs_all.txt | sort -u > macs_unique.txt

I expected it to take a few minutes, so I went back to the other things I was doing and let it go. I checked back 15 minutes later, and it was still running. Waited 5 minutes...still running. When the command had been running for 45 minutes, I got fed up and decided that I could do better. Perl, being my go to tool, came to the rescue in the form of hashes. I won't go into gritty detail, but a Perl hash is a data structure that consists of a list of key/value pairs. Whenever you assign a value to a key it will add an entry for the key if it doesn't exist, or update the value if it does. Since a key cannot be in the same hash multiple times, it makes for a pretty good hack to generate a unique list. The full source of the app is included at the end, but the basic data flow is:

  • Read MAC from input
  • If MAC exists in list next, else add to list
  • If EOF print all MACs

This worked significantly better for me. The output was not sorted, but that's fine, I didn't need it sorted, only unique. The timing information looked a lot better too.

packs@ node1:~> time cat macs_all.txt | sort -u > macs_unique.txt

real    181m12.417s
user    176m13.926s
sys     1m42.335s
packs@ node1:~> time cat macs_all.txt | ./fast_uniq.pl > macs_fast_uniqed.txt

real    8m9.074s
user    7m28.176s
sys     0m46.271s

The times can't really be directly compared, since output from fast_uniq.pl isn't actually sorted. Given the pretty substantial difference I think we can reasonably accept the fact that fast_uniq.pl is better in this use case. After seeing this, I'm tempted to add some functionality so I stop using both sort and uniq entirely.

I'm interested to hear if anyone else has done something similar or explain to me how much my code sucks.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;
use Pod::Usage;

# Setup main variables from script arguments
my ($unique, $sort, $count, $quiet, $DEBUG);

# Grab the options passed on the command line.
GetOptions (
  "unique|u" => \$unique,              # flag
  "sort|s" => \$sort,                  # flag
  "count|c" => \$count,                # flag
  "quiet|q" => \$quiet,                # flag
  "verbose|v" => \$DEBUG,              # flag
  "help|?|h" => sub { pod2usage(1); }, # flag
) or pod2usage("$0: Unrecognized program argument.");

if( !( defined($unique) or defined($sort) or defined($count) ) ) {
  pod2usage("$0:  Required argument missing.");
}

my ( %hash, $key, $value );

if ( $unique ) {
  print "Entered unique loop\n" if $DEBUG;

  while( my $mac = <> ) {
    next unless $mac;
    chomp $mac;

    $hash{$mac}++;
  }

  # Do all the prints here
  if ( $sort ) {
    print "Entered sorted print loop\n" if $DEBUG;

    for $key ( sort keys %hash ) {
      if ( $count ) {
        print "$hash{$key} \t$key\n";
      }
      else {
        print "$key\n";
      }
    }
  }
  else {
    print "Entered unsorted print loop\n" if $DEBUG;
    for $key (keys %hash) {
      if ( $count ) {
        print "$hash{$key} \t$key\n";
      }
      else {
        print "$key\n";
      }
    }
  }
}

if ( $sort and not $unique ) {
  # Join to an array and sort for print
  print "So you want sorted but uniqued data, eh? Patience my dear, patience.\n"
}

__END__

=head1 NAME

fast_sorter - Reimplementation of GNU sort and GNU uniq to perform faster
by Scott Pack

=head1 SYNOPSIS

fast_sorter.pl [options]

 Options:
   -u, --unique     Deduplicate entries in the input
   -s, --sort       Print the output ASCII sorted
   -c, --count      Print the output along with duplication counts
   -v, --verbose    Print debugging information
   -h, --help       Brief help message


By Scott Pack

=cut