Pulling back from the archives this is a repost of a previous blog post, with minor edits to include a more revent version of the app. This time ripped from a guest spot at The Nubby Admin, a fantastic blog from a fellow tech nerd.
I spend a lot of time doing text based data processing. A lot of time. During an analysis, I often want to do things like look at 'Top Talkers', 'Most Frequent Visitors', or really anything that comprises a list of unique identifiers sorted by count. As a result, I've translated two actions into a series of pipes:
- What's the count of events per thingy:
| sort | uniq -c | sort -n
- Who has been doing whatever:
| sort -u
This tends to work pretty well in most cases. Today, however, was not one of those cases. While attempting get a list of unique MACs I started out with a source (i.e. non-uniqued) 16GB text file with one MAC per line. This is where things got annoying. Muscle memory kicked in and since this matched Action #2, I ran the following command:
cat macs_all.txt | sort -u > macs_unique.txt
I expected it to take a few minutes, so I went back to the other things I was doing and let it go. I checked back 15 minutes later, and it was still running. Waited 5 minutes...still running. When the command had been running for 45 minutes, I got fed up and decided that I could do better. Perl, being my go to tool, came to the rescue in the form of hashes. I won't go into gritty detail, but a Perl hash is a data structure that consists of a list of key/value pairs. Whenever you assign a value to a key it will add an entry for the key if it doesn't exist, or update the value if it does. Since a key cannot be in the same hash multiple times, it makes for a pretty good hack to generate a unique list. The full source of the app is included at the end, but the basic data flow is:
- Read MAC from input
- If MAC exists in list next, else add to list
- If EOF print all MACs
This worked significantly better for me. The output was not sorted, but that's fine, I didn't need it sorted, only unique. The timing information looked a lot better too.
packs@ node1:~> time cat macs_all.txt | sort -u > macs_unique.txt
real 181m12.417s
user 176m13.926s
sys 1m42.335s
packs@ node1:~> time cat macs_all.txt | ./fast_uniq.pl > macs_fast_uniqed.txt
real 8m9.074s
user 7m28.176s
sys 0m46.271s
The times can't really be directly compared, since output from fast_uniq.pl
isn't actually sorted.
Given the pretty substantial difference I think we can reasonably accept the fact that fast_uniq.pl
is better in this use case. After seeing this, I'm tempted to add some functionality so I stop using
both sort and uniq entirely.
I'm interested to hear if anyone else has done something similar or explain to me how much my code sucks.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use Pod::Usage;
# Setup main variables from script arguments
my ($unique, $sort, $count, $quiet, $DEBUG);
# Grab the options passed on the command line.
GetOptions (
"unique|u" => \$unique, # flag
"sort|s" => \$sort, # flag
"count|c" => \$count, # flag
"quiet|q" => \$quiet, # flag
"verbose|v" => \$DEBUG, # flag
"help|?|h" => sub { pod2usage(1); }, # flag
) or pod2usage("$0: Unrecognized program argument.");
if( !( defined($unique) or defined($sort) or defined($count) ) ) {
pod2usage("$0: Required argument missing.");
}
my ( %hash, $key, $value );
if ( $unique ) {
print "Entered unique loop\n" if $DEBUG;
while( my $mac = <> ) {
next unless $mac;
chomp $mac;
$hash{$mac}++;
}
# Do all the prints here
if ( $sort ) {
print "Entered sorted print loop\n" if $DEBUG;
for $key ( sort keys %hash ) {
if ( $count ) {
print "$hash{$key} \t$key\n";
}
else {
print "$key\n";
}
}
}
else {
print "Entered unsorted print loop\n" if $DEBUG;
for $key (keys %hash) {
if ( $count ) {
print "$hash{$key} \t$key\n";
}
else {
print "$key\n";
}
}
}
}
if ( $sort and not $unique ) {
# Join to an array and sort for print
print "So you want sorted but uniqued data, eh? Patience my dear, patience.\n"
}
__END__
=head1 NAME
fast_sorter - Reimplementation of GNU sort and GNU uniq to perform faster
by Scott Pack
=head1 SYNOPSIS
fast_sorter.pl [options]
Options:
-u, --unique Deduplicate entries in the input
-s, --sort Print the output ASCII sorted
-c, --count Print the output along with duplication counts
-v, --verbose Print debugging information
-h, --help Brief help message
By Scott Pack
=cut