Pulling back from the archives this is a repost of a previous blog post, with minor edits to include a more revent version of the app. This time ripped from a guest spot at The Nubby Admin, a fantastic blog from a fellow tech nerd.
I spend a lot of time doing text based data processing. A lot of time. During an analysis, I often want to do things like look at 'Top Talkers', 'Most Frequent Visitors', or really anything that comprises a list of unique identifiers sorted by count. As a result, I've translated two actions into a series of pipes:
- What's the count of events per thingy:
| sort | uniq -c | sort -n
- Who has been doing whatever:
| sort -u
This tends to work pretty well in most cases. Today, however, was not one of those cases. While attempting get a list of unique MACs I started out with a source (i.e. non-uniqued) 16GB text file with one MAC per line. This is where things got annoying. Muscle memory kicked in and since this matched Action #2, I ran the following command:
cat macs_all.txt | sort -u > macs_unique.txt
I expected it to take a few minutes, so I went back to the other things I was doing and let it go. I checked back 15 minutes later, and it was still running. Waited 5 minutes...still running. When the command had been running for 45 minutes, I got fed up and decided that I could do better. Perl, being my go to tool, came to the rescue in the form of hashes. I won't go into gritty detail, but a Perl hash is a data structure that consists of a list of key/value pairs. Whenever you assign a value to a key it will add an entry for the key if it doesn't exist, or update the value if it does. Since a key cannot be in the same hash multiple times, it makes for a pretty good hack to generate a unique list. The full source of the app is included at the end, but the basic data flow is:
- Read MAC from input
- If MAC exists in list next, else add to list
- If EOF print all MACs
This worked significantly better for me. The output was not sorted, but that's fine, I didn't need it sorted, only unique. The timing information looked a lot better too.
The times can't really be directly compared, since output from fast_uniq.pl
isn't actually sorted.
Given the pretty substantial difference I think we can reasonably accept the fact that fast_uniq.pl
is better in this use case. After seeing this, I'm tempted to add some functionality so I stop using
both sort and uniq entirely.
I'm interested to hear if anyone else has done something similar or explain to me how much my code sucks.