One of the most common data analysis things I do in Unix is something
like
cat wines | sort | uniq -c | sort -nr
Given an input file with a million bottles of wine in it, this shows
me how many bottles of each type I have. It works for other things
besides wine. In fact, it works for a lot of things, and I've been
doing this for 15 years.
But the first sort is really inefficient, just something you have to do to make uniq work. So for big inputs I use a little Python script, countuniq.py. It does the same thing but more efficiently. Remarkably useful tool. |