Everybody already knows how great file-handling tools *NIX operating systems have. I’m just constantly amazed with how much you can get done in a single command line. For example, imagine a process that does some data processing on a remote machine and logs all the communication between the client and the server machine in one file. Now, imagine this process running for 30+ hours, producing a 220+ Mb log file.

After the process is done, your boss wants some kind of reporting - how many entries were processed, how many were processed sucessful, how many errors were there and which errors they were. Not much of a problem when working on a UNIX machine:

A:    cat out.txt | grep 'COMMAND SUCCESS' | wc -l
B:    cat out.txt | grep 'COMMAND FAILED' | wc -l

… and just to make sure, lets check if A+B = C:

C:    cat in.txt | wc -l

Now, lets report on errors:

cat out.txt | grep 'ERROR_CODE:' | sort | uniq

returns a list of errors:

ERROR_CODE: 10065
ERROR_CODE: 11245
ERROR_CODE: 19543

and now just lets find out how many of each we got:

cat out.txt | grep 'ERROR_CODE: 10065' | wc -l
cat out.txt | grep 'ERROR_CODE: 11245' | wc -l
cat out.txt | grep 'ERROR_CODE: 19543' | wc -l

Email to the boss, and we’re done. All good? Great. But, another email comes in: “Could you please send me a list of all the IDs that caused an error #11245″. Sure, no problem:

cat out.txt | grep -B 7 'ERROR_CODE: 11245' | grep 'REQUEST_ID' | awk '{ print $2; }' | sed 's/REQUEST_ID:\([0-9*]\)/\1/g’ > ids.txt

Lets explain this one a bit:

  • the initial request that was sent to the system was logged 7 lines before the ERROR_CODE (therefore the “-B 7″)
  • the line with the request had the following format:
    START REQUEST_ID:XXXXX SOME_OTHER_STUFF
    (therefore the awk part)
  • with sed we just extracted the number from the request_id column

Can it get any more powerful than this?