The Command Line on Steroids
Let’s dig a little deeper into the command line. Often there are arguments made about the usefulness of the command line interface (CLI) versus a GUI tool for analysis. I would argue that in the case of large sets of regimented data, the CLI can sometimes be faster and more flexible than many GUI tools available today.
As an example, we will look at a set of log files from a single Unix system. We are not going to analyze them for any sort of smoking gun. The point here is to illustrate the ability of the CLI to organize and parse through data by using pipes to string a series of commands together, obtaining the desired output. Follow along with the example, and keep in mind that to get anywhere near proficient with this will require a great deal of reading and practice. The payoff is enormous.
Create a directory called “logs” and download the file_logs.tar.gz_into that directory:
ftp://ftp.hq.nasa.gov/pub/ig/ccd/linuxintro/logs.tar.gz
As always, have a look at the contents of the archive before haphazardly writing the contents to your drive:
tar t**zvf logs.tar.gz**
-rw------- root/root 8296 2003-10-29 16:14:49 messages
-rw------- root/root 8302 2003-10-29 16:17:38 messages.1
-rw------- root/root 8293 2003-10-29 16:19:32 messages.2
-rw------- root/root 4694 2003-10-29 16:23:18 messages.3
-rw------- root/root 1215 2003-10-29 16:23:33 messages.4
The archive contains 5 log files from a Unix system. Themessages_logs contain entries from a variety of sources, including the kernel and other applications. The numbered files result from log rotation. As the logs are filled, they are rotated and eventually deleted. On most Unix systems, the logs are found in/var/log/or/var/adm_.
Untar the file:
tar xzvf logs.tar.gz
Let’s have a look at one log entry:
cat messages | head -1
Nov 17 04:02:14 localhost123 syslogd 1.4.1: restart.
Each line in the log files begin with a date and time stamp. Next comes the hostname followed by the name of the application that generated the log message. Finally, the actual message is printed.
Let’s assume these logs are from a victim system, and we want to analyze them and parse out the useful information. We are not going to worry about what we are actually seeing here, our object is to understand how to boil the information down to something useful.
First of all, rather than parsing each file individually, let’s try and analyze all the logs at one time. They are all in the same format, and essentially they comprise one large log. We can use thecatcommand to add all the files together and send themto standard output. If we work on that data stream, then we are essentially making one large log out of all five logs. Can you see a potential problem with this?
cat messages* | less
If you look at the output, you will see that the dates ascend and then jump to an earlier date and then start to ascend again. This is because the later log entries are added to the_bottom_of each file, so as the files are added together, the dates appear to be out of order. What we really want to do is stream each file_backwards_so that they get added together with the most recent date in each file_at the top_instead of at the bottom. In this way, when the files are added together they are in order. In order to accomplish this, we usetac(yes, that’scatbackwards).
tac messages* | less
Beautiful. The dates are now in order. We can now work on the stream of log entries as if they were one large (in order) file.
We will introduce a new command,awk,to help us view specific fields from the log entries. In this case, the dates. awkis an extremely powerful command. The version most often found on Linux systems isgawk(GNUawk). While we are going to use it as a stand alone command,awkis actually a programming language on its own, and can beused to write scripts for organizing data. Our concentration will be centered onawk’s“print” function. Seeman awkfor more details.
Sets of repetitive data can often be divided into columns or “fields”, depending on the structure of the file. Inthis case, the fields in the log files are separated by simple white space (awk’sdefault field separator). The date is comprised of the first two fields (month and day).
tac messages * | awk ‘{print $1 “ “ $2}’ | less
Feb 8
Feb 8 Feb 8
…
This command will stream all the log files (each one from bottom to top) and send the output toawkwhich will print the first field (month - $1), followed by a space (“ “), followed by the second field (day - $2). This shows the month and day for everyentry. Suppose I just want to see one of each date when an entry was made. I don’t need to see repeating dates. I ask to see one of each unique line of output withuniq:
tac messages* | awk ‘{print $1 “ “ $2}’ | uniq | less
Feb 8
Nov 22
Nov 2__1
Nov 20
…
This removes repeated dates, and shows me just those dates with log activity. If a particular date is of interest, I cangrepthe logs for that particular date (note there are2spaces between “Nov” and “4”, one space will not work):
tac messages* | grep “Nov 4”
Nov 4 17:41:27 localhost123 sshd(pam_unix)[27630]: session closed for user root
Nov 4 17:41:27 localhost123 sshd[27630]: Received disconnect from 1xx.183.221.214: 11: Disconnect requested by Windows SSH Client.
…
Of course, we have to keep in mind that this would give us any lines where the string “Nov 4” resided, not just in the date field. To be more explicit, we could say that we only want lines that_start_with “Nov 4”, using the “^”:
tac messages | gre*p “^Nov** 4”
Also, if we don’t_know_that there are_two_spaces between “Nov” and “4”, we can tellgrepto look for any number of spaces between the two:
tac messages | grep “^Nov[ ]4”
The abovegrepexpression translates to “Lines starting (^) with the string “Nov” followed by zero or more (*) of the preceding characters ([/space/]) followed by a 4”. Obviously, this is a complex issue. Knowing how to use regular expression will give you huge flexibility in sorting through and organizing large sets of data.
As we look through the log files, we may come across entries that appear suspect. Perhaps we need to gather all the entries that we see containing the string “Did not receive identification string from <IP>” for further analysis.
tac messages* | grep “identification string” | less
Nov 17 19:26:43 localhost123 sshd[2019]: Did not receive identification string from 200.92.72.129
Nov 18 18:55:06 localhost123 sshd[11204]: Did not receive identification string from 62.66.248.243
…
Now we just want the date (fields 1 and 2), the time (field 3) and the remote IP address that generated the log entry. The IP address is the last field. Rather than count each word in the entry to get to the field number of the IP, we can simplyuse the variable “$NF”, which means “number of fields”. Since the IP is the last field, its field number is equal to the number of fields:
tac messages* | grep “identification string” | awk ‘{print $1” “$2” “$3” “$NF}’ | less
Nov 17 19:26:43 200.92__.72.129 Nov 18 18:55:06 62.66.248.243
Nov 20 14:13:11 200.83.114.131
…
We can add some tabs (“\t”) in place of spaces to make it more readable:
tac messages* | grep “identification string” | awk ‘{print $1” “$2”\t“$3”\t“$NF}’ | less
Nov 17 19:26:43 200.92.72.129
Nov 18 18:55:06 62.66.248.243
Nov 20 14:13:11 200.83.114.131
…
This can all be redirected to an analysis log or text file for easy addition to a report (note that “> report.txt”_creates_the report file, “>> report.txt”_appends_to it):
echo “Localhost123: Log entries from /var/log/messages” > report.txt echo “\”Did not receive identification string\”:” >> report.txttac messages | grep “identification string” |*awk ‘{print $1” “$2**”\t“$3”\t“$NF}’>>report.txt
We can also get a sorted (sort) list of the unique (-u) IP addresses involved in the same way:
echo “Unique IP addresses:” >> report.txt tac messages | grep “identification string” |*awk ‘{print $NF}’ | sort -u >**>report.txt less report.txt
The resulting list of IP addresses can also be fed to a script that doesnslookuporwhoisdatabase queries.
As with all the exercises in this document, we have just sampled the abilities of the Linux commandline. It all seems somewhat convoluted to the beginner. After some practice and experience with different sets of data, you will find that you can glance at a file and say “I want that information”, and be able to write a quick piped command to get whatyou want in a readable formatin a matter of seconds. As with all language skills, the Linux command line “language” is perishable. Keep a good reference handy and remember that you might have to look up syntax a few times before it becomes second nature.