Name: pkthisto
Version: 0.1.2, October 28th 2001
Author: gj_armitage@yahoo.com
Copyright (c) 2001, Grenville Armitage
1. Summary:
A packet traffic analysis program specifically designed
for generating inter-packet arrival time histograms,
packet length histograms, and packet rate plots for
UDP/IP traffic. I orignally wrote this to assist me in
analysing online game traffic in and out of QuakeIII Arena
servers. pkthisto can take input from real time capture
(where Berkeley packet filter, BPF, is supported by the OS),
tcpdump tracefiles, or NAI Sniffer tracefiles.
Although pkthisto was originally developed under MS Visual
C++ 6.0 in a Win32 environment, it migrated to FreeBSD4.3
for the real time capture mode and development continued in
that environment using KDevelop 1.4. Real time capture is
currently not available under Win32. (See section 6 of
this README for details on compiling under MS Visual C++.)
pkthisto is released under the GNU General Public License,
Version 2, 1991.
pkthisto compiles 'out of the box' under FreeBSD4.3
(and Win32 with lesser functionality). Standard C
libraries are sufficient, there are no additional
packages to install (modulo the possible need to recompile
your kernel for BPF support). pkthisto does not come with
any additional run-time libraries encumbered by other
licenses.
2. Installation:
The current development environment for pkthisto is
FreeBSD4.3 with KDevelop 1.4, an X11-based C/C++
development tool. This distribution contains tools to
create an appropriate makefile, with which you
can generate a running executable. (I have not verified
whether pkthisto can or cannot be compiled under anything
other than FreeBSD4.3 or Win32, but I'd be interested in
hearing experiences.) The following installation steps
apply to *nix environments. (See section 6 for compiling
under Win32.)
The basic distribution is a gzipped tarfile named
pkthisto-0.1.2.tar.gz, which creates a subdirectory
./pkthisto-0.1.2 when gunzipped/untar'ed. Once the
tarfile is unpacked, perform the following steps:
> cd ./pkthisto-0.1.2
> ./configure
> cd pkthisto
> make
"./configure" will spend a minute or so inspecting
your system, compiler settings, etc and generating
appropriate makefiles. Once this has completed successfully,
you move into the source subdirectory and run "make"
to actually compile pkthisto.
You can then either copy pkthisto to somewhere more
convenient in your path, or use "make install" to
automatically copy pkthisto into /usr/local/bin.
(An alternate installation location can be specified
during the configuration stage. If you wish to install
into //bin then execute "./configure --prefix=//"
instead of "./configure" before compiling. "make install"
will then copy pkthisto to //bin/pkthisto.)
Executing "make clean" in ./pkthisto-0.1.2/pkthisto will
subsequently remove all intermediate object files.
KDevelop 1.4's pkthisto.kdevprj file is also supplied,
in case it helps you do further development of pkthisto.
3. Using pkthisto
Currently pkthisto takes a stream of IP packets and automatically
identifies each unique UDP/IP flow. For every active flow,
pkthisto creates histograms of inter-packet arrivals times
and IP packet lengths. These histograms can be dumped to
disk in stages, or dumped all at once at the end.
pkthisto can analyse traffic from one of three sources:
- Real time packet capture through a local Ethernet
interface if your OS supports the BPF driver
(e.g. FreeBSD4.3)
- Raw tracefile generated by tcpdump (any platform)
- Raw tracefile generated by NAI Sniffer (any platform,
so long as it is in ".enc" rather than ".cap" format)
Before starting pkthisto you'll need to create an appropriate
configuration file. Configuration options are described in
the file ./pkthisto-0.1.2/conf-demo.txt
If your config file is named "conf.txt", start pkthisto
with:
pkthisto -c conf.txt
(pkthisto will default to looking for configuration
file 'conf_ph.txt' if the '-c' option is not supplied.)
If you require real time packet capture, pkthisto needs to
run with sufficient priviledges to open your BPF/Ethernet
driver (and optionally, enough priviledges to establish
promiscuous mode operation on the BPF driver, unless the
"no_promiscuous" configuration file option has been set).
If you are reading from a pre-existing tracefile, you only
require enough priviledges to read the tracefile and write
new files to the current directory.
pkthisto supports a 'checkpoint' facility whereby it dumps
to disk the current flows and their histograms at regular
intervals. This is primarily useful during real time capture
mode - it keeps a record that is likely to survive a crash
of pkthisto or the machine on which it is running, and reduces
the amount of memory pkthisto keeps allocated at any given
time. Checkpoint intervals are set in the configuration file
with the "checkpoint_interval" option. Checkpoints may also
be forced during real time capture mode by sending the running
process a SIGUSR1 signal ("kill -USR1 ").
When reading from disk pkthisto concludes at the end of the
tracefile, and writes a final checkpoint to disk. When doing
real time capture, pkthisto can be configured to conclude after
a certain number of packets have been seen (use the "total_pkts"
configuration file option). During real time capture, pkthisto
also concludes gracefully (doing a final checkpoint before
exiting) when it receives a SIGTERM signal ("kill -TERM ").
4. Output files
pkthisto generates output across three levels of files.
The top level summary of pkthisto's activities is collected
in a file called:
gtout0.txt
The second level of files reflect a summary of each flow.
Filenames are one of three forms:
gtoutxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt
(where "xxx" is a decimal integer value uniquely identifying
each flow.)
The latter two forms occur when pkthisto has been informed
of a specific IP address and UDP port that is a game
server host. Servers are specified with a "specific_server"
configuration file option, and multiple servers may be
tracked.
Thus, "gtout80FS1.txt" represents the 80th flow seen by
pkthisto, a flow carrying UDP/IP packets coming *from*
the 1st server pkthisto knows about. Conversely
"gtout1035TS1.txt" is the 1035th flow, representing
UDP/IP packets going *to* the 1st server pkthisto knows
about.
The form "gtoutxxxx.txt" occurs for flows that do not
appear to be going to, or coming from, a known server.
The third level of files are the actual histograms and
running statistics themselves. Part of the file name is
derived from the second level files that summarize each
flow. The prefix identifies the file's contents:
Length histograms start with "LH-"
Cumulative Length distributions start with "CLH-"
Inter-arrival histograms start with "IH-"
Cumulative inter-arrival distributions start with "CIH"
Estimated bit rate plots start with "RATE-"
Estimated packet rate plots start with "PPS-"
If requested by the user (with the "dump_sizes" option)
the following files will also be created:
Plot of lowest 5% of packet sizes start with "SIZEL-"
Plot of median of packet sizes start with "SIZEM-"
Plot of upper 95% of packet sizes start with "SIZEU-"
Plot of size ratio start with "SIZER-"
The size related output files are optional because they
are easily derived from the information in the LM-*
and CLM-* files anyway.
Every histogram covers only a finite number of packets, so
that during post-analysis we can get a reasonable sense
of how the histograms (distributions) changed over time.
By default, up to 2000 packets make up each histogram, but
this can be modified by the "max_pkts_per_histo"
configuration option.
Thus, histogram and cumulative distribution files (which
are derived from the histograms) will contain
separate sequences of data representing individual
histograms in chronological order. Each sequence is
preceded by a short tag containing an integer histogram
number. Use this histogram number to find information
about each histogram within the flow's associated
gtoutxxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt
file.
4.1 Format of gtout0.txt
The top level file gtout0.txt is updated with a summary
of pkthisto's ongoing activities each time a checkpoint
occurs. When pkthisto starts, it logs key operating parameters
to gtout0.txt, and then begins gathering traffic.
Checkpoints are marked by the line of the form:
**Checkpoint NN, [XX,YY]
where NN is an integer representing the number of checkpints
so far, XX is the number of unique UDP/IP flows seen so far,
and identifies when the checkpoint occurred,
in ASCII format (for example, "Wed Sep 19 09:52:12 2001").
YY represents the number of flows that have been released
from memory because they never saw enough packets over the
previous two checkpoint periods to be considered relevant.
Following the "**Checkpoint" line is a sequence of entries
summarizing each active flow, with the form:
: srcaddr:port -> dstaddr:port: NNN elements, YYY min
where is the flow's filename (e.g. "gtout890FS1.txt"),
srcaddr:port and dstaddr:port are the IP address (in w.x.y.z
notation) and UDP port for the flow's source and destination
respectively, NNN is the number of packets seem in the flow
so far, and YYY is a floating point number representing the
number of minutes the flow has been active since it was first
detected.
A flow is only mentioned during a checkpoint if it has been
active during the period of time since the previous checkpoint.
A checkpoint ends with a line of the form:
**EndCheckpoint NN [XX dumped, YY stored]
where NN is the checkpoint number, XX is the number of flows
dumped, and YY represents the number of 'flows' currently stored
in memory (idle flows, or flows that haven't yet seen enough
packets to warrant dumping to disk).
The final checkpoint differs in that "**Checkpoint" is
replaced with "**FinalCheckpoint", summary information on
*all* flows seen during the pkthisto run are dumped, and
followed with the line "**Checkpointing complete." In
addition, the summary lines have ", started " appended
to the "YYY min" field.
Note that when doing real time capture, the fields
represent the local host's internal date and time clock. When
reading from a stored tracefile, date&time is taken from the
timestamps embedded in the file.
4.2 Format of gtoutxxx.txt, gtoutxxxFSd.txt, and gtoutxxxTSd.txt
Each flow has one of these files, which begin with:
Src: srcaddr:port
Dst: dstaddr:port
Start:
The source and destination IP/UDP information defines
the end points of the flow, the start field identifies
the time at which the flow was first seen (to an resolution
of one second, and accuracy dependent on the system clock).
The rest of the file is a sequence of summaries for
each histogram, updated each time a checkpoint occurs.
Checkpoints begin with the line:
Checkpoint: NN start
and end with the line
Checkpoint: NN end
where NN is an integer number representing the checkpoint (and
matches the "Checkpoint NN" in gtout0.txt).
Between the start and end fields are a sequence of summaries
for each histogram stored during the time since the previous
checkpoint. They are of the following form (all on one line):
Histo:HHH: BB - EE min, avg LL bytes, RR kbps, PP pps,
l/m/h XX/YY/ZZ, MMM pkts, QQ err
Where HHH is a positive integer uniquely identifying the
histogram, BB and EE are the timestamps of the first and last
packets making up the histogram (relative to the flow's
"Start:" time), LL is the average IP packet length during
this interval, RR is the total kbits of IP packets divided
by the length of the interval in seconds, PP is the number
of packets per second, XX/YY/ZZ represent the low 5%,
median, and upper 95% packet sizes, MMM is the number of
packets in the interval, and QQ is the number of packets
who were longer than 800 bytes (considered outside the range
of the length histograms, since they are almost certainly
unrelated to real time game play traffic).
Note that HHH counts histograms relative to the flow for
which the histograms apply. It only increments, and may
sometimes increment by more than one between histograms
and checkpoints. The reason is that sometimes a flow
goes idle after a period of being active, and the histogram
being created at the time the flow went idle may see only
a handful of packets. During operation, pkthisto reclaims
memory from histograms that will never be dumped to
disk. When the flow again becomes suffiently active,
histograms will be dumped to disk. However, the internal
histogram identifier HHH will have incremented a number
of times, reflecting the number of histograms that were
begun and then released due to the flow being too idle.
The minimum number of packets that must be seen to make
a valid histogram defaults to 100, and can be changed
with the "min_pkts_in_flow" option.
The "flow_max_milliseconds" option specifies how many
milliseconds can elapse between packets of the same flow
before the flow is considered to have become idle. The
default is 500 ms.
4.3 Format of LH-, CLH-, IH-, and CIH- files
Histograms generated by pkthisto can take one of two
forms: pure ASCII (as sequences of "X Y" data points that
can be fed to xgraph or copied directly into spreadsheets)
or compressed ASCII (lines of encoded text, where the X
axis is implied and the Y axis data points are compressed
into two-digit base64 values). The compressed ASCII format
provides substantial disk space savings during long term,
real time traffic traces.
By default pkthisto generates pure ASCII output. However,
the configuration file option "compressed_output" forces
pkthisto to generate compressed ASCII output instead.
The LH-, CLH-, IH-, and CIH- files are affected by this
choice.
Details of these two formats can be found in the
file ./pkthisto/HistoFormats.txt
4.4 Format of remaining RATE- and PPS- files
These files contain "X Y" pairs where the X value
is the time axis (represented as an integer number
of seconds since 1/1/1970) and Y is either the bit
rate (in kbits per second, measured as the number of
IP packet bits per time interval) or actual packets per
second.
"X Y" pairs are generated per histogram. X represents
the start time of the histogram.
4.5 Special files relating to server traffic
When one or more ipaddr:port pairs are specified as probable
servers, pkthisto also tracks the aggregate flows to and
from each specified server. These flows are known as
-All2Serv-NN and -Serv2All-NN, where NN represents which one
of the specified servers (starting at 1). All the per-flow
files discussed in sections 4.1 to 4.5 have siblings for
the aggregate server flows, for example:
gtout-All2Serv-1.txt,
gtout-Serv2All-1.txt,
LH-gtout-All2Serv-1.txt,
CLH-gtout-All2Serv-1.txt,
...etc...
Flows TO the server are logged as coming from 0.0.0.0:0, while
flows FROM the server are logged as going to 0.0.0.0:0.
5. What pkthisto ignores
During real-time capture pkthisto installs BPF filter code
to ignore Ethernet frames that do not carry UDP/IP packets.
TCP/IP flows are ignored.
When reading from existing tracefiles, pkthisto simply assumes
all Ethernet frames have been pre-filtered to contain only
UDP/IP packets.
Any flow that never sees more than 100 (or min_pkts_in_flow)
packets within 800 (or flow_max_milliseconds) milliseconds will
be forgotten. It will never be mentioned in gtout0.txt
and never have histograms dumped during checkpoints. (This
should filter out transient 'flows' due to game launchers
such as GameSpy3D semi-regularly probing a server.)
6. Compiling for Win32 platforms
Although pkthisto was originally developed under MS Visual
C++ 6.0 in a Win32 environment, it was ported to FreeBSD4.3
for the real time capture mode and development continued
a number of steps from there.
Where I discovered differences between Visual C++ 6.0
in a Win32 environment and KDevelop 1.4/gcc in a FreeBSD4.3
environment, I've used conditional compilation directives.
The flag WIN32 should be set for Win32-compatible code, and
unset for FreeBSD4.3 (or equivalent) environments.
You will need to specifically link against ws2_32.lib (add
under Project->Settings->Linker if you're using
MS Visual C++) for Win32 (to bring in inet_ntoa()
functions).
I've supplied sample Visual C++ 6.0 project/workspace files
./pkthisto-0.1.2/pkthisto.dsp and ./pkthisto-0.1.2/pkthisto.dsw
If you have Visual C++, you should be able to use WinZip
(or similar) to unpack/untar the pkthisto distribution,
then go into the ./pkthisto-0.1.2 folder and double-click
on pkthisto.dsw to start start Visual C++. Tell Visual C++
to "build" and a Win32 version of pkthisto should be built.
(At least, it worked for me on a Windows 2000 system.
No promises it'll work on every Win32 platform, although
I imagine it should.) The pkthisto executable must be
run from a console window (or from within Visual C++).
NOTE: pkthisto does not currently provide support for real
time capture in a Win32 environment. Perhaps later. Tcpdump
and NAI Sniffer tracefile analysis are supported.
7. Bugs, things TODO, Conclusions
No doubt there are bugs, and it would be lovely to use
a more compressed output format to save diskspace on
longer real time captures. And naturally, this README
file is not complete. Future releases of pkthisto may
extend real time capture support beyond just BPF devices,
and include Win32 environments.
The ultimate source of information is, of course, the
source code. Fortunately it is a relatively small program.
Enjoy!
gj_armitage@yahoo.com
               (
geocities.com/gj_armitage)