Name: pkthisto
Version: 0.1.2, October 28th 2001
Author: gj_armitage@yahoo.com

Copyright (c) 2001, Grenville Armitage

1. Summary:

 A packet traffic analysis program specifically designed
 for generating inter-packet arrival time histograms,
 packet length histograms, and packet rate plots for
 UDP/IP traffic. I orignally wrote this to assist me in
 analysing online game traffic in and out of QuakeIII Arena
 servers. pkthisto can take input from real time capture
 (where Berkeley packet filter, BPF, is supported by the OS),
 tcpdump tracefiles, or NAI Sniffer tracefiles.

 Although pkthisto was originally developed under MS Visual
 C++ 6.0 in a Win32 environment, it migrated to FreeBSD4.3
 for the real time capture mode and development continued in
 that environment using KDevelop 1.4. Real time capture is
 currently not  available under  Win32. (See section 6 of
 this README for details on compiling under MS Visual C++.)

 pkthisto is released under the GNU General Public License,
 Version 2, 1991.

 pkthisto compiles 'out of the box' under FreeBSD4.3
 (and Win32 with lesser functionality). Standard C
 libraries are sufficient, there are no additional
 packages to install (modulo the possible need to recompile
 your kernel for BPF support). pkthisto does not come with
 any additional run-time libraries encumbered by other
 licenses.


2. Installation:

 The current development environment for pkthisto is
 FreeBSD4.3 with KDevelop 1.4, an X11-based C/C++
 development tool. This distribution contains tools to
 create an appropriate makefile, with which you
 can generate a running executable. (I have not verified
 whether pkthisto can or cannot be compiled under anything
 other than FreeBSD4.3 or Win32, but I'd be interested in
 hearing experiences.)  The following installation steps
 apply to *nix environments. (See section 6 for compiling
 under Win32.)

 The basic distribution is a gzipped tarfile named
 pkthisto-0.1.2.tar.gz, which creates a subdirectory
 ./pkthisto-0.1.2 when gunzipped/untar'ed. Once the
 tarfile is unpacked, perform the following steps:

 > cd ./pkthisto-0.1.2
 > ./configure
 > cd pkthisto
 > make

 "./configure" will spend a minute or so inspecting
 your system, compiler settings, etc and generating
 appropriate makefiles. Once this has completed successfully,
 you move into the source subdirectory and run "make"
 to actually compile pkthisto.

 You can then either copy pkthisto to somewhere more
 convenient in your path, or use "make install" to
 automatically copy pkthisto into /usr/local/bin.
 (An alternate installation location can be specified
 during the configuration stage. If you wish to install
 into //bin then execute "./configure --prefix=//"
 instead of "./configure" before compiling. "make install"
 will then copy pkthisto to //bin/pkthisto.)

 Executing "make clean" in ./pkthisto-0.1.2/pkthisto will
 subsequently remove all intermediate object files.

 KDevelop 1.4's pkthisto.kdevprj file is also supplied,
 in case it helps you do further development of pkthisto.

3. Using pkthisto

 Currently pkthisto takes a stream of IP packets and automatically
 identifies each unique UDP/IP flow. For every active flow,
 pkthisto creates histograms of inter-packet arrivals times
 and IP packet lengths. These histograms can be dumped to
 disk in stages, or dumped all at once at the end.

 pkthisto can analyse traffic from one of three sources:

  - Real time packet capture through a local Ethernet
    interface if your OS supports the BPF driver
    (e.g. FreeBSD4.3)
  - Raw tracefile generated by tcpdump (any platform)
  - Raw tracefile generated by NAI Sniffer (any platform,
    so long as it is in ".enc" rather than ".cap" format)

 Before starting pkthisto you'll need to create an appropriate
 configuration file. Configuration options are described in
 the file ./pkthisto-0.1.2/conf-demo.txt

 If your config file is named "conf.txt", start pkthisto
 with:

  pkthisto -c conf.txt

  (pkthisto will default to looking for configuration
  file 'conf_ph.txt' if the '-c' option is not supplied.)

 If you require real time packet capture, pkthisto needs to
 run with sufficient priviledges to open your BPF/Ethernet
 driver (and optionally, enough priviledges to establish
 promiscuous mode operation on the BPF driver, unless the
 "no_promiscuous" configuration file option has been set).

 If you are reading from a pre-existing tracefile, you only
 require enough priviledges to read the tracefile and write
 new files to the current directory.

 pkthisto supports a 'checkpoint' facility whereby it dumps
 to disk the current flows and their histograms at regular
 intervals. This is primarily useful during real time capture
 mode - it keeps a record that is likely to survive a crash
 of pkthisto or the machine on which it is running, and reduces
 the amount of memory pkthisto keeps allocated at any given
 time. Checkpoint intervals are set in the configuration file
 with the "checkpoint_interval" option. Checkpoints may also
 be forced during real time capture mode by sending the running
 process a SIGUSR1 signal ("kill -USR1 ").

 When reading from disk pkthisto concludes at the end of the
 tracefile, and writes a final checkpoint to disk. When doing
 real time capture, pkthisto can be configured to conclude after
 a certain number of packets have been seen (use the "total_pkts"
 configuration file option). During real time capture, pkthisto
 also concludes gracefully (doing a final checkpoint before
 exiting) when it receives a SIGTERM signal ("kill -TERM ").


4. Output files

 pkthisto generates output across three levels of files.
 The top level summary of pkthisto's activities is collected
 in a file called:

 	gtout0.txt

 The second level of files reflect a summary of each flow.
 Filenames are one of three forms:

 	gtoutxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt

 (where "xxx" is a decimal integer value uniquely identifying
 each flow.)

 The latter two forms occur when pkthisto has been informed
 of a specific IP address and UDP port that is a game
 server host. Servers are specified with a "specific_server"
 configuration file option, and multiple servers may be
 tracked.

 Thus, "gtout80FS1.txt" represents the 80th flow seen by
 pkthisto, a flow carrying UDP/IP packets coming *from*
 the 1st server pkthisto knows about. Conversely
 "gtout1035TS1.txt" is the 1035th flow, representing
 UDP/IP packets going *to* the 1st server pkthisto knows
 about.

 The form "gtoutxxxx.txt" occurs for flows that do not
 appear to be going to, or coming from, a known server.

 The third level of files are the actual histograms and
 running statistics themselves. Part of the file name is
 derived from the second level files that summarize each
 flow. The prefix identifies the file's contents:

 	Length histograms start with "LH-"
 	Cumulative Length distributions start with "CLH-"
 	Inter-arrival histograms start with "IH-"
 	Cumulative inter-arrival distributions start with "CIH"
 	Estimated bit rate plots start with "RATE-"
 	Estimated packet rate plots start with "PPS-"

 If requested by the user (with the "dump_sizes" option)
 the following files will also be created:

 	Plot of lowest 5% of packet sizes start with "SIZEL-"
 	Plot of median of packet sizes start with "SIZEM-"
 	Plot of upper 95% of packet sizes start with "SIZEU-"
 	Plot of size ratio start with "SIZER-"
 	
 The size related output files are optional because they
 are easily derived from the information in the LM-*
 and CLM-* files anyway.

 Every histogram covers only a finite number of packets, so
 that during post-analysis we can get a reasonable sense
 of how the histograms (distributions) changed over time.
 By default, up to 2000 packets make up each histogram, but
 this can be modified by the "max_pkts_per_histo"
 configuration option.

 Thus, histogram and cumulative distribution files (which
 are derived from the histograms) will contain
 separate sequences of data representing individual
 histograms in chronological order. Each sequence is
 preceded by a short tag containing an integer histogram
 number. Use this histogram number to find information
 about each histogram within the flow's associated
 gtoutxxxx.txt, gtoutxxxFSd.txt, or gtoutxxxTSd.txt
 file.

4.1 Format of gtout0.txt

 The top level file gtout0.txt is updated with a summary
 of pkthisto's ongoing activities each time a checkpoint
 occurs. When pkthisto starts, it logs key operating parameters
 to gtout0.txt, and then begins gathering traffic.

 Checkpoints are marked by the line of the form:

   **Checkpoint NN, [XX,YY] 

 where NN is an integer representing the number of checkpints
 so far, XX is the number of unique UDP/IP flows seen so far,
 and  identifies when the checkpoint occurred,
 in ASCII format (for example, "Wed Sep 19 09:52:12 2001").
 YY represents the number of flows that have been released
 from memory because they never saw enough packets over the
 previous two checkpoint periods to be considered relevant.

 Following the "**Checkpoint" line is a sequence of entries
 summarizing each active flow, with the form:

  : srcaddr:port -> dstaddr:port: NNN elements, YYY min

 where  is the flow's filename (e.g. "gtout890FS1.txt"),
 srcaddr:port and dstaddr:port are the IP address (in w.x.y.z
 notation) and UDP port for the flow's source and destination
 respectively, NNN is the number of packets seem in the flow
 so far, and YYY is a floating point number representing the
 number of minutes the flow has been active since it was first
 detected.

 A flow is only mentioned during a checkpoint if it has been
 active during the period of time since the previous checkpoint.

 A checkpoint ends with a line of the form:

   **EndCheckpoint NN [XX dumped, YY stored]

 where NN is the checkpoint number, XX is the number of flows
 dumped, and YY represents the number of 'flows' currently stored
 in memory (idle flows, or flows that haven't yet seen enough
 packets to warrant dumping to disk).

 The final checkpoint differs in that "**Checkpoint" is
 replaced with "**FinalCheckpoint", summary information on
 *all* flows seen during the pkthisto run are dumped, and
 followed with the line "**Checkpointing complete." In
 addition, the summary lines have ", started " appended
 to the "YYY min" field.

 Note that when doing real time capture, the  fields
 represent the local host's internal date and time clock. When
 reading from a stored tracefile, date&time is taken from the
 timestamps embedded in the file.


4.2 Format of gtoutxxx.txt, gtoutxxxFSd.txt, and gtoutxxxTSd.txt

 Each flow has one of these files, which begin with:

  Src: srcaddr:port
  Dst: dstaddr:port
  Start: 

 The source and destination IP/UDP information defines
 the end points of the flow, the start field identifies
 the time at which the flow was first seen (to an resolution
 of one second, and accuracy dependent on the system clock).

 The rest of the file is a sequence of summaries for
 each histogram, updated each time a checkpoint occurs.
 Checkpoints begin with the line:

  Checkpoint: NN start 

 and end with the line

  Checkpoint: NN end

 where NN is an integer number representing the checkpoint (and
 matches the "Checkpoint NN" in gtout0.txt).

 Between the start and end fields are a sequence of summaries
 for each histogram stored during the time since the previous
 checkpoint. They are of the following form (all on one line):

  Histo:HHH: BB - EE min, avg LL bytes, RR kbps, PP pps,
  		 l/m/h XX/YY/ZZ, MMM pkts, QQ err


 Where HHH is a positive integer uniquely identifying the
 histogram, BB and EE are the timestamps of the first and last
 packets making up the histogram (relative to the flow's
 "Start:" time), LL is the average IP packet length during
 this interval, RR is the total kbits of IP packets divided
 by the length of the interval in seconds, PP is the number
 of packets per second, XX/YY/ZZ represent the low 5%,
 median, and upper 95% packet sizes, MMM is the number of
 packets in the interval, and QQ is the number of packets
 who were longer than 800 bytes (considered outside the range
 of the length histograms, since they are almost certainly
 unrelated to real time game play traffic).

 Note that HHH counts histograms relative to the flow for
 which the histograms apply. It only increments, and may
 sometimes increment by more than one between histograms
 and checkpoints. The reason is that sometimes a flow
 goes idle after a period of being active, and the histogram
 being created at the time the flow went idle may see only
 a handful of packets. During operation, pkthisto reclaims
 memory from histograms that will never be dumped to
 disk. When the flow again becomes suffiently active,
 histograms will be dumped to disk. However, the internal
 histogram identifier HHH will have incremented a number
 of times, reflecting the number of histograms that were
 begun and then released due to the flow being too idle.

 The minimum number of packets that must be seen to make
 a valid histogram defaults to 100, and can be changed
 with the "min_pkts_in_flow" option.

 The "flow_max_milliseconds" option specifies how many
 milliseconds can elapse between packets of the same flow
 before the flow is considered to have become idle. The
 default is 500 ms.


4.3 Format of LH-, CLH-, IH-, and CIH- files

 Histograms generated by pkthisto can take one of two
 forms: pure ASCII (as sequences of "X Y" data points that
 can be fed to xgraph or copied directly into spreadsheets)
 or compressed ASCII (lines of encoded text, where the X
 axis is implied and the Y axis data points are compressed
 into two-digit base64 values). The compressed ASCII format
 provides substantial disk space savings during long term,
 real time traffic traces.

 By default pkthisto generates pure ASCII output. However,
 the configuration file option "compressed_output" forces
 pkthisto to generate compressed ASCII output instead.
 The LH-, CLH-, IH-, and CIH- files are affected by this
 choice.

 Details of these two formats can be found in the
 file ./pkthisto/HistoFormats.txt


4.4 Format of remaining RATE- and PPS- files

 These files contain "X Y" pairs where the X value
 is the time axis (represented as an integer number
 of seconds since 1/1/1970) and Y is either the bit
 rate (in kbits per second, measured as the number of
 IP packet bits per time interval) or actual packets per
 second.

 "X Y" pairs are generated per histogram. X represents
 the start time of the histogram.


4.5 Special files relating to server traffic

 When one or more ipaddr:port pairs are specified as probable
 servers, pkthisto also tracks the aggregate flows to and
 from each specified server. These flows are known as
 -All2Serv-NN and -Serv2All-NN, where NN represents which one
 of the specified servers (starting at 1). All the per-flow
 files discussed in sections 4.1 to 4.5 have siblings for
 the aggregate server flows, for example:

   gtout-All2Serv-1.txt,
   gtout-Serv2All-1.txt,
   LH-gtout-All2Serv-1.txt,
   CLH-gtout-All2Serv-1.txt,
   ...etc...

 Flows TO the server are logged as coming from 0.0.0.0:0, while
 flows FROM the server are logged as going to 0.0.0.0:0.

5. What pkthisto ignores

 During real-time capture pkthisto installs BPF filter code
 to ignore Ethernet frames that do not carry UDP/IP packets.
 TCP/IP flows are ignored.

 When reading from existing tracefiles, pkthisto simply assumes
 all Ethernet frames have been pre-filtered to contain only
 UDP/IP packets.

 Any flow that never sees more than 100 (or min_pkts_in_flow)
 packets within 800 (or flow_max_milliseconds) milliseconds will
 be forgotten. It will never be mentioned in gtout0.txt
 and never have histograms dumped during checkpoints. (This
 should filter out transient 'flows' due to game launchers
 such as GameSpy3D semi-regularly probing a server.)


6. Compiling for Win32 platforms

 Although pkthisto was originally developed under MS Visual
 C++ 6.0 in a Win32 environment, it was ported to FreeBSD4.3
 for the real time capture mode and development continued
 a number of steps from there.

 Where I discovered differences between Visual C++ 6.0
 in a Win32 environment and KDevelop 1.4/gcc in a FreeBSD4.3
 environment, I've used conditional compilation directives.
 The flag WIN32 should be set for Win32-compatible code, and
 unset for FreeBSD4.3 (or equivalent) environments.

 You will need to specifically link against ws2_32.lib (add
 under Project->Settings->Linker if you're using
 MS Visual C++) for Win32 (to bring in inet_ntoa()
 functions).

 I've supplied sample Visual C++ 6.0 project/workspace files
 ./pkthisto-0.1.2/pkthisto.dsp and ./pkthisto-0.1.2/pkthisto.dsw
 If you have Visual C++, you should be able to use WinZip
 (or similar) to unpack/untar the pkthisto distribution,
 then go into the ./pkthisto-0.1.2 folder and double-click
 on pkthisto.dsw to start start Visual C++. Tell Visual C++
 to "build" and a Win32 version of pkthisto should be built.
 (At least, it worked for me on a Windows 2000 system.
 No promises it'll work on every Win32 platform, although
 I imagine it should.) The pkthisto executable must be
 run from a console window (or from within Visual C++).

 NOTE: pkthisto does not currently provide support for real
 time capture in a Win32 environment. Perhaps later. Tcpdump
 and NAI Sniffer tracefile analysis are supported.


7. Bugs, things TODO, Conclusions

 No doubt there are bugs, and it would be lovely to use
 a more compressed output format to save diskspace on
 longer real time captures. And naturally, this README
 file is not complete. Future releases of pkthisto may
 extend real time capture support beyond just BPF devices,
 and include Win32 environments.

 The ultimate source of information is, of course, the
 source code. Fortunately it is a relatively small program.
 Enjoy!


gj_armitage@yahoo.com


















    Source: geocities.com/gj_armitage/q3

               ( geocities.com/gj_armitage)