TCP provides a connection-based, reliable byte-stream service to applications. Microsoft networking relies upon the TCP transport for logon, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. It can only be used for one-to-one communications. TCP uses a checksum on both the headers and data of each segment to reduce the chance of network corruption going undetected.
TCP Receive Window Size Calculation
The TCP receive window size is the amount of receive
data (in bytes) that can be buffered at one time on a connection. The sending
host can send only that amount of data before waiting for an acknowledgment
and window update from the receiving host. The Windows NT 3.5x TCP/IP stack
was designed to self-tune itself in most environments. Instead of using
a hard-coded default receive window size, TCP adjusts to even increments
of the MSS (maximum segment size) negotiated during connection setup. Matching
the receive window to even increments of the MSS increases the percentage
of full-sized TCP segments utilized during bulk data transmission. The
receive window size defaults in the following manner:
PMTU (Path Maximum Transfer Unit) Discovery
PMTU discovery is described in RFC1191. When a connection is established,
the two hosts involved exchange their TCP maximum segment size (MSS) values.
The smaller of the two MSS values is used for the connection. The MSS for
a system is usually the MTU at the link layer minus 40 bytes for the IP
and TCP headers.
Figure 2: MTU versus MSS
When TCP segments are destined to a non-local network, the "don't fragment"
bit is set in the IP header. Any router or media along the path may have
an MTU that differs from that of the two hosts. If a media is encountered
with an MTU that is too small for the IP datagram being routed, the router
will attempt to fragment the datagram accordingly. Upon attempting to do
so, it will find that the "don't fragment" bit in the IP header is set.
At this point, the router should inform the sending host with an ICMP destination
unreachable message that the datagram can't be forwarded further without
fragmentation. Most routers will also specify the MTU that is allowed for
the next hop by putting the value for it in the low-order 16 bits of the
ICMP header field that is labeled "unused" in the ICMP specification. See
RFC1191, section 4, for the format of this message. Upon receiving this
ICMP error message, TCP adjusts its MSS for the connection to the specified
MTU minus the TCP and IP header size, so that any further packets sent
on the connection will be no larger than the maximum size that can traverse
the path without fragmentation. The minimum MTU permitted by RFCs is 68
bytes, and this limit is enforced by Windows NT TCP .
Some non-compliant routers may silently drop IP datagrams that cannot
be fragmented, or may not correctly report their next-hop MTU. If this
occurs, it may be necessary to make a configuration change to the PMTU
detection algorithm. There are two registry changes that can be made to
the TCP/IP stack in Windows NT 3.5x to work around these problematic routers.
These registry entries are described in more detail in Appendix A:
EnablePMTUBHDetect – Adjusts the PMTU discovery algorithm to attempt
to detect these "black hole" routers. Black Hole detection is disabled
by default.
EnablePMTUDiscovery – Completely enables or disables the PMTU discovery
mechanism. When PMTU discovery is disabled, an MTU of 576 bytes is used
for all non-local destination addresses. PMTU discovery is enabled by default.
The PMTU between two systems can be discovered manually using ping
with the -f (don't fragment) switch as follows:
ping -f -n <number of pings> -l <size> <destination
ip address>
As shown in the example below, the size parameter can be varied until
the MTU is found. Note that the size parameter used by ping is the size
of the data buffer to send, not including headers. The ICMP header consumes
8 bytes, and the IP header would normally be 20 bytes. In the case below,
(Ethernet) the link layer MTU is the maximum-sized ping buffer plus 28,
or 1500 bytes:
C:\temp>ping -f -n 1 -l 1472 10.57.8.1
Pinging 10.57.8.1 with 1472 bytes of data:
Reply from 10.57.8.1: bytes=1472 time<10ms TTL=30
C:\temp>ping -f -n 1 -l 1473 10.57.8.1
Pinging 10.57.8.1 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
In the example shown above, the router returned an ICMP error message,
that ping interpreted for us. If the router had been a "black hole" router,
the ping would simply not be answered once its size exceeded the MTU that
the router could handle. Ping can be used in this manner to detect such
a router.
A sample ICMP destination unreachable error message is shown below:
+ FRAME: Base frame properties
+ FDDI: Length = 77
+ LLC: UI DSAP=0xAA SSAP=0xAA C
+ SNAP: ETYPE = 0x0800
+ IP: ID = 0x0; Proto = ICMP; Len: 56
ICMP: Destination Unreachable, Destination: 199.199.40.125
ICMP: Packet Type = Destination Unreachable
ICMP: Unreachable Code = Fragmentation Needed, DF Flag
Set
ICMP: CheckSum = 0x8ABF
ICMP: Data: Number of data bytes remaining = 28 (0x001C)
00000: 50 00 60 8C 14 C7 0E 00 00 0C 1A EB C0 AA AA 03
00010: 00 00 00 08 00 45 00 00 38 00 00 00 00 FF 01 D3
00020: 36 C7 C7 2C 01 C7 C7 2C FE 03 04 8A BF 00 00 05
00030: C7 45 00 05 F8 55 24 40 00 1F 01 1B D7 C7 C7 2C
00040: FE C7 C7 28 7D 08 00 00 75 01 00 63 00
Network Monitor did not parse the MTU suggestion in this frame, but
it is shown underlined in the hex portion of the trace. This error was
generated by using ping -f -l 2000 on an FDDI-based host to send a large
datagram through a router to an Ethernet host. When the router tried to
place the large frame onto the Ethernet segment, it found that fragmentation
was not allowed, so it returned the error message indicating the largest
datagram that could be forwarded is 0x5c7, or 1479 bytes.
Dead Gateway Detection
Dead gateway detection is used to allow TCP to detect failure of the
default gateway, and to make an adjustment to the IP routing table to use
another default gateway. The Microsoft TCP/IP stack uses the TRIGGERED
RESELECTION method described in RFC816. When TCP has tried one-half of
the TcpMaxDataRetransmissions times to send a packet through the default
gateway, it will advise IP to switch to the next default gateway in the
list and try that one. Additional default gateways can be configured in
the TCP/IP Advanced Configuration screen in the network control panel.
Re-transmission Behavior
TCP starts a re-transmission timer when each outbound segment is handed
down to IP. If no acknowledgment has been received for the data in a given
segment before the timer expires, then the segment is retransmitted, up
to the TcpMaxDataRetransmissions times. The default value for this parameter
is 5.
The re-transmission timer is initialized to 3 seconds when a TCP connection
is established; however it is adjusted "on the fly" to match the characteristics
of the connection using Smoothed Round Trip Time (SRTT) calculations as
described in RFC793. The timer for a given segment is doubled after each
re-transmission of that segment. Using this algorithm, TCP tunes itself
to the "normal" delay of a connection. TCP connections over high-delay
links will take much longer to time out than those over low-delay links.
The following trace clip shows the re-transmission algorithm for two
hosts connected over Ethernet on the same subnet. An FTP file transfer
was in progress, when the receiving host was disconnected from the network.
Since the SRTT for this connection was very small, the first re-transmission
was sent after about one-half second. The timer was then doubled for each
of the re-transmissions that followed. After the fifth re-transmission,
the timer is once again doubled, and if no acknowledgment is received before
it expires, then the transfer is aborted.
delta source ip dest ip
pro flags description
0.000 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
0.521 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
1.001 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
2.003 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
4.007 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
8.130 10.57.10.32 10.57.9.138 TCP .A...., len: 1460, seq:
8043781, ack: 8153124, win: 8760
TCP Keepalive Messages
A TCP keepalive packet is simply an "ack" with the sequence number
set to one less than the current sequence number for the connection. A
system receiving one of these acks should respond with an ack for the current
sequence number. Keepalives can be used to verify that the computer at
the remote end of a connection is still available. TCP keepalives can be
sent once every KeepAliveTime (defaults to 7,200,000 milliseconds or two
hours), if no other data or higher level keepalives have been carried over
the TCP connection. If there is no response to a keepalive, it is repeated
once every KeepAliveInterval seconds. KeepAliveInterval defaults to 1 second.
NetBT connections, such as those used by many Microsoft networking components,
send NetBIOS keepalives more frequently, so normally no TCP keepalives
will be sent on a NetBIOS connection. TCP keepalives are disabled by default,
but Windows Sockets applications may enable them using setsockopt().
Slow Start Algorithm and Congestion Avoidance
When a connection is established, TCP treads lightly at first in order
to assess the bandwidth of the connection and to avoid overflowing the
receiving host or any other devices/links in the path. The send window
is set to one TCP segment, and if that is acknowledged, then it is doubled
to two segments. If those are acknowledged, then it is doubled again and
so on until the amount of data being sent per burst reaches the size of
the receive window on the remote host. At that point, the slow start algorithm
is no longer in use and flow control is governed by the receive window.
However, at any time during transmission, congestion could still occur
on a connection. If this happens (evidenced by the need to re-transmit)
, a congestion avoidance algorithm is used to reduce the send window size
temporarily, and to grow it back towards the receive window size more slowly.
Slow start and congestion avoidance are discussed further in RFC1122.
Silly Window Syndrome (SWS)
Silly Window Syndrome is described in RFC1122 as follows:
In brief, SWS is caused by the receiver advancing the right window edge
whenever it has any new buffer space available to receive data and by the
sender using any incremental window, no matter how small, to send more
data [TCP:5]. The result can be a stable pattern of sending tiny data segments,
even though both sender and receiver have a large total buffer space for
the connection...
Windows NT TCP/IP Windows NT implements SWS avoidance per RFC1122 by
not sending more data until there is a sufficient window size advertised
by the receiving end to send a full segment. It also implements SWS
on the receive end of a connection by not opening the receive window in
increments of less than a TCP segment.
Nagle Algorithm
Windows NT TCP/IP implements the Nagle algorithm described in RFC896.
The purpose of this algorithm is to reduce the number of "tiny" segments
sent, especially on high-delay (remote) links. The Nagle algorithm allows
only one small segment to be outstanding at a time without acknowledgment.
If more small segments are generated while awaiting the ack for the first
one, then these segments are coalesced into one larger segment. Any full-sized
segment is always transmitted immediately, assuming there is a sufficient
receive window available. The Nagle algorithm is effective in reducing
the number of packets sent by interactive applications, such as telnet,
especially over slow links.
The Nagle algorithm can be observed in the following trace captured
by Microsoft Network Monitor. The trace was captured by using PPP to dial
up an Internet provider at 9600 BPS. A Telnet (character mode) session
was established, then the "y" key was held down on the Windows NT Workstation.
At all times, one segment was sent, and further "y" characters were held
by the stack until an acknowledgment was received for the previous segment.
In this example, 3 to 4 "y" characters were saved up each time and sent
together in one segment. The Nagle algorithm resulted in a huge savings
in the number of packets sent–it was reduced by a factor of about three.
Time Source IP Dest IP
Prot Description
0.644 204.182.66.83 199.181.164.4 TELNET To Server From Port = 1901
0.144 199.181.164.4 204.182.66.83 TELNET To Client With Port = 1901
0.000 204.182.66.83 199.181.164.4 TELNET To Server From Port = 1901
0.145 199.181.164.4 204.182.66.83 TELNET To Client With Port = 1901
0.000 204.182.66.83 199.181.164.4 TELNET To Server From Port = 1901
0.144 199.181.164.4 204.182.66.83 TELNET To Client With Port = 1901
...
Each segment contained several of the "y" characters. The first segment
is shown more fully parsed below, and the data portion is pointed out in
the hex at the bottom.
***********************************************************************
Time Source IP Dest IP
Prot Description
0.644 204.182.66.83 199.181.164.4 TELNET To Server
From Port = 1901
+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol
+ IP: ID = 0xEA83; Proto = TCP; Len: 43
+ TCP: .AP..., len: 3, seq:1032660278, ack: 353339017, win: 7766,
src:
1901 dst: 23 (TELNET)
TELNET: To Server From Port = 1901
TELNET: Telnet Data
D2 41 53 48 00 00 52 41 53 48 00 00 08 00 45 00 .ASH..RASH....E.
00 2B EA 83 40 00 20 06 F5 85 CC B6 42 53 C7 B5 .+..@. .....BS..
A4 04 07 6D 00 17 3D 8D 25 36 15 0F 86 89 50 18 ...m..=.%6....P.
1E 56 1E 56 00 00 79 79 79
.V.V..yyy
^^^
data
Windows Sockets applications can disable the Nagle algorithm for their
connection(s) by setting the TCP_NODELAY socket option. However,
this practice should be avoided unless absolutely necessary as it increases
network utilization. Some network applications may not perform well
if their design does not take into account the effects of transmitting
large numbers of small packets and the Nagle algorithm.
Throughput Considerations
TCP was designed to provide optimum performance over varying link conditions.
Actual throughput for a link is dependent on a number of variables, but
the most important factors are:
How
to Troubleshoot Basic TCP/IP Problems in Windows NT 4.0
The TDI was developed to provide greater flexibility and functionality than is provided by existing interfaces such as NetBIOS and Windows Sockets. The TDI interface is exposed by all Windows NT transport providers. The TDI interface specification describes the set of primitive functions by which transport drivers and TDI clients communicate, and the call mechanisms used for accessing them. Currently, the TDI Interface is kernel-mode only.
The Windows NT redirector and server both use TDI
directly, rather than going through the NetBIOS mapping layer. By doing
so, they are not subject to many of the restrictions imposed by NetBIOS,
such as the 254 session limit.
The NT Executive is in charge of creating system worker threads, and
it initializes them in the routine, Exp WorkerInitialization(), which uses
both the system size and the product type to determine how many threads
to create. An NT system has three types of worker threads, each aimed at
different priorities of work:
However, by changing Registry settings under the key \hkey_local_ machine\system\currentcontrolset\control\session manager\executive, an administrator can direct a workstation to have just as many, or even more, worker threads than a default server configuration. The AdditionalCritical WorkerThreads value under this key controls the number of extra critical worker threads that are created, and you can set it to a value up to 16. Similarly, AdditionalDelayedWorkerThreads controls the number of extra delayed worker threads created, and it can also be a value up to 16.
Once worker threads are started, they sleep until an item that they
need to process is placed on a work queue. The form of sleep that Server
worker threads perform differs from that of Workstation worker threads:
Server threads sleep with their stacks locked into memory, whereas Workstation
worker threads can have their stacks paged to disk. This optimization means
that Server worker threads are generally more responsive when work arises,
because they never have a delay reading their stacks in from the disk.
However, Server threads always contribute to the in-memory footprint of
the operating system.
TRACERT Comando de diagnóstico del sistema IP
The TRACERT command reports each router or gateway crossed by a TCP/IP packet on its way to another host.
Tracert works by sending ICMP echo requests to an IP address, while incrementing the TimeToLive field in the IP header by one starting at one, and analyzing the ICMP errors that get returned. Each succeeding echo request should get one hop further into the network before the TTL field reaches 0 and an ICMP Time Exceeded error is returned by the router attempting to forward it. Tracert simply prints out an ordered list of the routers in the path that returned these error messages. If the -d (don't do a DNS lookup on each IP address) switch is used, then the IP address of the near-side interface of the routers is reported.
The file system runtime performs an interesting optimization that Microsoft has introduced with NT 4.0: To preserve long file names in the face of legacy 16-bit applications that would otherwise destroy them, the NT 4.0 file system supports the notion of long file name tunneling. Tunneling is necessary when a 16-bit application, such as a word processor, maintains the current version of a document in a temporary file. When the user saves the document, the program may delete the original and rename the temporary to the original file's name.
In the absence of tunneling, the renaming of the temporary file replaces the original long filename with the short-name form. When tunneling is in effect, the file system typically remembers delete operations for 15 seconds, and if a new short filename file is created with the name of a file that has recently been deleted, the file is automatically assigned the long name of the recently deleted file. On a server, the default number of remembered delete operations is 1024, but on a workstation, the number is only 256.
You can explain this difference if you assume that servers are likely to serve file systems to large numbers of clients that will probably have much more activity over short periods than a workstation file system will have. You can override the default number in the Registry value \hkey_local_machine\ system\currentcontrolset\control\ file system\maximumtunnelentries. Also, under the same key, by setting the value MaximumTunnelEntryAgeIn Seconds, you can tune the time-based window of recall for delete operations.