mtu
Geoff Huston
gih at apnic.net
Tue Feb 3 03:11:59 CET 2009
These are interesting comments Bernhard and, on the whole I agree with
them.
A few comments in response:
1 - On large packet sizes. It seems anomalous that the IEEE has been
unable to reach the levels of consensus thjat would allow
standardization of packet frames of size > 1500 octets. In a world
where the LAN carriage rate has advanced from 10Mbps to 10 Gbps, a
comparable packet size would be 1.5M. Yet 1500 is as large as the
standard world has got to. Part of the problem is that there are a
number of problems here - carriage efficiency in terms of multiplexing
multiple independent streams tends to want a lower minimum packet size
for constant time / data applications (voice, interactive access, even
video) while larger packet sizes tend to work to the advantage of the
efficiency of large volume non time-based transfers (which although
fewer in number in terms of stream counts are still larger in terms
of byte volumes,
2 - on the interaction between packet sizes and TCP transport. The
basic mathematical model is
MSS
BW = ---------------------------------------------------------
RTT*sqrt(1.33*p) + RTO*p*(1+32*p^2) * min(1,3*sqrt(.75*p))
(I hope that asciified ok)
or
MSS 1
BW = C --- -------
RTT sqrt(p)
where p is the packet loss rate.
It suggests that higher MSS sizes is directly proportional to
throughput, but that's misleading in some ways - the issue is the
packet loss rate, p. If that too is proportional to the packet size
then the expectation that larger packet sizes directly produces better
performance is not exactly the case. Another model of TCP congestion
performance is to take the "sawtooth" pattern of TCP and note that the
area under the "sawtooth is related to the available flow capacity of
the connection, not the packet quantization levels (a smaller packet
size produces a higher frequency oscillation, but not necessarily a
greatly reduced mean value across with TCP oscillates. i.e. smaller
packet sizes are not necessarily going to mean greatly reduced network
throughput rates in many practical cases.
3 - the problem is close to "your own network". The example I wrote up
for the article was interesting in that the problem was close neither
to me as the client nor close to the server. The problem was in the
transit path. Now I can't generalize about this because my sources of
data are limited. Intuition tends to say that per packet filters are
more likely to be at the edge than in the interprovider code, so close
to the edge seems like a good guess, but the case that I looked at in
detail tended to suggest otherwise with IPv6 - maybe its becuase of
the strange behaviours of a network that still contains a reaonsable
4 - tunnels - tunnels are just SO strange when you look at the corner
cases - what happens with ICMP messages that originated within the
tunnel, for example, and the treatment of the DF bit. I am still
scratching my head with my local IPv6 in IPv4 tunnel, for example:
interface Tunnel0
no ip address
ipv6 address <something>
tunnel source 10.0.0.1
tunnel destination 10.0.0.2
tunnel mode ipv6ip
Tunnel0 is up, line protocol is up
Hardware is Tunnel
MTU 1514 bytes, BW 9 Kbit, DLY 500000 usec,
reliability 255/255, txload 1/255, rxload 1/255
1514?
what a strange default selection!
I also don't understand the encapsulation behavior on tunnel ingress -
but maybe thats just me! i.e.when an IPv6 packet thats too big for a
tunnel gets wrapped in an IPv4 wrapper where IPv4 fragmentation is
allowed, then should the ingress router simply accept the packet, and
fragment it?
On the whole I'd stand by the general advice that a dual-homed server
will fewer client "problems" if it uses a more conservative approach
to MTU selection that allows for somewhere between 40 and 60 bytes to
be added to an IPv6 packet in transit without causing the packet to
hit a 1500 octet fragmentation choke point with all the consequent
issues of ICMPv6 coherency that we seem to have in today's networks.
The tradeoff appears to be one related to performance and I am not
convinced that the marginal differences in theoretical performance
with the slightly larger MTU are worth the pain. Your view may well be
different, of course!
Geoff
On 03/02/2009, at 10:13 AM, Bernhard Schmidt wrote:
> On Tue, Feb 03, 2009 at 08:13:04AM +1100, Geoff Huston wrote:
>
> Hello Geoff,
>
>> It seems that the most pragmatic advice is to use a lower-than-1500
>> MTU
>> from the outset for IPv6 servers. More details of what I found are at
>> http://ispcolumn.isoc.org/2009-01/mtu6.html
>
> Great summary, thanks a lot for this excellent article!
>
> However, I cannot agree with your conclusion. On the one hand we see a
> lot of people, especially in the research networks lobbying for
> networks
> that are transparent for 4470 byte or ~9000 byte frames to reduce
> overhead and increase (host) throughput with >>GE rates, and on the
> other hand we cannot even get 1500 byte reliably and propose to set
> the
> MTU of servers (all servers?) to a lower value? That can't be right.
>
> We need to rat out those issues before it's too late. We have been
> running our webservers (http://www.lrz.de, granted it's not really
> high
> profile) v6-enabled with MTU 1500 for three years now, same for MX and
> other services. We have some thousand clients all running on MTU 1500
> links (and thus sending out MSS 1440 and relying on IPv6 pMTU
> discovery)
> and have not heard complaints. If you are repeatedly having issues
> chances
> are high that the problem is close to your own network, which might
> make
> debugging and contacting the appropriate parties a bit easier.
>
> The most common issues are IPv6 tunnels on interprovider links. Most
> implementations (including Cisco and Juniper) set the IPv6 MTU of the
> tunnel to the IPv4 MTU (to the tunnel destination) minus 20/24 bytes
> of
> overhead. Which is a good default when you have the standard 1500 byte
> core links, but bad when your tunnelbox is connected with (for
> example)
> POS (4470 bytes) or some jumbo-enabled ethernet link. IPv6 MTU is
> set to
> 4450 bytes, a native 1500 byte IPv6 packet comes in, gets
> encapsulated,
> sent as 1520 byte IPv4 packet through the core and then dies at the
> 1500
> byte IX fabric to the peer. Bam!
>
> I've seen this issue multiple times now. These defaults are service
> affecting. Even worse, two or three times when I told the engineering
> folks of the affected network about the problem they did not fix the
> tunnel immediately or turned it down, but kept it running. After all
> they see traffic through it, so it can't be broken. But they broke a
> lot
> of connections through that tunnel without even noticing it. So a lot
> of education about pMTU and the effects of broken pMTU due to
> misconfigured tunnels is still necessary. Your article, although a bit
> lengthy for a quick slap, is very helpful in this.
>
> If you are having issues and you have a Linux box around, try
> tracepath6. It is a really great tool to find MTU issues on routers,
> and
> notifying the affected party of this problem helps you and a lot of
> other people.
>
> OBtracerouteoftheday:
>
> traceroute to www.rfc-editor.org
> (2001:1878:400:1:214:4fff:fe67:9351) from
> 2001:4ca0:0:f000:211:43ff:fe7e:3a76, port 80, from port 60366, 30
> hops max, 60 byte packets
> 1 vl-23.csr2-2wr.lrz-muenchen.de (2001:4ca0:0:f000::2) 0.400 ms
> 0.302 ms 0.282 ms
> 2 vl-3051.csr1-2wr.lrz-muenchen.de (2001:4ca0:0:51::1) 0.530 ms
> 0.400 ms 0.336 ms
> 3 xr-gar1-te1-3-108.x-win.dfn.de (2001:638:c:a003::1) 0.543 ms
> 0.393 ms 0.362 ms
> 4 2001:638:c:c043::2 (2001:638:c:c043::2) 8.272 ms 8.133 ms
> 7.889 ms
> 5 dfn.rt1.fra.de.geant2.net (2001:798:14:10aa::1) 7.826 ms 7.808
> ms 7.834 ms
> 6 abilene-wash-gw.rt1.fra.de.geant2.net (2001:798:14:10aa::12)
> 100.938 ms 119.152 ms 106.065 ms
> 7 2001:468:ff:209::2 (2001:468:ff:209::2) 117.410 ms 117.312 ms
> 117.330 ms
> 8 2001:468:ff:204:8000::2 (2001:468:ff:204:8000::2) 127.892 ms
> 127.959 ms 127.867 ms
> 9 2001:468:ff:407::2 (2001:468:ff:407::2) 174.721 ms 173.358 ms
> 173.695 ms
> 10 2001:468:ff:716::1 (2001:468:ff:716::1) 173.303 ms 185.809 ms
> 174.810 ms
> 11 kreonet-1-lo-jmb-706.sttlwa.pacificwave.net (2001:504:b:10::6)
> 169.214 ms 169.212 ms 169.147 ms
> 12 cenichpr-1-is-jmb-778.snvaca.pacificwave.net (2001:504:b:
> 88::129) 184.891 ms 184.551 ms 184.509 ms
> 13 lax-hpr--svl-hpr-10ge.cenic.net (2001:468:e00:403::1) 184.587
> ms 184.534 ms 184.519 ms
> 14 2607:f380::4:0:103 (2607:f380::4:0:103) 180.625 ms 180.597 ms
> 180.691 ms
> 15 2001:1878:8::2 (2001:1878:8::2) 180.895 ms 180.827 ms 180.773
> ms
> 16 www.rfc-editor.org (2001:1878:400:1:214:4fff:fe67:9351) 181.112
> ms [open] 181.000 ms 180.865 ms
>
> Am I the only one shuddering when I see Kreonet in the middle of an
> Europe-to-US trace? Fortunately traffic stays within .us and does not
> take a detour, but that might just be my luck today.
>
> Bernhard
More information about the ipv6-ops
mailing list