mtu

Tue Feb 3 03:11:59 CET 2009

These are interesting comments Bernhard and, on the whole I agree with  
them.

A few comments in response:

1 - On large packet sizes. It seems anomalous that the IEEE has been  
unable to reach the levels of consensus thjat would allow  
standardization of packet frames of size > 1500 octets. In a world  
where the LAN carriage rate has advanced from 10Mbps to 10 Gbps, a  
comparable packet size would be 1.5M. Yet 1500 is as large as the  
standard world has got to. Part of the problem is that there are a  
number of problems here - carriage efficiency in terms of multiplexing  
multiple independent streams tends to want a lower minimum packet size  
for constant time / data applications (voice, interactive access, even  
video) while larger packet sizes tend to work to the advantage of the  
efficiency of large volume non time-based transfers (which although  
fewer in number  in terms of stream counts are still larger in terms  
of byte volumes,

2 - on the interaction between packet sizes and TCP transport. The  
basic mathematical model is

                                  MSS
        BW = ---------------------------------------------------------
             RTT*sqrt(1.33*p) + RTO*p*(1+32*p^2) * min(1,3*sqrt(.75*p))

(I hope that asciified ok)

or

                        MSS   1
              BW = C    --- -------
                        RTT sqrt(p)

where p is the packet loss rate.

It suggests that higher MSS sizes is directly proportional to  
throughput, but that's misleading in some ways - the issue is the  
packet loss rate, p. If that too is proportional to the packet size  
then the expectation that larger packet sizes directly produces better  
performance is not exactly the case. Another model of TCP congestion  
performance is to take the "sawtooth" pattern of TCP and note that the  
area under the "sawtooth is related to the available flow capacity of  
the connection, not the packet quantization levels (a smaller packet  
size produces a higher frequency oscillation, but not necessarily a  
greatly reduced mean value across with TCP oscillates. i.e. smaller  
packet sizes are not necessarily going to mean greatly reduced network  
throughput rates in many practical cases.

3 - the problem is close to "your own network". The example I wrote up  
for the article was interesting in that the problem was close neither  
to me as the client nor close to the server. The problem was in the  
transit path. Now I can't generalize about this because my sources of  
data are limited. Intuition tends to say that per packet filters are  
more likely to be at the edge than in the interprovider code, so close  
to the edge seems like a good guess, but the case that I looked at in  
detail tended to suggest otherwise with IPv6  - maybe its becuase of  
the strange behaviours of a network that still contains a reaonsable

4 - tunnels - tunnels are just SO strange when you look at the corner  
cases - what happens with ICMP messages that originated within the  
tunnel, for example, and the treatment of the DF bit. I am still  
scratching my head with my local IPv6 in IPv4 tunnel, for example:

interface Tunnel0
  no ip address
  ipv6 address <something>
  tunnel source 10.0.0.1
  tunnel destination 10.0.0.2
  tunnel mode ipv6ip

Tunnel0 is up, line protocol is up
   Hardware is Tunnel
   MTU 1514 bytes, BW 9 Kbit, DLY 500000 usec,
      reliability 255/255, txload 1/255, rxload 1/255

1514?

what a strange default selection!

I also don't understand the encapsulation behavior on tunnel ingress -  
but maybe thats just me! i.e.when an IPv6 packet thats too big for a  
tunnel gets wrapped in an IPv4 wrapper where IPv4 fragmentation is  
allowed, then should the ingress router simply accept the packet, and  
fragment it?

On the whole I'd stand by the general advice that a dual-homed server  
will fewer client "problems" if it uses a more conservative approach  
to MTU selection that allows for somewhere between 40 and 60 bytes to  
be added to an IPv6 packet in transit without causing the packet to  
hit a 1500 octet fragmentation choke point with all the consequent  
issues of ICMPv6 coherency that we seem to have in today's networks.   
The tradeoff appears to be one related to performance and I am not  
convinced that the marginal differences in theoretical performance  
with the slightly larger MTU are worth the pain. Your view may well be  
different, of course!

    Geoff

On 03/02/2009, at 10:13 AM, Bernhard Schmidt wrote:

> On Tue, Feb 03, 2009 at 08:13:04AM +1100, Geoff Huston wrote:
>
> Hello Geoff,
>
>> It seems that the most pragmatic advice is to use a lower-than-1500  
>> MTU
>> from the outset for IPv6 servers. More details of what I found are at
>> http://ispcolumn.isoc.org/2009-01/mtu6.html
>
> Great summary, thanks a lot for this excellent article!
>
> However, I cannot agree with your conclusion. On the one hand we see a
> lot of people, especially in the research networks lobbying for  
> networks
> that are transparent for 4470 byte or ~9000 byte frames to reduce
> overhead and increase (host) throughput with >>GE rates, and on the
> other hand we cannot even get 1500 byte reliably and propose to set  
> the
> MTU of servers (all servers?) to a lower value? That can't be right.
>
> We need to rat out those issues before it's too late. We have been
> running our webservers (http://www.lrz.de, granted it's not really  
> high
> profile) v6-enabled with MTU 1500 for three years now, same for MX and
> other services. We have some thousand clients all running on MTU 1500
> links (and thus sending out MSS 1440 and relying on IPv6 pMTU  
> discovery)
> and have not heard complaints. If you are repeatedly having issues  
> chances
> are high that the problem is close to your own network, which might  
> make
> debugging and contacting the appropriate parties a bit easier.
>
> The most common issues are IPv6 tunnels on interprovider links. Most
> implementations (including Cisco and Juniper) set the IPv6 MTU of the
> tunnel to the IPv4 MTU (to the tunnel destination) minus 20/24 bytes  
> of
> overhead. Which is a good default when you have the standard 1500 byte
> core links, but bad when your tunnelbox is connected with (for  
> example)
> POS (4470 bytes) or some jumbo-enabled ethernet link. IPv6 MTU is  
> set to
> 4450 bytes, a native 1500 byte IPv6 packet comes in, gets  
> encapsulated,
> sent as 1520 byte IPv4 packet through the core and then dies at the  
> 1500
> byte IX fabric to the peer. Bam!
>
> I've seen this issue multiple times now. These defaults are service
> affecting. Even worse, two or three times when I told the engineering
> folks of the affected network about the problem they did not fix the
> tunnel immediately or turned it down, but kept it running. After all
> they see traffic through it, so it can't be broken. But they broke a  
> lot
> of connections through that tunnel without even noticing it. So a lot
> of education about pMTU and the effects of broken pMTU due to
> misconfigured tunnels is still necessary. Your article, although a bit
> lengthy for a quick slap, is very helpful in this.
>
> If you are having issues and you have a Linux box around, try
> tracepath6. It is a really great tool to find MTU issues on routers,  
> and
> notifying the affected party of this problem helps you and a lot of
> other people.
>
> OBtracerouteoftheday:
>
> traceroute to www.rfc-editor.org  
> (2001:1878:400:1:214:4fff:fe67:9351) from  
> 2001:4ca0:0:f000:211:43ff:fe7e:3a76, port 80, from port 60366, 30  
> hops max, 60 byte packets
> 1  vl-23.csr2-2wr.lrz-muenchen.de (2001:4ca0:0:f000::2)  0.400 ms   
> 0.302 ms  0.282 ms
> 2  vl-3051.csr1-2wr.lrz-muenchen.de (2001:4ca0:0:51::1)  0.530 ms   
> 0.400 ms  0.336 ms
> 3  xr-gar1-te1-3-108.x-win.dfn.de (2001:638:c:a003::1)  0.543 ms   
> 0.393 ms  0.362 ms
> 4  2001:638:c:c043::2 (2001:638:c:c043::2)  8.272 ms  8.133 ms   
> 7.889 ms
> 5  dfn.rt1.fra.de.geant2.net (2001:798:14:10aa::1)  7.826 ms  7.808  
> ms  7.834 ms
> 6  abilene-wash-gw.rt1.fra.de.geant2.net (2001:798:14:10aa::12)   
> 100.938 ms  119.152 ms  106.065 ms
> 7  2001:468:ff:209::2 (2001:468:ff:209::2)  117.410 ms  117.312 ms   
> 117.330 ms
> 8  2001:468:ff:204:8000::2 (2001:468:ff:204:8000::2)  127.892 ms   
> 127.959 ms  127.867 ms
> 9  2001:468:ff:407::2 (2001:468:ff:407::2)  174.721 ms  173.358 ms   
> 173.695 ms
> 10  2001:468:ff:716::1 (2001:468:ff:716::1)  173.303 ms  185.809 ms   
> 174.810 ms
> 11  kreonet-1-lo-jmb-706.sttlwa.pacificwave.net (2001:504:b:10::6)   
> 169.214 ms  169.212 ms  169.147 ms
> 12  cenichpr-1-is-jmb-778.snvaca.pacificwave.net (2001:504:b: 
> 88::129)  184.891 ms  184.551 ms  184.509 ms
> 13  lax-hpr--svl-hpr-10ge.cenic.net (2001:468:e00:403::1)  184.587  
> ms  184.534 ms  184.519 ms
> 14  2607:f380::4:0:103 (2607:f380::4:0:103)  180.625 ms  180.597 ms   
> 180.691 ms
> 15  2001:1878:8::2 (2001:1878:8::2)  180.895 ms  180.827 ms  180.773  
> ms
> 16  www.rfc-editor.org (2001:1878:400:1:214:4fff:fe67:9351)  181.112  
> ms [open]  181.000 ms  180.865 ms
>
> Am I the only one shuddering when I see Kreonet in the middle of an
> Europe-to-US trace? Fortunately traffic stays within .us and does not
> take a detour, but that might just be my luck today.
>
> Bernhard