Some very nice broken IPv6 networks at Google and Akamai

Tue Nov 11 20:52:50 CET 2014

On 2014-11-11 20:32, Pim van Pelt wrote:
> Hoi,
> 
> 2014-11-11 11:13 GMT-08:00 Jeroen Massar <jeroen at massar.ch>:
>> As stated, the MSS clamping is just hiding the real problems. It does
>> not properly resolve anything.
> You are simply wrong about this statement.

There is nothing wrong with that statement even though you removed the
context.

Lets dissect my sentence then:

 "the MSS clamping is just hiding the real problems"

Indeed with MSS clamping stuff "works" for TCP, but with a much lower
MTU. But when you start using anything non-TCP stuff is magically broken.

Hence, you are hiding the problem and do not resolve anything *PROPERLY*.

> MSS clamping effectively
> resolves issues with PMTUD by reducing its necessity in the first
> place. I think I'm the ninth person to point that out in this thread?
> 
> The reason why operators resort to MSS clamping, is because they then
> take end to end delivery reliability into their own hands, and have
> more control over the flow of their data onto the internet.

I *FULLY* understand WHY big providers are resorting to it:
 it fixes the problem on the short term. And that is great.

Still, it does not *PROPERLY* solve the problems that are in the
network. Which is exactly what I write above.

Next to reducing every link on the Internet to 1280 or whatever magic
MTU the one clamping the MSS thinks the link might be.

> The "real
> problems" you bring up, are almost impossible to address without
> explicit cooperation from all affected parties - this is a method that
> does not scale, and is not considered a winning strategy by operators
> who wish to actually see their packets reach the intended recipient.

But by hiding it, it won't be resolved ever. The IPv6 network is not
that large yet, eyeballs are not everywhere yet. But really, if the
"operators" want to give up already on resolving these kind of issues,
then well, just give up.

But maybe I should suggest another different approach to all the large
network operators:

  Collect information on network sources where you do properly
  receive ICMPv6 PTBs from and which sources you do not, but where
  you can notice that

Google loves doing 1 in 10.000 or whatever connection experiments, thus
not breaking "too much". Such an experiment thus should easily be done
and give quite a lot of empirical data.

And as you have a good ear from all the operators, as well, everybody
wants proper connectivity to the Big G, maybe that could help fixing the
problem by opening everybodies eyes? (Instead of silently keeping them
closed...)

>> If I had not commented about this problem,
>> it would never have come to light... maybe in several years when nothing
>> could have been done anymore. But today, we still can fix things.
>
> While it's great that you noticed it, I think it's mendacious to claim
> that if you had not commented on the problem, it would not have come
> to light. 

Of course. There will always be *somebody* who will discover it.

Note that I was not the one who noticed the issues, there where various
folks on both HE.net and SixXS forums complaining about it; I just
brought it to the attention here so that the people who are able to fix
it are able to fix it.

But as it is a hidden thing, it will have taken some time before
somebody was able to get somebody to state that this MSS clamping was
happening in the first place.

> Also, I'd like you to keep in mind that there can be some
> significant wall clock time between noticing a problem and completing
> its resolution, in a network or server deployment of sufficient size
> this can take a while.

I fully accept that and have also stated several times already that it
is great that Google folks have come forward with more information and
that they have resolved the issue.

> So saying "they just did not notice it this
> time around and thus it took a while for them to wake up (timezones :)
> figure out what it is and fix the issue" carries little merit if you
> don't actually know what happened, how it got noticed, or how it got
> resolved.

Unfortunately not everybody can see the insides of a big machine.

It would be great if those big machines would be a bit more open about
problems they are having and how they are attacking those so that other
folks don't have to repeat the same mistakes or are grasping in the dark
about things.

Note, I already wrote "timezones" as I assume that the folks who
actually working on it also sleep once in a while.

>> Noting problems and properly fixing them are important.
> but but but ... the problem *was* fixed, and whether you like it or
> not, it was fixed by restoring the intended behavior of MSS clamping
> in the affected Google servers after they had a regression.

I've already acked several times that the problem has been resolved.
AGAIN: THANK YOU FOR SOLVING THAT.

But: your version of "fix" works for TCP and it uses a suboptimal MTU.

Hence, it is not fully fixed. Yes, it is fixed for the services
affected, but that moment that you start globally using QUIC, QUIC is
not good enough anymore till they implement work-arounds again for the
same issue.

> It's fine if you want to practice pedantry, and I applaud your
> persistence. But you must understand that ipv6-ops being network
> engineers by trade will in general resort to "doing that, which
> actually gets the job done".

Patching up stuff is great, it makes things work indeed.

But lets also think about longer term effects and how to resolve this
problem on the longer term.

Or do you want to get a call again later on for the same issue and not
having a resolution ready?

And for everbody still reading, another suggested long term fix:
 http://www.ietf.org/id/draft-massar-v6man-mtu-label-00.txt

But hey, I am just suggesting things so that your work does not call you
out of bed....

Please, lets fix this issue properly.

Greets,
 Jeroen