How we discovered a hidden problem using the Website Performance Test

Website Performance Test: Part 3

This is the third of a series of articles about our new SamKnows Website Performance Test - the most comprehensive and accurate way to measure your website’s performance.

Shortly after developing the test, we saw Sky customers had experienced a significant drop in their internet performance.  The problem proved obscure and hard to find, but not too small for us!

Process: Identify, research, and validate

Looking at the Website Performance Test results in SamKnows One, our internet measurement platform, we quickly identified the root cause and alerted Sky straight away.  Sky were then able to deploy a software update to restore smooth internet for all their customers.  However, Sky were not alone in experiencing this issue, and many other internet service providers (ISPs) around the world are still seeing their network performance negatively impacted by the same problem.

This article documents our investigation and its findings.

When exploring the UK’s overall internet performance in SamKnows One, we noticed a large difference in website loading times for Sky and Virgin Media.  To show this clearly, we generated a line chart in SamKnows One to look at the time it took for a web page to load and be visually complete (Figure 1). 

Website Performance Test, time to visually complete

Unsure of the underlying cause, we checked to see if any other metrics demonstrated a similar pattern. Average DNS lookup times were also volatile and deviated from other UK ISPs (Figure 2).

Website Performance Test, average DNS lookup time

To understand the issue on a test-agent level, we looked at the HTTP archive. 

The HTTP archive revealed that several DNS lookup operations had large lookup times on the boundary of 5 and 10 seconds, and even 20 seconds (which fail). We took this to suggest that DNS results were recorded just after a DNS timeout had occurred and then automatically retried. This would certainly impact on peoples’ experience when accessing websites, especially when the web page has lots of visuals that need to be fetched from multiple servers.

This is likely to show up as being a larger issue in the Website Performance Test’s results than those of a typical browser test because it doesn’t rely on a local DNS cache. The DNS cache is disabled in the web engine for the Website Performance Test so that it always measures page load from scratch. However, this issue would certainly impact real users as well so the problem was worth pursuing. 

What we tested and how

Further debugging showed that many different domains were affected and that most websites had high DNS lookup times. However, the Website Performance Test can only confirm the presence of an issue, not the root cause.  

Further debugging showed that many different domains were affected. So, to dig deeper, we wrote a separate DNS lookup tool where we could alter the number of requests, and their concurrency, via the command line. To dig deeper, we wrote a separate DNS lookup tool where we could alter the number of requests, and their concurrency, via the command line. 

It turns out that when running DNS lookups in series, we were unable to see the problem. But when we introduced lookup concurrency, which is what a real web browser does, the issue was clearly visible:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

FQDN                            IP               Lookup time

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

dart.l.doubleclick.net          216.58.206.134   0.020740

pagead46.l.doubleclick.net      216.58.214.2     10.032715

Our packet capture also revealed that the second DNS request didn’t receive a response. The client’s DNS resolver tried again after 5 seconds, using the configured IPv6 interface and another IPv4 request, before receiving a response to the DNS query.

The packet capture also revealed that the client was using a local DNS server, in this case a home router with a LAN IP address of 192.168.0.1. Home routers typically operate a Forward DNS that forward requests to the DNS server(s) within the ISP's network. When the home router receives the relevant IP address, via DHCP, it “tells” the computer, or other internet-enabled device, what address to use, along with the router’s IP address as a DNS server. This means that all DNS requests will be sent to this router by default, which is a very common configuration for home networks. 

On a hunch that the home router could be the problem, we reconfigured the client device to bypass the DNS server in the home router, and to contact the ISP’s externally hosted DNS servers instead. We repeated this test setup, bypassing the home router’s DNS proxy, and found the problem disappeared. Regardless of the level of lookup concurrency, performance was stable and provided fast lookups. 

Conclusion

From our investigation, it was clear that Sky customers’ internet connections were fine but a concurrency limitation in the home routers’ DNS server was negatively affecting their web-browsing performance. Designed for intuitive analysis, SamKnows One visualises your end-to-end performance data and let’s you track your network health over time. Only by visualising your end-to-end internet performance for individual customer homes, can you find hidden problems, such as this one, and know how to fix them. 

Sky have since started to roll out their software update to fix this problem, and we can already see a dramatic improvement to webpage loading times. However, it should be noted that Sky is not the only ISP to experience this problem, and that it reaches far beyond the UK to service providers around the world.

Sky vs Virgin improvement for web-page loading time