Analyzing Internet reliability remotely with probing-based techniques

Loading...
Thumbnail Image

Publication or External Link

Date

2018

Citation

Abstract

Internet reliability for home users is increasingly important as a variety of services that we use migrate to the Internet. Yet, we lack authoritative measures of residential Internet reliability. Measuring reliability requires the detection of Internet outage events experienced by home users. But residential Internet outages are rare events. Further, they can affect relatively few users. Thus, detecting residential Internet outages requires broad and longitudinal measurements of individual users' Internet connections. However, such measurements of Internet reliability are challenging to obtain accurately and at scale.

Probing-based remote outage detection techniques can scale but their accuracy is questionable. These techniques detect Internet outages across time as well as across the IPv4 address space by sending active probes, such as pings and traceroutes, to users' IP addresses and use probe responses to infer Internet connectivity. However, they can infer false outages since their foundational assumption can sometimes be invalid: that the lack of response to an active probe is indicative of failure. In this dissertation, I show how to use probing-based techniques to measure residential Internet reliability by defending the following thesis: It is possible to remotely and accurately detect substantial outages experienced by any device with a stable public IP address that typically responds to active probes and use these outages to compare reliability across ISPs, media-types, geographical areas, and weather conditions.

In the first part of the dissertation, I address the inaccuracy of probing-based techniques' detected outages and show how to use probe responses to correctly detect outages. I illustrate two scenarios where the lack of response to an active probe is not indicative of failure. In the first scenario, responses are delayed beyond the prober's timeout, leading these techniques to infer packet-loss instead of delay. In the second scenario, these techniques can falsely infer packet-loss when the address they are probing gets dynamically reassigned. I examine how often delayed responses and dynamic reassignment occur across ISPs to quantify the inaccuracy of these techniques. I show how outages can be inferred correctly even in networks with dynamic reassignment using complementary datasets that can reveal whether an address was dynamically reassigned before, during, and after a detected outage for that address.

In the second part of the dissertation, I motivate why the detection of individual addresses' outages is necessary for analyzing residential reliability. An individual address typically represents one residential customer; therefore, detecting outages for individual addresses can allow capturing even small outages. Prior probing-based techniques focus upon the detection of edge network outages affecting a substantial set of addresses belonging to a BGP prefix or to a /24 address block. Here, I quantitatively demonstrate the extent to which prior techniques can miss residential outages. I show that even individual address outages occur rarely in most networks. When multiple simultaneous outages of related individual addresses occur, there is likely a common underlying cause. With this insight, I develop and evaluate an approach to find outage events that are statistically unlikely to have occurred independently. I show that the majority of such events do not affect entire /24 address blocks or BGP prefixes, and are therefore not likely to be detected by existing techniques which look for outages at these granularities.

In the final part of the dissertation, I show how to use individual addresses' outages detected by probing-based techniques to assess Internet reliability across media-types, geographical areas, and weather conditions. Individual outages are not direct measures of reliability: they can occur independently because users disable equipment or can be observed falsely due to dynamic address renumbering. I use the insight that the statistical change in outage rate in different challenging environments (e.g., thunderstorm) can quantitatively expose actual outage “inflation”. I show how to study the effect of challenging environments upon the reliability of a group of addresses by analyzing the inflation in outage rate for that group during its presence.

This dissertation's contributions will help achieve comprehensive measurements of Internet reliability that can be used to identify vulnerable networks and their challenges, inform which enhancements can help networks improve reliability, and evaluate the efficacy of deployed enhancements over time.

Notes

Rights