Internet-Draft | Observability & Network Measurement | August 2021 |
Arkko & Kühlewind | Expires 3 February 2022 | [Page] |
A key problem regarding network quality is determining which parts of the network contribute to various performance aspects. In this paper we propose the inclusion of observability and built-in measurement capabilities in networks. This needs to be taken into account when designing protocols, and there is a need for a standardized way to request and exchange such measurements, securely and without exposing privacy-sensitive data.¶
This paper is a position paper submission to the IAB Measuring Network Quality for End-Users workshop.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 February 2022.¶
Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.¶
Network quality measurements look at the performance and capabilities of a network, network monitoring reports on the health of the network, and troubleshooting comes into play when a problem is detected. All three depend heavily on the availability of good network measurements.¶
Often network monitoring largely relies on passive observations of traffic and changes in traffic behavior. In addition, active measurements can be used to enhance this information and probe for specific occurrences. However, active measurement is usually limited to the operator's own network, or for certain access networks may be extended to the end user, but usually does not cover the whole path. Network quality measurements on the other hand are far too often focused on momentary throughput measurements.¶
Because of these constraints, as also noted at the IAB workshop on COVID-19 network impacts, often neither users nor service providers have ability to understand where a particular problem arises or what is the limiting factor for better performance [I-D.iab-covid19-workshop]:¶
"It's clear that it's difficult for application providers an operators to isolate problems. Is a problem due to the local WiFi, the access network, cloud network, etc.? Metrics from access points would help, but in general lack of observability into the network as a whole is a real concern when it comes to debugging performance issues."¶
Further, it's often the behaviour of the applications themselves that makes observability difficult. As noted in [I-D.iab-covid19-workshop]:¶
"These types of applications use surprising amounts of Forward Error Correction (FEC). Applications hide lots of loss to ensure a good user experience. This makes it harder to observe problems. The network can be behaving poorly, but experience can be good enough. Resiliency measures can improve the user experience but hide severe problems. There may be a missing feedback loop between application developers and operators."¶
A key problem that many of us have struggled with is determining which parts of the network contribute to various performance aspects. For instance, a bad quality video conference may be due to issues in the server system, cloud farm that it runs on, somewhere along the path(s), in user's home network or WiFi, or in the user's equipment.¶
Some information relating to issues can be determined with commonly available tools such as traceroute, but in general, it is difficult to know where the issues are, in particular without collaboration from at least some parts of the associated network paths.¶
Problems with observability do not stop here, however. The ability of the network to support important features such as be able to carry specific transport protocols or use IPv6, exchange information with applications when congestion is detected, use secure DNS protocols, and so on is not always readily apparent. And certainly not something that is uncovered easily merely by performing a throughput measurement.¶
In addition, some things that can be visible to a host provide no indication of how something is treated in the network. For instance, the host may be able to determine that the first link in the network uses encryption, hiding the traffic from outsiders. However, this provides no indication of whether the user's information is secure beyond the first link. For instance, subsequent links can be unprotected.¶
Collecting the right measurement data is a major challenge. Another challenge is also how to correctly interpret the data and provide the data to the right entities that can act on it, either for troubleshooting or improving future usage. To enable everybody to draw the right conclusion, it is especially important to correlate data from different sources, e.g. a network operators might not see increased loss if the endpoints adapt its rate accordingly, still network optimization could help certain traffic to utilize the available resources more efficiently.¶
Shared measurement data may relate to observations about a particular flow at different points along a path, but it may also be about aggregate information relating to the overall traffic situation (such as queue or congestion status), or the capabilities of the network nodes.¶
To exchange data, standardized definitions of measurements (e.g., [RFC7679]), communication protocols, as well as standardized formats are needed (e.g., QLOG [I-D.ietf-quic-qlog-main-schema]).¶
A high-quality network is capable of leveraging a number of features and connectivity options, such as:¶
For each of these categories, there may be additional parameters that are of interest as well, such as various timing parameters related to how long NAT or firewall entries are kept.¶
Discovering these capabilities can generally be accomplished in two ways: either by probing whether a given mechanism exists and what its characteristics are, or by asking the network. Some information may be available in router discovery packets and DHCP responses.¶
However, knowing in advance that a certain path supports a certain service, or not, is difficult, also because paths change dynamically. Built-in measurement capabilities that collect information on the flight, can help to detect capabilities or problems that require additional troubleshooting.¶
Some aspects of security are readily visible to end hosts. The host knows what end-to-end protocols and security it runs, and it may have internal APIs to determine what type of connectivity security is being applied. For instance, both WiFi and mobile network stacks on the end host are aware of what security is being applied.¶
Some other aspects of security are something that may have to be discovered. For instance, the network may offer a DNS resolver address, but whether that resolver supports a secure protocol can be something that has to be discovered through a protocol mechanism such as [I-D.ietf-add-ddr].¶
But even when a particular network connectivity or support protocol is found to employ security, it provides no indication of how the user's is information treated by the server in question or by the rest of the network. For instance, the host may communicate securely with a DNS resolver that still leaks the user's browsing history to outsiders.¶
In some cases it may be possible for a network node to provide an attestation that it runs a particular software and does not leak information outside a trusted execution environment (see [I-D.arkko-dns-confidential]).¶
In general, endpoints would benefit from not only seeing claims about specific features or performance, but to actually get some assurances that the claims are valid. Similarly, endpoints need to be careful about exposing information related to the user to the network (see, e.g., the advice in [RFC8558]). This needs to be considered in protocol design.¶
To address this problem and improve visibility of network quality we need to consider observability and built-in measurement capabilities when designing protocols and networks. We need a standardized definitions and ways to request and exchange such measurement data, and this needs to happen securely and without exposing privacy-sensitive data.¶
The ability to observe the behaviour of the Internet connection extends far immediate or momentary speed measurements. Especially, localising a problem is challenging as multiple parties are involved. As such built-in measurement capabilities and ways to exchange measurement data securely are the basis for improved observability.¶
Some of the tools that may assist in better observability include¶
The authors would like to thank the participants of the 2020 IAB COVID-19 Network Impacts Workshop on interesting discussions in this problem space.¶