Thursday, September 24, 2009

Net Neutrality Explained--Again! Differently!

In the wake of FCC Chairman Genachowski’s speech, there’s been a lively network neutrality discussion going on over at Obsidian Wings, here, here, and (though I didn’t comment on these) here and here and here.

I’m still very concerned that the technical issues associated with network neutrality are still falling through the cracks. Since there appears to be a limit on how big a comment you can post over at OW, I thought I’d spend some time and describe some of the issues.

The net neutrality debate of course dates back quite a while. A good summary can be had by looking at the Wu and Lessig brief to the FCC from 2003. To boil this all down, and to describe what the FCC is worried about, there seem to be three major bones of contention:

  1. That network management policies be open, so that the public can understand how traffic is being managed by any particular ISP. This is a pretty simple transparency issue.

  2. That access to an ISP’s network be available to all content providers, so that the ISP can’t favor, for example, its own content over the content of one of its competitors, or so that small providers have equal access to big providers. Let’s call this location neutrality.

  3. Finally, that ISPs practice application neutrality as well as location neutrality. The idea here is to guarantee that the internet is available as a public utility, with access guaranteed for any application--and protocol--that comes along.

Item #1 is just motherhood and apple pie, mod obfuscating enough information that an attacker can't exploit it. Transparency is good.

Item #2 was stimulated by perfectly reasonable public policy concerns. We don’t want the big providers to get bigger at the expense of the little guys, and we don’t want giant media conglomerates using their vertical content and media integration as a weapon against more horizontal competitors. All great.

At first blush, item #3 seems to make lots of sense. We want the internet to be available to the next clever fellow who invents the next killer app, right? He should be able to count on well-defined access the underlying network infrastructure, right? Wu and Lessig use the analogy of the power grid to make this point. An electronics manufacturer can count on the power grid to deliver 110V, 60 Hz current anywhere in the US. Why shouldn’t the internet provide the same platform for producers of network applications?

Well, there’s a problem. In fact, I can think of two problems, one pretty simple and the other definitely not-so-simple.

The first problem is that, while we think of our ISPs providing us internet service, what they mostly provide is web service. In 2007, HTTP traffic comprised 46% of all web traffic. Back in the late 90’s that number was probably closer to 85%.

But in between, a little thing called peer-to-peer (P2P) file sharing came along, with BitTorrent being the killer app.

The vast majority of ISP subscribers use the web for pretty much everything, so ISPs optimized their traffic engineering for web applications. Web access is very simple, but it’s highly asymmetric: You make a small request and you get back a large amount of data. ISPs therefore heavily biased their network traffic toward downloading data from the core network, rather than uploading data to it. I’m currently on Time Warner Roadrunner, and my download speed is 15 Mbps, but my upload speed is only 1 Mbps.

But a relatively small number of users use the internet for P2P file sharing. That application is so bandwidth-intensive that in 2007 it accounted for 37% of all traffic.

Access ISPs hate BitTorrent. BitTorrent uploads and downloads nearly symmetrically, and it uploads and downloads a lot. If you’re on a DOCSIS cable system for internet and you have a P2P aficionado in your neighborhood, there’s a pretty good chance that he’s consuming a sizable chunk of the upload bandwidth available.

ISPs attempted to solve this problem by throttling BitTorrent. All they have to do is drop the occasional packet, or even de-prioritize the traffic at the router, and BitTorrent uploads and downloads slow to a crawl. Torrent-heads responded by encrypting their BitTorrent flows, so it was harder to do packet inspection to discover which flows were true P2P traffic.

And then Comcast decided to send TCP reset messages to BitTorrent flows, causing them to abort. And, as if this weren’t guaranteed to cause unbridled rage, they then denied they were doing it. Until they got caught, of course.

BitTorrent is, of course, a fine poster child for application neutrality. If the FCC were to adopt an app neutrality policy, ISPs would no longer be able to throttle BitTorrent. They would probably have to respond by changing the download/upload bandwidth mix, which would require deploying a lot more network equipment and forcing the price of broadband up. Maybe that wouldn’t be so bad. I wouldn’t be happy about this, but it wouldn’t destroy the internet or stifle innovation or any of those things that anti-net neutrality folks get the vapors about.

But the second problem will do those things. The poster child application for this problem is voice and video over the internet (VVoIP), but the problem really applies to any communication application that needs to have real time communication flows. We’re about to go down the rabbit hole here, folks. Any of you who faint at the sight of internet acronyms should leave the room now.

So far, we’ve talked about the web and P2P apps, both of which use the internet’s Transmission Control Protocol (TCP). TCP was invented back in the late 1970’s and has been the dominant internet transport ever since. It provides an end-to-end reliable byte stream between two applications. To do so, it takes messages handed to it by an application and chops them up into Internet protocol (IP) packets, encapsulating each one with some information so that the packets can be re-assembled by the receiver application. If packets get dropped, or duplicated, or arrive out of order, the TCP information is sufficient to put everything back together or, if necessary, request that the sender re-send some of the packets.

TCP accomplishes its reliable transfer with two additional, highly important properties. One is flow control: this simply means that, if the receiver runs out of memory into which to receive the byte stream, it has a way to tell the sender to stop sending until memory becomes available. The other property is called congestion control. These two properties, or more precisely the lack of them in other transport protocols, are going to be a problem for application neutrality.

Everybody knows that the internet is built out of routers, which are pretty easy to understand at a basic level. A router receives IP packets (not messages) from one or more network interfaces, stores them into memory, then forwards them as fast is it can out some other set of network interfaces. There can be more input interfaces than output interfaces, or the inputs can be faster than the outputs. When this happens, packets build up in memory until the router runs out of space. Unlike TCP, the IP packets that the router deals with don’t have flow control, so the router can only drop them on their little pointed packet heads, and the receiver has to do decide what to do about the missing data.

When TCP packets get dropped, one of the most common behaviors is for the receiver merely to refuse to send an acknowledgment (ACK) packet back to the sender. After a while, the sender decides to re-transmit its currently unacknowledged packets. But now imagine that a router gets congested, so that a whole bunch of TCP connections lose some packets simultaneously. Odds are, all of the TCP senders will re-send data at the same time, making the router even more congested. If they keep doing this, the router undergoes what we somewhat euphemistically call “congestive collapse.” You’d probably say that the internet gets broken real bad.

To avoid this problem, TCP has a congestion avoidance algorithm called “slow start.” I’m not going to go into this in great detail (you can look it up), but it works something like this: When TCP starts sending data, it will only send one or two packets at a time without waiting for an ACK packet. If it gets and ACK and it has more stuff to send, it then doubles the number of packets it’s willing to send without an ACK, and so on, up to some fairly large number of packets that it’s willing to send. But if it loses even a single ACK, it infers that it’s encountered congestion and drops all the way back to sending one or two packets at a time, and gradually works its way back up, but only to some average value where it knows that it’s likely to start losing ACKS. As a result, TCP senders send a lot less data to congested routers until the congestion condition clears for some reason.

It’s pretty simple, and it works surprisingly well. It works so well that ISPs often implement a separate algorithm on their routers called “random early drop” (RED). RED is designed to smooth the congestion condition so that not all TCP streams drop into slow start at the same time, which is very inefficient, and is vulnerable to something called tail-drop synchronization, which, suffice it to say, is yet another way that a bunch of TCP senders can unintentionally gang up on a poor defenseless router and send it off to gibber in the corner.

This is why, when your neighbor is uploading porn via BitTorrent, you’re only muttering under your breath about how slow the network is, as opposed to calling your ISP and telling them that it’s broken. BitTorrent uses TCP and therefore obeys slow start, which keeps the router network only very close to being overloaded, instead of being actually overloaded.

Note that this also works perfectly well when you're watching video on YouTube or Hulu. In this case, the video is being sent from a server that reads it off of the disk and sends it over HTTP (which uses TCP). The receiver receives the stream of bytes that make of the video, decodes them, then renders them on your screen and speakers.

But there’s a wrinkle. Imagine that, on the first packet received by your browser, it started playing the video. As long as the video stream continued to be received at exactly the right rate, each new video frame and audio sample would arrive just in time to be rendered, and you’d see perfect video and hear perfect audio. Maybe the sender can send the video faster than the receiver can play it. That’s no problem; eventually the receiver will run out of memory to store the byte stream, and TCP flow control will kick in.

But now imagine that the sender is sending at exactly the right rate, with many packets between each ACK, and your video stream suddenly gets subjected to TCP slow start because an ACK gets lost. Suddenly, instead of sending one packet right after another, the sender has to wait until it receives an ACK for only one or two packets, which will at least cause “jitter”, or an irregularity in the spacing of the received TCP packets. If there’s enough jitter, the receiver will run dry: the next video or audio frame won’t be available when it needs to be rendered. Your video will freeze, or your audio will sound like it’s coming from the bottom of the sea, or through a fan.

Fortunately, this streaming video is stored on disk--it's not occurring in real time. So all the receiver has to do to avoid this problem most of the time is to wait a couple of seconds before playing the video out. Then, if there’s jitter, the data that’s already in the buffer will tide the player over until more data is received.

But what if there is no disk? What if the source of the video and audio is another human being? Two additional constraints now apply:
  1. The sender can’t ever send the stream faster than real time to make up for past or future jitter, and

  2. The receiver can’t wait to render the video or audio. If it does, the two humans can’t have a conversation. Research shows that human conversation starts to suffer (people tend to try to talk at the same time a lot) when the delay from the time that sound leaves a person’s lips to when it reaches the other person’s ears is more than about 150 milliseconds.

Now jitter becomes the dominant constraint on the quality of the real time application. If you transport the voice and video over TCP, it can induce fairly large amounts of jitter, sometimes more than a second’s worth, which translates into more than a second that the receiver has to delay.

Since TCP is largely unsuitable, the IETF, those fine folks that recommend protocols for the internet, invented something called the Real Time Protocol (RTP). RTP is quite different from TCP. It’s designed to get media from the sender to the receiver as fast as possible, with enough information added so that the receiver can decide if any data is missing, and so it can reconstruct the timing of the stream, playing it out at exactly the same rate at which it was captured by the sender. But this real-time behavior comes at a cost: RTP can't re-send dropped data, and it has neither flow control nor congestion control.

At the router level, RTP packets are just like TCP packets: They arrive, they get stored, and they get sent. And, just like TCP packets, they’ll get dropped if the router is congested.

Dropping a whole bunch of RTP packets in a row is bad. The receiver is capable of losing a packet here and there and hiding the fact from the user. But if you lose a lot of packets, at some point the user sees the video break up or the audio start to drop out, or echo, or sound like something out of a satanic ritual.

Even if the router doesn’t drop RTP packets, they still have to wait their turn to get sent. When your aforementioned neighbor has decided to download or upload porn, there may be hundreds of his TCP porn packets to each one of your voice and video packets. When the RTP packets get delayed behind other traffic, that can cause jitter. This makes the receiver have to buffer more data before it can start to play out, which translates to delay, which translates to reduced quality.

The other extreme is bad, too. Since RTP doesn't have any congestion control, if too many people fire off real time applications at the same time, the router is going to get swamped, and there's no way for it to tell RTP senders to shut up. Instead, the network has to perform some kind of admission control before the RTP application starts up. This is a kind of "mother may I?" step, where some service owned by the ISP does some accounting (very, very complicated accounting, it turns out) and decides whether your video will be the straw that breaks the camel's back.

(At this point, some of you are no doubt yelling at your screen, "There ain't no stinkin' admission control when I use Skype, man! You're crazy! Well, both of those statements may be true, but the reason that there's no admission control in the public net is that nobody is currently running very much high-def live video in the public net. These admission control schemes are used all over enterprise networks and are a key component of any modern VoIP enterprise PBX. When the public net grows enough video--or other type of real time application, the ISPs will have to do this.)

Once we have admission control, we still have to have some way to to let RTP packets be less likely to get dropped and to “jump the line” when the queue for the packets to be sent is too long. There are many ways to accomplish this, but the most popular is to mark the RTP packets with something that the router can recognize in the incoming IP packet. This marking is placed in a IP field called the “differentiated services code point” (DSCP). The best way to the think of DSCP is that it’s a priority that goes with the packet.

Routers don’t have to obey the DSCP markings on packets, but they can if they wish to provide differentiated quality of service (QoS). But there’s yet another problem here.

Imagine that I’m an ISP and I announce to all my customers and to the various content providers, “I’m going to support DSCP on my routers.” What is the likely response? It’s usually something like, “Yippee! DSCP! If I mark my packets with high-priority DSCP, my application will get better service!” You get a form of DSCP inflation, where the markings mean less and less, because there’s no marginal cost attached to using them. Pretty soon, the applications that really need differentiated QoS are crowded out by the ones that don’t really need it but decide to use it to give themselves an edge.

ISPs can solve this problem by charging for QoS, or they can find a way to enforce admission control. They may charge their customers for a premium plan, just like they charge for faster modem speeds today. Or they may charge by the amount of DSCP-marked data that gets sent or delivered.

However the ISP does it, you’ll now understand that the terms, “differentiated QoS” and “application neutrality” are not the best of friends. And yet the fact remains: real time applications will simply stop working without DSCP if the network becomes congested.

If application neutrality becomes an FCC-mandated regulation, there is simply no way to provide the real time services that are one of the major sources of innovation on the internet today. Note that VoIP works today because its bit rate is quite low, but even today you can wind up with significant delay. Live video is in its infancy and I fear that it won’t live to childhood if app neutrality is required. Beyond that, there are lots of real time applications that could grow to be significant. There’s obviously real time gaming, which is already taking off, albeit without tight real time constraints yet. How about tele-operation of industrial robots? Or surgical robots in underdeveloped regions where it’s hard to get a top-notch surgeon to visit? How about a service where somebody else drives your car for you, or merely prevents you from crashing? All of these applications, plus many more that nobody’s thought of, will be jeopardy.

So what are the consequences of forgoing mandated app neutrality? Well, the big one is that application developers have to think about how the internet works before they engineer something new. Is this really so bad? Don’t engineers do that already?

Of course they do. If they don’t, two things happen. First, their application may not work. But even if it does work, it may be a sufficiently bad citizen that ISPs hate it. (Cf. BitTorrent above.) Even worse, users that don't use it but are affected by it may hate it. Unless it’s really, really useful, it’s unlikely to gain any traction.

On the other hand, sometimes a useful application comes along that requires new features in the internet, like the real time apps I've been t talking about. Those new features aren’t free, but ISPs will implement them if there’s a business case for them. That can’t happen if application neutrality forbids the ISP from innovating features that have to be constrained to a particular class of traffic. Each new class of traffic comes with its own engineering requirements. The ISP has to be free to implement those requirements and choose a business model that makes it worthwhile.

Update 9/24/09 11:07 PM: Fixed some typos and broken links.

Update 10/7/09 5:23 PM: Yet another net neutrality thread here.

No comments: