Video — thirty years of breaking the web's assumptions
For a long time, video on the web didn’t really have anything to do with the web itself. The O.G. solutions available around the turn of the millennium were the RealPlayer and QuickTime plugins, both of which are largely best forgotten to avoid triggering any old traumas. (If you remember how the <embed> and <object> tags worked, then you are officially old, congratulations — may your knees outlive the QuickTime brand which still lingers as a zombie in the macOS video player app.)
Browser plugins were effectively independent applications that could render into a frame within a web page. Like a blank zone on a medieval map that reads “Here Be Dragons”, they could contain absolutely anything. As native programs with full access to the operating system, plugins could freely mess around behind the browser’s back and without its knowledge: create sockets, read and write files, send the contents of your hard drive to an IP address in Belarus. Not coincidentally these were also the golden years of malware.
The “sockets” and “render into a frame” parts were useful for video, though. At that time, browsers didn’t have any APIs that could render even basic 2D graphics in realtime (Canvas wouldn’t be introduced by Apple’s Safari until some years later). The only networking interface was making a HTTP request. This assumed a single immediate response, but it could tortuously be used for a realtime connection in “long polling” mode where the server left the response open and kept writing to it. Some early webcams even used this to send a video feed as a series of JPEG frames, but not all browsers supported this kind of live-updating image.
A browser plugin didn’t have to play any of those games. It could open a proper socket to receive encoded video and audio data, decode those in native C code, write pixels into the frame it had within the browser window, and play sound to the user’s audio device. Presto, streaming video playback. And for the socket part, a crucial implementation detail was that a native program could use UDP instead of TCP.
Most Internet protocols including HTTP are built on TCP. It’s highly resilient to lost packets: if/when some data goes missing as it bounces around the Internet’s series of beautiful shiny pipes, the server will eventually find out and will retransmit those TCP packets. This is generally the right thing to do when you’re loading search results or viewing flight details or receiving a WhatsApp message from your long-lost teenage love — you want the data to arrive regardless of whether it takes 30 milliseconds or three seconds.
But for live video streaming, you mostly don’t care about frames that should have been displayed three seconds ago. That’s ancient history, there’s no point in retransmitting that data. UDP is a more basic kind of Internet protocol where senders don’t retransmit automatically, but instead the receiver has to make its own decisions about what to do about data that might be incomplete. It’s more complicated, but it lets you make efficient use of bandwidth and latency.
The browser plugin that put all this together in just the right way to enable novel audiovisual experiences on the web was Macromedia Flash. Precursors like QuickTime were basically just player applications that happened to be running inside a web page, but Flash let web developers create complete interactive experiences where video was just one element. It also helped that the Flash plugin shipped pre-installed on an astonishing 96% of all computers in 2003 (if there were such a thing as Tech Oscars, Macromedia would honestly deserve a posthumous Lifetime Achievement Award for “Best Partnerships Management”).
Flash shipped a video codec called Sorenson Spark already in 2002, but nobody liked it very much — it was inefficient, slow, and had an expensive closed source license. The video floodgates opened when Flash added a codec called VP6 in 2005. This was also the year when YouTube launched. (Indeed, YouTube’s video player would default to Flash for a full decade until 2015.)
For the next five years, everybody was mostly happy to keep web video in the desktop-centric Flash plugin sandbox. Flash soon added support for the H.264 video codec which remains incredibly popular for all kinds of video applications today. In 2009, a Russian teenager named Andrey Ternovskiy launched an app called ChatRoulette that proved Flash can do real-time two-way video calls in a browser. No extra install, just open a link to talk on video with a complete stranger. This capability opened the door to completely new browser-based experiences (as well as completely new ways to involuntarily see someone’s genitals).
Meanwhile in Cupertino, California, a clock was ticking down. Apple’s iPhone was a runaway success and its browser intentionally didn’t have Flash. In 2010, Steve Jobs wrote a public memo with the polite title “Thoughts on Flash” but whose essence could be summarized to paraphrase Cicero, “Flash delenda est” — Flash must die. The future of computing was increasingly on mobile. Video would have to find another medium for its delivery on the web.
At this point Microsoft’s stronghold on browsers was fully gone, Google was investing billions in making their Chrome browser support real apps, and the long-hibernating web standards bodies had been roused to life. HTML 5 was standardizing features like Apple's Canvas API which was already a de facto browser feature. That standard also included a <video> tag, but it was for playing pre-stored video (the kind you can buffer before playback and seek while playing). What could be done to enable apps like ChatRoulette to be built on the real web without Flash?
The answer was called WebRTC. It would bring the full glory of UDP-based audiovisual connections into web standards. The task turned out incredibly complicated: Google introduced the project in 2009 (when ChatRoulette was hot) and it finally shipped version 1.0 in 2021 (when COVID-19 was hot). One of the reasons why WebRTC took so long was that it’s fundamentally unlike the rest of web APIs.
Around the same 2009-2015 time frame, the web had already been sprouting new protocol limbs in the form of WebSockets. This API allows a two-way connection between a client and a server to be kept open without resorting to long polling tricks, but it's still implemented on top of TCP. In practice, WebSockets let you deliver fairly small amounts of data in an orderly fashion to a receiver who’s guaranteed to get it all while the connection is open. But that’s hardly useful for high-bandwidth video calls.
WebRTC not only introduced the chaos of UDP where packets can arrive (or not) out of order and with varying latencies, but it also attempted to solve the problem of peer-to-peer (P2P) connections on the web. Before WebRTC, the notion of a web page without a server would have been meaningless. But WebRTC theoretically lets two browser clients talk to each other without ever involving a centralized intermediary. The reality is a lot messier, and most real WebRTC apps use central relay servers because it’s practically mandatory if you want to provide a service that actually works for end users… Still, P2P remains as a core part of the API.
Furthermore, WebRTC was built on top of an existing complex stack of real-time internet technologies. This avoided reinventing the wheel, but it means users trying to build directly on top of WebRTC need to deal with a lot of protocol details that come from a different era and mindset than more recent web APIs.
All this is to say that making a WebRTC-based web app is fundamentally different from other kinds of web apps, and making a WebRTC server is so different from a traditional request-response HTTP application server that they have almost nothing in common.
And this concludes our dive into 30+ years of web history since Netscape introduced the browser plugin API. The dragons of unrestricted native code have been contained, but meanwhile the world of web video has become enormously more complicated with so many layers of open protocols, different codecs, and legacy technologies that can’t be entirely avoided.
I’ve been watching digital video and the web for all of those years. This blog is my attempt to write down some of the things I’ve learned, and offer some notions for what might be happening over the next few years. And yes, the unavoidable specter of AI haunts this space too — but I’ll try to focus on realistic productive applications rather than pie-in-the-sky demos (those infamous “Hollywood is dead!” demo reels that already decorate the tombstones of failed Gen AI video apps).
New killer apps for web video have taken off in distinct waves: YouTube in 2005, Netflix in 2011, live streaming in 2016, video conferencing in 2020. We’re probably on the cusp of another wave, and I believe this one will be strongly driven by open source and open models. It’s exciting times ahead. I hope to see you here next week.
Comments
No comments yet. Be the first.