Infrequently Noted

The Performance Inequality Gap, 2023

When digital is society's default, slow is exclusionary.

December 19, 2022

This is part six of the six-part series "The Performance Inequality Gap"

TL;DR: To serve users at the 75^th percentile (P75) of devices and networks, we can now afford ~150KiB of HTML/CSS/fonts and ~300-350KiB of JavaScript (gzipped). This is a slight improvement on last year's budgets, thanks to device and network improvements. Meanwhile, sites continue to send more script than is reasonable for 80+% of the world's users, widening the gap between the haves and the have-nots. This is an ethical crisis for frontend.

2023 Content Targets
Desktop
- Devices
- Networks
Mobile
- Devices
- Networks
Developing Your Own Targets
The Performance Inequality Gap is Growing

Last month, I had the honour of joining what seemed like the entire web performance community at performance.now() in Amsterdam.

The talks are up on YouTube behind a paywall, but my slides are mirrored here^[1]:

The talk, like this post, is an update on network and CPU realities this series has documented since 2017. More importantly, it is also a look at what the latest data means for our collective performance budgets.

2023 Content Targets #

In the interest of brevity, here's what we should be aiming to send over the wire per page in 2023 to reach interactivity in less than 5 seconds on first load:^[2]^[3]

~150KiB of HTML, CSS, images, and render-blocking font resources
No more than ~300-350KiB of JavaScript

This implies a heavy JS payload, which most new sites suffer from for reasons both bad and beyond the scope of this post. With a more classic content profile — mostly HTML and CSS — we can afford much more in terms of total data, because JavsScipt is still the costliest way to do things and CPUs at the global P75 are not fast.

These estimates also assume some serving discipline, including:

No more than two HTTP connections, implying HTTP/2
Compressing text resources
Reasonable content structure

These targets are anchored to global estimates for networks and devices at the 75^th percentile^[4].

More on how those estimates are constructed in a moment, but suffice to say, it's messy. Where the data is weak, we should always prefer conservative estimates.

Based on trends and historical precedent, there's little reason for optimism that things are better than they seem. Indeed, misplaced optimism about disk, network, and CPU resources is the background music to frontend's lost decade.

Interaction-to-Next Paint measures page responsivness, and shows a large gap between desktop and mobile today — Per the 2022 Web Almanac, which pulls data from real-world devices via the CrUX dataset, today's web offers poor performance for the majority of users who are on mobile devices.

It is not an exaggeration to say that modern frontend is so enamoured of post-scarcity fairy tales that it is mortgaging the web's future for another night drinking at the JavaScript party.

We're burning our inheritance and polluting the ecosystem on shockingly thin, perniciously marketed claims of "speed" and "agility" and "better UX" that have not panned out at all. Instead, each additional layer of JavaScript cruft has dragged us further from living within the limits of what we can truly afford.

No amount of framework vendor happy talk can hide the reality that we are sending an escalating and unaffordable amount of JavaScript.

This isn't working for users or for businesses that hire developers hopped up Facebook's latest JavaScript hopium. A correction is due.

Desktop #

In years past, I haven't paid as much attention to the situation on desktops. But researching this year's update has turned up sobering facts that should colour our collective understanding.

Devices #

From Edge's telemetry, we see that nearly half of devices fall into our "low-end" designation, which means that they have:

HDDs (not SSDs)
2-4 CPU cores
4GB RAM or less

Add to this the fact that desktop devices have a lifespan between five and eight years, on average. This means the P75 device was sold in 2016.

As this series has emphasised in years past, Average Selling Price (ASP) is performance destiny. To understand our P75 device, we must imagine what the ASP device was at the P75 age.^[5] That is, what was the average device in 2016? It sure wasn't a $2,000 M1 MacBook Pro, that's for sure.

No, it was a $600-$700 device. Think (best-case) 2-core, 4-thread married to slow, spinning rust.

Networks #

Desktop-attached networks are hugely variable worldwide, including in the U.S., where the shocking effects of digital red-lining continue this day. And that's on top of globally uncompetitive service, thanks to shockingly lax regulation and legalised corruption.

As a result, we are stuck with sticking to our highly conservative estimates for bandwidth in line with WebPageTest's throttled Cable profile of 5Mbps bandwidth and ~25ms RTT.

Speeds will be much slower than advertised in many areas, particularly for rural users.

Mobile #

We've been tracking the mobile device landscape more carefully over the years and, as with desktop, ASPs today are tomorrow's performance destiny. Thankfully, device turnover is faster, with the average handset surviving only three to four years.

Devices #

Without beating around the bush, our ASP 2019 device was an Android that cost between $300-$350, new and unlocked. It featured poor single and multi-core performance, and the high-end experience has continued to pull away from it since:

<em>Tap for a larger version.</em><br>Updated Geekbench five single-core scores for each mobile price point. TL;DR: your iPhone isn't real life. — *Tap for a larger version.*
Updated Geekbench five single-core scores for each mobile price point. TL;DR: your iPhone isn't real life.

<em>Tap for a larger version.</em><br>Android ecosystem <abbr title='system-on-chip'>SoC</abbr>s fare slightly better on multi-core performance, but the Performance Inequality Gap is growing there, too. — *Tap for a larger version.*
Android ecosystem SoCs fare slightly better on multi-core performance, but the Performance Inequality Gap is growing there, too.

As you can see, the gap is widening, in part because the high end has risen dramatically in price.

The best analogue you can buy for a representative P75 device today are ~$200 Androids from the last year or two, such as the Samsung Galaxy A50 and the Nokia G11.

These devices feature:

Eight slow, big.LITTLE ARM cores (A75+A55, or A73+A53) built on last-generation processes with very little cache
4GiB of RAM
4G radios

These are depressingly similar specs to devices I recommended for testing in 2017. Qualcomm has some 'splainin to do.

5G is still in its early diffusion phase, and the inclusion of a 5G radio is hugely randomising for device specs at today's mid-market price-point. It'll take a couple of years for that to settle.

Networks #

Trustworthy mobile network data is challenging to acquire. Geographic differences create huge effects that we can see as variability in various global indexes. This variance forces us towards the bottom of the range when estimating our baseline, as mobile networks are highly contextual.

Triangulating from both speedtest.net and OpenSignal data (which has declined markedly in usefuleness), we're also going to maintain our global network baseline from last year:

9Mbps bandwidth
170ms RTT

This is a higher bandwidth estimate than might be reasonable, but also a higher RTT to cover the effects of high network behaviour variance. I'm cautiously optimistic that we'll be able to bump one or both of these numbers in a positive direction next year. But they stay put for now.

Developing Your Own Targets #

You don't have to take my word for it. If your product behavior or your own team's data or market research suggests different tradeoffs, then it's only right to set your own per-product baseline.

For example, let's say you send more HTML and less JavaScript, or your serving game is on lock and all critical assets load over a single H/2 link. How should your estimates change?

Per usual, I've also updated the rinky-dink live model that you can use to select different combinations of device, network, and content type.

<em>Tap to try the interactive version.</em> — *Tap to try the interactive version.*

The Performance Inequality Gap is Growing #

Essential public services are now delivered primarily through digital channels in many countries. This means what the frontend community celebrates and promotes has a stochastic effect on the provision of those services — which leads to an uncomfortable conversation because, taken as a whole, it isn't working. Responsible public sector organisations must develop HTML-first, progressive enhancement guidance in stark opposition to the "frontend consensus".

This is an indictment: modern frontend's fascinationg with towering piles of JavasScript complexity is not delivering better experiences for most users.

For a genuinely raw example, consider California, the state where I live. In early November, it was brought to my attention that CA.gov "felt slow", so I gave it a look. It was bad on my local development box, so I put it under the WebPageTest microscope. The results were, to be blunt, a travesty.

How did this happen? Well, per the new usual, overly optimistic assumptions about the state of the world accreted until folks at the margins were excluded.

In the case of CA.gov, it was an official Twitter embed that, for some cursed reason, had been built using React, Next.js, and the full parade of modern horrors. Removing the embed, along with code optimistically built in a pre-JS-bloat era that blocked rendering until all resources were loaded, resulted in a dramatic improvement:

Thanks to some quick work by the CA.gov team, the experience of the site radically improved between early November and mid-December, giving Californians easier access to critical information.

This is not an isolated incident. These sorts of disasters have been arriving on my desk with shocking frequency for years.

Nor is this improvement a success story, but rather a cautionary tale about the assumptions and preferences of those who live inside the privilege bubble. When they are allowed to set the agenda, folks who are less well-off get hurt.

It wasn't the embed engineer getting paid hundreds of thousands of dollars a year to sling JavaScript who was marginalised by this gross misapplication of overly complex technology. No, it was Californians who could least afford fast devices and networks who were excluded. Likewise, it hasn't been those same well-to-do folks who have remediate the resulting disasters. They don't even clean up their own messes.

Frontend's failure to deliver in today's mostly-mobile, mostly-Android world is shocking, if only for the durability of the myths that sustain the indefensible. We can't keep doing this.

As they say, any trend that can't continue won't.

Apologies for the lack of speaker notes in this deck. If there's sufficient demand, I can go back through and add key points. Let me know if that would help you or your team over on Mastodon. ↩︎
Since at least 2017, I've grown increasingly frustrated at the way we collectively think about the tradeoffs in frontend metrics. Per this year's post on a unified theory of web performance, it's entirely possible to model nearly every interaction in terms of a full page load (and vice versa).

What does this tell us? Well, briefly, it tells us that the interaction loop for each interaction is only part of the story. Recall the loop's phases:
1. Interactive (ready to handle input)
2. Receiving input
3. Acknowledging input, beginning work
4. Updating status
5. Work ends, output displayed
6. GOTO 1
Now imagine we collect all the interactions a user performs in a session (ignoring scrolling, which is nearly always handled by the browser unless you screw up), and then we divide the total set of costs incurred by the number of turns through the loop.

Since our goal is to ensure users complete each turn through the loop with low latency and low variance, we can see the colourable claim for SPA architectures take shape: by trading off some initial latency, we can reduce total latency and variance. But this also gives rise to the critique: OK, but does it work?

The answer, shockingly, seems to be "no" — at least not as practised by most sites adopting this technology over the past decade.

The web performance community should eventually come to a more session-depth-weighted understanding of metrics and goals. Still, until we pull into that station, per-page-load metrics are useful. They model the better style of app construction and represent the most actionable advice for developers. ↩︎
The target that this series has used consistently has been reaching a consistently interactive ("TTI") state in less than 5 seconds on the chosen device and network baseline.

This isn't an ideal target.

First, even with today's the P75 network and device, we can aim higher (lower?) and get compelling experiences loaded and main-thread clean in much less than 5 seconds.

Second, this target was set in covnersation back in 2016 in preparation for a Google I/O talk, based on what was then possible. At the time, this was still not ambitious enough, but the impact of an additional connection shrunk the set of origins that could accomplish the feat significantly.

Lastly, P75 is not where mature teams and developers spend their effort. Instead, they're looking up the percentiles and focusing on P90+, and so for mature teams looking to really make their experiences sing, I'd happily recommend that you target 5 second TTI at P90 instead. It's possible, and on a good stack with a good team and strong management, a goal you can be proud to hit. ↩︎
Looking at the P75 networks and devices may strike mature teams and managers as a sandbagged goal and, honestly, I struggle with this.

On the one hand, yes, we should be looking into the higher percentiles. But weaker goals aren't within reach for most teams today. If we moved the ecosystem to a place where it could reliably hit these limits and hold them in place for a few years, the web would stand a significantly higher chance of remaining relevant.

On the other hand, these difficulties stack. Additive error means that targeting the combination P75 network and P75 device likely puts you north of P90 in the experiential distribution, but it's hard to know. ↩︎
Data-minded folks will be keenly aware that simply extrapolating from average selling price over time can lead to some very bad conclusions. For example, what if device volumes fluctuate significantly? What if, in more recent years, ASPs fluctuate significantly? Or what if divergence in underlying data makes comparison across years otherwise unreliable.

These are classic questions in data analysis, and thankfully the PC market has been relatively stable in volumes, prices, and segmentation, even through the pandemic.

As covered later in this post, mobile is showing signs of heavy divergence in properties by segment, with the high-end pulling away in both capability and price. This is happening even as global ASPs remain relatively fixed, due to the increases in low-end volume over the past decade. Both desktop and mobile are staying within a narrow Average Selling Price band, but in both markets (though for different reasons), the P75 is not where folks looking only at today's new devices might expect it to be.

In this way, we can think of the Performance Inequality Gap as being an expression of Alberto Cairo's visual data lessons: things may look descriptively similar at the level of movement of averages between desktop and mobile, but the underlying data tells a very different story. ↩︎

Apple Is Not Defending Browser Engine Choice

June 23, 2022
Updated: June 27, 2022

This is part six of the six-part series "Browser Choice Must Matter"

Gentle reader, I made a terrible mistake. Yes, that's right: I read the comments on a MacRumors article. At my age, one knows better. And yet.

As penance for this error, and for being short with Miguel, I must deconstruct the ways Apple has undermined browser engine diversity. Contrary to claims of Apple partisans, iOS engine restrictions are not preventing a "takeover" by Chromium — at least that's not the primary effect. Apple uses its power over browsers to strip-mine and sabotage the web, hurting all engine projects and draining the web of future potential.

As we will see, both the present and future of browser engine choice are squarely within Cupertino's control.

Apple's Long-Standing Policies Are Anti-Diversity
"WebKit couldn't compete if it had to."
Browsers Are Big Business
WebKit Is No Charity
Choices, Choices
- Recent Developments
How Apple Gutted Mozilla's Chances
Back Of The Napkin
The Best Kind Of Correct
What Now?

Apple's Long-Standing Policies Are Anti-Diversity #

A refresher on Apple's iOS browser policies:

From iOS 2.0 in '08 to iOS 14 in late '20, Apple would not allow any browser but Safari to be the default.
For 14 years and counting, Apple has prevented competing browsers from bringing their own engines, forcing vendors to build skins over Apple's WebKit binary, which has historically been slower, less secure, and lacking in features.
Apple will not even allow competing browsers to provide different runtime flags to WebKit. Instead, Fruit Co. publishes a paltry set of options that carry an unmistakable odour of first-party app requirements.
Apple continues to self-preference through exclusive API access for Safari; e.g., the ability to install PWAs to the home screen, implement media codecs, and much else.

Defenders of Apple's monopoly offer hard-to-test claims, but many boil down to the idea that Apple's product is inferior by necessity. This line is frankly insulting to the good people that work on WebKit. They're excellent engineers; some of the best, pound for pound, but there aren't enough of them. And that's a choice.

"WebKit couldn't compete if it had to." #

Nobody frames it precisely this way; instead they'll say, if WebKit weren't mandated, Chromium would take over, or Google would dominate the web if not for the WebKit restriction. That potential future requires mechanisms of action — something to cause Safari users to switch. What are those mechanisms? And why are some commenters so sure the end is nigh for WebKit?

Past swings away from OS default browsers have hinged on the new features, better performance, improved security, and superior site compatibility. Marketing and distribution play a prominent role but have been indecisive in recent browser battles. The leads of OS incumbents are not insurmountable because browsers are commodities with relatively low switching costs. Better products tend to win.

Destkop <abbr>OS</abbr>es have long created a vibrant market for browser choice, enabling competitors not tied to OS defaults to flourish over the years. — Destkop OSes have long created a vibrant market for browser choice, enabling competitors not tied to OS defaults to flourish over the years.

Apple's prohibition on iOS browser engine competition has drained the potential of browser choice to deliver improvements. Without the ability to differentiate on features, security, performance, privacy, and compatibility, what's to sell? A slightly different UI? That's meaningful, but identically feeble web features cap the potential of every iOS browser. Nobody can pull ahead, and no product can offer future-looking capabilities that might make the web a more attractive platform.

This is working as intended:

<a href='https://developer.apple.com/app-store/review/guidelines/#2.5.6'>Apple's policies</a> explicitly prevent meaningful competition between browsers on iOS. In 2022, you can have any default you like, as long as it's as buggy as Safari. — Apple's policies explicitly prevent meaningful competition between browsers on iOS. In 2022, you can have any default you like, as long as it's as buggy as Safari.

Of all the reasons to switch browsers, compatibility is often the most compelling. Major sites asking users to switch is incredibly effective in aggregate.

"Compatibility" describes both a browser's ability to display existing content and developers' ability to rely on a set of features across browsers. Standards support is a sub-point of this latter issue but acts as a trailing indicator of engine quality.^[1]

On OSes with browser competition, sites can recommend browsers with engines that cost less to support or unlock crucial capabilities. However, developers are loathe to do this; turning away users isn't a winning growth strategy, and prompting visitors to switch is passé.

Still, in extremis, missing features and the parade of showstopping bugs render some services impossible to deliver. In these cases, suggesting an alternative beats losing users entirely.

But what if there's no better alternative? This is the situation that Apple has engineered on iOS. Cui bono? — who benefits?

All iOS browsers present as Safari to developers. There's no point in recommending a better browser because none is available. The combined mass of all iOS browsing pegged to the trailing edge means that folks must support WebKit or decamp for Apple's App Store, where it hands out capabilities like candy, but at a shocking price.

iOS's mandated inadequacy has convinced some that when proper engine choice is possible, an overwhelming majority of users will switch to browsers with better engines. This will, in turn, cause developers to skimp on testing for Safari, making it inevitable that browsers based on WebKit and other minority engines cannot compete. Or so the theory goes.

But is it predestined?

Perhaps some users will switch, but browser market changes take a great deal of time, and Apple enjoys numerous defences.

To the extent that Apple wants to win developers and avoid losing users, it has plenty of time.

It took over five years for Chrome to earn majority share on Windows with a superior product, and there's no reason to think iOS browser share will move faster. Regulatory mandates about engine choice will also take more than a year to come into force, giving Apple plenty of time to respond and improve the competitiveness of its engine. And that's the lower bound.

Apple's pattern of malaicious compliance will likely postpone true choice even futher. As Apple fights tooth-and-nail to prevent alternative browser engines, it will try to create ambiguity about vendor's ability to ship their best products worldwide, potentially delaying high-cost investment in ports with uncertain market reach.

Cupertino may also try to create arduous processes that force vendors to individually challenge the lack of each API, one geography at a time. In the best case, time will still be lost to this sort of brinksmanship. This is time that Apple can use to improve WebKit and Safari to be properly competitive.

Why would developers recommend alternatives if Safari adds features, improves security, prioritises performance, and fumigates for showstopping bugs? Remember: developers don't want to prompt users to switch; they only do it under duress. The features and quality of Safari are squarely in Apple's control.

So, given that Apple has plenty of time to catch up, is it a rational business decision to invest enough to compete?

Browsers Are Big Business #

Browsers are both big business and industrial-scale engineering projects. Hundreds of folks are needed to implement and maintain a competitive browser with specialisations in nearly every area of computing. World-class experts in graphics, networking, cryptography, databases, language design, VM implementation, security, usability (particularly usable security), power management, compilers, fonts, high-performance layout, codecs, real-time media, audio and video pipelines, and per-OS specialisation are required. And then you need infrastructure; lots of it.

How much does all of this cost? A reasonable floor comes from Mozilla's annual reports. The latest consolidated financials (PDF) are from 2020 and show that, without marketing expenses, Mozilla spends between $380 and $430 million US per year on software development. Salaries are the largest category of these costs (~$180-210 million), and Mozilla economises by hiring remote employees without offering large bonuses and stock-based compensation.

From this data, we can assume a baseline cost to build and maintain a competitive, cross-platform browser at $450 million per year.

Browser vendors fund their industrial-scale software engineering projects through integrations. Search engines pay browser makers for default placement within their products. They, in turn, make a lot of money because browsers send them transactional and commercial intent searches as part of the query stream.

Advertisers bid huge sums to place ads against keywords in these categories. This market, in turn, funds all the R&D and operational costs of search engines, including "traffic acquisition costs" like browser search default deals.^[2]

How much money are we talking about? Mozilla's $450 million in annual revenue comes from approximately 8% of the desktop market and negligible mobile share. Browsers are big, big business.

WebKit Is No Charity #

Despite being largely open source, browsers and their engines are not loss leaders.

Safari, in particular, is wildly profitable. The New York Times reported in late 2020 that Google now pays Apple between $8-12 billion per year to remain Safari's default search engine, up from $1 billion in 2014. Other estimates put the current payments in the $15 billion range. What does this almighty torrent of cash buy Google? Searches, preferably of the commercial intent sort.

Mobile accounts for two-thirds of web traffic (or thereabouts), making outsized iOS adoption among wealthy users particularly salient to publishers and advertisers. Google's payments to Apple are largely driven by the iPhone rather than its niche desktop products where effective browser competition has reduced the influence of Apple's defaults.

Against considerable competition, Safari was used by 52% of visitors to <abbr>US</abbr> Government websites from macOS devices from <time datetime='2022-03-06'>March 6<sup>th</sup></time> to <time datetime='2022-04-04'>April 4<sup>th</sup>, 2022</time> — Against considerable competition, Safari was used by 52% of visitors to US Government websites from macOS devices from March 6^th to April 4^th, 2022

The influence of a dozen years of suppressed browser choice is evident on iOS, where Safari is used 90% of the time. Apple's policies caused Mozilla to delay producing an iOS browser for seven years, and its de minimus iOS share (versus 3.6% on macOS) is a predictable result.

iOS represents 75% of all visits to <abbr>US</abbr> Government websites from Apple <abbr>OS</abbr>es — iOS represents 75% of all visits to US Government websites from Apple OSes

Even with Apple's somewhat higher salaries per engineer, the skeleton staffing of WebKit, combined with the easier task of supporting fewer platforms, suggests that Apple is unlikely to spend considerably more than Mozilla does on browser development. In 2014, Apple would have enjoyed a profit margin of 50% if it had spent half a billion on browser engineering. Today, that margin would be 94-97%, depending on which figure you believe for Google's payments.

In absolute terms, that's more profit than Apple makes selling Macs.

Compare Cupertino's 3-6% search revenue reinvestment in the web with Mozilla's near 100% commitment, then recall that Mozilla has consistently delivered a superior engine to more platforms. I don't know what's more embarrassing: that some folks argue with a straight face that Apple is trying hard to build a good browser, or that it is consistently overmatched in performance, security, and compatibility by a plucky non-profit foundation that makes just ~5% of Apple's web revenue.

Choices, Choices #

Steve Jobs launched Safari for Windows in the same WWDC keynote that unveiled the iPhone.

Commenters often fixate on the iPhone's original web-based pitch, but don't give Apple stick for reducing engine diversity by abandoning Windows three versions later.

Today, Apple doesn't compete outside its home turf, and when it has agency, it prevents others from doing so. These are not the actions of a firm that is consciously attempting to promote engine diversity. If Apple is an ally in that cause, it is only by accident.

Theories that postulate a takeover by Chromium dismiss Apple's power over a situation it created and recommits to annually through its budgeting process.

This is not a question of resources. Recall that Apple spends $85 billion per year on stock buybacks^[3], $15 billion on dividends, enjoys free cash flow larger than the annual budgets of 47 nations, and retain tens of billions of dollars of cash on hand.^[4] And that's to say nothing of Apple's $100+ billion in non-business-related long-term investments.

Even if Safari were a loss leader, Apple would be able to avoid producing a slower, stifled, less secure, famously buggy engine without breaking the bank.

Apple needs fewer staff to deliver equivalent features because Safari supports fewer OSes. The necessary investments are also R&D expenses that receive heavy tax advantages. Apple enjoys enviable discounts to produce a credible browser, but refuses to do so.

Unlike Microsoft's late and underpowered efforts with IE 7-11, Safari enjoys tolerable web compatibility, more than 90% share on a popular OS, and an unheard-of war chest with which to finance a defence. The postulated apocalypse seems far away and entirely within Apple's power to forestall.

Recent Developments #

One way to understand the voluntary nature of Safari's poor competitiveness is to put Cupertino's recent burst of effort in context.

When regulators and legislators began asking questions in 2019, a response was required. Following Congress' query about default browser choice, Apple quietly allowed it through iOS 14 (however ham-fistedly) the following year. This underscores Apple's gatekeeper status and the tiny scale of investment required to enable large changes.

In the past six months, the Safari team has gone on a veritable hiring spree. This month's WWDC announcements showcased returns on that investment. By spending more in response to regulatory pressure, Apple has eviscerated notions that it could not have delivered a safer, more capable, and competitive browser many years earlier.

Safari's incremental headcount allocation has been large compared to the previous size of the Safari team, but in terms of Apple's P&L, it's loose change. Predictably, hiring talent to catch up has come at no appreciable loss to profitability.

The competitive potential of any browser hinges on headcount, and Apple is not limited in its ability to hire engineering talent. Recent efforts demonstrate that Apple has been able to build a better browser all along and, year after year, chose not to.

How Apple Gutted Mozilla's Chances #

For over a dozen years, setting any browser other than Safari as the iOS default was impossible. This spotted Safari a massive market share head-start. Meanwhile, restrictions on engine choice continue to hamstring competitors, removing arguments for why users should switch. But don't take my word for it; here's the recent "UK CMA Final Report on Mobile Ecosystems" summarising submissions by Mozilla and others (pages 154-155):

5.48 The WebKit restriction also means that browser vendors that want to use Blink or Gecko on other operating systems have to build their browser on two different browser engines. Several browser vendors submitted that needing to code their browser for both WebKit and the browser engine they use on Android results in higher costs and features being deployed more slowly.

5.49 Two browser vendors submitted that they do not offer a mobile browser for iOS due to the lack of differentiation and the extra costs, while Mozilla told us that the WebKit restriction delayed its entrance into iOS by around seven years

That's seven years of marketing, feature iteration, and brand loyalty that Mozilla sacrificed on the principle that if they could not bring their core differentiator, there was no point.

It would have been better if Mozilla had made a ruckus, rather than hoping the world would notice its stoic virtue, but thankfully the T-rex has roused from its slumber.

Given the hard times the Mozilla Foundation has found itself in, it seems worth trying to quantify the costs.

To start, Mozilla must fund a separate team to re-develop features atop a less-capable runtime. Every feature that interacts with web content must be rebuilt in an ad-hoc way using inferior tools. Everything from form autofill to password management to content blocking requires extra resources to build for iOS. Not only does this tax development of the iOS product, it makes coordinated feature launches more costly across all ports.

Most substantially, iOS policies against default browser choice — combined with "in-app-browser" and search entry point shenanigans — have delayed and devalued browser choice.

Until late 2020, users needed to explicitly tap the Firefox icon on the home screen to get back to their browser. Naïvely tapping links would, instead, load content in Safari. This split experience causes a sort of pervasive forgetfulness, making the web less useful.

Continuous partial amnesia about browser-managed information is bad for users, but it hurts browser makers too. On OSes with functional competition, convincing a user to download a new browser has a chance of converting nearly all of their browsing to that product. iOS (along with Android and Facebook's mobile apps) undermine this by constantly splitting browsing, ignoring the user's default. When users don't end up in their browser, searches occur through it less often, affecting revenue. Web developers also experience this as a reduction in visible share of browsing from competing products, reducing incentives to support alternative engines.

A foregetful web also hurts publishers. Ad bid rates are suppressed, and users struggle to access pay-walled content when browsing is split. The conspicuious lack of re-engagement features like Push Notifications are the rotten cherry on top, forcing sites to push users to the App Store where Apple doesn't randomly log users out, or deprive publishers of key features.

Users, browser makers, web developers, and web businesses all lose. The hat-trick of value destruction.

Back Of The Napkin #

The pantomime of browser choice on iOS has created an anaemic, amnesiac web. Tapping links is more slogging than surfing when autofill fails, passwords are lost, and login state is forgotten. Browsers become less valuable as the web stops being a reliable way to complete tasks.

Can we quantify these losses?

Estimating lost business from user frustration and ad rate depression is challenging. But we can extrapolate what a dozen years of choice might have meant for Mozilla from what we know about how Apple monetises the web.

For the purposes of argument, let's assume Mozilla would be paid for web traffic at the same rate as Apple; $8-15 billion per year for ~75% share of traffic from Apple OSes.

If the traffic numbers to US government websites are reasonable proxies for the iOS/macOS traffic mix (big "if"s), then equal share for Firefox on iOS to macOS would be worth $215-400 million per year.^[5] Put differently; there's reason to think that Mozilla would not have suffered layoffs if Apple were an ally of engine choice.

Apple's policies have made the web a less compelling ecosystem, its anti-competitive behaviour have driven up costs for browser makers, and it has simultaneously starved them of revenue. If Apple are friends of engine diversity, who needs enemies?

The Best Kind Of Correct #

There is a narrow, fetid sense in which Apple's influence is nominally pro-diversity. Having anchored a significant fraction of web traffic at the trailing edge, businesses that do not decamp for the App Store may feel obliged to support WebKit.

This is a malignant form of diversity, not unlike other lagging engines through the years that harmed users and web-based businesses by externalizing costs. But on OSes with true browser choice alternatives were meaningful. Consider the loathed memory of IE 6, a browser that overstayed its welcome by nearly a decade. For as bad as it was, folks could recommend alternatives. Plugins also allowed us to transparently upgrade the platform.

Before the rise open-source engines, the end of one browser lineage may have been a deep loss to ecosystem diversity, but in the past 15 years, the primary way new engines emerge has been through forks and remixing.

But the fact of an engine being different does not make that difference valuable, and WebKit's differences are incremental. Sure, Blink now has a faster layout engine, better security, more features, and fewer bugs, but like WebKit, it is also derived from KHTML. Both engines are forks and owe many present-day traits to their ancestors.

The history of browsers includes many forks and remixes. It's naïve to think that will end if iOS becomes hospitable to browser competition. After all, it has been competition that spurred engine improvements and forks.

Today's KHTML descendants are not the end of the story. Future forks are possible. New codebases can be built from parts. Indeed, there's already valuable cross-pollination in code between Gecko, WebKit, and Chromium. Unlike the '90s and early 2000s, diversity can arrive in valuable increments through forking and recombination.

What's necessary for leading edge diversity, however, is funding.

By simultaneously taking a massive pot of cash for browser-building off the table, returning the least it can to engine development, and preventing others from filling the gap, Apple has foundationally imperilled the web ecosystem by destroying the utility of a diverse population of browsers and engines.

Apple has agency. It is not a victim, and it is not defending engine diversity.

What Now? #

A better, brighter future for the web is possible, and thanks to belated movement by regulators, increasingly likely. The good folks over at Open Web Advocacy are leading the way, clearly explaining to anyone who will listen both what's at stake and what it will take to improve the situation.

Investigations are now underway worldwide, so if you think Apple shouldn't be afraid of a bit of competition if it will help the web thrive, consider getting involved. And if you're in the UK or do business there, consider helping the CMA help the web before July 22nd, 2022. The future isn't written yet, and we can change it for the better.

Many commenters come to debates about compatibility and standards compliance with a mistaken view of how standards are made. As a result, they perceive vendors with better standards conformance (rather than content compatibility) to occupy a sort of moral high ground. They do not. Instead, it usually represents a broken standards-setting process.

This can happen for several reasons. Sometimes standards bodies shutter, and the state of the art moves forward without them. This presents some risk for vendors that forge ahead without the cover of an SDO's protective IP umbrella, but that risk is often temporary and measured. SDOs aren't hard to come by; if new features are valuable, they can be standardised in a new venue. Alternatively, vendors can renovate the old one if others are interested in the work.

More often, working groups move at the speed of their most obstinate participants, uncomfortably prolonging technical debates already settled in the market and preventing definitive documentation of the winning design. In other cases, a vendor may play games with intellectual property claims to delay standardisation or lure competitors into a patent minefield (as Apple did with Touch Events).

At the leading edge, vendors need space to try new ideas without the need for the a priori consensus represented by a standard. However, compatibility concerns expressed by developers take on a different tinge over time.

When the specific API details and capabilities of ageing features do not converge, a continual tax is placed on folks trying to build sites using features from that set. When developers stress the need for compatibility, it is often in this respect.

Disingenuous actors sometimes try to misrepresent this interest and claim that all features must become standards before they are introduced in any engine. This interpretation runs against the long practice of internet standards development and almost always hides an ulterior motive.

The role of standards is to consolidate gains introduced at the leading edge through responsible competition. Vendors that fail to participate constructively in this process earn scorn. They bring ignominy upon their houses by failing to bring implementations in line with the rough (documented and tested) consensus or by playing the heel in SDOs to forestall progress they find inconvenient.

Vendors like Apple. ↩︎
In the financial reports of internet businesses, you will see the costs to acquire business through channels reported as "Traffic Acquisition Costs" or "TACM". Many startups report their revenue "excluding TAC" or "ex-TAC". These are all ways of saying, "we paid for lead generation", and search engines are no different. ↩︎
This is money Apple believes it cannot figure out a way to invest in its products. That's literally what share buybacks indicate. They're an admission that a company is not smart enough to invest the money in something productive. Buybacks are attractive to managers because they create artificial scarcity for shares to drive up realised employee compensation — their own included. Employees who are cheesed to realise that their projects are perennially short-staffed are encouraged not to make a stink through RSU appreciation. Everyone gets a cut, RSU-rich managers most of all. ↩︎
Different analysts use different ways of describing Apple's "cash on hand". Some analysts lump in all marketable securities, current and non-current, which consistently pushes the number north of $150 billion. Others report only the literal cash value on the books ($34 billion as of May 2020). All of this means that it can require more context to compare the numbers in Apple's consolidated financial statements (PDF) with public reporting on them.

The picture is also clouded by changes in the way Apple manages its cash horde. Over the past two years, Apple has begun to draw from this almighty pile of dollars and spend more to inflate its stock price through share buybacks and dividends. This may cast Apple as more cash-poor than it is. A better understanding of the actual situation is derived from free cash flow. Perhaps Apple will continue to draw down from its tall cash mountain to inflate its stock price via buybacks, but that's not a material change in the amount Apple can potentially spend on improving its products. ↩︎
Since this post first ran, several commenters have noted a point I considered while writing, but omitted in order to avoid heaping scorn on a victim; namely that Mozilla's management has been asleep at the switch regarding the business of its business.

Historically, when public records were available for both Opera and Mozilla, it was easy to understand how poorly Mozilla negotiated with search partners. Under successive leaders, Mozilla negotiated deals that led to payments less than as half as much per point of share. There's no reason to think MoCo's negotiating skills have improved dramatically in recent years. Apple, therefore, is likely to caputre much more revenue per search than an install of Firefox.

But even if Mozilla only made 1/3 of Apple's haul for equivalent use, the combined taxes of iOS feature re-development and loss of revenue would be material to the Mozilla Foundation's bottom line.

Obviously, to get that share, Mozilla would need to prioritise mobile, which it has not done. This is a deep own-goal and a point of continued sadness for me.

A noble house reduced to rubble is a tragedy no matter who demolishes the final wall. Management incompetence is in evidence, and Mozilla's Directors are clearly not fit for purpose.

But none of that detracts from what others have done to the Foundation and the web, and it would be just as wrong to claim Mozilla should have been perfect in ways its enemies and competitors were not. ↩︎

A Management Maturity Model for Performance

May 9, 2022

This is part five of the six-part series "The Performance Inequality Gap"

Since 2015 I have been lucky to collaborate with more than a hundred teams building PWAs and consult on some of the world's largest sites. Engineers and managers on these teams universally want to deliver great experiences and have many questions about how to approach common challenges. Thankfully, much of what once needed hand-debugging by browser engineers has become automated and self-serve thanks to those collaborations.

Despite advances in browser tooling, automated evaluation, lab tools, guidance, and runtimes, teams I've worked with consistently struggle to deliver minimally acceptable performance with today's popular frameworks. This is not a technical problem per se — it's a management issue, and one that teams can conquer with the right frame of mind and support.

What is Performance?
Value Propositions
Protecting the Commons
Levels of Performance Management Maturity
Uneven Steps, Regression, & False Starts
The Role of Senior Management
- Questions for Senior Managers
- "o11y, But Make it Performance"

What is Performance? #

It may seem a silly question, but what is performance, exactly?

This is a complex topic, but to borrow from a recent post, web performance expands access to information and services by reducing latency and variance across interactions in a session, with a particular focus on the tail of the distribution (P75+). Performance isn't a binary and there are no silver bullets.

Only teams that master their systems can make intentional trade-offs. Organisations that serve their tools will tread water no matter how advanced their technology, while groups that understand and intentionally manage their systems can succeed on any stack.^[1]

Value Propositions #

The value of performance is deeply understood within a specific community and in teams that have achieved high maturity. But outside those contexts it can be challenging to communicate. One helpful lens is to view the difference between good and bad performance as a gap between expectations and reality.

For executives that value:

Revenue
Performance is a significant revenue contributor. To the extent that a product performs poorly, a fraction of revenue will be lost and the narrower the funnel becomes for all types of conversions.^[2]
Engagement
Poor performance has a well-documented relationship to reduced engagement. The space between good and bad performance is the opportunity for a competitor's product to fill the same opening. An under-appreciated aspect of this equation is variance. When a product often performs well but sometimes is slow to respond, users lose trust and use that service less. Consistent performance matters just as much as low average latency.
Design
Poor performance is a gap between the responsiveness of Figma mockups and lived brand reality. Product that perform well are closer to the approved design spec. Many brands build their reputations on fanatical attention to detail about the products they sell and the physical environments they're sold in. A laggy digital experience is out of alignment with these brand values.
Accessibility
Performance is the foundation of access. Users experiencing slow services may be unable to access them in the first place, mooting other work poured into improving a11y. Reduced network and device capacity correlate with other access challenges. a11y that isn't founded on consistently excellent performance can easily veer into performative, rather than effective, territory.

Performance is rarely the single determinant of product success, but it can be the margin of victory. Improving latency and reducing variance allows teams to test other product hypotheses with less noise. A senior product leader recently framed a big performance win as creating space that allows us to be fallible in other areas.

Protecting the Commons #

Like accessibility, security, UI coherence, privacy, and testability, performance is an aggregate result. Any single component of a system can regress latency or create variance, which means that like other cross-cutting product properties, performance must be managed as a commons. The approaches that work over time are horizontal, culturally-based, and require continual investment to sustain.

Teams I've consulted with are too often wrenched between celebration over launching "the big rewrite" and the morning-after realisation that the new stack is tanking business metrics.

Now saddled with the excesses of npm, webpack, React, and a thousand promises of "great performance" that were never critically evaluated, it's easy for managers to lose hope. These organisations sometimes spiral into recrimination and mistrust. Where hopes once flourished, the horrors of a Bundle Buddy readout looms. Who owns this code? Why is it there? How did it get away from the team so quickly?

Many "big rewrite" projects begin with the promise of better performance. Prototypes "seem fast", but nobody's actually benchmarking them on low-end hardware. Things go fine for a while, but when sibling teams are brought in to integrate late in the process, attention to the cumulative experience may suffer. Before anyone knows it, the whole thing is as slow as molasses, but "there's no going back"... and so the lemon launches with predictably sour results.

In the midst of these crises, thoughtful organisations begin to develop a performance management discipline. This, in turn, helps to create a culture grounded in high expectations. Healthy performance cultures bake the scientific method into their processes and approaches; they understand that modern systems are incredibly complex and that nobody knows everything — and so we learn together and investigate the unknown to develop an actionable understanding.

Products that maintain a healthy performance culture elevate management of latency, variance, and other performance attributes to OKRs because they understand how those factors affect the business.

Levels of Performance Management Maturity #

Performance management isn't widely understood to be part of what it means to operate a high-functioning team. This is a communcation challenge with upper management, but also a potential differentiator or even a strategic advantage. Teams that develop these advantages progress through a hierarchy of management practice phases. In drafting this post, I was pointed to similar work developed independently by others^[3]; that experienced consultants have observed similar trends helps give me confidence in this assessment:

Level 0: Bliss #

Hear no evil, see no evil, speak no evil. — Photo by von Vix

Level 0 teams do not know they have a problem. They may be passively collecting some data (e.g., through one of the dozens of analytics tools they've inevitably integrated over the years), but nobody looks at it. It isn't anyone's job description to do so.

Folks at this level of awareness might also simply assume that "it's the web, of course it's slow" and reach for native apps as a panacea (they aren't). The site "works" on their laptops and phones. What's the problem?

Management Attributes #

Managers in Level 0 teams are unaware that performance can be a serious product problem; they instead assume the technology they acquired on the back of big promises will be fine. This blindspot usually extends up to the C-suite. They do not have latency priorities and they uncritically accept assertions that a tool or architecture is "performant" or "blazing fast". They lack the technical depth to validate assertions, and move from one framework to another without enunciating which outcomes are good and which are unacceptable. Faith-based product management, if you will.

Level 0 PMs fail to build processes or cultivate trusted advisors to assess the performance impacts of decisions. These organisations often greenlight rewrites because we can hire easily for X, and we aren't on it yet. These are vapid narratives, but Level 0 managers don't have the situational awareness, experience, or confidence to push back appropriately.

These organisations may perform incidental data collection (from business analytics tools, e.g.) but are inconsistently reviewing performance metrics or considering them when formulating KPIs and OKRs.

Level 1: Fire Fighting #

Shit's on fire, yo. — Photo by Jay Heike

At Level 1, managers will have been made aware that the performance of the service is unacceptable.^[4]

Service quality has degraded so much that even fellow travelers in the tech privilege bubble^[4:1] have noticed. Folks with powerful laptops, new iPhones, and low-latency networks are noticing, which is a very bad sign. When an executive enquires about why something is slow, a response is required.

This is the start of a painful remediation journey that can lead to heightened performance management maturity. But first, the fire must be extinguished.

Level 1 managers will not have a strong theory about what's amiss, and an investigation will commence. This inevitably uncovers a wealth of potential metrics and data points to worry about; a few of those will be selected and tracked throughout the remediation process. But were those the right ones? Will tracking them from now on keep things from going bad? The first firefight instills gnawing uncertainty about what it even means to "be fast". On teams without good leadership or a bias towards scientific inquiry, it can be easy for Level 1 investigations to get preoccupied with one factor while ignoring others. This sort of anchoring effect can be overcome by pulling in external talent, but this is often counter-intuitive and sometimes even threatening to green teams.

Competent managers will begin to look for more general "industry standard" baseline metrics to report against their data. The industry's default metrics are moving to a better place, but Level 1 managers are unequipped to understand them deeply. Teams at Level 1 (and 2) may blindly chase metrics because they have neither a strong, shared model of their users, nor an understanding of their own systems that would allow them to focus more tightly on what matters to the eventual user experience. They aren't thinking about the marginal user yet, so even when they do make progress on directionally aligned metrics, nasty surprises can reoccur.

Low levels of performance management maturity are synonymous with low mastery of systems and an undeveloped understanding of user needs. This leaves teams unable to quickly track down culprits when good scores on select metrics fail to consistently deliver great experiences.

Management Attributes #

Level 1 teams are in transition, and managers of those teams are in the most fraught part of their journey. Some begin an unproductive blame game, accusing tech leads of incompetence, or worse. Wise PMs will perceive performance remediation work as akin to a service outage and apply the principles of observability culture, including "blameless postmortems".

It's never just one thing that's amiss on a site that prompts Level 1 awareness. Effective managers can use the collective learning process of remediation to improve a team's understanding of its systems. Discoveries will be made about the patterns and practices that lead to slowness. Sharing and celebrating these discoveries is a crucial positive attribute.

Strong Level 1 managers will begin to create dashboards and request reports about factors that have previously caused problems in the product. Level 1 teams tend not to staff or plan for continual attention to these details, and the systems often become untrustworthy.

Teams can get stuck at Level 1, treating each turn through a Development ➡️ Remediation ➡️ Celebration loop as "the last time". This is pernicious for several reasons. Upper management will celebrate the first doused fire but will begin to ask questions about the fourth and fifth blazes. Are their services just remarkably flammable? Is there something wrong with their team? Losing an organisation's confidence is a poor recipe for maximising personal or group potential.

Next, firefighting harms teams, and doubly so when management is unwilling to adopt incident response framing. Besides potential acrimony, each incident drains the team's ability to deliver solutions. Noticeably bad performance is an expression of an existing feature working below spec, and remediation is inherently in conflict with new feature development. Level 1 incidents are de facto roadmap delays.

Lastly, teams stuck in a Level 1 loop risk losing top talent. Many managers imagine this is fine because they're optimising for something else, e.g. the legibility of their stack to boot camp grads. A lack of respect for the ways that institutional knowledge accelerates development is all too common.

It's difficult for managers who do not perceive the opportunities that lie beyond firefighting to comprehend how much stress they're placing on teams through constant remediation. Fluctuating between Levels 1 and 0 ensures a team never achieves consistent velocity, and top performers hate failing to deliver.

The extent to which managers care about this — and other aspects of the commons, such as a11y and security — is a reasonable proxy for their leadership skills. Line managers can prevent regression back to Level 0 by bolstering learning and inquiry within their key personnel, including junior developers who show a flair for performance investigation.

Level 2: Global Baselines & Metrics #

Think globally, then reset. — The global baseline isn't what folks in the privilege bubble assume.

Thoughtful managers become uncomfortable as repeated Level 1 incidents cut into schedules, hurt morale, and create questions about system architecture. They sense their previous beliefs about what's "reasonable" need to be re-calibrated... but against what baseline?

It's challenging for teams climbing the maturity ladder to sift through the many available browser and tool-vendor data points to understand which ones to measure and manage. Selected metrics are what influence future investments, and identifying the right ones allows teams to avoid firefighting and prevent blindspots.

A diagram of the W3C Navigation Timing timline events — Browsers provide a *lot* of data about site performance. Harnessing it requires a deep understanding of the product and its users.

Teams looking to grow past Level 1 develop (or uncover they already had) Real User Monitoring ("RUM data") infrastructure in previous cycles. They will begin to report to management against these aggregates.

Against the need for quicker feedback and a fog of metrics, managers who achieve Level 2 maturity look for objective, industry-standard reference points that correlate with business success. Thankfully, the web performance community has been busy developing increasingly representative and trustworthy measurements. Still, Level 2 teams will not yet have learned to live with the dissatisfaction that lab measurements cannot always predict a system's field behavior. Part of mastery is accepting that the system is complex and must be investigated, rather than fully modeled. Teams at Level 2 are just beginning to learn this lesson.

Strong Level 2 managers acknowledge that they don't know what they don't know. They calibrate their progress against studies published by peers and respected firms doing work in this area. These data points reflect a global baseline that may (or may not) be appropriate for the product in question, but they're significantly better than nothing.

Management Attributes #

Managers who bring teams to Level 2 spread lessons from remediation incidents, create a sense of shared ownership over performance, and try to describe performance work in terms of business value. They work with their tech leads and business partners to adopt industry-standard metrics and set expectations based on them.

Level 2 teams buy or build services that help them turn incidental data collection into continual reporting against those standard metrics. These reports tend to focus on averages and may not be sliced to focus on specific segments (e.g., mobile vs. desktop) and geographic attributes. Level 2 (and 3) teams may begin drowning in data, with too many data points being collected and sliced. Without careful shepherding to uncover the most meaningful metrics to the business, this can engender boredom and frustration, leading to reduced focus on important RUM data sources.

Strong Level 2 managers will become unsatisfied with how global rules of thumb and metrics fail to map directly into their product's experience and may begin to look for better, more situated data that describe more of the user journeys they care about. The canniest Level 2 managers worry that their teams lack confidence that their work won't regress these metrics.

Teams that achieve Level 2 competence can regress to Level 1 under product pressure (removing space to watch and manage metrics), team turnover, or assertions that "the new architecture" is somehow "too different" to measure.

Level 3: P75+, Site-specific Baselines & Metrics #

Level 3 teams are starting to fly the plane instead of being passengers on an uncomfortable journey — Photo by Launde Morel

The unease of strong Level 2 management regarding metric appropriateness can lead to Level 3 awareness and exploration. At this stage, managers and TLs become convinced that the global numbers they're watching "aren't the full picture" — and they're right!

At Level 3, teams begin to document important user journeys within their products and track the influence of performance across the full conversion funnel. This leads to introducing metrics that aren't industry-standard, but are more sensitive and better represent business outcomes. The considerable cost to develop and validate this understanding seems like a drop in the bucket compared to flying blind, so Level 3 teams do it, in part, to eliminate the discomfort of being unable to confidently answer management questions.

Substantially enlightened managers who reach Level 3 will have become accustomed to percentile thinking. This often comes from their journey to understand the metrics they've adopted at Levels 1 and 2. The idea that the median isn't the most important number to track will cause a shift in the internal team dialogue. Questions like, "Was that the P50 number?" and "What does it look like at P75 and P90?" will become part of most metrics review meetings (which are now A Thing (™).

Percentiles and histograms become the only way to talk about RUM data in teams that reach Level 3. Most charts have three lines — P75, P90, and P95 — with the median, P50, thrown in as a vanity metric to help make things legible to other parts of the organisation that have yet to begin thinking in distributions.

Treating data as a distribution fundamentally enables comparison and experimentation because it creates a language for describing non-binary shifts. Moving traffic from one histogram bucket to another becomes a measure of success, and teams at Level 3 begin to understand their distributions are nonparametric, and they adopt more appropriate comparisons in response.

Management Attributes #

Level 3 managers and their teams are becoming scientists. For the first time, they will be able to communicate with confidence about the impact of performance work. They stop referring to "averages", understand that medians (P50) can tell a different story than the mean, and become hungry to explore the differences in system behavior at P50 and outlying parts of the distribution.

Significant effort is applied to the development and maintenance of custom metrics and tools. Products that do not report RUM data in more sliceable ways (e.g., by percentile, geography, device type, etc.) are discarded for those that better support an investigation.

Teams achieving this level of discipline about performance begin to eliminate variance from their lab data by running tests in "less noisy" environments than somewhere like a developer's laptop, a shared server, or a VM with underlying system variance. Low noise is important because these teams understand that as long as there's contamination in the environment, it is impossible to trust the results. Disaster is just around the corner when teams can't trust tests designed to keep the system from veering into a bad state.

Level 3 teams also begin to introduce a critical asset to their work: integration of RUM metrics reporting with their experimentation frameworks. This creates attribution for changes and allows teams to experiment with more confidence. Modern systems are incredibly complex, and integrating this experimentation into the team's workflow only intensifies as groups get ever-more sophisticated moving forward.

Teams can regress from Level 3 because the management structures that support consistent performance are nascent. Lingering questions about the quality of custom metrics can derail or stall progress, and some teams can get myopic regarding the value of RUM vs. lab data (advanced teams always collect both and try to cross-correlate, but this isn't yet clear to many folks who are new to Level 3). Viewing metrics with tunnel vision and an unwillingness to mark metrics to market are classic failure modes.

Level 4: Variance Control & Regression Prevention #

Level 4 teams are beginning to understand and manage the tolerances of their service. — Photo by Mastars

Strong Level 3 managers will realise that many performance events (both better and worse than average) occur along a user journey. This can be disorienting! Everything one thought they knew about how "it's going" is invalidated all over again. The P75 latency for interaction (in an evenly distributed population) isn't the continuous experience of a single user; it's every fourth tap!

Suddenly, the idea of managing averages looks naive. Medians have no explanatory power and don't even describe the average session! Driving down the median might help folks who experience slow interactions, but how can the team have any confidence about that without constant management of the tail latency?

This new understanding of the impact that variance has on user experiences is both revelatory and terrifying. The good news is that the tools that have been developed to this point can serve to improve even further.

Level 4 teams also begin to focus on how small, individually innocuous changes add up to a slow bleed that can degrade the experience over time. Teams that have achieved this sort of understanding are mature enough to forecast a treadmill of remediation in their future and recognise it as a failure mode. And failure modes are avoidable with management processes and tools, rather than heroism or blinding moments of insight.

Management Attributes #

Teams that achieve Level 4 maturity almost universally build performance ship gates. These are automated tests that watch the performance of PRs through a commit queue, and block changes that tank the performance of important user flows. This depends on the team having developed metrics that are known to correlate well with user and business success.

This implies all of the maturity of the previous levels because it requires a situated understanding of which user flows and scenarios are worth automating. These tests are expensive to run, so they must be chosen well. This also requires an investment in infrastructure and continuous monitoring. Making performance more observable, and creating a management infrastructure that avoids reactive remediation is the hallmark of a manager who has matured to Level 4.

Many teams on the journey from Level 3 to 4 will have built simpler versions of these sorts of gates (bundle size checks, e.g.). These systems may allow for small continuous increases in costs. Over time, though, these unsophisticated gates become a bad proxy for performance. Managers at Level 4 learn from these experiences and build or buy systems to watch trends over time. This monitoring ought to include data from both the lab and the field to guard against "metric drift". These more sophisticated monitoring systems also need to be taught to alert on cumulative, month-over-month and quarter-over-quarter changes.

Level 4 maturity teams also deputise tech leads and release managers to flag regressions along these lines, and reward them for raising slow-bleed regressions before they become crises. This responsibility shift, backed up by long-run investments and tools, is one of the first stable, team-level changes that can work against cultural regression. For the first time, the team is considering performance on longer time scales. This also begins to create organisational demand for latency budgeting and slowness to be attributed to product contributions.

Teams that achieve Level 4 maturity are cautious acquirers of technology. They manage on an intentional, self-actualised level and value an ability to see through the fog of tech fads. They do bake-offs and test systems before committing to them. They ask hard questions about how any proposed "silver bullets" will solve the problems that they have. They are charting a course based on better information because they are cognizant that it is both valuable and potentially available.

Level 4 teams begin to explicitly staff a "performance team", or a group of experts whose job it is to run investigations and drive infrastructure to better inform inquiry. This often happens out of an ad-hoc virtual team that forms in earlier stages but is now formalised and has long-term staffing.

Teams can quickly regress from Level 4 maturity through turnover. Losing product leaders that build to Level 4 maturity can set groups back multiple maturity levels in short order, and losing engineering leaders who have learned to value these properties can do the same. Teams are also capable of losing this level of discipline and maturity by hiring or promoting the wrong people. Level 4 maturity is cultural and cultures need to be defended and reinforced to maintain even the status quo.

Level 5: Strategic Performance #

Level 5 teams have understood the complexity of their environment and can make tradeoffs with confidence. — Photo by Colton Sturgeon

Teams that fully institutionalise performance management come to understand it as a strategic asset.

These teams build management structures and technical foundations that grow their performance lead and prevent cultural regressions. This includes internal training, external advocacy and writing^[5], and the staffing of research work to explore the frontier of improved performance opportunities.

Strategic performance is a way of working that fully embeds the idea that "faster is better", but only when it serves user needs. Level 5 maturity managers and teams will gravitate to better-performing options that may require more work to operate. They have learned that fast is not free, but it has cumulative value.

These teams also internally evangelise the cause of performance. Sibling teams may not be at the same place, so they educate about the need to treat performance as a commons. Everyone benefits when the commons is healthy, and all areas of the organisation suffer when it regresses.

Level 5 teams institute "latency budgets" for fractional feature rollouts. They have structures (such as managers or engineering leadership councils) that can approve requests for non-latency-neutral changes that may have positive business value. When business leaders demand the ability to ram slow features into the product, these leaders are empowered to say no.

Lastly, Level 5 teams are focused on the complete user journey. Teams in this space can make trades intelligently, moving around code and time within a system they have mastered to ensure the best possible outcomes in essential flows.

Management Attributes #

Level 3+ team behaviours are increasingly illegible to less-advanced engineers and organisations. At Level 5, serious training and guardrails are required to integrate new talent. Most hires will not yet share the cultural norms that a strategically performant organisation uses to deliver experiences with consistent quality.^[6]

Strategy is what you do differently from the competition, and Level 5 teams understand their way of working is a larger advantage than any single optimisation. They routinely benchmark against their competition on important flows and can understand when a competitor has taken the initiative to catch up (it rarely happens through a single commit or launch). These teams can respond at a time of their choosing because their lead will have compounded. They are fully out of firefighting mode.

Level 5 teams do not emerge without business support. They earn space to adopt these approaches because the product has been successful (thanks in part to work at previous levels). Level 5 culture can only be defended from a position of strength. Managers in this space are operating for the long term, and performance is understood to be foundational to every new feature or improvement.

Teams at Level 5 degrade more slowly than at previous levels, but it does happen. Sometimes, Level 5 teams are poor communicators about their value and their values, and when sibling teams are rebuffed, political pressure can grow to undermine leaders. More commonly, enough key people leave a Level 5 team for reasons unrelated to performance management, like when the hard-won institutional understanding of what it takes to excel is lost. Sometimes, simply failing to reward continual improvement can drive folks out. Level 5 managers need to be on guard regarding their culture and their value to the organisation as much as the system's health.

Uneven Steps, Regression, & False Starts #

It's possible for strong managers and tech leads to institute Level 1 discipline by fiat. Level 2 is perhaps possible on a top-down basis in a small or experienced team. Beyond that, though, maturity is a growth process. Progression beyond global baseline metrics requires local product and market understanding. TLs and PMs need to become curious about what is and isn't instrumented, begin collecting data, then start the directed investigations necessary to uncover what the system is really doing in the wild. From there, tools and processes need to be built to recreate those tough cases on the lab bench in a repeatable way, and care must be taken to continually re-validate those key user journeys against the evolving product reality.

Advanced performance managers build groups that operate on mutual trust to explore the unknown and then explain it out to the rest of the organisation. This means that advancement through performance maturity isn't about tools.

Managers who get to Level 4 are rare, but the number who imagine they are could fill stadiums because they adopted the technologies that high-functioning leaders encourage. But without the trust, funding to enquire and explore, and an increasingly fleshed-out understanding of users at the margins, adopting a new monitoring tool is a hollow expenditure. Nothing is more depressing than managerial cosplay.

It's also common for teams to take several steps forward under duress and regress when heroics stop working, key talent burns out, and the managerial focus moves on. These aren't fatal moments, but managers need to be on the lookout to understand if they support continual improvement. Without a plan for an upward trajectory, product owners are putting teams on a loop of remediation and inevitable burnout... and that will lead to regression.

The Role of Senior Management #

Line engineers want to do a good job. Nobody goes to work to tank the product, lose revenue, or create problems for others down the line. And engineers are trained to value performance and quality. The engineering mindset is de facto optimising. What separates Level 0 firefighting teams from those that have achieved self-actualised Level 5 execution is not engineering will; it's context, space, and support.

Senior management sending mixed signals about the value of performance is the fastest way to degrade a team's ability to execute. The second-fastest is to use blame and recrimination. Slowness has causes, but the solution isn't to remove the folks that made mistakes, but rather to build structures that support iteration so they can learn. Impatience and blame are not assets or substitutes for support to put performance consistently on par with other concerns.

Teams that reach top-level performance have management support at the highest level. Those managers assume engineers want to do a good job but have the wrong incentives and constraints, and it isn't the line engineer's job to define success — it's the job of management.

Questions for Senior Managers #

Senior managers looking to help their teams climb the performance management maturity hill can begin by asking themselves a few questions:

Do we understand how better performance would improve our business?
- Is there a shared understanding in the leadership team that slowness costs money/conversions/engagement/customer-success?
- Has that relationship been documented in our vertical or service?
- Do we know what "strategic performance" can do for the business?
What constraints have we given the team?
- Do they have a device-class or network condition target?
- Can engineers rely on those targets to negotiate with other parts of the organisation (e.g., sales, marketing, etc.)?
Have we developed a management fluency wth histograms and distributions over time?
- Do we write OKRs for performance?
- Are they phrased in terms of marginal device and network targets, as well as distributions?
What support do we give teams that want to improve performance?
- Do folks believe they can appeal directly to you if they feel the system's performance will be compromised by other decisions?
- Can folks (including PMs, designers, and SREs — not just engineers) get promoted for making the site faster?
- Can middle managers appeal to performance as a way to push back on feature requests?
- Are there systems in place for attributing slowness to changes over time?
- Can teams win kudos for consistent, incremental performance improvement?
- Can a feature be blocked because it might regress performance?
- Can teams easily acquire or build tools to track performance?
What support do we give mid-level managers who push back on shiny tech in favour of better performance?
- Have we institutionalised critial questions for adopting new technologies?
- Are aspects of the product commons (e.g., uptime, security, privacy, a11y, performance) managed in a coherent way?
- Do managers get as much headcount and funding to make steady progress as they would from proposing rewrites?
Have we planned to staff a performance infrastructure team?
- It's the job of every team to monitor and respond to performance challenges, but will there be a group that can help wrangle the data to enable everyone to do that?
- Can any group in the organisation serve as a resource for other teams that are trying to get started in their latency and variance learning journeys?

The answers to these questions help organisations calibrate how much space they have created to scientifically interrogate their systems. Computers are complex, and as every enterprise becomes a "tech company", becoming intentional about these aspects is as critical as building DevOps and Observability to avoid downtime.

It's always cheaper in the long run to build understanding than it is to fight fires, and successful management can create space to unlock their team's capacity.

"o11y, But Make it Performance" #

Mature technology organisations may already have and value a discipline to manage performance: "Site Reliability Engineering" (SRE), aka "DevOps", aka "Observability". These folks manage and operate complex systems and work to reduce failures, which looks a lot like the problems of early performance maturity teams.

These domains are linked: performance is just another aspect of system mastery, and the tools one builds to manage approaches like experimental, flagged rollouts need performance to be accounted for as a significant aspect of the success of a production spike.

Senior managers who want to build performance capacity can push on this analogy. Performance is like every other cross-cutting concern; important, otherwise un-owned, and a chance to differentiate. Managers have a critical role to forge solidarity between engineers, SREs, and other product functions to get the best out of their systems and teams.

Everyone wants to do a great job; it's the manager's role to define what that means.

It takes a village to keep my writing out of the ditch, so my deepest thanks go to Annie Sullivan, Jamund Ferguson, Andy Tuba, Barry Pollard, Bruce Lawson, Tanner Hodges, Joe Liccini, Amiya Gupta, Dan Shappir, Cheney Tsai, and Tim Kadlec for their invaluable comments and corrections on drafts of this post.

High-functioning teams can succeed with any stack, but they will choose not to. Good craftsmen don't blame their tools, nor do they carry deficient implementations.

Per Kellan Elliot-McCrea's classic "Questions for new technology", this means that high-functioning teams will not be on the shiniest stack. Teams choices that are highly correlated with hyped solutions are a warning sign, not an asset. And while "outdated" systems are unattractive, they also don't say much at all about the quality of the product or the team.

Reading this wrong is a sure tell of immature engineers and managers, whatever their title. ↩︎
An early confounding factor for teams trying to remediate performance issues is that user intent matters a great deal, and thus the value of performance will differ based on context. Users who have invested a lot of context with a service will be less likely to bounce based on bad performance than those who are "just browsing". For example, a user that has gotten to the end of a checkout flow or are using a government-mandated system may feel they have no choice. This isn't a brand or service success case (failing to create access is always a failure), but when teams experience different amounts of elasticity in demand vs. performance, it's always worth trying to understand the user's context and intent.

Users that "succeed" but have a bad time aren't assets for a brand or service, they're likely to be ambasassadors for any other way to accomplish their tasks. That's not great, long-term, for a team or for their users. ↩︎
Some prior art was brought to my attention by people who reviewed earlier drafts of this post; notably this 2021 post by the Splunk team and the following tweet by the NCC Group from 2016 (as well as a related PowerPoint presentation):

NCC Group Web Perf @NCCGroupWebperf

Where are you on the #webperf maturity model? ow.ly/miAi3020A9G #perfmatters
3 10:04 AM · Jul 7, 2016

It's comforting that we have all independently formulated roughly similar framing. People in the performance community are continually learning from each other, and if you don't take my formulation, I hope you'll consider theirs. ↩︎
Something particularly problematic about modern web development is the way it has reduced solidarity between developers, managers, and users. These folks now fundamentally experience the same sites differently, thanks to the shocking over-application of client-side JavaScript to every conceivable problem.

This creates structural illegibility of budding performance crises in new, uncomfortably exciting ways.

In the desktop era, developers and upper management would experience sites through a relatively small range of screen sizes and connection conditions. JavaScript was applied in the breach when HTML and CSS couldn't meet a need.^[7] Techniques like Progressive Enhancement ensured that the contribution of CPU performance to the distribution of experiences was relatively small. When content is predominantly HTML, CSS, and images, browsers are able to accelerate processing across many cores and benefit from the ability to incrementally present the results.

By contrast, JavaScript-delivered UI strips the browser of its ability to meaningfully reorder and slice up work so that it prioritises responsiveness and smooth animations. JavaScript is the fuck it, we'll do it live way to construct UI, and stresses the relative performance of a single core more than competing approaches. Because JavaScript is, byte for byte, the most expensive thing you can ask a browser to process, this stacks the difficulty involved in doing a good job on performance. JavaScript-driven UI is inherently working with a smaller margin for error, and that means today's de facto approach of using JavaScript for roughly everything leaves teams with much less headroom.

Add this change in default architecture to the widening gap between the high end (where all developers and managers live) and the median user. It's easy to understand how perfectly mistimed the JavaScript community's ascendence has been. Not since the promise of client-side Java has the hype cycle around technology adoption been more out of step with average usability.

Why has it gone this badly?

In part because of the privilege bubble. When content mainly was markup, performance problems were experienced more evenly. The speed of a client device isn't the limiting site speed factor in an HTML-first world. When database speed or server capacity is the biggest variable, issues affect managers and executives at the same rate they impact end users.

When the speed of a device dominates, wealth correlates heavily with performance. This is why server issues reliably get fixed, but JavaScript bloat has continued unabated for a decade. Rich users haven't borne the brunt of these architectural shifts, allowing bad choices to fly under the radar much longer which, in turn, increase the likelihood of expensive remediation incidents.

Ambush by JavaScript is a bad time, and when managers and execs only live in the privilege bubble, it's users and teams who suffer most. ↩︎ ↩︎
Managers may fear that by telling everyone about how strategic and important performance has become to them, that their competitiors will wise up and begin to out-execute on the same dimension.

This almost never happens, and the risks are low. Why? Because, as this post exhaustively details, the problems that prevent the competition from achieving high-functioning performance are not strictly technical. They cannot — and more importantly, will not — adopt tools and techniques you evangelise because it is highly unlikely that they are at a maturity level that would allow them to benefit. In many cases, adding another tool to the list for a Level 1-3 team to consider can even slow down and confound them.

Strategic performance is hard to beat because it is hard to construct at a social level. ↩︎
Some hires or transfers into Level 5 teams will not easily take to shared performance values and training.

Managers should anticipate pushback from these quarters and learn to re-assert the shared cultural norms that are critical to success.

There's precious little space in a Level 5 team for résumé-oriented development because a focus on the user has evacuated the intellectual room that hot air once filled. Thankfully, this can mostly be avoided through education, support, and clear promotion criteria that align to the organisation's evolved way of working.

Nearly everyone can be taught, and great managers will be on the lookout to find folks who need more support. ↩︎
Your narrator built JavaScript frameworks in the desktop era; it was a lonely time compared to the clogged market for JavaScript tooling today. The complexity of what we were developing for was higher than nearly every app I see today; think GIS systems, full PIM (e.g., email, calendar, contacts, etc.) apps, complex rich text editing, business apps dealing with hundreds of megabytes worth of normalised data in infinite grids, and BI visualisations.

When the current crop of JavaScript bros tells you they need increased complexity because business expectations are higher now, know that they are absolutely full of it. The mark has barely moved in most experiences. The complexity of apps is not much different, but the assumed complexity of solutions is. That experiences haven't improved for most users is a shocking indictment of the prevailing culture. ↩︎

Cache and Prizes

Serious Platforms Don't Play Favourites

March 31, 2022

If you work on a browser, you will often hear remarks like, Why don't you just put [popular framework] in the browser?

This is a good question — or at least it illuminates how browser teams think about tradeoffs. Spoiler: it's gnarly.

Before we get into it, let's make the subtext of the proposal explicit:

Libraries provided this way will be as secure and privacy-preserving as every other browser-provided API.
Browsers will cache tools popular among vocal, leading-edge developers.
There's plenty of space for caching the most popular frameworks.
Developers won't need to do work to realise a benefit.

None of this holds.

The best available proxy data also suggests that shared caches would have a minimal positive effect on performance. So, it's an idea that probably won't work the way anyone wants it to, likely can't do much good, and might do a lot of harm.^[1]

Understanding why requires more context, and this post is an attempt to capture the considerations I've been outlining on whiteboards for more than a decade.

Trust Falls #

Every technical challenge in this design space is paired with an even more daunting governance problem. Like other successful platforms, the web operates on a finely tuned understanding of risk and trust, and deprecations are particularly explosive.^[2] As we will see, pressure to remove some libraries to make space for others will be intense.

That's the first governance challenge: who will adjudicate removals? Browser vendors? And if so, under what rules? The security implications alone mean that whomever manages removals will need to be quick, fair, and accurate. Thar be dragons.

Removal also creates a performance cliff for content that assumes cache availability. Vendors have a powerful incentive to "not break the web ". Since cache sizes will be limited (more on that in a second), the predictable outcome will be slower library innovation. Privileging a set of legacy frameworks will create disincentives to adopt modern platform features and reduce the potential value of modern libraries. Introducing these incentives would be an obvious error for a team building a platform.

Entropic Forces #

An unstated, but ironclad, requirement of shared caches is that they are uniform, complete, and common.

Browsers now understand the classic shared HTTP cache behaviour as a privacy bug.

Cache timing is a fingerprinting vector that is fixed via cache partitioning by origin. In the modern cache-partitioned world, if a user visits alice.com, which fetches https://example.com/cat.gif, then visits bob.com, which displays the same image, it will be requested again. Partitioning by origin ensures that bob.com can't observe any details about that user's browsing history. De-duplication can prevent multiple copies on disk, but we're out of luck on reducing network costs.

A shared library cache that isn't a fingerprinting vector must work like one big zip file: all or nothing. If a resource is in that bundle — and enough users have the identical bundle — using a resource from it on both alice.com and bob.com won't leak information about the user's browsing history.

Suppose a user has only downloaded part of the cache. A browser couldn't use resources from it, lest missing resources (leaked through a timing side channel) uniquely identify the user. These privacy requirements put challenging constraints on the design and pratical effectiveness of a shared cache.

Variations on a Theme #

But the completeness constraint is far from our only design challenge. When somebody asks, Why don't you just put [popular framework] in the browser?, the thoughtful browser engineer's first response is often which versions?

This is not a dodge.

The question of precaching JavaScript libraries has been around since the Prototype days. It's such a perennial idea that vendors have looked into the real-world distribution of library use. One available data source is Google's hosted JS libraries service.

Last we looked, jQuery (still the most popular JS tool by a large margin) showed usage almost evenly split between five or six leading versions, with a long tail that slowly tapers. This reconfirms observations from the HTTP Archive, published by Steve Souders in 2013.

TL;DR?: Many variants of the same framework will occur in the cache's limited space. Because a few "unpopular" legacy tools are large, heavily used, and exhibit flat distribution of use among their top versions, the breadth of a shared cache will be much smaller than folks anticipate.

The web is not centrally managed, and sites depend on many different versions of many different libraries. Browsers are unable to rely on semantic versioning or trust file names because returned resources must contain the exact code that developers request. If browsers provide a similar-but-slightly-different file, things break in mysterious ways, and "don't break the web" is Job #1 for browsers.

Plausible designs must avoid keying on URLs or filenames. Another sort of opt-in is required because:

It's impossible to capture most use of a library with a small list of URLs.
Sites will need to continue to serve fallback copies of their dependencies.^[3]
File names are not trustworthy indicators of file contents.

Subresource Integrity (SRI) to the rescue! SRI lets developers add a hash of the resource they're expecting as a security precaution, but we could re-use these assertions as a safe cache key. Sadly, relatively few sites deploy SRI today, with growth driven by a single source (Shopify), meaning developers aren't generally adopting it for their first-party dependencies.

It turns out this idea has been circulating for a while. Late in the drafting of this post, it was pointed out to me by Frederik Braun that Brad Hill had considered the downsides of a site-populated SRI addressable cache back in 2016. Since that time, SRI adoption has yet to reach a point where it can meaningfully impact cache hit rates. Without pervasive SRI, it's difficult to justify the potential download size hit of a shared cache, not to mention the ongoing required updates.

The size of the proposed cache matters to vendors because browser binary size increases negatively impact adoption. The bigger the download, the less likely users are to switch browser. The graphs could not be more explicit: downloads and installs fail as binaries grow.

Browser teams aggressively track and manage browser download size, and scarce engineering resources are spent to develop increasingly exotic mechanisms to distribute only the code a user needs. They even design custom compression algorithms to make incremental patch downloads smaller. That's how much wire size matters. Shared caches must make sense within this value system.

Back of the Napkin #

So what's a reasonable ballpark for a maximum shared cache? Consider:

Browser engineers are byte misers.
That's because any cache would need to be sized to the expected free space of the marginal (P95) user.
Firefox for Android is between 70 and 85 MiB in compressed APK form.
Chrome for Android varies in size between 55 and 75 MiB.
Release-over-release growth is often less than 1 MiB.
Cache growth will be difficult and slow.

The last point is crucial because it will cement the cache's contents for years, creating yet another governance quagmire. The space available for cache growth will be what's left for the marginal user after accounting for increases in the browser binary. Most years, that will be nothing. Browser engineers won't give JS bundles breathing room they could use to win users.

Given these constraints, it's impossible to imagine a download budget larger than 20 MiB for a cache. It would also be optional (not bundled with the browser binary) and perhaps fetched on the first run (if resources are available). Having worked on browsers for more than a decade, I think it would be shocking if a browser team agreed to more than 10 MiB for a feature like this, especially if it won't dramatically speed up the majority of existing websites. That is unlikely given the need for SRI annotations or the unbecoming practice of speeding up high-traffic sites, but not others. These considerations put tremendous downward pressure on the prospective size budget for a cache.

20 MiB over the wire expands to no more than 100 MiB of JS with aggressive Brotli compression. A more likely 5 MiB (or less) wire size budget provides something like ~25 MiB of on-disk size. Assuming no source maps, this may seem like a lot, but recall that we need to include many versions of popular libraries. A dozen versions of jQuery, jQuery UI, and Moment (all very common) burn through this space in a hurry.

One challenge for fair inclusion is the tension between site-count-weighting and traffic-weighting. Crawler-based tools (like the HTTP Archive's Almanac and BuiltWith) give a distorted picture. A script on a massive site can account for more traffic than many thousands of long-tail sites. Which way should the policy go? Should a cache favour code that occurs in many sites (the long tail), or the code that has the most potential to improve the median page load?

Thar be more dragons.

Today's Common Libraries Will Be Underrepresented #

Anyone who has worked on a modern JavaScript front end is familiar with the terrifyingly long toolchains that dominate the landscape today. Putting aside my low opinion of the high costs and slow results, these tools all share a crucial feature: code motion.

Modern toolchains ensure libraries are transformed and don't resemble the files that arrive on disk from npm. From simple concatenation to the most sophisticated tree-shaking, the days of downloading library.min.js and SFTP-ing one's dependencies to the server are gone. A side effect of this change has been a reduction in the commonality of site artifacts. Even if many sites depend on identical versions of a library, their output will mix that code with bits from other tools in ways that defeat matching hashes.

Advocates for caches suggest this is solvable, but it is not — at least in any reasonable timeframe or in a way that is compatible with other pressures outlined here. Frameworks built in an era that presumes transpilers and npm may never qualify for inclusion in a viable shared cache.

Common Sense #

Because of privacy concerns, caches will be disabled for a non-trivial number of users. Many folks won't have enough disk space to include a uniform and complete version, and the cache file will be garbage collected under disk pressure for others.

Users at the "tail" tend not to get updates to their software as often as users in the "head" and "torso" of the distribution. Multiple factors contribute, including the pain of updates on slow networks and devices, systems that are out of disk space, and harmful interactions with AV software. One upshot is that browsers will need to delete or disable shared caches for users in this state so they don't become fingerprinting vectors. A simple policy would be to remove caches from service after a particular date, creating a performance cliff that disproportionately harms those on the slowest devices.

First, Do No Harm #

Code distributed to every user is a tax on end-user resources, and because caches must be uniform, complete, and common, they will impose a regressive tax. The most enfranchised users with the most high-performance devices and networks will feel their initial (and ongoing) download impacted the least. In contrast, users on the margins will pay a relatively higher price for any expansions of the cache over time and any differential updates to it.

Induced demand is real, and it widens inequality, rather than shrinking it.

If a shared cache is to do good, it must do good for those in the tail and on the margins, not make things worse for them.

For all of the JS ecosystem's assertions that modern systems are fast enough to support the additional overhead of expensive parallel data structures, slow diffing algorithms, and unbounded abstractions, nearly all computing growth has occurred over the past decade at the low end. For most of today's users, the JS community's shared assumptions about compute and bandwidth abundance have been consistently wrong.

The dream of a global, shared, user-subsidised cache springs from the same mistaken analysis about the whys and wherefores of client-side computing. Perhaps one's data centre is post-scarcity, and maybe the client side will be too, someday. But that day is not today. And it won't be this year either.

Cache Back #

Speaking of governance, consider that a shared cache would suddenly create disincentives to adopt anything but the last generation of "winner" libraries. Instead of fostering innovation, the deck would be stacked against new and innovative tools that best use the underlying platform. This is a double disincentive. Developers using JS libraries would suffer an additional real-world cost whenever they pick a more modern option, and browser vendors would feel less pressure to integrate new features into the platform.

As a strategy to improve performance for users, significant questions remain unanswered. Meanwhile, such a cache poses a clear and present danger to the cause of platform progress. The only winners are the makers of soon-to-be obsolete legacy frameworks. No wonder they're asking for it.

Own Goals #

At this point, it seems helpful to step back and consider that the question of resource caching may have different constituencies with different needs:

Framework Authors may be proposing caching to reduce the costs to their sites or users of their libraries.
End Users may want better caching of resources to speed up browsing.

For self-evident security and privacy reasons, browser vendors will be the ones to define the eventual contents of a shared cache and distribute it. Therefore, it will be browser imperatives that drive its payload. This will lead many to be surprised and disappointed at the contents of a fair, shared global cache.

First, to do best by users, the cache will likely be skewed away from the tools of engaged developers building new projects on the latest framework because the latest versions of libraries will lack the deployed base to qualify for inclusion. Expect legacy versions of jQuery and Prototype to find a much established place in the cache than anything popular in "State Of" surveys.

Next, because it will be browser vendors that manage the distribution of the libraries, they are on the hook for creating and distributing derivative works. What does this mean? In short, a copyright and patent minefield. Consider the scripts most likely to qualify based on use: code for embedding media players, analytics scripts, and ad network bootstraps. These aren't the scripts that most people think of when they propose that "[b]rowsers should ship with the top 1 GiB of npm built-in", but they are the sorts of code that will have the highest cache hit ratios.

Plus, they're also copyrighted and unlikely to be open-source. The scale of headaches this would create is difficult to overstate.

Browsers build platform features through web standards, not because it's hard to agree on technical details (although it is), but because vendors need the legal protections that liberally licensed standards provide.^[4] These protections are the only things standing between hordes of lawyers and the wallets of folks who build on the web platform. Even OSS isn't enough to guarantee the sorts of patent license and pooling that Standards Development Organisations (SDOs) like the W3C and IETF provide.

A reasonable response would be to have caches constrain themselves to non-copyleft OSS, rendering them less effective.

And, so we hit bedrock. If the goal isn't to make things better for users at the margins, why bother? Serving only the enfranchised isn't what the web is about. Proposals that externalise governance and administrative costs are also unattractive. Without a credible plan for deprecation and removal, why wade into this space? It's an obvious trap.

"That's Just Standardisation With Extra Steps!" #

Another way to make libraries smaller is to upgrade the platform. New platform features usually let developers remove code, which can reduce costs. This is in the back of browser engineers minds when asked, "Why don't you just put Library X in the browser?". A workable cache proposal has most of the same problems as standards development, after all:

Licensing limitations
Challenging deprecations
Required opt-in to benefit
Limits on what can be added
Difficulties agreeing across browsers about what to include

Why build this new tool when the existing ones are likely to be as effective, if not more so?

Compression is a helpful lens for thinking about caches and platform APIs. The things that platforms integrate into the de facto computing base are nouns and verbs. As terms become common, they no longer need to be explained every time.

Requests to embed libraries into the web's shared computing base is a desire to increase their compression ratio.

Shared precaching is an inefficient way to accomplish the goal. If we can identify common patterns being downloaded frequently, we can modify the platform to include standardised versions. Either way, developers need to account for situations when the native implementations aren't available (polyfilling).

Given that a shared cache system will look nothing like the dreams of those who propose them, it's helpful to instead ask why browser vendors are moving so slowly to improve the DOM, CSS, data idioms, and many other core areas of the platform in response to the needs expressed by libraries.

Thoughtful browser engineers are right to push back on shared caches, but the slow pace of progress in the DOM (at the hands of Apple's under-funding of the Safari team) has been a collective disaster. If we can't have unicorns, we should at least be getting faster horses.

Is There a Version That Could Work? #

Perhaps, but it's unlikely to resemble anything that web developers want. First, let us stipulate some of the constraints we've already outlined on caches:

Sites will likely need to opt-in.
They can't be both large and fair.
Will not rev fast or grow quickly.
Comprised of multiple versions of a few legacy libraries.

To be maximally effective, we might want a cache to trigger for sites that haven't opted-in via SRI. A bloom filter could elide SRI annotations for high-traffic files, but this presents additional governance and operational challenges.

Only resources served as public and immutable (as observed by a trusted crawler) can possibly be auto-substituted. A browser that is cavalier enough to attempt to auto-substitute under other conditions deserves all of the predictable security vulnerabilities it will create.

An auto-substitution URL list will take space, and must also be uniform, complete, and common for privacy reasons. This means that the list itself is competing for space with libraries. This creates real favouritism challenges.

A cache designed to do the most good will need mechanisms to work against induced demand. Many policies could achieve this, but the simplest might be for vendors to disable caches for developers and users on high-end machines. We might also imagine adding scaled randomness to cache hits: the faster the box, the more often a cache hit will silently fail.

Such policies won't help users stuck at the tail of the distribution, but might add a pinch of balance to a system that could further slow progress on the web platform.

A workable cache will also need a new governance body within an existing OSS project or SDO. The composition of such a working group will be fraught. Rules that ensure representation by web developers (not framework authors) and browser vendors can be postulated, but governance will remain challenging. How the interests of security researchers and users on the margins are represented are open problems.

So, could we add a cache? If all the challenges and constraints outlined above are satisfied, maybe. But it's not where I'd recommend anyone who wants to drive the web forward invest their time — particularly if they don't relish chairing a new, highly contentious working group.

Thanks to Eric Lawrence, Laurie Voss, Fred K. Schott, Frederik Braun, and Addy Osmani for their insightful comments on drafts of this post.

My assessment of the potential upside of this sort of cache is generally negative, but in the interest of fairness, I should outline some ways in which pre-caching scripts could enable them to be sped up:
- Scripts downloaded this way can be bytecode cached on the device (at the cost of some CPU burn), but this will put downward pressure on both the size of the cache (as bytecode takes 4-8× the disk space of script text) and on cache growth (time spent optimizing potentially unused scripts is a waste).
- The benefits of download time scale with the age of the script loading technique. For example, using a script from a third-party CDN requires DNS, TCP, TLS, and HTTP handshaking to a new server, all of which can be shortcut. The oldest sites are the most likely to use this pattern, but are also most likely to be unmaintained.
↩︎
Case in point: it was news last year when the Blink API owners^[5] approved a plan to deprecate and remove the long-standing alert(), confirm(), and prompt() methods from within cross-origin <iframe>s. Not all <iframe>s would be affected, and top-level documents would continue to function normally. The proposal was scapular — narrowly tailored to address user abuse while reducing collateral damage.

The change was also shepherded with care and caution. The idea was floated in 2017, aired in concrete form for more than a year, and our friends at Mozilla spoke warmly of it. WebKit even implemented the change. This deprecation built broad consensus and was cautiously introduced.

It blew up anyway.^[6]

Overnight, influential web developers — including voices that regularly dismiss the prudential concerns of platform engineers — became experts in platform evolution, security UX, nested event loops in multi-process applications, Reverse Origin Trials, histograms, and Chromium's metrics. More helpfully, collaboration with affected enterprise sites is improving the level of collective understanding about the risks. Changes are now on hold until the project regains confidence through this additional data collection.

This episode and others like it reveal that developers expect platforms to be conservative. Their trust in browsers comes from the expectation that the surface they program to will not change, particularly regarding existing and deployed code.

And these folks weren't wrong. It is the responsibility of platform stewards to maintain stability. There's even a helpful market incentive attached: browsers that don't render all sites don't have many users. The fast way to lose users is to break sites, and in a competitive market, that means losing share. The compounding effect is for platform maintainers to develop a (sometimes unhelpful) belief that moving glacially is good per se.

A more durable lesson is that, like a diamond, anything added to the developer-accessible surface of a successful platform may not be valuable — but it is forever.^[7] ↩︎
With the advent of H/2 connection re-use and partitioned caches, hosting third-party libraries has become an anti-pattern. It's always faster to host files from your server, which means a shared cache shouldn't encourage users to centralise on standard URLs for hosted libraries, lest they make the performance of the web even worse when the cache is unavailable. ↩︎
For an accessible introduction to the necessity of SDOs and recent history of modern technical standard development, I highly recommend Open Standards and the Digital Age by Andrew L. Russell (no relation). ↩︎
Your humble narrator serves as a Blink API OWNER and deserves his share of the blame for the too-hasty deprecation of alert(), confirm(), and prompt().

In Blink, the buck stops with us, not the folks proposing changes, and this was a case where we should have known that our lingering "enterprise blindness" ^[6:1] in the numbers merited even more caution, despite the extraordinary care taken by the team. ↩︎
Responsible browser projects used to shoot from the hip when removing features, which often led to them never doing it due to the unpredictability of widespread site breakage.

Thankfully, this is no longer the case, thanks to the introduction of anonymised feature instrumentation and metrics. These data sets are critical to modern browser teams, powering everything from global views of feature use to site-level performance reporting and efforts like Core Web Vitals.

One persistent problem has been what I've come to think of as "enterprise blindness".

In the typical consumer scenario, users are prompted to opt-in to metrics and crash reporting on the first run. Even if only a relatively small set of users participate, the Law of Large Numbers ensures our understanding of these histograms is representative across the billions of pages out there.

Enterprises, meanwhile, tend to roll out software for their users and simultaneously push down policy configurations to machines that disable metrics reporting. The result is that enterprise users and the web applications they frequent are dramatically under-reported in the public stats.

Given the data at hand, the team deprecating cross-origin <iframe> prompts was being responsible. It's just that the data had a blind spot, one whose size has been maddeningly difficult to quantify. ↩︎ ↩︎
Forever, give or take half a decade. ↩︎

Towards a Unified Theory of Web Performance

March 1, 2022

This is part four of the six-part series "The Performance Inequality Gap"

Note: This post first ran as part of Sergey Chernyshev and Stoyan Stefanov's indispensible annual series. It's being reposted here for completeness, but if you care about web performance, make sure to check out the whole series and get subscribed to their RSS feed to avoid missing any of next year's posts.

In a recent Perf Planet Advent Calendar post, Tanner Hodges asked for what many folks who work in the space would like for the holidays: a unified theory of web performance.

I propose four key ingredients:

Definition: What is "performance" beyond page speed? What, in particular, is "web performance"?

Purpose: What is web performance trying to accomplish as a discipline? What are its goals?

Principles: What fundamental truths are guiding the discipline and moving it forward?

Practice: What does it look like to work on web performance? How do we do it?

This is a tall order!

A baseline theory, doctrine, and practicum represent months of work. While I don't have that sort of space and time at the moment, the web performance community continues to produce incredible training materials, and I trust we'll be able to connect theory to pracice once we roughly agree on what web performance is and what it's for.

This Is for Everyone #

Embedded in the term "web performance" is the web, and the web is for humans.

That assertion might start an argument in the wrong crowd, but 30+ years into our journey, attempts to promote a different first-order constituency are considered failures, as the Core Platform Loop predicts. The web ecosystem grows or contracts with its ability to reach people and meet their needs with high safely and low friction.

Taking "this is for everyone" seriously, aspirational goals for web performance emerge. To the marginal user, performance is the difference between access and exclusion.

The mission of web performance is to expand access to information and services.

Page Load Isn't Special #

It may seem that web performance comprises two disciplines:

Optimising page load
Optimising post-load interactions

The tools of performance investigators in each discipline overlap to some degree but generally feel like separate concerns. The metrics that we report against implicitly cleave these into different "camps", leaving us thinking about pre- and post-load as distinct universes.

But what if they aren't?

Consider the humble webmail client.

Here are two renderings of the same Gmail inbox in different architectural styles: one based on Ajax, and the other on "basic" HTML:

The Ajax version of Gmail with two messages — The Ajax version of Gmail loads 4.8MiB of resources, including 3.8MiB of JavaScript to load an inbox containing two messages.

The same inbox in Gmail's 'simple HTML' mode — The 'basic' HTML version of Gmail loads in 23KiB, including 1.3KiB of JavaScript.

The difference in weight between the two architectures is interesting, but what we should focus on is the per interaction loop. Typing gmail.com in the address bar, hitting Enter, and becoming ready to handle the next input is effectively the same interaction in both versions. One of these is better, and it isn't the experience of the "modern" style.

These steps inform a general description of the interaction loop:

The system is ready to receive input.
Input is received and processed.
Progress indicators are displayed.
Work starts; progress indicators update.
Work completes; output is displayed.
The system is ready to receive input.

Tradeoffs In Depth #

Consider the next step of our journey, opening the first message. The Ajax version leaves most of the UI in place, whereas the HTML version performs a full page reload. Regardless of architecture, Gmail needs to send an HTTP request to the server and update some HTML when the server replies. The chief effect of the architectural difference is to shift the distribution of latency within the loop.

Some folks frame performance as a competition between Team Local (steps 2 & 3) and Team Server (steps 1 & 4). Today's web architecture debates (e.g. SPA vs. MPA) embody this tension.

Team Local values heap state because updating a few kilobytes of state in memory can, in theory, involve less work to return to interactivity (step 5) while improving the experience of steps 2 and 3.

Intuitively, modifying a DOM subtree should generate less CPU load and need less network traffic than tearing down the entire contents of a document, asking the server to compose a new one, and then parsing/rendering it along with all of its subresources. Successive HTML documents tend to be highly repetitive, after all, with headers, footers, and shared elements continually re-created from source when navigating between pages.

But is this intuitive understanding correct? And what about the other benefits of avoiding full page refreshes, like the ability to animate smoothly between states?

Herein lies our collective anxiety about front-end architectures: traversing networks is always fraught, and so we want to avoid it being jarring. However, the costs to deliver client-side logic that can cushion the experience from the network latency remain stubbornly high. Improving latency for one scenario often degrades it for another. Despite partisan protests, there are no silver bullets; only complex tradeoffs that must be grounded in real-world contexts — in other words, engineering.

As a community, we aren't very good at naming or talking about the distributional effects of these impacts. Performance engineers have a fluency in histograms and percentiles that the broader engineering community could benefit from as a lens for thinking about the impacts of design choices.

Given the last decade of growth in JavaScript payloads, it's worth resetting our foundational understanding of these relative costs. Here, for instance, are the network costs of transitioning from the inbox view of Gmail to a message:

Displaying the first message requires 82KiB of network traffic in the Ajax version of Gmail, half of which are images embedded in the message.

Displaying a message in the 'basic' HTML version requires a full page refresh.

Despite fully reloading the page, the HTML version of Gmail consumes fewer network resources (~70KiB) and takes less overall time to return to interaction.

Objections to the comparison are legion.

First, not all interactions within an email client modify such a large portion of the document. Some UI actions may be lighter in the Ajax version, especially if they operate exclusively in the client-side state. Second, while avoiding a full-page refresh, steps 2, 3, and 4 in our interaction loop can be communicated with greater confidence and in a less jarring way. Lastly, by avoiding an entire back-and-forth with the server for all UI states, it's possible to add complex features — like chat and keyboard accelerators — in a way that doesn't incur loss of context or focus.

The deeper an app's session length and the larger the number of "fiddly" interactions a user may perform, the more attractive a large up-front bundle can be to hide future latency.

This insight gives rise to a second foundational goal for web performance:

We expand access by reducing latency and variance across all interactions in a user's session to more reliably return the system to an interactive state.

For sites with low interaction depths and short sessions, this implies that web performance engineering might remove as much JavaScript and client-side logic as possible. For other, richer apps, performance engineers might add precisely this sort of payload to reduce session-depth-amortised latency and variance. The tradeoff is contextual and informed by data and business goals.

No silver bullets, only engineering.

Medians Don't Matter #

Not all improvements are equal. To understand impacts, we must learn to think in terms of distributions.

Our goal is to minimise latency and variance in the interactivity loop... but for whom? Going back to our first principle, we understand that performance is the predicate to access. This points us in the right direction. Performance engineers across the computing industry have learned the hard way that the sneaky, customer-impactful latency is waaaaay out in the tail of our distributions. Many teams have reported making performance better at the tail only to see their numbers get worse upon shipping improvements. Why? Fewer bouncing users. That is, more users who get far enough into the experience for the system to boot up in order to report that things are slow (previously, those users wouldn't even get that far).

Tail latency is paramount. Doing better for users at the median might not have a big impact on users one or two sigmas out, whereas improving latency and variance for users at the 75th percentile ("P75") and higher tend to make things better for everyone.

As web performance engineers, we work to improve the tail of the distribution (P75+) because that is how we make systems accessible, reliable, and equitable.

A Unified Theory #

And so we have the three parts of a uniform mission, or theory, of web performance:

The mission of web performance is to expand access to information and services.
We expand access by reducing latency and variance across all interactions in a user's session to more reliably return the system to an interactive state.
We work to improve the tail of the distribution (P75+) because that is how we make systems accessible, reliable, and equitable.

Perhaps a better writer can find a pithier way to encapsulate these values.

However they're formulated, these principles are the north star of my performance consulting. They explain tensions in architecture and performance tradeoffs. They also focus teams more productively on marginal users, which helps to direct investigations and remediation work. When we focus on getting back to interactive for the least enfranchised, the rest tends to work itself out.

Older Posts