RSS Parrot

BETA

🦜 Marc Brooker's Blog

@brooker.co.za.blog@rss-parrot.net

I'm an automated parrot! I relay a website's RSS feed to the Fediverse. Every time a new post appears in the feed, I toot about it. Follow me to get all new posts in your Mastodon timeline! Brought to you by the RSS Parrot.

---

Marc Brooker's Blog

Your feed and you don't want it here? Just e-mail the birb.

Site URL: brooker.co.za/blog/

Feed URL: brooker.co.za/blog/rss.xml

Posts: 157

Followers: 1

It's time to be right.

Published: April 30, 2026 00:00

It’s time to be right. Outcomes continue to matter. Earlier this week, I spoke at AI Dev 26. This is what I spoke about there. I’ve been making money, in some form, building software for nearly 30 years. The last five months have been the most exciting…

Spec Driven Development isn't Waterfall

Published: April 9, 2026 00:00

Spec Driven Development isn’t Waterfall Write down what you mean. After spending a few months writing (e.g. on the Kiro Blog), and speaking (e.g. Real Python Podcast, SE Radio) about spec-driven development, I’ve noticed a common misconception: spec…

What about juniors?

Published: March 25, 2026 00:00

What about juniors? Start at the beginning. Last week I wrote about how the role of the most senior tech ICs has changed. Today, I wanted to share some thoughts on a more difficult topic: how the role of junior software engineers, folks just starting out…

My heuristics are wrong. What now?

Published: March 20, 2026 00:00

My heuristics are wrong. What now? More words. More meaning? Some people who ask me for advice get a lot of words in reply. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In the past, I’ve written about…

Music To Build Agents By

Published: March 18, 2026 00:00

Music To Build Agents By I don't have this problem, because I don't use a mouse. Press play, then start reading: Want to learn how to think about agent policy? Start with Goethe’s Der Zauberlehrling. So come along, you old broomstick! Dress…

SFQ: Simple, Stateless, Stochastic Fairness

Published: February 25, 2026 00:00

SFQ: Simple, Stateless, Stochastic Fairness Roll the dice. Paul E. McKenney’s 1990 paper Stochastic Fairness Queuing contains one of my favorite little algorithms for distributed systems. Stochastic Fairness Queuing is a way to stochastically isolate…

You Are Here

Published: February 7, 2026 00:00

You Are Here Where to next? The cost of turning written business logic into code has dropped to zero. Or, at best, near-zero. The cost of integrating services and libraries, the plumbing of the code world, has dropped to zero. Or, at best, near-zero. …

Pass@k is Mostly Bunk

Published: January 21, 2026 00:00

Pass@k is Mostly Bunk Exponentially better results? I'll take three! Measuring the success of AI agents isn’t easy. It’s very sensitive to what success means, it can require a lot of samples, its highly context sensitive. Generally hard. So it…

Agent Safety is a Box

Published: January 12, 2026 00:00

Agent Safety is a Box Keep a lid on it. Before we start, let’s cover some terms so we’re thinking about the same thing. This is a post about AI agents, which I’ll define (riffing off Simon Willison1) as: An AI agent runs models and tools in a loop to…

On the success of 'natural language programming'

Published: December 16, 2025 00:00

On the success of ‘natural language programming’ Specifications, in plain speech. I believe that specification is the future of programming. Over the last four decades, we’ve seen the practice of building programs, and software systems grow closer and…

What Does a Database for SSDs Look Like?

Published: December 15, 2025 00:00

What Does a Database for SSDs Look Like? Maybe not what you think. Over on X, Ben Dicken asked: What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and…

What Now? Handling Errors in Large Systems

Published: November 20, 2025 00:00

What Now? Handling Errors in Large Systems More options means more choices. Cloudflare’s deep postmortem for their November 18 outage triggered a ton of online chatter about error handling, caused by a single line in the postmortem: .unwrap() …

Why Strong Consistency?

Published: November 18, 2025 00:00

Why Strong Consistency? Eventual consistency makes your life harder. When I started at AWS in 2008, we ran the EC2 control plane on a tree of MySQL databases: a primary to handle writes, a secondary to take over from the primary, a handful of read…

DSQL: Simplifying Architectures

Published: November 2, 2025 00:00

DSQL: Simplifying Architectures Complexity is a choice. While we were designing and building Aurora DSQL, we spent a lot of time thinking about our experience building and running database-backed systems. We saw that building great, fast,…

Fixing UUIDv7 (for database use-cases)

Published: October 22, 2025 00:00

Fixing UUIDv7 (for database use-cases) How do I even balance a V7? RFC9562 defines UUID Version 7. This has made a lot of people very angry and been widely regarded as a bad move1. More seriously, UUIDv7 has received a lot of criticism, despite…

Is Systems Research Really Just About Making Numbers Bigger?

Published: October 12, 2025 00:00

Is Systems Research Really Just About Making Numbers Bigger? The Barbarian F.C. of systems research would be pretty cool. Lots of folks online have been talking about Barbarians at the Gate: How AI is Upending Systems Research by Cheng, Liu, Pan, et al…

Locality, and Temporal-Spatial Hypothesis

Published: October 5, 2025 00:00

Locality, and Temporal-Spatial Hypothesis Good fences make good neighbors? Last week at PGConf NYC, I had the pleasure of hearing Andres Freund talking about the great work he’s been doing to bring async IO to Postgres 18. One particular result…

Seven Years of Firecracker

Published: September 18, 2025 00:00

Seven Years of Firecracker Time flies like an arrow. Fruit flies like a banana. Back at re:Invent 2018, we shared Firecracker with the world. Firecracker is open source software that makes it easy to create and manage small virtual machines. At the time,…

Dynamo, DynamoDB, and Aurora DSQL

Published: August 15, 2025 00:00

Dynamo, DynamoDB, and Aurora DSQL Names are hard, ok? People often ask me about the architectural relationship between Amazon Dynamo (as described in the classic 2007 SOSP paper), Amazon DynamoDB (the serverless distributed NoSQL database from AWS),…

LLMs as Parts of Systems

Published: August 12, 2025 00:00

LLMs as Parts of Systems Towers of Hanoi is a boring game, anyway. Over on the Kiro blog, I wrote a post about Kiro and the future of AI spec-driven software development, looking at where I think the space of AI-agent-powered development tools is going.…

Career advice, or something like it

Published: June 20, 2025 00:00

Career advice, or something like it Cynicism is bad. If I could offer you a single piece of career advice, it’s this: avoid negativity echo chambers. Every organization and industry has watering holes where the whiners hang out. The cynical. The jaded.…

Systems Fun at HotOS

Published: June 2, 2025 00:00

Systems Fun at HotOS One day somebody will tell me what systems means. Last week I attended HotOS1 for the first time. It was super fun. Just the kind of conference I like: single-track, a mix of academic and industry, a mix of normal practical ideas…

Good Performance for Bad Days

Published: May 20, 2025 00:00

Good Performance for Bad Days Good things are good, one finds. Two weeks ago, I flew to Toronto to give one of the keynotes at the International Conference on Performance Evaluation. It was fun. Smart people. Cool dark squirrels. Interesting…

Decomposing Aurora DSQL

Published: April 17, 2025 00:00

Decomposing Aurora DSQL Riffing, I guess. Earlier today, Alex Miller wrote an excellent blog post titled Decomposing Transaction Systems. It’s one of the best things I’ve read about transactions this year, maybe the best. You should read it now. In…

One or Two? How Many Queues?

Published: March 25, 2025 00:00

One or Two? How Many Queues? Very applied queue theory. There’s a well-known rule of thumb that one queue is better than two. When you’ve got people waiting to check out at the supermarket, having a single shared queue improves utilization and…

What Fekete's Anomaly Can Teach Us About Isolation

Published: February 5, 2025 00:00

What Fekete’s Anomaly Can Teach Us About Isolation Is it just fancy write skew? In the first draft of yesterday’s post, the example I used was one that showed Fekete’s anomaly. After drafting, I realized the example distracted too much from the…

Versioning versus Coordination

Published: February 4, 2025 00:00

Versioning versus Coordination Spoiler: Versioning Wins. Today, we’re going to build a little database system. For availability, latency, and scalability, we’re going to divide our data into multiple shards, have multiple replicas of each shard,…

Snapshot Isolation vs Serializability

Published: December 17, 2024 00:00

Snapshot Isolation vs Serializability Getting into some fundamentals. In my re:Invent talk on the internals of Aurora DSQL I mentioned that I think snapshot isolation is a sweet spot in the database isolation spectrum for most kinds of applications.…

DSQL Vignette: Wait! Isn't That Impossible?

Published: December 6, 2024 00:00

DSQL Vignette: Wait! Isn’t That Impossible? Laws of physics are real. In today’s post, I’m going to look at how Aurora DSQL is designed for availability, and how we work within the constraints of the laws of physics. If you’d like to learn more about…

DSQL Vignette: Transactions and Durability

Published: December 5, 2024 00:00

DSQL Vignette: Transactions and Durability The hard half of a database system? In today’s post, I’m going to look at the other half of what’s under the covers of Aurora DSQL, our new scalable, active-active, SQL database. If you’d like to learn more…

DSQL Vignette: Reads and Compute

Published: December 4, 2024 00:00

DSQL Vignette: Reads and Compute The easy half of a database system? In today’s post, I’m going to look at half of what’s under the covers of Aurora DSQL, our new scalable, active-active, SQL database. If you’d like to learn more about the product…

DSQL Vignette: Aurora DSQL, and A Personal Story

Published: December 3, 2024 00:00

DSQL Vignette: Aurora DSQL, and A Personal Story It's happening. In this morning’s re:Invent keynote, Matt Garman announced Aurora DSQL. We’re all excited, and some extremely excited, to have this preview release in customers’ hands. Over the next few…

Ten Years of AWS Lambda

Published: November 14, 2024 00:00

Ten Years of AWS Lambda Everything starts somewhere. Today, Werner Vogels shared his annotated version of the original AWS Lambda PRFAQ. This is a great inside look into how product development happens at AWS - the real working backwards process in…

Garbage Collection and Metastability

Published: August 14, 2024 00:00

Garbage Collection and Metastability Cleaning up is hard to do. I’ve written a lot about stability and metastability, but haven’t touched on one other common cause of metastability in large-scale systems: garbage collection. GC is great. Garbage…

Resource Management in Aurora Serverless

Published: July 29, 2024 00:00

Resource Management in Aurora Serverless Systems, big and small. My favorite thing about distributed systems is how they allow us to solve problems at multiple levels: single process problems, single machine problems, multi-machine problems, and…

Let's Consign CAP to the Cabinet of Curiosities

Published: July 25, 2024 00:00

Let’s Consign CAP to the Cabinet of Curiosities CAP? Again? Still? Brewer’s CAP theorem, and Gilbert and Lynch’s formalization of it, is the first introduction to hard trade-offs for many distributed systems engineers. Going by the vast amounts of ink…

Not Just Scale

Published: June 4, 2024 00:00

Not Just Scale Bookmarking this so I can stop writing it over and over. It seems like everywhere I look on the internet these days, somebody’s making some form of the following argument: You don’t need distributed systems! Computers are so fast these…

It's always TCP_NODELAY. Every damn time.

Published: May 9, 2024 00:00

It’s always TCP_NODELAY. Every damn time. It's not the 1980s anymore, thankfully. The first thing I check when debugging latency issues in distributed systems is whether TCP_NODELAY is enabled. And it’s not just me. Every distributed system builder I…

MemoryDB: Speed, Durability, and Composition.

Published: April 25, 2024 00:00

MemoryDB: Speed, Durability, and Composition. Blocks are fun. Earlier this week, my colleagues Yacine Taleb, Kevin McGehee, Nan Yan, Shawn Wang, Stefan Mueller, and Allen Samuels published Amazon MemoryDB: A fast and durable memory-first cloud database1.…

Formal Methods: Just Good Engineering Practice?

Published: April 17, 2024 00:00

Formal Methods: Just Good Engineering Practice? Yes. The answer is yes. In your face, Betteridge. Earlier this week, I did the keynote at TLA+ conf 2024 (watch the video or check out the slides). My message in the keynote was something I have believed to…

Finding Needles in a Haystack with Best-of-K

Published: March 25, 2024 00:00

Finding Needles in a Haystack with Best-of-K Keep track of those needles. As I’ve written about before, best of two and best of k are surprisingly powerful tools for load balancing in distributed systems. I have deployed them many times in…

The Builder's Guide to Better Mousetraps

Published: March 4, 2024 00:00

The Builder’s Guide to Better Mousetraps A little rubric for making a tough decision. Some people who ask me for advice at work get very long responses. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In…

Better Benchmarks Through Graphs

Published: February 12, 2024 00:00

Better Benchmarks Through Graphs Isn't the ambiguity in the word *graphs* fun? This is a blog post version of a talk I gave at the Northwest Database Society meeting last week. The slides are here, but I don’t believe the talk was recorded. I…

How Do You Spend Your Time?

Published: February 6, 2024 00:00

How Do You Spend Your Time? Career advice, or something like it. Some people who ask me for advice at work get very long responses. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In the past, I’ve…

Pat's Big Deal, and Transaction Coordination

Published: January 23, 2024 00:00

Pat’s Big Deal, and Transaction Coordination Working together towards a common goal. I have a lot of opinions about Pat Helland’s CIDR’24 paper Scalable OLTP in the Cloud: What’s the BIG DEAL?1. Most importantly, I like the BIG DEAL that he proposes:…

What is Scalability Anyway?

Published: January 18, 2024 00:00

What is Scalability Anyway? Do words mean things? Why? What does scalable mean? As systems designers, builders, and researchers, we use that word a lot. We kind of all use it to mean that same thing, but not super consistently. Some include scaling both…

Why Aren't We SIEVE-ing?

Published: December 15, 2023 00:00

Why Aren’t We SIEVE-ing? Captain, we are being scanned! Long-time readers of this blog will know that I have mixed feelings about caches. One on hand, caching is critical to the performance of systems at every layer, from CPUs to storage to whole…

It's About Time!

Published: November 27, 2023 00:00

It’s About Time! What's the time? Time to get a watch. My friend Al Vermeulen used to say time is for the amusement of humans1. Al’s sentiment is still the common one among distributed systems builders: real wall-clock physical time is great for…

Optimism vs Pessimism in Distributed Systems

Published: October 18, 2023 00:00

Optimism vs Pessimism in Distributed Systems What—Me Worry? Avoiding coordination is the one fundamental thing that allows us to build distributed systems that out-scale the performance of a single machine1. When we build systems that avoid coordinating,…

Writing For Somebody

Published: September 21, 2023 00:00

Writing For Somebody Who's there? Sometimes I write long emails to people at work. Sometimes those emails are generally interesting, and not work-specific at all. Sometimes I share those emails here on my blog. This may be one of those times. Always…

Exponential Value at Linear Cost

Published: September 8, 2023 00:00

Exponential Value at Linear Cost What a deal! Binary search is kind a of a magical thing. With each additional search step, the size of the haystack we can search doubles. In other words, the value of a search is exponential in the amount of…

On The Acoustics of Cocktail Parties

Published: August 25, 2023 00:00

On The Acoustics of Cocktail Parties Only parties of well-mannered guests will be considered. If you, like me, tend to practice punctual arrival at parties, you’ve likely noticed that most parties start out quiet. Folks are talking in small groups,…

Invariants: A Better Debugger?

Published: July 28, 2023 00:00

Invariants: A Better Debugger? 🎵Some things never change🎵 Like many of my blog posts, this started out as a long email to a colleague. I expanded it here because I thought folks might find it interesting. I don’t tend to use debuggers. I’m not against…

My Favorite Bits of OSDI/ATC'23

Published: July 13, 2023 00:00

My Favorite Bits of OSDI/ATC’23 Talking to 3D people is cool again. This week brought USENIX ATC’23 and OSDI’23 together in Boston. While I’ve followed OSDI and ATC papers for years, it’s the first time I’ve been to either of them (I’ve have been to NSDI…

Bélády's Anomaly Doesn't Happen Often

Published: June 23, 2023 00:00

Bélády’s Anomaly Doesn’t Happen Often Anomaly is a really fun word. Try saying it ten times. It was 1969. The Summer of Love wasn’t raging4, Hendrix was playing the anthem, and Forest Gump was running rampant. In New York, IBM researchers Bélády, Nelson,…

What is a container?

Published: June 19, 2023 00:00

What is a container? What are words, anyway? A common cause of confusion and miscommunication I see is different people using different definitions of words. Sometimes the definitions are subtly different (as with availability). Sometimes they’re…

Container Loading in AWS Lambda

Published: May 23, 2023 00:00

Container Loading in AWS Lambda Slap shot? Back in 2019, we started thinking about how allow Lambda customers to use container images to deploy their Lambda functions. In theory this is easy enough: a container image is an image of a filesystem, just…

Open and Closed, Omission and Collapse

Published: May 10, 2023 00:00

Open and Closed, Omission and Collapse Were you born in a cave? This, from Open Versus Closed: A Cautionary Tale by Schroeder et al1 is one of the most important concepts in systems performance: Workload generators may be classified as based on a…

The Four Hobbies, and Apparent Expertise

Published: April 20, 2023 00:00

The Four Hobbies, and Apparent Expertise Around the end of high school, I started to get really into photography. My friend (let’s call him T) was also into it, which should have been great fun. But it wasn’t. Going shooting with him was never…

Surprising Scalability of Multitenancy

Published: March 23, 2023 00:00

Surprising Scalability of Multitenancy When most folks talk about the economics of cloud systems, their focus is on automatically scaling for long-term seasonality: changes on the order of days (fewer people buy things at night), weeks (fewer people…

False Sharing versus Perfect Placement

Published: March 7, 2023 00:00

False Sharing versus Perfect Placement This is part 3 of an informal series on database scalability. The previous parts were on NoSQL, and Hot Keys. In the last installment, we looked at hot keys and how they affect the theoretical peak scale a…

Hot Keys, Scalability, and the Zipf Distribution

Published: February 7, 2023 00:00

Hot Keys, Scalability, and the Zipf Distribution the: so hot right now. Does your distributed database (or microservices architecture, or queue, or whatever) scale? It’s a good question, and often a relevant one, but almost impossible to answer. To…

NoSQL: The Baby and the Bathwater

Published: January 30, 2023 00:00

NoSQL: The Baby and the Bathwater Is this a database? This is a bit of an introduction to a long series of posts I’ve been writing about what, fundamentally, it is that makes databases scale. The whole series is going to take me a long time, but…

Erasure Coding versus Tail Latency

Published: January 6, 2023 00:00

Erasure Coding versus Tail Latency There are zero FEC puns in this post, against my better judgement. Jeff Dean and Luiz Barroso’s paper The Tail At Scale popularized an idea they called hedging, simply sending the same request to multiple places and…

Under My Thumb: Insight Behind the Rules

Published: December 15, 2022 00:00

Under My Thumb: Insight Behind the Rules My left thumb is exactly 25.4mm wide. Starting off in a new field, you hear a lot of rules of thumb. Rules for estimating things, thinking about things, and (ideally) simplifying tough decisions. When I started in…

Lambda Snapstart, and snapshots as a tool for system builders

Published: November 29, 2022 00:00

Lambda Snapstart, and snapshots as a tool for system builders Clones. Yesterday, AWS announced Lambda Snapstart, which uses VM snapshots to reduce cold start times for Lambda functions that need to do a lot of work on start (starting up a language…

Amazon's Distributed Computing Manifesto

Published: November 22, 2022 00:00

Amazon’s Distributed Computing Manifesto Manifesto made manifest. In the Johannesburg of 1998, I was rocking a middle parting, my friend group was abuzz about the news that there was water (and therefore monsters) on Europa, and all the cool kids were…

Writing Is Magic

Published: November 8, 2022 00:00

Writing Is Magic Magic can be dangerous. Sometimes when folks ask me for advice at work, I write them very long emails to answer their question. Sometimes, those emails are generally interesting and not work-specific, so I share them here. A couple days…

Give Your Tail a Nudge

Published: October 21, 2022 00:00

Give Your Tail a Nudge Tricks are fun. We all care about tail latency (also called high percentile latency, also called those times when your system is weirdly slow). Simple changes that can bring it down are valuable, especially if they don’t come…

Atomic Commitment: The Unscalability Protocol

Published: October 4, 2022 00:00

Atomic Commitment: The Unscalability Protocol 2PC is my enemy. Let’s consider a single database system, running on one box, good for 500 requests per second. ┌───────────────────┐ │ Database │ │(good for 500 rps) │ └───────────────────┘ …

Histogram vs eCDF

Published: September 2, 2022 00:00

Histogram vs eCDF Accumulation is a fun word. Histograms are a rightfully popular way to present data like latency, throughput, object size, and so on. Histograms avoid some of the difficulties of picking a summary statistic, or group of statistics,…

What is Backoff For?

Published: August 11, 2022 00:00

What is Backoff For? Back off man, I'm a scientist. Years ago I wrote a blog post about exponential backoff and jitter, which has turned out to be enduringly popular. I like to believe that it’s influenced at least a couple of systems to add jitter, and…

Getting into formal specification, and getting my team into it too

Published: July 29, 2022 00:00

Getting into formal specification, and getting my team into it too Getting started is the hard part Sometimes I write long email replies to people at work asking me questions. Sometimes those emails seem like they could be useful to more than just the…

The DynamoDB paper

Published: July 12, 2022 00:00

The DynamoDB paper The other database called Dynamo This week at USENIX ATC’22, a group of my colleagues1 from the AWS DynamoDB team are going to be presenting their paper Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL…

Formal Methods Only Solve Half My Problems

Published: June 2, 2022 00:00

Formal Methods Only Solve Half My Problems At most half my problems. I have a lot of problems. The following is a one-page summary I wrote as a submission to HPTS’22. Hopefully it’s of broader interest. Formal methods, like TLA+ and P, have proven to be…

What is a simple system?

Published: May 3, 2022 00:00

What is a simple system? Is this pretentious? Why do I need cryptography when I could simply hide the contents of my communications rotating every letter by 13? Why do I need a distributed storage system when I could simply store my files on this one…

Simple Simulations for System Builders

Published: April 11, 2022 00:00

Simple Simulations for System Builders Even the most basic numerical methods can lead to surprising insights. It’s no secret that I’m a big fan of formal methods. I use P and TLA+ often. I like these tools because they provide clear ways to communicate…

Fixing retries with token buckets and circuit breakers

Published: February 28, 2022 00:00

Fixing retries with token buckets and circuit breakers Throttle yourself before you DoS yourself. After my last post on circuit breakers, a couple of people reached out to recommend using circuit breakers only to break retries, and still send normal…

Will circuit breakers solve my problems?

Published: February 16, 2022 00:00

Will circuit breakers solve my problems? Maybe, but you need to know what problem you're trying to solve first. A couple of weeks ago, I started a tiny storm on Twitter by posting this image, and claiming that retries (mostly) make things worse in…

Software Deployment, Speed, and Safety

Published: January 31, 2022 00:00

Software Deployment, Speed, and Safety There's one right answer that applies in all situations, as always. Disclaimer: Sometime around a 2015, I wrote AWS’s official internal guidance on balancing deployment speed and safety. This blog post is not that.…

DynamoDB's Best Feature: Predictability

Published: January 19, 2022 00:00

DynamoDB’s Best Feature: Predictability Happy birthday! It’s 10 years since the launch of DynamoDB, Amazon’s fast, scalable, NoSQL database. Back when DynamoDB launched, I was leading the team rethinking the control plane of EBS. At the time, we had a…

The Bug in Paxos Made Simple

Published: November 16, 2021 00:00

The Bug in Paxos Made Simple There's not really a bug in Paxos, but clickbait is fun. Over the last few weeks, I’ve been picking up the excellent P programming language, a language for modelling and specifying distributed systems. One of the first things…

Serial, Parallel, and Quorum Latencies

Published: October 20, 2021 00:00

Serial, Parallel, and Quorum Latencies Why are they letting me write Javascript? I’ve written before about the latency effects of series (do X, then Y), parallel (do X and Y, wait for them both), and quorum (do X, Y and Z, return when two of them are…

Caches, Modes, and Unstable Systems

Published: August 27, 2021 00:00

Caches, Modes, and Unstable Systems Best practices are seldom the best. Is your system having scaling trouble? A bit too slow? Sending too much traffic to the database? Add a caching layer! After all, caches are a best practice and a standard way to…

My Proposal for Arecibo: Drones

Published: August 11, 2021 00:00

My Proposal for Arecibo: Drones With apologies to real radio astronomers Last night I finally got around to watching Grady Hillhouse’s excellent video on the collapse of the Arecibo Telescope. At the end of Grady’s video he says: I hope eventually…

Latency Sneaks Up On You

Published: August 5, 2021 00:00

Latency Sneaks Up On You And is a bad way to measure efficiency. As systems get big, people very reasonably start investing more in increasing efficiency and decreasing costs. That’s a good thing, for the business, for the environment, and often for the…

Metastability and Distributed Systems

Published: May 24, 2021 00:00

Metastability and Distributed Systems What if computer science had different parents? There’s no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again. The reasons that…

Tail Latency Might Matter More Than You Think

Published: April 19, 2021 00:00

Tail Latency Might Matter More Than You Think A frustratingly qualitative approach. Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently. Things like: “my service mostly responds in around…

Redundant against what?

Published: April 14, 2021 00:00

Redundant against what? Threat modeling thinking to distributed systems. There’s basically one fundamental reason that distributed systems can achieve better availability than single-box systems: redundancy. The software, state, and other things needed…

What You Can Learn From Old Hard Drive Adverts

Published: March 25, 2021 00:00

What You Can Learn From Old Hard Drive Adverts The single most important trend in systems. Adverts for old computer hardware, especially hard drives, are a fun staple of computer forums and the nerdier side of the internet1. For example, a couple days…

Incident Response Isn't Enough

Published: February 22, 2021 00:00

Incident Response Isn’t Enough Single points of failure become invisible. Postmortems, COEs, incident reports. Whatever your organization calls them, when done right they are a popular and effective way of formalizing the process of digging into system…

The Fundamental Mechanism of Scaling

Published: January 22, 2021 00:00

The Fundamental Mechanism of Scaling It's not Paxos, unfortunately. A common misconception among people picking up distributed systems is that replication and consensus protocols—Paxos, Raft, and friends—are the tools used to build the largest and most…

Quorum Availability

Published: January 6, 2021 00:00

Quorum Availability It's counterintuitive, but is it right? In our paper Millions of Tiny Databases, we say this about the availability of quorum systems of various sizes: As illustrated in Figure 4, smaller cells offer lower availability in the face…

Getting Big Things Done

Published: October 19, 2020 00:00

Getting Big Things Done In one particular context. A while back, a colleague wanted to make a major change in the design of a system, the sort of change that was going to take a year or more, and many tens of person-years of effort. They asked me how to…

Consensus is Harder Than It Looks

Published: October 5, 2020 00:00

Consensus is Harder Than It Looks And it looks pretty hard. In his classic paper How to Build a Highly Available System Using Consensus Butler Lampson laid out a pattern that’s become very popular in the design of large-scale highly-available systems.…

Focus on the Good Parts

Published: September 2, 2020 00:00

Focus on the Good Parts Skepticism and cynicism can get in your way. Back in May, I wrote Reading Research: A Guide for Software Engineers, answering common questions I get about why and how to read research papers. In that post, I wrote about three…

Surprising Economics of Load-Balanced Systems

Published: August 6, 2020 00:00

Surprising Economics of Load-Balanced Systems The M/M/c model may not behave like you expect. I have a system with c servers, each of which can only handle a single concurrent request, and has no internal queuing. The servers sit behind a load balancer,…

A Story About a Fish

Published: July 28, 2020 00:00

A Story About a Fish Nothing's more boring than a fishing story. In the 1930s, Marjorie Latimer was working as a museum curator in East London. Not the eastern part of London as one may expect. This East London is a small city on South Africa’s south…

Code Only Says What it Does

Published: June 23, 2020 00:00

Code Only Says What it Does Only loosely related to what it should do. Code says what it does. That’s important for the computer, because code is the way that we ask the computer to do something. It’s OK for humans, as long as we never have to modify or…

Some Virtualization Papers Worth Reading

Published: June 8, 2020 00:00

Some Virtualization Papers Worth Reading A short, and incomplete, survey. A while back, Cindy Sridharan asked on Twitter for pointers to papers on the past, present and future of virtualization. A picked a few of my favorites, and given the popularity of…

~ 57 additional posts are not shown ~