No More Hustleporn: "Spanner: Google’s Globally-Distributed Database" by Corbett et al (OSDI '12) is a seminal paper from Google that I believe kick-started the last decade of distributed SQL databases such as @Cockroach


Tweet by Pekka Enberg (penberg@fosstodon.org)
https://twitter.com/penberg


   @penberg:  
 
     
      "Spanner: Google’s Globally-Distributed Database" by Corbett et al (OSDI '12) is a seminal paper from Google that I believe kick-started the last decade of distributed SQL databases such as
 
         @CockroachDB  
 
      ,
 
         @Yugabyte  
 
      , and others. So let's dig into what the paper has to say!
 
  🧐🧵1/
 

   @penberg:  
 
     
      What is Spanner? Spanner is a globally-distributed database, which operates at Google scale, but provides general-purpose transactions and SQL as its query language. It provides external consistency and shards data across multiple data centers with Paxos. 2/
 
             @penberg:  
 
     
      Why is this important? Google has a long history of building data stores such as Bigtable, Megastore, and others, for their large-scale workloads, but what they discovered that most applications just want SQL. 3/
 
             @penberg:  
 
     
      But it's not just SQL, but the whole transactional database experience: "[...] it is better to have application programmers deal with performance problems due to overuse of transactions [...], rather than always coding around the lack of transactions." 4/
 
             @penberg:  
 
     
      In the paper, the authors give an example use case: their advertisement backend used a manually sharded MySQL cluster, which was hard to scale. With Spanner, they wanted to retain the SQL interface, but make the thing scale. 5/
 
             @penberg:  
 
     
      The paper only discusses Spanner architecture at a high level. The important bit is that a semi-relational model is layered on top of what is effectively a key-value store. This is an architecture that
 
         @CockroachDB  
 
      also has. 6/
 
             @penberg:  
 
     
      They key-value store consists of a number of spans managed by spanservers. Each span consists of a number of tablets (100 to 1000 per server according to the paper), which provide the basic mechanism for replication and data placement. 7/
 
             @penberg:  
 
     
      A tablet is a bag (a set that allows duplicates) of "(key, timestamp) → value" mappings. Each tablet is managed by an independent Paxos state machine, which provides replication. A tablet is stored in a specific geographic location in full. 8/
 
             @penberg:  
 
     
      At SQL level, application developers can declare hierarchy between SQL tables (using the INTERLEAVE IN extension), which make the two tables live in the same tablet, allowing efficient access to a hierarchy of tables. 9/
 
             @penberg:  
 
     
      Spanner does provide distributed transactions with 2PC too, but they're slow(er) in multi-region configurations because of the geographic latency. That's why application developers are expected to model the schema in a way that doesn't require distributed transactions. 10/
 
             @penberg:  
 
     
      The way Spanner makes all of this work is via a technique they call TrueTime, which provides bounded uncertainty for physical clocks. Without going into the gory details, TrueTime allows the system to produce monotonic timestamps efficiently. 11/
 
             @penberg:  
 
     
      TrueTime requires the use of custom hardware, only available to Google at the time. However, the need for custom hardware has gone away with the invention or Hybrid Logical Clocks that
 
         @CockroachDB  
 
      , for example, uses. 12/
 
             @penberg:  
 
     
      The paper also describes a bunch of Paxos related optimizations that make running lots and lots of Paxos state machines practical, which you will have to read the paper to find out about. 13/
 
             @penberg:  
 
     
      The paper also makes an interesting point about the downsides of layering SQL on top of a key-value store: "node-local data structures have [...] poor performance on complex SQL [...], because they were designed for [...] key-value access". 14/
 
             @penberg:  
 
     
      I never quite understood *why* modern distributed SQL databases are built on top of key-value stores, but reading that makes me think it sort of happened accidentally: the systems were developed from bottom to up. 15/
 
             @penberg:  
 
     
      They already had globally-distributed key-value stores, but ended up building SQL on top of it because of application requirements. Then when others saw that the approach was practical, they ran with it. Perhaps the systems would look different if they were built for SQL? 16/
 
             @penberg:  
 
     
      Anyway, if this got you interested, go read the whole paper. The next paper I think I will read is "CockroachDB: The Resilient Geo-Distributed SQL Database" by Taft et al (SIGMOD '20), which I believe is a modern take on the Spanner architecture. 17/17