Sharding the blockchain
Vlad Zamfir
I am going to talk about blockchain sharding. This is one of the scaling solutions that I like the most. I have done the most work on it. In case you don't know who I am, I am a researcher mostly with the ethereum project. Basically the basics of sharding is that in non-sharding, every node does every validation. In a sharding solution, nodes hold a subset of the state, and a subset of the blockchain. By state we mean UTXOs. Instead of everyone redundantly doing the same work, we're going to share the load but still have an only economic assurance even though we're not going to validate every transaction.
I am going to give a high-level overview. The details would take too long. We can get orders of magnitude of transactions per second if everyone isn't validating everything. We have authentication from the genesis block using proof-of-work. It's all proof-of-work from the genesis block. Same trust model. There's a clear appeal.
There are challenges to sharding the blockchain. We need to assign miners to shards so that miners from one shard don't mine from another to produce invalid blocks. So we need to sample the mining power, which presents a problem. We need to split the state space into shards, we need to process transactions within shards, and then deal with attack vectors where not everyone is checking everything.
Securely sampling mining power needs to happen. We can do it, but it means changing the block header. The key to doing this is non-outsourceable proof-of-work. Basically is proof-of-work where the private key from the coinbase has to be on the machine that produces the proof-of-work. You're required to sign a nonce before in the blockheader, so it wouldn't just be that the coinbase has a private key the mining pool had, you have to have the private key to sign it, it represents a change to the block structure and it requires a backwards-compatible asics that everyone switches to after enough people have enough asics for non-outsourceable proof-of-work. It lets you estimate the mining power that various people have. You can bound their mining power below that. You know that they don't have less mining power than they claim to do, using proof-of-work proofs.
We need a way to estimate these, maybe a treechain approach where anyone can mine parts of the tree, less hashpower mines bottom of the tree, so everyone can get their transactions in. And then basically the amount of hashing power that anoyne uses, will be bounded by this estimate of hashing power so that you can't pretend you have hashpower than you do, you can't bring on huge amounts of hashrate to a shard to overwhelm it. First you need to estimate the hashrate of all these miners, sample them, then assign them to shards. We need to securely sample the hashrat,e otherwise the hashrate might be overwhelming to a certain shard.
Fault tolerance analysis.. if the majority are honest, then you can basically split up the pie into lots of little slices, and then randomly reorder the slices and sample them, the larger the number of slices you have, the higher probability you have of having a majority of correct nodes in any given shard. Just from fault tolerance analysis, if we have enough slices of the pie we can make more shards. So more slices of the pie, we can make more shards, we have a very high probability of correct nodes being on every shard. It's important because if a majority of byzantine nodes are on a shard then it might make an invalid block, and other miners wont be aware of that.
We can do a cumulative binomial distribution to see the probability of any one shard having more than half byzantine, and we can make sure we have a very low probability of that when we place samples of mining power into that. And then we need to shard the UTXO space. If they spend the same UTXO, only one of them can get in; you can process concurrent transactions if they spend different UTXOs. There will be sharding of the UTXO set into mutually exclusive sets. Transactions that spend inputs from shard-i are mined by the miners we place in shard-i so basically there are going to be transactions that they process in their shard from other shards, and they only need to check that the spending is valid. And then we check that the outputs are valid. Then use proof-of-work for economic proof that transaction outputs were spent properly.
So basically there are two levels of blocks. Top-level blocks deals with estimating people's mining powers, and then low-level blocks that are creating shards to produce those outputs. Everyone will be a light client, but they wont process their transactions, these headers of the shards will have merkle roots for the UTXO state changes ideally, we can talk more about that later, and then basically they will group together outputs that are going to different shards so that they can update their UTXO sets to include those new outputs.
We estimate the mining power of coinbases, we sample the coinbases to shards, we split up the UTXOs, then we validate the work on the low-level blocks, then we validate work and format of low-level block-headers. Top-level block is just SPV of all low-level blocks. The top-level block- in the transaction location we instead have block headers for different shards, these have groups of transactions,
https://github.com/vbuterin/scalability_paper/blob/master/scalability.pdf
We have this top-level chain which in the more overhead we have there, we can add more shards. The number of shards is linear in the overhead of the top-level chain. So if we assume half of the nodes processing at bandwidth capacity is going to validate top-level, and half to processing transactions on the shard; then a linear increase in computational power if we increase block size will lead to a linear increase in the number of shards and a linear increase in the transactions per shards. Then we have O(n2) scaling transactions per seconds. At the moment we get O(n) when every node has to keep up, then we're limited by the weakest node. Here we are limited by O(n2) by the weakest node. This becomes superlinear instead of linear in the computational power of any node. Increasing block size has a bigger effect in sharding than in non-sharding.
Just to illustrate this, if we include the computational power so that the baige color represents the overhead for checking the top-level chain and the other colors represent overhead for checking the shards, if we increase the computational power, the top-level chain can keep track of more shards, and then we get another linear increase in transactions per second because we added another shard. Also shards get longer because we assume, we insist that we spend half of the time validating the block header, and half of the time validating transactions per shard. And we get this square increase in the number of transactions we can do when we increase the capacity of a single individual node.
Security problems, one problem is that outputs corralling. On every shard you have an independent market for transaction fees, if you corrall lots of outputs into shards it will be expensive to put outputs there. But you would move outputs out into another shard. Invalid blocks are another problem where a miner might produce an invalid block in a shard, and if we don't find out about it, other people might process outputs from that shard going to another shard; this is especially bad because you could convince rational nodes in the shard to go without it, in fault tolerance you have to have them faulty ahead of time before the sampling. There are challenge-response protocols that deal with this cleanly. If you mine an invalid block and someone challenges you, they can prove it, then if people don't agree, then there's big incentive for people to download that block and verifyu it because they will gain from this market where people are betting whether it is invalid or not by betting on the right side. Another problem is dealing with unavailable blocks, where you don't know if the block is invalid or not, but you just can't find it. There are challenge response protocols here where you want ot show that this data is unavailable because you can't prove that the data is unavailable, but what you definitely wont do is build a block on top of their block because if you build on top of an unvalid nblock then you wont get a block reward because of your shared invalidity. There are ways to deal with this as well.
Basically some of the takeaways are that we can do more transactions per second without changing the trust model; we count the total work from the genesis block. Instead of looking at the headers on the top-level chain, we also look at the headers from the shards as well. The work is non-outsourceable, and we need to do this to do sampling and assigning miners to shards. The protocol is complex, especially when we start talking about challenge-response. What happens if something goes wrong in a shard? More attention needs to be there. It's a bit complex. Another thing I should bmention si that if you find an invalid block, it needs to be reverted, and so do blocks that the other blocks spent. There could be a good amount of block reversions as a result of finding only one invalid block. It's very important that there is another block., not just the shard you are in here. I don't know, I hope that you guys can maybe have an opinion. And that you will talk to me about it. Thanks a lot.