Advances in block propagation

Greg Maxwell

2017-11-27

slides: https://people.xiph.org/~greg/gmaxwell-sf-prop-2017.pdf

video: https://www.youtube.com/watch?v=EHIuuKCm53o

https://en.bitcoin.it/wiki/User:Gmaxwell/block_network_coding

efficient block transfer: https://people.xiph.org/~greg/efficient.block.xfer.txt

low latency block transfer: https://people.xiph.org/~greg/lowlatency.block.xfer.txt

https://people.xiph.org/~greg/mempool_sync_relay.txt

https://people.xiph.org/~greg/weakblocks.txt

compact blocks FAQ https://bitcoincore.org/en/2016/06/07/compact-blocks-faq/

some history https://www.reddit.com/r/btc/comments/557ihg/what_are_the_arguments_against_xthin_blocks/d88q5l4/

more history https://www.reddit.com/r/btc/comments/6p076l/segwit_only_allows_170_of_current_transactions/dkmugw5/

https://twitter.com/kanzure/status/944656354307919873


Introduction

I'll be talking about some recent advances in block propagation and why this stuff is important. I am going to talk about how the original bitcoin p2p protocol works, and it doesn't work that way for the most part anymore. I will talk about the history of this, including fast block relay protocol, block network coding, and then building up to where the state of the art is today.

There are two things that people talk about when they are talking about the resource usage of block propagation. One is block propagation cost to the node operators. They use bandwidth and CPU power to propagate blocks. This is obviously a factor. The original bitcoin protocol was particularly inefficient with transmissions of blocks. You would operate as a node on the network and receive transactions that come in and then when a block was found on the network you would receive the block which included the transactions that you already had received. This was a doubling of the amount of data that needed to be spent. But not a doubling of the node's overall bandwidth usage, because it turns out that nodes do things other than just relay blocks, and these other things are even less efficient than block propagation. In particular, this process that bitcoin uses to relay transactions is really inefficient. The standard bitcoin protocol method of relaying a transaction is to announce to the peers "hey, I know bitcoin txid 'deadbeef'" and then the peers respond to me with "please send me transaction with txid 'deadbeef'" and then there's 38 bytes of INV transactions on the wire, and then a transaction is maybe 200 bytes. So you're using more than 10% of the bandwidth just to announce the transaction. Even if you got rid of block relay entirely, the reduction in bandwidth to a bitcoin node is only 12%. Although, 12% is not nothing, bitcoin is a highly optimized system and we grab for every improvement that we can.

The other thing that is a problem with resource usage is that the old bitcoin block propagation mode is really bursty. You would use a steady amount of bandwidth as transactions come in, but when a block came in you would use a megabyte of bandwidth as soon as the blocks came in. Back at Mozilla, I could tell which of my colleagues were bitcoin users because during video chat their video would stall out in time with blocks appearing on the bitcoin network. So, they would have to turn off their bitcoin node.

That behavior was a real usability factor in how comfortable it was to run a node on residential broadband. A few years ago there was noise on the internet about buffer bloat on routers having excessive latency, which has still not been fixed. Big blocks all sent at once is basically the perfect storm for buffer bloat. So it makes your VOIP mess up.

The bigger concern for block propagation is what is the latency for distributing a block all around the network, in particular to all of the hashpower? This is a concern for two major reasons. One is that if it takes a long time relative to the interval between blocks, then there will be more forks in the network, and confirmations are less useful. It becomes a possibility that confirmations get reorged out. So this has an impact on bitcoin's interblock interval, currently 10 minutes, which is a nice and safe number which is far away from convergence failures. In order to lower this number, the block interval needs to be up to this challenge, so that's one reason that it's important. The other reason is that block propagation time creates "progress". Ideally, mining is supposed to work like a perfectly fair lottery, where if you have 10% of the network hashrate then you should mine 10% of the blocks. When you introduce delays into block propagation, mining works more like a race instead of a fair lottery. In a race, the fastest runner will always win unless the next fastest is very closest in speed. But if you have a runner that is 10% speed or 30% speed, the 10% one will never win the race. Mining is supposed to be like a lottery, propagation delay makes it more like a race, and the reason for this is that if I don't know about the latest blocks then my blocks won't extend them and I'm starting behind. Also, if other miners don't know about my blocks then they won't mine on this, either.

Why is progress in mining bad?

So why is "progress" in mining bad? As I was saying, there are incentives such that every participant should find a proportion of blocks equal to their hashrate proportion. There is a centralization pressure: if blocks take a longer time to propagate, then the bigger the consolidation of hashrate, and the bigger the advantage they have, and there is no system limit to this. You are more centralized, you can buy more hashpower and get even bigger. Miners have the opportunity to choose, to collaborate into a single pool. I'll talk more about this.

This is often misunderstood, when propagation comes up on reddit, there's some vaguely nationalist sentiments like "screw the Chinese with their slow connections" things that people say. I think that's wrong, because block propagation issue isn't better or worse for people with faster or slower connections, instead it's better or worse for people with more hashrate. So if it were the case that people in China had very limited bandwidth, then they would gain from this problem, so as long as they had a lot of hashpower, which in fact they do.

In general, this is a problem that we need to overkill. When we talk about how we spend resources in the community, there's many cases where we can half-ass a solution and that works just fine. It's true for a lot of things. But other things we need to solve well. Then there's a final set of things that we need to nuke from orbit. The reason why I think that block propagation is something that we should overkill is because we can't directly observe its effects. If block propagation is too slow, then it's causing mining centralization, and we won't necessarily see that happening.

Why won't miners solve block propagation?

https://www.youtube.com/watch?v=EHIuuKCm53o&t=7m30s

The problem adversely effects miners, so why won't they solve it? Well, one reason is because bad propagation is actually good for larger miners, who also happen to be the ones in an economic position to work on the problem in the first place. I don't think that any large miner today or in the recent part has been intentionally exploiting this. But there's an effect where you might be benefiting from this, and it's profitable for you, then you might not notice that you're doing it wrong, you're not going to go "hmm why am I making too much money?" and then go investigate. I mean, it takes a kind of weird person to do that. I am one of those weird people. But for the most part, sensible people don't work like that, and that's an effect worth keeping in mind.

Also, miner speciality isn't protocol development. If you ask them to solve this problem, then they are going to solve it by doing it the miner way-- the miner's tool is to get lots of electricity, lots of hardware, contracts, etc. There are tools in the miner toolbelt, but they are not protocol design. In the past, we saw miners centralize to pools. For example, if their pool had too high of an orphan rate, they would move to another (larger) pool with a lower orphan rate. This is absolutely something that we have seen happen in the past, and I'll talk about what we did to help stop those problems.

Another thing that miners have been doing is they sometimes extend chains while completely blind, so instead of waiting for the block to be propagated, they learn enough from another pool to get the header and they just extend it without validating, and they can mine sooner. The bip66 soft-fork activation showed us that likely a majority of hashrate on the network was mining without verifying. Mining without verifying is benign so as long as nothing goes wrong. Unfortunately, SPV clients make a very strong security assumption that miners are validating and are economically incentivized to validate. Unfortunately the miners incentives don't work out like that, because they think that blocks aren't invalid too often. If you are using an SPV client and making millions of dollars of transactions, it's possibly bad for you right?

Bitcoin is designed to have a system of incentives and we expect participants to follow their incentives. However, this isn't a moral judgement. We need to make it so that breaking the system isn't the most profitable thing to do, and improving block propagation is one of the ways we can do that.

Sources of latency

https://www.youtube.com/watch?v=EHIuuKCm53o&t=10m45s

Why does it take a while for a block to propagate? There are many sources of block latency to relay a block through the network. Creating a block template to handout is something that we can improve in Bitcoin Core through better software engineering, not protocol changes. In earlier versions it would take a few seconds. Today, it's a few milliseconds, and that's through clever software optimization. The miner dispatches a block template to their miners, and that might be entire seconds in some places, and it's not in my remit to control- perhaps mining manufacturers to do this. When you are purchasing equipment from mining hardware, you should ask the manufacturer for these numbers. These can be fixed with better firmware.

You need to distribute your solution to peers outside of your network. You have to send the data over the wire. There are protocol roundtrips, like INV and getdata. There are also TCP roundtrips, which are invisible to people, and people don't realize how slow this is. Before we had the block propagation improvements, we measured block propagation in bitcoin network basd on block size and orphan rate, and we found that the aggregate transmission rate through the network was only around 750 kilobits/second. This was because of node latency and also just because packet loss and TCP behavior; all these things together made it so that even though nodes had fast connections, propagation between nodes was only 750k/sec.

One of the things that is very important to keep in mind is that if you're going to send data all around the world, there will be high latency hops like from here to China will be more than a few dozen milliseconds. And there are also block cascades from peer to peer to peer.

Obviously, in the original bitcoin protocol, every time you relayed a block you would first validate it. And in v0.14 and v0.15 there has been speedups for validation at the tip through better caching. These more advanced propagation techniques can work without fully validating the blocks, though.

Original bitcoin block relay protocol

This is a visualization of the original bitcoin block relay protocol. Time goes down. Block comes in, there's a chunk of validation, the node sends an INV message saying "hey, I have this block", the other node responds with "please give me this block", and the other node gives a megabyte of data. This is really simple, and it has some efficiency in that you don't send a 1 MB block to a node multiple times. But it's inefficient. A node knows 99.953% of all the transactions in a block, before the block arrives. As I mentioned before, in addition to the roundtrip in this INV getdata block sequence, the underlying TCP protocol adds additional roundtrips in many cases. And you have to wait for validation.

Bloom filtered block relay

One of the earliest things that people worked on, is that Pieter and Matt observed that bip37 filterblock messages could be used to transmit a block but eliminate transactions that were already sent. So we would skip sending the transactions that were already sent. In 2013, testing suggested that this slowed down transmission due to various overheads, and blocks were much smaller and it was a different network than today. The filtering assumed you would remember all transactions you had been sent, even those that you had discarded, which is obviously impossible. This was the start of an idea- we got it implemented, tried it out, and this inspired more work to refined ideas.

Protools vs networks and the fast block relay protocol

Around this time, we started to see high orphan rates in the network that contributed to pool consolidation. There was a bit of a panic here among us, we were worried that the bitcoin network would catch fire and die. And when we saw ghash.io creeping up to majority hashrate on a single pool, that was a big issue.

https://www.youtube.com/watch?v=EHIuuKCm53o&t=16m50s

Matt's idea was to work on a fast block relay network where he would spin up a bunch of really fast, well-maintained servers.

He would encourage miners to connect to the nodes with the theory that the relay was slow in the network because miners for commercial reasons are hesistant to connect with each other because if you know your competitor's IP address, then you can do DoS attacks or be attacked. So he set it up and it seemed to work. But having nodes alone would not be enough to solve the problem. So, he developed some custom protocols. This has resulted in a bit of confusion. Sometimes we're talking about the relay network like compact blocks and FIBRE, and sometimes Matt Corallo's privately operated network, because it became a testbed for testing protocol changes. So I just want to emphasize that I'm not really talking about Matt's relay network, but I do want to talk about relay network protocols, such as can be deployed on the bitcoin p2p network or already have been, and not necessarily Matt's private networks which sometimes use the same protocols.

So the first actually widespread production to use faster block relay protocols is what I call the fast block relay protocol (2013-2015). The idea behind the fast block relay protocol is basically you have some hub node Alice, and she streams every transaction she sees to you Bob, and she promoises by protocol contract to remember all the last 10k transactions that Alice has streamed. When a block shows up, Alice can show short 2-byte ids to say which transactions to send. You don't have to send a megabyte of data, you send only two bytes per transaction. If there's a transaction that Alice has never sent, then she can send a zero and then the explicit transaction. One of the great things about this is that Alice doesn't have to ask permission to send this data. It doesn't hurt you to get this if she just streams it out to you anyway. This doesn't require a round trip. The block would be transferred in half the roundtrip time. But the downside of this is that Alice has to send you every transaction on the network, redundantly. And this protocol had quite overhead if you had many Alices. There were some miners that ran their own copies of htis as well. It has a high overhead. So it's better to use hub-and-spoke in that scheme.

Block network coding (2014)

I created an omnibus protocol design where I took every good idea I could find, and created a page I called block network coding (2014) which I dare you to read. I came up with some ideas that I could argue got communication optimal block propagation and it can't be made more efficient than this. When we tried to implement some of these ideas, it turned out to be very CPU inefficient. It doesn't matter how small you made the block, if you're going to spend 4 CPU seconds to decode the thing once it shows up. So there are many ideas all sort of mixed together and combined in this page. And almost all of them have come out and been included in production systems, but it required a lot of engineering to try to make it fast, and find the ones that paid for their cost in both implementation and CPU.

bip152 compact blocks

The next big advancement in block propagation was from 2013 and would eventually get written up into became bip152 compact blocks. So the primary goal of bip152 was to eliminate the overhead from block relay, and get rid of the double transaction transmission. For a lot of the developers, we basically prioritized doing bip152 to try to negate some of the negative potential effects of the segwit capacity increases by making it more efficient to transmit blocks in advance of the activation of the segwit upgrade. So as I said, compact blocks is really designed to lower bandwidth usage, and it's not really designed to lower latency, but it does that too as a side effect. There's a design draft document that is linked to, slides, etc. So, in December 2015, I wrote a long high-level design document. Matt made it better, and it turned into bip152 in April. Today, compact blocks are used on greater than 97% of the nodes on the network, and it's what carries blocks in the bitcoin network today.

Compact blocks has two modes. Non-high bandwidth, and high bandwidth. So what's a compact block? The idea of a compact block is that we want to exploit the far-end, the remote party, but we don't know what they know. So what compact block does is sends a block header, transaction count, an 8 byte nonce which I will talk about in a minute, and then it sends along sequence of short IDs. They are 6 bytes each, and it sends one for every transaction in the block. A short id is a hash of the txid and a salt, effectively. So, it also can send explicit transaction, like the coinbase transaction, and the far-end never knows that so you know it doesn't have that so you can send it explicitly. When you receive one of these compact blocks, you can scan your memory and other sources of transactions you might have laying around and see if you get any matches. This uses siphash as the main hash function in it, because it's very fast and it needs to be fast because you need to scan through your mempool with that salt and look for the matches. This scheme only needs to send 6 bytes per transaction in a block, which is less efficient than the previous protocol, but it doesn't have any requirement for a hub-and-spoke topology. And so it's a bit less efficient (3x less efficient) but it's still way more efficient than sending full transactions. But as a result there is some complexity, because the short 48-bit ids at least tcould theoretically have collisions and the protocol has to figure out that when something's missing, you have to fill it in. I mentioned that the short ids are salted, and the reason why they are salted is pretty straightforward. Even if we were to imagine that our short ids were 8 bytes long, so even less efficient, it would be very easy for a troublemaker to produce two transactions that have two short ids that are the same if it wasn't for the salt, and you could efficiently find two transactions with the same id and then you can spam the network with these and consequently jam up block propagation and a miner could do this in fact to create a profit potentially. So in the compact block scheme, this short id is salted using the block header that the block is found in. And when an attacker announces their transactions to the network, they aren't going to have known the salt because the block hasn't been found yet by definition. Also it has an additional salt that is provided by each of the generated parties that generate a compact block. The idea of this additional salt is that if just by dumb luck some transacctions pair in the block or in the mempool happen to have the same short id as the one hashed with the block header-- by having different salts with different lengths, there would be random propagation delays like between Alice and Bob because of a collision and the block would instead route through the fastest path. It starts off with the receiving --- use high bandwidth node. Then in some arbitrary time in the future, a block shows up, a compact block is immediately send. The validation box is moved down. It occurs at the same time that you're sending out the block. The reason why this is okay is because by definition of the protocol it works this way; and high bandwidth mode only works if they are going to be validating, so you won't be handing it out to SPV nodes. Most of the time, some 85% of the time, that's where the protocol stops: the block shows up, the high bandwidth compact block peer sends it to you, and you're done. But if there was a hash collision or you didn't have a transaction, you have to disambiguate, and you ask with a getutxoblock RPC message, and the remote side responds with the missing transactions you asked for. The low bandwidth mode works similarly, but you have to validate before participating. Just like the original block transmission, you ask "would you like the block" and they respond with "yes, please send me the compact block". It adds an extra roundtrip, but it's more bandwidth efficient. In high bandwidth mode, if you ask multiple peers for high bandwidth service, you might get multiple copies of the compact block, which on average is maybe 13 kilobytes and you might get multiple copies in that mode. It wastes a little bandwidth but it's much faster.

Bitcoin Core requests your 3 most recent peers to relay the high bandwidth compact block. We found that, the vast vast majority of the time, the first peer to relay your block is one of the last three to relay your block, in terms of the network topology. This redundancy also has other positive effects. You can DoS attack in the original bitcoin protocol by saying I have a block, then they ask to get the block, you send it, and then they don't reply. So the timeouts are long. And this redundancy in transmission means that you bypass and have these delays and timeouts.

Summarizing compact blocks and its two modes-- the high bandwidth mode takes half-roundtrip time just to send in one direction, up to one if they are missing transactions. The non-high bandwidth mode takes 1.5 to 2.5 roundtrip times. Whether you're in high bandwidth or not, and whether you take 1 or 2.5 roundtrip times, in all cases, compact blocks saves 99% of the bandwidth sending compact blocks. This is on the order of 12% of a node's bandwidth. Further protocols can't really meaningfully improve the node's bandwidth overall; if you half it, it's going to only be like halfing 1% of 12%. But latency could be improved. Compact blocks in the worst case can take 2.5 roundtrip time. For a long haul link around the world, that's really long. The fast block relay protocol showed that it could be better.

What about xthin?

There's another protocol called xthin which I'll comment on it briefly because otherwise people are going to ask me about it. So, it was a parallel development where basically Matt and Pieter did this bloom filter block stuff in 2013. Mike Hearn re-earthed it and made a patch for Bitcoin-XT which didn't work particularly well. The BU people picked up Mike Hearn's work and apparently they were unaware of all the other development that had been in progress on this-- developers don't always communicate well... It's a very similar protocol to compact blocks. Some BU advocates have irritated me pretty extensively by arguing that compact blocks was copied from xthin. Who cares, and it wasn't. They were just parallel developed. It has some major differences. One is that it uses 8 bytes, so 64-bit short ids, and it doesn't salt it, which means it has this vulnerability where you could easily construct a collision. It turns out that a bunch of the BU developers and their advocates didn't know about the birthday paradox effect on how easy it is to create collisions. They argued it wasn't feasible to construct a 64-bit collision and I had a bunch of fun on reddit for 2 days responding to every post giving out collisions generated bespoke from the messages because they said it would take hours to generate them. I was generating them in about 3 seconds or something. So, the other thing that it does differently from compact blocks is that it can't be used in this high-bandwidth mode where you send an unsolicited block. With xthin, there's always an INV message, and then the response to the INV message, sends a bloom filter of the receiving node's mempool, basically says "here's an approximate list of the transactions I know about", and as a result of that bloom filter which is on the order of 20kb typically, the most of the time there's no need to get extra transactions because the sender will know what transactions are missing and it will send them. So basically I look at this optimization is that it costs a constant 1 roundtrip time because you can't use high-bandwidth mode, plus bandwidth plus CPU plus attack surface, to save a roundtrip time less than 15% of the time because high bandwidth mode normally doesn't have roundtrip time 85% of the time. I don't think that is useless, I think it would be useful for the non high bandwidth mode use case, but it's a lot more code and attack surface, for a relatively low improvement. This is particularly interesting for xthin because of political drama it was rushed in production because they wanted to claim they had it first, and it resulted in at least three exploited crash bugs that knocked out every Bitcoin Unlimited node on the network. And every Bitcoin Classic fork node on the network (not Bitcoin Core nodes). And when they fixed some of those bugs later, that could cause nodes to get split from the network; particularly, interestingly, they introduced a bug where if you introduced a short id collision, the node would get stuck, a combination of two vulnerabilities so kind of something to learn from that.

What about "Xpedied"?

The BU folks also kind of have this thing called "expediated" which was basically a response to bip152 because their forwarding without high bandwidth mode is much slower, and it's a manually configured mode that works almost identically to bip152 high bandwidth mode which doesn't send a bloom filter, sends without asking, but since it's manual configuration I don't think anyone really ever used it and I'm confident that it's not used today because their software doesn't support segwit and only miners would use this and miners need software that supports segwit. Doesn't have a spec either. I think it uses the same vulnerable short id scheme, but I'm not actually sure without having read the code. But it's a thing, so now you know.

Better latency than bip152?

I mentioned above that I don't think you can do much better on bandwidth, but you could certainly do better on latency. And we care a lot about the worst case on latency. If miners can gain from not cooperating from filling a block full of novel transactions that you have never seen before, then we should expect that sooner or later that someone is going to do it. In the worst case, bip152 behavior is just like the original behavior. It will send a megabyte of data, ultimately. But with one more roundtrip time, so in the worst case it's potentially worse. On fast low latency links, bip152's extra roundtrip is irrelevant, so you could say that bip152 is all you need on a fast low latency link. But this 2.5 roundtrip time in worst case is really awful internationally especially because long haul international links typically have on the order of 1% packet loss which makes TCP stall.

Erasure coding

https://www.youtube.com/watch?v=EHIuuKCm53o&t=35m20s

I want to take a quick detour into some fundamentals. I think most people are at least vaguely familiar with the concept of an erasure code. This is what RAID does with your disks. Basically you can take N blocks of data and encode them into n+k blocks of data such that any n out of the n+k is sufficient to recover the whole thing. This is what RAID6 does, for example. You've got two extra drives, and so as long as you have no more than two failed, you can recover all your data. It might seem a little magical, but if you think about the k as one case, you can just use XOR and you can work out on paper how that works fine. You compute your redundant block as the XOR of all the other blocks. This can be done for any n+k. So you could split a bitcoin block into a thousand pieces such that you only need any 500 fragments to recover the whole block. But the software to do that efficiently is not particularly computationally efficient. So if you use a Reed-Solomon code, which is the canonical optimal space-efficient tool to do this, it's very slow if n and k are large. But there are other techniques which we could use, they aren't perfectly efficient, they sometimes require an extra block or two extra blocks. But- they are very fast. So this is an erasure code, so keep this construct in mind.

Network coding

Another concept that there's variously highly academic groups in IETF that they like to talk about-- which is network coding, which is an application of linear codes and erasure codes to multi-cast and broadcast. The idea is that if you want to send out a piece of data to a bunch of different nodes, then you could split it into different parts and use error coding on it and spread them out over the network so that you never send the same data twice, you make optimal use of your data links, and then all the peers in the network are collaborating to recombine data and as soon as they have enough of it then they are done. So you can think of it, this like bittorrent swarming, but it can be more communication efficient because you don't need to locate a correct source, everyone's a correct source instead. For actual usage in bittorrent, computational efficiency is more important. For block relay though, we don't realy want to have a lot of back-and-forth two-way communication, so we need to not know where our sources are.

Bitcoin FIBRE network

So this brings us to bitcoin FIBRE (fast internet block relay engine) which was a protocol developed heavily derived from block network coding writeup that I did early on. The idea of FIBRE is basically this: you start with bip152 compact blocks high bandwidth mode, but you modify the short ids to include the transactions. Then you transmit over UDP instead of over TCP and you apply erasure coding to the blocks. You take the 1 MB block and you expand it into 8 MB of redundant data, and you don't send the original block data. You only send the redundant data and you stream it out to different peers and all of those peers will stream it out to each other. So you send each peer a chunk, and then each peer all send each other the chunks. All the chunks are useful because there's no redundancy sent over it. And then, when the receiver gets this data, they use this size-augmented short ids in the compact blocks, to lay out a block in memory and there will be holes in the blocks for the transactions that they didn't know about, and they use the erasure code to fill in the holes to make a complete block. So effectively what this does is it allows you to communicate a block to a remote end where you have no idea what they know, but if they know something then they can make good use of it, and so you could send it out and you don't need to receive a response, and they are guaranteed to get the block so long as they get at least as many packets as they were missing. And you never need to hear back from them, you never take this roundtrip delay, just have one-way delay.

So FIBRE lets you build this single globally-distributed node with very very fast relay of blocks internally. And what this actually achieves in practice, on Matt's relay network which uses this, is 95th percentile latency is under 16 ms latency over the speed of light delay. The speed of light delay is how long does it take your one byte to get from one end of the world to the other. And 95% of the time, the system achieves is able to transmit the entire block in under 16 ms, pretty good, it could be better.

Q: .... pacific ocean ...

A: Yes, oceans are big.

So this scheme has a lot of advantages, right? It's unconditional uni-directional transport, you never need a roundtrip, unless a node is just, offline and didn't get any of the packets. But it needs to be used only within a single administrative domain because part of the way that it works and is so fast is because nodes relay packets between each other without being able to validate them because they can't reconstruct the block yet, they are just relaying them as soon as the packets come in. So this would be a DoS vector if it was open to the public. Also this has absurd bandwidth overheads (sending 8x the total block for example), but unlike TCP based protocols these overheads don't come at the expense of additional delays, they lower delays. In the production FIBRE network, it sends 8x the total amount of block data. But unlike the overheads in the original bitcoin p2p relay network, these overheads don't make the block slower to reconstruct. You don't have to receive all 8 copies, you just need to receive one copy, and most of that copy comes out of your local mempool anyway.

The way that Matt uses this protocol in his own network is that he has a bunch of nodes, they are on well-maintained systems, and they speak FIBRE amongst each other, and then they speak regular bip152 compact blocks high-bandwidth mode to peers, anyone who is connected to his nodes. Because these bip152 links are short, because they are all nodes that are connecting to the closest FIBRE node, the fact that it has lots of roundtrips doesn't matter in the end.

What can we do better?

Of course, there's the drive to do better. Like I said, we want to overkill this. There are lots of tools in the toolbelt that could be used to make some of this stuff better. One of those tools is set reconciliation. This is an idea very similar to based on the same technology as the erasure coding stuff. The idea is that I have a set, a list of transactions, and we don't know what each other's sets are, but we assume they are really similar but maybe some minor differences. It turns out that there are protocols where someone can send a single message to another party with a similar set and they don't know how similar, and the message has size that is equal to the symmetric set difference. So the size is the number of entries I send is how many things I have that [you] don't and how many things you have that I don't. There are protocols that allow you to, with no more data than that, to recover the other set. I can send you a message, and now you know my set. There are algorithms to do this, which are communications optimal, meaning no way to do it more efficiently in terms of communication. They are based on interpolating roots in polynomial ratios. They are similar to RS codes, but more expensive to decode. They take CPU time that is cubic in the set difference, and all of these operations are expensive. There are some newer techniques such as IBLT (invertible bloom lookup table) which are fast, they are linear in time of the set difference but they have bad constant factors in communication so they send more data. If your set differences are small and the entries are small, then IBLT it can be like 5x the set difference that you have to have to get a very reliable code.

Applying set reconciliation?

So how could you use set reconciliation technology? So back in this early efficient block transfer design doc that I wrote, I described that you could use set reconciliation to send the short ids in a compact block and use much less bandwidth. But the big challenge in using that is that you still have to communicate the transaction order because that's normative in a block and miners can make whatever order they want, subject to the transaction dependency graph. And also, ... they were unaware that I had previously suggested it, they went and implemented it using IBLT. Their protocol assumes you have changed the bitcoin consensus protocol to assume that transactions are listed in a certain order. But right now in bitcoin today, miners will order transactions by how profitable they are to the miners, which is a known function to the receiver. And, this ordering for most income and to least income, then my pool software can fit transactions into blocks by dropping the last couple transactions. And that would be broken if you had a required order. Using set reconciliation this way, ignores the latency in general. In particular, graphing only talks about bandwidth never talks about latency. Since it's ignoring latency, it probably would have been better to use polynomial set reconciliation instead of IBLT. But in any case... on one hand I'm really excited about set reconciliation because it's like cool rocket science, but on the other hand-- IBLT algorithm in particular is a great programming project, basically any programmer could implement it and it's magic, I would actually recommend implementing IBLT decoders if you want to do something fun, but this technology doesn't really seem like it would lower latency in any normal application of block propagation. Any further bandwidth improvements are sharing off fractions of a percent or something-- the graphene stuff for example ((talk)), is going to save only a couple hundred kilobytes per day on a node. It's not that exciting. If the set reconciliation fails, you get a roundtrip time, there's no way to incrementally increase the size of an IBLT, there's negatives there. Bandwidth relaying blocks is 99% solved. Using set reconciliation to try to improve that, is probably trying to improve it. But I did say only 99% solved, there's areas for improvement.

So can we do better?

There are improvements where bandwidth improvements would still be useful. 20 kb doesn't take long to send over normal links. But there are cases where this will getting down will improve latency. There are some things we could do which are not necessarily useful for block propagation but can improve bandwidth overall.

When is 20 kb going to impact latency? Well, one important use case is satellites. The blockstream satellites are broadcasting blocks right now. This is to improve the partioning resistance of bitcoin. You can receive blocks over the radio, and then use your local bandwidth only for sending out transactions. The link speed on the satellite is only 80 kb/sec, which is enough to stay in sync with the bitcoin network. We use software-defined radio so we can increase/decrease bandwidth with some tradeoffs. This 20 kb compact block means that it takes a minimum of 2 second to send a block across it, which isn't attractive for mining. You could mine over it, but it's not particularly good. Another area that could be improved here, is transaction compression. The early fast block relay protocol used LZMA compression and it turned out to be worthless because compressors don't do much when you use them on small amounts of data, and the faster compressors do even less. You can get good compression if you compress one or more blocks at a time, but that's not useful for transaction relay when you send multiple transactions at a time. It's possible to construct alternative serialization for bitcoin transactions. You can convert from a normal transaction to alternative serializations losslessly, you could use this alternative serialization and use it on the whole of bitcoin history. Pieter and I figured one out that can reduce the size of the entire history of bitcoin to 28%. It's CPU intensive, and it's only based on a per transaction basis, so there are even more optimzations that could be done. If this was used, it could make block storage smaller, it could make transaction smaller, it could make block propagation smaller, but the real effect for users would probably be transaction relay.

Set reconciliation for transaction relay

Another idea that would be really powerful would be set reconciliation for transaction relay. It's not useful too much for block relay, but transactions are naturally a set. In 2013, I talked on bitcointalk where instead of a node offers transactions to all of its peers, but you would offer it to one peer and sometimes you would send it to another peer based on a coinflip. If you did this alone, a transaction wouldn't get very far in the network. But if you had nodes periodically running set reconciliation with each other, then every time a transaction got stuck in a part of the network, it would be able to break out. This would make transaction relay really efficient, because it's low bandwidth.

Template delta compression

Another thing I have been working on is template-delta compression. The big cost in set reconciliation is sending the order of transactions. What if you sent a block template in advance of getting a block? You send the template over to someone; instead of sending the block in the future, you could send the permutation difference. If I have an efficient way of encoding the block I was expecting versus the block I actually got-- or rather, encoding the difference, then maybe you could propagate this pretty quickly. With bip152, we made a stateless protocol for many peers-- but what if we introduce state again and we can support differences? I designed a typical and fast compressor for typical permutations, using the av1 range coder which I worked on in a previous life. And, what I found using this scheme I came up with so far, is that if you look up with the difference of a 30 second old template and a block that actually shows up, I came up with a median size of 2916, mean 3041 bytes, and 22% of blocks fit in less than one IP packet over the internet. Saving bandwidth isn't necessarily important in many cases, but for the blockstream satellites this can get down to a few hundred milliseconds from whole seconds.

Template deltas for pre-consensus and weak block schemes

The other reason why I think this delta stuff is interesting, other than for the satellite, is that there's a set of proposals out there that I call pre-consensus. This whole block propagation problem is because mining is a race about announcing blocks. What if you could have the miners get together and decide what to include in the next block? And all you have to send is the pre-consensus the things it said it would have, and there's the payout amount.... the bitcoin-ng-- it's not very bitcoiny, it introduces miner identities, you can't slot it into bitcoin without very deep economic changes and technical changes. But you could do pre-consensus in a bitcoin compatible way, like with weak blocks where they use a p2pool-like sharechain that pre-commits to transactions they intend to include in the future, and it comes more frequently than blocks, and when a block shows up it just references one of the weak blocks. This is good for template deltas; the difference will be negigible.

Known improvements possible....

There are other known improvements that are possible that people have been talking about.. like the FIBRE swarming block thing in a single administrative domain; it could be swarmed over the whole bitcoin network by making the blocks commit to the chunks that they send out. But the error correction code would have to be part of the bitcoin consensus rules, and the validation would require CPU time. I don't know how reasonable it would be to do that, and there's some downsides to that as well. If you want to prove that a particular packet was one of the ones that a miner has authorized, the SPV proof for that or hash tree would be quite large.

Consensus could require an order that obeyed dependencies, so you could make set reconciliations more efficient but it would break truncation for miners. When a miner prioritizes transactions, those transactions end up at the top of the block. The bitcoin nodes could re-sort transactions so that they are in a more predictable order, but it's unclear if this is worth implementing given its complexity. I think one of the big surprises is that there's a lot of ideas that are awesome but once you implement it, it makes things worse. We benefited a lot from Matt's relay network because we're able to take these ideas and check if they work. It was only his nodes so if it was bad, we could change it, and we weren't rushing out tech to the bitcoin network that could be used say to crash every node in the network. And some of the ideas turned out to be slow or whatever but they could be fixed once scaled back.

Q&A

https://www.youtube.com/watch?v=EHIuuKCm53o&t=57m24s

Please wait for the mic if you have a question.

Q: On your favorite topic of how to make compact blocks even smaller or something like that; so, the thing where you're getting these like 2 byte indexes of transactions, would it be more efficient to send a compactified bitfield?

A: Yes.

Q: And then the permutation. And you want to get things down to fitting into a single packet. Isn't that just 2kb for a permutation?

A: If the permutation is uniform, yes. But we have a strong prior for what that should look like, which is why I am able to get to that size with my difference thing. So the ordering is like rent and ancestor fee rate. My encoding scheme is efficient for here's some stuff out of order and the rest is in order. The 2 byte thing could be made more efficient, but the tradeoff is CPU time. Reading 2 bytes and indexing is really fast. You could use an arithmetic coder and doing things that my template difference does, but it's quite a bit slower.

Q: As far as reducing bandwidth used by relaying transactions, a pretty good network topology for doing that is full nodes become part of some club and each transaction becomes part of a club. So you stream it out to the club. And then you have one peer in that club...

A: There are many papers on efficient gossip networks. One of the things that work doesn't work on is byzantine fault tolerance gossip networks. The efficient gossip networks tend to be ones where a single malicious peer can black out all your transactions. But in bitcoin you can have nodes that are completely malicious and it won't really slow the transactions at once. So far academia hasn't been super useful for concrete schemes.. I think the low fan out plus set reconciliation scheme is byzantine robust but I don't have a strong proof of this yet.

Q: What's the current propagation delay across the network and what's a good target for that?

A: We have lost our ability to measure it, and orphan rate is really low at this point. It's hard to say. There's some stats that people have from polling random peers. Christian Decker has a chart. It's gotten much better, it's nice to see. But this doesn't look at miners, just peers. The best way we have to look, at block propagation times from miners is how often blocks are getting orphaned. We overshot on some of this work- we improved some of these things but orphan rate wasn't going down. And then miners upgraded to segwit, and orphan rate went down. And we went for like 5000 blocks without seeing a single orphan, which was unlikely. I think the numbers are pretty good on the network. But the challenge is, is propagation time still adequate once we start producing blocks with more data in them?

sipa: The fact that we are seeing low orphan rates on the network means either our technology is awesome and we have solved the propagation delay..... or all miners are working together and it's entirely centralized and we've failed. So it's hard to tell.

A: Before all of this optimzation, there were like 2 or 3 orphans per day, so it wasn't a very clear indicator in the first place, but it was one of the best we had, because it directly measures what we're trying to solve.

Q: Is block propagation time bad on its own, or is just because of the effect on orphan rate?

A: It's a little bad on its own independent of orphan rate because it means that you take longer to see what the status of the network is. But suppose if you have a scheme where there are no orphans---- there's radical rethinkings of the blockchain where there's no orphans, like schemes with DAGs, where blocks aren't a straight-linked list, there's a forest of blocks, and then transactions can occur in multiple places there and there's some arbitration technique for dealing with conflicts. But the problem is low throughput because you have to send transaction data redundantly multiple times.

Q: It seems to be that there seems to be this arbitrary permutation encoded in a block kind of sucks.

A: It's something I would like to fix, but there's like, I know in genetics it's commonly the case that there's strange useless effects that turn out to be vital to life because other things have developed a dependency on this. In bitcoin, miners do really truncate blocks. If we take that away, would they care? It's hard to get miners to talk about protocol details. You could require it as an arbitrary order subject to the dependency graph, which is easy to do inside of node software, but hard otherwise. If I was doing bitcoin design today, I would probably require an arbitrary order.

Q: When you said block propagation get better with segwit, is that because segwit helped it?

A: No. It turns out that there were some large miners running really old software like Bitcoin Unlimited which was forked from a really old bitcoin version which didn't have cache improvements or any of the other improvements. It turns out that there weren't enough miners running improved software. And then miners switched to modern Bitcoin Core and then the orphan rate fell through the floor. It was difficult to tell though, because slow propagation doesn't just cause the miners to have orphans, it causes everyone to have orphans. You couldn't really say "ah Antpool has lots of orphans and Antpool is running BU" or whatever...

Q: So we're in a stable time for network conditions. We're seeing major network partioning like the GFW or Syria or Turkey... is there a way to measure what happens if the internet becomes a lot more fractured?

A: This is one of those things that is important but also hard to measure. It's not like the internet in the developed world fractures every day. So we don't have lots of data about how fractures or partitions impact the bitcoin network. So we need to overkill partition resistance. Satellites is one way. We could do more network monitoring in the community-- but it's not anybody's job to run the bitcoin network operation center... so unless someone like BlueMatt steps up and starts monitoring it like he does, we don't otherwise get the broad network data and nobody looks at it. There's some things like we can see twitches in cdecker's block propagation data, and if there was a widespread internet outage we might learn something about bitcoin's network partition resistance.

Q: My question here.. you mentioned about BlueMatt's network and how it has been useful for implementing and iterating on these... so what's the difference on what Matt did on his network and regtest or testnet? Why has it provided such huge benefits, and why not replicate that in the normal...?

A: On all of these tools and techniques, we did lots of testing, sandboxing, simulated environments, there's Linux dummynets to simulate packet loss and latency. But the problems with these things is that they allow you to deal with known unknowns. Like you already know you need to measure the situation with packet loss. But what about the actual network out there? And what about user factors? We learned a lot about interacting with miners, like what does motivate them, what interfaces do they prefer, how sustainable is manual configuration? Production usage for things like this is no replacement for a test, but a test is no replacement for production usage. And there was no real risk that Matt's network would break bitcoin-- the worse that would happen is that Matt's network would stop working... but that's the worst it could do.

Q: Can there be multiple relay networks?

A: There are in fact multiple relay networks. Some miners run their own relay networks between their data centers and they peer with themselves and friends. There's a social dynamics question; how well can multiple of these scale? Any of these require manual configuration. Matt would love it if someone would run a well-maintained relay network. He was previously getting donations to fund doing it.. but the effort to collect the donations each month was more than it was worth to it to him. And it turns out there's bitcoin miners that don't want to pay bitcoin for things. There were people that only wanted to pay with wechat or something. So he pays out of pocket, and it was faster before when we had better funding for it.

Q: ...

A: We have tested other heuristics. Compact blocks can send transactions opportunistically. It always sends the coinbase transaction because it knows you don't have it. I have tested a heuristic that basically works like, "a transaction was surprising to me then I should assume it's surprising to my peers, too". This is an obvious area that could be investigating. Why do misses occur in compact blocks? One reason is that a transaction just arrived milliseconds after a block; or another reason it's missed in a compact block is because there was a double spend-- and you have a different one than the one that the miner confirmed. We addressed this somewhat in Bitcoin Core, we don't just use the mempool to construct the compact block, we have an extra pool that includes double spends and replacements that we keep around for that reason.

Q: There is some research about gigabyte-sized blocks. The issue here-- have you tried this yourself?

A: Haven't tried that on Matt's network, but in private test labs sure. Being able to run large blocks over this stuff is not a surprise. It's designed to be able to handle it. The problem is the overall scaling impact on the network. It's not just my one node on super fast hardware can it keep up. This bloom filter that xthin sends adds a superlinear scaling component that is worse than compact blocks. You can run a gigabyte block, but good luck keeping a node affordable running on that for any period of time running....

Q: I was wondering about transaction throughput on the bitcoin network. I have heard it's like 10 or 18 with segwit. In practice it's like 3?

A: The throughput depends on what transactions people are producing. Prior to segwit, people started to do batch sends, meaning that transaction throughput is a poor metric in terms of capacity of the network. The number of transactions can go down, but the number of UTXOs or the people using it goes up. Segwit is only being used by about 10% of the transactions so far. If you're not using segwit and you care about transaction fees, you should use segwit, it's a massive decrease in fees. I haven't looked today-- but in the past couple of days, the fees were around 5 sat/byte... Part of the design of segwit was to avoid creating a system shock by adding lots of capacity and causing a fee market collapse or something. There's an incentive to use segwit that goes up when fees go higher, and we get more capacity when there's more demand.


forward erasure correction codes https://en.bitcoin.it/wiki/User:Gmaxwell/block_network_coding