Date: Wed, 24 Feb 2021 17:18:32 +1000
From: Anthony Towns <aj@erisian.com.au>
To: Jeremy <jlrubin@mit.edu>,
 Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Message-ID: <20210224071832.qhsc2fhw5cf7tybs@erisian.com.au>
References: <20210222101632.j5udrgtj2aj5bw6q@erisian.com.au>
 <7B0D8EE4-19D9-4686-906C-F762F29E74D4@mattcorallo.com>
 <CABm2gDrbKdxMuKdwYh0HBXNUxhqWtq1x2oLC0Ni=dbfP8b_a7Q@mail.gmail.com>
 <CABm2gDp5dRTrPEqPfrjOeeYBn6RMS=HFMbtkJ+MM0SMVnSfK5A@mail.gmail.com>
 <CAD5xwhg0pWJykWUusdoNQd60L9_MgCzzky1dvViLERADMcoysQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAD5xwhg0pWJykWUusdoNQd60L9_MgCzzky1dvViLERADMcoysQ@mail.gmail.com>
User-Agent: NeoMutt/20170113 (1.7.2)
Subject: Re: [bitcoin-dev] Yesterday's Taproot activation meeting on
 lockinontimeout (LOT)
Precedence: list

On Mon, Feb 22, 2021 at 06:10:34PM -0800, Jeremy via bitcoin-dev wrote:
> Not responding to anyone in particular, but it strikes me that one can think
> about the case where a small minority (let's say H = 20%?) of nodes

I don't think that's a good way to try to look at things -- number of
nodes has some impacts, but they're relatively minor (pun deflected).

I think the things to look at are (from most to least important):

 (1) what the price indicates / what people buying/selling BTC want
 (2) what hashpower does
 (3) what nodes do

Here's a concrete example to help justify that ordering. Suppose
that for whatever reason nobody is particularly interested in running
lockinontimeout=true -- only 0.1% of nodes are doing it and they're not
the "economic majority" in any way. In addition, 15% of hashpower have
spent almost the entire signalling period not bothering to upgrade and
thus haven't been signalling and have been blocking activation.

Suppose further that there are futures/prediction markets setup so that
people can bet on taproot activation (eg the bitfinex chain split tokens,
or some sort of DeFi contracts), and the result is that there's some
decent profits to be made if it does activate, enough to tempt >55%
of hashpower into running with lockinontimeout=true. That way those
miners can be confident it will activate, take up contracts in the
futures/predictions markets, and be confident they'll win and get a
big payday. (Note that this means the people on the other side of those
contracts are betting that taproot *doesn't* activate)

Once a majority of hashpower is running lockinontimeout=true, it then
makes sense for the remaining hashpower to both signal for activation
and also run lockinontimeout=true -- otherwise they risk their blocks
being orphaned if too many blocks don't signal, and they build on top
of one.  Figuring out that a majority of hashpower is/will be running
lockinontimeout=true can be done either by a coinbase message or by
bip91-style signalling.

In that scenario, you end up with >90% of hashpower running with
lockinontimeout=true, even if only a token number of nodes in the wild
are doing the same.


It's possible to do estimates of what happens if a majority of miners
are using lockinontimeout=true, and the numbers end up pretty wild.

With 90% of miners signalling and running lockinontimeout=true, if the
remaining 10% don't signal, they can expect to lose around 3% of revenue
($2M) due to blocks getting orphaned. If the numbers are 85% running
lockinontimeout=true, and 15% not signalling, the non-signallers can
expect to lose about 37% of revenue ($38M) during the retarget period
prior to timeout. If 60% of miners are doing spy-mining for up to 90s,
they would expect to lose 0.9% of their spy-mining revenue ($2.5M). If
60% of hashpower is running lockinontimeout=true, while 40% don't
signal, the non-signallers will forego ~83% of revenue ($320M) due to
their blocks being orphaned, and if 60% of miners spy-mine for 90s, they
should expect to lose 5% of revenue ($10M) over the same period. Dollar
figures based on 6.25BTC/block at $50k per BTC.

https://gist.github.com/ajtowns/fbcf30ed9d0e1708fdc98a876a04ff69#file-forced_signalling_chaos_cost_sim-py

Note that if miners simply accept those losses and don't take any
action to prevent it, very long reorgs are to be expected -- in the 15%
non-signalling scenario, you'd expect to see a 5-block reorg; in the 40%
non-signalling scenario, you'd get reorgs of 60+ blocks. (Only people
not running lockinontimeout=true would see the blocks being reorged out,
of course)


So I think focussing on how many nodes have a particular lockinontimeout
setting can be pretty misleading.

> # 80% on LOT=false, 20% LOT=True
> - Case 1: Activates ahead of time anyways

That's the case where >90% of hashpower is signalling, and everything
works fine.

> - Case 2: Fails to Activate before timeout...
> 20% *may* fork off with LOT=true.

Anyone running with lockinontimeout=true will refuse to follow a chain
where lockin hasn't been reached by the timeout height; so if the most
work chain meets that condition, lockinontimeout=true nodes will refuse
to follow it; either getting stuck with no confirmations at all, or
following a lower work chain that does (or can) reach lockin by timeout
height.

> Bitcoin hashrate reduced, chance of multi
> block reorgs at time of fork relatively high, especially if network does not
> partition.

If the most-work chain fails to activate, and only a minority of
hashrate is running lockinontimeout=true, the chance of multiblock
reorgs is actually pretty low. The way it would play out in detail,
with say 20% of hashpower not signalling and 40% of hashpower running
lockinontimeout=true:

  * the chain reaches the last retarget period; lockinontimeout=false
    nodes stay in STARTED, lockinontimeout=true nodes switch to
    MUST_SIGNAL

  * for the first ~1009 blocks, everyone stays in sync, but block ~1010
    becomes the 202nd non-signalling block, meaning that the 60% of
    hashpower on lockinontimeout=false is now one block ahead of the 40%
    of hashpower on lockinontimeout=true

  * it's possible that the 40% have a lucky run and get ahead of the 60%
    chain causing a reorg. But in that case the within about 5 blocks,
    another non-signalling block will be mined and the 60% will be ahead
    again. So the 40% of lockinontimeout=true hashpower has to keep with
    with miners that have 150% of their hashrate for ~1000 blocks in
    order for everyone to end up on a locked in chain, which is
    vanishingly unlike.

Even if you set the percentage not signalling to 11% and the percent of
hashpower running lockinontimeout=true to 48%, by my count you only get
about a 27% chance of ending up reaching lockin on the most work chain.
With the 40%/20% figures above it's a flat 0.0%.

https://gist.github.com/ajtowns/fbcf30ed9d0e1708fdc98a876a04ff69#file-test_disaster-py

It's possible that the 60% will take some action to prevent their blocks
being reorged out if the 40% do get lucky. One option would be for them
to set lockinontimeout=true -- then we quickly get back to the "almost
all hashpower ends up running lockinontimeout=true" and activation is
certain. But they could just as easily decide that one getblockchaininfo
reports a softfork isn't possible, they won't reorg to a chain where it
is possible unless it's 2 or 4 or 6 or whatever blocks longer.

> # 80% on LOT=true, 20% LOT=False
> - Case 1: Activates ahead of time Anyways
> No issues.

This is same case where there's plenty of signalling and it's irrelevant
what the setting for lockinontimeout is...

> - Case 2: Fails to Activate before timeout...

I'm not sure what you mean by "before timeout" here -- if you mean
it reaches the MUST_SIGNAL phase, with 80% of hashpower running
lockinontimeout=true, then things work out okay: even assuming that all
20% that are not running lockinontimeout=true are also not signalling,
then the miners who don't signal will lose up to 56% of their revenue for
the MUST_SIGNAL period (~$80M) , and if some of the lockinontimeout=true
miners do spy-mining and build on top of non-signalling blocks, they
may lose something like 1.7% of their revenue as well. In addition we
might see reorgs of up to ~10 blocks as this resolves itself. That's a
significant loss for the miners who are out of consensus, and the
liklihood of large reorgs will make doing business with bitcoin harder,
but that at least is all able to be coped with.

But if you mean the most work chain reaches the timeout height without
achieving locked in state, because the majority of miners aren't running
lockinontimeout=true, then the 80% of nodes running lockinontimeout=true
will be stalled, and unable to process transactions, until they downgrade.

If that ever occurred, it would be an astounding disaster, and I hope
the first thing people would do is decide never to run any software by
whoever proposed, ACKed or merged the PR that resulted in 80% of nodes
running with lockinontimeout=true.

*Because* it would be such a disaster to effectively run a
denial-of-service attack on 80% of nodes, it's plausible that price
signals would indicate to miners that it will be much more profitable to
run lockinontimeout=true, preventing that from occuring. But people can
make profits out of disasters too -- it might be that people will figure
"oh, the price will crash if this happens, so it'll be a chance to get
some cheap bitcoins, and maybe put competing miners out of business so
I can buy their ASICs off them for cheap too!"

> My overall summary is thus:
> 1) People care what Core releases because we assume the majority will likely
> run it. If core were a minority project, we wouldn't really care what core
> released.

That seems very backwards to me. I'd put it as: people run core because
it makes good, conservative decisions on what features to add. If
"choose your own consensus rules" were what the market wanted, then
Bitcoin Unlimited or similar would be what everyone was running.

If core were to change that policy and push risky changes, I'd hope
that users would be able to recognise this, and would switch to an
implementation that continues to emphasise safe, conservative policies.

> 2) People are upset with LOT=true being suggested as release parameters because
> of the narrative that it puts devs in control.

If users will just run whatever core devs release, even if it involves
contentious changes to consensus rules, then the core devs are in control.

> 3) LOT=true having a sizeable minority running it presents major issues to
> majority LOT=false in terms of lost blocks during the final period and in terms
> of a longer term fork.

As above, I think this scenario is easy to avoid if it were to
eventuate.

> 4) Majority LOT=true has no long term instability on consensus (majority LOT=
> true means the final period always activates, any instability is short lived +
> irrational).

The instability occurs if the lockinontimeout=true chain stalls or is
overtaken by a more-work non-activating chain, then users running nodes
with that parameter set will stop their nodes, and reinstall/reconfigure
it to set lockinontimeout=false.

> 5) On the balance, the safer parameter to release *seems* to be LOT=true. But
> because devs are sensitive to control narrative, LOT=false is preferred by
> devs.

I think that conclusion is based on a few shakey assumptions; particularly
that people won't downgrade/reinstall back to lockinontimeout=false
and that miners will be be pretty naive about allowing their blocks to
be orphaned.

> 6) Almost paradoxically, choosing a less safe option for a narrative reason is
> more of a show of dev control than choosing a more safe option despite
> appearances.

Going all-in on a bluff can be a good bet 9 times out of 10, while still
being a net negative because of the 1 time out of 10 when you lose. In
the examples above, the "80% of nodes running the default client can no
longer follow the blockchain without manual intervention" is the "lose
it all scenario", even if "taproot" is probably one of the 9/10 cases,
not the 1/10 case.

> 7) This all comes down to if we think that a reasonable number of important
> nodes will run LOT=true.

What nodes run (as compared to hashpower, or as compared to what people
want to buy/sell) is the least important factor in working out what's
going to happen.

> As a plan of action, I think that means that either:
> A) Core should release LOT=true, as a less disruptive option given stated
> community intentions to do LOT=true
> B) Core  community should vehemently anti-advocate running LOT=true to ensure
> the % is as small as possible
> C) Do nothing
> D) Core community should release LOT=false and vehemently advocate manually
> changing to LOT=true to ensure the % is supermajority, but leaving it as a user
> choice.

I think these are all a bit terrible as plans of action -- "core should
release X, then advocate Y" is really not playing to core's strengths.
Far better for devs to focus on writing/debugging code, analysing the
way things work, making tests, and adding mitigations for risks.

Better for bloggers and podcasters and the twitterati to do the advocacy,
and core to stick to working on code and saying "no, there are significant
technical risks to doing that that we don't yet have mitigations for"
when people advocate for risky things.

My view is more along the lines of:

 - the setting for lockinontimeout will not matter until around July 2022,
   (though maybe as early as May 2022 if blocks come really fast) either
   technically or even as a game theory incentive

 - lockinontimeout=true has consensus implications, and depending on
   the response by miners can cause network interruptions like long
   chains of reorgs. At best, it hasn't had the same level of review as
   taproot, and some experienced developers aren't comfortable with it
   as it stands. Those seem like pretty good reasons not to deploy it
   immediately, IMO.

 - the lockinontimeout=true code we've got doesn't do (at least) two
   things that the bip148 client did that help avoid bad cases:
     - ensure preferential connections to other nodes setting
       lockinontimeout=true to prevent network splits if the
       non-activating chain is longer during/after the MUST_SIGNAL phase
     - cope with rewinding the chain to the best lockinontimeout=true valid
       block, in the event a node is upgraded to lockinontimeout=true
       from either lockinontimeout=false or a version of bitcoind that
       doesn't have activation parameters set at all

I think it makes more sense to:

 1) release lockinontimeout=false code with a view to reconsidering it
    at about ~6 months (so prior to the 23.0 release)

 2) do more review of lockinontimeout=true code to ensure everyone
    understands what behaviours are likely

 3) add support for the features from the bip148 client, along with
    any other mitigations we think of, assuming we can do so in a way
    that's safe and sane

 4) work with miners and mining pools to ensure that if
    lockinontimeout=true does get used they know how to minimise
    disruption and losses due to orphaning, etc.

That gives us about 6 months work on (2) and (3), and probably 9-12
months to work on (4), and it's all technical rather than advocacy and
popularity contests. Six, nine or twelve months should be plenty of time
to get pretty clear indications of what both the market in general
thinks about things, and what miners are thinking.

I think if lockinontimeout=true weren't new code, and devs, miners and
users widely understood its potential behaviours and risks, and we didn't
have safety features that were still on the todo list, then there'd be
a good argument for doing lockinontimeout=true from day 1. I could see
that being the case for the next soft-fork, assuming it gets a similar
amount of review prior to deployment as taproot has had, eg. But, to me,
taking a more cautious approach seems more sensible today.

> If I had to summarize the emotional dynamic among developers [...]

(Fortunately, you don't have to do that...)

Cheers,
aj