7e/3c18bda156825c0dd1a33fb8342dc6317b1d42


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322

Return-Path: <bitcoin-dev@wuille.net>
Received: from fraxinus.osuosl.org (smtp4.osuosl.org [140.211.166.137])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 56A1CC013B
 for <bitcoin-dev@lists.linuxfoundation.org>;
 Sat,  5 Dec 2020 22:59:33 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by fraxinus.osuosl.org (Postfix) with ESMTP id 4504486448
 for <bitcoin-dev@lists.linuxfoundation.org>;
 Sat,  5 Dec 2020 22:59:33 +0000 (UTC)
X-Virus-Scanned: amavisd-new at osuosl.org
Received: from fraxinus.osuosl.org ([127.0.0.1])
 by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id Y-avnzBxFZyi
 for <bitcoin-dev@lists.linuxfoundation.org>;
 Sat,  5 Dec 2020 22:59:31 +0000 (UTC)
X-Greylist: from auto-whitelisted by SQLgrey-1.7.6
Received: from mail-40133.protonmail.ch (mail-40133.protonmail.ch
 [185.70.40.133])
 by fraxinus.osuosl.org (Postfix) with ESMTPS id BA39C86403
 for <bitcoin-dev@lists.linuxfoundation.org>;
 Sat,  5 Dec 2020 22:59:30 +0000 (UTC)
Date: Sat, 05 Dec 2020 22:59:17 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wuille.net;
 s=protonmail2; t=1607209166;
 bh=lIneUli+abdrodiB7DA3J/aTnv9ymOUCh696f5389tM=;
 h=Date:To:From:Reply-To:Subject:In-Reply-To:References:From;
 b=J98S5AiliYUqahgBcjL1pdJKR/WTB4sa7D78gAZb6SA3qQcGtuXJQQKIMEOBx9IiB
 8fMa7zD/1eBvq/fgYODKa/jWh85vF0hohpPewyLJ4Zh/GHQ3ZVMjUvnLjAtAA6RyIq
 mW4mo79s/I7Vdzik1mQX3HK3EbdrEFjn6LLpq/jueQwsGOT6n9hgf7f/VUe3hGfz/L
 lk8pJwSkum7UGdbOp996PC5UQCQWFm8hg/MO7f+0wJHecIT41uPaWT6Q5/N0teMw4z
 hhBkZWIUh9Dk3g4ATTJAwbOS2vJhoz4LrjBT+cxgFlIa8zirS41czJC8khtIa9Xf0Y
 f2kiC0y8gTsEg==
To: Rusty Russell <rusty@rustcorp.com.au>,
 Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
From: Pieter Wuille <bitcoin-dev@wuille.net>
Reply-To: Pieter Wuille <bitcoin-dev@wuille.net>
Message-ID: <Txf3NOgbjhU0PtTMxTd9AWNuR8WpOwramvcClf5D-vqxKbE7MT4fi1YLY-xcbC9MBr3c1AqgmHETSMhNnbu4-fZ8Lcoe23etgo05P34b0v4=@wuille.net>
In-Reply-To: <p6i6YqXjOBastpss12gNUcJGGqomgPiLXOIyob71VptqVwJwcFwrd4m8Mad4RDnAhSyFXAZqsD67fW0kS3NayQ6k6dB-h2_V7vl7RBxmvvA=@wuille.net>
References: <87imblmutl.fsf@rustcorp.com.au>
 <p6i6YqXjOBastpss12gNUcJGGqomgPiLXOIyob71VptqVwJwcFwrd4m8Mad4RDnAhSyFXAZqsD67fW0kS3NayQ6k6dB-h2_V7vl7RBxmvvA=@wuille.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Mailman-Approved-At: Sun, 06 Dec 2020 04:04:00 +0000
Subject: Re: [bitcoin-dev] Progress on bech32 for future Segwit Versions
	(BIP-173)
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>, 
 <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>, 
 <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Sat, 05 Dec 2020 22:59:33 -0000

> On Wednesday, October 7, 2020 5:21 PM, Rusty Russell via bitcoin-dev bitc=
oin-dev@lists.linuxfoundation.org wrote:
>
> > I propose an alternative to length restrictions suggested by
> > Russell in https://github.com/bitcoin/bips/pull/945: use the
> > https://gist.github.com/sipa/a9845b37c1b298a7301c33a04090b2eb variant,
> > unless the first byte is 0.
>
> Hi all,
>
> starting a slight side-thread here.

Another update, and a longer write-up.


Introduction
------------

Bech32's checksum algorithm was designed to be strong against substitution
errors, but it also provides some protection against more general classes o=
f
errors. The final constant M that is XOR'ed into the checksum influences th=
at
protection. BIP173 today uses M=3D1, but it is now known that this has a
weakness: if the final character is a "p", any number of "q" characters can=
 be
inserted or erased right before it, without invalidating the checksum.

As it was recognized that other constants do not have this issue, the obvio=
us
question is whether this is the only possible type of weakness, and if not,=
 if
there is an optimal constant to use that minimizes the largest number of
weaknesses.

Since my last mail I've realized that it is actually possible to analyse th=
e
behavior of these final constants under a wide variety of error classes
(substitutions, deletions, insertions, swaps, duplications) programatically=
.
Greg Maxwell and I have used this to perform an exhaustive analysis of cert=
ain
error patterns for all 2^30 possible M values, selected a number of criteri=
a
to optimize for, and conclude that we should use as constant:

  M =3D 0x2bc830a3

The code used to do this analysis, as well as the code used to verify some
expected properties of the final result, and more, can be found on
https://gist.github.com/sipa/14c248c288c3880a3b191f978a34508e

See results_final.txt to see how this constant compares with the previously
suggested constants 1, 0x3fffffff, and 0x3fefffff.

Properties
----------

If we define an error as a deletion of one position, a swap of adjacent
positions, a substitution in a specific position with a random character, a=
n
insertion of one random character in a specific position, or a duplication =
of
the character in a specific position, then this M constant above gives us t=
he
following properties:

* For generic HRPs and errors that don't affect the first 6 data characters=
,
  or alternatively averaged over all HRPs (see details futher):
  * Always detected:
    * (P) Up to 4 substitution errors (true for any constant).
    * (Q) Any valid code with the new constant, fed without modification to
          software that uses the old constant 1 (true for any constant).
  * Any error pattern has failure to detect probability <=3D 2^-30:
    * (L) Any number of errors restricted to a single window of up to 4
          characters.
    * (B) Up to two errors restricted to a window of up to 68 characters.
    * (D) Any one error made to a valid code with the new constant, and fed=
 to
          software that uses the old constant 1
  * Most error patterns have probability <=3D 2^-30:
    * (C) Up to two errors in general: out of 23926796 such error patterns,
          0.0040% have probability 2^-25.
    * (N) Up to three errors restricted to a window of up to 69 characters:
          out of 284708444 such patterns, 0.033% have probability 2^-25.
    * (O) Up to three errors in general: out of 295744442 such error patter=
ns,
          0.034% have probability 2^-25; 0.000065% have probability 2^-20.
    * (G) Up to two errors made to a valid code with the new constant, and =
fed
          to software that uses the old constant 1: out of 2831622 such err=
or
          patterns, 0.048% have probability 2^-25.

* Specifically for the bc1 HRP, with the BIP173 length restrictions:
  * Always detected:
    * (R) Up to 4 substitution errors (true for any constant).
    * (A) Up to 3 substitution errors made to a valid code with the new
          constant, and fed to software that uses the old constant 1.
  * Any error pattern has failure to detect probability <=3D 2^-30:
    * (E) Any one error.
    * (F) Any one error made to a valid code with the new constant, and fed=
 to
          software that uses the old constant 1.
    * (H) Up to two errors restricted to a window of 28 characters.
  * Most error patterns have probability <=3D 2^-30:
    * (J) Up to two errors in general: out of 455916 such error patterns,
          0.039% have probability 2^-25; 0.0053% have 2^-20.
    * (K) Any number of errors restricted to a window of 4 characters: out =
of
          5813139 such error patterns, 0.0016% have probability 2^-25.
    * (M) Up to three errors: out of 50713466 such error patterns, 0.078% h=
ave
          probability 2^-25; 0.00063% have 2^-20.
    * (I) Up to two errors made to a valid code with the new constant, and =
fed
          to software that uses the old constant 1: out of 610683 such erro=
r
          patterns, 0.092% have probability 2^-25; 0.00049% have probabilit=
y
          2^-20.

To give an idea of what these probabilities mean, consider the known BIP173
insertion issue. It admits an error pattern of 1 error (insertion in
penultimate position) that has a failure to detect probability of 2^-10:
it requires the final character to be 'p', and the inserted character to be
'q'. Assuming those are both random, we have a chance of 1 in 32*32 to hit =
it.
Note that the choice of *what* the error pattern is (whether it's insertion=
,
and where) isn't part of our probabilities: we try to make sure that *every=
*
pattern behaves well, not just randomly chosen ones, because presumably hum=
ans
make some kinds of errors more than others, and we don't know which ones.

All the analyzed patterns above are guaranteed to be detected with probabil=
ity
2^-20 or better (and most are 2^-30). Of course, if we'd search for even
larger classes of errors, say any 4 independent errors of any type, we woul=
d
probably discover patterns with worse probabilities, but at that point the
probability of the pattern itself being hit should be taken into account.

The selection was made based on these same properties:
* Start with the set of all 2^30 constants.
* The generic properties (L), (B), (D), (C), (N), (O), and (G) were selecte=
d
  for by rejecting all constants that left any worse error patterns (e.g.
  all codes for which patterns matching (N) existed with failure probabilit=
y
  above 2^-25 were removed). All these restrictions are as strong as they
  can be: making them over longer strings, wider windows, or more errors wi=
th
  the same restrictions removes all remaining constants. This leaves us wit=
h
  just 12054 acceptable constants.
* The same was then done for the bc1/BIP173 specific properties (A), (E), (=
J),
  (F), (H), (K), (M), and (I). This reduces the set further to 79 acceptabl=
e
  constants. The full analysis output for all of these can be found in
  output.txt.
* Finally, the constant with the minimal number of worst-probability patter=
ns
  was chosen for the generic property (N). The single constant 0x2bc830a3
  remains.
* This solution and a few of its expected properties were then validated us=
ing
  a simple program that makes random errors (see the monte_carlo.py file).


Technical details
-----------------

For the purpose of this analysis, define an "error pattern" as a starting
length (of a valid string consisting of otherwise randomly chosen character=
s)
combined with a sequence of the following (in this order):
* 0 or more deletions of characters at specific positions (without
  constraining what those characters are)
* 0 or more swaps of characters at specific positions with the character
  following it
* 0 or more substitutions of characters at specific positions with a unifor=
mly
  randomly selected character
* 0 or more insertions of uniformly randomly selected characters at specifi=
c
  positions
* 0 or more duplications of characters at specific positions (including
  duplications of characters inserted/substituted in the previous steps)

Examples:
* "Start with a random valid 58 character string, remove the 17th character=
,
  swap the 11th character with the 12th character, and insert a random
  character in the 24th position" is an error pattern.
* "Replace the 17th through 24th characters in a 78 character string with
  'aardvark'" is not an error pattern, because substituted characters have =
to
  be random, and can't be specific values.

Given such a pattern, assign variable names to every input character, and t=
o
every inserted/substituted character. For example, the pattern "Start with
a 6 character string, delete the 1st character, swap the 2nd and 3rd
character, and insert a random character between those" would be represente=
d
as [v0 v1 v2 v3 v4 v5] and [v1 v3 v6 v2 v4 v5]. Treat these variables as
elements of GF(32), and write out the equations that both the first and sec=
ond
list have a valid checksum. Due to the fact that BCH codes are linear, this=
 is
just a linear set of equations over GF(32), and we can use Gaussian
elimination to find the size of the solution space. If the input and output
are the same length, we need to subtract the number of solutions for which =
the
input and output are exactly the same, which is easy to find with another s=
et
of equations. Now compute the ratio of this number divided by (32^numvars /
32^6), where the 32^6 is due to the precondition that the input string is
valid. This gives us the probability of failure, assuming input and output =
are
random, apart from the known relation between the two, and the fact that bo=
th
are valid.

This technique has an important limitation: it can only reason about random=
ly-
chosen input strings, and the presence of the HRP and version numbers at th=
e
start violates that assumption. These are not random, and we're forced to
make one of these concessions:
1) Ignore the problem, and treat the HRP as random. This lets us derive
   properties that hold over all potential HRPs on average, but will thus f=
ail
   to account for the possibility that for a small numbers of potential HRP=
s
   some error patterns may exist that behave worse. For technical reasons,
   this averaging makes all constants behave identically for error patterns
   that don't change the length of the string. Given that substitution/swap
   only errors are already dealt with well due to the BCH design this is
   perhaps not too important. One exception is frame-shifting errors (a
   deletion in one place compensated with an insertion in another place).
2) Restrict analysis to error patterns that don't affect the first 6 actual
   characters. Doing so "masks" the effect of the HRP completely.
3) Do analysis for specific HRPs only, allowing much more accurate statemen=
ts,
   but HRP-specific ones that may not hold for every HRP.

Our final selection primarily optimizes for 1) and 2) as those benefit all
potential uses of the encoding, but do optimize for 3) the "bc1" prefix
specifically (and the BIP173 length restriction) as a tiebreaker.

The code for this can be found under the link above, in const_analysis.cpp.

Cheers,

--
Pieter