Return-Path: Received: from fraxinus.osuosl.org (smtp4.osuosl.org [140.211.166.137]) by lists.linuxfoundation.org (Postfix) with ESMTP id 56A1CC013B for ; Sat, 5 Dec 2020 22:59:33 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by fraxinus.osuosl.org (Postfix) with ESMTP id 4504486448 for ; Sat, 5 Dec 2020 22:59:33 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from fraxinus.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Y-avnzBxFZyi for ; Sat, 5 Dec 2020 22:59:31 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from mail-40133.protonmail.ch (mail-40133.protonmail.ch [185.70.40.133]) by fraxinus.osuosl.org (Postfix) with ESMTPS id BA39C86403 for ; Sat, 5 Dec 2020 22:59:30 +0000 (UTC) Date: Sat, 05 Dec 2020 22:59:17 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wuille.net; s=protonmail2; t=1607209166; bh=lIneUli+abdrodiB7DA3J/aTnv9ymOUCh696f5389tM=; h=Date:To:From:Reply-To:Subject:In-Reply-To:References:From; b=J98S5AiliYUqahgBcjL1pdJKR/WTB4sa7D78gAZb6SA3qQcGtuXJQQKIMEOBx9IiB 8fMa7zD/1eBvq/fgYODKa/jWh85vF0hohpPewyLJ4Zh/GHQ3ZVMjUvnLjAtAA6RyIq mW4mo79s/I7Vdzik1mQX3HK3EbdrEFjn6LLpq/jueQwsGOT6n9hgf7f/VUe3hGfz/L lk8pJwSkum7UGdbOp996PC5UQCQWFm8hg/MO7f+0wJHecIT41uPaWT6Q5/N0teMw4z hhBkZWIUh9Dk3g4ATTJAwbOS2vJhoz4LrjBT+cxgFlIa8zirS41czJC8khtIa9Xf0Y f2kiC0y8gTsEg== To: Rusty Russell , Bitcoin Protocol Discussion From: Pieter Wuille Reply-To: Pieter Wuille Message-ID: In-Reply-To: References: <87imblmutl.fsf@rustcorp.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Sun, 06 Dec 2020 04:04:00 +0000 Subject: Re: [bitcoin-dev] Progress on bech32 for future Segwit Versions (BIP-173) X-BeenThere: bitcoin-dev@lists.linuxfoundation.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Bitcoin Protocol Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 05 Dec 2020 22:59:33 -0000 > On Wednesday, October 7, 2020 5:21 PM, Rusty Russell via bitcoin-dev bitc= oin-dev@lists.linuxfoundation.org wrote: > > > I propose an alternative to length restrictions suggested by > > Russell in https://github.com/bitcoin/bips/pull/945: use the > > https://gist.github.com/sipa/a9845b37c1b298a7301c33a04090b2eb variant, > > unless the first byte is 0. > > Hi all, > > starting a slight side-thread here. Another update, and a longer write-up. Introduction ------------ Bech32's checksum algorithm was designed to be strong against substitution errors, but it also provides some protection against more general classes o= f errors. The final constant M that is XOR'ed into the checksum influences th= at protection. BIP173 today uses M=3D1, but it is now known that this has a weakness: if the final character is a "p", any number of "q" characters can= be inserted or erased right before it, without invalidating the checksum. As it was recognized that other constants do not have this issue, the obvio= us question is whether this is the only possible type of weakness, and if not,= if there is an optimal constant to use that minimizes the largest number of weaknesses. Since my last mail I've realized that it is actually possible to analyse th= e behavior of these final constants under a wide variety of error classes (substitutions, deletions, insertions, swaps, duplications) programatically= . Greg Maxwell and I have used this to perform an exhaustive analysis of cert= ain error patterns for all 2^30 possible M values, selected a number of criteri= a to optimize for, and conclude that we should use as constant: M =3D 0x2bc830a3 The code used to do this analysis, as well as the code used to verify some expected properties of the final result, and more, can be found on https://gist.github.com/sipa/14c248c288c3880a3b191f978a34508e See results_final.txt to see how this constant compares with the previously suggested constants 1, 0x3fffffff, and 0x3fefffff. Properties ---------- If we define an error as a deletion of one position, a swap of adjacent positions, a substitution in a specific position with a random character, a= n insertion of one random character in a specific position, or a duplication = of the character in a specific position, then this M constant above gives us t= he following properties: * For generic HRPs and errors that don't affect the first 6 data characters= , or alternatively averaged over all HRPs (see details futher): * Always detected: * (P) Up to 4 substitution errors (true for any constant). * (Q) Any valid code with the new constant, fed without modification to software that uses the old constant 1 (true for any constant). * Any error pattern has failure to detect probability <=3D 2^-30: * (L) Any number of errors restricted to a single window of up to 4 characters. * (B) Up to two errors restricted to a window of up to 68 characters. * (D) Any one error made to a valid code with the new constant, and fed= to software that uses the old constant 1 * Most error patterns have probability <=3D 2^-30: * (C) Up to two errors in general: out of 23926796 such error patterns, 0.0040% have probability 2^-25. * (N) Up to three errors restricted to a window of up to 69 characters: out of 284708444 such patterns, 0.033% have probability 2^-25. * (O) Up to three errors in general: out of 295744442 such error patter= ns, 0.034% have probability 2^-25; 0.000065% have probability 2^-20. * (G) Up to two errors made to a valid code with the new constant, and = fed to software that uses the old constant 1: out of 2831622 such err= or patterns, 0.048% have probability 2^-25. * Specifically for the bc1 HRP, with the BIP173 length restrictions: * Always detected: * (R) Up to 4 substitution errors (true for any constant). * (A) Up to 3 substitution errors made to a valid code with the new constant, and fed to software that uses the old constant 1. * Any error pattern has failure to detect probability <=3D 2^-30: * (E) Any one error. * (F) Any one error made to a valid code with the new constant, and fed= to software that uses the old constant 1. * (H) Up to two errors restricted to a window of 28 characters. * Most error patterns have probability <=3D 2^-30: * (J) Up to two errors in general: out of 455916 such error patterns, 0.039% have probability 2^-25; 0.0053% have 2^-20. * (K) Any number of errors restricted to a window of 4 characters: out = of 5813139 such error patterns, 0.0016% have probability 2^-25. * (M) Up to three errors: out of 50713466 such error patterns, 0.078% h= ave probability 2^-25; 0.00063% have 2^-20. * (I) Up to two errors made to a valid code with the new constant, and = fed to software that uses the old constant 1: out of 610683 such erro= r patterns, 0.092% have probability 2^-25; 0.00049% have probabilit= y 2^-20. To give an idea of what these probabilities mean, consider the known BIP173 insertion issue. It admits an error pattern of 1 error (insertion in penultimate position) that has a failure to detect probability of 2^-10: it requires the final character to be 'p', and the inserted character to be 'q'. Assuming those are both random, we have a chance of 1 in 32*32 to hit = it. Note that the choice of *what* the error pattern is (whether it's insertion= , and where) isn't part of our probabilities: we try to make sure that *every= * pattern behaves well, not just randomly chosen ones, because presumably hum= ans make some kinds of errors more than others, and we don't know which ones. All the analyzed patterns above are guaranteed to be detected with probabil= ity 2^-20 or better (and most are 2^-30). Of course, if we'd search for even larger classes of errors, say any 4 independent errors of any type, we woul= d probably discover patterns with worse probabilities, but at that point the probability of the pattern itself being hit should be taken into account. The selection was made based on these same properties: * Start with the set of all 2^30 constants. * The generic properties (L), (B), (D), (C), (N), (O), and (G) were selecte= d for by rejecting all constants that left any worse error patterns (e.g. all codes for which patterns matching (N) existed with failure probabilit= y above 2^-25 were removed). All these restrictions are as strong as they can be: making them over longer strings, wider windows, or more errors wi= th the same restrictions removes all remaining constants. This leaves us wit= h just 12054 acceptable constants. * The same was then done for the bc1/BIP173 specific properties (A), (E), (= J), (F), (H), (K), (M), and (I). This reduces the set further to 79 acceptabl= e constants. The full analysis output for all of these can be found in output.txt. * Finally, the constant with the minimal number of worst-probability patter= ns was chosen for the generic property (N). The single constant 0x2bc830a3 remains. * This solution and a few of its expected properties were then validated us= ing a simple program that makes random errors (see the monte_carlo.py file). Technical details ----------------- For the purpose of this analysis, define an "error pattern" as a starting length (of a valid string consisting of otherwise randomly chosen character= s) combined with a sequence of the following (in this order): * 0 or more deletions of characters at specific positions (without constraining what those characters are) * 0 or more swaps of characters at specific positions with the character following it * 0 or more substitutions of characters at specific positions with a unifor= mly randomly selected character * 0 or more insertions of uniformly randomly selected characters at specifi= c positions * 0 or more duplications of characters at specific positions (including duplications of characters inserted/substituted in the previous steps) Examples: * "Start with a random valid 58 character string, remove the 17th character= , swap the 11th character with the 12th character, and insert a random character in the 24th position" is an error pattern. * "Replace the 17th through 24th characters in a 78 character string with 'aardvark'" is not an error pattern, because substituted characters have = to be random, and can't be specific values. Given such a pattern, assign variable names to every input character, and t= o every inserted/substituted character. For example, the pattern "Start with a 6 character string, delete the 1st character, swap the 2nd and 3rd character, and insert a random character between those" would be represente= d as [v0 v1 v2 v3 v4 v5] and [v1 v3 v6 v2 v4 v5]. Treat these variables as elements of GF(32), and write out the equations that both the first and sec= ond list have a valid checksum. Due to the fact that BCH codes are linear, this= is just a linear set of equations over GF(32), and we can use Gaussian elimination to find the size of the solution space. If the input and output are the same length, we need to subtract the number of solutions for which = the input and output are exactly the same, which is easy to find with another s= et of equations. Now compute the ratio of this number divided by (32^numvars / 32^6), where the 32^6 is due to the precondition that the input string is valid. This gives us the probability of failure, assuming input and output = are random, apart from the known relation between the two, and the fact that bo= th are valid. This technique has an important limitation: it can only reason about random= ly- chosen input strings, and the presence of the HRP and version numbers at th= e start violates that assumption. These are not random, and we're forced to make one of these concessions: 1) Ignore the problem, and treat the HRP as random. This lets us derive properties that hold over all potential HRPs on average, but will thus f= ail to account for the possibility that for a small numbers of potential HRP= s some error patterns may exist that behave worse. For technical reasons, this averaging makes all constants behave identically for error patterns that don't change the length of the string. Given that substitution/swap only errors are already dealt with well due to the BCH design this is perhaps not too important. One exception is frame-shifting errors (a deletion in one place compensated with an insertion in another place). 2) Restrict analysis to error patterns that don't affect the first 6 actual characters. Doing so "masks" the effect of the HRP completely. 3) Do analysis for specific HRPs only, allowing much more accurate statemen= ts, but HRP-specific ones that may not hold for every HRP. Our final selection primarily optimizes for 1) and 2) as those benefit all potential uses of the encoding, but do optimize for 3) the "bc1" prefix specifically (and the BIP173 length restriction) as a tiebreaker. The code for this can be found under the link above, in const_analysis.cpp. Cheers, -- Pieter