Received: from sog-mx-2.v43.ch3.sourceforge.com ([172.29.43.192] helo=mx.sourceforge.net) by sfs-ml-2.v29.ch3.sourceforge.com with esmtp (Exim 4.76) (envelope-from <allen.piscitello@gmail.com>) id 1VcOLp-0007Wq-ID for bitcoin-development@lists.sourceforge.net; Fri, 01 Nov 2013 23:42:01 +0000 Received-SPF: pass (sog-mx-2.v43.ch3.sourceforge.com: domain of gmail.com designates 209.85.212.171 as permitted sender) client-ip=209.85.212.171; envelope-from=allen.piscitello@gmail.com; helo=mail-wi0-f171.google.com; Received: from mail-wi0-f171.google.com ([209.85.212.171]) by sog-mx-2.v43.ch3.sourceforge.com with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) id 1VcOLo-0006fV-5j for bitcoin-development@lists.sourceforge.net; Fri, 01 Nov 2013 23:42:01 +0000 Received: by mail-wi0-f171.google.com with SMTP id f4so1739953wiw.16 for <bitcoin-development@lists.sourceforge.net>; Fri, 01 Nov 2013 16:41:54 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.180.87.69 with SMTP id v5mr3959294wiz.45.1383349313919; Fri, 01 Nov 2013 16:41:53 -0700 (PDT) Received: by 10.194.85.112 with HTTP; Fri, 1 Nov 2013 16:41:53 -0700 (PDT) In-Reply-To: <CANg-TZC2NHfGR3mfm4VuuZMbwxkJzP69OmWhLvOD2Zq8GWejnw@mail.gmail.com> References: <CANg-TZC2NHfGR3mfm4VuuZMbwxkJzP69OmWhLvOD2Zq8GWejnw@mail.gmail.com> Date: Fri, 1 Nov 2013 18:41:53 -0500 Message-ID: <CAJfRnm6mjm5Oy5YFM9vqC487AjtVG2NNzNg+GXaB1p2j7JtcGA@mail.gmail.com> From: Allen Piscitello <allen.piscitello@gmail.com> To: Brooks Boyd <boydb@midnightdesign.ws> Content-Type: multipart/alternative; boundary=f46d044402a274e2de04ea261c61 X-Spam-Score: -0.6 (/) X-Spam-Report: Spam Filtering performed by mx.sourceforge.net. See http://spamassassin.org/tag/ for more details. 0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [URIs: doubleclick.net] -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (allen.piscitello[at]gmail.com) -0.0 SPF_PASS SPF: sender matches SPF record 1.0 HTML_MESSAGE BODY: HTML included in message -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature X-Headers-End: 1VcOLo-0006fV-5j Cc: Bitcoin Development <bitcoin-development@lists.sourceforge.net> Subject: Re: [Bitcoin-development] BIP39 word list X-BeenThere: bitcoin-development@lists.sourceforge.net X-Mailman-Version: 2.1.9 Precedence: list List-Id: <bitcoin-development.lists.sourceforge.net> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/bitcoin-development>, <mailto:bitcoin-development-request@lists.sourceforge.net?subject=unsubscribe> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=bitcoin-development> List-Post: <mailto:bitcoin-development@lists.sourceforge.net> List-Help: <mailto:bitcoin-development-request@lists.sourceforge.net?subject=help> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/bitcoin-development>, <mailto:bitcoin-development-request@lists.sourceforge.net?subject=subscribe> X-List-Received-Date: Fri, 01 Nov 2013 23:42:01 -0000 --f46d044402a274e2de04ea261c61 Content-Type: text/plain; charset=ISO-8859-1 The problem with this is that you might have word A which is similar to B, but B is also similar to C. So we scrub B from the list, someone enters B, and we have no way to know if it means A or C. It leads to a much more complicated scheme to ensure that all errors are correctable. Scrubbing A, B, and C is preferable, since it leads to no ambiguity and there is no need to try to correct an error. On Fri, Nov 1, 2013 at 3:14 PM, Brooks Boyd <boydb@midnightdesign.ws> wrote: > I was inspired to join the mailing list to comment on some of these > discussions about BIP39, which I think will have great use in the Bitcoin > community and outside it as a way to transcribe binary data. > > The one thought I had as the discussions about similar characters are > resulting in culling words from the list, is that it only helps to validate > input, not help the user if it is incorrect. > > For example, if both "cat" and "eat" were in the word list, and someone > wrote down "eat", but later mis-translated it and put "cat" back into > translator, the result would be a checksum error; "cat" is a different > number, so the checksum would fail. > > As it currently stands, "cat" would not be a valid word ("eat" is the real > word, and no other number is "cat"), so the translator can throw a > different error which is more helpful (i.e. "'cat' isn't a valid word > choice), but still doesn't get the user to the proper translation. > > What about if the wordlist included those "words that are so similar to > each other that we only kept one of them" and had them all refer to the > same number? I propose the wordlist have the possibility of multiple words > on a single line, with the first word on the line being the "primary" or > "real" word to be used, with the other similar words be included so that a > translation program if it wanted to assist the user could fix their input > for them (verbosely or not), along the lines of "'cat' isn't a valid word > choice; assuming you meant 'eat', which is valid". You might still hit a > checksum error if that similar word is still the wrong word, but as it > stands now, I know you culled a bunch of words from the wordlist as "too > similar", but if I want to try and help the user fix a bad input, I need to > write a translation program with a full english dictionary alongside the > BIP39 dictionary. > > I'd be willing to create a pull request for such an update, but before I > delve into that, does this sound like a good idea? I could see it devolving > into a slippery slope if every number in the 2048 set had a dozen word > variations (misspellings, similar words, slang terms for the real word, > etc.) which could get confusing of how similar is similar enough to be > added as an alternate, and the standard would need to be clear that when > translating binary to words, you only use the "main" word for that row, not > any of the variations. > > MidnightLightning > > > > I've just pushed updated wordlist which is filtered to similar > characters taken from this matrix. > > BIP39 now consider following character pairs as similar: > > similar = ( > > ('a', 'c'), ('a', 'e'), ('a', 'o'), > > ('b', 'd'), ('b', 'h'), ('b', 'p'), ('b', 'q'), ('b', 'r'), > > ('c', 'e'), ('c', 'g'), ('c', 'n'), ('c', 'o'), ('c', 'q'), > ('c', 'u'), > > ('d', 'g'), ('d', 'h'), ('d', 'o'), ('d', 'p'), ('d', 'q'), > > ('e', 'f'), ('e', 'o'), > > ('f', 'i'), ('f', 'j'), ('f', 'l'), ('f', 'p'), ('f', 't'), > > ('g', 'j'), ('g', 'o'), ('g', 'p'), ('g', 'q'), ('g', 'y'), > > ('h', 'k'), ('h', 'l'), ('h', 'm'), ('h', 'n'), ('h', 'r'), > > ('i', 'j'), ('i', 'l'), ('i', 't'), ('i', 'y'), > > ('j', 'l'), ('j', 'p'), ('j', 'q'), ('j', 'y'), > > ('k', 'x'), > > ('l', 't'), > > ('m', 'n'), ('m', 'w'), > > ('n', 'u'), ('n', 'z'), > > ('o', 'p'), ('o', 'q'), ('o', 'u'), ('o', 'v'), > > ('p', 'q'), ('p', 'r'), > > ('q', 'y'), > > ('s', 'z'), > > ('u', 'v'), ('u', 'w'), ('u', 'y'), > > ('v', 'w'), ('v', 'y') > > ) > > Feel free to review and comment current wordlist, but I think we're > slowly moving forward final list. > > slush > > > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Bitcoin-development mailing list > Bitcoin-development@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bitcoin-development > > --f46d044402a274e2de04ea261c61 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">The problem with this is that you might have word A which = is similar to B, but B is also similar to C. =A0So we scrub B from the list= , someone enters B, and we have no way to know if it means A or C. =A0It le= ads to a much more complicated scheme to ensure that all errors are correct= able.<div> <br></div><div>Scrubbing A, B, and C is preferable, since it leads to no am= biguity and there is no need to try to correct an error.</div></div><div cl= ass=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Fri, Nov 1, 2013 = at 3:14 PM, Brooks Boyd <span dir=3D"ltr"><<a href=3D"mailto:boydb@midni= ghtdesign.ws" target=3D"_blank">boydb@midnightdesign.ws</a>></span> wrot= e:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I was inspired to join the = mailing list to comment on some of these discussions about BIP39, which I t= hink will have great use in the Bitcoin community and outside it as a way t= o transcribe binary data.<br> <br> The one thought I had as the discussions about similar characters are resul= ting in culling words from the list, is that it only helps to validate inpu= t, not help the user if it is incorrect.<br><br>For example, if both "= cat" and "eat" were in the word list, and someone wrote down= "eat", but later mis-translated it and put "cat" back = into translator, the result would be a checksum error; "cat" is a= different number, so the checksum would fail.<br> <br>As it currently stands, "cat" would not be a valid word (&quo= t;eat" is the real word, and no other number is "cat"), so t= he translator can throw a different error which is more helpful (i.e. "= ;'cat' isn't a valid word choice), but still doesn't get th= e user to the proper translation.<br> <br>What about if the wordlist included those "words that are so simil= ar to each other that we only kept one of them" and had them all refer= to the same number? I propose the wordlist have the possibility of multipl= e words on a single line, with the first word on the line being the "p= rimary" or "real" word to be used, with the other similar wo= rds be included so that a translation program if it wanted to assist the us= er could fix their input for them (verbosely or not), along the lines of &q= uot;'cat' isn't a valid word choice; assuming you meant 'ea= t', which is valid". You might still hit a checksum error if that = similar word is still the wrong word, but as it stands now, I know you cull= ed a bunch of words from the wordlist as "too similar", but if I = want to try and help the user fix a bad input, I need to write a translatio= n program with a full english dictionary alongside the BIP39 dictionary.<br= > <br>I'd be willing to create a pull request for such an update, but bef= ore I delve into that, does this sound like a good idea? I could see it dev= olving into a slippery slope if every number in the 2048 set had a dozen wo= rd variations (misspellings, similar words, slang terms for the real word, = etc.) which could get confusing of how similar is similar enough to be adde= d as an alternate, and the standard would need to be clear that when transl= ating binary to words, you only use the "main" word for that row,= not any of the variations.<br> <br>MidnightLightning<br><br>=A0<br>> I've just pushed updated wordl= ist which is filtered to similar characters taken from this matrix.<br>>= BIP39 now consider following character pairs as similar:<br>> =A0 =A0 = =A0 =A0 similar =3D (<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('a', 'c'), ('a', '= ;e'), ('a', 'o'),<br>> =A0 =A0 =A0 =A0 =A0 =A0 ('= ;b', 'd'), ('b', 'h'), ('b', 'p'= ;), ('b', 'q'), ('b', 'r'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('c', 'e'), ('c', '= ;g'), ('c', 'n'), ('c', 'o'), ('c&#= 39;, 'q'), ('c', 'u'),<br>> =A0 =A0 =A0 =A0 =A0 = =A0 ('d', 'g'), ('d', 'h'), ('d', &= #39;o'), ('d', 'p'), ('d', 'q'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('e', 'f'), ('e', '= ;o'),<br>> =A0 =A0 =A0 =A0 =A0 =A0 ('f', 'i'), ('= ;f', 'j'), ('f', 'l'), ('f', 'p'= ;), ('f', 't'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('g', 'j'), ('g', '= ;o'), ('g', 'p'), ('g', 'q'), ('g&#= 39;, 'y'),<br>> =A0 =A0 =A0 =A0 =A0 =A0 ('h', 'k'= ;), ('h', 'l'), ('h', 'm'), ('h', &= #39;n'), ('h', 'r'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('i', 'j'), ('i', '= ;l'), ('i', 't'), ('i', 'y'),<br>> = =A0 =A0 =A0 =A0 =A0 =A0 ('j', 'l'), ('j', 'p= 9;), ('j', 'q'), ('j', 'y'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('k', 'x'),<br>> =A0 =A0 = =A0 =A0 =A0 =A0 ('l', 't'),<br>> =A0 =A0 =A0 =A0 =A0 =A0= ('m', 'n'), ('m', 'w'),<br>> =A0 =A0 = =A0 =A0 =A0 =A0 ('n', 'u'), ('n', 'z'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('o', 'p'), ('o', '= ;q'), ('o', 'u'), ('o', 'v'),<br>> = =A0 =A0 =A0 =A0 =A0 =A0 ('p', 'q'), ('p', 'r= 9;),<br>> =A0 =A0 =A0 =A0 =A0 =A0 ('q', 'y'),<br> > =A0 =A0 =A0 =A0 =A0 =A0 ('s', 'z'),<br>> =A0 =A0 = =A0 =A0 =A0 =A0 ('u', 'v'), ('u', 'w'), (&#= 39;u', 'y'),<br>> =A0 =A0 =A0 =A0 =A0 =A0 ('v', '= ;w'), ('v', 'y')<br> > =A0 =A0 =A0 =A0 )<br>> Feel free to review and comment current word= list, but I think we're slowly moving forward final list.<br>> slush= <br></div> <br>-----------------------------------------------------------------------= -------<br> Android is increasing in popularity, but the open development platform that= <br> developers love is also attractive to malware creators. Download this white= <br> paper to learn more about secure code signing practices that can help keep<= br> Android apps secure.<br> <a href=3D"http://pubads.g.doubleclick.net/gampad/clk?id=3D65839951&iu= =3D/4140/ostg.clktrk" target=3D"_blank">http://pubads.g.doubleclick.net/gam= pad/clk?id=3D65839951&iu=3D/4140/ostg.clktrk</a><br>___________________= ____________________________<br> Bitcoin-development mailing list<br> <a href=3D"mailto:Bitcoin-development@lists.sourceforge.net">Bitcoin-develo= pment@lists.sourceforge.net</a><br> <a href=3D"https://lists.sourceforge.net/lists/listinfo/bitcoin-development= " target=3D"_blank">https://lists.sourceforge.net/lists/listinfo/bitcoin-de= velopment</a><br> <br></blockquote></div><br></div> --f46d044402a274e2de04ea261c61--