MIME-Version: 1.0
In-Reply-To: <oCNGbVElAQCJ1bEmwLXLzIVec0ZoOA2Ar3vkOc1a0GW12h78bhMi_W4n3pCdDt7hJyPFoMRb0U1T5Wx5uQl4oo6zeQtjKs0MdAXGtvLw1SQ=@protonmail.com>
References: <CAPweEDx4wH_PG8=wqLgM_+RfTQEUSGfax=SOkgTZhe1FagXF9g@mail.gmail.com>
 <oCNGbVElAQCJ1bEmwLXLzIVec0ZoOA2Ar3vkOc1a0GW12h78bhMi_W4n3pCdDt7hJyPFoMRb0U1T5Wx5uQl4oo6zeQtjKs0MdAXGtvLw1SQ=@protonmail.com>
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Wed, 3 Feb 2021 13:24:13 +0000
Message-ID: <CAPweEDy7Xf3nD1mfyX5MmtsGX=1sd5=gsLosZ=bYavJ0BZyy3g@mail.gmail.com>
To: ZmnSCPxj <ZmnSCPxj@protonmail.com>
Content-Type: multipart/alternative; boundary="000000000000da8d8405ba6e8095"
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] Libre/Open blockchain / cryptographic ASICs
Precedence: list

--000000000000da8d8405ba6e8095
Content-Type: text/plain; charset="UTF-8"

(hi folks do cc me, i am subscribed digest, thank you for doing that,
ZmnSCPxj)

On Wednesday, February 3, 2021, ZmnSCPxj <ZmnSCPxj@protonmail.com> wrote:
> Good morning Luke,
>
> I happen to have experience designing digital ASICs, mostly pipelined
data processing.
> However my experience is limited to larger geometries and in
SystemVerilog.

larger geometries for a hardware wallet ASIC is ok (as long as it is not
retail based and trying to run e.g. RSA, taking so long to complete that
the retail customer walks out)

> On the technical side, as I understand it (I have been out of that
industry for 4 years now, so my knowledge may be obsolete)

not at all! still very valuable

> as you approach lower geometries, you also start approaching analog
design.

yyeah i could intuitively tell/guess there might be something like this
which would throw a spanner in the works, it is why the grant request i put
in specifically excluded data-dependent constant time analysis and also
power analysis.


> In our case we were already manually laying out gates and flip-flops (or
replacing flip-flops with level-triggered latches and being extra careful
with clocks) to squeeze performance (and area) ...

ya-howw :)


> Many of the formal correctness proofs were really about the formal
equivalence of the netlist to the RTL; the correctness of the RTL was
"proved" by simulation testing.

thanks to Symbiyosys we are using formal proofs much more extensively, as
effectively a 100% coverage replacement for unit tests.

an example is popcount.  we did two versions.  one is a recursive tree
algorithm, almost impossible to read and understand what the hell it does.

the other is a total braindead 1-liner "x = x + input[i]", rubbish
performance though.

running a formal proof on these gave us 100% confidence that the complex
optimised version does the damn job.


yes we still do unit tests, these are more "demo code".

now, the caveat is that you have to have a model of the "dut" (device under
test) against which to compare, and if the dut is ridiculously complex then
the formal model variant, which has to do the same job, ends up equally as
complex (or effectively a duplicate of the dut) and the exercise is a bit
of a waste of time...

...*unless*... there happens to be other implementations out there.  then
the proof can be run against those and everybody wins through collaboration.


now, here's why i put in the NLnet Grant request to explore going back to
the mathematics of crypto-primitives.

many ISAs e.g. intel AVX2 have added GFMULT8 etc etc because that does
S-Boxes for Rijndael.  they have gone mad by analysing algorithms trying to
fit them to standard ISAs.

nobody does Rijndael S-Boxes any way other than 256-entry lookup tables
because no standard ISA has general-purpose Galois Field Multiply.

consequently implementations in assembler get completely divorced from the
original mathematics on which the cryptographic algorithm was based.

the approach i would like to take is, "hang on a minute: how far would you
get if you actually added *general-purpose* instructions that *directly*
provided the underlying mathematical principles, and then wrapped a
Vector-Matrix Engine around them?".

would this drastically simplify algorithms to the point where *READABLE* c
code compiles directly to opcodes that run screamingly fast, outperforming
hand-optimised SIMD code using standard ISAs?

then, given the Formal Correctness approach above, can we verify that the
mathematically-related opcodes do the job?


> (to be fair, there were tools to force you to improve coverage by
injecting faults to your RTL, e.g. it would virtually flip an `&&` to an
`||` and if none of your tests signaled an error it would complain that
your test coverage sucked.)

nice!

> Things might have changed.

nah.  this is such a complex area, run by few incumbent players, that
innovation is rare.  not least, innovation is different and cannot be
trusted by the Foundries!


> A good RTL would embed SystemVerilog Assertions or PSL Assertions as well.
> Some formal verification tools can understand a subset of SystemVerilog
Assertions / PSL assertions and validate that your RTL conformed to the
assertions, which would probably help cut down on the need for RTL
simulation.

interesting.

> Overall, my understanding is that smaller geometries are needed only if
you want to target a really high performance / unit cost and performance /
energy consumption ratios.
> That is, you would target smaller geometries for mining.

yes.

> If you need a secure tr\*sted computing module that does not need to be
fast or cheap, just very accurate to the required specification, the larger
geometries should be fine and you would be able to live almost entirely in
RTL-land without diving into netlist and layout specifications.

hardware wallet ASICs.

i concur.

> A wrinkle here is that licenses for tools from tr\*sted vendors like
Synopsys or Cadence are ***expensive***.

yes they are :)  we are currently working with Sorbonne University LIP6.fr
and Staf Verhaegen from Chips4Makers, trying a different approach:
coriolis2.

this will do fine up to 130nm (skywater).  beyond that, mmm, we need a few
more years.

> What is more, you should really buy two sets of licenses, e.g. do logic
synthesis with Synopsys and then formal verification with Cadence, because
you do not want to fully tr\*st just one vendor.

interesting, good advice.

> Synthesis in particular is a black box and each vendor keeps their
particular implementations and tricks secret.

sigh.  i think that's partly because they have to insert diodes, and
buffers, and generally mess with the netlist.

i was stunned to learn that in a 28nm ASIC, 50% of it is repeater-buffers!

plus, they make an awful lot of money, it is good business.

> Pointing some funding at the open-source Icarus Verilog might also fit,
as it lost its ability to do synthesis more than a decade ago due to
inability to maintain.

ah i didn't know it could do synthesis at all! i thought it was simulation
only.

> Note as well that I heard (at the time when I was in the industry) that
some foundries will not even accept a netlist unless it was created by a
synthesis tool from one of the major vendors (Synopsys, Cadence, Mentor
Graphics, maybe more I have forgotten since).

yes i heard this too, they don't want their time wasted: after all they
only make money by selling wafers, and if they can't sell any they have to
run empty wafers to keep the equipment at operating temperature.

if you book a slot 18 months in advance and the RTL doesn't work during
testing 3 months before the deadline they may not be able to find someone
else in time.

anything to reduce the risk there is good, so i totally get why.

thank you for the insights and the discussion, really appreciated.

l.


-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

--000000000000da8d8405ba6e8095
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

(hi folks do cc me, i am subscribed digest, thank you for doing that, ZmnSC=
Pxj)<br><br>On Wednesday, February 3, 2021, ZmnSCPxj &lt;<a href=3D"mailto:=
ZmnSCPxj@protonmail.com">ZmnSCPxj@protonmail.com</a>&gt; wrote:<br>&gt; Goo=
d morning Luke,<br>&gt;<br>&gt; I happen to have experience designing digit=
al ASICs, mostly pipelined data processing.<br>&gt; However my experience i=
s limited to larger geometries and in SystemVerilog.<br><br>larger geometri=
es for a hardware wallet ASIC is ok (as long as it is not retail based and =
trying to run e.g. RSA, taking so long to complete that the retail customer=
 walks out)<br><br>&gt; On the technical side, as I understand it (I have b=
een out of that industry for 4 years now, so my knowledge may be obsolete)<=
br><br>not at all! still very valuable<br><br>&gt; as you approach lower ge=
ometries, you also start approaching analog design.<br><br>yyeah i could in=
tuitively tell/guess there might be something like this which would throw a=
 spanner in the works, it is why the grant request i put in specifically ex=
cluded data-dependent constant time analysis and also power analysis.<br><b=
r><br>&gt; In our case we were already manually laying out gates and flip-f=
lops (or replacing flip-flops with level-triggered latches and being extra =
careful with clocks) to squeeze performance (and area) ...<br><br>ya-howw :=
)<br><br><br>&gt; Many of the formal correctness proofs were really about t=
he formal equivalence of the netlist to the RTL; the correctness of the RTL=
 was &quot;proved&quot; by simulation testing.<br><br>thanks to Symbiyosys =
we are using formal proofs much more extensively, as effectively a 100% cov=
erage replacement for unit tests.<br><br>an example is popcount. =C2=A0we d=
id two versions. =C2=A0one is a recursive tree algorithm, almost impossible=
 to read and understand what the hell it does.<br><br>the other is a total =
braindead 1-liner &quot;x =3D x + input[i]&quot;, rubbish performance thoug=
h.<br><br>running a formal proof on these gave us 100% confidence that the =
complex optimised version does the damn job.<br><br><br>yes we still do uni=
t tests, these are more &quot;demo code&quot;.<br><br>now, the caveat is th=
at you have to have a model of the &quot;dut&quot; (device under test) agai=
nst which to compare, and if the dut is ridiculously complex then the forma=
l model variant, which has to do the same job, ends up equally as complex (=
or effectively a duplicate of the dut) and the exercise is a bit of a waste=
 of time...<br><br>...*unless*... there happens to be other implementations=
 out there. =C2=A0then the proof can be run against those and everybody win=
s through collaboration.<br><br><br><br>now, here&#39;s why i put in the NL=
net Grant request to explore going back to the mathematics of crypto-primit=
ives.<br><br>many ISAs e.g. intel AVX2 have added GFMULT8 etc etc because t=
hat does S-Boxes for Rijndael. =C2=A0they have gone mad by analysing algori=
thms trying to fit them to standard ISAs.<br><br>nobody does Rijndael S-Box=
es any way other than 256-entry lookup tables because no standard ISA has g=
eneral-purpose Galois Field Multiply.<br><br>consequently implementations i=
n assembler get completely divorced from the original mathematics on which =
the cryptographic algorithm was based.<br><br>the approach i would like to =
take is, &quot;hang on a minute: how far would you get if you actually adde=
d *general-purpose* instructions that *directly* provided the underlying ma=
thematical principles, and then wrapped a Vector-Matrix Engine around them?=
&quot;.<br><br>would this drastically simplify algorithms to the point wher=
e *READABLE* c code compiles directly to opcodes that run screamingly fast,=
 outperforming hand-optimised SIMD code using standard ISAs? <br><br>then, =
given the Formal Correctness approach above, can we verify that the mathema=
tically-related opcodes do the job?<br><br><br>&gt; (to be fair, there were=
 tools to force you to improve coverage by injecting faults to your RTL, e.=
g. it would virtually flip an `&amp;&amp;` to an `||` and if none of your t=
ests signaled an error it would complain that your test coverage sucked.)<b=
r><br>nice!<br><br>&gt; Things might have changed.<br><br>nah. =C2=A0this i=
s such a complex area, run by few incumbent players, that innovation is rar=
e. =C2=A0not least, innovation is different and cannot be trusted by the Fo=
undries!<br><br><br>&gt; A good RTL would embed SystemVerilog Assertions or=
 PSL Assertions as well.<br>&gt; Some formal verification tools can underst=
and a subset of SystemVerilog Assertions / PSL assertions and validate that=
 your RTL conformed to the assertions, which would probably help cut down o=
n the need for RTL simulation.<br><br>interesting.<br><br>&gt; Overall, my =
understanding is that smaller geometries are needed only if you want to tar=
get a really high performance / unit cost and performance / energy consumpt=
ion ratios.<br>&gt; That is, you would target smaller geometries for mining=
.<br><br>yes.<br><br>&gt; If you need a secure tr\*sted computing module th=
at does not need to be fast or cheap, just very accurate to the required sp=
ecification, the larger geometries should be fine and you would be able to =
live almost entirely in RTL-land without diving into netlist and layout spe=
cifications.<br><br>hardware wallet ASICs.<br><br>i concur.<br><br>&gt; A w=
rinkle here is that licenses for tools from tr\*sted vendors like Synopsys =
or Cadence are ***expensive***.<br><br>yes they are :) =C2=A0we are current=
ly working with Sorbonne University LIP6.fr and Staf Verhaegen from Chips4M=
akers, trying a different approach: coriolis2.<br><br>this will do fine up =
to 130nm (skywater). =C2=A0beyond that, mmm, we need a few more years.<br><=
br>&gt; What is more, you should really buy two sets of licenses, e.g. do l=
ogic synthesis with Synopsys and then formal verification with Cadence, bec=
ause you do not want to fully tr\*st just one vendor.<br><br>interesting, g=
ood advice.<br><br>&gt; Synthesis in particular is a black box and each ven=
dor keeps their particular implementations and tricks secret.<br><br>sigh. =
=C2=A0i think that&#39;s partly because they have to insert diodes, and buf=
fers, and generally mess with the netlist.<br><br>i was stunned to learn th=
at in a 28nm ASIC, 50% of it is repeater-buffers!<br><br>plus, they make an=
 awful lot of money, it is good business.<br><br>&gt; Pointing some funding=
 at the open-source Icarus Verilog might also fit, as it lost its ability t=
o do synthesis more than a decade ago due to inability to maintain.<br><br>=
ah i didn&#39;t know it could do synthesis at all! i thought it was simulat=
ion only.<br><br>&gt; Note as well that I heard (at the time when I was in =
the industry) that some foundries will not even accept a netlist unless it =
was created by a synthesis tool from one of the major vendors (Synopsys, Ca=
dence, Mentor Graphics, maybe more I have forgotten since).<br><br>yes i he=
ard this too, they don&#39;t want their time wasted: after all they only ma=
ke money by selling wafers, and if they can&#39;t sell any they have to run=
 empty wafers to keep the equipment at operating temperature.<br><br>if you=
 book a slot 18 months in advance and the RTL doesn&#39;t work during testi=
ng 3 months before the deadline they may not be able to find someone else i=
n time.<br><br>anything to reduce the risk there is good, so i totally get =
why.<br><br>thank you for the insights and the discussion, really appreciat=
ed.<br><br>l.<br><br><br>-- <br>---<br>crowd-funded eco-conscious hardware:=
 <a href=3D"https://www.crowdsupply.com/eoma68" target=3D"_blank">https://w=
ww.crowdsupply.com/eoma68</a><br><br>

--000000000000da8d8405ba6e8095--