Date: Sun, 14 Feb 2021 00:27:36 +0000
To: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
From: ZmnSCPxj <ZmnSCPxj@protonmail.com>
Reply-To: ZmnSCPxj <ZmnSCPxj@protonmail.com>
Message-ID: <PrPIKz5dz72OO3dKSJS0Qjp-Gk3WYG4uA75UnK2Bze1H1XvSQOsCfDOBlPwG5UtOql8W6yv_cDWLsIFjJj60QunxD_NB__6HU7uOs1cxQMc=@protonmail.com>
In-Reply-To: <CAPweEDz0AsvcbYnS2o3KL6snvUV67JpFawruq0gpcWwcTc4npQ@mail.gmail.com>
References: <CAPweEDx4wH_PG8=wqLgM_+RfTQEUSGfax=SOkgTZhe1FagXF9g@mail.gmail.com>
 <oCNGbVElAQCJ1bEmwLXLzIVec0ZoOA2Ar3vkOc1a0GW12h78bhMi_W4n3pCdDt7hJyPFoMRb0U1T5Wx5uQl4oo6zeQtjKs0MdAXGtvLw1SQ=@protonmail.com>
 <CAPweEDy7Xf3nD1mfyX5MmtsGX=1sd5=gsLosZ=bYavJ0BZyy3g@mail.gmail.com>
 <puUth0RIvY16I3ghjUiTkIPJQEKETPLZrm2QiiELW8AheIGIin29u5RkztTXIeYIK0xg2UIbsx6m-TpkJU2BvmVyYYr_BYbCdIQSk2t7TkU=@protonmail.com>
 <CAPweEDz0AsvcbYnS2o3KL6snvUV67JpFawruq0gpcWwcTc4npQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>,
 Libre-Soc General Development <libre-soc-dev@lists.libre-soc.org>
Subject: Re: [bitcoin-dev] Libre/Open blockchain / cryptographic ASICs
Precedence: list

Good morning Luke,

> > Another point to ponder is test modes.
> > In mass production you need test modes.
>
> > (Sure, an attacker can try targeted ESD at the `TESTMODE` flip-flop rep=
eatedly, but this risks also flipping other scan flip-flops that contain th=
e data that is being extracted, so this might be sufficient protection in p=
ractice.)
>
> if however the ASIC can be flipped into TESTMODE and yet it carries on
> otherwise working, an algorithm can be re-run and the exposed data
> will be clean.

But in most testmodes I have seen (and designed) all clocks are driven exte=
rnally from a different pin (usually the serial interface) when in testmode=
.
If the CPU clock is now controlled by the attacker, how do you run any kind=
 of algorithm?

(This could be an artifact of how my old design company designed testmodes,=
 YMMV.)

Really the concern here is that testmode is entered while the CPU has key m=
aterial loaded into registers, or caches, then it is possible, if those reg=
isters/caches are in the scan chain, to exfiltrate data.
Does not matter if the chip is now in a mode that cannot execute algorithms=
, if it was doing any kind of computation involving privkeys (including say=
 deriving its public key so that PC-side hardware can get the `xpub`) then =
key material may be in scan chain registers, clock is now controlled by the=
 attacker, and possibly scan mode as well (which disables combinational cir=
cuitry thus none of your algorithms can run).

>
> > If you are really going to open-source the hardware design then the lay=
out
> > is also open and attackers can probably target specific chip area for E=
SD
> > pulse to try a flip-flop upset, so you need to be extra careful.
>
> this is extremely valuable advice. in the followup [1] you describe a
> gating method: this we have already deployed on a couple of places in
> case the Libre Cell Library (also being developed at the same time by
> Staf Verhaegen of Chips4Makers) causes errors: we do not want, for
> example, an error in a Cell Library to cause a permanent HI which
> locks us from being able to perform testing of other areas of the
> ASIC.
>
> the idea of being able to actually randomly flip bits inside an ASIC
> from outside is both hilarious and entirely news to me, yet it sounds
> to be exactly the kind of thing that would allow an attacker to
> compromise a hardware wallet. potentially destructively, mind, but
> compromise all the same.

Certainly outside of the the old company design philosophy I have seen many=
 experts strongly protest against a design philosophy which assumes that an=
y flip-flop could randomly switch.

Yet the design philosophy within the old company always had this assumption=
, supposedly (according to in-company lore) because previous engineers had =
actually found the hard way that random bitflips did occur, and for e.g. au=
tomobile chips the risk was too great to not have strong mitigations:

* State machines had to force unused states into known states.
  For example a state machine with 3 states needs 2 bits of state, but 2 bi=
ts of state is actually 4 states, so there is a 4th unused state.
  * Not all state machines needed this rule but during planning we had to i=
dentify state machines that needed this rule, and often we just targeted ha=
ving 2^n states just to ensure that there were no unused states.
  * I even suggested the use of ECC encoding for important state machines a=
nd it was something being investigated at the time I left.
* State machines that otherwise did not need the above rule were strongly e=
ncouraged to clear state at display frame vsync.
  This ensured that any unexpected states they had would only last up to on=
e display frame, which was considered acceptable.
* Flip-flops that held settings were periodically reloaded at each display =
frame vsync from a flash memory (which apparently as a lot more immune to b=
itflips).

It could be an artifact as well that the company had its own in-house found=
ry rather than delegate out to TSMC or whatnot --- maybe the technology we =
had was just suckier than state-of-the-art so bitflips were more common.

The reason why this stuck to mind is because at one time we had a DS test w=
here shooting the ESD gun could sometimes cause the chip to fail (blank dis=
play) until reset, when the expectation was that at most it would flicker f=
or one display frame.
And afterwards we had to go through the entire RTL looking for which state =
machine or settings register was the culprit.
I even wrote a little Verilog-PLI plugin that would inject deterministicall=
y random data into flip-flops in the model to try to catch it.
Eventually we found a bunch of possible root causes, and on the next DS ite=
ration testing we had fun shooting the chip with the ESD gun over and over =
again and sighing in relief that the display was not failing for more than =
one frame.

The chip was a display driver for automotive, apparently at the time cars w=
ere starting to transition to using LCD for things like speedometer and acc=
elerometer rather than physical dials.
And of course the display suddenly switching off while the car is running a=
t high speed due to some extra-powerful pulse elsewhere was potentially dan=
gerous and could distract the driver, so that is why we were paranoid about=
 such sudden bitflips potentially leading to such massive cascade of upsets=
 as to make the display fail permanently.

I think being excessively cautious for cryptographic chips should be standa=
rd as well.
And certainly test mode exfiltration of data is always an issue, JTAG is ve=
ry standard way of reading memory.

Regards,
ZmnSCPxj