Received-SPF: pass (sog-mx-2.v43.ch3.sourceforge.com: domain of gmail.com
	designates 209.85.213.180 as permitted sender)
	client-ip=209.85.213.180;
	envelope-from=gustav.simonsson@gmail.com;
	helo=mail-ig0-f180.google.com; 
MIME-Version: 1.0
In-Reply-To: <CANEZrP1+=JY0RGEMvm9iL09L-tZAWqsSOOwFaroYUKkWumx+xg@mail.gmail.com>
References: <CANEZrP25N7W_MeZin_pyVQP5pC8bt5yqJzTXt_tN1P6kWb5i2w@mail.gmail.com>
	<0720C223-E9DD-4E76-AD6F-0308CA5B5289@gmail.com>
	<CAAS2fgTGDzPFDP=ii08VXcXYpWr2akYWxqJCNHW-ABuN=ESc8A@mail.gmail.com>
	<7E50E1D6-3A9F-419B-B01E-50C6DE044E0F@gmail.com>
	<CAAS2fgScLKgq8_V0oVpvP1gYAKxiyVNGVWA86XfecSmPqsMKUg@mail.gmail.com>
	<CANEZrP1+=JY0RGEMvm9iL09L-tZAWqsSOOwFaroYUKkWumx+xg@mail.gmail.com>
Date: Sat, 8 Mar 2014 20:29:15 +0100
Message-ID: <CANeYco-Zno1xAETTFoYA12K2TAqJN9u+ttEEuNgATjcLuUek+w@mail.gmail.com>
From: Gustav Simonsson <gustav.simonsson@gmail.com>
To: Mike Hearn <mike@plan99.net>
Content-Type: multipart/alternative; boundary=047d7bdc0cf4cbf37904f41d628e
Cc: Bitcoin Development <bitcoin-development@lists.sourceforge.net>
Subject: Re: [Bitcoin-development] New side channel attack that can recover
 Bitcoin keys
Precedence: list

--047d7bdc0cf4cbf37904f41d628e
Content-Type: text/plain; charset=ISO-8859-1

While there is no mention of virtualization in the side-channel article,
the FLUSH+RELOAD paper [1] mentions virtualization and claims the clflush
instruction works not only towards processes on the same OS, but also
against processes in a separate guest OS if executed on the host OS (type 2
hypervisor) [2]. It also works if executed from within another guest OS
(though that reduces the efficiency of the attack) [3].

Both the authors [4] and Vulnerability Note VU#976534 [5] claim disabling
hypervisor memory page de-duplication prevents the attack. This could
perhaps be a first step for bitcoin companies running their software on
shared hosts; demand their allocated instances to be on hosts with this
disabled. Question is how common it is for virtualization providers to
offer that as an option.

TRESOR is is only applicable if running in a non-virtualized OS [6].

While TRESOR only implements AES, it seems it could work for ECDSA as well,
as they use the four x86 debug registers to fit a 256 bit privkey [7] for
the entire machine uptime, and then use other registers when doing the
actual AES ops. They use the Intel AES-NI instruction set though, and since
there is no corresponding instruction set for EC extra work would be
required to manually implement EC math in assembler.

They actually do what Mike Hearn mentioned and disable preemption in Linux
(their code runs in kernel space; they patched the kernel) to ensure
atomicity. Not only do they manage to protect against memory attacks (and
RAM/cache timing attacks) from other processes running on the same host,
but even from root on the same host (from userland, the debug registers are
only accessible through ptrace, which they patched, and they also disabled
LKM & KMEM).

One could imagine different levels of TRESOR-like ECDSA with different
tradeoffs of complexity vs security. For example, if one is fine with
keeping the privkey(s) in RAM but want to avoid cache timing attacks, the
signing could be implemented as a userspace program holding key(s) in RAM
together with a kernel module providing a syscall for signing. Signing is
then run with preemption using only x86 registers for intermediate data and
then using e.g. movntps [8] to write to RAM without data being cached. The
benefit of this compared with the full TRESOR approach is that it would not
require a patched kernel, only a kernel module. It would also be simpler to
implement compared to keeping the privkey in the debug registers for the
entire machine uptime, especially if multiple privkeys are used. It would
not protect against root though, since an adversary getting root could load
their own kernel module and read the registers.

To handle multiple keys (maybe as one-time-use) and get full TRESOR
benefits, one could perhaps (with the original TRESOR approach, i.e. with
patched kernel) store a BIP 0032 starting string / seed + counter in the
debug registers and have the atomic kernel code generate new keys and do
the signing.

Cheers,
Gustav Simonsson

1. http://eprint.iacr.org/2013/448.pdf
2. Page 1 of [1]
3. page 5 of [1]
4. page 8 (end of conclusions section) of [1]
5. http://www.kb.cert.org/vuls/id/976534
6. page 8, "3.2 Hardware compatibility",
https://www.usenix.org/legacy/event/sec11/tech/full_papers/Muller.pdf
7. page 3, "2.2 Key Management" of [6]
8. page 1041 of
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf


On Thu, Mar 6, 2014 at 9:38 AM, Mike Hearn <mike@plan99.net> wrote:

> I'm wondering about whether (don't laugh) moving signing into the kernel
> and then using the MTRRs to disable caching entirely for a small scratch
> region of memory would also work. You could then disable pre-emption and
> prevent anything on the same core from interrupting or timing the signing
> operation.
>
> However I suspect just making a hardened secp256k1 signer implementation
> in userspace would be of similar difficulty, in which case it  would
> naturally be preferable.
>
>
> On Wed, Mar 5, 2014 at 11:25 PM, Gregory Maxwell <gmaxwell@gmail.com>wrote:
>
>> On Wed, Mar 5, 2014 at 2:14 PM, Eric Lombrozo <elombrozo@gmail.com>
>> wrote:
>> > Everything you say is true.
>> >
>> > However, branchless does reduce the attack surface considerably - if
>> nothing else, it significantly ups the difficulty of an attack for a
>> relatively low cost in program complexity, and that might still make it
>> worth doing.
>>
>> Absolutely. I believe these things are worth doing.
>>
>> My comment on it being insufficient was only that "my signer is
>> branchless" doesn't make other defense measures (avoiding reuse,
>> multsig with multiple devices, not sharing hardware, etc.)
>> unimportant.
>>
>> > As for uniform memory access, if we avoided any kind of heap
>> allocation, wouldn't we avoid such issues?
>>
>> No. At a minimum to hide a memory timing side-channel you must perform
>> no data dependent loads (e.g. no operation where an offset into memory
>> is calculated). A strategy for this is to always load the same values,
>> but then mask out the ones you didn't intend to read... even that I'd
>> worry about on sufficiently advanced hardware, since I would very much
>> not be surprised if the processor was able to determine that the load
>> had no effect and eliminate it! :) )
>>
>> Maybe in practice if your data dependencies end up only picking around
>> in the same cache-line it doesn't actually matter... but it's hard to
>> be sure, and unclear when a future optimization in the rest of the
>> system might leave it exposed again.
>>
>> (In particular, you can't generally write timing sign-channel immune
>> code in C (or other high level language) because the compiler is
>> freely permitted to optimize things in a way that break the property.
>> ... It may be _unlikely_ for it to do this, but its permitted-- and
>> will actually do so in some cases--, so you cannot be completely sure
>> unless you check and freeze the toolchain)
>>
>> > Anyhow, without having gone into the full details of this particular
>> attack, it seems the main attack point is differences in how squaring and
>> multiplication (in the case of field exponentiation) or doubling and point
>> addition (in the case of ECDSA) are performed. I believe using a branchless
>> implementation where each phase of the operation executes the exact same
>> code and accesses the exact same stack frames would not be vulnerable to
>> FLUSH+RELOAD.
>>
>> I wouldn't be surprised.
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Bitcoin-development mailing list
>> Bitcoin-development@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development
>
>

--047d7bdc0cf4cbf37904f41d628e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">While there is no mention of virtualization in the side-ch=
annel article, the FLUSH+RELOAD paper [1] mentions virtualization and claim=
s the clflush instruction works not only towards processes on the same OS, =
but also against processes in a separate guest OS if executed on the host O=
S (type 2 hypervisor) [2]. It also works if executed from within another gu=
est OS (though that reduces the efficiency of the attack) [3].<br>
<br>Both the authors [4] and Vulnerability Note VU#976534 [5] claim disabli=
ng hypervisor memory page de-duplication prevents the attack. This could pe=
rhaps be a first step for bitcoin companies running their software on share=
d hosts; demand their allocated instances to be on hosts with this disabled=
. Question is how common it is for virtualization providers to offer that a=
s an option.<br>
<br>TRESOR is is only applicable if running in a non-virtualized OS [6].<br=
><br>While TRESOR only implements AES, it seems it could work for ECDSA as =
well, as they use the four x86 debug registers to fit a 256 bit privkey [7]=
 for the entire machine uptime, and then use other registers when doing the=
 actual AES ops. They use the Intel AES-NI instruction set though, and sinc=
e there is no corresponding instruction set for EC extra work would be requ=
ired to manually implement EC math in assembler.<br>
<br>They actually do what Mike Hearn mentioned and disable preemption in Li=
nux (their code runs in kernel space; they patched the kernel) to ensure at=
omicity. Not only do they manage to protect against memory attacks (and RAM=
/cache timing attacks) from other processes running on the same host, but e=
ven from root on the same host (from userland, the debug registers are only=
 accessible through ptrace, which they patched, and they also disabled LKM =
&amp; KMEM).<br>
<br>One could imagine different levels of TRESOR-like ECDSA with different =
tradeoffs of complexity vs security. For example, if one is fine with keepi=
ng the privkey(s) in RAM but want to avoid cache timing attacks, the signin=
g could be implemented as a userspace program holding key(s) in RAM togethe=
r with a kernel module providing a syscall for signing. Signing is then run=
 with preemption using only x86 registers for intermediate data and then us=
ing e.g. movntps [8] to write to RAM without data being cached. The benefit=
 of this compared with the full TRESOR approach is that it would not requir=
e a patched kernel, only a kernel module. It would also be simpler to imple=
ment compared to keeping the privkey in the debug registers for the entire =
machine uptime, especially if multiple privkeys are used. It would not prot=
ect against root though, since an adversary getting root could load their o=
wn kernel module and read the registers.<br>
<br>To handle multiple keys (maybe as one-time-use) and get full TRESOR ben=
efits, one could perhaps (with the original TRESOR approach, i.e. with patc=
hed kernel) store a BIP 0032 starting string / seed + counter in the debug =
registers and have the atomic kernel code generate new keys and do the sign=
ing.<br>
<br>Cheers,<br>Gustav Simonsson<br><br>1. <a href=3D"http://eprint.iacr.org=
/2013/448.pdf">http://eprint.iacr.org/2013/448.pdf</a><br>2. Page 1 of [1]<=
br>3. page 5 of [1]<br>4. page 8 (end of conclusions section) of [1]<br>5. =
<a href=3D"http://www.kb.cert.org/vuls/id/976534">http://www.kb.cert.org/vu=
ls/id/976534</a><br>
6. page 8, &quot;3.2 Hardware compatibility&quot;, <a href=3D"https://www.u=
senix.org/legacy/event/sec11/tech/full_papers/Muller.pdf">https://www.useni=
x.org/legacy/event/sec11/tech/full_papers/Muller.pdf</a><br>7. page 3, &quo=
t;2.2 Key Management&quot; of [6]<br>
8. page 1041 of <a href=3D"http://www.intel.com/content/dam/www/public/us/e=
n/documents/manuals/64-ia-32-architectures-software-developer-manual-325462=
.pdf">http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6=
4-ia-32-architectures-software-developer-manual-325462.pdf</a><br>
<br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On =
Thu, Mar 6, 2014 at 9:38 AM, Mike Hearn <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:mike@plan99.net" target=3D"_blank">mike@plan99.net</a>&gt;</span> wrot=
e:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;m wondering about whe=
ther (don&#39;t laugh) moving signing into the kernel and then using the MT=
RRs to disable caching entirely for a small scratch region of memory would =
also work. You could then disable pre-emption and prevent anything on the s=
ame core from interrupting or timing the signing operation.<div>

<br></div><div>However I suspect just making a hardened secp256k1 signer im=
plementation in userspace would be of similar difficulty, in which case it =
&nbsp;would naturally be preferable.</div></div><div class=3D"HOEnZb"><div =
class=3D"h5">
<div class=3D"gmail_extra"><br>
<br><div class=3D"gmail_quote">On Wed, Mar 5, 2014 at 11:25 PM, Gregory Max=
well <span dir=3D"ltr">&lt;<a href=3D"mailto:gmaxwell@gmail.com" target=3D"=
_blank">gmaxwell@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">

<div>On Wed, Mar 5, 2014 at 2:14 PM, Eric Lombrozo &lt;<a href=3D"mailto:el=
ombrozo@gmail.com" target=3D"_blank">elombrozo@gmail.com</a>&gt; wrote:<br>
&gt; Everything you say is true.<br>
&gt;<br>
&gt; However, branchless does reduce the attack surface considerably - if n=
othing else, it significantly ups the difficulty of an attack for a relativ=
ely low cost in program complexity, and that might still make it worth doin=
g.<br>


<br>
</div>Absolutely. I believe these things are worth doing.<br>
<br>
My comment on it being insufficient was only that &quot;my signer is<br>
branchless&quot; doesn&#39;t make other defense measures (avoiding reuse,<b=
r>
multsig with multiple devices, not sharing hardware, etc.)<br>
unimportant.<br>
<div><br>
&gt; As for uniform memory access, if we avoided any kind of heap allocatio=
n, wouldn&#39;t we avoid such issues?<br>
<br>
</div>No. At a minimum to hide a memory timing side-channel you must perfor=
m<br>
no data dependent loads (e.g. no operation where an offset into memory<br>
is calculated). A strategy for this is to always load the same values,<br>
but then mask out the ones you didn&#39;t intend to read... even that I&#39=
;d<br>
worry about on sufficiently advanced hardware, since I would very much<br>
not be surprised if the processor was able to determine that the load<br>
had no effect and eliminate it! :) )<br>
<br>
Maybe in practice if your data dependencies end up only picking around<br>
in the same cache-line it doesn&#39;t actually matter... but it&#39;s hard =
to<br>
be sure, and unclear when a future optimization in the rest of the<br>
system might leave it exposed again.<br>
<br>
(In particular, you can&#39;t generally write timing sign-channel immune<br=
>
code in C (or other high level language) because the compiler is<br>
freely permitted to optimize things in a way that break the property.<br>
... It may be _unlikely_ for it to do this, but its permitted&mdash; and<br=
>
will actually do so in some cases&mdash;, so you cannot be completely sure<=
br>
unless you check and freeze the toolchain)<br>
<div><br>
&gt; Anyhow, without having gone into the full details of this particular a=
ttack, it seems the main attack point is differences in how squaring and mu=
ltiplication (in the case of field exponentiation) or doubling and point ad=
dition (in the case of ECDSA) are performed. I believe using a branchless i=
mplementation where each phase of the operation executes the exact same cod=
e and accesses the exact same stack frames would not be vulnerable to FLUSH=
+RELOAD.<br>


<br>
</div>I wouldn&#39;t be surprised.<br>
<div><div><br>
---------------------------------------------------------------------------=
---<br>
Subversion Kills Productivity. Get off Subversion &amp; Make the Move to Pe=
rforce.<br>
With Perforce, you get hassle-free workflows. Merge that actually works.<br=
>
Faster operations. Version large binaries. &nbsp;Built-in WAN optimization =
and the<br>
freedom to use Git, Perforce or both. Make the move to Perforce.<br>
<a href=3D"http://pubads.g.doubleclick.net/gampad/clk?id=3D122218951&amp;iu=
=3D/4140/ostg.clktrk" target=3D"_blank">http://pubads.g.doubleclick.net/gam=
pad/clk?id=3D122218951&amp;iu=3D/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Bitcoin-development mailing list<br>
<a href=3D"mailto:Bitcoin-development@lists.sourceforge.net" target=3D"_bla=
nk">Bitcoin-development@lists.sourceforge.net</a><br>
<a href=3D"https://lists.sourceforge.net/lists/listinfo/bitcoin-development=
" target=3D"_blank">https://lists.sourceforge.net/lists/listinfo/bitcoin-de=
velopment</a><br>
</div></div></blockquote></div><br></div>
</div></div><br>-----------------------------------------------------------=
-------------------<br>
Subversion Kills Productivity. Get off Subversion &amp; Make the Move to Pe=
rforce.<br>
With Perforce, you get hassle-free workflows. Merge that actually works.<br=
>
Faster operations. Version large binaries. &nbsp;Built-in WAN optimization =
and the<br>
freedom to use Git, Perforce or both. Make the move to Perforce.<br>
<a href=3D"http://pubads.g.doubleclick.net/gampad/clk?id=3D122218951&amp;iu=
=3D/4140/ostg.clktrk" target=3D"_blank">http://pubads.g.doubleclick.net/gam=
pad/clk?id=3D122218951&amp;iu=3D/4140/ostg.clktrk</a><br>__________________=
_____________________________<br>

Bitcoin-development mailing list<br>
<a href=3D"mailto:Bitcoin-development@lists.sourceforge.net">Bitcoin-develo=
pment@lists.sourceforge.net</a><br>
<a href=3D"https://lists.sourceforge.net/lists/listinfo/bitcoin-development=
" target=3D"_blank">https://lists.sourceforge.net/lists/listinfo/bitcoin-de=
velopment</a><br>
<br></blockquote></div><br></div>

--047d7bdc0cf4cbf37904f41d628e--