binary-transparency

Contours for binary transparency

Mustafa

https://twitter.com/kanzure/status/1014167797205815297

I am going to talk about binary transparency. What is it? Let's suppose that you have an android phone or iphone and you download some software from the app store or Google Play. How do you know that the apk or that software that you're being given is the same piece of software that is being given to everyone else and google or apple hasn't specifically given you a bad version of that software because they were threatened or compelled to do so in a court order or because their keys were compromised?

These kinds of problems have happened before. In 2012, there was a case of NSA malware called Flame that used a rogue Microsoft binary signing certificate to infect users via Windows update. Microsoft didn't know what was being signed because there was no binary transparency there.

In 2015, there was the famous case where the FBI wanted to unlock or decrypt an iphone of a suspected terrorist. The FBI got a court order against Apple to make them sign a backdoored version of iOS that they could put on to the iphone that would bypass the passcode ratelimiting feature so that the FBI could make thousands of guesses per second to decrypt that person's iphone.

Binary transparency is not to be confused with reproducible builds,which solves a different problem: how do you make sure that the binary was compiled from certain source code? But binary transparency is rather, how do you know that this binary is the same as everyone else is being given? And if it's not the same one, then you need some assurances so that it's public so that everyone knows about it. If apple was to give you a backdoored version of that binary because they were compelled by the FBI, then that should be transparent and everyone should know about it so that it can't be done in secret.

To understand why this is good for bitcoin's proof-of-work... you can't just use a merkle tree. I am going to talk about what a basic model of binary transparency might look like.

People have been talking about using merkle trees to create verifiable logs since the 90s. You would have an append-only log that would have a merkle tree or hashcahin. Some company that wants to be audited would write to that log, its actions, and then the end user, that wants to audit the log, and he would read that log or get an inclusion proof of certain things in thatl og. Many different systems for this have been proposed in the past few years.

What happens if the log is forked, and the person responsible for maintaining that log or ledger, gives different people a different versino of that log with different things inside of it? In the example of the Apple case, what happened if someone forked their log and gave the other log to someone else? That's equivocation. We all know how bitcoin deals with this using proof-of-work to make sure there is network consensus on what the actual ledger or blockchain is.

There's retroactive transparency (certificate transparency does this) and then proactive transparency like bitcoin and byzantine fault tolerance.

Certificate transparency is a system created by Google for use in Chrome to deal with the problem of certificate authorities signing rogue certificates. If you want to have an SSL certificate accepted by Google Chrome as valid, then your certificate authority would have to make sure it submits every certificate to Google's block sampler which uses an append-only merkle tree so that every single certificate ever signed by a certificate authority is actually transparent and everyone knows about it. If some certificate authority signs for google.com, then everyone is supposed to know about it.

There's a method to deal with this in retroactive transparency, called gossiping. There's no consensus mechanism or proof-of-work. There's a central log server created by Google and some other certificate authorities run some... the way they deal with this case where someone might fork the log and give different views of the log is through gossiping. Auditors gossip with each other about the different merkle roots for the logs that they have been receiving. If they compare with each other and then they realize the log server was being dishonest and giving people different views of the logs, then that would be detectable, and the log server would be identified as malicious. The key thing to recognize is that this does not prevent logs from being forged. This does not preventl og equivocation. Some log servers do it, but it makes it possible to detect.

A lot of people say, if you want transparency then just use certificate transparency without a blockchain. I don't think it's as simple as that. For transparency systems to be useful, you need a way to not trust the log server. At the moment, gossiping isn't really practical and hasn't been implemented yet, due to performance issues. In Google Chrome right now, it's supposed to be checking fori nclusion proofs, and gossiping hasn't been implemented... and right now it's completely trusted... in the context of binary transparency, and retroactive transparency +gossiping is infeasible because if you're trying to update a binary that has root access, which are the most importantbinaries to keep transparent, it can disable the gossiping mechanism after you execute the binary. So once you execute that binary on your device, then fraud from the log server would never be discovered. And also, it's specifically unsuitable for devices that have a low-level of reseources or can easily be eclipse attacked and to be prevented from gossiping to other nodes in the first place.

And then there's the other way of doing it, which is pro-active transparency. The idea of proactive transparency is to make equivocation difficult to do in the first place, not simply to make it detectable. A basic way of doing that is to use a special signature scheme or just use a consensus mechanism for that. A bunch of people have to sign every single update to the log for every new single merkle root in the log. But how do you select those people in a Sybil resistant way? You would still need governance, which I would imagine that people here would w.nt to avoid

Bitcoin uses crypto-economic incentives. Creating a different view of the blockchain would be costly. You either have to do a 51% attack or an eclipse attack.

Suppose you wanted to build a system to do binary transparency using the bitcoin blockchain system. Let's look at the threat model. You have services like software and application developers which provide applications to the end user. They create and issue the software. They might issue or submit software to some software repository somewhere, which would be the authority. In this threat model, the authority would be the party that is responsible for authenticating or approving the software, like the app store or the debian package system. And then you have the monitor actor, who is responsible for inspecting updates to that log to see if there might be something fishy there. If you go to the log and you see a hashed binary there, and nobody has seen that distributed anywhere to anyone, then it might look like an obvious backdoor. And then you have auditors which are responsible for checking that certain binaries in the logs on behalf of users that want to install those updates. In practice, binaries and users might be the same parties. Auditors could be a piece of software on a user's comuter, so in practice they would be the same actor as the user. We would assume the authority is completely untrusted and could be compromised at any point. Their code signing keys might be compromised, or they might be compelled to sign something. The auditor and the user might be compromised after acting upon some log entry, like after verifying the update is in the log, like installing an update and having a gossiping mechanism disabled. Also, the local network for auditors and suesrs is untrusted. You can't make any assumptions about the bitcoin nodes that you're communicating with, and they might be malicious and they might be hiding blocks being distributed to you and the rest of the network. They might be doing an eclipse attack where they feed you only specific blocks that might have less PoW than blocks on the rest of the network like to hide certain information or transactions from you.

Under this threat model, let's assume that the authority has some known bitcoin address that can be associated with his code signing certificate. This bitcoin address would act as a root-of-trust for that authority. It could be hardcoded in software or you could use some public-key infrastructure. It doesn't matter.

The server could send a binary to an authority to get it to be approved and accepted into some kind of app store. You could have an inclusion proof structure where they publish transactions that contain a merkle root of a batch of updates or a batch of binaries that he wants to commit to, and he can simply include that in the OP_RETURN field of a transaction or whatever your favorite way is to include arbitrary data in bitcoin transactions.

The client, now that it wants to install a software update, the authority distributes the software update, and also an inclusion proof that this binary was actually committed to in the bitcoin blockchain. And then the client or the auditor would validate that inclusion proof because the auditor would have to run an SPV client and download all the bitcoin blockheaders and verify the inclusion proofs.

The monitor is responsible for monitoring the log and seeing if there's anything fishy in there. The monitoring simply monitors the bitcoin blockchain for any transactions that have been assigned by that authority's bitcoin address, and then mniotiro the binary batch merkle oot to see what has been committed there. And then he could expect the binary and see if there's anything interesting there, or maybe there's a binary there that nobody has heard about or hasn't been actually distributed.

There's an interesting problem here, which is similar to the problem of censorship resistance that petertodd was talking about in the previous talk. Well, what if the authority publishes or commits to a merkle root of binaries but doesn't actually publish the data behind those leaves? What if they don't publish the binaries themselves? You would see a hash on the bitcoin blockchain, but you wouldn't know what that hash corresponds to.

This could be detected as misbehavior because if you ask around and nobody has a copy of those binaries, and they weren't being distributed on their websites or in the app store, then socially it would become clear that something fishy is going on. This is all quite social. There's no systematic way or technical way of saying there's a malicious binary in there. This is a social method of auditing authority.

What if you actually want, however, to force the authority or to have some assurance that the authority is publishing merkle roots for binaries and is actually publishing those binaries? If you are downloading an update from the appstore that is a targeted backdoor for you, then it would be nice to have some assurance that the binary of that update is known to everyone so that if it turns out to be malicious then you could inspect that binary and see what was malicious about it. There's no easy solution to that, unless you do something stupid, like publish the binaries themselves to the bitcoin blockchain. In the context of the linux repositories, there's a natural solution here because with Linux repositories like the debian repository... there are many mirrors around. If you assume that those mirrors are a form of sybil resistance, and they don't have a sybil in them, then you can do some special scheme where the auditor could ask a bunch of these mirrors to ask if they have the data for this, and if they say yes, then they would accept the inclusion proofs for the updates, and if they say no, then they would reject the updates. For an auditor to verify the inclusion proof, it would have to get archive state, of the archival node which is basically responsible for downloading all data from the authority, and making sure that it actually exists. The auditor can simply-- can tell the latest blockhash, for the authority that has committed to those specific binaries.

What would it cost to attoack a system like this? For an eclipse attack, it should be $2b using Antminer S9's for a 51% attack. For an eclipse attack, and auditors only require 6 confirmations, then the hardware cost is only $8.3m and $100k in electricity costs. And it requires per device block headers.

To implement this with the debian package repositories, there were about ~1.7 terabytes of package data, with 1040 package updates daily. So you need about 4 bicoin transactions per day, which seems reasonable. There is about ~1.3 kilobytes of bandwidth and storage overhead per package, for storing and sending inclusion proofs. Users would need to run SPV clients, download ~39 megabytes of block header data, but only store the hashes, which is ~15.9 megabytes of data. So this would be 11.5 kilobytes/day to download, and 5.6 kilobytes/day to store. This seems pretty reasonable to me.

https://github.com/musalbas/contour

Q: What about custom builds for everyone? You would need formal proof of software behavior, tied to the particular build.