BlockSci: a Platform for Blockchain Science and Exploration
Harry Kalodner, Princeton University (Arvind Narayanan etc.)
Alright well, my name is Harry Kalodner, I am a student at Princeton University and I am here to tell you about a tool I built alon with some colleagues at Princeton which could be used for analyzing the blockchain in different ways. So far most of the talks today have been constructive about new ways to use the blockchain and protocols to slightly modify bitcoin in roder to improve privacy or scaling. I am going ot take a step back and look at scaling demands beyond ... to look at why are blocks full?
Why do we need analytics? We need to be able to motivate the scaling solutions. It's very cool to just present an idea and say it would increase htorughput in this way. We want ot be able to know how that is going ot effect actual kind of use cases of bitcoin. And so, using analytics, we can deliver new areas of improvement by getting better understanding of what kinds of transactions and usecases people have in practice. We can categorize these cases to see whether scaling effects some use cases more than others. It's important to keep a general view of the whole ecosystem rathe rthan getting swamped in the technical details and understanding the circumstances in which they occur.
I am going to start by talking about some interesting economic questions we can ask about bitcoin. These are all use cases for blocksci. The second part of my talk is some more details about blocksci is built and what blocksci is.
Can we tell the difference between organic and artificial demand for bitcoin? For instance, are blocks totally full? Is the mempool overloaded because there are a lot of people doing transactions? Or is there a spam attack going on? When you're counting transactions in the mempool, it might look the same by that metric alone.
We can differentiate between different types of demand for block space. If we're looking at how much bitcoin is moving in a given day, well, the total number is going to have a lot of noise in it. It's going ot include change addresses, people sending money to htemselves, al sorts of stuff. So we want to differentiate how much bitcoin is going to be transferred between people and businesses and so on. So is it just churn? Also, is it a store of value, or a medium of exchange? What is the viscosity of bitcoin? Are people just investors which seems like a good idea right now, or are people buying coffee? Is there serious exchange of currency for goods going on? And how do we detect organic demand from a wide base of users, or just malicious spam? Is there just one particular business doing something that generates a lot of transaction load? It's not that hard to overload the network.
One thing we can look at is the velocity of bitcoin. This idea is just looking at how much bitcoin has moved in a day, like bitcoin days destroyed. We can look at output amonuts and add them up in a signel given day. There's no rhyme or reason as to what goes on in bitcoin. People are sending money in no discernable pattern and you can't really learn anything. But we can clean this up a lot and try to figure out how much money is actually being moved in bitcoin. We can discount self-churn. That's change addresses, that's send to self, and to do that, you need to be able to have a good heuristic understanding of blockchain based on current whatever state of the art address clustering you have and linkability.
What's interesting here is that if you look at our orange adjusted line, you see something that makes more sense. Bitcoin has been fairly steady in its demad and just recently in this year started a gradual rise. This is what I like to think about- people sending each other bitcoin in exchange for goods or services or what have you. Anther thing you can look at is you can try to see, you can try to get some sort of handle of how much of bitcoin's volume is people going to exchanges ,buying it, and holdong on to it. One interesting thing we did here is that we can correlate between... the blue line is percentage of outputs that have moved in the last month. Are the outputs just sitting? Or are people spending their UTXOs and we correlated that with trade volume from exchanges and their data feeds. The spikes are extremely correlated. Basically any time a lot of bitcoin is being moved around, right now it looks like people buying and selling on exchanges, based on this correlation.
And turning into the details of how blocksci helps figure this out.. one thing we implemented on top of blocksci, and again I haven't described blocksci yet, the ability to implement clustering heuristics. There was mention in earlier talks that there's a lot you can do, starting from work from a while ago, to link addresses together and try to understand what wallets... to see a single user controlling multiple addresses. We can look at change addresses, address reuse, shared inputs, there are other specific methods we can use, like to estimate what calculation algorithm was used for coin selection etc... And there's a lot of choices to make. There's nothing clear. Having a system where we can try out multiple combinations of heuristics to see what effects they have, and that's a powerful thing. And this compares to most of the current state of the art clustering which is in their defense is leaps and bounds ahead of what we have produced in blocksci- they have a lot of data sources, but we only look at blockchain. But they don't give you control over what you want to do and linking together addresses. Blocksci allows you to build arbitrary heuristics for connecting addresses for ...
The real sell for blocksci is that it is capable of clustering the entire blockchain in under 10 minutes. You can try out different weights to different heuristics and try out different rules to get new results.
There are a bunch of tools out there for analyzing bitcoin data in different ways. All the different ones I have seen is that they have a number of differnet problems-- a lot of closed source services out there that will give you interesting data about the blockchain but you have no way of validating or verifying that data. As someone who is interested in decentralized systems, it seems weird to trust a central party about bitcoin analytics when we have blockchain data ourselves. Many of the tools have limited functionality, they were specifically designed to calculate some statistic, and their use stops there. So making a tool that can answer any sort of question that you might want to pose is a really powerful thing. Further, there are a lot of tools that run into problems of insufficient performance. Anyone who has tried to import the bitcoin blockchain into a general purpose database, it's not fast, and it's huge. So having a tool that gets raound that issue, I think that might be better. So the solution that solves all of those issues is... blocksci.
I am going to throw up here the architecture diagram. I wont go into all the details. We have a paper out there. The code is also public on github. I want to talk about the big picture of how... ... we have a custom database format, designed by hand to be compact and highly localized. We use a protocol independent format so that it can be made to work with bitcoin but also bitcoin testnet or litecoin or namecoin or any number of other blockchains. On the other side of that, we have the parser which actually does the owrk of producing this data format. And so, and that has the benefit of, which some other tools have, but I think hasn't been done super nicely, of providing incremental updates as new blocks come in, it can update the databsae, handle reorgs, which are not fun to handle, but it can be done in a way that can provide a static view of the blockchain to our analysis library.
Just a small shot of what it's like to use block sci
fees = [sum(block.fees) for block in chain.range('2017')]
times = [block.time for block in chain.range('2017')]
converter = blocksci.CurrencyConverter()
df = pandas.DataFrame({"Fee":fees}, index=times)
df = converter.satoshi_to_currency_df(df, chain)
We want to be able to hand this over to economists and get their insights, and give them a tool that they can really use to do good work. So that's really my favorite thing about blocksci, the broad applicability of it. We have this python interface for it, and for people who know python it's quite easy to use. As you may have seen in the previous example, we can easily incorporate external data feeds. We can convert between BTC and your local currency. We have support for standard script types. For a given pubkey, we can find everywhere it appears on the blockchain, and you can see if it was pay-to-pubkey, pay-to-scripthash, inside of P2WPKH, we can deduplicate on all of those because of this nice database format.
The big thing with blocksci and the thing I want to highlight is the performance. And kind of, I'm really excited about this, we can iterate on every input and output in the blockchain, it scales pretty linearly, in about 10.3 seconds. So what that means is that the processing capability here-- and this is all on a single machine and we just run on an ec2 instance with 64 GB memory. So nothing hard to setup. Any of you could do that, it's available online. So blocksci enables you to scale what you're looking at to levels that other tools can't achieve.
S ojust a few tidbits of how we achieve this performance. We have a highly customized data format which is all coded in C++. We use memory mapping which allows you to directly load files into memory and allows the OS to efficiently decide what to load and how to access your data. It's all in C++ which gives it serious speed. And a caveat to the earlier performance slide-- the python interface is great to use, but without compiling with cython, it's a few orders of magnitude. You have to use the C++ library if you want to achieve these speeds. Any reasonable query can be completed in about half an hour.
So far I have focused on scalability implications related to bitcoin. I want to take a mention to second a few other things that blocksci can do. We can look into privacy of multisig. One of the purported uses of multisig was for organizations to manage their funds so that they can distribute keys to multiple people. But this allows leakage of private data, you can see some keys, acccess patterns ,etc. We looked at places where the keys used for a multisig changed slightly. For instance, if there was a 3-of-3 multisig, we found locations where there was an input and output where only one of the 3 keys changed. And that is a fairly substantial violation in organizational privacy if outside observers can tell when... it might just be a key swap, or maybe you booted someone out for control of the funds. You can explore this with blocksci, and you can see how much money is in outputs that exhibit these signs. So looking at the last few years, there are recently as much as 10k transactions per month with these properties, and a fair amount of money involved in them. These are serious privacy violations.
"When the cookie mets the blockchain: Privacy risks of web payments via cryptocurrencies"
"BlockSci: Design and applications of a blockchain analysis platform"
Q&A
Q: What indexes do you need on Bitcoin Core?
A: The default setup. You don't need txindex. We parse all of the block files ourselves without using any sort of indexing. The reason for this is that we would like to be able to have blocksci running concurrently with a node. This is not possible with bitcoin core's current leveldb indexes which only support single reader.