From fabrice.drouin at acinq.fr Tue Sep 20 13:07:14 2016 From: fabrice.drouin at acinq.fr (Fabrice Drouin) Date: Tue, 20 Sep 2016 15:07:14 +0200 Subject: [Lightning-dev] Testing a Flare-like routing implementation on 2500 AWS nodes In-Reply-To: References: Message-ID: Edit: our link to the Flare white paper should be [1] http://bitfury.com/content/5-white-papers-research/whitepaper_flare_an_approach_to_routing_in_lightning_network_7_7_2016.pdf Fabrice On 18 September 2016 at 14:58, Pierre wrote: > TLDR: It kind of works (there is a 80+% chance of finding a route to any > other node in ~500ms). > > Hello guys, > > Flare[1] is a routing protocol for the Lightning Network which combines > local neighborhood maps and paths to remote beacons. It is really > interesting and the simulations are promising, so we wanted to give it a try > and actually implement it in Eclair. We ended up with a close variant [2], > and tested it on 2500 nodes on AWS. > > The main difference between our implementation and the original paper is in > the way nodes communicate with each other. In the paper, an assumption is > made that "there exists a way for any two nodes in LN to communicate with > each other, perhaps with some prior setup (such as a distributed hashtable > overlay maintained among LN nodes that would store mapping of node > identifiers to their current IP addresses)". In our implementation there is > no DHT: inter-nodes communication only occurs on existing open channels, and > all routing-related messages are onion-encrypted and routed themselves, just > like htlc messages. So a node can't talk directly to a node it doesn't > already have a route to. This certainly has a cost but it should also > improves privacy, and overall we think that using a DHT isn't necessary. > Also, making the assumption that a node is reachable from the outside is > rather strong, particularly for mobile/light wallets. Whereas by > communicating over channels, a node can actively participate to the global > routing as soon as it has more than two channels. > > The neighbor discovery/beacons selection is pretty similar to what's in the > paper, but a little bit simplified : > - nodes don't ack table update messages (it is probably not needed since it > occurs between adjacent nodes and any disconnection will be noticed) > - nodes don't tell other nodes that they have been chosen as beacon (there > is no beacon_set message). A given node knows it is a candidate because it > received a beacon_req message, but it won't know if it was eventually > selected. > > We only focused on static ranking (finding a route formed of open channels > between two given nodes), so it is very possible (probable?) that a route is > not actually usable because the channels are not balanced. Basically we made > the assumption that the network graph is undirected, which is not true and > is a pretty strong assumption. > > The route selection has also been simplified. Whereas Flare proposed a 3 > steps process: > (a) sender uses its own routing table > (b) sender requests receiver's table and uses a combination of the two > tables > (c) sender iterates over nodes it knows that are close to the receiver, and > request their tables > in our test we only did (b), which is a strong tradeoff. It means that the > time allowed to find a route is short and predictable, but it is very > suboptimal in terms of probability of finding a route. > > Below are the results of a few tests we ran on AWS. The procedure was as > follows: > 1) start a certain number of m1.medium instances with standard linux AMI > 2) after startup the instances run an init script that d/l oracle jdk and > our jar, then run the jar > 3) from a control server, we link nodes to each other following a given > graph using json-rpc commands > 4) wait a few minutes for the gossip to calm down (it can be pretty intense > with radius=3 [3] and our implementation is not optimized) > 5) pick 1000 random routes (random sender, random receiver), and for each > route using json-rpc we (a) asks the receiver for a "payment request" (H + > routing_table) and then (b) asks the sender to find a route for this payment > request (without actually making a payment). > 6) collect stats > > Caveats: > - nodes are connected all at once (step 3) above), and don't periodically > reconcile their own routing table with that of their neighbors > (neighbor_reset in the paper). This leads to nodes not knowing all the nodes > in their radius, and probably has a significant negative impact on results; > we figured this out afterwards :-s > - there is no "proof of channels" (nodes are not lying to each other about > existing channels) > - onion messages are not actually encrypted (although channel communications > are) > - it is a "LAN" setup so things wouldn't be that smooth in reality (although > we are testing static routes so it doesn't actually matter that much) > - smallworld graphs may seem a little bit optimistic in terms of actual > network topology > - ... > > Parameters and results (probability of finding a route to any given node): > test nodes r beacon_count graph result t_avg > A 1000 2 10 sw1000_6_02 87% > B 2000 2 10 sw2000_6_02 76% 305ms > C 2500 2 10 sw2500_4_01 ~20% > D 2500 3 10 sw2500_4_01 43% 377ms > E 2500 3 10 sw2500_6_02 83% 532ms > Note: a swX_Y_0Z graph is a smallworld graph generated using the Watts and > Strogatz model with n=X, k=Y and beta=0.Z > > Network size (known nodes): > test adjacent all > A > B 6.8 62.3 > C > D 4.0 58.8 > E 6.0 97.0 > Note: expected values for column 'all' should be D=4^3=64+ and E=6^3=216+, > see caveat above. The good thing is that a radius=2 might actually be > enough, just like the paper says! > > Beacons: > test dist_avg dist_max dist_var > A > B 7.4 39 29.4 > C > D 9.2 96 79.5 > E 8.1 45 36.0 > > Route lengths: > test len_avg len_max len_var > A 5.4 26.0 3.3 > B 6.2 30.0 6.2 > C > D 17.9 74.0 109.5 > E 7.3 28.0 5.6 > > Response time (this includes sender requesting the receiver's routing table > and running a dijkstra): > test t_avg t_max > A > B 305ms 5158ms > C > D 377ms 7809ms > E 532ms 5128ms > Note: High max times are probably related to the use of poor-performance > instances that sometimes behave erratically (unfortunately we don't have the > variance). > > In conclusion, we show that our modified version of Flare actually works on > a decent size network. Because of the bug mentioned in the caveats, the > value of the radius parameter should be taken with caution. One of the main > concerns for us is the fact that the graph is in fact directed and that at > creation channels are actually one-way. This might make finding route very > difficult at bootstrapping. > > > Cheers, > > Pierre > > > [1] > http://bitfury.com/content/5-white-papers-research/bitfury_whitepaper_shared_send_untangling_in_bitcoin_8_24_2016.pdf > [2] > https://github.com/ACINQ/eclair/blob/wip-flare/eclair-node/src/main/scala/fr/acinq/eclair/router/FlareRouter.scala > [3] https://s3-eu-west-1.amazonaws.com/acinq/public/network_out.PNG > > _______________________________________________ > Lightning-dev mailing list > Lightning-dev at lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev > -- ACINQ SAS, Si?ge Social 10 rue de Penthi?vre 75008 PARIS. Capital 51 437 Euros, 804 203 792 RCS Paris ? APE 6202A ? SIRET 804 203 792 00010 ? TVA FR 34 804203792.