From fabrice.drouin at acinq.fr  Tue Sep 20 13:07:14 2016
From: fabrice.drouin at acinq.fr (Fabrice Drouin)
Date: Tue, 20 Sep 2016 15:07:14 +0200
Subject: [Lightning-dev] Testing a Flare-like routing implementation on
 2500 AWS nodes
In-Reply-To: <CAMXsxZMSZqui9UrC+ZWNAa9xoKva07Z3d3r6xWrM3Htg265T3Q@mail.gmail.com>
References: <CAMXsxZMSZqui9UrC+ZWNAa9xoKva07Z3d3r6xWrM3Htg265T3Q@mail.gmail.com>
Message-ID: <CAL3Hpbd2q7_Jxu7dbmUhzg-Sy6vjDd=0K+7BnYRSiPSAjDqWUQ@mail.gmail.com>

Edit: our link to the Flare white paper should be
[1] http://bitfury.com/content/5-white-papers-research/whitepaper_flare_an_approach_to_routing_in_lightning_network_7_7_2016.pdf

Fabrice

On 18 September 2016 at 14:58, Pierre <pm+lists at acinq.fr> wrote:
> TLDR: It kind of works (there is a 80+% chance of finding a route to any
> other node in ~500ms).
>
> Hello guys,
>
> Flare[1] is a routing protocol for the Lightning Network which combines
> local neighborhood maps and paths to remote beacons. It is really
> interesting and the simulations are promising, so we wanted to give it a try
> and actually implement it in Eclair. We ended up with a close variant [2],
> and tested it on 2500 nodes on AWS.
>
> The main difference between our implementation and the original paper is in
> the way nodes communicate with each other. In the paper, an assumption is
> made that "there exists a way for any two nodes in LN to communicate with
> each other, perhaps with some prior setup (such as a distributed hashtable
> overlay maintained among LN nodes that would store mapping of node
> identifiers to their current IP addresses)". In our implementation there is
> no DHT: inter-nodes communication only occurs on existing open channels, and
> all routing-related messages are onion-encrypted and routed themselves, just
> like htlc messages. So a node can't talk directly to a node it doesn't
> already have a route to. This certainly has a cost but it should also
> improves privacy, and overall we think that using a DHT isn't necessary.
> Also, making the assumption that a node is reachable from the outside is
> rather strong, particularly for mobile/light wallets. Whereas by
> communicating over channels, a node can actively participate to the global
> routing as soon as it has more than two channels.
>
> The neighbor discovery/beacons selection is pretty similar to what's in the
> paper, but a little bit simplified :
> - nodes don't ack table update messages (it is probably not needed since it
> occurs between adjacent nodes and any disconnection will be noticed)
> - nodes don't tell other nodes that they have been chosen as beacon (there
> is no beacon_set message). A given node knows it is a candidate because it
> received a beacon_req message, but it won't know if it was eventually
> selected.
>
> We only focused on static ranking (finding a route formed of open channels
> between two given nodes), so it is very possible (probable?) that a route is
> not actually usable because the channels are not balanced. Basically we made
> the assumption that the network graph is undirected, which is not true and
> is a pretty strong assumption.
>
> The route selection has also been simplified. Whereas Flare proposed a 3
> steps process:
> (a) sender uses its own routing table
> (b) sender requests receiver's table and uses a combination of the two
> tables
> (c) sender iterates over nodes it knows that are close to the receiver, and
> request their tables
> in our test we only did (b), which is a strong tradeoff. It means that the
> time allowed to find a route is short and predictable, but it is very
> suboptimal in terms of probability of finding a route.
>
> Below are the results of a few tests we ran on AWS. The procedure was as
> follows:
> 1) start a certain number of m1.medium instances with standard linux AMI
> 2) after startup the instances run an init script that d/l oracle jdk and
> our jar, then run the jar
> 3) from a control server, we link nodes to each other following a given
> graph using json-rpc commands
> 4) wait a few minutes for the gossip to calm down (it can be pretty intense
> with radius=3 [3] and our implementation is not optimized)
> 5) pick 1000 random routes (random sender, random receiver), and for each
> route using json-rpc we (a) asks the receiver for a "payment request" (H +
> routing_table) and then (b) asks the sender to find a route for this payment
> request (without actually making a payment).
> 6) collect stats
>
> Caveats:
> - nodes are connected all at once (step 3) above), and don't periodically
> reconcile their own routing table with that of their neighbors
> (neighbor_reset in the paper). This leads to nodes not knowing all the nodes
> in their radius, and probably has a significant negative impact on results;
> we figured this out afterwards :-s
> - there is no "proof of channels" (nodes are not lying to each other about
> existing channels)
> - onion messages are not actually encrypted (although channel communications
> are)
> - it is a "LAN" setup so things wouldn't be that smooth in reality (although
> we are testing static routes so it doesn't actually matter that much)
> - smallworld graphs may seem a little bit optimistic in terms of actual
> network topology
> - ...
>
> Parameters and results (probability of finding a route to any given node):
> test nodes   r    beacon_count     graph      result  t_avg
>   A  1000    2    10            sw1000_6_02     87%
>   B  2000    2    10            sw2000_6_02     76%    305ms
>   C  2500    2    10            sw2500_4_01    ~20%
>   D  2500    3    10            sw2500_4_01     43%    377ms
>   E  2500    3    10            sw2500_6_02     83%    532ms
> Note: a swX_Y_0Z graph is a smallworld graph generated using the Watts and
> Strogatz model with n=X, k=Y and beta=0.Z
>
> Network size (known nodes):
> test adjacent  all
>   A
>   B    6.8    62.3
>   C
>   D    4.0    58.8
>   E    6.0    97.0
> Note: expected values for column 'all' should be D=4^3=64+ and E=6^3=216+,
> see caveat above. The good thing is that a radius=2 might actually be
> enough, just like the paper says!
>
> Beacons:
> test dist_avg dist_max dist_var
>   A
>   B    7.4      39      29.4
>   C
>   D    9.2      96      79.5
>   E    8.1      45      36.0
>
> Route lengths:
> test  len_avg len_max len_var
>   A     5.4    26.0     3.3
>   B     6.2    30.0     6.2
>   C
>   D    17.9    74.0   109.5
>   E     7.3    28.0     5.6
>
> Response time (this includes sender requesting the receiver's routing table
> and running a dijkstra):
> test  t_avg   t_max
>   A
>   B   305ms   5158ms
>   C
>   D   377ms   7809ms
>   E   532ms   5128ms
> Note: High max times are probably related to the use of poor-performance
> instances that sometimes behave erratically (unfortunately we don't have the
> variance).
>
> In conclusion, we show that our modified version of Flare actually works on
> a decent size network. Because of the bug mentioned in the caveats, the
> value of the radius parameter should be taken with caution. One of the main
> concerns for us is the fact that the graph is in fact directed and that at
> creation channels are actually one-way. This might make finding route very
> difficult at bootstrapping.
>
>
> Cheers,
>
> Pierre
>
>
> [1]
> http://bitfury.com/content/5-white-papers-research/bitfury_whitepaper_shared_send_untangling_in_bitcoin_8_24_2016.pdf
> [2]
> https://github.com/ACINQ/eclair/blob/wip-flare/eclair-node/src/main/scala/fr/acinq/eclair/router/FlareRouter.scala
> [3] https://s3-eu-west-1.amazonaws.com/acinq/public/network_out.PNG
>
> _______________________________________________
> Lightning-dev mailing list
> Lightning-dev at lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
>


-- 
ACINQ SAS, Si?ge Social 10 rue de Penthi?vre 75008 PARIS.
Capital 51 437 Euros, 804 203 792 RCS Paris ? APE 6202A ? SIRET 804
203 792 00010 ? TVA FR 34 804203792.