From pm+lists at acinq.fr  Sun Sep 18 12:58:03 2016
From: pm+lists at acinq.fr (Pierre)
Date: Sun, 18 Sep 2016 14:58:03 +0200
Subject: [Lightning-dev] Testing a Flare-like routing implementation on 2500
	AWS nodes
Message-ID: <CAMXsxZMSZqui9UrC+ZWNAa9xoKva07Z3d3r6xWrM3Htg265T3Q@mail.gmail.com>

TLDR: It kind of works (there is a 80+% chance of finding a route to any
other node in ~500ms).

Hello guys,

Flare[1] is a routing protocol for the Lightning Network which combines
local neighborhood maps and paths to remote beacons. It is really
interesting and the simulations are promising, so we wanted to give it a
try and actually implement it in Eclair. We ended up with a close variant
[2], and tested it on 2500 nodes on AWS.

The main difference between our implementation and the original paper is in
the way nodes communicate with each other. In the paper, an assumption is
made that "there exists a way for any two nodes in LN to communicate with
each other, perhaps with some prior setup (such as a distributed hashtable
overlay maintained among LN nodes that would store mapping of node
identifiers to their current IP addresses)". In our implementation there is
no DHT: inter-nodes communication only occurs on existing open channels,
and all routing-related messages are onion-encrypted and routed themselves,
just like htlc messages. So a node can't talk directly to a node it doesn't
already have a route to. This certainly has a cost but it should also
improves privacy, and overall we think that using a DHT isn't necessary.
Also, making the assumption that a node is reachable from the outside is
rather strong, particularly for mobile/light wallets. Whereas by
communicating over channels, a node can actively participate to the global
routing as soon as it has more than two channels.

The neighbor discovery/beacons selection is pretty similar to what's in the
paper, but a little bit simplified :
- nodes don't ack table update messages (it is probably not needed since it
occurs between adjacent nodes and any disconnection will be noticed)
- nodes don't tell other nodes that they have been chosen as beacon (there
is no beacon_set message). A given node knows it is a candidate because it
received a beacon_req message, but it won't know if it was eventually
selected.

We only focused on static ranking (finding a route formed of open channels
between two given nodes), so it is very possible (probable?) that a route
is not actually usable because the channels are not balanced. Basically we
made the assumption that the network graph is undirected, which is not true
and is a pretty strong assumption.

The route selection has also been simplified. Whereas Flare proposed a 3
steps process:
(a) sender uses its own routing table
(b) sender requests receiver's table and uses a combination of the two
tables
(c) sender iterates over nodes it knows that are close to the receiver, and
request their tables
in our test we only did (b), which is a strong tradeoff. It means that the
time allowed to find a route is short and predictable, but it is very
suboptimal in terms of probability of finding a route.

Below are the results of a few tests we ran on AWS. The procedure was as
follows:
1) start a certain number of m1.medium instances with standard linux AMI
2) after startup the instances run an init script that d/l oracle jdk and
our jar, then run the jar
3) from a control server, we link nodes to each other following a given
graph using json-rpc commands
4) wait a few minutes for the gossip to calm down (it can be pretty intense
with radius=3 [3] and our implementation is not optimized)
5) pick 1000 random routes (random sender, random receiver), and for each
route using json-rpc we (a) asks the receiver for a "payment request" (H +
routing_table) and then (b) asks the sender to find a route for this
payment request (without actually making a payment).
6) collect stats

Caveats:
- nodes are connected all at once (step 3) above), and don't periodically
reconcile their own routing table with that of their neighbors
(neighbor_reset in the paper). This leads to nodes not knowing all the
nodes in their radius, and probably has a significant negative impact on
results; we figured this out afterwards :-s
- there is no "proof of channels" (nodes are not lying to each other about
existing channels)
- onion messages are not actually encrypted (although channel
communications are)
- it is a "LAN" setup so things wouldn't be that smooth in reality
(although we are testing static routes so it doesn't actually matter that
much)
- smallworld graphs may seem a little bit optimistic in terms of actual
network topology
- ...

Parameters and results (probability of finding a route to any given node):
test nodes   r    beacon_count     graph      result  t_avg
  A  1000    2    10            sw1000_6_02     87%
  B  2000    2    10            sw2000_6_02     76%    305ms
  C  2500    2    10            sw2500_4_01    ~20%
  D  2500    3    10            sw2500_4_01     43%    377ms
  E  2500    3    10            sw2500_6_02     83%    532ms
Note: a swX_Y_0Z graph is a smallworld graph generated using the Watts and
Strogatz model with n=X, k=Y and beta=0.Z

Network size (known nodes):
test adjacent  all
  A
  B    6.8    62.3
  C
  D    4.0    58.8
  E    6.0    97.0
Note: expected values for column 'all' should be D=4^3=64+ and E=6^3=216+,
see caveat above. The good thing is that a radius=2 might actually be
enough, just like the paper says!

Beacons:
test dist_avg dist_max dist_var
  A
  B    7.4      39      29.4
  C
  D    9.2      96      79.5
  E    8.1      45      36.0

Route lengths:
test  len_avg len_max len_var
  A     5.4    26.0     3.3
  B     6.2    30.0     6.2
  C
  D    17.9    74.0   109.5
  E     7.3    28.0     5.6

Response time (this includes sender requesting the receiver's routing table
and running a dijkstra):
test  t_avg   t_max
  A
  B   305ms   5158ms
  C
  D   377ms   7809ms
  E   532ms   5128ms
Note: High max times are probably related to the use of poor-performance
instances that sometimes behave erratically (unfortunately we don't have
the variance).

In conclusion, we show that our modified version of Flare actually works on
a decent size network. Because of the bug mentioned in the caveats, the
value of the radius parameter should be taken with caution. One of the main
concerns for us is the fact that the graph is in fact directed and that at
creation channels are actually one-way. This might make finding route very
difficult at bootstrapping.


Cheers,

Pierre


[1]
http://bitfury.com/content/5-white-papers-research/bitfury_whitepaper_shared_send_untangling_in_bitcoin_8_24_2016.pdf
[2]
https://github.com/ACINQ/eclair/blob/wip-flare/eclair-node/src/main/scala/fr/acinq/eclair/router/FlareRouter.scala
[3] https://s3-eu-west-1.amazonaws.com/acinq/public/network_out.PNG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20160918/466f17e3/attachment.html>