From bastien at acinq.fr  Fri Aug  2 08:30:30 2019
From: bastien at acinq.fr (Bastien TEINTURIER)
Date: Fri, 2 Aug 2019 10:30:30 +0200
Subject: [Lightning-dev] Improving Lightning Network Pathfinding Latency
 by Path Splicing and Other Real-Time Strategy Game Techniques
In-Reply-To: <h8QnQr2f565M2Kds8BcD5Cd8DYVgX_D9Le7j73aMJenfLudQbjGGFpYqsl1VktMVGIiXh17rjIQJ9cpCOIoWFrSdTdRyw9Dfd5bnxpV463o=@protonmail.com>
References: <K5X732PWnUY2qH_s39wSBT1mX_7kt2McTVvFcW1HvpghHVZccNCyZB9R_-HUoAOHX4wqM-T1j7Rs94y8apb3fmlP4AOLM8UwP92VsPYEYxU=@protonmail.com>
	<CAO3Pvs8uR1q-QycLuybh0HYLK6sauueJY8WrsB5jBG+q1LAX5Q@mail.gmail.com>
	<wiX65OpOUF68a4OAqX0LoCqJaMsEAiYvMI9EwSOMXiEZW-wkyzvTpzu1P0GgoC1RKYjLpP-0CM7etvHmkn9tixoNanEFAiGLIjxi3zSgluQ=@protonmail.com>
	<CACdvm3OgLcxpVWMf2ANPq=DR7e_2EaqXXMGa5HoSWzx0GZOmVQ@mail.gmail.com>
	<h8QnQr2f565M2Kds8BcD5Cd8DYVgX_D9Le7j73aMJenfLudQbjGGFpYqsl1VktMVGIiXh17rjIQJ9cpCOIoWFrSdTdRyw9Dfd5bnxpV463o=@protonmail.com>
Message-ID: <CACdvm3PCt+b+9f2r7XesqCQp_86YaxiVODyfCZwy+umGjJVsJg@mail.gmail.com>

Good morning ZmnSCPxj,

The channel that is failing is then the channel *after* the error-reporting
> node (assuming bit `NODE` (`0x2000`) is not set in the `failure_code`: if
> it is a node-level error we should back off by one node and mark the erring
> node as unreliable).
>
> Indeed, the other insight here is that, if we were able to receive an
> error report from forwarding node N, this implies that every node and
> channel between us and node N is reliable.
> `permuteroute` reuses this prefix, since it is known-reliable.
>

I think this is more subtle than that. The thread I linked provided more
details, but in many cases you can't decide whether you should blacklist
only the channel *after* the failing node or also the channel *before* the
failing node. And it's even worse than that, if a node before the failing
one is malicious, it can force some next node to fail (by simply holding
the HTLC until close to the expiry) and in that case you should also
blacklist
some of the nodes *before* the failing node. And note that malicious nodes
with that behavior would happily forward the error onion because it
directly incriminates someone else.

I agree that we should optimize for the most common use-case (which
probably means ignoring these malicious node scenario for now), but I think
it's important to keep them in mind. At some point people will attack the
network so we need to give some thoughts about potential attacks and make
sure our algorithms can heal properly.
But that's not the most important discussion for this thread so let's
shelve that for now :).

The *real* issue is that costs are *both* fixed and proportional.
> So we need to select a balancing factor between the fixed and proportional
> costs.
>

I fully agree with that.

We can assume "past performance is an indicator of future performance" and
> record the average payment size of the user in order to determine how to
> balance the fixed and proportional costs.
> Picking an example value of say 1mBTC at the start, when the user has not
> used the node yet, seems reasonable.
>

I also agree this sounds reasonable, this is what we had in mind as a
starting point for eclair.

Most solutions to the network flow problem seem to require an accurate view
> of flows at each node, which we do not have.
>

Interesting, but for the first hop (local channels) we have the exact
balance available for sending, and for next hops we can consider the
channels
balanced (with a random perturbation of X%). The combination of that
and retries could provide interesting results (I plan on testing that on
realistic simulations of the network, I can't know for sure if this will
work until then).

My first implementation of MPP for eclair uses an algorithm similar to
flocking.
I think your last suggestion of using something similar to `permuteroute`
can be interesting to try too.
I'll give that a shot if we're not satisfied with the results of the
flocking implementation. Is it what you plan on doing for MPP in
c-lightning?

Cheers,
Bastien


Le ven. 2 ao?t 2019 ? 01:02, ZmnSCPxj <ZmnSCPxj at protonmail.com> a ?crit :

> Good morning Bastien,
>
> >
> > I think that the main points that make routing in the Lightning Network
> different from game path-finding algorithms are:
> >
> > -   Paths are consumed by payments, effectively moving available balance
> between sides of a channel
> > -   The routing algorithm doesn't know remote channels balance
> allocation (and that changes constantly)
> > -   The cost of a path depends on the value you're sending (proportional
> fees)
> > -
> > -   This encourages algorithms not to search for an optimal solution
> (because an optimal solution on outdated/incomplete data doesn't even make
> sense) but rather fast and good enough solutions with retries
>
> I believe the differences are smaller than you might initially think.
>
> Units move around on the map and a pathfinding algorithm cannot predict
> how the *other* units owned by allied players will be, once the current
> units asking for a path have moved along the path.
> i.e. the algorithm does not know how remote tiles are occupied (and that
> changes constantly)
>
> Faster units really should be able to walk around slower units, because
> there is often a tradeoff between speed and combat effectiveness and a
> player asking a faster unit to move probably is depending on their speed.
> i.e. paths can be blocked by slower units, effectively becoming
> slow-moving obstacles that need to be worked around.
>
> And so on.
>
> >
> > There are a few technicalities that might be a problem for some of your
> suggestions, I'm interested in your opinion on how to address them.
> >
> > For `permuteroute`, you mention the following pre-requisite:
> >
> > > the original payer is informed of which node reported the failure and
> which channel failed.
> >
> > We currently don't have a solution for reliable error reporting, as
> pointed out in [1].
> > I think making progress on this thread would be interesting and useful
> for routing heuristics.
>
> We already have sufficiently-good error reporting on route failures:
> https://github.com/lightningnetwork/lightning-rfc/blob/master/04-onion-routing.md#returning-errors
>
> > When an origin node receives an error message matching a transfer it
> initiated
> > (i.e. it cannot return-forward the error any further) it generates the
> `ammag`
> > and `um` keys for each hop in the route.
> > It then iteratively decrypts the error message, using each hop's `ammag`
> > key, and computes the HMAC, using each hop's `um` key.
>
> The number of iterations of decryption is how distant the error-reporting
> node is from the payer.
> Knowing the entire route, we can know which node reported the error.
> The channel that is failing is then the channel *after* the
> error-reporting node (assuming bit `NODE` (`0x2000`) is not set in the
> `failure_code`: if it is a node-level error we should back off by one node
> and mark the erring node as unreliable).
> The node reporting the error is the node that we start the limited-range
> search to "heal" the path.
> `permuteroute` does not *need* better error reporting than we already have.
>
> Of course, if a node is malingering and does not report anything, there is
> nothing we can do, but this is unavoidable anyway and does not prevent use
> of `permuteroute` in other cases either.
>
> Indeed, the other insight here is that, if we were able to receive an
> error report from forwarding node N, this implies that every node and
> channel between us and node N is reliable.
> `permuteroute` reuses this prefix, since it is known-reliable.
>
> >
> > I thought about path pre-computation and path caching, but what bothered
> me is that the cost depends on the amount you want to send.
> > When pre-computing / caching, you have to either ignore that completely
> (which can be fine, I don't think trying to always find the most
> > cost-efficient route is a reasonable goal) or take into account some
> kind of "universal" factor that works for most amounts. How did you take
> > that into account in your pre-computation experiments?
>
> Just to be clear: I have not run any experiments.
> I work on Lightning in a hobbyist capacity, am a small-time HODLer, and
> cannot even afford to run a mainnet LN node (which is why I had to ask Rene
> to time `getroute`).
> I mostly get by on sheer code review, tons of armchair reasoning, and
> sheer force of will.
>
> While costs depend on the amount you want to send, we observe that there
> are three main costs:
>
> * Risk of locking up funds for `cltv_delta` blocks
> * `fee_proportional_millionths`
> * `fee_base_msatoshi`
>
> Of these, the first two are proportional to the value being sent.
> So if you double the value, you double the cost, but you also double this
> same cost on *every* alternate path.
> And pathfinding algorithms do not judge the absolute cost, but the
> relative cost of every path.
> In short the cases below are equivalent and given the same map and the
> same source and destination, will result in the same path (assuming your
> variables do not overflow, of course):
>
> * Plains cost 1 movement point, Forests cost 2 movement point
> * Plains cost 2 movement point, Forests cost 4 movement point
> * Plains cost 3241 movement point, Forests cost 6482 movement point
>
> The issue is not so much that costs are proportional to the value being
> sent.
> The *real* issue is that costs are *both* fixed and proportional.
> So we need to select a balancing factor between the fixed and proportional
> costs.
>
> We can assume "past performance is an indicator of future performance" and
> record the average payment size of the user in order to determine how to
> balance the fixed and proportional costs.
> Picking an example value of say 1mBTC at the start, when the user has not
> used the node yet, seems reasonable.
>
> Using the average value here, assuming the distribution of values the user
> sends is the same in the future, minimizes the error between the cached
> result vs the actual resulting fees.
>
> Again, the point is that we need some sort of way to set limits on the
> fees and risk the user has for payments, hence the need for `maxfeepercent`
> and `maxdelay`.
> We cannot reliably get perfect paths on potentially-stale data anyway.
> So I think we can just use whatever path we find using precomputation and
> caching (using some "example value"), and then do a double-check that the
> generated path does not get past `maxfeepercent` and `maxdelay`.
> If the generated path gets past the limits, we fall back to a
> non-precomputed search: given a good-enough "example value" this should be
> rare in practice.
>
> >
> > I do agree that multi-part payments and trampoline (hierarchical
> routing) can offer a lot of room for algorithmic improvements and your
> > ideas on how to leverage them resonate with mine.
> >
> > An interesting thing to note is that trampoline (in the current proposal
> at least) allows trampoline nodes to leverage multi-part payments
> > "at each hop", meaning that a trampoline node can join/split arbitrarily
> an incoming payment to reach the next trampoline node.
>
> I agree, this is an interesting thing.
>
> > While implementing a first version of multi-part payments, I realized
> that they need to be tightly integrated to the routing algorithm.
> > Since each payment "consumes" a path, potentially "stealing" it from
> other payments, a naive implementation of multi-part payments
> >  would try to use different paths for each sub-payment, but that's an
> inefficient way of solving it. Working on multi-part payments made
> > me think that maybe our routing problem is more similar to a circulation
> or network flow problem [2] rather than path-finding. Have you
> > thought about this? If so what is your opinion?
>
> Most solutions to the network flow problem seem to require an accurate
> view of flows at each node, which we do not have.
> For multi-part, this is actually similar to the issue of sending a blob of
> units from one location to another, while keeping the units in a cohesive
> blob without them forming a line of units where everyone walks nearly the
> exact same path.
> (older RTSs tend to form these lines when sending blobs of units on
> long-distance trips, with the downside that on reaching the combat location
> units come to battle one at a time rather than all at once, reducing the
> impact of the blob; players learned to micromanage these lines near the
> combat location so that at least the first entry into the attack is a small
> group of units rather than a solitary one.)
> Walking nearly the exact same path is, incidentally, the thing we want to
> avoid in multi-part payments, incidentally, so we have a similar problem as
> RTS games with lines of units going into battle one-by-one.
>
> Hence, why I proposed the use of flocking, which is a technique used to
> retain cohesion of a blob of units.
> For example, some RTS games have a concept of putting their units "in
> formation", which is actually just a way to excuse the flocking behavior to
> the player.
>
> Another solution is to use `permuteoute`.
> Run normal single-pathfinding algo.
> Find the smallest-capacity channels in the returned route and
> `permuteroute` around those channels, resulting in an alternate route that
> avoids the smallest-capacity channels (which are more likely to fail when
> multiple payments run through them) but shares the rest of the path with
> the original route.
> Keep doing this until `permuteroute` fails, then we have a bunch of
> alternate routes we can try to use for multi-part routing.
>
> Regards,
> ZmnSCPxj
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20190802/1cd3530b/attachment-0001.html>