From ZmnSCPxj at protonmail.com Sat Jun 15 02:53:16 2019 From: ZmnSCPxj at protonmail.com (ZmnSCPxj) Date: Sat, 15 Jun 2019 02:53:16 +0000 Subject: [Lightning-dev] Improve Lightning payment reliability through better error attribution In-Reply-To: References: Message-ID: Good morning Joost, > Yes that is accurate, although using the time difference between receiving the `update_add_htlc` and sending back the `update_fail_htlc` would work too. It would then include the node's processing time. It would not work safely. A node can only propagate an `update_fail_htlc` if the downstream `update_fail_htlc` has been irrevocably committed by `revoke_and_ack`. See BOLT spec about this. Suppose we have a route A -> B -> C. C sends `update_fail_htlc` immediately, but dallies on `revoke_and_ack`. B cannot send `update_fail_htlc` to A yet, because C can still drop the previous B-C channel state onchain (it is not yet revoked, that is what the `revoke_and_ack` will later do). If B send `update_fail_htlc` to A as soon as it receives `update_fail_htlc` from C, A can use the new A-B channel state onchain, while at the same time C drops the previous B-C channel state onchain. the new A-B channel state returns the HTLC to A, while the previous B-C channel state has the HTLC still claimable by C, causing B to lose funds. For `update_fulfill_htlc` B can immediately propagate to A (without waiting for `update_and_ack` from C) since C is already claiming the money. Since, B cannot report the `update_fail_htlc` immediately, its timer should still be running. Suppose we counted only up to `update_fail_htlc` and not on the `revoke_and_ack`. If C sends `update_fail_htlc` immediately, then the `update_add_htlc`->`update_fail_htlc` time reported by B would be fast. But if C then does not send `revoke_and_ack`, B cannot safely propagate `update_fail_htlc` to A, so the time reported by A will be slow. This sudden transition of time from A to B will be blamed on A and B, while C is unpunished. That is why, for failures, we ***must*** wait for `revoke_and_ack`. The node must report the time when it can safely propagate the error report upstream, not the time it receives the error report. For payment fulfillment, `update_fulfill_htlc` is fine without waiting for `revoke_and_ack` since it is always reported immediately upstream anyway. See my discussion about "fast forwards": https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-April/001986.html > I think we could indeed do more with the information that we currently have and gather some more by probing. But in the end we would still be sampling a noisy signal. More scenarios to take into account, less accurate results and probably more non-ideal payment attempts. Failed, slow or stuck payments degrade the user experience of lightning, while "fat errors" arguably don't impact the user in a noticeable way. Fat errors just give you more information when a problem happens for a "real" payment. But the problem still occurs on the "real" payment and user experience is still degraded. Background probing gives you the same information **before** problems happen for "real" payments. Regards, ZmnSCPxj