From ZmnSCPxj at protonmail.com  Sat Jun 15 02:53:16 2019
From: ZmnSCPxj at protonmail.com (ZmnSCPxj)
Date: Sat, 15 Jun 2019 02:53:16 +0000
Subject: [Lightning-dev] Improve Lightning payment reliability through
	better error attribution
In-Reply-To: <CAJBJmV9t2ygmn2o6bFCwdXpebVKHAQebZUUdt9QfTugcdsCzhA@mail.gmail.com>
References: <CAJBJmV-TGo0sE2-3GVtDvewj8E=ONd9bv-2bRqjkjV870qrCDQ@mail.gmail.com>
	<CACdvm3OXibyyBJW9NgK_o0m3W0VK0bodpnZ3a4+UdgP+Jux45w@mail.gmail.com>
	<VLqGxfXFptkC42VUIja6DsiVTFfFF3M2CqJifBOdb0bMHTKySFbm-tVl-y8GWCuNSh4qIriy0EAiv3n0j_8jJiEdgC8aI6ZdeQIdHDGQjP0=@protonmail.com>
	<CAJBJmV-Wg5KAhVsgVJJv8Bp52HDMP6K0vd5t+Gyekxrn1n0DkA@mail.gmail.com>
	<CAJBJmV--+RYNJaH=g10EKgGh==47jBwO922PpevoQTzgyocw=A@mail.gmail.com>
	<CAJBJmV-WEDjZW8S5Ud=6NpZcgC+3Eu56piBVHVd3Eb_JA-Gr+g@mail.gmail.com>
	<Cc2L0OrkDyHH7y1t6--ndZY63XWDgEhTWEyPYFdGhFcBdEPAWSXw0jvKsVM3hdzLCNJy2mUxuPtJtdSmoydkJyuv-CRG8yuW3zv6l5QRMpk=@protonmail.com>
	<CAJBJmV9t2ygmn2o6bFCwdXpebVKHAQebZUUdt9QfTugcdsCzhA@mail.gmail.com>
Message-ID: <iYEps2Zv0odIFX6VU2rXqH_5nfGeDNAJj-l-296rI3pZnYIZ1MfgFZaCWGdhUW7D8w487v44kPgY8M1mZDOb-9O3VJ9U7du_RcwLFmvoxRI=@protonmail.com>

Good morning Joost,

> Yes that is accurate, although using the time difference between receiving the `update_add_htlc` and sending back the `update_fail_htlc` would work too. It would then include the node's processing time.

It would not work safely.
A node can only propagate an `update_fail_htlc` if the downstream `update_fail_htlc` has been irrevocably committed by `revoke_and_ack`.
See BOLT spec about this.

Suppose we have a route A -> B -> C.
C sends `update_fail_htlc` immediately, but dallies on `revoke_and_ack`.
B cannot send `update_fail_htlc` to A yet, because C can still drop the previous B-C channel state onchain (it is not yet revoked, that is what the `revoke_and_ack` will later do).
If B send `update_fail_htlc` to A as soon as it receives `update_fail_htlc` from C, A can use the new A-B channel state onchain, while at the same time C drops the previous B-C channel state onchain.
the new A-B channel state returns the HTLC to A, while the previous B-C channel state has the HTLC still claimable by C, causing B to lose funds.

For `update_fulfill_htlc` B can immediately propagate to A (without waiting for `update_and_ack` from C) since C is already claiming the money.

Since, B cannot report the `update_fail_htlc` immediately, its timer should still be running.
Suppose we counted only up to `update_fail_htlc` and not on the `revoke_and_ack`.
If C sends `update_fail_htlc` immediately, then the `update_add_htlc`->`update_fail_htlc` time reported by B would be fast.
But if C then does not send `revoke_and_ack`, B cannot safely propagate `update_fail_htlc` to A, so the time reported by A will be slow.
This sudden transition of time from A to B will be blamed on A and B, while C is unpunished.

That is why, for failures, we ***must*** wait for `revoke_and_ack`.
The node must report the time when it can safely propagate the error report upstream, not the time it receives the error report.
For payment fulfillment, `update_fulfill_htlc` is fine without waiting for `revoke_and_ack` since it is always reported immediately upstream anyway.

See my discussion about "fast forwards": https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-April/001986.html

> I think we could indeed do more with the information that we currently have and gather some more by probing. But in the end we would still be sampling a noisy signal. More scenarios to take into account, less accurate results and probably more non-ideal payment attempts. Failed, slow or stuck payments degrade the user experience of lightning, while "fat errors" arguably don't impact the user in a noticeable way.

Fat errors just give you more information when a problem happens for a "real" payment.
But the problem still occurs on the "real" payment and user experience is still degraded.

Background probing gives you the same information **before** problems happen for "real" payments.

Regards,
ZmnSCPxj