Skip to content

Conversation

omkar3141
Copy link
Contributor

Improve the network message cache to be aware of network keys to prevent false duplicate detection across different subnets. This ensures that messages with the same source address and sequence number but from different network keys are not incorrectly identified as duplicates, as it can happen in certain cases. See ES-27446.

The approach uses credential pointer which never changes during runtime of the node as a proxy for netkey identification. This approach is simplified but still solves the need and avoids major changes to msg_cache_add() function as NID/NetKey index information is not sufficiently unambiguous at the time when this API is called. Because the information is hidden in two sub[] arrays within rx context and to decipher this info (to arrive at correct NID/NetKey) you will need several checks to figure out which state of the key refresh procedure is active, and which friendship/LPN function is active, and exactly which credential is being used.

Improve the network message cache to be aware of network keys to
prevent false duplicate detection across different subnets. This
ensures that messages with the same source address and sequence
number but from different network keys are not incorrectly
identified as duplicates, as it can happen in certain cases.
See ES-27446.

The approach uses credential pointer which never changes during runtime
of the node as a proxy for netkey identification. This approach is
simplified but still solves the need and avoids major changes
to msg_cache_add() function as NID/NetKey index information is not
sufficiently unambiguous at the time when this API is called. Because
the information is hidden in two sub[] arrays within rx context and
to decipher this info (to arrive at correct NID/NetKey) you will need
several checks to figure out which state of the key refresh procedure
is active, and which friendship/LPN function is active, and exactly
which credential is being used.

Signed-off-by: Omkar Kulkarni <omkar.kulkarni@nordicsemi.no>
@omkar3141 omkar3141 force-pushed the omku/update_net_cache_with_netkeyidx_p2 branch from f37629f to b6c3dda Compare October 1, 2025 15:21
Copy link

sonarqubecloud bot commented Oct 1, 2025

@alxelax
Copy link
Contributor

alxelax commented Oct 2, 2025

LGTM, please convert it to "Ready for review"

@@ -402,6 +406,7 @@ static void bt_mesh_net_local(struct k_work *work)
while ((node = sys_slist_get(&bt_mesh.local_queue))) {
struct loopback_buf *buf = CONTAINER_OF(node, struct loopback_buf, node);
struct bt_mesh_net_rx rx = {
.cred = NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary.

C99 Standard, ISO/IEC 9899:1999, Section 6.7.8 (Initialization), paragraph 21:

"If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration."


return bt_mesh_net_decrypt(&cred->enc, out, BT_MESH_NET_IVI_RX(rx),
proxy) == 0;
int err = bt_mesh_net_decrypt(&cred->enc, out, BT_MESH_NET_IVI_RX(rx), proxy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move err definition to the beginning of the function.

uint32_t src : 15, /* MSb of source is always 0 */
seq : 17;
seq : 17;
const struct bt_mesh_net_cred *cred;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cache is quite big, we can run into this situation:

  • device A receives message M from device B (in direct range) over subnet A with cred pointer A
  • right after this, Configuration Manager removes device A from subnet A and adds to subnet B
  • subnets in the code is an array, so the code will allocate first empty space in the array, which will be the address of subnet A.
  • a subnet bridge device which is in range of device A (but not in range of device B), eventually with some delay receives message M from device B and bridges it over subnet B.
  • device A receives the bridged message B over subnet B, it sees that SRC and SEQ are the same as for previously received message, cred pointer points to the same address and drops the message.
  • the first message M over subnet A was rejected by device A because there is no appkey attached to the netkey of this subnet A. But message B sent over subnet B should have been processed because it has appkey attached to the netkey of that subnet B.

This a very theoretical case, but there is a risk of using pointer to the key here as it points to an address that can be reused. Just wanted to highlight this.
key_ptr_msg_cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using array pointer can also let a message pass the message cache if secondary key was used to decrypt message M for the first time, but the KR procedure completed, key pointers were swapped and now the same message M relayed with some delay will go in again because cred pointers won't match.

Copy link
Contributor

@alxelax alxelax Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using array pointer can also let a message pass the message cache if secondary key was used to decrypt message M for the first time, but the KR procedure completed, key pointers were swapped and now the same message M relayed with some delay will go in again because cred pointers won't match.

I guess after swapping keys, deobfuscation gives garbage address and sequence number since privacy is swapped too. There will not be relaying rings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess after swapping keys, deobfuscation gives garbage address and sequence number since privacy is swapped too. There will not be relaying rings.

The same key will be used. It will be on a different place (keys[0] instead of keys[1]). But pointer will change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not issues of pointer usage. Network cache should be purged after removing subnetwork (friendship termination???). It prevents considering deprecated data for networking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network cache should be purged after removing subnetwork (friendship termination???).

But this is not the case for key refresh.

Purging or updating message cache may work, but I guess there is some extra risk in this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then storing subnetwork array index instead should work. It will solve swapping keys issue.
Additionally, BT_MESH_KEY_DELETED event handler in net.c should purge network cache.
It solves your described scenarios

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use NetKey index the same way here as pointers but instead of them? If we assume that the probability of generating SEQ and SRC as garbage is unlikely by any other key, then using netkey index is the same as using pointer to keys that were used to successfully decrypt the last message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems NetKey index should be sufficient.

Copy link
Contributor Author

@omkar3141 omkar3141 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with NetKey index is arriving at it in the API in a way that does not break API is problematic (because of friendship and LPNs). I had made prototype with that first but discarded it for simplicity of this approach. Concern about using old pointer address as index is valid, but this really far-fetched case. The problem gets resolved as soon as a the same message with a new SEQ number arrives. So, even if message is incorrectly ignored at this case, the system continues function as expected on the next same (or different) model message from same source with different SEQ. What this solves is a systematic failure that always happens when stable network configuration exists and implementations are behaving in certain ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants