Skip to content

Conversation

@guillaumemichel
Copy link
Contributor

@guillaumemichel guillaumemichel commented Jan 31, 2023

This PR defines the Spec for the Double Hash DHT.

All feedback and contributions are welcome.

The Migration section is still WIP.

cc: @yiannisbot, @masih, @ischasny , @willscott, @noot

@guillaumemichel guillaumemichel requested a review from a team as a code owner January 31, 2023 14:36
@lidel lidel self-requested a review February 6, 2023 14:35
3. Client sends a DHT lookup request for `KeyPrefix` to these DHT servers. The request contains a flag to specify whether Client wants the `multiaddrs` associated with the `CPPeerID` or not.
4. The DHT servers find the 20 closest `PeerID`s to `KeyPrefix` in XOR distance (see [algorithm](#closest-keys-to-a-key-prefix)). Add these `PeerID`s and their associated multiaddresses (if applicable) to the `message` that will be returned to Client.
5. The DHT servers search if there are entries matching `KeyPrefix` in their Provider Store.
6. For all entries `HASH2` of the Provider Store where `HASH2[:len(KeyPrefix)]==KeyPrefix`, add to `message` the following encrypted payload: `EncPeerID || SERVERNONCE || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)`, `SERVERNONCE` being a randomly generated 12-byte array, for `multiaddrs` being the multiaddresses associated with `CPPeerID` (if applicable) if the `multiaddrs` were requested by Client. If more than `MatchLimit` distinct `HASH2`s match the requested `KeyPrefix`, the DHT Server doesn't return any Provider Record, and adds the number of `HASH2` matching `KeyPrefix` along with its own `MatchLimit` to `message`.
Copy link
Member

@masih masih Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The word "encrypted" suggests the entire message is encrypted again, and technically the message is partially encrypted, i.e. SERVERNONCE is not encrypted. Aside from that Enc prefix in EncPeerID and AESGSM(...) is enough to clarify which bits are. So i'd suggest removing this word.

  • Where did multiaddrs come from? There is no mention of addresses in "Publish Process".

    • It would be great to spell out to the reader that addresses are propagated on other/existing mechanisms unencrypted, etc.
  • The response to double hashed lookup in IPNI is incompatible with what's proposed here; IPNI returns:

    [0xa2, Nonce, AESGCM(MH, Nonce, PeerID + ContextID)].

    Whereas, DHT returns:

    [0xa2, Nonce, AESGCM(MH, Nonce, CPPeerID)] || SERVERNONCE || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)

    Two things to figure out and clarify here please:

    1. The DHT response contains two encrypted values sandwiched between an unencrypted value that is SERVERNONCE. How would a client know where in a given message one section ends and the other starts? The spec here should clarify this.

    2. Given two messages, one from IPNI and one from DHT how would a client know whether any remaining bytes are ContextID from IPNI or SEVERNONCE || AESGCS... from DHT? Maybe the clarification above would solve this. The key thing to point out here is that we should facilitate programatic compatibility and graceful response decoding between the two routing systems for the first section of the message, i.e. what this document refers to as EncPeerID.

      Depending on the clarification of the first item, we could assume that if the user reads a multihash from the the decrypted EncPeerID and no bytes remain then the record has no context ID and therefore is from DHT. If this is a safe assumption It would be great to document the ability to specify "optional" bytes after CPPeerID which should be ignored by the DHT flavour of lookup. Or something along those lines.

  • The second ecrypted section should follow the same pattern re varint prepend for the same reasons it was added to EncPeerID section I think :)

Suggested change
6. For all entries `HASH2` of the Provider Store where `HASH2[:len(KeyPrefix)]==KeyPrefix`, add to `message` the following encrypted payload: `EncPeerID || SERVERNONCE || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)`, `SERVERNONCE` being a randomly generated 12-byte array, for `multiaddrs` being the multiaddresses associated with `CPPeerID` (if applicable) if the `multiaddrs` were requested by Client. If more than `MatchLimit` distinct `HASH2`s match the requested `KeyPrefix`, the DHT Server doesn't return any Provider Record, and adds the number of `HASH2` matching `KeyPrefix` along with its own `MatchLimit` to `message`.
6. For all entries `HASH2` of the Provider Store where `HASH2[:len(KeyPrefix)]==KeyPrefix`, add to `message` the following payload: `EncPeerID || SERVERNONCE || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)`, `SERVERNONCE` being a randomly generated byte array of length 12, for `multiaddrs` being the multiaddresses associated with `CPPeerID` (if applicable) if the `multiaddrs` were requested by Client. If more than `MatchLimit` distinct `HASH2`s match the requested `KeyPrefix`, the DHT Server doesn't return any Provider Record, and adds the number of `HASH2` matching `KeyPrefix` along with its own `MatchLimit` to `message`.

Cc @ischasny @willscott

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "encrypted" suggests the entire message is encrypted again, and technically the message is partially encrypted, i.e. SERVERNONCE is not encrypted. Aside from that Enc prefix in EncPeerID and AESGSM(...) is enough to clarify which bits are. So i'd suggest removing this word.

  1. Agreed and done.

Where did multiaddrs come from? There is no mention of addresses in "Publish Process".

  1. The multiaddrs come from the node's libp2p Peerstore. This address is learnt at the moment where the Content Provider Publishes the Provider record, or during any other interaction with this peer. Note that the multiaddrs may be stale. Currently the multiaddrs only stay in the peerstore for 30 min, but we will be extending the TTL Increase provider Multiaddress TTL libp2p/go-libp2p-kad-dht#795.

I will add a sentence about it.

The DHT response contains two encrypted values sandwiched between an unencrypted value that is SERVERNONCE. How would a client know where in a given message one section ends and the other starts? The spec here should clarify this.

3i. Good point! We should add the size of encrypted values in front of them. I will add it to the document.

Given two messages, one from IPNI and one from DHT how would a client know whether any remaining bytes are ContextID from IPNI or SEVERNONCE || AESGCS... from DHT? Maybe the clarification above would solve this. The key thing to point out here is that we should facilitate programatic compatibility and graceful response decoding between the two routing systems for the first section of the message, i.e. what this document refers to as EncPeerID.

3ii. Maybe the best option to have a common format for all content routers would be to use 0xa501 || Nonce || payload_len || AESGCM(MH, Nonce, CPPeerID) as a common format, and append additional fields to enable composability. The additional fields could be AESGCM(MH, Nonce+1, ContextID) for IPNI and 0xa501 || SERVERNONCE || payload_len || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs) for the DHT.

It would be great to have some type of id or varint, uniquely identifying each content router and its versions to include in front of the additional fields. WDYT @masih ?

The second ecrypted section should follow the same pattern re varint prepend for the same reasons it was added to EncPeerID section I think :)

  1. Agreed and done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have some type of id or varint, uniquely identifying each content router and its versions to include in front of the additional fields.

I agree, it makes sense to use the simimarl multicodec patter for the inner encrypted payload. That then should differentiate what unencrypted record the client is dealing with. So far I think we have two kinds of routing records: DHT and IPNI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it makes sense to use the simimarl multicodec patter for the inner encrypted payload. That then should differentiate what unencrypted record the client is dealing with. So far I think we have two kinds of routing records: DHT and IPNI

We could start a new repository (e.g multiformats/multirouter-codec) specifying a codec and associated varint for different content routers and their versions, each content router will probably change the metadata format over time. We don't need to use the same multiformats/multicodec repo, as we would never use the existing records.
WDYT @masih ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I see the whole point of multicodec CSV table to be its singularity. It is already context agnostic: it contains codes from lots of systems that are self-contained/separate from one-another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we could add additional codes there to define the formats used in the DHT and IPNI.

e.g the format for the DHT would be ENC_FNC_VARINT || SERVERNONCE || payload_len || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs).

The format of the payload would be EncPeerID || DHTv0_FORMAT_CODEC || 0x8040 || SERVERNONCE || payload_len || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)

Copy link
Member

@masih masih Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of being able to tell what the record is without having to decrypt it for performance reasons.

  • Is this something to worry about in terms of revealing information in the context of privacy?
  • having two multicodecs MULTICODEC || 0x8040 in a row seems slightly odd: the first MULTICODEC then needs to specify how the second multicodec should be interpreted, i.e. ecrypted and non-encryted forms of the same record kind, which means we need some multicodec to say "not encrypted" (which is also slightly odd).
    • At which point we might as well have a single multicodec to combine the two, e.g. aes-gcm-256-dht-v0-format. However, doing this means for every encrypted record kind we need a new multicodec.
      • This is not necessarily bad; it just means none of the existing multicodecs are reused. For example, imagine introducing privacy-enabled IPNS records. There is already a multicodec for IPNS records. If we want to have those records encrypted with AESGCM256, we'd need to introduce a whole new codec for aes-gcm-256-ipns-records.
      • But if we nest the multicodecs then we don't need to redefine codeces: any data could be encrypted, and after decryption read a varint as usual to infer what the remaining bytes mean.
      • I have no hard feelings here. Either works as long as one can explicitly infer payload types instead of having to implicitly know where the payload came from.
      • Maybe @aschmahmann or @lidel have alternative suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I think that it makes sense to combine both multicodecs into a single one. dht-lookup-response-metadata-format-v0 is defined with aes-gcm-256, and when we decide to change the encryption algorithm, we create dht-lookup-response-metadata-format-v1 referencing a new encryption algorithm. We need to make sure to have a wide range reserved for this usage in the multicodec, as formats will probably change over time, and there are multiple content routers.

There is nothing to hide in the data format, so it isn't a problem to have the codec in the clear. The only information a curious user could guess is the number of multiaddrs that the DHT Server has in its peerstore for the Content Provider.

Copy link
Member

@yiannisbot yiannisbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent writeup @guillaumemichel ! Thanks! Left some editorial comments and some questions. A couple of graphs would help, but these can come at a second stage.

guillaumemichel and others added 5 commits February 14, 2023 09:19
Co-authored-by: Masih H. Derkani <m@derkani.org>
Co-authored-by: Yiannis Psaras <52073247+yiannisbot@users.noreply.github.com>
Co-authored-by: Yiannis Psaras <52073247+yiannisbot@users.noreply.github.com>
Co-authored-by: Yiannis Psaras <52073247+yiannisbot@users.noreply.github.com>
Co-authored-by: Yiannis Psaras <52073247+yiannisbot@users.noreply.github.com>
guillaumemichel and others added 2 commits February 28, 2023 12:04
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Dennis Trautwein <dennis.trautwein@posteo.de>
guillaumemichel and others added 10 commits February 28, 2023 13:53
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
3. Client sends a DHT lookup request for `KeyPrefix` to these DHT servers. The request contains a flag to specify whether Client wants the `multiaddrs` associated with the `CPPeerID` or not.
4. The DHT servers find the 20 closest `PeerID`s to `KeyPrefix` in XOR distance (see [algorithm](#closest-keys-to-a-key-prefix)). Add these `PeerID`s and their associated multiaddresses (if applicable) to the `message` that will be returned to Client.
5. The DHT servers search if there are entries matching `KeyPrefix` in their Provider Store.
6. For all entries `HASH2` of the Provider Store where `HASH2[:len(KeyPrefix)]==KeyPrefix`, add to `message` the following encrypted payload: `EncPeerID || SERVERNONCE || AESGCM(ServerKey, SERVERNONCE, TS || Signature || multiaddrs)`, `SERVERNONCE` being a randomly generated 12-byte array, for `multiaddrs` being the multiaddresses associated with `CPPeerID` (if applicable) if the `multiaddrs` were requested by Client. If more than `MatchLimit` distinct `HASH2`s match the requested `KeyPrefix`, the DHT Server doesn't return any Provider Record, and adds the number of `HASH2` matching `KeyPrefix` along with its own `MatchLimit` to `message`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have some type of id or varint, uniquely identifying each content router and its versions to include in front of the additional fields.

I agree, it makes sense to use the simimarl multicodec patter for the inner encrypted payload. That then should differentiate what unencrypted record the client is dealing with. So far I think we have two kinds of routing records: DHT and IPNI

- **Provider Record** is defined as a pointer to the storage location of some content identified by `CID` or `HASH2`. A Provider Record consists on the following fields: [`EncPeerID`, `TS`, `Signature`].
- **Provider Store** is the data structure on the DHT Servers used to store the Provider Records. Its structure is a [nested key-value store](#provider-store): `HASH2` -> `ServerKey` -> `CPPeerID` -> Provider Record.

### Wire formats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you're waiting for the implementation to be further along, but having the proposed wire formats for the RPC calls (and effectively a new version of https://github.com/libp2p/specs/tree/master/kad-dht for this new version) would be helpful to understand some of the concrete changes and give feedback.

If you're going to be doing a network fork here it might also be easy enough to make some other wire format changes that will improve how the network operates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏃 In Progress

Development

Successfully merging this pull request may close these issues.

10 participants