Skip to content

Commit b05f341

Browse files
committed
Transfer-v2: Updated proposal
1 parent b971841 commit b05f341

File tree

1 file changed

+162
-59
lines changed

1 file changed

+162
-59
lines changed

file-transfer-protocol.md

Lines changed: 162 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,19 @@ them before sending through the rendezvous server to the peer).
1010

1111
## Application version
1212

13-
The main key in the `app_version` object is called `abilities`, which is an array of strings. The known values are: `["transfer-v1", "transfer-v2"]`. Unknown values and keys have to be accepted by every client. An ability may specify additional hints to store in the object as well. If the value is empty (`{}`), `{abilities = ["transfer-v1"];}` must be assumed for backwards compatibility. `transfer-v1` SHOULD always be supported.
13+
The main key in the `app_version` object is called `abilities`, which is an array of strings. The known values are: `["transfer-v1", "transfer-v2"]`. Unknown values and keys have to be accepted by every client. An ability may specify additional hints to store in the object as well. If the value is empty (`{}`), `{abilities = ["transfer-v1"];}` must be assumed for backwards compatibility. `transfer-v1` should always be supported.
1414

15-
The sender gets to pick a protocol version and capabilities based on the version information of the peer. The receiver distinguishes which protocol is used on the first incoming message.
15+
The sender gets to pick a protocol version and capabilities based on the version information of the peer. The receiver distinguishes which protocol is used on the first incoming message. (Therefore, different protocol versions must be distinguishable on the first message.)
1616

1717
**Example value:**
1818

1919
```json
2020
{
21-
abilities: ["transfer-v1", "transfer-v2"],
22-
transfer-v2-hints: {
23-
supported-formats: ["tar.zst"]
24-
}
21+
"abilities": ["transfer-v1", "transfer-v2"],
22+
"transfer-v2": {
23+
"supported-formats": ["plain", "zst"],
24+
"transit-abilities": ["direct-tcp-v1", "relay-v1"],
25+
}
2526
}
2627
```
2728

@@ -112,103 +113,205 @@ the Transit connection. The final ack of the received data is sent through
112113
the Transit object, as a UTF-8-encoded JSON-encoded dictionary with `ack: ok`
113114
and `sha256: HEXHEX` containing the hash of the received data.
114115

115-
## Transfer v2 (proposal)
116+
## Transfer v2
116117

117-
A v2 of the file transfer protocol got invented to add the following features:
118+
Version 2 of the file transfer protocol got invented to add the following features:
118119

119120
- Resumable transfers after a connection interruption
120121
- No need to build a temporary zip file; for both speed and space efficiency reasons. Also zip has a lot of other subtle limitations.
122+
<!-- - Allow for multiple transfer from both sides using a single connection -->
121123

122-
The feature of sending text messages (without a transit connection), on the other hand, got removed.
123-
124-
### Basic protocol
124+
The feature of sending text messages (without a transit connection), on the other hand, got removed (version 1 serves us well for that purpose).
125+
All transfers may contain multiple files: This covers both the "single file" use
126+
case as well as the "folder" use case.
125127

126-
The sender sends an offer, which contains a list of all the files, their size, modification time, and a transfer identifier that can be used to resume connections. The attempt to send the same files twice should use with the same identifier. How it is generated is an implementation detail, the suggested method is to either store it locally or to use the hash of the absolute path of the folder being sent.
128+
### Application version
127129

128-
The receiver responses either with either a `"transfer rejected"` error of with an acknowledgement. The acknowledgement may contain a list of byte offsets, one for each file, which will tell the sender from where to resume the transfer.
130+
Setting the `transfer-v2` ability also requires providing a `transfer-v2` dictionary with the following values:
131+
`supported-formats` (see below) and `transit-abilities`, which is the same as `abilities-v1` in the version 1 specification. The transit abilities are exchanged earlier than in version 1 so that the `transit` message may
132+
only contain the hints for abilities both sides support, which avoids wasting effort.
129133

130-
Both do the negotiation to open a transit relay. The process to doing so is slightly different from the one in the first version. The set of supported abilities is already delivered during the file offer/ack. Thus, the `transit` message only contains the hints for methods both sides support. Both side try to connect to every hint of the other side, the sender will then confirm the first one that succeeded.
134+
#### Supported formats
131135

132-
The sender then sends the requested bytes over the relay using one of the supported formats. Afterwards, it sends a message with checksums. The receiver then closes the connections, optionally with sending an error message on a checksum mismatch.
136+
Known formats are `plain` and `zst`. The former indicates uncompressed data and
137+
must be supported by all clients; all other formats are optional. TODO
138+
At the moment, the only supported format is `zst`. The details are up to the sender; a low compression level is recommended.
133139

134-
#### Supported formats
140+
### Overview
135141

136-
At the moment, the only supported format is `tar.zst`. The files are sent bundled as a tar ball, compressed with zstd. The details are up to the sender; a low compression level is recommended. Only the files requested by the sender must be sent, and only the bytes starting from the requested offset must be contained.
142+
Both sides immediately negotiate a transit connection. Once established, they start communicating over it and close
143+
the rendezvous connection. All messages over the relay connection are encoded using [msgpack](https://msgpack.org/) instead of JSON
144+
to allow binary payloads. (All protocol examples in this document will use JSON for readability.)
137145

138-
### The structs in detail
146+
- The sender starts by sending an offer. The receiver accepts it and receives the bytes.
147+
- The receiver rejects the offer by closing the connection with an error.
148+
- The connection is closed once all accepted files have been transferred (and checked).
139149

140-
#### Send offer
150+
### Transit hints
141151

142-
File paths must be normalized and relative to the root of the sent folder. If the sender's file system does not support modification times, `mtime` must be constant (preferably `0`). Sending a file is the same as sending a directory with a single file. `directory-name` is the name of the directory being sent. It must be present unless `files` contains exactly one item. `files` must not be empty.
152+
This is the first and (usually) also last message sent over the Wormhole connection.
153+
As the first message, it is the distinguisher for version 2 file transfer. As the last message, all following communication uses the transit connection, encoded using `msgpack`.
154+
Both sides then close their Wormhole connection as soon as transit is established.
155+
The message type is `transit-v2` and it is equivalent to the v1 `transit` message,
156+
except that it only contains the hints (the abilities have already been sent earlier).
143157

144158
```json
145159
{
146-
"offer-v2": {
147-
"directory-name": "<string, optional>",
148-
"files": [
149-
{
150-
"path": "<string>",
151-
"size": "<integer>",
152-
"mtime": "<integer>"
153-
}
154-
],
155-
"transit-abilities": "<list, subset of ['direct-tcp-v1', 'relay-v1', 'tor-tcp-v1']>"
156-
};
160+
"transit-v2": {
161+
"hints-v1": [ ]
162+
}
157163
}
158164
```
159165

160-
#### Receive ack
166+
### Send offer
167+
168+
A send offer has only one entry, but which may contain a recursive directory
169+
structure. If the top level entry is not a file, receiving clients may display
170+
the offer either as single folder or as a list of files.
171+
172+
File names may be *arbitrary* (but UTF-8 encoded), it is up to the receiver to
173+
sanitize them. Handling of unsupported file names is implementation speficit,
174+
but could for example be realized through escaping or rejection of the offer.
161175

162-
`files` contains a mapping from file (index) to offset (bytes). If omitted, all files must be sent.
176+
If the sender's file system does not support modification times, `mtime` must be constant (preferably `0`).
177+
`files` must not be empty. If there are multiple files, `directory-name` may be set to mark
178+
this transfer as directory instead of a loose collection of files. If it is not present, `path`
179+
must have a depth of one, i.e. only contain the file name.
180+
The `format` must be one that both sides support.
181+
182+
`type` must be one of `"regular-file"`, `"directory"` and `"symlink"`. Regular
183+
files have an additional `size` field (in bytes) and a transfer `id`. Directories have a
184+
`content` field, which contains a list of direct children. Symlinks have a
185+
`target` path.
163186

164187
```json
165188
{
166-
"answer-v2": {
167-
"files": {
168-
"<integer>": "<integer>"
169-
},
170-
"transit-abilities": "<list of ability strings>"
171-
}
189+
"offer-v2": {
190+
//"transfer-name": "<string, optional>",
191+
"content": {
192+
"type": "<string>",
193+
"name": "<string>",
194+
"mtime": "<integer>",
195+
"format": "<string>",
196+
197+
},
198+
}
172199
}
173200
```
174201

175-
#### Transit hints
202+
If a transfer fails mid way, we don't want to re-transmit unnecessary data when
203+
a second attempt is made. The idea is that when a transfer fails, the sender
204+
stores the IDs along with the partially transferred data. On the second attempt,
205+
the sender should reuse the trnasfer IDs so that the sender can tell it already
206+
has part of the data, therefore only requesting what it does not yet have.
207+
208+
Transfer IDs are opaque strings to the receiver, how they are generated is an
209+
implementation detail of the sender. However the following points should be taken
210+
into consideration:
211+
212+
- Sending the same files or folder twice results in the same identifiers
213+
- When making transfer IDs content adressed, they should not leak any information
214+
about the data to anybody except the receiver.
215+
- All hashes in use should be salted, the salt should be kept private by the
216+
sender and rotate regularly.
217+
- The transfer ID should have sufficiently high entropy to avoid collisions.
218+
- At least 256 bits are recommended
219+
- Due to the purpose of allowing retransfers, no data
220+
- Since the goal is to facilitate retransfers after a failure, no further
221+
information needs to be stored on success.
222+
- Retransfers after failure are expected to happen more or less immediately. The
223+
data needs not be kept around longer than a few hours, at most days.
224+
- False negatives lead to additional retransfer of data, while false positives
225+
result in a transfer failure due to hash mismatch. Therefore, try to keep the
226+
ID generation as conservative as possible.
227+
- Simply using fresh random IDs for everything is an acceptable strategy.
228+
229+
### Receive ack
230+
231+
`files` contains a mapping from transfer ID to offset (bytes).
232+
An offer may be rejected using an `error` message.
176233

177-
Note that the hints for abilities added in the future might follow a different schema. The discriminant is `type`.
234+
```json
235+
{
236+
"answer-v2": {
237+
"files": {
238+
"<string>": "<integer>"
239+
},
240+
}
241+
}
242+
```
243+
244+
### Payload transfer
245+
246+
After receiving the ack, the sender transfers the payload according to the `format`. For each file, the data stream
247+
must start at the offset requested by the receiver. A `payload-v2` message contains only the (compressed) bytes as value.
178248

179249
```json
180250
{
181-
"transit-v2": [
182-
{
183-
"type": "<ability string>",
184-
"hostname": "<string>",
185-
"port": "<tcp port>",
186-
"priority": "<number, usually [0..1], optional, default 0.5>"
187-
},
188-
]
251+
"payload-v2": {
252+
"id": "<string>",
253+
"payload": "<bytes>",
254+
}
189255
}
190256
```
191257

192-
#### Checksums
258+
The payload must not exceed 64kiB per message. The sender keeps track of the received bytes (after
259+
decompression according to the format), and errors out if the sender exceeds the announced amount by more than 5%. Note that due to
260+
file system smear, sending a different amount of bytes than announced is rather common (hence
261+
the 5%). Errors will be caught using checksums later on.
262+
263+
### Checksums
193264

194-
`tar-file-sha256` is the lowerhex-encoded sha256sum of all transferred bytes of the tar file.
265+
At the end of the transfer, *both* sides send their checksums. That way, they do not need to communicate any further
266+
to exchange their opinion: they can both calculate themselves whether things went wrong or not and only need to notify
267+
the user. Once the checksums are exchanged, the transfer is complete and the connection is closed.
195268

196-
TODO maybe some per file integrity check?
269+
There is a per file integrity check. `wire-sha256` is the (binary) sha256sum of all transferred payload bytes (i.e. before decompression). `sha256` is the sha256sum of the *entire* file, including bytes before the resumption offset.
197270

198271
```json
199272
{
200-
"transfer-ack-v2": {
201-
"tar-file-sha256": "<string>"
202-
}
273+
"transfer-ack-v2": {
274+
"wire-sha256": "<bytes>",
275+
"files": [
276+
{
277+
"id": "<string>",
278+
"size": "<integer>",
279+
"sha256": "<bytes>",
280+
}
281+
],
282+
}
203283
}
204284
```
205285

286+
### A note about file system handling
287+
288+
File systems are hard. To achieve consistent and sane behavior across implementations and
289+
systems, applications should pay attention to the following details:
290+
291+
- Symlinks are preserved by default when sending directories
292+
- Hardlinks and reflinks may be resolved/duplicated at any point
293+
- Permissions are not preserved by default (use rsync for that instead).
294+
- The sender's mtime should be preserved, unless it is zero
295+
- Extended file attributes (xattrs) are not preserved
296+
- Files may have been modified between transfers. Checking the modification time
297+
is necessary, but not sufficient.
298+
- To avoid file system hacking: The receiver must check for malicious file paths
299+
and invalid/unsupported character sequences. Symlinks *must not* be followed.
300+
301+
### When to resume
302+
303+
On a failed attempt, the receiver may decide to keep the partially transferred data in the
304+
anticipation of the transfer being tried again soon. The receiver can use the `answer` message
305+
to exert some control over which bytes the sender will send again. It is also free to decide
306+
when a transfer should be resumed instead of being started anew. However, not every failure
307+
may be recovered from, forcing a full retransfer:
308+
309+
-
310+
311+
### Random notes
312+
206313
## Future Extensions
207314

208-
* "command mode": establish the connection, *then* figure out what we want to
209-
use it for, allowing multiple files to be exchanged, in either direction.
210-
This is to support a GUI that lets you open the wormhole, then drop files
211-
into it on either end.
212315
* some Transit messages being sent early, so ports and Onion services can be
213316
spun up earlier, to reduce overall waiting time
214317
* transit messages being sent in multiple phases: maybe the transit

0 commit comments

Comments
 (0)