-
-
Notifications
You must be signed in to change notification settings - Fork 1
Description
In CBOR codec src/cbor/
implement export to a string
CBRO Extended Diagnostic Notation. See specification:
CBOR Extended Diagnostic Notation (EDN)
Abstract
This document formalizes and consolidates the definition of the
Extended Diagnostic Notation (EDN) of the Concise Binary Object
Representation (CBOR), addressing implementer experience.
Replacing EDN's previous informal descriptions, it updates RFC 8949,
obsoleting its Section 8, and RFC 8610, obsoleting its Appendix G.
It also specifies and uses registry-based extension points, using one
to support text representations of epoch-based dates/times and of IP
addresses and prefixes.
- Overview over CBOR Extended Diagnostic Notation (EDN)
CBOR is a binary interchange format. To facilitate documentation and
debugging, and in particular to facilitate communication between
entities cooperating in debugging, this document defines a simple
human-readable diagnostic notation. All actual interchange always
happens in the binary format.
Note that diagnostic notation truly was designed as a diagnostic
format; it originally was not meant to be parsed. Therefore, no
formal definition (as in ABNF) was given in the original documents.
Recognizing that formal grammars can aid interoperation of tools and
usability of documents that employ EDN, Section 5 now provides ABNF
definitions.
EDN is a true superset of JSON as it is defined in [STD90] in
conjunction with [RFC7493] (that is, any interoperable [RFC7493] JSON
text also is an EDN text), extending it both to cover the greater
expressiveness of CBOR and to increase its usability.
EDN borrows the JSON syntax for numbers (integer and floating-point,
Section 2.4), certain simple values (Section 2.8), UTF-8 [STD63] text
strings, arrays, and maps (maps are called objects in JSON; the
diagnostic notation extends JSON here by allowing any data item in
the map key position).
As EDN is used for truly diagnostic purposes, its implementations MAY
support generation and possibly ingestion of EDN for CBOR data items
that are well-formed but not valid. It is RECOMMENDED that an
implementation enables such usage only explicitly by configuration
(such as an API or CLI flag). Validity of CBOR data items is
discussed in Section 5.3 of RFC 8949 [STD94], with basic validity
discussed in Section 5.3.1 of RFC 8949 [STD94], and tag validity
discussed in Section 5.3.2 of RFC 8949 [STD94]. Tag validity is more
likely a subject for individual application-oriented extensions,
while the two cases of basic validity (for text strings and for maps)
are addressed in Sections 2.5.7 and 2.6.2 under the heading of
validity.
The rest of this section provides an overview over specific features
of EDN, starting with certain common syntactical features and then
going through kinds of CBOR data items roughly in the order of CBOR
major types. Any additional detailed syntax discussion needed has
been deferred to Section 5.1.
2.3. Encoding Indicators
Sometimes it is useful to indicate in the diagnostic notation which
of several alternative representations were actually used; for
example, a data item written »1.5« by a diagnostic decoder might have
been encoded as a half-, single-, or double-precision float.
The convention for encoding indicators is that anything starting with
an underscore and all immediately following characters that are
alphanumeric or underscore is an encoding indicator, and can be
ignored by anyone not interested in this information. For example, _
or _3.
Encoding indicators are always optional.
Encoding indicators are placed immediately to the right of the data
item or of a syntactic feature that can stand for the data item the
encoding of which the encoding indicator is controlling. Table 1
provides examples for encoding indicators used with various kinds of
data items.
+====+=====================+
| mt | examples |
+====+=====================+
| 0 | 1_1, 0x4711_3 |
+----+---------------------+
| 1 | -1_1 |
+----+---------------------+
| 2 | 'A'_1 |
+----+---------------------+
| 3 | "A"_1 |
+----+---------------------+
| 4 | [_1 "bar"] |
+----+---------------------+
| 5 | {_1 "bar": 1} |
+----+---------------------+
| 6 | 1_1(4711) |
+----+---------------------+
| 7 | 1.5_2, 0x4711p+03_3 |
+----+---------------------+
Table 1: Examples of
Encoding Indicators for
Different Data Items (mt
= major type)
(In the following, an abbreviation of the form ai=nn gives nn as the
numeric value of the field additional information, the low-order 5
bits of the initial byte: see Section 3 of RFC 8949 [STD94]. This
field is used in encoding the "argument", i.e., the value, tag, or
length; ai=0 to ai=23 mean that the value of the ai field immediately
is the argument, ai=24 to ai=27 mean that the argument is carried
in 2^(ai-24) (1, 2, 4, or 8) additional bytes, and ai=31 means that
indefinite length encoding is used.)
An underscore followed by a decimal digit n indicates that the
preceding item (or, for arrays and maps, the item starting with the
preceding bracket or brace) was encoded with an additional
information value of ai=24+n. For example, 1.5_1 is a half-precision
floating-point number (2^1 = 2 additional bytes or 16 bits), while
1.5_3 is encoded as double precision (2^3 = 8 additional bytes or 64
bits).
The encoding indicator _ is an abbreviation of what would in full
form be 7, which is not used. Therefore, an underscore _ on its own
stands for indefinite length encoding (ai=31). (Note that this
encoding indicator is only available behind the opening brace/bracket
for map and array (Section 2.6.1): strings have a special syntax
streamstring for indefinite length encoding except for the special
cases '' and ""_ (Section 2.5.4).)
The encoding indicators _0 to _3 can be used to indicate ai=24 to
ai=27, respectively; they therefore stand for 1, 2, 4, and 8 bytes of
additional information (ai) following the initial byte in the head of
the data item. (The abbreviation of _7 into _ was discussed above.
_4 to _6 are not currently used in CBOR, but will be available if and
when CBOR is extended to make use of ai=28 to ai=30.)
Surprisingly, Section 8.1 of RFC 8949 [STD94] does not address ai=0
to ai=23 — the assumption seems to have been that preferred
serialization (Section 4.1 of RFC 8949 [STD94]) will be used when
converting CBOR diagnostic notation to an encoded CBOR data item, so
leaving out the encoding indicator for a data item with a preferred
serialization will implicitly use ai=0 to ai=23 if that is possible.
The present specification allows making this explicit:
_i ("immediate") stands for encoding with ai=0 to ai=23, i.e., it
indicates that the argument is encoded directly in the initial byte
of the CBOR item.
While no pressing use for further values for encoding indicators
comes to mind, this is an extension point for EDN; Section 6.2
defines a registry for additional values.
Encoding Indicators are discussed in further detail in Section 2.5.4
for indefinite length strings and in Section 2.6.1 for arrays and
maps.
2.4. Numbers
In addition to JSON's decimal number literals, EDN provides
hexadecimal, octal, and binary number literals in the usual
C-language notation (octal with 0o prefix present only).
Numbers composed only of digits (of the respective base) are
interpreted as CBOR integers (major type 0/1, or where the number
cannot be represented in this way, major type 6 with tag 2/3). A
leading "+" sign is a no-op, and a leading "-" sign inverts the sign
of the number. So 0, 000, +0 all represent the same integer zero, as
does -0. Similarly, 1, 001, +1 and +0001 all stand for the same
integer one, and -1 and -0001 both designate the same integer minus
one.
Using a decimal point (.) and/or an exponent (e for decimal, p for
hexadecimal) turns the number into a floating point number (major
type 7) instead, irrespective of whether it is an integral number
mathematically. Note that, in floating point numbers, 0.0 is not the
same number as -0.0, even if they are mathematically equal.
In Table 2, all the items on a row are the same number (also shown in
CBOR, hexadecimally), but they are distinct from items in a different
row.
+========================================+===================+
| EDN | CBOR hex |
+========================================+===================+
| 4711, 0x1267, 0o11147, 0b1001001100111 | 19 1267 # uint |
+----------------------------------------+-------------------+
| 1.5, 0.15e1, 15e-1, 0x1.8p0, 0x18p-4 | F9 3E00 # float16 |
+----------------------------------------+-------------------+
| 0, +0, -0 | 00 # uint |
+----------------------------------------+-------------------+
| 0.0, +0.0 | F9 0000 # float16 |
+----------------------------------------+-------------------+
| -0.0 | F9 8000 # float16 |
+----------------------------------------+-------------------+
Table 2: Example Sets of Equivalent Notations for Some Numbers
The non-finite floating-point numbers Infinity, -Infinity, and NaN
are written exactly as in this sentence (this is also a way they can
be written in JavaScript, although JSON does not allow them).
See Section 5.1, Paragraph 7, Item 3 for additional details of the
EDN number syntax.
(Note that literals for further number formats, e.g., for
representing rational numbers as fractions, or for NaNs with non-zero
payloads, can be added as application-oriented literals. Background
information beyond that in [STD94] about the representation of
numbers in CBOR can be found in the informational document
[I-D.bormann-cbor-numbers].)
2.5. Strings
CBOR distinguishes two kinds of strings: text strings (the bytes in
the string constitute UTF-8 [STD63] text, major type 3), and byte
strings (CBOR does not further characterize the bytes that constitute
the string, major type 2).
2.5.1. Text String Literals
EDN notates text strings in a form compatible to that of notating
text strings in JSON (i.e., as a double-quoted string literal), with
a number of usability extensions. In JSON, no control characters are
allowed to occur directly in text string literals; if needed, they
can be specified using escapes such as \t or \r. In EDN, string
literals additionally can contain newlines (LINEFEED U+000A), which
are copied into the resulting string like other characters in the
string literal. To deal with variability in platform presentation of
newlines, any carriage return characters (U+000D) that may be present
in the EDN string literal are not copied into the resulting string
(see Section 5.1, Paragraph 7, Item 2). No other control characters
can occur directly in a string literal, and the handling of escaped
characters (\r etc.) is as in JSON.
JSON's escape scheme for characters that are not on Unicode's basic
multilingual plane (BMP) is cumbersome (see Section 7 of RFC 8259
[STD90]). EDN keeps it, but also adds the syntax \u{NNN} where NNN
is the Unicode scalar value as a hexadecimal number. This means the
following are equivalent (the first o is escaped as \u{6f} for no
particular reason):
"D\u{6f}mino's \u{1F073} + \u{2318}" # \u{}-escape 3 chars
"Domino's \uD83C\uDC73 + \u2318" # escape JSON-like
"Domino's 🁳 + ⌘" # unescaped
2.5.2. Byte String Literals
EDN adds a number of ways to notate byte strings, some of which
provide detailed access to the bits within those bytes (see
Section 2.5.5). However, quite often, byte strings carry bytes that
can be meaningfully notated as UTF-8 text (Section 2.5.3).
2.5.3. Single-Quoted String Literals
Analogously to text string literals delimited by double quotes, EDN
allows the use of single quotes (without a prefix) to express byte
string literals with UTF-8 text; for instance, the following are
equivalent:
'hello world'
h'68656c6c6f20776f726c64'
The escaping rules of JSON strings are applied equivalently for text-
based byte string literals, e.g., \ stands for a single backslash
and ' stands for a single quote. However, to facilitate parsing, in
single-quoted strings EDN excludes certain escaping mechanisms
available for double-quoted strings:
-
/ is an escape in JSON that is available for EDN text strings as
well to ensure all JSON texts are EDN literals. Since EDN's
single-quoted strings to not occur in JSON, this legacy
compatibility feature is not available for them. -
\u-based escapes are not available for characters in the range
from U+0020 to U+007e (essentially, printable ASCII).
Single-quoted string literals can occur unprefixed and stand for the
byte string that encodes its text string value (the "content"), or be
prefixed by what looks like an application-extension prefix (see
Section 2.1).
In a prefixed string literal, the text content of the single-quoted
string literal is not used directly as a byte string, but is further
processed in a way that is defined by the meaning given to the
prefix. Depending on the prefix, the result of that processing can,
but need not be, a byte string value.
Prefixed string literals (which are always single-quoted after the
prefix) are used both for base-encoded byte string literals (see
Section 2.5.5) and for application-oriented extension literals (see
Section 2.1, called app-string). (Additional kinds of base-encoded
string literals can be defined as application-oriented extension
literals by registering their prefixes; there is no fundamental
difference between the two predefined base-encoded string literal
prefixes (h, b64) and any such potential future extension literal
prefixes.)
2.5.4. Encoding Indicators of Strings
For indefinite length encoding, strings (byte and text strings) have
a special syntax streamstring. This is used (except for the special
cases ''_ and ""_ below) to notate their detailed composition into
individual "chunks" (Section 3.2.3 of RFC 8949 [STD94]), by
representing the individual chunks in sequence within parentheses,
each optionally followed by a comma, with an encoding indicator _
immediately after the opening parenthesis: e.g., (_ h'0123', h'4567')
or (_ "foo", "bar"). The overall type (byte string or text string)
of the string is provided by the types of the individual chunks,
which all need to be of the same type (Section 3.2.3 of RFC 8949
[STD94]).
For an indefinite-length string with no chunks inside, (_ ) would be
ambiguous as to whether a byte string (encoded 0x5fff) or a text
string (encoded 0x7fff) is meant and is therefore not used. The
basic forms ''_ and ""_ can be used instead and are reserved for the
case of no chunks only --- not as short forms for the (permitted, but
not really useful) encodings with only empty chunks, which need to be
notated as (_ ''), (_ ""), etc., when it is desired to preserve the
chunk structure.
2.5.5. Base-Encoded Byte String Literals
Besides the unprefixed byte string literals that are analogous to
JSON text string literals, EDN provides base-encoded byte string
literals. These are notated as prefixed string literals that carry
one of the base encodings [RFC4648], without padding, i.e., the base
encoding is enclosed in a single-quoted string literal, prefixed by
»h« for base16 or »b64« for base64 or base64url (the actual encodings
of the latter do not overlap, so the string remains unambiguous).
For example, the byte string consisting of the four bytes 12 34 56 78
(given in hexadecimal here) could be written h'12345678' or
b64'EjRWeA'.
Examples often benefit from some blank space (spaces, line breaks) in
byte strings literals. In certain EDN prefixed byte string literals,
blank space is ignored; for instance, the following are equivalent:
h'48656c6c6f20776f726c64'
h'48 65 6c 6c 6f 20 77 6f 72 6c 64'
h'4 86 56c 6c6f
20776 f726c64'
The internal syntax of prefixed single-quote literals such as h'' and
b64'' can also allow comments as blank space (see Section 2.2).
h'68656c6c6f20776f726c64'
h'68 65 6c /doubled l!/ 6c 6f # hello
20 /space/
77 6f 72 6c 64' /world/
Slash characters are part of the base64 classic alphabet (see Table 1
in Section 4 of [RFC4648]), and they therefore need be in the b64''
set of characters that contribute to the byte string. Therefore,
only end-of-line comments are available in b64 byte string literals.
b64'/base64 not a comment/ but one follows # comment'
h'FDB6AC 7BAE27A2D69CA2699E9EDFDBBADA2779FA25 968C2C'
These two byte string literals stand for the same byte string; the
deliberately confusing base64 content starts with b64'/bas' which is
the same as h'FDB6AC' and ends with b64'lows' which is the same as
h'968C2C'.
2.5.6. CBOR Sequence Literals
In diagnostic notation, a sequence of zero or more CBOR data item
literals can be enclosed in << and >>, optionally prefixed by an
application-extension prefix; we speak of sequence literals. EDN
mainly deals with individual data items, not with CBOR sequences
[RFC8742], so the CBOR sequence represented by the sequence literal
needs to be further processed to obtain the value of the literal.
Prefixed sequence literals refer to the application extension (see
Section 2.1) identified by the prefix and apply the extension to its
sequence content, resulting in a single data item. This data item
may be a string or may not (always) be, depending on the definition
of the application extension.
An unprefixed sequence literal applies CBOR encoding to the data
items in its content, taken as a CBOR sequence. The value of the
literal thus is a byte string with the encoded content; we also speak
of embedded CBOR. For instance, each pair of columns in the
following are equivalent:
<<1>> h'01'
<<1, 2>> h'0102'
<<"hello", null>> h'65 68656c6c6f f6'
<<>> h''
2.6. Arrays and Maps
EDN borrows the JSON syntax for arrays and maps. (Maps are called
objects in JSON.)
For maps, EDN extends the JSON syntax by allowing any data item in
the map key position (before the colon).
JSON requires the use of a comma as a separator character between the
elements of an array as well as between the members (key/value pairs)
of a map. (These commas also were required in the original
diagnostic notation defined in [STD94] and [RFC8610].) The separator
commas are now optional in the places where EDN syntax allows commas.
(Stylistically, leaving out the commas is more idiomatic when they
occur at line breaks.)
In addition, EDN also allows, but does not require, a trailing comma
before the closing bracket/brace, enabling an easier to maintain
"terminator" style of their use.
In summary, the following eight examples are all equivalent:
[1, 2, 3]
[1, 2, 3,]
[1 2 3]
[1 2 3,]
[1 2, 3]
[1 2, 3,]
[1, 2 3]
[1, 2 3,]
as are
{1: "n", "x": "a"}
{1: "n", "x": "a",}
{1: "n" "x": "a"}
etc.
| CDDL's comma separators in the equivalent contexts (CDDL
| groups) are entirely optional (and actually are terminators,
| which together with their optionality allows them to be used
| like separators as well, or even not at all). In summary,
| comma use is now aligned between EDN and CDDL, in a fully
| backwards compatible way.
2.6.1. Encoding Indicators of Arrays and Maps
A single underscore can be written after the opening brace of a map
or the opening bracket of an array to indicate that the data item was
represented in indefinite-length format. For example, [_ 1, 2]
contains an indicator that an indefinite-length representation was
used to represent the data item [1, 2].
At the same position, encoding indicators for specifying the size of
the array or map head for definite-length format can be used instead,
specifically _i or _0 to _3. For example [_0 false, true] can be
used to specify the encoding of the array [false, true] as 98 02 f4
f5.
2.6.2. Validity of Maps
As discussed at the start of Section 2, EDN implementations MAY
support generation and possibly ingestion of EDN for CBOR data items
that are well-formed but not valid (Section 5.3 of RFC 8949 [STD94]).
For maps, this is relevant for map keys that occur more than once, as
in:
{1: "to", 1: "fro"}
2.7. Tags
A tag is written as a decimal unsigned integer for the tag number,
followed by the tag content in parentheses; for instance, a date in
the format specified by RFC 3339 (ISO 8601) could be notated as:
0("2013-03-21T20:04:00Z")
or the equivalent epoch-based time as the following:
1(1363896240)
The tag number can be followed by an encoding indicator giving the
encoding of the tag head. For example:
1_1(1363896240)
(assuming preferred encoding for the tag content) is encoded as
d9 0001 # tag(1)
1a 514b67b0 # unsigned(1363896240)
2.8. Simple values
EDN uses JSON syntax for the simple values True (»true«), False
(»false«), and Null (»null«). Undefined is written »undefined« as in
JavaScript.
These and all other simple values can be given as "simple()" with the
appropriate integer in the parentheses. For example, »simple(42)«
indicates major type 7, value 42, and »simple(0x14)« indicates
»false«, as does »simple(20)« or »simple(0b10100)«.
- Application-Oriented Extension Literals
This document extends the syntax used in diagnostic notation to also
enable application-oriented extensions. This section defines a
number of application-oriented extensions.
3.1. The "dt" Extension
The application-extension identifier "dt" is used to notate a date/
time literal that can be used as an Epoch-Based Date/Time as per
Section 3.4.2 of RFC 8949 [STD94].
The content of the literal is a single Standard Date/Time String as
per Section 3.4.1 of RFC 8949 [STD94], as a text or byte string.
The value of the literal is a number representing the result of a
conversion of the given Standard Date/Time String to an Epoch-Based
Date/Time. If fractional seconds are given in the text (production
time-secfrac in Figure 4), the value is a floating-point number; the
value is an integer number otherwise. In the all-upper-case variant
of the app-prefix, the value is enclosed in a tag number 1.
Each row of Table 3 shows an example of "dt" notation and equivalent
notation not using an application-extension identifier.
+================================+==============+
| dt literal | plain EDN |
+================================+==============+
| dt'1969-07-21T02:56:16Z' | -14159024 |
+--------------------------------+--------------+
| dt'1969-07-21T02:56:16.0Z' | -14159024.0 |
+--------------------------------+--------------+
| dt'1969-07-21T02:56:16.5Z' | -14159023.5 |
+--------------------------------+--------------+
| dt<<'1969-07-21T02:56:16.5Z'>> | -14159023.5 |
+--------------------------------+--------------+
| dt<<"1969-07-21T02:56:16.5Z">> | -14159023.5 |
+--------------------------------+--------------+
| DT'1969-07-21T02:56:16Z' | 1(-14159024) |
+--------------------------------+--------------+
Table 3: dt and DT literals vs. plain EDN
See Section 5.2.3 for an ABNF definition for the content of dt
literals.
3.2. The "ip" Extension
The application-extension identifier "ip" is used to notate an IP
address literal that can be used as an IP address as per Section 3 of
[RFC9164].
The content of the literal is a single IPv4address or IPv6address as
per Section 3.2.2 of [RFC3986], as a text or byte string.
With the lower-case app-string prefix ip, the value of the literal is
a byte string representing the binary IP address. With the upper-
case app-string prefix IP, the literal is such a byte string tagged
with tag number 54, if an IPv6address is used, or tag number 52, if
an IPv4address is used.
As an additional case, the upper-case app-string prefix IP'' can be
used with an IP address prefix such as 2001:db8::/56 or 192.0.2.0/24,
with the equivalent tag as its value. (Note that [RFC9164]
representations of address prefixes need to implement the truncation
of the address byte string as described in Section 4.2 of [RFC9164];
see example below.) For completeness, the lower-case variant
ip'2001:db8::/56' or ip'192.0.2.0/24' stands for an unwrapped
[56,h'20010db8'] or [24,h'c00002']; however, in this case the
information on whether an address is IPv4 or IPv6 often needs to come
from the context.
Note that this application-extension provides no direct
representation of the "Interface format" defined in Section 3.1.3 of
[RFC9164], an address combined with an optional prefix length and an
optional zone identifier, and therefore no way to reference a zone
identifier at all. (If needed, this format can be put together by
building their structures explicitly, e.g., an interface format
without a zone identifier can be represented as in
52([ip'192.0.2.42',24]), or an interface format with zone identifier
42 as in 54([ip'fe80::0202:02ff:ffff:fe03:0303',64,42]).)
Each row of Table 4 shows an example of "ip" notation and equivalent
notation not using an application-extension identifier.
+====================+=========================================+
| ip literal | plain EDN |
+====================+=========================================+
| ip'192.0.2.42' | h'c000022a' |
+--------------------+-----------------------------------------+
| ip<<'192.0.2.42'>> | h'c000022a' |
+--------------------+-----------------------------------------+
| IP'192.0.2.42' | 52(h'c000022a') |
+--------------------+-----------------------------------------+
| IP'192.0.2.0/24' | 52([24,h'c00002']) |
+--------------------+-----------------------------------------+
| ip'2001:db8::42' | h'20010db8000000000000000000000042' |
+--------------------+-----------------------------------------+
| IP'2001:db8::42' | 54(h'20010db8000000000000000000000042') |
+--------------------+-----------------------------------------+
| IP'2001:db8::/64' | 54([64,h'20010db8']) |
+--------------------+-----------------------------------------+
Table 4: ip and IP literals vs. plain EDN
See Section 5.2.4 for an ABNF definition for the content of ip
literals.
3.3. The "hash" Extension
The application-extension identifier "hash" is used to notate the
input to a cryptographic hash function as well as identify such a
hash function to obtain a byte string that represents the output of
that hash function.
The content of the literal is a string, optionally followed by either
an integer or a text string that identifies the hash function in the
COSE Algorithms registry of the CBOR Object Signing and Encryption
(COSE) registry group [IANA.cose], either by the identifier (value:
integer or string), or, if no algorithm is registered with this
value, by its name used in the registry. If the second item is not
given, the default algorithm used is -16 ("SHA-256").
No uppercase variant prefix is defined for the application-extension
identifier "hash".
+===============+====================================+
| hash literal | plain EDN |
+===============+====================================+
| hash<<'foo'>> | h'2C26B46B68FFC68FF99B453C1D304134 |
| | 13422D706483BFA0F98A5E886266E7AE' |
+---------------+------------------------------------+
| hash'foo' | h'2C26B46B68FFC68FF99B453C1D304134 |
| | 13422D706483BFA0F98A5E886266E7AE' |
+---------------+------------------------------------+
| hash<<'foo', | h'2C26B46B68FFC68FF99B453C1D304134 |
| -16>> | 13422D706483BFA0F98A5E886266E7AE' |
+---------------+------------------------------------+
| hash<<'foo', | h'2C26B46B68FFC68FF99B453C1D304134 |
| "SHA-256">> | 13422D706483BFA0F98A5E886266E7AE' |
+---------------+------------------------------------+
| hash<<'foo', | h'F7FBBA6E0636F890E56FBBF3283E524C |
| -44>> | 6FA3204AE298382D624741D0DC663832 |
| | 6E282C41BE5E4254D8820772C5518A2C |
| | 5A8C0C7F7EDA19594A7EB539453E1ED7' |
+---------------+------------------------------------+
| hash<<'foo', | h'F7FBBA6E0636F890E56FBBF3283E524C |
| "SHA-512">> | 6FA3204AE298382D624741D0DC663832 |
| | 6E282C41BE5E4254D8820772C5518A2C |
| | 5A8C0C7F7EDA19594A7EB539453E1ED7' |
+---------------+------------------------------------+
Table 5: hash literals vs. plain EDN
- Stand-in Representations in Binary CBOR
In some cases, an EDN consumer cannot construct actual CBOR items
that represent the CBOR data intended for eventual interchange. This
document defines stand-in representation for two such cases:
-
The EDN consumer does not know (or does not implement) an
application-extension identifier used in the EDN document
(Section 4.1) but wants to preserve the information for a later
processor. -
The generator of some EDN intended for human consumption (such as
in a specification document) may not want to include parts of the
final data item, destructively replacing complete subtrees or
possibly just parts of a lengthy string by elisions
(Section 4.2).
Implementation note: Typically, the ultimate applications will fail
if they encounter tags unknown to them, which the ones defined in
this section likely are. Where chains of tools are involved in
processing EDN, it may be useful to fail earlier than at the ultimate
receiver in the chain unless specific processing options (e.g.,
command line flags) are given that indicate which of these stand-ins
are expected at this stage in the chain.
4.1. Handling unknown application-extension identifiers
When ingesting CBOR diagnostic notation, any application-oriented
extension literals are usually decoded and transformed into the
corresponding data item during ingestion. If an application-
extension is not known or not implemented by the ingesting process,
this is usually an error and processing has to stop.
However, in certain cases, it can be desirable to exceptionally carry
an uninterpreted application-oriented extension literal in an
ingested data item, allowing to postpone its decoding to a specific
later stage of ingestion.
This specification defines a CBOR Tag for this purpose: The
Diagnostic Notation Unresolved Application-Extension Tag, tag number
CPA999 (Section 6.5). The content of this tag is an array of a text
string for the application-extension identifier, and another array:
-
For app-strings, the second array contains a single item, a text
string containing the text notated by the single-quoted string in
the app-string. -
For app-sequences, the second array contains zero or more items,
which represent each item in the sequence contained in the app-
sequence.
For example, cri'https://example.com' can be represented as /CPA/
999(["cri", ["https://example.com"]]), or hash<<"data", -44>> as
/CPA/ 999(["hash", ["data", -44]]).
If a stage of ingestion is not prepared to handle the Unresolved
Application-Extension Tag, this is an error and processing has to
stop, as if this stage had been ingesting an unknown or unimplemented
application-extension literal itself.
4.2. Handling information deliberately elided from an EDN document
When using EDN for exposition in a document or on a whiteboard, it is
often useful to be able to leave out parts of an EDN document that
are not of interest at that point of the exposition.
To facilitate this, this specification supports the use of an
ellipsis (notated as three or more dots in a row, as in ...) to
indicate parts of an EDN document that have been elided (and
therefore cannot be reconstructed).
Upon ingesting EDN as a representation of a CBOR data item for
further processing, the occurrence of an ellipsis usually is an error
and processing has to stop.
However, it is useful to be able to process EDN documents with
ellipses in the automation scripts for the documents using them.
This specification defines a CBOR Tag that can be used in the
ingestion for this purpose: The Diagnostic Notation Ellipsis Tag, tag
number CPA888 (Section 6.5). The content of this tag either is
-
null (indicating a data item entirely replaced by an ellipsis),
or it is -
an array, the elements of which are alternating between fragments
of a string and the actual elisions, represented as ellipses
carrying a null as content.
Elisions can stand in for entire subtrees, e.g. in:
[1, 2, ..., 3]
{ "a": 1,
"b": ...,
...: ...
}
A single ellipsis (or key/value pair of ellipses) can imply eliding
multiple elements in an array (members in a map); if more detailed
control is required, a data definition language such as CDDL can be
employed. (Note that the stand-in form defined here does not allow
multiple key/value pairs with an ellipsis as a key: the CBOR data
item would not be valid.)
Subtree elisions can be represented in a CBOR data item by using
/CPA/888(null) as the stand-in:
[1, 2, 888(null), 3]
{ "a": 1,
"b": 888(null),
888(null): 888(null)
}
Elisions also can be used as part of a (text or byte) string:
{ "contract": "Herewith I buy" + ... + "gned: Alice & Bob",
"bytes_in_IRI": 'https://a.example/' + ... + '&q=Übergrößenträger',
"signature": h'4711...0815',
}
The example "contract" combines string concatenation via the +
operator (Section 5.1) with ellipses; while the example "signature"
uses special syntax that allows the use of ellipses between the bytes
notated inside h'' literals.
String elisions can be represented in a CBOR data item by a stand-in
that wraps an array of string fragments alternating with ellipsis
indicators:
{ "contract": /CPA/888(["Herewith I buy", 888(null),
"gned: Alice & Bob"]),
"bytes_in_IRI": 888(['https://a.example/', 888(null),
'&q=Übergrößenträger']),
"signature": 888([h'4711', 888(null), h'0815']),
}
Note that the use of elisions is different from "commenting out" EDN
text, e.g.:
{ "signature": h'4711/.../0815',
# ...: ...
}
The consumer of this EDN will ignore the comments and therefore will
have no idea after ingestion that some information has been elided;
validation steps may then simply fail instead of being informed about
the elisions.
- ABNF Definitions
This section collects grammars in ABNF form ([STD68] as extended in
[RFC7405]) that serve to define the syntax of EDN and some
application-oriented literals.
Bormann Expires 8 January 2026 [Page 28]
Internet-Draft CBOR Extended Diagnostic Notation (EDN) July 2025
Implementation note: The ABNF definitions in this section are
intended to be useful in a Parsing Expression Grammar (PEG) parser
interpretation (see Appendix A of [RFC8610] for an introduction into
PEG).
5.1. Overall ABNF Definition for Extended Diagnostic Notation
This subsection provides an overall ABNF definition for the syntax of
CBOR extended diagnostic notation.
For simplicity, the internal parsing for the built-in EDN prefixes is
specified in the same way. ABNF definitions for h'' and b64'' are
provided in Section 5.2.1 and Section 5.2.2. However, the prefixes
b32'' and h32'' are not in wide use and an ABNF definition in this
document could therefore not be based on implementation experience.
seq = S [item *(MSC item) SOC]
one-item = S item S
item = map / array / tagged
/ number / simple
/ string / streamstring
string1 = (tstr / bstr) spec
string1e = string1 / ellipsis
ellipsis = 3*"." ; "..." or more dots
string = string1e *(S "+" S string1e)
number = (hexfloat / hexint / octint / binint
/ decnumber / nonfin) spec
sign = "+" / "-"
decnumber = [sign] (1DIGIT ["." DIGIT] / "." 1DIGIT)
["e" [sign] 1DIGIT]
hexfloat = [sign] "0x" (1HEXDIG ["." HEXDIG] / "." 1HEXDIG)
"p" [sign] 1DIGIT
hexint = [sign] "0x" 1HEXDIG
octint = [sign] "0o" 1ODIGIT
binint = [sign] "0b" 1*BDIGIT
nonfin = %s"Infinity"
/ %s"-Infinity"
/ %s"NaN"
simple = %s"false"
/ %s"true"
/ %s"null"
/ %s"undefined"
/ %s"simple(" S item S ")"
uint = "0" / DIGIT1 *DIGIT
tagged = uint spec "(" S item S ")"
app-prefix = lcalpha *lcldh ; including h and b64
/ ucalpha *ucldh ; tagged variant, if defined
app-string = app-prefix sqstr
app-sequence = app-prefix "<<" seq ">>"
sqstr = SQUOTE *single-quoted SQUOTE
bstr = app-string / sqstr / app-sequence / embedded
; app-string/-sequence could be any type
tstr = DQUOTE *double-quoted DQUOTE
embedded = "<<" seq ">>"
array = "[" (specms S item *(MSC item) SOC / spec S) "]"
map = "{" (specms S keyp *(MSC keyp) SOC / spec S) "}"
keyp = item S ":" S item
; We allow %x09 HT in prose, but not in strings
blank = %x09 / %x0A / %x0D / %x20
non-slash = blank / %x21-2e / %x30-7F / NONASCII
non-lf = %x09 / %x0D / %x20-7F / NONASCII
comment = "/" *non-slash "/"
/ "#" *non-lf %x0A
; optional space
S = *blank *(comment *blank)
; mandatory space
MS = (blank/comment) S
; mandatory comma and/or space
MSC = ("," S) / (MS ["," S])
; optional comma and/or space
SOC = S ["," S]
; check semantically that strings are either all text or all bytes
; note that there must be at least one string to distinguish
streamstring = "(_" MS string *(MSC string) SOC ")"
spec = ["" *wordchar]
specms = ["" *wordchar MS]
double-quoted = unescaped
/ SQUOTE
/ "" escapable-d
single-quoted = unescaped
/ DQUOTE
/ "" escapable-s
escapable1 = %s"b" ; BS backspace U+0008
/ %s"f" ; FF form feed U+000C
/ %s"n" ; LF line feed U+000A
/ %s"r" ; CR carriage return U+000D
/ %s"t" ; HT horizontal tab U+0009
/ "" ; \ backslash (reverse solidus) U+005C
escapable-d = escapable1
/ DQUOTE
/ "/" ; / slash (solidus) U+002F (JSON!)
/ (%s"u" hexchar) ; uXXXX U+XXXX
escapable-s = escapable1
/ SQUOTE
/ (%s"u" hexchar-s) ; uXXXX U+XXXX
hexchar = "{" (1*"0" [ hexscalar ] / hexscalar) "}"
/ non-surrogate
/ two-surrogate
non-surrogate = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG)
/ ("D" ODIGIT 2HEXDIG )
two-surrogate = high-surrogate "" %s"u" low-surrogate
high-surrogate = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate = "D" ("C"/"D"/"E"/"F") 2HEXDIG
hexscalar = "10" 4HEXDIG / HEXDIG1 4HEXDIG
/ non-surrogate / 1*3HEXDIG
; single-quote hexchar-s: don't allow 0020..007e
hexchar-s = "{" (1*"0" [ hexscalar-s ] / hexscalar-s) "}"
/ non-surrogate-s
/ two-surrogate
non-surrogate-s = "007F" ; rubout
/ "00" ("0"/"1"/"8"/"9"/HEXDIGA) HEXDIG
/ "0" HEXDIG1 2HEXDIG
/ non-surrogate-1
non-surrogate-1 = ((DIGIT1 / "A"/"B"/"C" / "E"/"F") 3HEXDIG)
/ ("D" ODIGIT 2HEXDIG )
hexscalar-s = "10" 4HEXDIG / HEXDIG1 4HEXDIG
/ non-surrogate-1 / HEXDIG1 2HEXDIG
/ ("1"/"8"/"9"/HEXDIGA) HEXDIG
/ "7F"
/ HEXDIG1
; Note that no other C0 characters are allowed, including %x09 HT
unescaped = %x0A ; new line
/ %x0D ; carriage return -- ignored on input
/ %x20-21
; omit 0x22 "
/ %x23-26
; omit 0x27 '
/ %x28-5B
; omit 0x5C
/ %x5D-7F
/ NONASCII
DQUOTE = %x22 ; " double quote
SQUOTE = "'" ; ' single quote
DIGIT = %x30-39 ; 0-9
DIGIT1 = %x31-39 ; 1-9
ODIGIT = %x30-37 ; 0-7
BDIGIT = %x30-31 ; 0-1
HEXDIGA = "A" / "B" / "C" / "D" / "E" / "F"
; Note: double-quoted strings as in "A" are case-insensitive in ABNF
HEXDIG = DIGIT / HEXDIGA
HEXDIG1 = DIGIT1 / HEXDIGA
lcalpha = %x61-7A ; a-z
lcldh = lcalpha / DIGIT / "-"
ucalpha = %x41-5A ; A-Z
ucldh = ucalpha / DIGIT / "-"
ALPHA = lcalpha / ucalpha
wordchar = "_" / ALPHA / DIGIT ; [_a-z0-9A-Z]
NONASCII = %x80-D7FF / %xE000-10FFFF