Skip to content

AVRO-4135: [C] full type name for json encoded unions #3373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

steven-aerts
Copy link
Contributor

What is the purpose of the change

Like java use the full type name when encoding the types for unions.
This to prevent ambiguities when decoding this json encoded field back to avro using for example the java library.
Without this patch you cannot use the Java JsonDecoder to decode union with records having a namespace entry generated by the avro C library.

Verifying this change

This change added tests and can be verified by running the test_avro_4135

Documentation

  • Does this pull request introduce a new feature? no

@github-actions github-actions bot added Java Pull Requests for Java binding build C labels May 6, 2025
Like java use the full type name when encoding the types for unions.
This to prevent ambiguities when decoding this json encoded field back
to avro using for example the java library.
@github-actions github-actions bot removed Java Pull Requests for Java binding build labels May 6, 2025
@KalleOlaviNiemitalo
Copy link
Contributor

Can you show an example of the ambiguous JSON data and the Avro schema with which was encoded?

https://avro.apache.org/docs/1.12.0/specification/#json-encoding stipulates:

otherwise it is encoded as a JSON object with one name/value pair whose name is the type’s name and whose value is the recursively encoded value. For Avro’s named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.

That specifically says "name" rather than "fullname". If the JSON encoders now should output the fullname of the type as the property name, then the specification should be changed.

This likewise says "name" rather than "fullname":

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted. (Names permit efficient resolution when reading and writing unions.)

Moreover, the schema resolution part of the specification says that schemas match if:

both schemas are records with the same (unqualified) name

Suppose the specification were changed to allow multiple identically named record schemas as members of the same union, as long as they are in different namespaces. Then, if such a union schema is the reader's schema, then both record schemas can match the writer's schema, in which case the first matching schema must be used in schema resolution — even if the other record schema has the same namespace as in the writer's schema. That seems an undesirable result to me.

@steven-aerts
Copy link
Contributor Author

steven-aerts commented May 6, 2025

@KalleOlaviNiemitalo the ambiguity you describe is contained in AVRO-2287.

What we are seeing is that the C library goes for the name interpretation and the java library goes for the fullname description.

To reproduce this with an example, I used the schema at the bottom of this message.

To encode:

  • avro_value_to_json will output {"field": {"r2": {}}}.
  • While the JsonEncoder outputs {"field": {"space.r2": {}}}.

On the decoding side:

C does not have a implementation for decoding the json encoding.

However when you try with java to to read {"field": {"r2": {}}} with the JsonDecoder for this schema you get:

Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch r2
	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:434)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:282)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)

Which shows the implementation differences for both languages. And java is clearly expecting a fully qualified path for unions.

This patch proposes to line up the behavior for C with that of Java.

Would you prefer the other way around? Are you willing to accept a patch on the Java side which allows reading the C encoding?

Example schema:

{
    "type": "record",
    "name": "r",
    "fields": [
        {
            "name": "field",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "r2",
                    "namespace": "space",
                    "fields": []
                }
            ]
        }
    ]
}

Java code use to generate exception:

var decoder = DecoderFactory.get().jsonDecoder(schema, "{\"field\": {\"r2\": {}}}");
var data = new GenericDatumReader(schema).read(null, decoder);

@steven-aerts
Copy link
Contributor Author

@KalleOlaviNiemitalo meanwhile I checked all the other language bindings on how they interpret the spec.

All the language bindings which have a JSON encoder seem to fully qualify their types. It is only the C binding which deviates here. So this patch lines up this behavior with the other language bindings.

json union encoding json union decoding
C unqualified N/A
C++ qualified qualified
csharp qualified [1] qualified
java qualified qualified
js qualified qualified and unqualified [2]
perl N/A N/A
PHP N/A N/A
python N/A N/A
ruby N/A N/A
rust N/A? N/A?

[1]: possible to configure a non standard encoding where the union value is not wrapped in an object. This variant is not decodable.
[2]: where unqualified is commented in code as being less strict. I think this is done, as it is part of the standard flow to encode any javascript object to (binary) avro, giving developers who encode unions a little bit less to type.

Again, if you think decoders should, like the js decoder, be more lenient I do not mind to workout a patch for the Java bindings.

steven-aerts added a commit to steven-aerts/avro that referenced this pull request May 7, 2025
Support unqualified type references for unions when decoding them from
json.
AVRO-2287 makes it unclear if type reference for a JSON encoded union
needs to be qualified or not.
Today all encoders use the fully qualified types except the C JSON
encoder which uses the unqualified type.

In this patch we make the java JSON Decoder more lenient and let it
fallback to unqualified types names when no qualified type name
matches.  Which matches the behavior currently implemented in the
Javascript Json decoder.

This patch is an alternative for apache#3373 where it is proposed
to update the C library instead.
@steven-aerts
Copy link
Contributor Author

@KalleOlaviNiemitalo in #3374 I propose to add support for non qualified type resolution in the java JsonDecoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants