Skip to content

A pipeline system #890

@faassen

Description

@faassen

I'm going to sketch out a pipeline/middleware system for quick_xml to explore whether there is any interest in taking this further. I started to build this but realized that it's difficult to implement without access to quick_xml internals, in particular the NamespaceResolver infrastructure. Pipelines are still possible without this infrastructure, but it requires writing to intermediate XML strings and parsing them again.

Here's what I envision:

Use cases

  • a quick_xml reader write canonicalized XML output. It'd like to layer that over a quick_xml Serde serializer without having to write to a string first.

  • I want to take just an element and all its descendants in XML and post-process it (for instance applying canonicalization). I'd like to do that without having to serialize that to a separate XML string as an intermediate step.

  • quick_xml deserializer isn't namespace aware. You can use prefixes in annotations in Rust types, but what if our input uses other prefixes? We could inject middleware that rewrites the prefixes to the set known by our types.

Traits

Reader trait

There's an abstract Reader trait. In reality we may need multiple ones to cover all the bases - a slice reader, a buffered reader, a slice ns reader, and a buffered ns reader.

The idea is that you can use this instead of a concrete reader.

I'm most interested in the most complex scenario, I have access to namespace prefixes and the like. In the rest of the story I will pretend there's only a single reader just for convenience (and it might a place to start anyway).

The ns reader trait should implement prefixes, and the various resolve_ methods.

Writer trait

Similarly, there's a Writer trait. This can be much simpler - it takes write_event. I think it also needs something to set up initialize known namespace prefixes (more about why later), and perhaps it's handy to be able to ask whether writing has already begun so we know whether to initialize or not.

Interesting implementations

(Ns)Reader

(Ns)Reader implements the Reader trait.

Writer

The Writer implements the Writer. This is so that our pipeline can end if we want XML output.

It may have a special feature to take the prefixes it gets and declares them on the outer element it writes if it doesn't already have such declarations. (or this may be in a little middleware).

Pipeline step

A pipeline step takes an Reader, a Writer (trait implementations) and pulls in stuff from the reader and writes to the writer. Very similar to what you'd do now from a concrete reader to a writer.

Buffer

A pipeline step may take a single event and ignore it, or split it into multiple events. We don't want complicated state management inside of pipeline steps; they should just deal with readers and writer. So we need something that implements both Reader and Writer. This buffers events in a deque.

When a pipeline pulls from a buffer, the buffered events will be returned first, until the buffer is empty. Then the buffer invokes its reader to put more events into its buffer.

It also manages namespaces separately using NamespaceResolver. Because pipeline steps could do interesting things to namespaces (this is in fact one of my use cases in canonicalization and prefix rewriting). A buffer can also be initialized with prefixes when writing starts, because a buffer may apply to a subset of the whole document.

Splitter

This splits a single stream into events into multiple streams of events, based on some criterion on BytesStart (and namespace info). All events until the end tag will be streamed to a specific pipeline. This way you can efficiently select one or more parts of the document for further processing.

This takes a hashmap of pipeline names to Writer implementations (a hashmap as how many pipelines should exist can in many cases only be determined at runtime) and a function that given a BytesStart and namespace information can determine which pipeline name it belongs to (or should not be piped through at all). By taking Writer implementation we can put in a Buffer with a pipeline step under it.

Namespace resolver

Right now NsReader already is a bit of a pipeline step on top of Reader. If we had a "namespace resolving" pipeline step we could generalize that. You could start a pipeline without namepaces, and add it as needed. I'm not entirely sure this is worth it, as I still think you need a reader trait that supports prefixes and resolve_ as you'd want to write your pipeline steps against those.

Related topics

This relates to #611 and #881 as those would enable pipeline support for Serde (de)serialization.

Next steps

We need to answer a bunch of questions:

  • Do want to support these use cases with quick_xml at all?

  • Would it make sense to have this implemented in quick_xml or by another crate?

  • If by another crate, can we make quick_xml open up its APIs sufficiently to support this? The big blocker is NamespaceResolver, as without it, it becomes really difficult to implement Buffer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions