Static types for process inputs/outputs #4553

bentsherman · 2023-12-03T04:07:18Z

This PR is an exploration intended to add support for static types for process inputs and outputs.

TODO:

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

… annotation inputs Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

netlify · 2023-12-03T04:07:24Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`1cd6fce`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/660a9cdb7f6bbf00098b125d

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2023-12-09T00:44:45Z

Now that I have developed the builder classes for process inputs and outputs and refactored the TaskProcessor accordingly, I think it is possible to bring static types to the DSL.

The key insight is to decouple the staging and unstaging of files/envs/stdin/stdout from the actual inputs and outputs declaration. I have been able to greatly simplify the runtime code by doing this, but a bonus is that it allows you to use arbitrary types.

In it's raw form, it would look like this:

process FOO {
    input:
    take 'sample_id'
    take 'files'

    env('SAMPLE_ID') { sample_id }
    path { files }

    output:
    env 'SAMPLE_ID'
    path '$file1', 'file.txt', arity: '1'

    emit { sample_id }
    emit { stdout }
    emit { [env('SAMPLE_ID'), path('$file1')] }
    emit { new Sample(sample_id, path('$file1') }
}

This is a bit verbose, but the output envs and files need to be declared immediately so that Nextflow can unstage them in the task wrapper script (whereas the emits aren't evaluated until after the task is completed). But, as you can see, it allows you to take and emit whatever types you want. You could imagine the take method having a type option and then verifying the type at runtime.

I think we can recover the existing DSL syntax on top of this API with some AST transforms and/or wrapper methods, but I still need to try this. So something in the current DSL:

process FOO {
    input:
    val my_val
    env SAMPLE_ID
    path 'file1.txt'
    tuple val(sample_id), path('file2.txt')

    output:
    val my_val
    env SAMPLE_ID
    path 'file1.txt'
    tuple val(sample_id), path('file2.txt')
}

Should be automatically translated to:

process FOO {
    input {
    take 'my_val' // $in0
    take '$in1'
    take '$in2'
    take '$in3'

    env('SAMPLE_ID') { $in1 }
    path('file1.txt') { $in2 }
    var('sample_id') { $in3[0] }
    path('file2.txt') { $in3[1] }
    }

    output {
    env('SAMPLE_ID')
    path('$file0', 'file1.txt')
    path('$file1', 'file2.txt')

    emit { my_val }
    emit { env('SAMPLE_ID') }
    emit { path('$file0') }
    emit { [sample_id, path('$file1')] }
    }
}

Another option might be to evaluate the emits before task execution and generate the outputs ahead of time, since the input vars are already defined, but calling env() / path() / stdout() would return a wrapper object that is bound to the final output after the task is complete. Then you at least don't have to define every env/path output twice. This is basically what the tuple output does, and it works fine because it constructs the tuple directly, whereas with static types the user defines the emitted object.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2023-12-09T18:01:09Z

Putting those speculations aside, I have refactored the existing DSL to use the new builders, establishing a clear boundary between the DSL and runtime. I have not added any new features to the DSL, but this PR lays the groundwork for future enhancements.

If we want to support static types in the DSL, I think there is a range of options:

The take / emit syntax shown above is the most explicit and verbose for the user, but the implementation is simple and supports arbitrary objects
We could go wild and add some kind of pattern matching syntax (see Rust, OCaml, and Python >3.10 for examples). Likely the most difficult to implement, but would be the most concise for the user and also support arbitrary objects
Maybe we don't need to support arbitrary objects. Maybe it would be enough to support flat lists with tuple and flat records with record, maybe flat maps with map. If so, it would just be a minor extension of the current syntax.

Note that if we add an alternative interface like the annotations, (3) is the obvious choice because users can fall back to the more verbose programmatic syntax if they need to do something that the DSL doesn't support.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2023-12-11T05:19:43Z

I have renamed this PR to reflect it's primary purpose. It is basically ready, and it works with nf-core/rnaseq without any changes. I may add a few more small changes and obviously need to update the tests, but I'd like to reach consensus on this PR first.

To help facilitate the review of this PR, here is a summary of the essential and non-essential (but still merge-worthy) changes:

Essential

Refactor runtime classes to be independent of the DSL
- lays the groundwork for writing Nextflow pipelines in Python
Separate process inputs/outputs from env/file/stdin/stdout declarations in the runtime (i.e. DSL is unchanged)
- lays the groundwork for supporting static types in the DSL

Non-essential

Move process input channel merge logic to CombineManyOp
Refactor TaskProcessor to accept a single merged input channel
Move task output collect logic to Task*Collector classes
Move helper classes for TaskConfig into generic LazyHelpers module for lazy binding

While I did rebuild many new classes from scratch, many of them ended up aligning nicely with existing classes, here is a rough mapping:

ProcessConfig -> much logic moved to ProcessBuilder, ProcessConfigBuilder, ProcessDsl
InputsList / OutputsList -> ProcessInputs / ProcessOutputs
BaseInParam / BaseOutParam -> ProcessInput / ProcessOutput
FileInParam / FileOutParam -> ProcessFileInput / ProcessFileOutput
TaskProcessor -> some logic moved to CombineManyOp and Task*Collector classes
TaskConfig -> some logic moved to LazyHelper

I am happy to carve out pieces of this PR and merge them separately if that would make things easier, it was just easier to do it all at once in order to validate the overall approach.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-03-29T19:27:34Z

Current proposed syntax for static types:

inputs are just method parameters
- use directives to stage env, files (stage name), stdin
- input paths are automatically detected and staged, including nested
outputs are just variable declarations with assignment
use AST xform (eventually custom parser) to translate DSL syntax to runtime API calls
new syntax supports record types
previous syntax can be kept as a shorthand, ease the migration

// shorthand for @ValueObject class Sample { ... }
// should replace @ValueObject in custom parser
record Sample {
  String id
  List<Path> files
}

process FOO {
  // stageAs only needed if staging as different name
  env('SAMPLE_ID') { my_tuple[0] }
  stageAs('file1.txt') { my_file }
  stdin { my_stdin }
  stageAs('file2.txt') { my_tuple[1] }

  input:
  // additional behavior provided by directives
  // can support named args, default value in the future
  int my_val
  Path my_file
  String my_stdin
  List my_tuple // can be accessed in body via my_tuple[0], etc
  Sample my_record // custom record type!

  // previous syntax equivalent
  // doesn't require extra directives for env, stdin, files
  // can't be used for record types though
  val my_val
  path 'file1.txt'
  stdin /* my_stdin */
  tuple env(SAMPLE_ID), path('file2.txt')

  output:
  // r-value will be wrapped in closure by AST xform
  // r-value can be anything! even a function defined elsewhere!
  // closure delegate provides env(), stdout(), path() to unstage from task environment
  int my_val // must be assigned in body if no assignment here
  Path my_file = path('file1.txt') // maybe could be called file() like the script function?
  String my_stdout = stdout()
  List my_tuple = tuple( env('SAMPLE_ID'), path('file2.txt') )
  Sample my_record = new Sample( env('SAMPLE_ID'), path('file2.txt') )

  // previous syntax equivalent
  // can't be used for record types though
  val my_val
  path 'file1.txt'
  stdout /* my_stdout */
  tuple env(SAMPLE_ID), path('file2.txt')

  script:
  // ...
}

bentsherman · 2024-04-22T17:09:56Z

Side note regarding maps. This PR will enable you to use maps instead of tuples or record types, but it's not as convenient. Because Nextflow doesn't know which map values are files, it can't automatically stage files like with tuples and records, so you'd have to use the stageAs directive to declare any file inputs:

process foo {
  stageAs { sample.files }

  input:
  Map sample // [ id: String, files: List<Path> ] (but Nextflow doesn't know this)

  script:
  // ...
}

IMO it's much better to use records anyway because of the explicit typing, and you could still have a meta-map in the record if you need to have arbitrary key-value pairs.

stevekm · 2024-05-24T16:46:59Z

this PR looks really cool but I had some questions

is the "record" type something new to this PR? Or is this something that we can already use? Not entirely clear which aspects described here are new from this PR vs. illustrating currently available methods

This PR will enable you to use maps instead of tuples or record types, but it's not as convenient. Because Nextflow doesn't know which map values are files

[ id: String, files: List<Path> ] (but Nextflow doesn't know this)

naive question but can Nextflow just iterate through the map values and detect objects of type Path ( [ "some_file": file("foo.txt")] ) and stage them appropriately? Is it using a different method to detect Path attributes of a record object for staging?

bentsherman · 2024-05-27T19:28:43Z

Record types are sort of already supported:

@ValueObject
class Sample {
  Map meta
  List<Path> reads
}

But Nextflow doesn't know how to stage files from a record type. You have to use tuples for this so that you can say exactly where the files are in the tuple using the path qualifier.

Right now, this feature will most likely be folded into DSL3, which we are still discussing but will focus on better static type checking.

And in one of my experiments with a potential DSL3 syntax (see nf-core/fetchngs#309), I found that oftentimes you don't even need record types in the process definition. In that proposal, you call a process within an operator with individual values, rather than with channels, which gives you more control over how to pass arguments. Take this example:

ftp_samples |> map { meta ->
    def fastq = [ file(meta.fastq_1), file(meta.fastq_2) ]
    SRA_FASTQ_FTP ( meta, fastq, sra_fastq_ftp_args )
}

I don't really need to pass a record type here when I could just pass the individual values directly. I might still pass around records at the workflow level, just to keep related things bundled together, but when calling a specific process, I could just pass in the values that the process needs. So I think this syntax, if it is accepted by the team, will reduce the need for tuples and records and such in the process definition.

hanslovp-gne · 2025-06-09T19:14:29Z

Thank you for exploring static types for inputs/outputs. This is very exciting and something that I am very excited about!

I have a question regarding static typing of tuples: We are currently using tuple (instead of the each) to facilitate parameter sweeps with channels, e.g. something like this:

input:
tuple path(embedding), val(numComponents), val(classColumn)

where the process would be called with a channel combined from 3 channels:

combinedChannel = embeddingChannel.combine(numComponentsChannel).combine(classColumnChannel)

With the new proposed syntax, would we be able to add type annotations to each element of a tuple? I see

List my_tuple

above, but I don't know if there is a way to specify the exact types of each element in my_tuple? Or would we need to define record with the appropriate members instead, and map the combinedChannel into a channel of such record?

bentsherman · 2025-06-10T06:04:39Z

Hi @hanslovp-gne , the proposed syntax has evolved a bit since this PR, but I think you would be able to do parameter sweeps in a nice way

The process inputs would be define like:

input:
embedding: Path
num_components: Integer
class_column: String
...

Then you would do some kind of cross / combine as before to build up a channel of maps, where each map contains the above inputs as keys:

ch_inputs = ch_embeddings
  .cross(ch_num_components)
  .cross(ch_class_column)
  .map { embedding, num_components, class_column ->
    [ embedding: embedding, num_components: num_components, class_column: class_column ]
  }

EVALUATE( ch_inputs )

I was originally skeptical about using maps, but using them this way I think we can make it work nicely. Basically, treat the process inputs as one big map with a different type for each key

hanslovp-gne · 2025-06-10T14:03:10Z

Thank you @bentsherman, that looks awesome!

bentsherman · 2025-08-27T17:59:11Z

Closing in favor of #6368

bentsherman added 11 commits November 19, 2023 08:31

Refactor ast xform classes

e974900

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Move process and workflow DSLs into separate classes

4612de3

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add ProcessFn annotation

5ad9813

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Rename ProcessDsl -> ProcessBuilder, add separate builder for process…

01ef1db

… annotation inputs Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add WorkflowFn annotation

c465000

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add support for native processes, use reflection to invoke workflows

8f2c090

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Separate process input channel logic from task processor

48fdfc2

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Remove params from WorkflowFn

041e10a

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Simplify ProcessFn param names

a52a829

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Separate InParams from task config

570892c

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix process input channel logic

cf0e4b2

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

marcodelapierre assigned bentsherman Dec 4, 2023

This comment was marked as outdated.

Sign in to view

Fix bugs

0c490e8

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

Refactor process inputs and outputs

8733ba6

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman added 3 commits December 8, 2023 22:13

Refactor process inputs/outputs DSL

d2268b2

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Move ProcessBuilder#applyConfig() into subclass

bca231b

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add CombineManyOp to combine process input channels

872a3e2

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman added 3 commits December 9, 2023 13:23

Save variable refs in ProcessFn

72b54f6

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix bugs

dfd5aea

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix task hash (resume still not working)

1e77a22

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Annotation API~~ Separate DSL from task processor Dec 11, 2023

Update tests

c00ee3f

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Rename CombineManyOp -> MergeWithEachOp

1cd6fce

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman mentioned this pull request Apr 8, 2024

Static types nextflow-io/rnaseq-nf#26

Closed

This was referenced Apr 17, 2024

Recursively downloading S3 directory restored from Glacier doesn't work #4747

Open

Support string indices for operators with by option #3108

Open

This was referenced Apr 24, 2024

Refactor ext config as params nf-core/fetchngs#308

Closed

Default values for process inputs #3687

Closed

Named arguments for processes and workflows #3712

Closed

Proposal: Static types nf-core/fetchngs#309

Draft

bentsherman linked an issue May 2, 2024 that may be closed by this pull request

Using custom objects with paths #2085

Open

bentsherman mentioned this pull request May 2, 2024

Publishing a symlink when using fusion copies link (mock symlink) rather than link target #4967

Closed

bentsherman mentioned this pull request Sep 3, 2024

DSL2 - emit tuples with optional values #2678

Open

pditommaso force-pushed the master branch from 1f834a0 to 6454605 Compare January 20, 2025 15:13

pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46

bentsherman mentioned this pull request Apr 4, 2025

Cid store quick wins #5945

Merged

pditommaso force-pushed the master branch from f6a3696 to 49b58d2 Compare April 9, 2025 16:18

pditommaso force-pushed the master branch 3 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54

bentsherman mentioned this pull request Aug 27, 2025

Typed processes #6368

Open

3 tasks

bentsherman closed this Aug 27, 2025

bentsherman deleted the ben-programmatic-api branch August 27, 2025 17:59

Static types for process inputs/outputs #4553

Static types for process inputs/outputs #4553

Uh oh!

Conversation

bentsherman commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

bentsherman commented Dec 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bentsherman commented Dec 9, 2023

Uh oh!

bentsherman commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bentsherman commented Mar 29, 2024

Uh oh!

bentsherman commented Apr 22, 2024

Uh oh!

stevekm commented May 24, 2024

Uh oh!

bentsherman commented May 27, 2024

Uh oh!

hanslovp-gne commented Jun 9, 2025

Uh oh!

bentsherman commented Jun 10, 2025

Uh oh!

hanslovp-gne commented Jun 10, 2025

Uh oh!

bentsherman commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bentsherman commented Dec 3, 2023 •

edited

Loading

netlify bot commented Dec 3, 2023 •

edited

Loading

bentsherman commented Dec 9, 2023 •

edited

Loading

bentsherman commented Dec 11, 2023 •

edited

Loading