-
Notifications
You must be signed in to change notification settings - Fork 737
Static types for process inputs/outputs #4553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
… annotation inputs Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Now that I have developed the builder classes for process inputs and outputs and refactored the TaskProcessor accordingly, I think it is possible to bring static types to the DSL. The key insight is to decouple the staging and unstaging of files/envs/stdin/stdout from the actual inputs and outputs declaration. I have been able to greatly simplify the runtime code by doing this, but a bonus is that it allows you to use arbitrary types. In it's raw form, it would look like this: process FOO {
input:
take 'sample_id'
take 'files'
env('SAMPLE_ID') { sample_id }
path { files }
output:
env 'SAMPLE_ID'
path '$file1', 'file.txt', arity: '1'
emit { sample_id }
emit { stdout }
emit { [env('SAMPLE_ID'), path('$file1')] }
emit { new Sample(sample_id, path('$file1') }
} This is a bit verbose, but the output envs and files need to be declared immediately so that Nextflow can unstage them in the task wrapper script (whereas the emits aren't evaluated until after the task is completed). But, as you can see, it allows you to take and emit whatever types you want. You could imagine the I think we can recover the existing DSL syntax on top of this API with some AST transforms and/or wrapper methods, but I still need to try this. So something in the current DSL: process FOO {
input:
val my_val
env SAMPLE_ID
path 'file1.txt'
tuple val(sample_id), path('file2.txt')
output:
val my_val
env SAMPLE_ID
path 'file1.txt'
tuple val(sample_id), path('file2.txt')
} Should be automatically translated to: process FOO {
input {
take 'my_val' // $in0
take '$in1'
take '$in2'
take '$in3'
env('SAMPLE_ID') { $in1 }
path('file1.txt') { $in2 }
var('sample_id') { $in3[0] }
path('file2.txt') { $in3[1] }
}
output {
env('SAMPLE_ID')
path('$file0', 'file1.txt')
path('$file1', 'file2.txt')
emit { my_val }
emit { env('SAMPLE_ID') }
emit { path('$file0') }
emit { [sample_id, path('$file1')] }
}
} Another option might be to evaluate the emits before task execution and generate the outputs ahead of time, since the input vars are already defined, but calling |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Putting those speculations aside, I have refactored the existing DSL to use the new builders, establishing a clear boundary between the DSL and runtime. I have not added any new features to the DSL, but this PR lays the groundwork for future enhancements. If we want to support static types in the DSL, I think there is a range of options:
Note that if we add an alternative interface like the annotations, (3) is the obvious choice because users can fall back to the more verbose programmatic syntax if they need to do something that the DSL doesn't support. |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
I have renamed this PR to reflect it's primary purpose. It is basically ready, and it works with nf-core/rnaseq without any changes. I may add a few more small changes and obviously need to update the tests, but I'd like to reach consensus on this PR first. To help facilitate the review of this PR, here is a summary of the essential and non-essential (but still merge-worthy) changes: Essential
Non-essential
While I did rebuild many new classes from scratch, many of them ended up aligning nicely with existing classes, here is a rough mapping:
I am happy to carve out pieces of this PR and merge them separately if that would make things easier, it was just easier to do it all at once in order to validate the overall approach. |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Current proposed syntax for static types:
// shorthand for @ValueObject class Sample { ... }
// should replace @ValueObject in custom parser
record Sample {
String id
List<Path> files
}
process FOO {
// stageAs only needed if staging as different name
env('SAMPLE_ID') { my_tuple[0] }
stageAs('file1.txt') { my_file }
stdin { my_stdin }
stageAs('file2.txt') { my_tuple[1] }
input:
// additional behavior provided by directives
// can support named args, default value in the future
int my_val
Path my_file
String my_stdin
List my_tuple // can be accessed in body via my_tuple[0], etc
Sample my_record // custom record type!
// previous syntax equivalent
// doesn't require extra directives for env, stdin, files
// can't be used for record types though
val my_val
path 'file1.txt'
stdin /* my_stdin */
tuple env(SAMPLE_ID), path('file2.txt')
output:
// r-value will be wrapped in closure by AST xform
// r-value can be anything! even a function defined elsewhere!
// closure delegate provides env(), stdout(), path() to unstage from task environment
int my_val // must be assigned in body if no assignment here
Path my_file = path('file1.txt') // maybe could be called file() like the script function?
String my_stdout = stdout()
List my_tuple = tuple( env('SAMPLE_ID'), path('file2.txt') )
Sample my_record = new Sample( env('SAMPLE_ID'), path('file2.txt') )
// previous syntax equivalent
// can't be used for record types though
val my_val
path 'file1.txt'
stdout /* my_stdout */
tuple env(SAMPLE_ID), path('file2.txt')
script:
// ...
} |
Side note regarding maps. This PR will enable you to use maps instead of tuples or record types, but it's not as convenient. Because Nextflow doesn't know which map values are files, it can't automatically stage files like with tuples and records, so you'd have to use the process foo {
stageAs { sample.files }
input:
Map sample // [ id: String, files: List<Path> ] (but Nextflow doesn't know this)
script:
// ...
} IMO it's much better to use records anyway because of the explicit typing, and you could still have a meta-map in the record if you need to have arbitrary key-value pairs. |
this PR looks really cool but I had some questions is the "record" type something new to this PR? Or is this something that we can already use? Not entirely clear which aspects described here are new from this PR vs. illustrating currently available methods
naive question but can Nextflow just iterate through the map values and detect objects of type |
Record types are sort of already supported: @ValueObject
class Sample {
Map meta
List<Path> reads
} But Nextflow doesn't know how to stage files from a record type. You have to use tuples for this so that you can say exactly where the files are in the tuple using the Right now, this feature will most likely be folded into DSL3, which we are still discussing but will focus on better static type checking. And in one of my experiments with a potential DSL3 syntax (see nf-core/fetchngs#309), I found that oftentimes you don't even need record types in the process definition. In that proposal, you call a process within an operator with individual values, rather than with channels, which gives you more control over how to pass arguments. Take this example: ftp_samples |> map { meta ->
def fastq = [ file(meta.fastq_1), file(meta.fastq_2) ]
SRA_FASTQ_FTP ( meta, fastq, sra_fastq_ftp_args )
} I don't really need to pass a record type here when I could just pass the individual values directly. I might still pass around records at the workflow level, just to keep related things bundled together, but when calling a specific process, I could just pass in the values that the process needs. So I think this syntax, if it is accepted by the team, will reduce the need for tuples and records and such in the process definition. |
5a93547
to
27345a6
Compare
b4b321e
to
069653d
Compare
Thank you for exploring static types for inputs/outputs. This is very exciting and something that I am very excited about! I have a question regarding static typing of tuples: We are currently using
where the process would be called with a channel combined from 3 channels:
With the new proposed syntax, would we be able to add type annotations to each element of a tuple? I see
above, but I don't know if there is a way to specify the exact types of each element in |
Hi @hanslovp-gne , the proposed syntax has evolved a bit since this PR, but I think you would be able to do parameter sweeps in a nice way The process inputs would be define like: input:
embedding: Path
num_components: Integer
class_column: String
... Then you would do some kind of cross / combine as before to build up a channel of maps, where each map contains the above inputs as keys: ch_inputs = ch_embeddings
.cross(ch_num_components)
.cross(ch_class_column)
.map { embedding, num_components, class_column ->
[ embedding: embedding, num_components: num_components, class_column: class_column ]
}
EVALUATE( ch_inputs ) I was originally skeptical about using maps, but using them this way I think we can make it work nicely. Basically, treat the process inputs as one big map with a different type for each key |
Thank you @bentsherman, that looks awesome! |
Closing in favor of #6368 |
This PR is an exploration intended to add support for static types for process inputs and outputs.
TODO: