Move PeriodicStream into periodicsequence. #35412

shunping · 2025-06-24T03:07:46Z

A follow-up PR of #35300 to address the concern at #35300 (comment)

github-actions · 2025-06-24T04:42:53Z

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

damccorm

I proposed some changes that I think we should do regardless, but if we do them it also makes it really clean to consolidate under PeriodicImpulse (we'd just be adding the data parameter)

sdks/python/apache_beam/transforms/periodicsequence.py

damccorm · 2025-06-24T14:23:08Z

sdks/python/apache_beam/transforms/periodicsequence.py

+
+  The output mode of the DoFn is based on the input `data`:
+
+    - **None**: If `data` is None (by default), the output element will be the


Can we still output a TimestampedValue here for consistency? Then we could remove

| 'MapToTimestamped' >> beam.Map(lambda tt: TimestampedValue(tt, tt)))

from PeriodicImpulse

Sure. But notice that this will introduce a breaking change on the pipeline DAG.

This is a good point. You could get around this by moving most of this logic into the map transform instead, but I don't think this is necessary if you don't think it is cleaner (I won't block on this).

As we discussed offline, the logic of using pre-timestamped values has to be put inside ImpulseSeqGenDoFn to ensure the water estimate is consistent with the event times. I updated the code to run this map step when data is not specified just to keep the DAG compatible, but I am open to any other suggestions.

damccorm · 2025-06-25T15:30:35Z

sdks/python/apache_beam/transforms/periodicsequence.py

+      current_watermark = watermark_estimator.current_watermark()
+      if current_watermark is None or output_ts > current_watermark:
+        # ensure watermark is monotonic
+        watermark_estimator.set_watermark(output_ts)


I think this watermark estimation won't work if the data has provided event times.

For example, imagine we have the following data:

(data=foo, event_time=10) (data=foo, event_time=15) (data=foo, event_time=1)

when we see the first element, we'd increment the watermark to 10. But this would immediately mean that the 3rd element is late.

The problem is even more severe when you consider repeating data. If you have repeating data, the only valid watermark you can advance to is min(all_event_times).

I'd propose the following changes:

If you repeat data, that data cannot have event times associated with it (otherwise watermark estimation is basically impossible). This would be a validation step at transform construction.

If you don't repeat data, but there are associated watermarks, when emitting data[i], set the watermark to min(data[i+1:].map(lambda d: d. event_time)). That will ensure a valid watermark.

Note that this would allow you to move the logic into the map transform if you want to, resolving the problems described in https://github.com/apache/beam/pull/35412/files#r2164781118

I consider this is a feature for users to specify late data.

As mentioned in another thread, I think the event time setting logic has to be in the DoFn so that event time in the elements is consistent with the watermark estimate.

damccorm · 2025-06-25T15:32:18Z

sdks/python/apache_beam/transforms/periodicsequence.py

+
+  The output mode of the DoFn is based on the input `data`:
+
+    - **None**: If `data` is None (by default), the output element will be the


This is a good point. You could get around this by moving most of this logic into the map transform instead, but I don't think this is necessary if you don't think it is cleaner (I won't block on this).

damccorm

Thanks - this generally lgtm, remaining comments are minor. Thanks for being patient/flexible here

damccorm · 2025-06-26T20:09:30Z

sdks/python/apache_beam/transforms/periodicsequence.py

    '''
    :param start_timestamp: Timestamp for first element.
    :param stop_timestamp: Timestamp after which no elements will be output.
    :param fire_interval: Interval in seconds at which to output elements.
    :param apply_windowing: Whether each element should be assigned to
      individual window. If false, all elements will reside in global window.
+    :param data: The sequence of elements to emit into the PCollection.
+      The elements can be raw values or pre-timestamped tuples in the format
+      `(apache_beam.utils.timestamp.Timestamp, value)`.


Could you describe the watermark semantics of pre-timestamped data here please?

damccorm · 2025-06-26T20:11:53Z

sdks/python/apache_beam/transforms/periodicsequence.py

+      start_timestamp: Union[Timestamp, float] = Timestamp.now(),
+      stop_timestamp: Union[Timestamp, float] = MAX_TIMESTAMP,


I'm fine with allowing floats here, but could you explain what float means in the pydoc? Alternatively, we can just keep Timestamp as the type.

I see ImpulseSeqGenDoFn accept float as its input elements

beam/sdks/python/apache_beam/transforms/periodicsequence.py

Lines 109 to 111 in 95b28b1

start, _, interval = element

if isinstance(start, Timestamp):

I don't have a strong preference on this, so let's keep Timestamp as the type then.

sdks/python/apache_beam/transforms/periodicsequence.py

Add tests.

damccorm

Thanks, this LGTM

… upper bound.

shunping · 2025-06-27T12:52:21Z

Thanks, this LGTM

Thanks! Could you take a final look on the docstrings change I made?

damccorm

Thanks - this looks good to me

shunping · 2025-06-27T15:22:06Z

The lint problem (https://github.com/apache/beam/actions/runs/15926733133/job/44925603939?pr=35412) seems to related to vertex ai change. @claudevdm could you take a look?

apache_beam/ml/transforms/embeddings/vertex_ai.py:129: error: Argument 1 to "get_embeddings" of "MultiModalEmbeddingModel" has incompatible type "Sequence[TextEmbeddingInput]"; expected "Optional[Image]"  [arg-type]
apache_beam/ml/transforms/embeddings/vertex_ai.py:150: error: List comprehension has incompatible type List[TextEmbeddingInput]; expected List[str]  [misc]
apache_beam/ml/rag/embeddings/vertex_ai.py:40: error: Incompatible types in assignment (expression has type "None", variable has type Module)  [assignment]

damccorm · 2025-06-27T15:23:30Z

The lint problem (https://github.com/apache/beam/actions/runs/15926733133/job/44925603939?pr=35412) seems to related to vertex ai change. @claudevdm could you take a look?

apache_beam/ml/transforms/embeddings/vertex_ai.py:129: error: Argument 1 to "get_embeddings" of "MultiModalEmbeddingModel" has incompatible type "Sequence[TextEmbeddingInput]"; expected "Optional[Image]"  [arg-type]
apache_beam/ml/transforms/embeddings/vertex_ai.py:150: error: List comprehension has incompatible type List[TextEmbeddingInput]; expected List[str]  [misc]
apache_beam/ml/rag/embeddings/vertex_ai.py:40: error: Incompatible types in assignment (expression has type "None", variable has type Module)  [assignment]

I have #35463 - it should be safe to ignore for this PR though

shunping · 2025-06-27T15:56:18Z

The lint problem (https://github.com/apache/beam/actions/runs/15926733133/job/44925603939?pr=35412) seems to related to vertex ai change. @claudevdm could you take a look?
apache_beam/ml/transforms/embeddings/vertex_ai.py:129: error: Argument 1 to "get_embeddings" of "MultiModalEmbeddingModel" has incompatible type "Sequence[TextEmbeddingInput]"; expected "Optional[Image]"  [arg-type]
apache_beam/ml/transforms/embeddings/vertex_ai.py:150: error: List comprehension has incompatible type List[TextEmbeddingInput]; expected List[str]  [misc]
apache_beam/ml/rag/embeddings/vertex_ai.py:40: error: Incompatible types in assignment (expression has type "None", variable has type Module)  [assignment]
I have #35463 - it should be safe to ignore for this PR though

Sounds good. Thanks!

Move PeriodicStream into periodicsequence.

2e1a6c2

github-actions bot added python runners labels Jun 24, 2025

shunping self-assigned this Jun 24, 2025

shunping requested a review from damccorm June 24, 2025 03:13

shunping marked this pull request as ready for review June 24, 2025 03:13

github-actions bot added the Next Action: Reviewers label Jun 24, 2025

damccorm reviewed Jun 24, 2025

View reviewed changes

damccorm reviewed Jun 25, 2025

View reviewed changes

shunping added 4 commits June 26, 2025 15:51

Consolidate PeriodicStream into PeriodicImpulse.

30757b4

Fix fn_runner_test that used PeriodicStream.

2eb4c90

Reformat

66f0d51

Merge branch 'master' into periodic-stream-2

2991877

damccorm reviewed Jun 26, 2025

View reviewed changes

shunping force-pushed the periodic-stream-2 branch from 1d89229 to da73f7c Compare June 27, 2025 05:02

github-actions bot added java go build examples io gcp jdbc spanner yaml and removed java go build examples io labels Jun 27, 2025

github-actions bot removed gcp jdbc spanner yaml labels Jun 27, 2025

shunping force-pushed the periodic-stream-2 branch from da73f7c to 0deba91 Compare June 27, 2025 05:07

Fix some edge cases and resolve floating point precision problem.

fd153a5

Add tests.

shunping force-pushed the periodic-stream-2 branch from 0deba91 to fd153a5 Compare June 27, 2025 05:08

Fix lint and move tests for periodic impulse to its own test case.

523e130

damccorm approved these changes Jun 27, 2025

View reviewed changes

Add more docstrings. Revise stop_timestamp part as it is an exclusive…

109b2bb

… upper bound.

damccorm approved these changes Jun 27, 2025

View reviewed changes

Fix pydoc error.

2c3cc51

shunping merged commit 9f431f7 into apache:master Jun 27, 2025
111 of 113 checks passed

This was referenced Jun 28, 2025

[WIP] Fix flakiness in fuzzy test #35468

Closed

Fix flakiness in fuzzy tests and support TimestampTypes in PeriodicImpulse. #35470

Draft


		The output mode of the DoFn is based on the input `data`:

		- None: If `data` is None (by default), the output element will be the

		start_timestamp: Union[Timestamp, float] = Timestamp.now(),
		stop_timestamp: Union[Timestamp, float] = MAX_TIMESTAMP,

	start, _, interval = element

	if isinstance(start, Timestamp):

Move PeriodicStream into periodicsequence. #35412

Move PeriodicStream into periodicsequence. #35412

Conversation

shunping commented Jun 24, 2025

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

damccorm Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

shunping Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

shunping Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

shunping Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

damccorm Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

shunping Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

shunping commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

shunping commented Jun 27, 2025

Uh oh!

damccorm commented Jun 27, 2025

Uh oh!

shunping commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

shunping Jun 24, 2025 •

edited

Loading

shunping Jun 26, 2025 •

edited

Loading

shunping Jun 27, 2025 •

edited

Loading

shunping commented Jun 27, 2025 •

edited

Loading