Skip to content

Conversation

DtDamianGrzes
Copy link
Contributor

Fixed callbacks' first iteration bug with scheduler refactor to use APScheduler.
While analyzing the bug I realized we could simplify the scheduler implementation and use APScheduler, which handles ThreadPoolExecutors on its own. Scheduling jobs in intervals allowed to remove iterative scheduling of each consecutive callback in their implementation.

Bug description:
The _callback_iteration method performs the following tasks:

Submits the callback for execution to the ThreadPoolExecutor

Schedules the next iteration of the callback

Both tasks run in parallel in separate threads.

During the scheduling of the next iteration, the method relies on a start_timestamp, which is initially set when the callback instance is created. On the first execution of the callback, this timestamp is updated to the current time.

Since callback execution and iteration scheduling occur in parallel, there is a possibility of reading the original start_timestamp (set during instance creation) instead of its updated value after the first execution.

If the difference between the initial start_timestamp and the current time exceeds the configured interval between callback iterations, the next iteration might be scheduled in the past. This situation causes the callback method to execute immediately, which is unintended.

@DtDamianGrzes DtDamianGrzes requested review from Copilot and afurche July 25, 2025 18:22
Copilot

This comment was marked as outdated.

if not self._is_fastcheck:
try:
self._heartbeat_iteration()
# TODO: is it surely okay to shedule hearbeat this way? Originaly it was scheduled in the very same scheduler, which would starve heartbeat if any callback took too long
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some doubts about this part. I feel like the heartbeat should have the lowest priority, but it doesn't make much sense if most callbacks have been run in separate threads.
For now and for me it seems to be okayish implementation to the previous one, but I'm curious about your opinion.

@DtDamianGrzes DtDamianGrzes force-pushed the feature/scheduler_refactor_and_callbacks_fix branch from 3e822d9 to 41dea63 Compare July 25, 2025 18:27
@dynatrace-extensions dynatrace-extensions deleted a comment from Copilot AI Jul 25, 2025
@DtDamianGrzes DtDamianGrzes requested a review from Copilot July 25, 2025 18:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the scheduler implementation to use APScheduler instead of the standard library's sched module, which also fixes a race condition bug where callbacks were being scheduled for execution based on a potentially stale timestamp during the first iteration.

  • Replaced custom scheduler with APScheduler BackgroundScheduler and IntervalTrigger
  • Fixed race condition in callback first iteration by removing manual timestamp tracking
  • Updated tests to use the new scheduler API and removed threading test dependencies

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
dynatrace_extension/sdk/extension.py Core scheduler refactor replacing sched with APScheduler and removing manual timestamp management
dynatrace_extension/sdk/callback.py Removed manual iteration tracking and timestamp calculation methods
dynatrace_extension/sdk/status.py Added mockable datetime function for testing
tests/sdk/test_extension.py Updated tests to use new scheduler API and removed threading dependencies
tests/sdk/test_status.py Updated test setup/teardown and scheduler calls for new API
tests/sdk/test_endpoints_sfm.py Replaced freezegun with manual datetime mocking and updated scheduler usage
pyproject.toml Removed freezegun dependency
Comments suppressed due to low confidence (3)

tests/sdk/test_extension.py:138

  • This comment indicates the test expectation changed due to implementation details. Consider if this test is still validating the intended behavior or if it needs adjustment.
        # call_count expected to be 2 because scheduled `query` method used the original, nonmocked _run_callback

tests/sdk/test_extension.py:236

  • The test logic has changed significantly from the original version. Verify that this still properly tests the callback scheduling behavior it was intended to validate.
        self.assertEqual(extension._scheduled_callbacks[2].executions_total, 1)

@DtDamianGrzes DtDamianGrzes force-pushed the feature/scheduler_refactor_and_callbacks_fix branch from 41dea63 to 4560f98 Compare July 25, 2025 18:32
@dynatrace-extensions dynatrace-extensions deleted a comment from Copilot AI Jul 25, 2025
@dynatrace-extensions dynatrace-extensions deleted a comment from Copilot AI Jul 25, 2025
@dynatrace-extensions dynatrace-extensions deleted a comment from Copilot AI Jul 25, 2025
@dlopes7
Copy link
Member

dlopes7 commented Aug 11, 2025

I am concerned about this, it is a major rewrite of the scheduling for very little benefit.

It also introduces a dependency, we have agreed that the SDK should not have any dependencies, this can introduce conflicts for extension dependencies.

@DtDamianGrzes DtDamianGrzes force-pushed the feature/scheduler_refactor_and_callbacks_fix branch from b359f8a to 3bcd629 Compare August 11, 2025 15:36
@DtDamianGrzes
Copy link
Contributor Author

DtDamianGrzes commented Aug 12, 2025

Hey @dlopes7 , thanks for the comment, I partly understand your doubts. Let's split those into to separate discussions maybe.

I am concerned about this, it is a major rewrite of the scheduling for very little benefit.
I agree that's not a little change, but I also believe that, looking closely, it's a simplification of what we already did. sched doesn't provide a way to schedule callbacks in intervals. We do that on our own, additionally manually implementing queuing, holding and calculating "next timestamps", delegating jobs to their executors (which required additional level of callbacks) and so on. Altogether I saw that as something a bit overcomplicated and possibly prone to bugs (as in the described example). While reading the current version it seemed to me that's quite a lot of manual implementation for something rather simple and I expected there should be some module that provides that in a ready-to-use fashion. APScheduler seems to do exactly what we did manually and is quite commonly used.
Regarding the risk of this change, I'd say that if any issues occurred after merging this commit I'd opt for reverting it and trying to fix the issues. That would require rebuilding of extensions though, so I'm not sure if it's a good approach.

Let me know how all of that resonates with you. I'm not going to fight for this PR just for a sake of what's been already implemented.

It also introduces a dependency, we have agreed that the SDK should not have any dependencies, this can introduce conflicts for extension dependencies.

That's for sure something to consider, but I'm not really familiar with discussions you had. Could you shed some more light on the topic?

Please add anyone you want from DXS if you'd like involve someone into the discussion.

@dlopes7
Copy link
Member

dlopes7 commented Aug 12, 2025

We had discussions with @PiotrMeller and @MaciejGrynda at the beginning of the SDK development. What happened in EF1 is that the dependencies that were included with the RemotePluginModule eventually caused conflicts with extension dependencies.

That is one of the reasons why we don't use "requests" for instance to make HTTP calls, that would not only increase the bundle size but also cause people to either NOT declare requests as a dependency of their extension, or be unable to pin a different version of requests, or even worse break the SDK because they introduce a specific version of requests.

This dependency (APScheduler) is not as wide used as requests, but we are breaking an interesting rule of having no dependencies on the SDK to fix a minor inconvenience (the timestamp race condition), which I believe could be fixed directly in the code.

And although APSCheduler fixes some things we do manually it also does a ton of other things we will never use, seems like an overkill for a simple issue

@DtDamianGrzes
Copy link
Contributor Author

DtDamianGrzes commented Aug 12, 2025

@dlopes7 , thanks for the answer. I agree with this reasoning and will close the PR after sync with team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants