Skip to content

Stabilising video depth #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
calledit opened this issue Apr 9, 2025 · 6 comments
Open

Stabilising video depth #6

calledit opened this issue Apr 9, 2025 · 6 comments

Comments

@calledit
Copy link

calledit commented Apr 9, 2025

Hi to start, really great work with UniK3D!

I am trying to generate depth video here and it works great but i have issues with frame to frame stability. There is quite some variation(flicker) between each frame even when the real depth is more or less static. The issue is also visible in the UniK3D README banner.

Would it be possible to add some way to stabilise the output, In to the next release of the model?

One way could be to implement the option of having a low resolution depth map and a mask as an extra input ontop of the RGB input, 32x32 resolution would probably be plenty even 16x16 would probably be enough.

For video you could then take the depth output of the last frame and feed it in as a prompt together with a mask for dynamic areas, (a mask for parts in the input depth map that that should be ignored due to unknown movments)

If the camera rotates you can then use external basic camera tracking to cancel out the rotation in the depth prompt and mask. So the model would not need to care about that type of stuff.

This would also work great for a use case where you know the distance to a single thing in the frame. Then you can simply mask everything except that thing then specify its depth and the network would be able to work of that known depth. Say one knew the middle pixel was 25.4 meters away, such a feature would allow one to help the model out.

Does this sound like a reasonable idea? I understand this will require retraining the network (or at least the decoder) unless one wan't to add the depth input in to the encoder to.

@lpiccinelli-eth
Copy link
Owner

Hey, thanks for your explanation and suggestion!
What you suggest makes total sense, and we have an ongoing project that is targeting exactly this use case, eg, with previous depth and/or flow as additional information to stabilise depth 😅, and it actually stabilizes depth really well.
I will try to release it soon and add the news once it is released!

@Dr0mp
Copy link

Dr0mp commented Apr 28, 2025

Wouldnt an average between past "frame" and current "frame" have a stabilisation effect to the points position, also maybe from 1 to 4 buffered frames?

@calledit
Copy link
Author

calledit commented Apr 28, 2025

Wouldnt an average between past "frame" and current "frame" have a stabilisation effect to the points position, also maybe from 1 to 4 buffered frames?

Sure it would, but it would only be usable if your scene is completely static.
Or if you know what parts in the frame are static and mask everything not static. But that would still leave the dynamic parts of the image "unstable."

The point of using the model for this is that it will be able to stabilise the dynamic parts of the frame using "knowledge" about distances in the static parts of the frame.

Static meaning "background" or specifically things that don't move in the video.

Dynamic meaning stuff that does moves in the video. Like people, cars, animals and so on.

@Dr0mp
Copy link

Dr0mp commented Apr 28, 2025

This would assume static camera?

Oh i think i understand you approach, so mask the dynamic (based on a noise threshold to avoid false positives), add that over the static, static remains fixed dynamic is continously updated. Did i got that right?

And for pivoting the camera we would need some external tracking data(either generated or some camera sensors)?

While my approach will just settle down the jiggling overall, I think this could work toghether.
I am curios, because I work more with visual coding rather than python, and these tehniques, as long as they are coded in pixels, they can be done on the fly in touchdesigner at a prototyping level.

@calledit
Copy link
Author

Oh i think i understand you approach, so mask the dynamic (based on a noise threshold to avoid false positives), add that over the static, static remains fixed dynamic is continously updated. Did i got that right?

Kind of like that, you would need to track the camera movement and account for it's movement by subtracting it from the static depth. But tracking the camera is easy and can be done pretty much flawlessly these days, using some AI powered tracker like mega-sam.

But i don't think trying this is worth it without underlying ML model support, It will never look good enough to be usable. It would be better to use one of the less accurate video models that already exist. Even if they are "less accurate" their true stable output will still beat any post processing you can apply to the frames to stabilise them.

@calledit
Copy link
Author

If the camera rotates you can then use external basic camera tracking to cancel out the rotation in the depth prompt and mask. So the model would not need to care about that type of stuff.

Thought i could add some info on that.

To cancel out basic camera rotation you could use the tracking tools in metric_depth_video_toolbox that are based on cotracker3 to do this:

python track_points_in_video.py --color_video ~/input.mp4 --nr_iterations 4 --steps_bewtwen_track_init 30

The result is a tracking file called ~/input.mp4_tracking.json which contains a list of tracked points for the video.
To cancel out camera rotation you simply need to move each frame in the opposite direction of the average point movement for that frame. If the scene contains lots of movement you might want to mask that out. (but for many scenes that probably wont be needed. But should it be needed the metric DVT contains tools for that too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants