Skip to content

Conversation

ayoussf
Copy link

@ayoussf ayoussf commented Jul 25, 2025

Allows passing custom estimation_options to localize_inloc.py, while maintaining the default behaviour if not provided.

I have also modified the Markdown in pipeline_InLoc.ipynb to specify additional details. If this is unwanted, I can revert to the original markdown.

@ayoussf
Copy link
Author

ayoussf commented Jul 30, 2025

Hi @Phil26AT,

I wanted to share a few observations regarding recent PyColmap updates (starting around v3.10.0 and later). Since these updates, I’ve been unable to reproduce InLoc's results, not only for SuperGlue and LightGlue but across other models I’ve tested (e.x. LoFTR)

As an example, after freshly cloning the InLoc repository and running the provided notebook without making any changes, I obtained the following results with SuperGlue:

  • DUC1: 44.4 / 66.7 / 79.8
  • DUC2: 50.4 / 73.3 / 77.1

These numbers were fairly consistent across multiple machines.

For context:

Interestingly, when replacing pycolmap.estimate_and_refine_absolute_pose with poselib.estimate_absolute_pose, I saw improved results on DUC2:

  • DUC1: 44.9 / 67.7 / 79.3
  • DUC2: 56.5 / 77.1 / 78.6

Moreover, I can still reliably reproduce results on Aachen v1.1, where I got the following with default settings:

SuperGlue:

  • Day: 90.3 / 96.2 / 99.4
  • Night: 76.4 / 90.1 / 100.0

LightGlue:

  • Day: 90.3 / 96.2 / 99.2
  • Night: 77.5 / 91.6 / 99.5

Thus, I am not entirely sure the reason for this behaviour on the InLoc dataset specifically.

I hope this helps and I’m happy to run additional tests if needed.

@Phil26AT
Copy link
Collaborator

Hi @ayoussf, thank you for reopening this PR.

Thank you for reporting detailed statistics on InLoc, and great that Aachen v1.1 is reproducible again. I was rerunning the pipeline (SP+SG) without changes, and also got similar results. However, results with the temporal pairs (used in the leaderboard) were reproducible up to ~2%:

Methods InLoc DUC1 InLoc DUC2 Retrieval
Leaderboard 46.5 / 65.7 / 78.3 52.7 / 72.5 / 79.4 NetVLAD top 40
Pycolmap / 3.12.0 43.9 / 66.2 / 79.3 51.1 / 74.0 / 77.1 NetVLAD top 40
Leaderboard 49.0 / 68.7 / 80.8 53.4 / 77.1 / 82.4 NetVLAD top 40 (temporal)
Pycolmap / 3.12.0 47.0 / 68.2 / 80.3 52.7 / 77.9 / 80.2 NetVLAD top 40 (temporal)

Note that this dataset is fairly small and that pose estimation is non-deterministic (I get fluctuations of 1-2%), so small differences might not be significant.

I checked the backlog, and realized that some default parameters in the estimation options changed, e.g.: old vs. new. It might be worth trying those.

Could you maybe post the last pycolmap version where results were reproducible for you, together with the actual numbers?

I will also try to run the pipeline with older versions over the next days.

@ayoussf
Copy link
Author

ayoussf commented Jul 31, 2025

@Phil26AT Thank you for the detailed response.

Unfortunately, I cannot pinpoint the exact pycolmap version I used, as I have since switched machines and no longer have access to the old environment to fully reproduce it. What I can say with certainty is that it was prior to v3.11.0.

Over the coming days, I will rerun the evaluations for SP+SG and SP+LG across pycolmap versions 0.6.0 through 3.12.3. Starting from v3.11.0, I will also include results using both the old and new estimation options for consistency. I will share the updated results in a new comment on this PR.

Lastly, I am aware there is a notebook for InLoc evaluation; however, if you prefer, I can create \hloc\pipelines\InLoc and add a pipeline.py script to stay consistent with the other dataset evaluations.

@ayoussf
Copy link
Author

ayoussf commented Aug 3, 2025

Hello @Phil26AT,

Following up on my previous comment, I conducted evaluations on the InLoc dataset using both SP+SG and SP+LG across pycolmap versions 0.6.0 to 3.12.3.

I am aware that earlier versions do not include the new RANSAC estimation options. However, for consistency, I evaluated each pycolmap version using both the new and old estimation settings.

Since a Markdown table would not fit in a comment, I attached the results as a figure below. To ensure fairness:

  • SP+SG and SP+LG features and matches were computed only once to avoid introducing sources of variation.

  • Absolute Pose Estimation was rerun for each pycolmap version using the precomputed features and matches.

From these results:

  • It appears I was incorrect regarding the inability to reproduce results, as results are fairly consistent across all versions.

  • The results also show some deterministic patterns (no fluctuations) across versions: (0.6.0–3.10.0), (3.11.0–3.11.1), and (3.12.0–3.12.3).

I also included Poselib results for comparison:

  • On DUC1, pycolmap is similar to Poselib.

  • On DUC2, there is a noticeable gap at the (0.25m, 2°) / (0.5m, 5°) thresholds. I understand that COLMAP has integrated Poselib into its Absolute Pose Estimation pipeline. Since the focal length is known for the InLoc evaluation, I assume COLMAP directly uses Poselib’s P3P solver. Therefore, the difference observed in DUC2 may be due to the refinement step between the two libraries. However, I could be mistaken, and the discrepancy might be caused by a completely different factor.

If you would like to double-check the evaluation code used, I have created a branch in my fork:
pycolmap_test branch.

I hope this is helpful, and apologies for the earlier confusion regarding the reproducibility of InLoc results.

PyCOLMAP_Tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants