👷SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

[📢 [Project Page] [Arxiv] [Data] [Model Zoo]

We are excited to announce that our paper is accepted by NeurIPS 2025!

Install

Clone this repository

git clone https://github.com/cpystan/SD-VLM.git
cd SD-VLM

Install Package

conda create -n sdvlm python=3.10 -y
conda activate sdvlm
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"

Quick Start With HuggingFace

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
import copy

model_path = "cpystan/SD-VLM-7B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
ori_img = copy.deepcopy(image)
image_tensor = process_images([image], image_processor, model.config)[0]

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.unsqueeze(0).half().to(input_ids.device),
        image_sizes=[image.size],
        do_sample=True if temperature > 0 else False,
        temperature=0.2,
        top_p=None,
        num_beams=1,
        ori_imgs = [ori_img],
        max_new_tokens=1024,
        use_cache=True,)
response= tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

MSMU (Massive Spatial Measuring and Understanding) Dataset

For instruction tuning, please download the train.parquet of MSMU Dataset from [HuggingFace].

For evaluation on MSMU-Bench, please download the test.parquet of MSMU Dataset from [HuggingFace].

Training

SD-VLM inherits the instruction-tuning pipeline of LLaVA, based on the well-established checkpoint of LLaVA-1.5-7B. It requires relatively low resources for GPUs since SD-VLM can be trained with LoRA on 8 V100 GPUs.

LoRA Finetuning (official setting)

sh scripts/v1_5/finetune_task_lora.sh

Non-LoRA Finetuning

sh scripts/v1_5/finetune_task.sh

Some arguments in the script need to be modified:

--model_name_or_path: path to the checkpoint of LLaVA-1.5-7B
--data_path: path to the train set of MSMU
--vision_tower: path to clip-vit-large-patch14-336
--depth_path: path to depth_anything_v2_vitl

🚧 Status: Coming Soon

More details are coming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
evaluation		evaluation
images		images
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👷SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

Install

Quick Start With HuggingFace

MSMU (Massive Spatial Measuring and Understanding) Dataset

Training

🚧 Status: Coming Soon

About

Uh oh!

Releases

Packages

Languages

License

cpystan/SD-VLM

Folders and files

Latest commit

History

Repository files navigation

👷SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

Install

Quick Start With HuggingFace

MSMU (Massive Spatial Measuring and Understanding) Dataset

Training

🚧 Status: Coming Soon

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages