Skip to content

sxfly99/InstructCrop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

InstructCrop

Official code release: InstructCrop — Teaching Multimodal Large Language Models to Crop Aesthetic Images

1. Installation

  1. Install dependencies:
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

or set up the same environment as LLaVA: https://github.com/haotian-liu/LLaVA

  1. Build rod_align (required for anchor/grid utilities): Refer to: https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch The repository contains build scripts under rod_align/ (e.g., make.sh, setup.py).

  2. Download base model for --base_model_path: You need to obtain a compatible base LLaVA model listed in the LLaVA MODEL_ZOO: https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md

Place the downloaded base model path into --base_model_path when running inference.py.

2. Cropping model

Setup:

  • Compile and install the rod_align dependencies.
  • Place downloaded model weights in an accessible path for inference scripts.

Architecture:

  • Base Model (LLaVA-v1.6 + LoRA): multimodal backbone handling image + prompt.
  • Cropping Model (Stage 1): generates initial candidate boxes.
  • Instruct Model (Stage 2): refines boxes for high precision.

Key files:

  • Inference entry: inference.py
  • Model implementations: model/text_baseline.py (contains LICAForCausalLM, Cropping_LICA, Instruct_Model)
  • LLaVA utilities: model/llava/
  • rod_align tools: rod_align/

3. Weights status

  • Stage 1 (cropping) checkpoint has already been released with this repo (no extra download required). Place provided checkpoint path into --checkpoint_path.
  • 下载(百度网盘):checkpoint (Baidu Pan, 提取码: y8qm) -- shared via Baidu Pan Super Member v1
  • Hugging Face–style (organized) weights for easy from_pretrained loading are being prepared and will be published soon.
  • Stage 2 / Instruct weights are included in the provided weights folder or via the released links in the repo's weights section.

4. Inference (single image) — explainable outputs implemented

This project now implements explainable-text inference together with box prediction: the pipeline produces both an explanatory text describing the crop rationale and the final crop coordinates/visualization.

Run from project root:

python inference.py \
  --base_model_path <path_to_base_llava_model> \
  --checkpoint_path <stage1_checkpoint.pth> \
  --instruct_ckpt_path <stage2_checkpoint.pth> \
  --vision_tower openai/clip-vit-large-patch14 \
  --image_path <image_or_url> \
  --output_dir ./inference_results \
  --device cuda \
  --precision bf16 \
  --input_size 256 \
  --out_dim 256

Notes:

  • Model base: LLaVA-1.6-7B is used for the current implementation; v1.5、v1 or more versions will be released/updated in the repo subsequently.
  • The inference script produces:
    • Visualization: ./inference_results/<image_basename>_vis.jpg
    • Text explanation and coordinates: ./inference_results/<image_basename>_res.txt

6. Acknowledgements

This project leverages and builds upon:

7. Quick references

  • Inference script: inference.py
  • Model code: model/text_baseline.py
  • LLaVA tools: model/llava/
  • rod_align: rod_align/

About

[MM 2025] Official code release of our paper "InstructCrop: Teaching Multimodal Large Language Models to Crop Aesthetic Images"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors