Official code release: InstructCrop — Teaching Multimodal Large Language Models to Crop Aesthetic Images
- Install dependencies:
pip install -r requirements.txt
pip install flash-attn --no-build-isolationor set up the same environment as LLaVA: https://github.com/haotian-liu/LLaVA
-
Build rod_align (required for anchor/grid utilities): Refer to: https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch The repository contains build scripts under
rod_align/(e.g.,make.sh,setup.py). -
Download base model for
--base_model_path: You need to obtain a compatible base LLaVA model listed in the LLaVA MODEL_ZOO: https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md
Place the downloaded base model path into --base_model_path when running inference.py.
Setup:
- Compile and install the
rod_aligndependencies. - Place downloaded model weights in an accessible path for inference scripts.
Architecture:
- Base Model (LLaVA-v1.6 + LoRA): multimodal backbone handling image + prompt.
- Cropping Model (Stage 1): generates initial candidate boxes.
- Instruct Model (Stage 2): refines boxes for high precision.
Key files:
- Inference entry:
inference.py - Model implementations:
model/text_baseline.py(contains LICAForCausalLM, Cropping_LICA, Instruct_Model) - LLaVA utilities:
model/llava/ - rod_align tools:
rod_align/
- Stage 1 (cropping) checkpoint has already been released with this repo (no extra download required). Place provided checkpoint path into
--checkpoint_path. - 下载(百度网盘):checkpoint (Baidu Pan, 提取码: y8qm) -- shared via Baidu Pan Super Member v1
- Hugging Face–style (organized) weights for easy
from_pretrainedloading are being prepared and will be published soon. - Stage 2 / Instruct weights are included in the provided weights folder or via the released links in the repo's weights section.
This project now implements explainable-text inference together with box prediction: the pipeline produces both an explanatory text describing the crop rationale and the final crop coordinates/visualization.
Run from project root:
python inference.py \
--base_model_path <path_to_base_llava_model> \
--checkpoint_path <stage1_checkpoint.pth> \
--instruct_ckpt_path <stage2_checkpoint.pth> \
--vision_tower openai/clip-vit-large-patch14 \
--image_path <image_or_url> \
--output_dir ./inference_results \
--device cuda \
--precision bf16 \
--input_size 256 \
--out_dim 256Notes:
- Model base: LLaVA-1.6-7B is used for the current implementation; v1.5、v1 or more versions will be released/updated in the repo subsequently.
- The inference script produces:
- Visualization: ./inference_results/<image_basename>_vis.jpg
- Text explanation and coordinates: ./inference_results/<image_basename>_res.txt
This project leverages and builds upon:
- LLaVA: https://github.com/haotian-liu/LLaVA
- Grid-Anchor-based-Image-Cropping-Pytorch (rod_align): https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch
- Inference script:
inference.py - Model code:
model/text_baseline.py - LLaVA tools:
model/llava/ - rod_align:
rod_align/