InstructCrop

Official code release: InstructCrop — Teaching Multimodal Large Language Models to Crop Aesthetic Images

1. Installation

Install dependencies:

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

or set up the same environment as LLaVA: https://github.com/haotian-liu/LLaVA

Build rod_align (required for anchor/grid utilities): Refer to: https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch The repository contains build scripts under rod_align/ (e.g., make.sh, setup.py).
Download base model for --base_model_path: You need to obtain a compatible base LLaVA model listed in the LLaVA MODEL_ZOO: https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md

Place the downloaded base model path into --base_model_path when running inference.py.

2. Cropping model

Setup:

Compile and install the rod_align dependencies.
Place downloaded model weights in an accessible path for inference scripts.

Architecture:

Base Model (LLaVA-v1.6 + LoRA): multimodal backbone handling image + prompt.
Cropping Model (Stage 1): generates initial candidate boxes.
Instruct Model (Stage 2): refines boxes for high precision.

Key files:

Inference entry: inference.py
Model implementations: model/text_baseline.py (contains LICAForCausalLM, Cropping_LICA, Instruct_Model)
LLaVA utilities: model/llava/
rod_align tools: rod_align/

3. Weights status

Stage 1 (cropping) checkpoint has already been released with this repo (no extra download required). Place provided checkpoint path into --checkpoint_path.
下载（百度网盘）：checkpoint (Baidu Pan, 提取码: y8qm) -- shared via Baidu Pan Super Member v1
Hugging Face–style (organized) weights for easy from_pretrained loading are being prepared and will be published soon.
Stage 2 / Instruct weights are included in the provided weights folder or via the released links in the repo's weights section.

4. Inference (single image) — explainable outputs implemented

This project now implements explainable-text inference together with box prediction: the pipeline produces both an explanatory text describing the crop rationale and the final crop coordinates/visualization.

Run from project root:

python inference.py \
  --base_model_path <path_to_base_llava_model> \
  --checkpoint_path <stage1_checkpoint.pth> \
  --instruct_ckpt_path <stage2_checkpoint.pth> \
  --vision_tower openai/clip-vit-large-patch14 \
  --image_path <image_or_url> \
  --output_dir ./inference_results \
  --device cuda \
  --precision bf16 \
  --input_size 256 \
  --out_dim 256

Notes:

Model base: LLaVA-1.6-7B is used for the current implementation; v1.5、v1 or more versions will be released/updated in the repo subsequently.
The inference script produces:
- Visualization: ./inference_results/<image_basename>_vis.jpg
- Text explanation and coordinates: ./inference_results/<image_basename>_res.txt

6. Acknowledgements

This project leverages and builds upon:

LLaVA: https://github.com/haotian-liu/LLaVA
Grid-Anchor-based-Image-Cropping-Pytorch (rod_align): https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch

7. Quick references

Inference script: inference.py
Model code: model/text_baseline.py
LLaVA tools: model/llava/
rod_align: rod_align/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
model		model
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructCrop

1. Installation

2. Cropping model

3. Weights status

4. Inference (single image) — explainable outputs implemented

6. Acknowledgements

7. Quick references

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InstructCrop

1. Installation

2. Cropping model

3. Weights status

4. Inference (single image) — explainable outputs implemented

6. Acknowledgements

7. Quick references

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages