Pipeline to investigate structured reasoning and instruction adherence in multimodal LLMs
benchmark robustness grounding out-of-distribution neuro-symbolic robustness-verification instruction-following trustworthy-ai large-language-models faithfulness hallucination-detection agentic-ai llm-alignment agentic-evaluation agentic-reasoning deterministic-eval
-
Updated
May 28, 2026 - Python