FDU-NLP · AAAI 2024 · EMNLP 2025 · ACL 2026

Comprehensive Evaluation for Large Language Models

LLMEval is a research initiative from Fudan NLP Lab, building rigorous and fair evaluation frameworks for LLMs across 13+ academic disciplines, medical AI, and 220,000+ generative questions.

Explore Papers View Leaderboard GitHub

Papers at AAAI / EMNLP / ACL

LLMs Benchmarked

220K

Questions in LLMEval-Fair

265

GitHub Stars

Featured Research

Our latest publications in LLM evaluation.

View all papers

Under submission2026

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval Team, Fudan NLP Lab(Anonymous authors during peer review)

LLMEval-Logic is a Chinese logical reasoning benchmark built through a three-stage audit pipeline: (a) annotators authored items forward from real-world stories rather than templating backward from formulas, (b) a hand-written rubric checklist together with the Z3 SMT solver double-audited every natural-language → first-order-logic translation, and (c) a closed-loop adversarial hardening agent workflow discarded items that turned out to be too easy. The dataset has two paired splits — LLMEval-Logic-Base (single-question PL & FOL items with Z3-verified answers, gold formalisations and atom-level NL→FL rubrics) and LLMEval-Logic-Hard (multi-question / sub-question items covering enumeration / counting / uniqueness / alternative-solution / counterfactual reasoning). Three independent runs of 14 frontier LLMs under thinking / no-thinking configurations show the strongest model reaches only 37.5% Item Accuracy on Hard, leaving substantial headroom for frontier reasoning research. Following the contamination-resistant tradition of LLMEval-Fair, only 80% of the corpus is released publicly; the remaining 20% is held out as a private contamination-resistant test set maintained by Fudan NLP Lab.

logical reasoningpropositional logicfirst-order logicZ3 / SMTadversarial hardeningcontamination-resistant

Code Dataset

ACL 2026 Main Conference2026

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang*†, Yujiong Shen*, Jingyi Deng*, Yuhui Wang*, Huayu Sha, Kexin Tan, Qiyuan Peng, Yue Zhang, Junzhe Wang, Shichun Liu, Yueyuan Huang, Jingqi Tong, Changhao Jiang, Yilong Wu, Zhihao Zhang, Mingqi Wu, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang†, Xuanjing Huang(* Equal Contribution, † Corresponding Author)

LLMEval-Fair addresses robustness and fairness concerns in LLM evaluation through a 30-month longitudinal study. Built on a proprietary bank of 220,000 graduate-level questions across 13 academic disciplines, it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts. A study of nearly 60 leading models reveals performance ceilings and exposes data contamination vulnerabilities undetectable by static benchmarks.

evaluationfairnessrobustnessgenerative QAlongitudinal study

37 arXiv Code Dataset

EMNLP 2025 Findings2025

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Ming Zhang*, Yujiong Shen*, Zelin Li*, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang†, Xuanjing Huang†(* Equal Contribution, † Corresponding Author)

LLMEval-Med is a physician-validated benchmark for evaluating LLMs on real-world clinical tasks. It covers five core medical areas (Medical Knowledge, Language Understanding, Reasoning, Ethics & Safety, Text Generation) with 2,996 questions from real electronic health records and expert-designed clinical scenarios. An automated evaluation pipeline with expert-developed checklists is validated through human-machine agreement analysis. 13 LLMs across specialized, open-source, and closed-source categories are evaluated.

medicalclinicalphysician validationLLM-as-Judge

25 arXiv Code Dataset

View all papers →

Participate in LLMEval

LLMEval is open to the public. Feel free to explore our code and data on GitHub, or contact us for collaboration.

View on GitHub