X-LANCE Lab Logo

AirQA

Evaluating Scientific Question Answering in Realistic Scenarios
ICLR 2026  Poster🔥

About AirQA

AirQA is a human-annotated multi-modal multi-task question answering dataset, which encompasses 1,246 examples and 13,956 papers in the domain of artificial intelligence.

It contains 4 different question types (single, multiple, retrieval, comprehensive) and 5 different element categories (text, table, image, formula, metadata), with 19 parameterized Python functions to support customized evaluation.


News

  • 2026-02-26: Our code and data have been publicly available! See above for the links.

Why AirQA?

While LLMs excel at atomic tasks, complex scientific inquiry remains a challenge requiring long-term reasoning rather than just processing given information.

AirQA bridges this gap by evaluating three critical capabilities: understanding both textual and paratextual elements, actively retrieving relevant papers from scratch, and applying long-term planning to orchestrate these tasks iteratively.

Unlike benchmarks focusing on isolated skills, AirQA emphasizes combining these abilities under realistic scenarios, shifting the focus from what a model knows to how it plans, retrieves, and synthesizes information over multiple steps.

Notably, even advanced LLMs solve less than half the tasks with external sources, dropping to merely 5% with questions alone, highlighting the significant challenges posed by AirQA.
Setting Task Type #Examples Databases Cost
Spider 2.0-Snow Text-to-SQL task 547 Snowflake(547) NO COST!😊
Spider 2.0-DBT Code agent task 68 DuckDB (DBT)(68) NO COST!😊
Spider 2.0-Lite Text-to-SQL task 547 BigQuery(214), Snowflake(198), SQLite(135) Some cost incurred

Acknowledgements

We would like to thank Haoran Wang, Jingyi Zhang, Ye Wang, Yuxun Miao, Danyang Zhang, Hanqi Li, Zichen Zhu, Situo Zhang, Senyu Han, Dingye Liu, Wenjie Sun, Hanchong Zhang, Nan Jiang, Liangtai Sun, Da Ma, Hankun Wang, and Zhihan Li for their careful annotation on the AirQA dataset. The website and submission guidelines are greatly inspired by BIRD-SQL and Spider 2.0, and we thank them for their contributions.

Data Examples

Type Question Answer Format
single Which downstream tasks does the CLiCoTEA outperform other models in terms of zero-shot performance on the IGLUE benchmark? Your answer should be a Python list of strings, every string is the abbreviation of a downstream task type mentioned in the paper.
multiple According to this survey, what're the three most recent decoder-only LLMs for NL2Code? How many programming languages do their training datasets each contain? Your answer should be a Python dictionary of 3 key-value pairs, where each key is a string, the LLM, and each value is the number of programming languages.
retrieval Which paper unifies reinforcement learning and imitation learning methods under a dual framework? Your answer should be the exact title of the paper WITHOUT ANY OTHER EXPLANATION.
comprehensive Among the text-to-SQL papers in ACL 2023, which one achieves the best testsuite accuracy on the SPIDER dataset? Tell me the paper title and corresponding test accuracy. Your answer should be a Python list of length two, with the first one being the title string and the second one being a float, the accuracy rounded to 3 decimals.

Have Questions?

Ask us questions at our Github issues page or contact Tiancheng Huang for more information.

Citation

@misc{huang2025airqacomprehensiveqadataset,
    title={AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation}, 
    author={Tiancheng Huang and Ruisheng Cao and Yuxin Zhang and Zhangyi Kang and Zijian Wang and Chenrun Wang and Yijie Luo and Hang Zheng and Lirong Qian and Lu Chen and Kai Yu},
    year={2025},
    eprint={2509.16952},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2509.16952}, 
}
Leaderboard

The integral AirQA dataset with 1,246 examples. Since we continually check the dataset quality and welcome contributions, the scores may change slightly over time.

Rank Method Score
Agentic Hybrid + Gemini-2.5-Pro 44.14
Agentic Hybrid + GPT-4o 35.96
Agentic Hybrid + Qwen2.5-72B-Instruct 35.07
Agentic Hybrid + DeepSeek-R1 29.29
Agentic Hybrid + Fine-tuned Qwen2.5-7B-Instruct 24.07
Qwen2.5-72B-Instruct 4.65
GPT-4o 4.41