About AirQA
AirQA is a human-annotated multi-modal multi-task question answering dataset, which encompasses 1,246 examples and 13,956 papers in the domain of artificial intelligence.
It contains 4 different question types (single, multiple, retrieval, comprehensive) and 5 different element categories (text, table, image, formula, metadata), with 19 parameterized Python functions to support customized evaluation.
News
- 2026-02-26: Our code and data have been publicly available! See above for the links.
Why AirQA?
AirQA bridges this gap by evaluating three critical capabilities: understanding both textual and paratextual elements, actively retrieving relevant papers from scratch, and applying long-term planning to orchestrate these tasks iteratively.
Unlike benchmarks focusing on isolated skills, AirQA emphasizes combining these abilities under realistic scenarios, shifting the focus from what a model knows to how it plans, retrieves, and synthesizes information over multiple steps.
Notably, even advanced LLMs solve less than half the tasks with external sources, dropping to merely 5% with questions alone, highlighting the significant challenges posed by AirQA.
Acknowledgements
We would like to thank Haoran Wang, Jingyi Zhang, Ye Wang, Yuxun Miao, Danyang Zhang, Hanqi Li, Zichen Zhu, Situo Zhang, Senyu Han, Dingye Liu, Wenjie Sun, Hanchong Zhang, Nan Jiang, Liangtai Sun, Da Ma, Hankun Wang, and Zhihan Li for their careful annotation on the AirQA dataset. The website and submission guidelines are greatly inspired by BIRD-SQL and Spider 2.0, and we thank them for their contributions.
Data Examples
| Type | Question | Answer Format |
|---|---|---|
| single | Which downstream tasks does the CLiCoTEA outperform other models in terms of zero-shot performance on the IGLUE benchmark? | Your answer should be a Python list of strings, every string is the abbreviation of a downstream task type mentioned in the paper. |
| multiple | According to this survey, what're the three most recent decoder-only LLMs for NL2Code? How many programming languages do their training datasets each contain? | Your answer should be a Python dictionary of 3 key-value pairs, where each key is a string, the LLM, and each value is the number of programming languages. |
| retrieval | Which paper unifies reinforcement learning and imitation learning methods under a dual framework? | Your answer should be the exact title of the paper WITHOUT ANY OTHER EXPLANATION. |
| comprehensive | Among the text-to-SQL papers in ACL 2023, which one achieves the best testsuite accuracy on the SPIDER dataset? Tell me the paper title and corresponding test accuracy. | Your answer should be a Python list of length two, with the first one being the title string and the second one being a float, the accuracy rounded to 3 decimals. |
Have Questions?
Citation
@misc{huang2025airqacomprehensiveqadataset,
title={AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation},
author={Tiancheng Huang and Ruisheng Cao and Yuxin Zhang and Zhangyi Kang and Zijian Wang and Chenrun Wang and Yijie Luo and Hang Zheng and Lirong Qian and Lu Chen and Kai Yu},
year={2025},
eprint={2509.16952},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.16952},
}
The integral AirQA dataset with 1,246 examples. Since we continually check the dataset quality and welcome contributions, the scores may change slightly over time.
| Rank | Method | Score |
|---|---|---|
| Agentic Hybrid + Gemini-2.5-Pro | 44.14 | |
| Agentic Hybrid + GPT-4o | 35.96 | |
| Agentic Hybrid + Qwen2.5-72B-Instruct | 35.07 | |
| Agentic Hybrid + DeepSeek-R1 | 29.29 | |
| Agentic Hybrid + Fine-tuned Qwen2.5-7B-Instruct | 24.07 | |
| Qwen2.5-72B-Instruct | 4.65 | |
| GPT-4o | 4.41 |