Jinxu Zhang | Document Image Analysis | Research Excellence Award

Dr. Jinxu Zhang | Document Image Analysis | Research Excellence Award

Harbin Institute of Technology | China

Dr. Jinxu Zhang is a researcher at the Harbin Institute of Technology specializing in multimodal understanding, Document Visual Question Answering (DocVQA), and multimodal large language models. His work focuses on advancing key technologies for interpreting complex, multi-form, and multi-page documents, contributing significantly to the fields of document intelligence and machine reading systems.He has completed and continues to contribute to the National Natural Science Foundation of China (NSFC) project on Key Technologies of Multi-form Document VQA. His research outputs include six SCI/Scopus-indexed publications5 , with a total of 41 citations, an h-index of 2, and an i10-index of 2. His contributions appear in top-tier venues such as ACM Multimedia (CCF-A), EMNLP Findings (CCF-B), Information Fusion (SCI, IF 15.5), and IEEE Transactions on Learning Technologies. His notable works CREAM, DocRouter, DocAssistant, and DREAM introduce innovative solutions for hierarchical multimodal retrieval, prompt-guided vision transformers, mixture-of-experts connectors and robust reasoning strategies for document comprehension.Dr. Zhang’s patented work on an intelligent question-answering system for multi-form documents further extends his impact toward practical deployable intelligent document systems. His research achievements emphasize coarse-to-fine retrieval key-region reading step-wise reasoning and efficient multimodal fusion. He also incorporates Reinforcement Learning–based data enhancement and Chain-of-Thought (CoT) construction to improve model reasoning in multi-page document analysis.He actively collaborates with university researchers in multimodal understanding document analysis OCR and deep learning fostering interdisciplinary innovation. His work contributes to building reliable and generalizable document intelligence systems with broad societal applications including education digital governance business automation and large-scale knowledge management.Dr. Zhang continues to advance the frontier of intelligent document analysis through sustained research model innovation and high-impact scholarly contributions.

Profiles: ScopusGooglescholar

Featured Publications

1.Liu, M., Zhang, J., Nyagoga, L. M., & Liu, L. (2023). Student-AI question cocreation for enhancing reading comprehension. IEEE Transactions on Learning Technologies, 17, 815–826. Cited By: 28

2.Zhang, J., Yu, Y., & Zhang, Y. (2024). CREAM: Coarse-to-fine retrieval and multi-modal efficient tuning for document VQA. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 925–934). Cited By:  13

3.Zhang, J., Fan, Q., & Zhang, Y. (2025). DocAssistant: Integrating key-region reading and step-wise reasoning for robust document visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2025 (pp. 3496–3511).

4.Zhang, J., Fan, Q., Yu, Y., & Zhang, Y. (2025). DREAM: Integrating hierarchical multimodal retrieval with multi-page multimodal language model for documents VQA. In Proceedings of the 33rd ACM International Conference on Multimedia (pp. 4213–4221).

5.Zhang, J., & Zhang, Y. (2025). DocRouter: Prompt guided vision transformer and Mixture of Experts connector for document understanding. Information Fusion, 122, Article 103206.

Dr. Zhang’s research advances the global frontier of intelligent document understanding by enabling machines to accurately interpret complex, multi-page documents with human-level reasoning. His innovations in multimodal fusion, retrieval, and robust VQA architectures support breakthroughs in scientific research, digital governance, education, and automated knowledge management. Ultimately, his work drives the development of reliable, scalable, and socially beneficial AI systems that enhance information accessibility and decision-making worldwide.