Adaptive OCR Engine Selection and Evaluation for Multi-Format Government Document Digitization
DOI:
https://doi.org/10.69987/AIMLR.2026.70103Keywords:
Optical Character Recognition, Government Document Digitization, Adaptive Engine Selection, Multi-Format RecognitionAbstract
Government agencies worldwide face mounting pressure to digitize archival records for enhanced public access and administrative efficiency. Multi-format documents present unique challenges spanning printed text, handwritten annotations, tabular structures, and degraded scans from historical archives. This research establishes a comprehensive evaluation framework for OCR engines, specifically addressing heterogeneous government documents. We systematically compare nine OCR systems, including traditional engines and deep learning approaches, and include a dedicated table-structure baseline (CascadeTabNet) for tabular structure experiments. An adaptive selection strategy dynamically routes documents to optimal engines based on automated format classification and quality assessment. Experimental validation across 15,000 government documents demonstrates a 23.7% improvement in accuracy over single-engine baselines while maintaining processing times of seconds per page. Integration of large language models for quality verification reduces manual annotation costs by 41% and the number of reviews required by 44.7%. The proposed methodology provides data-driven implementation guidelines for government digitization initiatives addressing scalability and cost-performance trade-offs.
, , ,

