publications | A.D (David) Le

2026

Thesis

A System for Extracting Mathematical Expressions from Document Images (Master Thesis)

Anh Duy Le

May 2026

@misc{le2026thesis,
  title = {A System for Extracting Mathematical Expressions from Document Images (Master Thesis)},
  author = {Le, Anh Duy},
  month = may,
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.20427923},
  url = {https://doi.org/10.5281/zenodo.20427923},
}

WACV
CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization (Oral, Award Finalist)

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, and 2 more authors

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

Abs arXiv Bib Code

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
@inproceedings{le2026constant, title = {CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization (Oral, Award Finalist)}, author = {Le, Anh-Duy and Pham, Van-Linh and Vo, Thanh-Nam and Mai, Xuan Toan and Tran, Tuan-Anh}, booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2026}, publisher = {IEEE/CVF}, }

2023

ICDAR
Formerge: Recover Spanning Cells in Complex Table Structure Using Transformer Network (Poster)

Nam Quan Nguyen, Anh Duy Le, Anh Khoa Lu, and 2 more authors

In International Conference on Document Analysis and Recognition (ICDAR), 2023

Abs arXiv Bib Code

Table structure recognition (TSR) task is indispensable in a robust document analysis system. Recently, the split-and-merge-based approach has attracted many researchers to develop the TSR problem. It is a two-stage method: firstly, split table region into row/column separation and obtain grid cells of the table; then recover spanning cells by merging some grid cells and complete the table structure. Most recent proposals focus on the first stage, with few solutions for the merge task. Therefore, this paper proposes a novel method to recover spanning cells using Transformer networks called Formerge. This model contains a Transformer encoder and two parallel left-right/top-down decoders. With grid structure output from a split branch, Formerge extracts cell features with RoIAlign and passes them into the encoder to enhance features before decoding to detect spanning cells. Our technique outperforms other methods on two benchmark datasets, including SciTSR and ICDAR19-cTDaR modern.
@inproceedings{nguyen2023formerge, title = {Formerge: Recover Spanning Cells in Complex Table Structure Using Transformer Network (Poster)}, author = {Nguyen, Nam Quan and Le, Anh Duy and Lu, Anh Khoa and Mai, Xuan Toan and Tran, Tuan Anh}, booktitle = {International Conference on Document Analysis and Recognition (ICDAR)}, year = {2023}, publisher = {Springer}, }

2022

DICTA
A Hybrid Vision Transformer Approach for Mathematical Expression Recognition (Oral)

Anh Duy Le, Van Linh Pham, Vinh Loi Ly, and 3 more authors

In International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2022

Abs DOI arXiv Bib Code

One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention’s history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.
@inproceedings{le2022hybrid, title = {A Hybrid Vision Transformer Approach for Mathematical Expression Recognition (Oral)}, author = {Le, Anh Duy and Pham, Van Linh and Ly, Vinh Loi and Nguyen, Nam Quan and Nguyen, Huu Thang and Tran, Tuan Anh}, booktitle = {International Conference on Digital Image Computing: Techniques and Applications (DICTA)}, pages = {1--7}, year = {2022}, publisher = {IEEE}, doi = {10.1109/DICTA56598.2022.10034626}, }