A.D (David) Le

I am an AI researcher and engineer from Vietnam, currently working at the intersection of generative modeling, multimodal AI, and controllable visual generation.

My research journey began with visual text intelligence, especially OCR, handwriting recognition, and mathematical expression recognition. These problems taught me how difficult it is for AI systems to understand fine-grained visual patterns, spatial structure, and content constraints. More recently, my work has shifted toward generative modeling. In my latest research, I study one-shot handwriting generation with diffusion models, focusing on how to capture complex writer styles from a single reference image while preserving textual content and local visual details.

Beyond academic research, I have spent several years building AI systems in industry. At Viettel AI, I have worked on OCR, eKYC, document processing, handwriting-related problems, information extraction, and generative data synthesis for real-world applications. This experience has shaped my research style: I care not only about proposing new models, but also about building systems that are robust, scalable, and useful in practice.

I am now interested in broader questions in compositional and controllable generative AI. How can generative models compose multiple constraints at inference time? How can style, content, structure, and realism be represented modularly? How can diffusion models, energy-based models, and multimodal representations be combined to build more flexible generative systems?

My long-term goal is to contribute to the next generation of generative AI systems: models that are not only high-quality, but also controllable, interpretable, efficient, and useful across real-world domains.

news

Apr 29, 2026	Was invited to submit the extension of our WACV paper to Machine Vision and Applications journal.
Jan 22, 2026	Our WACV 2026 paper CONSTANT was selected for an oral presentation.
Nov 11, 2025	One paper accepted at WACV 2026!

selected publications

Thesis

A System for Extracting Mathematical Expressions from Document Images (Master Thesis)

Anh Duy Le

May 2026

DOI Bib

@misc{le2026thesis,
  title = {A System for Extracting Mathematical Expressions from Document Images (Master Thesis)},
  author = {Le, Anh Duy},
  month = may,
  year = {2026},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.20427923},
  url = {https://doi.org/10.5281/zenodo.20427923},
}

WACV
CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization (Oral, Award Finalist)

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, and 2 more authors

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

Abs arXiv Bib Code

One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
@inproceedings{le2026constant, title = {CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization (Oral, Award Finalist)}, author = {Le, Anh-Duy and Pham, Van-Linh and Vo, Thanh-Nam and Mai, Xuan Toan and Tran, Tuan-Anh}, booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2026}, publisher = {IEEE/CVF}, }
ICDAR
Formerge: Recover Spanning Cells in Complex Table Structure Using Transformer Network (Poster)

Nam Quan Nguyen, Anh Duy Le, Anh Khoa Lu, and 2 more authors

In International Conference on Document Analysis and Recognition (ICDAR), 2023

Abs arXiv Bib Code

Table structure recognition (TSR) task is indispensable in a robust document analysis system. Recently, the split-and-merge-based approach has attracted many researchers to develop the TSR problem. It is a two-stage method: firstly, split table region into row/column separation and obtain grid cells of the table; then recover spanning cells by merging some grid cells and complete the table structure. Most recent proposals focus on the first stage, with few solutions for the merge task. Therefore, this paper proposes a novel method to recover spanning cells using Transformer networks called Formerge. This model contains a Transformer encoder and two parallel left-right/top-down decoders. With grid structure output from a split branch, Formerge extracts cell features with RoIAlign and passes them into the encoder to enhance features before decoding to detect spanning cells. Our technique outperforms other methods on two benchmark datasets, including SciTSR and ICDAR19-cTDaR modern.
@inproceedings{nguyen2023formerge, title = {Formerge: Recover Spanning Cells in Complex Table Structure Using Transformer Network (Poster)}, author = {Nguyen, Nam Quan and Le, Anh Duy and Lu, Anh Khoa and Mai, Xuan Toan and Tran, Tuan Anh}, booktitle = {International Conference on Document Analysis and Recognition (ICDAR)}, year = {2023}, publisher = {Springer}, }
DICTA
A Hybrid Vision Transformer Approach for Mathematical Expression Recognition (Oral)

Anh Duy Le, Van Linh Pham, Vinh Loi Ly, and 3 more authors

In International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2022

Abs DOI arXiv Bib Code

One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention’s history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.
@inproceedings{le2022hybrid, title = {A Hybrid Vision Transformer Approach for Mathematical Expression Recognition (Oral)}, author = {Le, Anh Duy and Pham, Van Linh and Ly, Vinh Loi and Nguyen, Nam Quan and Nguyen, Huu Thang and Tran, Tuan Anh}, booktitle = {International Conference on Digital Image Computing: Techniques and Applications (DICTA)}, pages = {1--7}, year = {2022}, publisher = {IEEE}, doi = {10.1109/DICTA56598.2022.10034626}, }