Invention Grant
- Patent Title: System and method for text normalization using atomic tokens
-
Application No.: US16542514Application Date: 2019-08-16
-
Publication No.: US10997964B2Publication Date: 2021-05-04
- Inventor: Ladan Golipour , Alistair D. Conkie
- Applicant: AT&T Intellectual Property I, L.P.
- Applicant Address: US GA Atlanta
- Assignee: AT&T Intellectual Property I, L.P.
- Current Assignee: AT&T Intellectual Property I, L.P.
- Current Assignee Address: US GA Atlanta
- Main IPC: G10L13/00
- IPC: G10L13/00 ; G10L13/10

Abstract:
A system, method and computer-readable storage devices are for normalizing text for ASR and TTS in a language-neutral way. The system described herein divides Unicode text into meaningful chunks called “atomic tokens.” The atomic tokens strongly correlate to their actual pronunciation, and not to their meaning. The system combines the tokenization with a data-driven classification scheme, followed by class-determined actions to convert text to normalized form. The classification labels are based on pronunciation, unlike alternative approaches that typically employ Named Entity-based categories. Thus, this approach is relatively simple to adapt to new languages. Non-experts can easily annotate training data because the tokens are based on pronunciation alone.
Information query