WO2014138756A1 - Système et procédé pour ajouter automatiquement des diacritiques à un texte vietnamien - Google Patents

Système et procédé pour ajouter automatiquement des diacritiques à un texte vietnamien Download PDF

Info

Publication number
WO2014138756A1
WO2014138756A1 PCT/VN2013/000005 VN2013000005W WO2014138756A1 WO 2014138756 A1 WO2014138756 A1 WO 2014138756A1 VN 2013000005 W VN2013000005 W VN 2013000005W WO 2014138756 A1 WO2014138756 A1 WO 2014138756A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
text
word
vocabulary
user
Prior art date
Application number
PCT/VN2013/000005
Other languages
English (en)
Inventor
Thi Mai Huong DANG
Viet Hai NGUYEN
Original Assignee
Dang Thi Mai Huong
Nguyen Viet Hai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dang Thi Mai Huong, Nguyen Viet Hai filed Critical Dang Thi Mai Huong
Publication of WO2014138756A1 publication Critical patent/WO2014138756A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • the present invention relates to diacritization of Vietnamese text and more particular to an automatic diacritization system and method to support typing and editing Vietnamese on a physical or virtual keyboard.
  • Vietnamese alphabet is based on Latin alphabet. However, in Vietnamese, each vowel may have from 6 to 18 variants using up to two levels of diacritical marks.
  • the letter a has 18 diacritical variants: a ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ a a a ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • Vietnamese language Another important characteristic "of Vietnamese language is that a word may consist of one or more syllables separated from each other, which means blank characters or space between syllables are not representing word boundary as it is customary in English and many other languages.
  • the multiple, separated syllable nature of word in Vietnamese means that Vietnamese text written without diacritics is even more ambiguous, more prone to be misunderstood and misinterpreted.
  • This object is achieved by designing a user interface component that keep track of the movement of user typing cursor to predict user intention based on the current and historical context: when user is in typing mode or has just finished typing a word, a phrase or a sentence; and when user is in editing mode or is about to correcting a syllable.
  • the user vocabulary, used by the language model for automatic diacritization of text can optionally be shared among users. As such, any improvement to the automatic diacritization will be beneficial to all shared users.
  • the system allows users to manually edit or correct an incorrectly diacritized syllable using a popular Vietnamese typing method such as Telex, VNI or VIQR.
  • a popular Vietnamese typing method such as Telex, VNI or VIQR.
  • the system allows user to tap on the incorrectly diacritized syllable and then choose from a pop up list of word correction options the correct syllable with diacritics.
  • a system and method for diacritization of text includes: detecting phrase ending characters entered by user and diacritizing the most recently entered phrase by employing an optimization solver to search for the diacritized phrase with the highest score; detecting special characters entered by user and adding, removing or changing diacritics of word previously entered or diacritized by employing either TELEX or VNI or VIQR typing methods; building, updating and maintaining a vocabulary of phrases with score.
  • a system and method for diacritization of text on an electronic device with touch screen keyboard includes: detecting phrase ending characters entered by user and diacritizing a previously entered phrase by employing a solver to search for the diacritized phrase with the highest score; detecting a touch event on a previously enter word to show a list of word correction options, and replacing previously entered word by a user-selected correction option.
  • FIG. 1 is a diagram representation of the system for automatic
  • FIG. 2 is a diagram representation of the system for automatic
  • FIG. 3 is a block diagram showing the logic to determine, when automatic diacritization of text is performed and when manual diacritization by user is initiated.
  • FIG. 4 is a block diagram showing the logic for manually editing or correcting text with diacritics on a touch device according to the present invention.
  • FIG. 5 is a block diagram showing the steps to automatically diacritize text, according to the present invention.
  • the user interface event handler 100 is responsible for handling, processing all text editing and typing
  • the manual diacritizer 101 is invoked when user press a special key in the keyboard to manually put diacritics in the text.
  • a special key in the keyboard to manually put diacritics in the text.
  • the corresponding typing method among the three implemented typing methods, TELEX, VNI, VIQR is chosen.
  • [weroasdfjxz] are the special keys for TELEX method
  • [0123456789] are for VNI
  • ⁇ ⁇ (+ ⁇ ' /?] are for VIQR.
  • the auto diacritizer 102 may be invoked to decide if the recently entered phrase needs to be diacritized. After analyzing the context, auto diacritizer 102 may invoke the solver 104 to search in the vocabulary 106 for words or phrases matching user input, generate candidate phrases and select the solution with the highest score.
  • the logic for auto diacritization of text in the solver 104 will be described in details later in FIG. 5.
  • the solver 104 can be configured by using the data manager 103.
  • Solver configuration options may be solver type, for examples, either exact or approximate maximum matching, as evident by the highest score, or type of language model used, for example either 3-gram or 5-gram language model.
  • the data manager 103 is also responsible for maintaining the text corpus 108. Using the data manager 103, functionalities of the score calculator 105 can be invoked, to train the language model that is implemented by the vocabulary 106.
  • the score calculator 105 is responsible for updating the language model, calculating for words and phrases the
  • corresponding scores that are based on statistical characteristics such as word frequency count, n-gram probabilities, derived from the text corpus 107.
  • FIG 2. shows a diagram representating an alternative embodiment of the system for auto diacritization of text as implemented on touch devices, for example mobile phone.
  • a user interacts with the device via a keyboard and the result of that interaction is shown on a display.
  • a user may express an intention to edit an incorrectly diacritized portion of text in the touch screen by tapping or touching that portion of the text.
  • the keyboard and display event handler 200 is responsible for handling all user interactions with the touch device.
  • the touch editor 201 and the auto diacritizer 202 are responsible for processing all text editing and typing activities handled by the keyboard and display event handler 200.
  • the logic for manual editing of diacritics in the touch editor 201 will be described in details later in FIG. 4.
  • the auto diacritizer 202 is generally invoked when user enters a phrase ending character.
  • the auto diacritizer 202 then analyzes the context to decide if the recently entered phrase needs diacritization. After that the auto diacritizer 202 may invoke the solver 204 to search in the vocabulary 205 for words or phrases matching user input, generate candidate phrases and select the phrase with the highest score as the solution.
  • the solver 204 can be configured by user of the touch device using the settings 203. User can also use the settings 203 to manually update the vocabulary 205.
  • FIG. 3 provides a block diagram showing the logic to determine when automatic diacritization of text phrase and when manual diacritization should be performed.
  • User interactions with the device are first looked at by the user interface event handler 100.
  • step 301 the user input is tested if it is a key press event. If it is not, the control is returned back to the user interface event handler 100.
  • step 302 the input character is tested to determine if it is a phrase ending character. If the input character is a phrase ending character then in step 303 the recently entered text is analyzed to determine if diacritics is needed for any portion of the phrase. If the answer is positive, the backend process for auto diacritization of phrase is initiated in the next step 305 and then the control will be returned to the user interface event handler 100. If in the test 303 it is determined that no
  • step 302 the input character is not a phrase ending character, then it will be tested again in step 304 to determine if it is one of the special characters used by one of the manual text typing methods listed, TELEX, VNI, VIQR, to add, change or remove diacritics. Further context analysis may also be done to determine that the adjacent text is not a foreign language word and potentially needs diacritic. If the test 304 is positive, the manual diacritizer will be invoked in the next step 306. If in step 304 is is determined that the input character is not related to any of the manual typing method, or the adjacent text is a foreign word, then the control will be returned to the user interface event handler 100.
  • step 305 the automatic diacritization of a phrase initiated in step 305 may be performed while user is typing. Only when the diacritized text is available in step 307 then the corresponding undiacritized text in the text area of the display will be replaced, in step 308, by the diacritized text returned from the solver 104 in FIG.1.
  • JavaScript and XML (AJAX), a modern web technology for building fluid user interface in web applications. Instead of using XML (Extensible Markup
  • JavaScript Object Notation can also be used for transmitting the diacritized text result between the solver, as shown in block 104 on the server side and the auto diacritizer 102 responsible for delivering the diacritized text to the user interface event handler 100.
  • the use of AJAX and JSON may not be necessary if all the system components reside in the same device.
  • the diacritized text result can also be transmitted synchronously between the solver 204 and the auto diacritizer 202.
  • Users of a touch device may want to correct a word or a syllable with missing diacritics or incorrectly diacritized.
  • To edit an incorrectly diacritized syllable user may tap or touch that syllable, a list of possible options for correcting the diacritized syllable will be shown. User then can choose one of the 3 top ranked options with the highest scores for correction. If the option user want is not among these 3 options, user can tap on to choose from other available options.
  • FIG. 4 provides a block diagram showing the logic for manually editing or correcting text with diacritics, on a touch device, by the touch editor 202, according to the present invention.
  • the software component handling all touch events is again depicted as the keyboard and display event handler 200, which may alternatively be understood as an infinite loop checking if a user touch event is taking place.
  • the manual editing process is started only when the test 401 is positive, typically when user taps or touches the text area.
  • a syllable s may be identified as the object of the touch event, the phrase P enclosing s is considered as the context for the event.
  • step 403 the function suggest word will calculate and return a sorted list of correction options for s.
  • the option list will then be shown to the user in step 406 in the format of a popup menu, with the 3 most probable options at the top of the list and other options, if exist, available via a touch on the list.
  • FIG. 5 is a block diagram showing the steps to automatically diacritize text, according to the present invention. These are the steps implemented in the solver 104 in FIG. 1.
  • step 500 the input undiacritized text, a phrase, is tokenized into syllables.
  • step 501 for each syllable si in the input text, a search, for all words or phrases in the system vocabulary 106 that have undiacritized part matching si, is performed.
  • step 502 a list of all possible phrases is generated, based on the list of matching word for each syllable si, and the order of all syllable sts in the input text.
  • step 503 the list of possible phrases is filtered and only grammatically valid phrases are kept. Note that each matching word is associated with a score and the score of a phrase may be calculated as sum or product of its syllables. These candidate phrases are fed into an optimization solver in step 504 to calculate and return a phrase with maximal score.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Input From Keyboards Or The Like (AREA)

Abstract

L'invention concerne des systèmes et des procédés pour ajouter automatiquement des diacritiques à un texte vietnamien saisi en utilisant un clavier d'ordinateur physique ou virtuel. Conformément à certains modes de réalisation, l'invention réalise un procédé pour ajouter automatiquement des diacritiques à un texte vietnamien, le procédé comprenant : détection d'un caractère de fin de phrase et ajout automatique de diacritiques à une phrase saisie précédemment pendant que l'utilisateur peut continuer la saisie d'autres phrases ; détection d'un caractère spécial ou, sur un clavier d'ordinateur virtuel, d'un événement d'effleurement sur un mot saisi précédemment, pour permettre l'ajout manuel de diacritiques à un texte vietnamien.
PCT/VN2013/000005 2013-03-07 2013-04-12 Système et procédé pour ajouter automatiquement des diacritiques à un texte vietnamien WO2014138756A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
VN1-2013-00719 2013-03-07
VN201300719 2013-03-07

Publications (1)

Publication Number Publication Date
WO2014138756A1 true WO2014138756A1 (fr) 2014-09-12

Family

ID=48446707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/VN2013/000005 WO2014138756A1 (fr) 2013-03-07 2013-04-12 Système et procédé pour ajouter automatiquement des diacritiques à un texte vietnamien

Country Status (1)

Country Link
WO (1) WO2014138756A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077396A1 (en) * 2006-09-27 2008-03-27 Wen-Lian Hsu Typing Candidate Generating Method for Enhancing Typing Efficiency
US20110087484A1 (en) * 2009-10-08 2011-04-14 Electronics And Telecommunications Research Institute Apparatus and method for detecting sentence boundaries
US20130006613A1 (en) * 2010-02-01 2013-01-03 Ginger Software, Inc. Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices
US20130050098A1 (en) * 2011-08-31 2013-02-28 Nokia Corporation User input of diacritical characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077396A1 (en) * 2006-09-27 2008-03-27 Wen-Lian Hsu Typing Candidate Generating Method for Enhancing Typing Efficiency
US20110087484A1 (en) * 2009-10-08 2011-04-14 Electronics And Telecommunications Research Institute Apparatus and method for detecting sentence boundaries
US20130006613A1 (en) * 2010-02-01 2013-01-03 Ginger Software, Inc. Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices
US20130050098A1 (en) * 2011-08-31 2013-02-28 Nokia Corporation User input of diacritical characters

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MINH TRUNG NGUYEN ET AL: "Vietnamese Diacritics Restoration as Sequential Tagging", COMPUTING AND COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2012 IEEE RIVF INTERNATIONAL CONFERENCE ON, IEEE, 27 February 2012 (2012-02-27), pages 1 - 6, XP032138192, ISBN: 978-1-4673-0307-1, DOI: 10.1109/RIVF.2012.6169816 *
MOUSTAFA ELSHAFEI ET AL: "Statistical Methods for Automatic Diacritization of Arabic Text", 18 November 2000 (2000-11-18), pages 1 - 8, XP002624239, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.3763> [retrieved on 20110221] *
RUHI SARIKAYA ET AL: "Maximum Entropy Modeling for Diacritization of Arabic Text", INTERSPEECH 2006 -INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING) INTERSPEECH 2006 - ICSLP : NINTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING ; PITTSBURGH, PENNSYLVANIA, USA, SEPTEMBER 17 - 21, 2006 / ISCA, BONN : ISCA, 2006, DE, 17 September 2006 (2006-09-17), pages 145 - 148, XP008163575 *
TUAN ANH LUU ET AL: "A Pointwise Approach for Vietnamese Diacritics Restoration", ASIAN LANGUAGE PROCESSING (IALP), 2012 INTERNATIONAL CONFERENCE ON, IEEE, 13 November 2012 (2012-11-13), pages 189 - 192, XP032339752, ISBN: 978-1-4673-6113-2, DOI: 10.1109/IALP.2012.18 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text

Similar Documents

Publication Publication Date Title
Fowler et al. Effects of language modeling and its personalization on touchscreen typing performance
US9009030B2 (en) Method and system for facilitating text input
US7895030B2 (en) Visualization method for machine translation
US9542385B2 (en) Incremental multi-word recognition
US20190012076A1 (en) Typing assistance for editing
US10762293B2 (en) Using parts-of-speech tagging and named entity recognition for spelling correction
CN103026318B (zh) 输入法编辑器
CN102439540B (zh) 输入法编辑器
US7707515B2 (en) Digital user interface for inputting Indic scripts
US20080133444A1 (en) Web-based collocation error proofing
US20130321267A1 (en) Dynamically changing a character associated with a key of a keyboard
JP5228633B2 (ja) 電子辞書装置及びプログラム
JP6404511B2 (ja) 翻訳支援システム、翻訳支援方法、および翻訳支援プログラム
CN104850543A (zh) 语音对话支持装置和语音对话支持方法
US20140380169A1 (en) Language input method editor to disambiguate ambiguous phrases via diacriticization
Sharma et al. Word prediction system for text entry in Hindi
US11899904B2 (en) Text input system with correction facility
Kumar et al. Design and implementation of nlp-based spell checker for the tamil language
Alharbi et al. The effects of predictive features of mobile keyboards on text entry speed and errors
Sarcar et al. Eyeboard++ an enhanced eye gaze-based text entry system in Hindi
WO2014138756A1 (fr) Système et procédé pour ajouter automatiquement des diacritiques à un texte vietnamien
KR101582155B1 (ko) 문자 수정이 용이한 문자 입력 방법과 시스템, 그리고 기록 매체 및 파일 배포 시스템
Herbig et al. Improving the multi-modal post-editing (MMPE) CAT environment based on professional translators’ feedback
JP2019215936A (ja) 自動翻訳装置及び自動翻訳プログラム
Toba et al. Efficient Speech-Recognition Error-Correction Interface for Japanese Text Entry on Smartwatches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13723394

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13723394

Country of ref document: EP

Kind code of ref document: A1