WO2022060439A1 - Language autodetection from non-character sub-token signals - Google Patents

Language autodetection from non-character sub-token signals Download PDF

Info

Publication number
WO2022060439A1
WO2022060439A1 PCT/US2021/035563 US2021035563W WO2022060439A1 WO 2022060439 A1 WO2022060439 A1 WO 2022060439A1 US 2021035563 W US2021035563 W US 2021035563W WO 2022060439 A1 WO2022060439 A1 WO 2022060439A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
score
corpus
word
text string
Prior art date
Application number
PCT/US2021/035563
Other languages
English (en)
French (fr)
Inventor
Andrew Stuart Glass
Margaret Hope Magnus
Roland Radtke
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to CN202180063398.1A priority Critical patent/CN116194925A/zh
Publication of WO2022060439A1 publication Critical patent/WO2022060439A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Definitions

  • Text input on computing devices increasingly depends on language-specific processing to refine and respond to user intent. Such processing depends on a core assumption that the language of the text being entered is known. These systems perform poorly when the assumed language does not match the entered text. To address this, systems may use a pre-processing step to identify the language of the incoming text strings.
  • one or more of the language detection models may be applied to the text string.
  • a match score between the text string and each language corresponding to an applied language detection model may be determined based on the prefixes and suffixes included in the words of the text string, the syllables included in the words of the text string, where a syllable is defined as an optional legal initial consonant sequence as defined in the model followed by an obligatory legal vowel sequence as defined by the model followed by an optional legal final consonant sequence as defined in the model.
  • a legal word or stem is one which consists solely of a contiguous sequence of legal syllables.
  • FIG. 2 is a schematic diagram of a computing environment illustrating the training of a language detection model.
  • FIG. 3 illustrates a computing environment for the processing of an exemplary word from a corpus by a plurality of processing engines encompassed in a language detection training engine.
  • FIG. 4 illustrates various components of a language detection model.
  • FIGS. 6 and 7 are simplified diagrams of a mobile computing device with which aspects of the disclosure may be practiced.
  • FIG. 8 is a block diagram illustrating example physical components of a computing device with which aspects of the disclosure may be practiced.
  • FIG. 9 is a simplified block diagram of a distributed computing system in which aspects of the present disclosure may be practiced.
  • Non-limiting examples of the present disclosure describe systems, methods and devices for determining a language of a text string input into a construct of a computing device.
  • the construct that the text string is input in may comprise an operating system shell construct, an application construct, or an application service (e.g., cloudbased service) construct.
  • An indication to analyze text utilizing the language detection models described herein may be received by a language detection service.
  • the language detection service may be incorporated and executed on a local computing device and/or in the cloud.
  • the indication may comprise a determination that a text input has not previously been received by a user account in a construct where a text string input is currently being received, and/or that a user account has not set up a preferred language for a computing device, application or service.
  • the language detection service described herein may simply apply the language detection models periodically or whenever a text input is received in one or more computing constructs.
  • the language detection service may apply one or more of the language detection models to the text string and calculate a match score for each word in the string to each language corresponding to the models that are applied.
  • the scores for each word may be summed or otherwise functionally processed to generate a total match score for the text string and the language.
  • a determination may be made that a match score between a text string and a language meets or exceeds a threshold value, a determination may be made that the text string is definitely in that language, possibly/probably in that language, or definitely not in that language.
  • the language detection service may determine that a language associated with a highest ranked match score from amongst a plurality of scores from a plurality of language detection models is the language of the text string.
  • Various follow-up actions may be performed based on a determination that a text string is probably in a specific language.
  • a language processing model e.g., intent determination processing model, spellcheck processing model, grammar check processing model, etc.
  • the type of language processing model that is applied in the specific language may be determined based on an application or computing shell construct that the text string was received in.
  • the follow-up action may comprise downloading one or more linguistic libraries or models in the specific language from a cloud database to a local computing device on which the text string was received.
  • a language detection service which is generically illustrated in language detection models sub-environment 116, may be hosted and executed in the cloud (e.g., by server computing device 112), and/or the language detection service may be hosted and executed by a local computing device (e.g., computing device 102).
  • a local computing device e.g., computing device 102
  • one of the technical advantages of the language detection models described herein is that they are sufficiently small (e.g., from a memory requirement and storage footprint) that they can easily reside on limited storage provided by local computing devices, and they therefore need not necessarily be executed in the cloud.
  • Language detection models sub-environment 116 includes language detection module 124, and follow-up action module 126.
  • the elements described in relation to language detection models sub-environment 116 may be encompassed in the language detection service.
  • Language detection module 124 may comprise one or more processing engines that are applied to input text strings, in association with language model data (e.g., from language model data store 120), to determine the language of the input text strings. Additional details regarding the application of language models to text strings are described below in relation to FIG. 4.
  • follow-up action module 126 may cause one or more language libraries or language models (in language X) from full suite of language assets 114 to be downloaded to computing device 102 from the cloud.
  • a local computing device need only download and store language libraries and language models that are likely to be utilized by users of the local computing device.
  • a user has entered text string 106 on canvas 104 of application 103.
  • FIG. 2 is a schematic diagram of a computing environment 200 illustrating the training of a language detection model.
  • Computing environment 200 includes corpus 202, affix detection training engine 204, syllabifier token detection training engine 212, final weighting engine 234, and final weighted individual language model 250.
  • Affix detection training engine 204 includes suffix training engine 208, prefix training engine 210 and manual review of the preliminary affix lists extracted from corpus 202.
  • Syllabifier token detection training engine 212 includes word-initial consonants 222, vowels and vowel sequences 224 and word-final consonants 226.
  • Stem- or word-initial and stem- or word-final consonant clusters may be used to determine syllable-initial and syllable-final consonant clusters for that language, because it is rare that a word-internal syllable may end on a consonant cluster that may not also end a word.
  • the resulting word stem (e.g., the characters minus the identified/extracted prefix and/or suffix) must have to be at least a threshold number of characters long and include at least one vowel. If a resulting word stem is determined not to be at least the threshold number of characters long and include at least one vowel, that word and/or the prefix or suffix that has been identified for that specific word may be rejected from the training process and shorter prefixes/suffixes tested.
  • consonant and vowel sequence candidates may be manually reviewed, as illustrated by consonant and vowel sequence review 232. That is, a person familiar with the language may manually review the unique consonant and vowel sequences and discard uncommon ones and any that result from noise in the training data (e.g., proper nouns, foreign nouns).
  • monosyllabic vowel sequences may be split across syllables (e.g., “ayo” in “mayor” may be split into “ay” and “o”).
  • the training may determine that “re-” is a prefix, but then discard it, because “nd” is not in the list of consonant sequences that can legally begin an English word. This step may also be utilized to avoid falsely counting the suffix as “-ion”, as in “lion”.
  • an additional step may be performed when processing the suffixes. If a suffix begins with a vowel and the preceding consonant sequence cannot end the word, a determination may be made as to whether there is a legal final/initial consonant combination. For example, the suffix “-ation” (as in “amalgamation”) is found when parsing the word “filtration”. After stripping “-ation”, the remaining substring is “filtr”.
  • final weighting engine 234 cannot build a legal word utilizing the described steps above (e.g., the word does not have any vowels, or starts with an illegal English cluster, like “kjenne”) the word may be ignored for training purposes.
  • Prefix sequence identification engine 310 is applied to exemplary subword 302D, “antidisestablish”, resulting in the identification of prefixes 311 (“anti” and “dis”), which are stripped from the remaining characters 312, which is the stem 302F (“establish”).
  • Each of the tokens identified by the engines illustrated in FIG. 3 may be added to a token list in the language model (e.g., in final weighted individual language model 250) and have their weights normalized once the engines have been applied to the remaining words in corpus 202.
  • Weighted prefixes and prefix sequences 404 include the character strings that were identified via prefix training engine 210, and which may have had their weights adjusted via application of one or more operations associated with prefix review 230.
  • Weighted suffixes and suffix sequences 406 include character strings that were identified via suffix training engine 208, and which may have had their weights adjusted via application of one or more operations associated with suffix review 228.
  • Weighted legal initial consonant clusters 408 include initial consonant cluster strings that were identified via syllabifier token detection training engine 212, and which may have had their weights adjusted via application of one or more operations associated with consonant and vowel sequence review 232.
  • Weighted legal vowel sequences 412 include vowels and vowel sequences that were identified via syllabifier token detection training engine 212, and which may have had their weights adjusted via application of one or more operations associated with consonant and vowel sequence review 232.
  • one or more language detection models such as language detection model 402 for one or more languages may be applied to the text string.
  • a score for the string for the candidate language may be obtained based on the fit of the string to the frequencies of prefixes, suffixes, and syllables in language detection model 402. The presence of syllables not occurring in the model strongly indicates that the string is not a match with the language of the model.
  • a string may be tested against multiple candidate language models and the scores for each model may be compared to obtain a confidence score for the language of the string.
  • FIG. 5A is a method 500A for determining a language of a text string based on application of a single syllable-based language detection model and performing a follow- up action based on the determining.
  • the method 500A begins at a start operation and flow moves to operation 502A.
  • a language detection model for a first language is maintained.
  • the language detection model may comprise a first list comprising identities of a plurality of prefixes from a corpus of the first language, and weights for each of the plurality of prefixes.
  • the language detection model may further comprise a second list comprising identities of a plurality of suffixes from the corpus, and weights for each of the plurality of suffixes.
  • the weights may correspond to a frequency of the prefixes and suffixes in the corpus.
  • a follow-up action is performed based on the determination that the language match score meets the threshold value.
  • the follow-up action may comprise applying a language processing engine that is specific to the first language to the text string.
  • performing the follow-up action may comprise downloading a language package library for the first language to a computing device that the text string was initially input to.
  • the language package library for the first language may comprise an embeddings library (e.g., a BERT library, an ELMo library).
  • a preprocessing step may quickly accept a word because it is in the common word list or may reject the word because it does not contain a vowel.
  • a follow-up action is performed based on the determination that the language match score meets the threshold value.
  • the follow-up action may comprise applying a language processing engine that is specific to the first language to the text string.
  • performing the follow-up action may comprise downloading a language package library for the first language to a computing device that the text string was initially input to.
  • the language package library for the first language may comprise an embeddings library (e.g., a BERT library, an ELMo library).
  • FIG. 5D is a method 500D for determining a language of a text string based on application of a plurality of language detection models and performing a follow-up action based on the determining.
  • the method 500D begins at a start operation and flow moves to operation 502D.
  • the first language detection model may further comprise a list of prefixes and suffixes from the corpus of the first language and weights for each of those prefixes and suffixes.
  • the first language detection model may further comprise a list of common words from the corpus of the first language and weights for each of those common words. The weights may correspond to a frequency of each of the text units (tokens) in the corpus of the first language.
  • Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
  • RF radio frequency

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/US2021/035563 2020-09-17 2021-06-03 Language autodetection from non-character sub-token signals WO2022060439A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180063398.1A CN116194925A (zh) 2020-09-17 2021-06-03 从非字符子标记信号中自动检测语言

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/024,428 2020-09-17
US17/024,428 US11361158B2 (en) 2020-09-17 2020-09-17 Language autodetection from non-character sub-token signals

Publications (1)

Publication Number Publication Date
WO2022060439A1 true WO2022060439A1 (en) 2022-03-24

Family

ID=76695832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/035563 WO2022060439A1 (en) 2020-09-17 2021-06-03 Language autodetection from non-character sub-token signals

Country Status (3)

Country Link
US (3) US11361158B2 (zh)
CN (1) CN116194925A (zh)
WO (1) WO2022060439A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115099832A (zh) * 2022-06-29 2022-09-23 广州华多网络科技有限公司 异常用户检测方法及其装置、设备、介质、产品
CN117556817B (zh) * 2024-01-10 2024-05-24 国开启科量子技术(安徽)有限公司 基于量子电路的大模型生成文本检测方法、装置、设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229865A1 (en) * 2005-04-07 2006-10-12 Richard Carlgren Method and system for language identification
US20090326918A1 (en) * 2008-06-26 2009-12-31 Microsoft Corporation Language Detection Service

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366424B2 (en) * 2014-06-04 2019-07-30 Nuance Communications, Inc. Medical coding system with integrated codebook interface
US11438683B2 (en) * 2020-07-21 2022-09-06 Apple Inc. User identification using headphones

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229865A1 (en) * 2005-04-07 2006-10-12 Richard Carlgren Method and system for language identification
US20090326918A1 (en) * 2008-06-26 2009-12-31 Microsoft Corporation Language Detection Service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOMMI JAUHIAINEN ET AL: "Automatic Language Identification in Texts: A Survey", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 April 2018 (2018-04-23), XP081425274 *

Also Published As

Publication number Publication date
US20220083734A1 (en) 2022-03-17
US11630951B2 (en) 2023-04-18
US20230252235A1 (en) 2023-08-10
CN116194925A (zh) 2023-05-30
US11361158B2 (en) 2022-06-14
US20220309242A1 (en) 2022-09-29
US11947909B2 (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN108287858B (zh) 自然语言的语义提取方法及装置
US10290299B2 (en) Speech recognition using a foreign word grammar
US11947909B2 (en) Training a language detection model for language autodetection from non-character sub-token signals
US9229924B2 (en) Word detection and domain dictionary recommendation
KR100714769B1 (ko) 서면 텍스트로부터의 조정가능 신경망 기반 언어 식별
US10878817B2 (en) Systems and methods for generating comedy
US7277029B2 (en) Using language models to expand wildcards
US8463598B2 (en) Word detection
US20150279366A1 (en) Voice driven operating system for interfacing with electronic devices: system, method, and architecture
CN101815996A (zh) 检测名称实体和新词
KR20160008480A (ko) 명칭을 강인하게 태깅하는 방법 및 시스템
CN112270167B (zh) 角色标注方法、装置、电子设备和存储介质
EP1617409A1 (en) Multimodal method to provide input to a computing device
KR102364401B1 (ko) 문맥형 음성-구동 딥 북마킹
CN110808032A (zh) 一种语音识别方法、装置、计算机设备及存储介质
CN114631094A (zh) 智能电子邮件标题行建议和重制
CN110020429B (zh) 语义识别方法及设备
Lee et al. Impact of out-of-vocabulary words on the twitter experience of blind users
CN112149403A (zh) 一种确定涉密文本的方法和装置
Celikkaya et al. A mobile assistant for Turkish
US20220414334A1 (en) Post-model filtering of predictive text
CN114281969A (zh) 答复语句推荐方法、装置、电子设备及存储介质
CN114911896A (zh) 基于语音的搜索方法及相关设备
CN113268984A (zh) 文本处理方法、装置、存储介质及电子设备
CN114547306A (zh) 一种数据处理方法、装置、电子设备及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21736088

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21736088

Country of ref document: EP

Kind code of ref document: A1