WO2022060439A1 - Language autodetection from non-character sub-token signals - Google Patents
Language autodetection from non-character sub-token signals Download PDFInfo
- Publication number
- WO2022060439A1 WO2022060439A1 PCT/US2021/035563 US2021035563W WO2022060439A1 WO 2022060439 A1 WO2022060439 A1 WO 2022060439A1 US 2021035563 W US2021035563 W US 2021035563W WO 2022060439 A1 WO2022060439 A1 WO 2022060439A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language
- score
- corpus
- word
- text string
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
Definitions
- Text input on computing devices increasingly depends on language-specific processing to refine and respond to user intent. Such processing depends on a core assumption that the language of the text being entered is known. These systems perform poorly when the assumed language does not match the entered text. To address this, systems may use a pre-processing step to identify the language of the incoming text strings.
- one or more of the language detection models may be applied to the text string.
- a match score between the text string and each language corresponding to an applied language detection model may be determined based on the prefixes and suffixes included in the words of the text string, the syllables included in the words of the text string, where a syllable is defined as an optional legal initial consonant sequence as defined in the model followed by an obligatory legal vowel sequence as defined by the model followed by an optional legal final consonant sequence as defined in the model.
- a legal word or stem is one which consists solely of a contiguous sequence of legal syllables.
- FIG. 2 is a schematic diagram of a computing environment illustrating the training of a language detection model.
- FIG. 3 illustrates a computing environment for the processing of an exemplary word from a corpus by a plurality of processing engines encompassed in a language detection training engine.
- FIG. 4 illustrates various components of a language detection model.
- FIGS. 6 and 7 are simplified diagrams of a mobile computing device with which aspects of the disclosure may be practiced.
- FIG. 8 is a block diagram illustrating example physical components of a computing device with which aspects of the disclosure may be practiced.
- FIG. 9 is a simplified block diagram of a distributed computing system in which aspects of the present disclosure may be practiced.
- Non-limiting examples of the present disclosure describe systems, methods and devices for determining a language of a text string input into a construct of a computing device.
- the construct that the text string is input in may comprise an operating system shell construct, an application construct, or an application service (e.g., cloudbased service) construct.
- An indication to analyze text utilizing the language detection models described herein may be received by a language detection service.
- the language detection service may be incorporated and executed on a local computing device and/or in the cloud.
- the indication may comprise a determination that a text input has not previously been received by a user account in a construct where a text string input is currently being received, and/or that a user account has not set up a preferred language for a computing device, application or service.
- the language detection service described herein may simply apply the language detection models periodically or whenever a text input is received in one or more computing constructs.
- the language detection service may apply one or more of the language detection models to the text string and calculate a match score for each word in the string to each language corresponding to the models that are applied.
- the scores for each word may be summed or otherwise functionally processed to generate a total match score for the text string and the language.
- a determination may be made that a match score between a text string and a language meets or exceeds a threshold value, a determination may be made that the text string is definitely in that language, possibly/probably in that language, or definitely not in that language.
- the language detection service may determine that a language associated with a highest ranked match score from amongst a plurality of scores from a plurality of language detection models is the language of the text string.
- Various follow-up actions may be performed based on a determination that a text string is probably in a specific language.
- a language processing model e.g., intent determination processing model, spellcheck processing model, grammar check processing model, etc.
- the type of language processing model that is applied in the specific language may be determined based on an application or computing shell construct that the text string was received in.
- the follow-up action may comprise downloading one or more linguistic libraries or models in the specific language from a cloud database to a local computing device on which the text string was received.
- a language detection service which is generically illustrated in language detection models sub-environment 116, may be hosted and executed in the cloud (e.g., by server computing device 112), and/or the language detection service may be hosted and executed by a local computing device (e.g., computing device 102).
- a local computing device e.g., computing device 102
- one of the technical advantages of the language detection models described herein is that they are sufficiently small (e.g., from a memory requirement and storage footprint) that they can easily reside on limited storage provided by local computing devices, and they therefore need not necessarily be executed in the cloud.
- Language detection models sub-environment 116 includes language detection module 124, and follow-up action module 126.
- the elements described in relation to language detection models sub-environment 116 may be encompassed in the language detection service.
- Language detection module 124 may comprise one or more processing engines that are applied to input text strings, in association with language model data (e.g., from language model data store 120), to determine the language of the input text strings. Additional details regarding the application of language models to text strings are described below in relation to FIG. 4.
- follow-up action module 126 may cause one or more language libraries or language models (in language X) from full suite of language assets 114 to be downloaded to computing device 102 from the cloud.
- a local computing device need only download and store language libraries and language models that are likely to be utilized by users of the local computing device.
- a user has entered text string 106 on canvas 104 of application 103.
- FIG. 2 is a schematic diagram of a computing environment 200 illustrating the training of a language detection model.
- Computing environment 200 includes corpus 202, affix detection training engine 204, syllabifier token detection training engine 212, final weighting engine 234, and final weighted individual language model 250.
- Affix detection training engine 204 includes suffix training engine 208, prefix training engine 210 and manual review of the preliminary affix lists extracted from corpus 202.
- Syllabifier token detection training engine 212 includes word-initial consonants 222, vowels and vowel sequences 224 and word-final consonants 226.
- Stem- or word-initial and stem- or word-final consonant clusters may be used to determine syllable-initial and syllable-final consonant clusters for that language, because it is rare that a word-internal syllable may end on a consonant cluster that may not also end a word.
- the resulting word stem (e.g., the characters minus the identified/extracted prefix and/or suffix) must have to be at least a threshold number of characters long and include at least one vowel. If a resulting word stem is determined not to be at least the threshold number of characters long and include at least one vowel, that word and/or the prefix or suffix that has been identified for that specific word may be rejected from the training process and shorter prefixes/suffixes tested.
- consonant and vowel sequence candidates may be manually reviewed, as illustrated by consonant and vowel sequence review 232. That is, a person familiar with the language may manually review the unique consonant and vowel sequences and discard uncommon ones and any that result from noise in the training data (e.g., proper nouns, foreign nouns).
- monosyllabic vowel sequences may be split across syllables (e.g., “ayo” in “mayor” may be split into “ay” and “o”).
- the training may determine that “re-” is a prefix, but then discard it, because “nd” is not in the list of consonant sequences that can legally begin an English word. This step may also be utilized to avoid falsely counting the suffix as “-ion”, as in “lion”.
- an additional step may be performed when processing the suffixes. If a suffix begins with a vowel and the preceding consonant sequence cannot end the word, a determination may be made as to whether there is a legal final/initial consonant combination. For example, the suffix “-ation” (as in “amalgamation”) is found when parsing the word “filtration”. After stripping “-ation”, the remaining substring is “filtr”.
- final weighting engine 234 cannot build a legal word utilizing the described steps above (e.g., the word does not have any vowels, or starts with an illegal English cluster, like “kjenne”) the word may be ignored for training purposes.
- Prefix sequence identification engine 310 is applied to exemplary subword 302D, “antidisestablish”, resulting in the identification of prefixes 311 (“anti” and “dis”), which are stripped from the remaining characters 312, which is the stem 302F (“establish”).
- Each of the tokens identified by the engines illustrated in FIG. 3 may be added to a token list in the language model (e.g., in final weighted individual language model 250) and have their weights normalized once the engines have been applied to the remaining words in corpus 202.
- Weighted prefixes and prefix sequences 404 include the character strings that were identified via prefix training engine 210, and which may have had their weights adjusted via application of one or more operations associated with prefix review 230.
- Weighted suffixes and suffix sequences 406 include character strings that were identified via suffix training engine 208, and which may have had their weights adjusted via application of one or more operations associated with suffix review 228.
- Weighted legal initial consonant clusters 408 include initial consonant cluster strings that were identified via syllabifier token detection training engine 212, and which may have had their weights adjusted via application of one or more operations associated with consonant and vowel sequence review 232.
- Weighted legal vowel sequences 412 include vowels and vowel sequences that were identified via syllabifier token detection training engine 212, and which may have had their weights adjusted via application of one or more operations associated with consonant and vowel sequence review 232.
- one or more language detection models such as language detection model 402 for one or more languages may be applied to the text string.
- a score for the string for the candidate language may be obtained based on the fit of the string to the frequencies of prefixes, suffixes, and syllables in language detection model 402. The presence of syllables not occurring in the model strongly indicates that the string is not a match with the language of the model.
- a string may be tested against multiple candidate language models and the scores for each model may be compared to obtain a confidence score for the language of the string.
- FIG. 5A is a method 500A for determining a language of a text string based on application of a single syllable-based language detection model and performing a follow- up action based on the determining.
- the method 500A begins at a start operation and flow moves to operation 502A.
- a language detection model for a first language is maintained.
- the language detection model may comprise a first list comprising identities of a plurality of prefixes from a corpus of the first language, and weights for each of the plurality of prefixes.
- the language detection model may further comprise a second list comprising identities of a plurality of suffixes from the corpus, and weights for each of the plurality of suffixes.
- the weights may correspond to a frequency of the prefixes and suffixes in the corpus.
- a follow-up action is performed based on the determination that the language match score meets the threshold value.
- the follow-up action may comprise applying a language processing engine that is specific to the first language to the text string.
- performing the follow-up action may comprise downloading a language package library for the first language to a computing device that the text string was initially input to.
- the language package library for the first language may comprise an embeddings library (e.g., a BERT library, an ELMo library).
- a preprocessing step may quickly accept a word because it is in the common word list or may reject the word because it does not contain a vowel.
- a follow-up action is performed based on the determination that the language match score meets the threshold value.
- the follow-up action may comprise applying a language processing engine that is specific to the first language to the text string.
- performing the follow-up action may comprise downloading a language package library for the first language to a computing device that the text string was initially input to.
- the language package library for the first language may comprise an embeddings library (e.g., a BERT library, an ELMo library).
- FIG. 5D is a method 500D for determining a language of a text string based on application of a plurality of language detection models and performing a follow-up action based on the determining.
- the method 500D begins at a start operation and flow moves to operation 502D.
- the first language detection model may further comprise a list of prefixes and suffixes from the corpus of the first language and weights for each of those prefixes and suffixes.
- the first language detection model may further comprise a list of common words from the corpus of the first language and weights for each of those common words. The weights may correspond to a frequency of each of the text units (tokens) in the corpus of the first language.
- Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
- modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
- communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
- RF radio frequency
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202180063398.1A CN116194925A (zh) | 2020-09-17 | 2021-06-03 | 从非字符子标记信号中自动检测语言 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/024,428 | 2020-09-17 | ||
US17/024,428 US11361158B2 (en) | 2020-09-17 | 2020-09-17 | Language autodetection from non-character sub-token signals |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022060439A1 true WO2022060439A1 (en) | 2022-03-24 |
Family
ID=76695832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2021/035563 WO2022060439A1 (en) | 2020-09-17 | 2021-06-03 | Language autodetection from non-character sub-token signals |
Country Status (3)
Country | Link |
---|---|
US (3) | US11361158B2 (zh) |
CN (1) | CN116194925A (zh) |
WO (1) | WO2022060439A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115099832A (zh) * | 2022-06-29 | 2022-09-23 | 广州华多网络科技有限公司 | 异常用户检测方法及其装置、设备、介质、产品 |
CN117556817B (zh) * | 2024-01-10 | 2024-05-24 | 国开启科量子技术(安徽)有限公司 | 基于量子电路的大模型生成文本检测方法、装置、设备 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060229865A1 (en) * | 2005-04-07 | 2006-10-12 | Richard Carlgren | Method and system for language identification |
US20090326918A1 (en) * | 2008-06-26 | 2009-12-31 | Microsoft Corporation | Language Detection Service |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10366424B2 (en) * | 2014-06-04 | 2019-07-30 | Nuance Communications, Inc. | Medical coding system with integrated codebook interface |
US11438683B2 (en) * | 2020-07-21 | 2022-09-06 | Apple Inc. | User identification using headphones |
-
2020
- 2020-09-17 US US17/024,428 patent/US11361158B2/en active Active
-
2021
- 2021-06-03 CN CN202180063398.1A patent/CN116194925A/zh active Pending
- 2021-06-03 WO PCT/US2021/035563 patent/WO2022060439A1/en active Application Filing
-
2022
- 2022-06-13 US US17/839,330 patent/US11630951B2/en active Active
-
2023
- 2023-04-17 US US18/301,341 patent/US11947909B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060229865A1 (en) * | 2005-04-07 | 2006-10-12 | Richard Carlgren | Method and system for language identification |
US20090326918A1 (en) * | 2008-06-26 | 2009-12-31 | Microsoft Corporation | Language Detection Service |
Non-Patent Citations (1)
Title |
---|
TOMMI JAUHIAINEN ET AL: "Automatic Language Identification in Texts: A Survey", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 April 2018 (2018-04-23), XP081425274 * |
Also Published As
Publication number | Publication date |
---|---|
US20220083734A1 (en) | 2022-03-17 |
US11630951B2 (en) | 2023-04-18 |
US20230252235A1 (en) | 2023-08-10 |
CN116194925A (zh) | 2023-05-30 |
US11361158B2 (en) | 2022-06-14 |
US20220309242A1 (en) | 2022-09-29 |
US11947909B2 (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108287858B (zh) | 自然语言的语义提取方法及装置 | |
US10290299B2 (en) | Speech recognition using a foreign word grammar | |
US11947909B2 (en) | Training a language detection model for language autodetection from non-character sub-token signals | |
US9229924B2 (en) | Word detection and domain dictionary recommendation | |
KR100714769B1 (ko) | 서면 텍스트로부터의 조정가능 신경망 기반 언어 식별 | |
US10878817B2 (en) | Systems and methods for generating comedy | |
US7277029B2 (en) | Using language models to expand wildcards | |
US8463598B2 (en) | Word detection | |
US20150279366A1 (en) | Voice driven operating system for interfacing with electronic devices: system, method, and architecture | |
CN101815996A (zh) | 检测名称实体和新词 | |
KR20160008480A (ko) | 명칭을 강인하게 태깅하는 방법 및 시스템 | |
CN112270167B (zh) | 角色标注方法、装置、电子设备和存储介质 | |
EP1617409A1 (en) | Multimodal method to provide input to a computing device | |
KR102364401B1 (ko) | 문맥형 음성-구동 딥 북마킹 | |
CN110808032A (zh) | 一种语音识别方法、装置、计算机设备及存储介质 | |
CN114631094A (zh) | 智能电子邮件标题行建议和重制 | |
CN110020429B (zh) | 语义识别方法及设备 | |
Lee et al. | Impact of out-of-vocabulary words on the twitter experience of blind users | |
CN112149403A (zh) | 一种确定涉密文本的方法和装置 | |
Celikkaya et al. | A mobile assistant for Turkish | |
US20220414334A1 (en) | Post-model filtering of predictive text | |
CN114281969A (zh) | 答复语句推荐方法、装置、电子设备及存储介质 | |
CN114911896A (zh) | 基于语音的搜索方法及相关设备 | |
CN113268984A (zh) | 文本处理方法、装置、存储介质及电子设备 | |
CN114547306A (zh) | 一种数据处理方法、装置、电子设备及计算机可读介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21736088 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21736088 Country of ref document: EP Kind code of ref document: A1 |