WO2022240358A1 - System and method for training a culturally-specific assisting language learning model - Google Patents

System and method for training a culturally-specific assisting language learning model Download PDF

Info

Publication number
WO2022240358A1
WO2022240358A1 PCT/SG2022/050301 SG2022050301W WO2022240358A1 WO 2022240358 A1 WO2022240358 A1 WO 2022240358A1 SG 2022050301 W SG2022050301 W SG 2022050301W WO 2022240358 A1 WO2022240358 A1 WO 2022240358A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
inputs
language
error correction
culture
Prior art date
Application number
PCT/SG2022/050301
Other languages
French (fr)
Inventor
Jun Yao Francis LEE
Original Assignee
National University Of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University Of Singapore filed Critical National University Of Singapore
Publication of WO2022240358A1 publication Critical patent/WO2022240358A1/en

Links

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/06Foreign languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates, in general terms, to systems and methods for training a culturally-specific assisted language learning model.
  • the present invention also relates to using that model for assisted learning of a language.
  • the present disclosure focuses on describing a culturally-specific assisted language learning system and method.
  • the system is a real-time grammar correction system with sentence error classification that is tunable to the culture of the foreign language learner.
  • speaking practice is afforded by utilising speech recognition technologies coupled with mouth visualizations for students to mimic.
  • an input language culture may be Bahasa Melayu.
  • Language culture may instead refer to two or more languages or dialects that are sufficiently similar that the common errors in translation to a target language are common to each of the two or more languages or dialects.
  • the input language culture may include Bahasa Melayu and Bahasa Indonesia.
  • the target language culture may be a single language or dialect, or two or more languages or dialects that are sufficiently similar that the same commons errors are found by second language learners attempting to translate into each of the two or more languages or dialects.
  • a system for training a culturally-specific assisted language learning model comprising: memory; an error correction module; and one or more processors, the memory storing instructions that, when executed by the one or more processors, cause the one or more processors to: receive a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use of one or more words in a target language culture and, for
  • the terms “improper use” or “imperfect use” are intended to convey uses of the target language by users, such as native users, of the input language, that have routine errors such as spelling errors, as well as translations (e.g. from a third input, being in the input language culture, which may also be used for training, to identify common errors in translation) in which each word is correctly directly translated, but the resulting input in the target language is incorrect since it does not take into account the influence of the context or grammar of a sentence or pronunciation.
  • the term “proper use” in this context refers to a correct translation into the target language.
  • the first inputs and second inputs may comprise multi-word inputs.
  • a multi word input can be a sentence, phrase, part-sentence and other multi-word structures, whether verbal or written, in which translation of one or more words relies on the grammar of the multi-word input or upon the other words in the multi-word input, or both. This type of input recognises that some words may have a direct translation that differs from the translation that would be used in the context of a sentence or other multi-word input.
  • training the error correction module may comprise training the error correction module to determine corrections needed to be made to each
  • the instructions may cause the one or more processors to train the error correction module to learn one or more differences between a grammatical sentence structure of the imperfect use and a grammatical sentence structure of the proper use, and to identify a proper use of words in the target language culture based on the one or more differences.
  • the instructions may cause the one or more processors to train the error correction module by applying a one-hot encoding categorical parameter to the first inputs, the one-hot encoding categorical parameter specifying the input language culture.
  • the set of inputs may comprise inputs from two or more input language cultures, and the instructions cause the one or more processors to apply the one-hot encoding parameter to train the error correction module to learn respective sets of common errors in use of the target language culture for each input language culture.
  • the imperfect translations and proper translations may be written translations.
  • imperfect use of the first input into the target language culture comprises a speech input of a person whose native tongue is in the input language culture (mispronounced inputs).
  • proper use of the target language culture may comprise a speech input of a person whose native tongue is in the target language culture (correctly pronounced inputs).
  • the instructions may cause the one or more processors to train the error correction module to identify the relationships between inputs in the set of inputs by identifying mispronounced phones by determining: phones of each speech input; and a difference between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs.
  • the system may further comprise a visualizer for visually demonstrating, for each mispronounced phone, a mouth position for properly pronouncing the mispronounced phone.
  • the visualizer may be configured to use animations to visually demonstrate the mouth position for properly pronouncing mispronounced phones.
  • the system may further comprise an activities system.
  • the instructions may cause the one or more processors to produce an activity, at the activities system, corresponding to each common error, the activity comprising a practice exercise for correcting the respective common error.
  • Also disclosed herein is the use of a system as described above, comprising: receiving, at the one or more processors, a further input in the target language culture from a speaker of the input language culture, the further input corresponding to a predetermined prompt in the input language culture; applying the trained model to the further input to generate a multi-word proper use based on the further input and the common errors; and outputting the proper use, indicating corrections of the common errors identified in the further input.
  • Also disclosed herein is the use of a system as described above, comprising: receiving, at the one or more processors, a further speech input in the target language culture from a speaker of the input language culture, the further speech input corresponding to a predetermined prompt; applying the trained model to the further speech input to identify mispronounced phones; and outputting a visualization of each mispronounced phone.
  • Also disclosed herein is a method for training a culturally-specific assisted language learning model, comprising: receiving a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use
  • embodiments of the present invention provide a culture- specific automatic grammar error correcting method and system.
  • the native language of the user e.g. using one-hot encoding during the learning process - the resulting model (i.e. machine learning model) gives more informed corrections. This allows second language learners to get immediate feedback without relying on teachers.
  • embodiments of the invention provide for the identification of common errors including the classification of sentence errors.
  • target language learners can identify the types of mistakes they are making.
  • embodiments of the present invention assist with pronunciation of phones second language.
  • Figure 1 is a system for each of training and using a culturally-specific assisted language learning in accordance with present teachings
  • Figure 2 is a method, implemented on the system of Figure 1, for training a culturally-specific assisted language learning model for multi-word input (e.g. sentence) error detection;
  • Figure 3 illustrates the difference between direct, imperfect use and proper use in the context of a sentence being translated from Japanese to English
  • Figure 4 is an illustrative embodiment of an interface with which a user can interact to undertake culturally-specific assisted language learning
  • Figure 5 is a flowchart illustrating the broad steps in the use of the system of Figure 1 in the correction of speech mispronunciations.
  • the systems and methods train and error correction module to detect various types of errors.
  • the error correction module is trained to detect sentence errors. Sentence errors can result from incorrect grammar, from direct translation of words without regard to the changes in translation resulting from the context of the sentence, or both.
  • the error correction module is trained to detect phone mispronunciation. When a mispronounced phone is detected, a visualisation of the mouth position for correct pronunciation is displayed to the user.
  • the error correction module is trained to detect both types of error along with, if desired, standard translational errors identified by existing translation systems.
  • translation may be used herein to refer to user of an input language culture formulating a sentence or input in their mind and attempting to produce a corresponding sentence or input in the target language - i.e. translating the sentence.
  • the system and method described herein may only receive the target language input, yet it is a translation.
  • imperfect use and “proper use” are interchangeable with the terms “imperfect translation” and “proper translation” respectively.
  • the computer device 100 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.
  • a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone TM manufactured by AppleTM, Inc or one manufactured by LGTM, HTCTM and SamsungTM, for example, or other device.
  • the mobile computer device 100 includes the following components
  • memory e.g. non-volatile (non-transitory) memory 104;
  • RAM random access memory
  • transceiver component 112 that includes N transceivers
  • Figure 1 Although the components depicted in Figure 1 represent physical components, Figure 1 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 1 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 1.
  • the display 102 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro- projector and OLED displays).
  • the display 102 may display visualisations of mouth positions of mispronounced phones detected by the model 118.
  • the visualisations may be static images or animations and may or may not be accompanied with an audible simulation of the correct pronunciation of each mispronounced phone.
  • non-volatile data storage 104 functions to store (e.g., persistently store) data and executable code.
  • the system architecture may be implemented in memory 104, or by instructions stored in memory 104 - e.g. memory 104 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 110 cause the processor(s) 110 to perform the method 200 described with reference to
  • the non-volatile memory 104 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
  • the non-volatile memory 104 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 104, the executable code in the non-volatile memory 104 is typically loaded into RAM 108 and executed by one or more of the N processing components 110.
  • flash memory e.g., NAND or ONENAND memory
  • the N processing components 110 in connection with RAM 108 generally operate to execute the instructions stored in non-volatile memory 104.
  • the N processing components 110 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
  • the transceiver component 112 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. These can also be used to receive the inputs in both the input language culture and target language culture, to specify the input language culture or target language culture and so on.
  • Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme.
  • each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
  • the system 100 of Figure 1 includes or communicates with a model 118, being a model comprising or constituting the trained error correction module.
  • a model 118 being a model comprising or constituting the trained error correction module.
  • model 118 may be part of the system 100 or, as shown, the system 100 may form a client terminal through which a user interacts with the model 118.
  • the system 100 may also be connected to any other appliance, such as an external server, a scanner, or any other resource from which, for example, documents can be sourced for numerical analysis.
  • any other appliance such as an external server, a scanner, or any other resource from which, for example, documents can be sourced for numerical analysis.
  • the system 100 also includes an activities system or module 120.
  • the task of the activities system 120 is to produce activities corresponding to the common errors that a particular second (i.e. target) language learner needs to focus on.
  • Non-transitory computer-readable medium 104 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium may be any available medium that can be accessed by a computer.
  • the system 100 may be used for training a culturally-specific assisted language learning model.
  • Memory 104 may store instructions that, when executed by the N processing units 110, cause the N processing unit 110 to perform the method 200 of Figure 2 which broadly includes:
  • Step 202 receiving a set of inputs
  • Step 204 training error correction module; and Step 206: outputting the trained model.
  • Step 202 comprises receiving a set of inputs.
  • the set of inputs includes first inputs and, for each first input, a second input.
  • the first input comprises an imperfect use of one or more words on the target language.
  • the second input is or includes a proper use (or proper translation) of one or more words of the target language, corresponding to the imperfect use.
  • a "proper" use is an accurate translation such as one that would be made by a person who is proficient in the target language.
  • the system 100 may also take a third input that is or includes an input in the input language culture corresponding to the first input and second input.
  • the first and second inputs enable the system to identify differences between a translation made by a person whose native tongue is the input language but who has not yet mastered the target language, and a translation made by a person who is proficient in the target language.
  • the inputs may be text-based inputs for written error correction, or verbal or speech inputs for pronunciation error correction. Therefore, the N transceiver elements 112 may include receiver elements such as a microphone or keyboard.
  • Step 204 involves training the error correction module to identify relationships between the inputs - i.e. between improper and proper uses of the target language by a person proficient in the input language, or an input comprising the input language and the inputs comprising the imperfect and proper translations into the target language.
  • the relationships that are identified are based on the input language. This enables the error correction module to identify common errors made by second language learners with native tongue is the first (i.e. input) language, when translating into or using the target language - hereinafter referred to as "common errors". Therefore, the relationships identified between the inputs include errors commonly made during translation of words of the input language culture into the target language culture
  • step 204 the result of step 204 is that a trained error correction module is produced.
  • This trained error correction module is outputted in a model for translating between the input language and target language.
  • Step 202 to 206 result in a model that is specifically configured to identify errors that arise when people who speak a particular first language are attempting to translate into or use a particular second language.
  • the model is therefore more accurately able to model the errors that challenge speakers of the particular first language when performing translations.
  • the present error correction module may be, or comprise, a grammar error correction model that, in use, takes a multi-word input including the sentence to be corrected and the native language of the learner as input. This is achieved by augmenting an existing automatic grammar error correction network and retraining for the native language using a one-hot encoding categorical parameter.
  • the one-hot encoding categorical parameter specifies the input language or native language of the learner.
  • the one-hot encoding categorical parameter can be used as an additional label to retrain a pre-existing automatic grammar error correction network - e.g. a neural network with the one-hot encoding categorical parameters being densely connected to a layer in the neural network.
  • the resultant machine learning model outputted at Step 206 will be tuned towards a particular leaner's native language or language culture.
  • path 208 can be applied when the first and second inputs are sentences. All along the path 208, training the error correction module involves learning proper translations or uses of the target language based on proper uses of the target language and imperfect uses of the target language by a person proficient in the input language - this can also include learning proper translations from an input sentence in an input language (i.e. third input) to a corresponding sentence in the target language - Step 210.
  • Step 210 involves identifying a proper translation for one or more words in each input sentence based on other words in the input sentence and the input language.
  • a sentence may be translated from Japanese to English with the translation of a particular word meaning either "tension” or "excitement” depending on other words in the sentence.
  • the 'other words in the sentence' may be words taken from the sentence in the input language or the target language. This is because the sentence in the input language contains all of the information needed for proper translation of that sentence to a sentence in the target language.
  • Steps 204 and 210 may leverage off training or retraining an automatic grammar error correction network.
  • the error correction module can be trained to learn differences between the grammatical sentence structures used in the input language, or in improper uses of the target language by people proficient in the input language, and those used in the target language. This can further assist with proper translation and use as tense, plural or singular form and other information can be extracted from grammatical information.
  • step 202 can include the receipt of a set of inputs that includes inputs from two or more languages.
  • the N processing components 110 then apply the one-hot encoding parameter to differentiate between input languages so that the error correction module and learn a set of common errors in use of the target language translating by speakers of the input language and/or translation of the input language into the target language.
  • a one-hot encoding categorical parameter may be applied for the target language.
  • the error correction module can concurrently learn from a large number of inputs from various input languages and target languages. The error correction module can therefore learn features for each respective input language and target language across all inputs for those languages without regard to the language to which all from which they
  • the error correction module can concurrently learn a said of common errors for each input language-target language pair.
  • the resulting model 118 can be used to correct errors in translations, and freely generated sentences (uses of the target language), attempted by second language learners in the target language.
  • the model 118 may be used by a speaker being provided a prompt (e.g. a sentence) in the input language, the speaker attempts to produce a sentence in the target language based on the prompt (i.e. a further input to the system 100 after training the error correction module) - Step 212.
  • the speaker may attempt to produce an input in the target language without reference to a prompt.
  • the speaker will generally be a native speaker of the input language - i.e. a second language learner.
  • the system 100 receives the input at Step 214 and applies the trained model 118 to the attempted sentence to generate a multi-word corrected sentence based on the attempted sentence and the common errors. For the second language learner to then learn from any errors, the corrected sentence is outputted, indicating corrections of the translation errors - Step 216.
  • the user interface can come in various forms such as a browser-based application or a native application on a computer.
  • the interface 400 includes a prompt 402, presently prompting the user to generate a sentence in the target language using a word in the target language, presently "assume".
  • the interface 400 also provides a definition 404 of the word "assume" in the input language.
  • the user inputs an attempted use of the word in a sentence in the input field 406 and clicks "SUBMIT" - 408.
  • the interface 400 displays a corrected output 410 and a set of errors 412 in the attempted use of the word "assume", which include common errors made in translating from the input language to the target language, along with routine error correction such as spelling correction.
  • the system accessed through the interface 400 therefore corrects any mistakes in the user input (imperfect use) and shows the errors detected and changes that have been made to transition from the users input to the corrected sentence (proper use). This can be achieved by passing the user generated sentence as input to the trained model 118 (which now operates as an automated error grammar correction machine learning model) and outputting the corrected sentence as output. If the corrected sentence is equivalent to the original sentence (i.e. the user generated sentence), the original sentence is deemed to be correct and no further action is taken. If the corrected sentence is not equivalent to the original sentence, this indicates that the original sentence is erroneous.
  • the types of errors are classified as one of the common errors found in translations from the input language to the target language, uses of the target language based on a prompt or uses of the target language in freely generated inputs.
  • information about the original sentence, correction and error types can be recorded and saved either locally or on the cloud. Over time, history can be generated for the second language learner to help them monitor progress. Moreover an activity record can be generated from which the system can understand the English proficiency of the second language learner and, depending on the second language learner's history of errors made, the relevant practices and questions will be suggested to have the learner work on their weaknesses.
  • the activity system can produce an activity corresponding to each common error, or each common error made by the particular second language learner.
  • the activity includes a practice exercise for correcting the respective common error. The selection of follow-up practice exercises, from all exercises for correcting the
  • 17 common mistakes can achieved by simple statistical analysis of the frequency of mistakes in a particular activity - for example, in the event that the second language learner makes a particular common error more frequently than any other common error, then the activity can be directed to the correction of that particular common error.
  • selection of the activity can also be achieved using a weighted combination of factors such as mistakes made in another activity. For example, responses provided in multi-choice exercises, sentence creation exercises, and translation exercises can each yield different information about the strengths and weaknesses of a student. That information can be used to identify further activities to refine those strengths or improve those weaknesses.
  • the system 100 may also be used for verbal or speech error correction.
  • the system 100 can be used for pronunciation analysis, visualisation and feedback.
  • many spoken languages have a different number of phones in which the language is conveyed.
  • a phone in the phonetics definition refers to any distinct speech sound or gesture, regardless of whether the exact sound is critical to the meanings of words.
  • the English language has 24 consonant sounds and 20 vowel sounds.
  • Mandarin has 23 consonant sounds and 24 vowel sounds whilst the Japanese language has 15 consonant sounds and 5 vowel sounds. This is one of the reasons why it is difficult to learn pronunciations of new languages that have phones that are outside of a second language learner's
  • Japanese ESL learner it can be difficult for an ESL learner to pronounce English words because the English language has more phones than their native language, Japanese. Additionally, Japanese has many loan words from English, for example “chocolate” becomes “ - a n u — b ” or “chokoreto”. When learning these words in English, ESL learners in Japan often find themselves using the borrowed form of the English word because these words have been modified to be easier for native Japanese speakers to pronounce.
  • Modern speech recognition systems are trained to detect and differentiate phones that are found in a particular language through the audio data that is fed through the machine learning system during the training process. This method, however, fails to properly pick up the mispronounced phones that are outside the language that the speech recognition model was trained on.
  • the error correction module can be trained to be a phone detection system that is able to recognize the spoken phones of the second language learner and provide a visualization in the form of an animated lip, teeth, and tongue model.
  • the model 118 can be configured as a speech recognition model that is able to recognise phones in the mispronounced inputs in the target language, made by a speaker of the input language, and correctly pronounced
  • the display 102 comprises a visualiser, or animated phoneme visualiser, allowing second language learners to identify differences in mouth positions (lips, teeth, tongue, etc.) required to pronounce a given word accurately.
  • the input received at Step 202 may therefore comprise improper use in the form of speech inputs of a person whose native tongue is the input language - i.e. mispronounced inputs.
  • the term "improper use” or “imperfect use” can be used to refer to an erroneous translation of a prompt sentence given to the user in the input language, improper pronunciation of a sentence in the target language displayed to the user, or other erroneous response in the target language to a prompt given in either the input language or the target language.
  • the error correction module may then be trained at Step 204 to identify mispronounced phones.
  • the error correction module may learn phones of each input or speech segment - Step 222 - these can be given in one or both of the input language and target language, and to determine differences between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs - Step 224.
  • the display 102 the system 100 of Figure 2 includes a visualiser.
  • the visualiser usually demonstrates a mouth position for each mispronounced phone.
  • the visualiser produces a static image of the correct mouth position and in other instances the visualiser produces an animation for demonstrating the mouth position.
  • the static image or animation may be accompanied by an audio feed of the correct pronunciation for the phone.
  • the system can receive a further speech input generated by a user in the target language at Step 214.
  • the further speech input is in the target language and may correspond to a predetermined prompt
  • Step 21 in the input language may correspond prompt in the target language that the user simply needs to read, or may be a freely formed speech input.
  • the trained model is applied to the further speech input to identify mispronounced phones. Thereafter, at Step 218, a visualization of each mispronounced phone is outputted.
  • Speech recognition can be achieved using existing open-source speech recognition models to detect the words (or sounds) that the learner is articulating and convert them into phones. For each of the phones, there will be a corresponding keyframe representing the positions of the various mouthparts (lips, teeth and tongue, etc.) and depending on the detected phones. Where an animation is provided, the animation will transition to the respective keyframe. For example, where pronunciation of a phone requires the mouth to move between multiple keyframes (images of the correct mouth positions for creating a sound), the animation may interpolate or warp from one keyframe to the next. This provides a smooth transition between keyframes.
  • the learner is given a word or a phrase to pronounce through a microphone or other audio device. Their pronunciation audio is recorded for processing by the system - step 502.
  • the speech is then analysed by a speech recognition module (Step 504) and the phones are then picked out (Step 506).
  • the phone is subsequently modelled on a mouth model - Step 508.
  • the learner is able to playback and compare the differences between the correct pronunciation and their pronunciation through both auditory and visual feedback. Instructions to guide the learner to understand the mistakes and subtle differences between similar sounding phones are also presented to the user.
  • 22 - R roll the tongue to the back of the mouth without pushing onto the upper palate.
  • the system 100 can also rank the aptitude of different learners and the difficulty of language learning tasks. With reference to method 200, and particularly step 226, by analysing the performance of a particular learner and mistakes across many students of a similar language proficiency in the target language, the machine learning model may identify or approximate the aptitude of the learner in comparison to his or her peers. Further, an activity can also be given a difficulty rating based on how many learners the system predicts would make mistakes while attempting it.
  • the activities could include video conferencing with peers to provide feedback on their conversational skills and similar to the aforementioned activities, the errors made during the sessions can be recorded and evaluated. This will again provide more context for the system to recommend activities and questions that target the learner's weaknesses.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Disclosed is a system for training and using a culturally-specific assisted language learning model. The system receives a set of inputs comprising first inputs from an input language culture and, for each first input, a corresponding second input and third input being respectively a proper translation and improper translation of the first input into a target language culture. The system includes an error correction module trained to identify relationships between inputs in the set of inputs, based on the input language culture, to produce a trained error correction module. The relationships comprise errors commonly made during translation of words of the input language culture into the target language culture (common errors). The system outputs a comprising the trained error correction module, for translating between the input language culture and target language culture.

Description

SYSTEM AND METHOD FOR TRAINING A CULTURALLY-SPECIFIC ASSISTING LANGUAGE LEARNING MODEL
Technical Field
The present invention relates, in general terms, to systems and methods for training a culturally-specific assisted language learning model. The present invention also relates to using that model for assisted learning of a language.
Background
In modern society, there is very high interconnectedness both electronically and physically. The growth of international travel and multinational corporations has further increased the frequency of interaction between people from different language backgrounds.
A result of the high interconnectedness is an increased need to learn foreign languages - i.e. languages other than the native tongue of the speaker. Taking English as an example, English as a Second Language (ESL) learners have various struggles when learning English such as the lack of opportunities to practice and receive feedback. In Japan, a country with many ESL learners, one critical problem is that the teachers are often unable to provide personalized feedback for each learner.
It can also be daunting when attempting to converse in a foreign language. Speakers tend to lack confidence in utilizing foreign languages beyond the classroom because they lack the practice and are afraid to make mistakes. Moreover, proper pronunciation is difficult to learn alone, since there is no mechanism to obtain feedback on whether proper pronunciation has been used.
It is desirable that there be provided a system and/or method for overcoming or reducing at least one of the above-described problems with learning foreign
1 languages.
Summary
The present disclosure focuses on describing a culturally-specific assisted language learning system and method. In some case, the system is a real-time grammar correction system with sentence error classification that is tunable to the culture of the foreign language learner. In some embodiments, speaking practice is afforded by utilising speech recognition technologies coupled with mouth visualizations for students to mimic.
In the description that follows, the term "language culture" is used. That term can refer to a single language or dialect. For example, an input language culture may be Bahasa Melayu. Language culture may instead refer to two or more languages or dialects that are sufficiently similar that the common errors in translation to a target language are common to each of the two or more languages or dialects. For example, the input language culture may include Bahasa Melayu and Bahasa Indonesia. Similarly, for the target language, the target language culture may be a single language or dialect, or two or more languages or dialects that are sufficiently similar that the same commons errors are found by second language learners attempting to translate into each of the two or more languages or dialects.
Disclosed is a system for training a culturally-specific assisted language learning model, comprising: memory; an error correction module; and one or more processors, the memory storing instructions that, when executed by the one or more processors, cause the one or more processors to: receive a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use of one or more words in a target language culture and, for
2 each first input, a second input comprising a proper use of one or more words in the target language culture, the proper use being a correction of the imperfect use; train the error correction module to identify relationships between inputs in the set of inputs, based on the input language culture, to produce a trained error correction module, the relationships comprising errors commonly made during use of the target language culture by a user of the input language culture (common errors); and output the model, the model comprising the trained error correction module, for correcting imperfect uses of words in the target language culture.
Note: the terms "improper use" or "imperfect use" are intended to convey uses of the target language by users, such as native users, of the input language, that have routine errors such as spelling errors, as well as translations (e.g. from a third input, being in the input language culture, which may also be used for training, to identify common errors in translation) in which each word is correctly directly translated, but the resulting input in the target language is incorrect since it does not take into account the influence of the context or grammar of a sentence or pronunciation. Similarly, the term "proper use" in this context refers to a correct translation into the target language.
The first inputs and second inputs may comprise multi-word inputs. A multi word input can be a sentence, phrase, part-sentence and other multi-word structures, whether verbal or written, in which translation of one or more words relies on the grammar of the multi-word input or upon the other words in the multi-word input, or both. This type of input recognises that some words may have a direct translation that differs from the translation that would be used in the context of a sentence or other multi-word input.
In this case, training the error correction module may comprise training the error correction module to determine corrections needed to be made to each
3 first input to form a proper use of the target language, based on the second input. This can include identifying a proper use for one or more words in each multi-word input based on other words in the multi-word first input based on other words in each multi-word first input and the input language culture. The instructions may cause the one or more processors to train the error correction module to learn one or more differences between a grammatical sentence structure of the imperfect use and a grammatical sentence structure of the proper use, and to identify a proper use of words in the target language culture based on the one or more differences.
The instructions may cause the one or more processors to train the error correction module by applying a one-hot encoding categorical parameter to the first inputs, the one-hot encoding categorical parameter specifying the input language culture. The set of inputs may comprise inputs from two or more input language cultures, and the instructions cause the one or more processors to apply the one-hot encoding parameter to train the error correction module to learn respective sets of common errors in use of the target language culture for each input language culture.
The imperfect translations and proper translations may be written translations. In other embodiments, imperfect use of the first input into the target language culture comprises a speech input of a person whose native tongue is in the input language culture (mispronounced inputs). Similarly, proper use of the target language culture may comprise a speech input of a person whose native tongue is in the target language culture (correctly pronounced inputs). The instructions may cause the one or more processors to train the error correction module to identify the relationships between inputs in the set of inputs by identifying mispronounced phones by determining: phones of each speech input; and a difference between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs.
4 The system may further comprise a visualizer for visually demonstrating, for each mispronounced phone, a mouth position for properly pronouncing the mispronounced phone. The visualizer may be configured to use animations to visually demonstrate the mouth position for properly pronouncing mispronounced phones.
The system may further comprise an activities system. In this case, the instructions may cause the one or more processors to produce an activity, at the activities system, corresponding to each common error, the activity comprising a practice exercise for correcting the respective common error.
Also disclosed herein is the use of a system as described above, comprising: receiving, at the one or more processors, a further input in the target language culture from a speaker of the input language culture, the further input corresponding to a predetermined prompt in the input language culture; applying the trained model to the further input to generate a multi-word proper use based on the further input and the common errors; and outputting the proper use, indicating corrections of the common errors identified in the further input.
Also disclosed herein is the use of a system as described above, comprising: receiving, at the one or more processors, a further speech input in the target language culture from a speaker of the input language culture, the further speech input corresponding to a predetermined prompt; applying the trained model to the further speech input to identify mispronounced phones; and outputting a visualization of each mispronounced phone.
Also disclosed herein is a method for training a culturally-specific assisted language learning model, comprising: receiving a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use
5 of one or more words in a target language culture and, for each first input, a second input comprising a proper use of one or more words in the target language culture, the proper use being a correction of the imperfect use; training an error correction module to identify relationships between inputs in the set of inputs, based on the input language culture, to produce a trained error correction module, the relationships comprising errors commonly made during use of the target language culture by a user of the input language culture (common errors); and outputting the model, the model comprising the trained error correction module, for correcting imperfect uses of words in the target language culture.
Existing automatic grammar error correction systems are generic for users of all cultures - there is no additional specification of the native language of the user. For second language learners, such as English as a second language (ESL) learners, depending on their native language, the mistakes that they make are likely to be similar across other second language learners with the same native language. One reason for this is the use of the grammar-translation methods when learning second languages. For example, for English, students learn the grammatical rules of English and translate words and sentences from their native language to English and vice-versa. This often results with direct word- for-word translations that do not take into context the sentence in which each word appears. This results in a direct translation but not a translation to words a native speaker of the second language used. Such translations are imperfect translations.
Advantageously, embodiments of the present invention provide a culture- specific automatic grammar error correcting method and system. By indicating the native language of the user during training - e.g. using one-hot encoding during the learning process - the resulting model (i.e. machine learning model) gives more informed corrections. This allows second language learners to get immediate feedback without relying on teachers.
6 Advantageously, embodiments of the invention provide for the identification of common errors including the classification of sentence errors. By identifying common errors, target language learners can identify the types of mistakes they are making.
Advantageously, embodiments of the present invention assist with pronunciation of phones second language. In particular, embodiments and provide phone pronunciation visualization and feedback.
Brief description of the drawings
Embodiments of the present invention will now be described, by way of non limiting example, with reference to the drawings in which:
Figure 1 is a system for each of training and using a culturally-specific assisted language learning in accordance with present teachings;
Figure 2 is a method, implemented on the system of Figure 1, for training a culturally-specific assisted language learning model for multi-word input (e.g. sentence) error detection;
Figure 3 illustrates the difference between direct, imperfect use and proper use in the context of a sentence being translated from Japanese to English;
Figure 4 is an illustrative embodiment of an interface with which a user can interact to undertake culturally-specific assisted language learning; and
Figure 5 is a flowchart illustrating the broad steps in the use of the system of Figure 1 in the correction of speech mispronunciations.
Detailed description
7 In the description that follows, systems and methods are proposed for culturally-specific assisted language learning. The systems and methods train and error correction module to detect various types of errors. In some embodiments, the error correction module is trained to detect sentence errors. Sentence errors can result from incorrect grammar, from direct translation of words without regard to the changes in translation resulting from the context of the sentence, or both. In other embodiments, the error correction module is trained to detect phone mispronunciation. When a mispronounced phone is detected, a visualisation of the mouth position for correct pronunciation is displayed to the user. In still further embodiments, the error correction module is trained to detect both types of error along with, if desired, standard translational errors identified by existing translation systems.
The term "translation" may be used herein to refer to user of an input language culture formulating a sentence or input in their mind and attempting to produce a corresponding sentence or input in the target language - i.e. translating the sentence. The system and method described herein may only receive the target language input, yet it is a translation. As such, unless context dictates otherwise, the terms "imperfect use" and "proper use" are interchangeable with the terms "imperfect translation" and "proper translation" respectively.
Such a system is shown in the block diagram in Figure 1. In particular, the system is represented by an exemplary computer device 100 in which embodiments of the invention, particularly method 200 of Figure 2, may be practiced. The computer device 100 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone ™ manufactured by Apple™, Inc or one manufactured by LG™, HTC™ and Samsung™, for example, or other device.
As shown, the mobile computer device 100 includes the following components
8 in electronic communication via a bus 106:
(a) a display 102;
(b) memory - e.g. non-volatile (non-transitory) memory 104;
(c) random access memory ("RAM") 108;
(d) N processing components 110;
(e) a transceiver component 112 that includes N transceivers;
(f) user controls 114; and
(g) machine learning model - i.e. model comprising the trained error correction module 118.
Although the components depicted in Figure 1 represent physical components, Figure 1 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 1 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 1.
The display 102 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro- projector and OLED displays). The display 102 may display visualisations of mouth positions of mispronounced phones detected by the model 118. The visualisations may be static images or animations and may or may not be accompanied with an audible simulation of the correct pronunciation of each mispronounced phone.
In general, the non-volatile data storage 104 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 104, or by instructions stored in memory 104 - e.g. memory 104 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 110 cause the processor(s) 110 to perform the method 200 described with reference to
9 Figure 2 respectively.
In some embodiments for example, the non-volatile memory 104 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
In many implementations, the non-volatile memory 104 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 104, the executable code in the non-volatile memory 104 is typically loaded into RAM 108 and executed by one or more of the N processing components 110.
The N processing components 110 in connection with RAM 108 generally operate to execute the instructions stored in non-volatile memory 104. As one of ordinarily skill in the art will appreciate, the N processing components 110 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
The transceiver component 112 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. These can also be used to receive the inputs in both the input language culture and target language culture, to specify the input language culture or target language culture and so on. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
The system 100 of Figure 1 includes or communicates with a model 118, being a model comprising or constituting the trained error correction module. The
10 model 118 may be part of the system 100 or, as shown, the system 100 may form a client terminal through which a user interacts with the model 118.
The system 100 may also be connected to any other appliance, such as an external server, a scanner, or any other resource from which, for example, documents can be sourced for numerical analysis.
The system 100 also includes an activities system or module 120. The task of the activities system 120 is to produce activities corresponding to the common errors that a particular second (i.e. target) language learner needs to focus on.
It should be recognized that Figure 1 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 104. Non-transitory computer-readable medium 104 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer.
The system 100 may be used for training a culturally-specific assisted language learning model. Memory 104 may store instructions that, when executed by the N processing units 110, cause the N processing unit 110 to perform the method 200 of Figure 2 which broadly includes:
Step 202: receiving a set of inputs;
Step 204: training error correction module; and Step 206: outputting the trained model.
For the purposes of illustration, the ensuing description will be made with reference to a first language or input language and a second language or target
11 language. However, it will be appreciated that the term "language" can be substituted for "language culture" unless context dictates otherwise.
Step 202 comprises receiving a set of inputs. The set of inputs includes first inputs and, for each first input, a second input. The first input comprises an imperfect use of one or more words on the target language. The second input is or includes a proper use (or proper translation) of one or more words of the target language, corresponding to the imperfect use. A "proper" use is an accurate translation such as one that would be made by a person who is proficient in the target language. The system 100 may also take a third input that is or includes an input in the input language culture corresponding to the first input and second input. The first and second inputs enable the system to identify differences between a translation made by a person whose native tongue is the input language but who has not yet mastered the target language, and a translation made by a person who is proficient in the target language.
The inputs may be text-based inputs for written error correction, or verbal or speech inputs for pronunciation error correction. Therefore, the N transceiver elements 112 may include receiver elements such as a microphone or keyboard.
Step 204 involves training the error correction module to identify relationships between the inputs - i.e. between improper and proper uses of the target language by a person proficient in the input language, or an input comprising the input language and the inputs comprising the imperfect and proper translations into the target language. The relationships that are identified are based on the input language. This enables the error correction module to identify common errors made by second language learners with native tongue is the first (i.e. input) language, when translating into or using the target language - hereinafter referred to as "common errors". Therefore, the relationships identified between the inputs include errors commonly made during translation of words of the input language culture into the target language culture
12 Per Step 206, the result of step 204 is that a trained error correction module is produced. This trained error correction module is outputted in a model for translating between the input language and target language.
Broadly speaking, Step 202 to 206 result in a model that is specifically configured to identify errors that arise when people who speak a particular first language are attempting to translate into or use a particular second language. The model is therefore more accurately able to model the errors that challenge speakers of the particular first language when performing translations.
In contrast, existing automatic grammar error correction systems are generic for users of all cultures - i.e. there is no additional specification of the native language of the user. For ESL learners, for example, depending on their native language, the mistakes that they make are likely to be similar across other ESL learners with the same native language. One reason for this is the use of the grammar-translation methods when learning English - students learn the grammatical rules of English and translate words and sentences from their native language to English and vice-versa. These translations are often direct, word-for-word translations with the translated words simply being rearranged to meet the grammar requirements of the target language. Such translations are made without regard to the different translations of some words depending on grammar or context.
As an example, take the Japanese sentence Ί © ft £ &P < tr y y 3
Figure imgf000015_0001
0 ii. This translates to “I get so excited when I hear this song”. This sentence contains the phrase r y y a 0 which translates literally or directly to “the tension is rising”. However, the phrase r y y 3
Figure imgf000015_0002
1 more correctly translates to “I get so excited”. A possible mistake a Japanese ESL learner might make, would be to incorrectly translate t © ft 1¾ < tr y y Ό i to “I get rising tension when I hear this song” as illustrated by the arrows 300 in Figure 3. The sentence is grammatically correct, but a native speaker of the target language, when taking into consideration the grammar or
13 the words surrounding each word being translated (i.e. the context of a multi-word input), would make a different translation as reflected by arrows 302.
This illustrates the problems of mistakes arising from the usage in some languages of loan words that have had their meanings borrowed and modified. In the absence of information about the learner's native language, automatic sentence grammar correction systems will likely provide incorrect translations.
The present error correction module may be, or comprise, a grammar error correction model that, in use, takes a multi-word input including the sentence to be corrected and the native language of the learner as input. This is achieved by augmenting an existing automatic grammar error correction network and retraining for the native language using a one-hot encoding categorical parameter. The one-hot encoding categorical parameter specifies the input language or native language of the learner. The one-hot encoding categorical parameter can be used as an additional label to retrain a pre-existing automatic grammar error correction network - e.g. a neural network with the one-hot encoding categorical parameters being densely connected to a layer in the neural network. The resultant machine learning model outputted at Step 206 will be tuned towards a particular leaner's native language or language culture.
For illustration purposes, the term "sentence" will be used hereafter. However, that term may be replaced with "multi-word input" or another type of multi word input, unless context dictates otherwise.
With further reference to Figure 2, path 208 can be applied when the first and second inputs are sentences. All along the path 208, training the error correction module involves learning proper translations or uses of the target language based on proper uses of the target language and imperfect uses of the target language by a person proficient in the input language - this can also include learning proper translations from an input sentence in an input language (i.e. third input) to a corresponding sentence in the target language - Step 210.
14 Rather than being a word-for-word translation, Step 210 involves identifying a proper translation for one or more words in each input sentence based on other words in the input sentence and the input language. To use the example given in Figure 3, a sentence may be translated from Japanese to English with the translation of a particular word meaning either "tension" or "excitement" depending on other words in the sentence. The 'other words in the sentence' may be words taken from the sentence in the input language or the target language. This is because the sentence in the input language contains all of the information needed for proper translation of that sentence to a sentence in the target language.
As mentioned above, Steps 204 and 210 may leverage off training or retraining an automatic grammar error correction network. To that end, the error correction module can be trained to learn differences between the grammatical sentence structures used in the input language, or in improper uses of the target language by people proficient in the input language, and those used in the target language. This can further assist with proper translation and use as tense, plural or singular form and other information can be extracted from grammatical information.
As a result of the one-hot encoding categorical parameter, step 202 can include the receipt of a set of inputs that includes inputs from two or more languages. The N processing components 110 then apply the one-hot encoding parameter to differentiate between input languages so that the error correction module and learn a set of common errors in use of the target language translating by speakers of the input language and/or translation of the input language into the target language. Similarly, a one-hot encoding categorical parameter may be applied for the target language. As a result, the error correction module can concurrently learn from a large number of inputs from various input languages and target languages. The error correction module can therefore learn features for each respective input language and target language across all inputs for those languages without regard to the language to which all from which they
15 are to be translated. Similarly, the error correction module can concurrently learn a said of common errors for each input language-target language pair.
Once the error correction module is trained the resulting model 118 can be used to correct errors in translations, and freely generated sentences (uses of the target language), attempted by second language learners in the target language. With further reference to Figure 2, the model 118 may be used by a speaker being provided a prompt (e.g. a sentence) in the input language, the speaker attempts to produce a sentence in the target language based on the prompt (i.e. a further input to the system 100 after training the error correction module) - Step 212. Alternatively, the speaker may attempt to produce an input in the target language without reference to a prompt. In this case, the speaker will generally be a native speaker of the input language - i.e. a second language learner. The system 100 receives the input at Step 214 and applies the trained model 118 to the attempted sentence to generate a multi-word corrected sentence based on the attempted sentence and the common errors. For the second language learner to then learn from any errors, the corrected sentence is outputted, indicating corrections of the translation errors - Step 216.
An example of an interface for using the model 118 is shown in Figure 4. The user interface can come in various forms such as a browser-based application or a native application on a computer. The interface 400 includes a prompt 402, presently prompting the user to generate a sentence in the target language using a word in the target language, presently "assume". The interface 400 also provides a definition 404 of the word "assume" in the input language. The user inputs an attempted use of the word in a sentence in the input field 406 and clicks "SUBMIT" - 408. The interface 400 then displays a corrected output 410 and a set of errors 412 in the attempted use of the word "assume", which include common errors made in translating from the input language to the target language, along with routine error correction such as spelling correction.
16 The system accessed through the interface 400 therefore corrects any mistakes in the user input (imperfect use) and shows the errors detected and changes that have been made to transition from the users input to the corrected sentence (proper use). This can be achieved by passing the user generated sentence as input to the trained model 118 (which now operates as an automated error grammar correction machine learning model) and outputting the corrected sentence as output. If the corrected sentence is equivalent to the original sentence (i.e. the user generated sentence), the original sentence is deemed to be correct and no further action is taken. If the corrected sentence is not equivalent to the original sentence, this indicates that the original sentence is erroneous. If the original sentence is erroneous, the types of errors are classified as one of the common errors found in translations from the input language to the target language, uses of the target language based on a prompt or uses of the target language in freely generated inputs. To use translation from Japanese to English as an example, there are 12 commonly made mistakes made by Japanese ESL Learners.
To facilitate progressive learning, information about the original sentence, correction and error types can be recorded and saved either locally or on the cloud. Over time, history can be generated for the second language learner to help them monitor progress. Moreover an activity record can be generated from which the system can understand the English proficiency of the second language learner and, depending on the second language learner's history of errors made, the relevant practices and questions will be suggested to have the learner work on their weaknesses.
After each activity, or after building a history of errors, further activities can be tailored to the second language learner. With reference to Figure 1, the activity system can produce an activity corresponding to each common error, or each common error made by the particular second language learner. The activity includes a practice exercise for correcting the respective common error. The selection of follow-up practice exercises, from all exercises for correcting the
17 common mistakes, can achieved by simple statistical analysis of the frequency of mistakes in a particular activity - for example, in the event that the second language learner makes a particular common error more frequently than any other common error, then the activity can be directed to the correction of that particular common error. In other circumstances, selection of the activity can also be achieved using a weighted combination of factors such as mistakes made in another activity. For example, responses provided in multi-choice exercises, sentence creation exercises, and translation exercises can each yield different information about the strengths and weaknesses of a student. That information can be used to identify further activities to refine those strengths or improve those weaknesses.
Some examples of common errors are set out in Table 1.
Figure imgf000020_0001
18
Figure imgf000021_0001
The system 100 may also be used for verbal or speech error correction. In particular, the system 100 can be used for pronunciation analysis, visualisation and feedback. In this regard, many spoken languages have a different number of phones in which the language is conveyed. In this regard, a phone (in the phonetics definition) refers to any distinct speech sound or gesture, regardless of whether the exact sound is critical to the meanings of words.
As shown in Table 2, the English language has 24 consonant sounds and 20 vowel sounds. In comparison, Mandarin has 23 consonant sounds and 24 vowel sounds whilst the Japanese language has 15 consonant sounds and 5 vowel sounds. This is one of the reasons why it is difficult to learn pronunciations of new languages that have phones that are outside of a second language learner's
19 native language. Learners of a language with more phones than their native language will find it difficult to differentiate and mimic the foreign phones.
Figure imgf000022_0001
Table 2: languages and phones for each of consonant and vowel sounds
Taking a Japanese ESL learner for example, it can be difficult for an ESL learner to pronounce English words because the English language has more phones than their native language, Japanese. Additionally, Japanese has many loan words from English, for example “chocolate” becomes “ - a n u — b ” or “chokoreto”. When learning these words in English, ESL learners in Japan often find themselves using the borrowed form of the English word because these words have been modified to be easier for native Japanese speakers to pronounce.
Modern speech recognition systems are trained to detect and differentiate phones that are found in a particular language through the audio data that is fed through the machine learning system during the training process. This method, however, fails to properly pick up the mispronounced phones that are outside the language that the speech recognition model was trained on.
To provide automated pronunciation feedback, the error correction module can be trained to be a phone detection system that is able to recognize the spoken phones of the second language learner and provide a visualization in the form of an animated lip, teeth, and tongue model.
In particular, the model 118 can be configured as a speech recognition model that is able to recognise phones in the mispronounced inputs in the target language, made by a speaker of the input language, and correctly pronounced
20 inputs in the target language. Moreover, the display 102 comprises a visualiser, or animated phoneme visualiser, allowing second language learners to identify differences in mouth positions (lips, teeth, tongue, etc.) required to pronounce a given word accurately.
With reference to Figure 2, the input received at Step 202 may therefore comprise improper use in the form of speech inputs of a person whose native tongue is the input language - i.e. mispronounced inputs. As used in this context, the term "improper use" or "imperfect use" can be used to refer to an erroneous translation of a prompt sentence given to the user in the input language, improper pronunciation of a sentence in the target language displayed to the user, or other erroneous response in the target language to a prompt given in either the input language or the target language.
The error correction module may then be trained at Step 204 to identify mispronounced phones. With reference to path 220, the error correction module may learn phones of each input or speech segment - Step 222 - these can be given in one or both of the input language and target language, and to determine differences between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs - Step 224.
The display 102 the system 100 of Figure 2 includes a visualiser. The visualiser usually demonstrates a mouth position for each mispronounced phone. In some instances, the visualiser produces a static image of the correct mouth position and in other instances the visualiser produces an animation for demonstrating the mouth position. In each case, the static image or animation may be accompanied by an audio feed of the correct pronunciation for the phone.
Therefore, after training the error correction module at Step 204, outputting the resulting model at Step 206, the system can receive a further speech input generated by a user in the target language at Step 214. The further speech input is in the target language and may correspond to a predetermined prompt
21 in the input language, that the user then seeks to translate, may correspond prompt in the target language that the user simply needs to read, or may be a freely formed speech input. At Step 216, the trained model is applied to the further speech input to identify mispronounced phones. Thereafter, at Step 218, a visualization of each mispronounced phone is outputted.
Speech recognition can be achieved using existing open-source speech recognition models to detect the words (or sounds) that the learner is articulating and convert them into phones. For each of the phones, there will be a corresponding keyframe representing the positions of the various mouthparts (lips, teeth and tongue, etc.) and depending on the detected phones. Where an animation is provided, the animation will transition to the respective keyframe. For example, where pronunciation of a phone requires the mouth to move between multiple keyframes (images of the correct mouth positions for creating a sound), the animation may interpolate or warp from one keyframe to the next. This provides a smooth transition between keyframes.
This process is schematically in the flow chart 500 Figure 5. The learner is given a word or a phrase to pronounce through a microphone or other audio device. Their pronunciation audio is recorded for processing by the system - step 502. Using the model trained at Step 204 of method 200, the speech is then analysed by a speech recognition module (Step 504) and the phones are then picked out (Step 506). The phone is subsequently modelled on a mouth model - Step 508. With the recorded audio, the learner is able to playback and compare the differences between the correct pronunciation and their pronunciation through both auditory and visual feedback. Instructions to guide the learner to understand the mistakes and subtle differences between similar sounding phones are also presented to the user.
One such instruction may endeavour to help learners differentiate between the "L" and "R" sounds in the words "light" and "right" could be:
- L: the tongue is pointed and touching the back of the upper teeth; and
22 - R: roll the tongue to the back of the mouth without pushing onto the upper palate.
The system 100 can also rank the aptitude of different learners and the difficulty of language learning tasks. With reference to method 200, and particularly step 226, by analysing the performance of a particular learner and mistakes across many students of a similar language proficiency in the target language, the machine learning model may identify or approximate the aptitude of the learner in comparison to his or her peers. Further, an activity can also be given a difficulty rating based on how many learners the system predicts would make mistakes while attempting it.
Besides learning by themselves on the platform, the activities could include video conferencing with peers to provide feedback on their conversational skills and similar to the aforementioned activities, the errors made during the sessions can be recorded and evaluated. This will again provide more context for the system to recommend activities and questions that target the learner's weaknesses.
It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be
23 taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.
24

Claims

Claims
1. A system for training a culturally-specific assisted language learning model, comprising: memory; an error correction module; and one or more processors, the memory storing instructions that, when executed by the one or more processors, cause the one or more processors to: receive a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use of one or more words in a target language culture and, for each first input, a second input comprising a proper use of one or more words in the target language culture, the proper use being a correction of the imperfect use; train the error correction module to identify relationships between inputs in the set of inputs, based on the input language culture, to produce a trained error correction module, the relationships comprising errors commonly made during use of the target language culture by a user of the input language culture (common errors); and output the model, the model comprising the trained error correction module, for correcting imperfect uses of words in the target language culture.
2. The system of claim 1, wherein at least one of: the first language culture is a first language; and the second language culture is a second language.
3. The system of claim 1 or 2, wherein:
25 the first inputs and second inputs comprise multi-word inputs, wherein training the error correction module comprises: training the error correction module to determine corrections needed to be made to each first input to form a proper use of the target language, based on the second input, including identifying a proper use for one or more words in each multi-word first input based on other words in each multi-word first input and the input language culture. The system of claim 3, wherein the instructions cause the one or more processors to train the error correction module to learn one or more differences between a grammatical sentence structure of the imperfect uses and a grammatical sentence structure of proper uses, and to identify a proper use of words in the target language culture based on the one or more differences. The system of claim 1, wherein the instructions cause the one or more processors to train the error correction module by applying a one-hot encoding categorical parameter to the first inputs, the one-hot encoding categorical parameter specifying the input language culture. The system of claim 5, wherein the set of inputs comprises inputs from two or more input language cultures, and the instructions cause the one or more processors to apply the one-hot encoding parameter to train the error correction module to learn respective sets of common errors in use of the target language culture for each input language culture. The system of any one of claims 1, 2, 5 and 6, wherein imperfect use of the target language culture comprises a speech input of a person whose native tongue is in the input language culture (mispronounced inputs).
26 The system of 7, wherein each proper use comprises a speech input of a person whose native tongue is in the target language culture (correctly pronounced inputs). The system of claim 7 or 8, wherein the instructions cause the one or more processors to train the error correction module to identify the relationships between inputs in the set of inputs by identifying mispronounced phones by determining: phones of each speech input; and a difference between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs. The system of any one of claims 7 to 9, further comprising a visualizer for visually demonstrating, for each mispronounced phone, a mouth position for properly pronouncing the mispronounced phone. The system of claim 10, wherein the visualizer is configured to use animations to visually demonstrate the mouth position for properly pronouncing mispronounced phones. The system of any one of claims 7 to 11, further comprising an activities system, wherein the instructions cause the one or more processors to produce an activity, at the activities system, corresponding to each common error, the activity comprising a practice exercise for correcting the respective common error. Use of a system according to claim 3 or 4, comprising: receiving, at the one or more processors, a further input in the target language culture from a speaker of the input language culture, the further input corresponding to a predetermined prompt in the input language culture;
27 applying the trained model to the further input to generate a multi word proper use based on the further input and the common errors; and outputting the proper use, indicating corrections of the common errors identified in the further input. Use of a system according to any one of claims 7 to 12, comprising: receiving, at the one or more processors, a further speech input in the target language culture from a speaker of the input language culture, the further speech input corresponding to a predetermined prompt; applying the trained model to the further speech input to identify mispronounced phones; and outputting a visualization of each mispronounced phone. A method for training a culturally-specific assisted language learning model, comprising: receiving a set of inputs, the set of inputs comprising first inputs from a user of an input language culture, each first input comprising an imperfect use of one or more words in a target language culture and, for each first input, a second input comprising a proper use of one or more words in the target language culture, the proper use being a correction of the imperfect use; training an error correction module to identify relationships between inputs in the set of inputs, based on the input language culture, to produce a trained error correction module, the relationships comprising errors commonly made during use of the target language culture by a user of the input language culture (common errors); and outputting the model, the model comprising the trained error correction module, for correcting imperfect uses of words in the target language culture.
28 The method of claim 16, wherein: the first inputs and second inputs comprise multi-word inputs; and wherein training the error correction module comprises training the error correction module to determine corrections needed to be made to each third input to form a proper translation of the first input, based on the first input and second input, including identifying a proper translation for one or more words in each multi-word input based on other words in the multi-word input and the first language culture. The method of claim 17, wherein training the error correction module comprises training the error correction module to learn one or more differences between a grammatical sentence structure of the imperfect uses and a grammatical sentence structure of the proper uses, and to identify a proper use of words in the target language culture based on the one or more differences. The method of claim 16, wherein: imperfect use of the target language culture comprises a speech input of a person whose native tongue is in the input language culture
(mispronounced inputs); and proper use of the target language culture comprises a speech input of a person whose native tongue is in the target language culture
(correctly pronounced inputs). The method of claim 19, wherein training the error correction module to identify the relationships between inputs in the set of inputs comprises training the error correction module to identify mispronounced phones by determining: phones of each speech input; and a difference between phones used in the mispronounced inputs and phones used in the correctly pronounced inputs.
29
PCT/SG2022/050301 2021-05-11 2022-05-10 System and method for training a culturally-specific assisting language learning model WO2022240358A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202104939P 2021-05-11
SG10202104939P 2021-05-11

Publications (1)

Publication Number Publication Date
WO2022240358A1 true WO2022240358A1 (en) 2022-11-17

Family

ID=84029905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2022/050301 WO2022240358A1 (en) 2021-05-11 2022-05-10 System and method for training a culturally-specific assisting language learning model

Country Status (1)

Country Link
WO (1) WO2022240358A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050257147A1 (en) * 2000-03-31 2005-11-17 Microsoft Corporation Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction
US20070206017A1 (en) * 2005-06-02 2007-09-06 University Of Southern California Mapping Attitudes to Movements Based on Cultural Norms
CN106610930A (en) * 2015-10-22 2017-05-03 科大讯飞股份有限公司 Foreign language writing automatic error correction method and system
US20180308474A1 (en) * 2012-06-29 2018-10-25 Rosetta Stone Ltd. Systems and methods for modeling l1-specific phonological errors in computer-assisted pronunciation training system
CN112364990A (en) * 2020-10-29 2021-02-12 北京语言大学 Method and system for realizing grammar error correction and less sample field adaptation through meta-learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050257147A1 (en) * 2000-03-31 2005-11-17 Microsoft Corporation Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction
US20070206017A1 (en) * 2005-06-02 2007-09-06 University Of Southern California Mapping Attitudes to Movements Based on Cultural Norms
US20180308474A1 (en) * 2012-06-29 2018-10-25 Rosetta Stone Ltd. Systems and methods for modeling l1-specific phonological errors in computer-assisted pronunciation training system
CN106610930A (en) * 2015-10-22 2017-05-03 科大讯飞股份有限公司 Foreign language writing automatic error correction method and system
CN112364990A (en) * 2020-10-29 2021-02-12 北京语言大学 Method and system for realizing grammar error correction and less sample field adaptation through meta-learning

Similar Documents

Publication Publication Date Title
US8272874B2 (en) System and method for assisting language learning
US8033831B2 (en) System and method for programmatically evaluating and aiding a person learning a new language
Saito Experienced teachers' perspectives on priorities for improved intelligible pronunciation: The case of J apanese learners of E nglish
US8221126B2 (en) System and method for performing programmatic language learning tests and evaluations
Zhan et al. The role of technology in teaching and learning Chinese characters
KR20160008949A (en) Apparatus and method for foreign language learning based on spoken dialogue
KR102277362B1 (en) System for language study service for learning korean as a foreign language
KR102043419B1 (en) Speech recognition based training system and method for child language learning
Ai Automatic pronunciation error detection and feedback generation for call applications
Gottardi et al. Automatic speech recognition and text-to-speech technologies for L2 pronunciation improvement: reflections on their affordances
JP2019061189A (en) Teaching material authoring system
Bashori et al. I Can Speak: Improving English pronunciation through automatic speech recognition-based language learning systems
US20210304628A1 (en) Systems and Methods for Automatic Video to Curriculum Generation
KR20140087956A (en) Apparatus and method for learning phonics by using native speaker&#39;s pronunciation data and word and sentence and image data
KR20080100857A (en) Service system for word repetition study using round type
CN114241835B (en) Student spoken language quality evaluation method and device
WO2006057896A2 (en) System and method for assisting language learning
WO2022240358A1 (en) System and method for training a culturally-specific assisting language learning model
Liu Teaching Chinese pronunciation: Explanation, expectation, and implementation
KR20140075994A (en) Apparatus and method for language education by using native speaker&#39;s pronunciation data and thought unit
Jo et al. Effective computer‐assisted pronunciation training based on phone‐sensitive word recommendation
Filighera et al. Towards A Vocalization Feedback Pipeline for Language Learners
CN114420088B (en) Display method and related equipment thereof
KR20140073768A (en) Apparatus and method for language education by using native speaker&#39;s pronunciation data and thoughtunit
Makhmutova et al. DICTATION PRACTICE ENHANCED BY ARTIFICIAL INTELLIGENCE: A MODERN APPROACH TO LANGUAGE LEARNING

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22807954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22807954

Country of ref document: EP

Kind code of ref document: A1