CN108831212B

CN108831212B - Auxiliary device and method for oral teaching

Info

Publication number: CN108831212B
Application number: CN201810689188.3A
Authority: CN
Inventors: 何光耀
Original assignee: Shenzhen Langease Education Technology Co ltd
Current assignee: He Guangyao
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2020-10-23
Anticipated expiration: 2038-06-28
Also published as: CN108831212A

Abstract

The invention relates to a spoken language teaching auxiliary device and a method, the device comprises a processing unit, and a storage unit, a voice input unit and an output unit which are respectively and electrically connected with the processing unit, wherein the storage unit stores a plurality of spoken language learning data, each spoken language learning data comprises a training sentence, a plurality of standard syllable groups and weighting information, the training sentence is composed of a plurality of standard characters which are sequentially arranged, the standard syllable groups correspond to the standard characters, and the training sentence is obtained by analyzing the standard characters in advance. The voice processing capability of the invention is strong, the pronunciation of the user can be accurately distinguished, the voice accuracy of the user can be evaluated, the virtual reality image can be presented to the user through voice interaction, the interactivity is strong, the user experience is good, the feeling of being personally on the scene is strong, and therefore, the interest of the user in oral language learning is improved.

Description

Auxiliary device and method for oral teaching

Technical Field

The invention belongs to the technical field of oral teaching, and particularly relates to an auxiliary device and method for oral teaching.

Background

Spoken language learning is one of the necessary approaches to learning foreign languages, and spoken language teaching for spoken language learning is an important means. The spoken language teaching way and method in the prior art mainly comprise audio-visual teaching, software teaching, real person (on-site or remote) teaching and the like. The oral teaching approach and method in the prior art have the following disadvantages: the defects of video teaching are as follows: the mode has no interactivity with the user; the software teaching has the following defects: the user experience is not easy to realize the telepresence of the spoken language use situation, and the user experience is not good; the disadvantages of real teaching: the price is high, and some users can not be in the mood and can not be opened when meeting real teachers.

Disclosure of Invention

In view of the above problems in the prior art, an object of the present invention is to provide a spoken language teaching assistance device and method that can avoid the above technical drawbacks.

In order to achieve the above object, the present invention provides the following technical solutions:

a spoken language teaching auxiliary device comprises a processing unit, and a storage unit, a voice input unit and an output unit which are respectively electrically connected with the processing unit.

Further, the output unit comprises a display screen and a loudspeaker which are respectively electrically connected with the processing unit.

Further, the voice input unit includes a microphone.

Further, the storage unit stores a plurality of spoken language learning data, each of which includes a training sentence, a plurality of standard syllable sets and a weighting information, the training sentence is composed of a plurality of standard characters arranged in sequence, the standard syllable sets correspond to the standard characters and are obtained by pre-parsing the standard characters; the weighted information has a weighted word corresponding to one of the standard words included in the training sentence and a weighted value corresponding to the weighted word.

Further, the processing unit controls the output unit to output a training sentence; when receiving a voice from the voice input unit, the processing unit analyzes the voice to obtain a sentence to be judged consisting of a plurality of characters to be judged and a plurality of syllable groups to be judged respectively corresponding to the characters to be judged; when the processing unit determines that the standard syllable set is not matched with at least one syllable set to be determined, the processing unit takes the at least one standard character corresponding to the at least one syllable set which is not matched as at least one unmatched character; the processing unit generates an original score related to the at least one unmatched character according to the training sentence and the at least one unmatched character by using the language identification model; when the processing unit judges that the at least one unmatched character corresponds to the weighted character, taking a weighted value corresponding to the weighted character as a target weighted value; the processing unit generates a weighted score according to the original score and the target weighted value, and controls the output unit to output an evaluation related to the voice according to the weighted score.

A spoken language teaching assistance method realized by the spoken language teaching assistance apparatus according to claim 1, comprising the steps of:

(A) the processing unit controls the output unit to output a training sentence;

(B) when receiving the voice from the voice input unit, the processing unit analyzes the voice to obtain a sentence to be determined which is composed of a plurality of characters to be determined and a plurality of syllable groups to be determined which respectively correspond to the characters to be determined;

(C) when the processing unit determines that the standard syllable set is not matched with at least one syllable set to be determined, the processing unit takes the at least one standard character corresponding to the at least one syllable set which is not matched as at least one unmatched character;

(D) the processing unit generates an original score related to the at least one unmatched character according to the training sentence and the at least one unmatched character by utilizing a language identification model;

(E) when the processing unit judges that the at least one unmatched character corresponds to the weighted character, taking a weighted value corresponding to the weighted character as a target weighted value; and

(F) the processing unit generates a weighted score according to the original score and the target weighted value, and controls the output unit to output an evaluation related to the voice according to the weighted score.

Further, the step (C) includes:

(c1) the processing unit determines that the at least one of the standard syllable sets that does not correspond to the to-be-determined syllable set when determining that the standard syllable set and the to-be-determined syllable set do not correspond to each other;

(c2) the processing unit determines whether the standard syllable sets are respectively identical to the to-be-determined syllable sets when it is determined that the standard syllable sets respectively correspond to the to-be-determined syllable sets, and determines that at least one different syllable set of the standard syllable sets does not correspond to the to-be-determined syllable sets when it is determined that the standard syllable sets are not identical to the to-be-determined syllable sets.

Further, in the step (E), the processing unit controls the display screen and/or the speaker to output the evaluation, and controls the display screen to display the at least one non-conforming character.

Further, before the step (a), the processing unit controls the display screen to display the virtual reality image of the multimedia data according to the multimedia data included in the spoken language learning data.

Further, the language identification model is an N-Gram model.

The spoken language teaching auxiliary device and the spoken language teaching auxiliary method provided by the invention have the advantages that the voice processing capacity is strong, the pronunciation of a user can be accurately distinguished, the voice accuracy of the user can be evaluated, the virtual reality image can be presented to the user through voice interaction, the interactivity is strong, the user experience is good, the immersive feeling is strong, the interest of the user in spoken language learning is improved, and the requirements of practical application can be well met.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A spoken language teaching auxiliary device comprises a storage unit, a voice input unit, an output unit and a processing unit, wherein the storage unit, the voice input unit and the output unit are electrically connected with the processing unit respectively.

The storage unit stores a plurality of spoken language learning data, each spoken language learning data comprises a training sentence, a plurality of standard syllable sets and weighting information, the training sentence is composed of a plurality of standard characters which are sequentially arranged, the standard syllable sets correspond to the standard characters and are obtained by analyzing the standard characters in advance; the weighted information has a weighted word corresponding to one of the standard words included in the training sentence and a weighted value corresponding to the weighted word.

The processing unit controls the output unit to output training sentences; when the processing unit receives a voice from the voice input unit, the processing unit analyzes the voice to obtain a sentence to be judged consisting of a plurality of characters to be judged and a plurality of syllable groups to be judged respectively corresponding to the characters to be judged; the processing unit takes the at least one standard character corresponding to the at least one syllable group not matched as at least one unmatched character when the standard syllable group is determined to be not matched with the at least one syllable group to be determined; the processing unit generates an original score related to the at least one unmatched character according to the training sentence and the at least one unmatched character by utilizing a language identification model; the language identification model is an N-Gram language identification model; when the processing unit judges that the at least one unmatched character corresponds to the weighted character, taking a weighted value corresponding to the weighted character as a target weighted value; the processing unit generates a weighted score according to the original score and the target weighted value, and controls the output unit to output an evaluation related to the voice according to the weighted score.

The output unit comprises a display screen and a loudspeaker, the display screen and the loudspeaker are respectively electrically connected with the processing unit, and the processing unit controls the display screen and/or the loudspeaker to output the evaluation and controls the display screen to display the at least one character which is not matched.

The processing unit determines that the at least one of the standard syllable set not corresponding to the to-be-determined syllable set does not correspond to the to-be-determined syllable set when determining whether the standard syllable set matches the to-be-determined syllable set or not, and the processing unit determines that the at least one of the standard syllable set not corresponding to the to-be-determined syllable set does not correspond to the to-be-determined syllable set when determining that the standard syllable set does not match the to-be-determined syllable set

The processing unit further determines whether the standard syllable sets are respectively identical to the to-be-determined syllable sets when it is determined that the standard syllable sets respectively correspond to the to-be-determined syllable sets, and determines that at least one different syllable set of the standard syllable sets does not correspond to the to-be-determined syllable sets when it is determined that the standard syllable sets are not identical to the to-be-determined syllable sets.

Each piece of learning data also comprises multimedia data, the multimedia data is provided with a virtual reality image related to the training sentences, and the processing unit controls the display screen to display the virtual reality image of the multimedia data according to the multimedia data contained in the learning data before controlling the output unit to output the training sentences of the learning data.

The voice input unit comprises a microphone. This oral teaching auxiliary device still includes the shell of a helmet shape, and storage unit, speech input unit, output unit and processing unit all assemble on this shell, and this oral teaching auxiliary device is a wearable on overhead head-mounted display device.

The processing unit is provided with a speech recognition engine which comprises three main modules: a foreign language dictionary module: for example, the present embodiment uses english, so an "english dictionary" is required as a standard; a word analysis module: parsing each word (vocarbulariy) in the foreign language dictionary module into a series of syllables; a language model module: and analyzing the proportion of the words in the language appearing in succession according to the use habits of different languages.

A spoken language teaching auxiliary method is realized by the spoken language teaching auxiliary device, and comprises the following steps:

(A) the processing unit controls the output unit to output a training sentence;

In the step (E), the processing unit controls the display screen and/or the speaker to output the evaluation, and controls the display screen to display the at least one non-conforming character.

The step (C) comprises:

(c1) the processing unit determines that the at least one of the standard syllable sets that does not correspond to the to-be-determined syllable set when determining that the standard syllable set and the to-be-determined syllable set do not correspond to each other; and

Before the step (A), the processing unit controls the display screen to display the virtual reality image of the multimedia data according to the multimedia data contained in the spoken language learning data.

In step (D), the language identification model used is an N-Gram language identification model. An N-Gram (N-Gram) Language recognition model can predict or evaluate whether a sentence is reasonable in NLP (Natural Language Processing), and can be used to evaluate the degree of difference between two strings.

Assuming S represents a meaningful sentence, consisting of a string of words w1, w2, w3,.., wn in a particular order, n is the length of the sentence. The probability that S appears in the text (corpus), mathematically called the probability p (S), is:

P(S)＝P(w1,w2,w3,..,wn)＝P(W1)P(W2|W1)P(W3|W1,W2)..P(Wn|W1,W2,..,Wn-1)；

the defects of the calculation method are as follows:

the parameter space is too large: the conditional probability P (wn | w1, w 2.., wn-1) is too likely to be estimated and not likely to be useful;

the data sparsity is severe: for combinations of very many word pairs, none of which is present in the corpus, the probability from the maximum likelihood estimate will be 0. The end result is that the model can only compute a few sentences, while most sentences compute a probability of 0.

In order to solve the problem of overlarge parameter space, a Markov assumption is introduced: the probability of an arbitrary word occurring is only related to the limited word or words that it precedes. If a word occurs that depends only on a word that appears before it, it is called bigram:

assuming that the occurrence of a word depends only on the two words that it occurs before, it is called a trigram:

in general, the N-gram model assumes that the probability of occurrence of the current word is related to only the N-1 words preceding it. These probability parameters can be calculated by a large-scale corpus, for example, the ternary probabilities include:

P(Wi|Wi-1,Wi-2)≈count(Wi-2Wi-1Wi)/count(Wi-2Wi-1)。

the data smoothing technology is an important means for constructing a highly robust language model, and the effect of data smoothing is related to the scale of a training corpus. The smaller the training corpus scale is, the more remarkable the data smoothing effect is; the larger the training corpus, the less pronounced or even negligible the effect of data smoothing. The purpose of data smoothing is two: one is to make the sum of all the N-Gram probabilities 1; one is to make all the N-Gram probabilities not 0. The main strategy is to properly reduce the probability of events occurring in the training sample and then assign the reduced probability density to events that do not occur in the training corpus. In practice, there are many kinds of smoothing algorithms, such as: add-one smoothing; Witten-Bell smoothing; Good-Turing smoothing; KatzBackoff; stupid backkoff.

String distance defined based on the N-Gram model:

the key to fuzzy matching is how to measure the "difference" between two very long words (or strings), often referred to as "distance". In addition to defining the edit distance between two strings (typically using the Needleman-Wunsch algorithm or the Smith-Waterman algorithm), the Ngram distance between them can also be defined. Assuming that there is a character string S, the Ngram of the character string represents a word segment obtained by segmenting the original word by the length N, that is, all the substrings with the length N in S. It is envisaged that if there are two strings and then their Ngrams are evaluated separately, then the Ngram distance between the two strings can be defined in terms of the number of their common substrings. But simply counting common sub-strings is clearly insufficient, and this solution obviously ignores the problem that a difference in length of two strings may cause. Such as the strings girl and girlfriend, the number of common substrings owned by both is obviously equal to the number of common substrings owned by girl and itself, but we cannot consider girl and girlfriend as two equivalent matches according to this. To solve this problem, some scholars propose to define Ngram distances based on non-repetitive Ngram participles, the formula is expressed as follows:

|GN(S1)|+|GN(S2)|-2×|GN(S1)∩GN(S2)|；

where | GN (S1) | is the Ngram set of the character string, and the value of N generally takes 2 or 3. Segmenting the character strings gorbachelv and Gorbechyov by taking N-2 as an example, the following results can be obtained:

1 Go or rb ba ac ch he ev

2 Go or rb be ec ch hy yo ov

combining the above formula, it can be calculated that the distance between two character strings is 8+9-2 × 4 ═ 9. Obviously, the smaller the distance between strings, the closer they are. When two strings are completely equal, the distance between them is 0.

Using the Ngram model to evaluate whether the statement is reasonable:

from a statistical point of view, a sentence S in natural language can be made up of any string of words, but the probability p (S) is large or small. Given a section of a sentence, it is possible to guess what the following words should be, for example:

the large green_.mountain or tree？

Kate swallowed the large green_.pill or broccoli？

suppose now that there is a corpus as follows, where < s1> < s2> is a sentence start marker and </s2> </s1> is a sentence end marker:

1<s1><s2>yes no no no no yes</s2></s1>

2<s1><s2>no no no yes yes yes no</s2></s1>

the probability of this sentence is evaluated as follows:

1<s1><s2>yes no no yes</s2></s1>

the result of calculating the probability using the trigram model is:

the resulting probability is equal to: 1/2 × 1 × 1/2 × 2/5 × 1/2 × 1 is 0.05.

The oral teaching auxiliary device is a helmet type wearable device and is provided with a helmet type shell. When the device is used, after a user wears the helmet type shell, the display screen is right facing eyes of the user, the head can be freely rotated, a real-time three-dimensional virtual scene and a character with 360 degrees can be seen from a viewing window of the helmet type shell, and a virtual character role is an object for practicing spoken language of the user; the user can hear the spoken language teaching situation and course content customized by the program in advance from the loudspeaker on the helmet-type shell; when the program requires the user to interact with the customized program in a "talk-through" manner, the user can talk directly through the mouth, and the program will receive the voice spoken by the user (speech input) through a microphone mounted on the helmet-based shell; the customized program is provided with a voice recognition module, so that the voice input information received by the microphone can be interpreted in real time, the evaluation is carried out in real time, and the evaluation result (very good speaking, to-be-strengthened and the like) is fed back to the user, so that the user can know that the spoken language of the user is not good at the first time; the user can also compare the spoken language with the standard spoken language pre-recorded in the program by the real person in detail through the function of recording/replaying to improve the spoken language ability.

The storage unit stores text data (i.e. conversation courses) which require the user to input in spoken language, and the user makes the microphone receive audio data in a spoken mode. The processing unit utilizes the pre-customized text data to be converted into an identification template for comparing the spoken input data through the speech recognition engine. After the user inputs the spoken language, the processing unit analyzes the audio of the spoken language input into a series of syllables through a word analysis module of the speech recognition engine, and then converts the syllables into spoken language samples through a language model module. The processing unit compares the spoken language sample with a preset identification template. During data comparison, a predefined weight parameter is adopted, floating point number between 0 and 1 is calculated by fuzzy logic to represent, and the difference degree between the spoken language sample and the identification template is obtained, wherein 0 represents that the data of the spoken language sample and the identification template are completely different, 1 represents that the data of the spoken language sample and the identification template are completely the same, and 0.7 represents that the data of the spoken language sample and the identification template are about 70% of the same.

Data comparison can be performed by a speech recognition engine in two ways:

1) semantic analysis and comparison: the method comprises the steps of firstly splitting sentences in an identification template into single words, labeling the single words into different labels according to characteristics, finding out semantic structures of the sentences through the previous and later steps, then comparing the sentences which are read by a user with the sentence structures in the identification template, analyzing whether the semantic structures of the two sentences are similar, giving a score between 0 and 1 by an algorithm, wherein the higher the score is, the higher the similarity is. For example:

example 1

□ recognize sentences in the template: let's go to the thread o' clcok show.

□ user 1 Let's go to the show

□ user 2 Let's go to the thread clock show

□ analysis results: user 1 misses the important time "three o' clock" and gets only 0.5 point. The user 2 has named o' clock as clock, but since the preceding and following words are correctly named, a relatively high score of 0.8 can be obtained.

Example 2

□ recognize sentences in the template: uh-huh, the go right to the next block and the post office'll be on your left.

□ user 1 Uh-huh, the n go right to the next block and the polieof ice be on your right.

□ user 2: Uh-huh, go to the next block, the post office be on yourleft.

□ analysis results: the sentence structure of user 1 is similar to that of the text, but the post office and the policy office are different places, the direction is wrong, and the obtained score is 0.4 which is lower. Although the user 2 omits the words "the", "right", "and", "will", etc., the semantic meanings are similar to the text, and a good score of 0.7 can be obtained.

Example 3

□ recognize sentences in the template: thanks! Can I get off at and stop up arriving the route?

□ user 1: thank! Can I get off store up the road?

□ user 2: thanks! Can I get on at any stop arriving at the route?

□ two users have similar scores, but user 2 pronounces get on as the key word, which is completely contrary to the semantic meaning of the text, and therefore deducts a very large score, only 0.6.

2) Single word weight comparison: the difference between the sentences in the recognition template and the sentences spoken by the user is compared word by word, and the sentences are given according to the importance weight of the single word. For example:

□ recognize sentences in the template: i gu you've gauge a cold. have you checked your temperature?

□ user: i lower you've catch a cold. Have you closed your temperature?

□ analysis results: the user recites caught as catch, the obtained original score is 0.91, and through weighting calculation, caught is a weighted word, so the score is corrected to be 0.78.

And (3) comparison evaluation: and the processing unit marks a score on the result of the data comparison, and the score is fed back to the user for knowing. For example: after the spoken language of the user is input, the result of data comparison is more than 0.8, and the user can give the evaluation of 'very good'; another example is: after the user inputs the spoken language, the result of data comparison is less than 0.4, the system gives an evaluation of 'to be strengthened', and the user can practice repeatedly.

Degree of difference: the processing unit can mark the place where the compared data are not in accordance according to the result of the comparison of the voice recognition engine data, so that the user can know which words are evaluated as being in accordance by the system in the stage of executing 'spoken language input' (the text data can mark the words which are not in accordance with each other with special colors and display the words on the display screen in real time).

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The auxiliary device for the oral teaching is characterized by comprising a processing unit, and a storage unit, a voice input unit and an output unit which are respectively and electrically connected with the processing unit;

the storage unit stores a plurality of spoken language learning data, each spoken language learning data comprises a training sentence, a plurality of standard syllable groups and weighting information, the training sentence is composed of a plurality of standard characters which are sequentially arranged, and the standard syllable groups correspond to the standard characters and are obtained by analyzing the standard characters in advance; the weighted information has a weighted word and a weighted value corresponding to the weighted word, the weighted word corresponds to one of the standard words included in the training sentence;

the processing unit controls the output unit to output a training sentence; when receiving a voice from a voice input unit, a processing unit analyzes the voice to obtain a sentence to be judged consisting of a plurality of characters to be judged and a plurality of syllable groups to be judged corresponding to the characters to be judged respectively; when the processing unit judges that the standard syllable group is not matched with at least one syllable group of the syllable group to be judged, the processing unit takes at least one standard character corresponding to the at least one syllable group which is not matched as at least one unmatched character; the processing unit generates an original score related to at least one unmatched character according to the training sentence and the at least one unmatched character by using the language identification model; when the processing unit judges that at least one unmatched character corresponds to the weighted character, taking the weighted value corresponding to the weighted character as a target weighted value; the processing unit generates a weighted score according to the original score and the target weighted value, and controls the output unit to output an evaluation related to the voice according to the weighted score.

2. The spoken language teaching aid of claim 1, wherein the output unit includes a display screen and a speaker electrically connected to the processing unit, respectively.

3. The device of claim 1, wherein the voice input unit comprises a microphone.

4. A spoken language teaching assistance method implemented by the spoken language teaching assistance apparatus according to any one of claims 1 to 3, comprising the steps of:

(A) the processing unit controls the output unit to output a training sentence;

(B) when receiving the voice from the voice input unit, the processing unit analyzes the voice to obtain a sentence to be judged consisting of a plurality of characters to be judged and a plurality of syllable groups to be judged respectively corresponding to the characters to be judged;

(C) when the processing unit judges that the standard syllable group is not matched with at least one syllable group of the syllable group to be judged, the processing unit takes at least one standard character corresponding to the at least one syllable group which is not matched as at least one unmatched character;

(D) the processing unit generates an original score related to at least one unmatched character according to the training sentence and the at least one unmatched character by utilizing a language identification model;

(E) when the processing unit judges that at least one unmatched character corresponds to the weighted character, taking the weighted value corresponding to the weighted character as a target weighted value; and

5. The spoken language teaching assistance method of claim 4, wherein the step (C) comprises:

(c1) when the processing unit judges that the standard syllable group does not correspond to at least one syllable group of the syllable group to be judged, judging that the at least one syllable group of the standard syllable group which does not correspond to the syllable group to be judged;

(c2) the processing unit determines whether the standard group of syllables is respectively identical to the group of syllables to be determined when it is determined that the standard group of syllables respectively corresponds to the group of syllables to be determined, and determines that at least one different group of syllables of the standard group of syllables does not correspond to the group of syllables to be determined when it is determined that the standard group of syllables is not identical to the group of syllables to be determined.

6. The spoken language teaching assistance method of claim 4, wherein in step (E), the processing unit controls the display screen and/or the speaker to output the rating and controls the display screen to display at least one non-conforming word.

7. The spoken language teaching assistance method of claim 4, wherein before step (A), the processing unit controls the display screen to display a virtual reality image of the multimedia data according to the multimedia data included in the spoken language learning data.

8. The spoken language teaching assistance method of claim 4 wherein the language identification model is an N-Gram model.