CN112151018A

CN112151018A - Voice evaluation and voice recognition method, device, equipment and storage medium

Info

Publication number: CN112151018A
Application number: CN201910496211.1A
Authority: CN
Inventors: 张平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2020-12-29

Abstract

A method, device, equipment and storage medium for voice evaluation and voice recognition are disclosed. Outputting prompt information, wherein the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words; receiving voice; recognizing the voice based on a character/word recognition model corresponding to the character/word in the test text, wherein the character/word recognition model is used for recognizing whether the voice is matched with the corresponding character/word; and evaluating the voice based on the recognition result. Therefore, when the speech to be evaluated of multi-language mixture (such as the speech of English and Chinese mixture) is evaluated, the character/word recognition model can be directly selected without considering the switching problem of the language recognizer, and a more accurate recognition result can be obtained by recognizing the grade of the character/word of the speech to be evaluated.

Description

Voice evaluation and voice recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a method, an apparatus, a device, and a storage medium for voice evaluation and voice recognition.

Background

Education is regarded as a never-outdated topic, and is more and more valued by parents, and various education institutions for children are also endless. Generally, offline education costs are high, parents need to take and deliver children to class, much time is consumed by the parents, inconvenience is brought to the parents due to traffic, weather and the like, and the price of offline education is high. Therefore, online education is bound to replace offline education, and becomes a mainstream education method.

With the overall advance of quality education, the language education of children is more and more valued. How to evaluate the pronunciation of children is the key to realize online language education.

On the one hand, children may receive multilingual language education (such as chinese and english) at the same time, so it is a difficult point at present how to evaluate the multilingual voice data sent by children. For example, in the case that the speech data sent by the child contains both english words and chinese words, the existing scheme needs to use different language recognizers to recognize the speech data of the child, and the different language recognizers need to be switched in the recognition process, which is complicated to implement.

On the other hand, because the pronunciation characteristics and pronunciation habits of children of different ages are different, the universal ASR (Automatic Speech Recognition) technology cannot normally recognize the pronunciation of the children, and thus cannot accurately score the pronunciation. Moreover, scoring logics and attention points of different texts and teaching materials are different, so that a large amount of teacher resources and algorithm development are needed for scoring logic combing and algorithm modeling work.

Therefore, a more effective speech evaluation scheme is needed to provide technical support for on-line language education of children.

Disclosure of Invention

An object of the present disclosure is to provide a speech evaluation scheme to provide technical support for solving at least one of the above technical problems.

According to a first aspect of the present disclosure, a speech evaluation method is provided, including: outputting prompt information, wherein the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words; receiving voice; recognizing a voice based on a word/word recognition model corresponding to a word/word in a test text, the word/word recognition model being used for recognizing whether the voice matches the word/word corresponding to the word/word recognition model; and evaluating the voice based on the recognition result.

Optionally, the test text comprises a plurality of words, the method further comprising: segmenting the speech to obtain a plurality of audio segments, wherein the step of recognizing the speech based on the word/word recognition model corresponding to the word/word in the test text comprises the steps of: each audio clip is recognized based on a plurality of word/phrase recognition models to determine a word/phrase corresponding to each audio clip, wherein each word/phrase recognition model corresponds to one word/phrase of the plurality of words/phrases.

Optionally, the word/word recognition model is a hidden markov model, wherein the step of recognizing each audio segment based on a plurality of word/word recognition models to determine a word/word corresponding to each audio segment includes: extracting the characteristics of each audio clip to obtain a characteristic sequence of the audio clip; respectively inputting the characteristic sequences into a plurality of character/word recognition models to obtain the probability value of each character/word recognition model for generating the characteristic sequences; and under the condition that the character/word recognition model with the probability value larger than the first threshold exists, recognizing the audio clip as the character/word corresponding to the character/word recognition model with the maximum generated probability value.

Optionally, the word/word recognition model is a hidden markov model, the test text includes a single word/word, and the step of recognizing the speech based on the word/word recognition model corresponding to the word/word in the test text includes: carrying out feature extraction on the voice to obtain a feature sequence of the voice; inputting the characteristic sequence into a character/word recognition model corresponding to the character/word to obtain a probability value of the character/word recognition model for generating the characteristic sequence; in case the probability value is larger than a second threshold value, the speech is recognized as a word/word.

Optionally, feature extraction is performed based on a neural network.

Optionally, the step of evaluating the speech based on the recognition result comprises: evaluating the voice according to the difference between the character/word corresponding to the recognized voice and the character/word in the test text; and/or evaluating the speech according to the similarity between the speech output by the character/word recognition model and the corresponding character/word.

Optionally, the method further comprises: calculating the similarity between the voice and the audio characteristic distribution represented by the voice based on the characteristic distribution model; and adjusting the evaluation result according to the similarity obtained by the calculation of the characteristic distribution model, wherein the height of the evaluation result is in negative correlation with the size of the similarity.

Optionally, the feature distribution model is a gaussian mixture model.

Optionally, the method further comprises: acquiring training data, wherein the training data comprises voice data of a plurality of characters/words; based on the voice data of the same character/word, a character/word recognition model of the character/word is trained.

Optionally, the method further comprises: based on the voice data of the plurality of characters/words, training a characteristic distribution model, wherein the characteristic distribution model is used for representing the audio characteristic distribution of the voice data of the plurality of characters/words.

Optionally, the method further comprises: and screening the voice data of the same character/word based on the characteristic distribution model, wherein the step of training the character/word recognition model of the character/word based on the voice data of the same character/word comprises the following steps: and training a character/word recognition model based on the screened voice data.

Optionally, the plurality of words/phrases comprises words/phrases in one or more languages.

According to a second aspect of the present disclosure, a speech evaluation method is further provided, including: receiving voice; recognizing the voice based on a character/word recognition model, wherein the character/word recognition model is used for recognizing whether the voice is matched with a character/word corresponding to the character/word recognition model; and evaluating the voice based on the recognition result.

According to a third aspect of the present disclosure, there is also provided a speech recognition method, including: receiving voice; the speech is recognized based on a word/word recognition model for recognizing whether the speech matches a word/word corresponding to the word/word recognition model.

According to a fourth aspect of the present disclosure, there is also provided a speech evaluation apparatus, including: the output module is used for outputting prompt information, the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words; the receiving module is used for receiving voice; the recognition module is used for recognizing the voice based on a character/word recognition model corresponding to the character/word in the test text, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model; and the evaluation module is used for evaluating the voice based on the recognition result.

According to a fifth aspect of the present disclosure, there is also provided a speech evaluation apparatus, including: the receiving module is used for receiving voice; the recognition module is used for recognizing the voice based on a character/word recognition model, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model; and the evaluation module is used for evaluating the voice based on the recognition result.

According to a sixth aspect of the present disclosure, there is also provided a voice interaction device, including: the first output module is used for outputting prompt information, wherein the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words; the receiving module is used for receiving voice; the recognition module is used for recognizing the voice based on a character/word recognition model corresponding to the character/word in the test text, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model; and the evaluation module is used for evaluating the voice based on the recognition result.

Optionally, the voice interaction device further comprises: and the second output module is used for outputting the voice teaching data.

Optionally, the voice interaction device is a smart speaker or a smart watch.

According to a seventh aspect of the present disclosure, there is also provided a speech recognition apparatus comprising: the receiving module is used for receiving voice; and the recognition module is used for recognizing the voice based on a character/word recognition model, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model.

According to an eighth aspect of the present disclosure, there is also presented a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform a method as set forth in the first or second aspect of the disclosure.

According to a ninth aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform a method as set forth in the first or second aspect of the present disclosure.

When the speech evaluation scheme disclosed by the invention is used for evaluating the speech to be evaluated in a multi-language mixed mode (such as the speech mixed by English and Chinese), the character/word recognition model can be directly selected without considering the switching problem of a language recognizer, and a more accurate recognition result can be obtained by recognizing the grade of the character/word of the speech to be evaluated.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic flow diagram of a speech assessment scheme according to an embodiment of the present disclosure.

Fig. 2 shows a schematic block diagram of the structure of a voice interaction device according to an embodiment of the present disclosure.

FIG. 3 is a schematic flow chart diagram illustrating a speech assessment method in a text-related scenario according to an embodiment of the present disclosure.

FIG. 4 is a schematic flow chart diagram illustrating a speech assessment method in a text-independent scenario according to an embodiment of the present disclosure.

FIG. 5 shows a schematic flow diagram of a training process for an HMM model, a GMM model.

Fig. 6 shows a schematic block diagram of the structure of a speech evaluation device according to an embodiment of the present disclosure.

Fig. 7 shows a schematic block diagram of the structure of a speech evaluation device according to another embodiment of the present disclosure.

Fig. 8 shows a schematic block diagram of the structure of a speech recognition apparatus according to an embodiment of the present disclosure.

FIG. 9 shows a schematic structural diagram of a computing device according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The disclosure provides a speech evaluation scheme irrelevant to languages. Based on the voice evaluation scheme disclosed by the invention, under the condition that no corresponding language recognizer exists, the content of the voice sent by the user (especially a child user) can be recognized according to the pronunciation similarity, and grading is carried out according to a simple grading standard, so that the data cost and the algorithm development cost in the aspect of voice evaluation can be greatly reduced.

As shown in fig. 1, the present disclosure may train a word/phrase recognition system including a plurality of word/phrase recognition models in advance. Each character/word recognition model corresponds to a character/word, and the character/word recognition model is used for recognizing whether the voice (i.e. the voice to be evaluated) is matched with the character/word corresponding to the character/word recognition model.

Optionally, the word/recognition system may also include a filer model, which may also be referred to as a "white fill model" or a "garbage model". The filer model may be used to absorb various linguistic phenomena other than the words/phrases corresponding to the word/phrase recognition model, including Out Of speech (OOV), common non-linguistic phenomena (such as background noise, coughing, wheezing), and so on. In the present disclosure, the filer model may be regarded as a special character/word recognition model, and the extracorporeal word and various non-linguistic phenomena may be regarded as a special character/word, i.e., the filer model is used to recognize the special character/word.

The plurality of word/phrase recognition models included in the word/phrase recognition system may correspond to a plurality of words/phrases belonging to different languages. For example, the word/phrase recognition model 1 may correspond to the english word "applet" and the word/phrase recognition model 2 may correspond to the chinese word "hoe". That is, the present disclosure does not distinguish languages, but trains a corresponding character/word recognition model for a specific character/word in units of characters/words.

Taking the application to a children education scene as an example, the word/phrase recognition system corresponding to vocabularies of different scales can be trained for children of different age groups or different grades. For example, a vocabulary library composed of common words in multiple languages such as chinese and english may be constructed for children aged 3 to 6 years, and corresponding word recognition models may be trained for each word in the vocabulary library. For another example, a corresponding vocabulary library may be constructed for children of different grades according to vocabulary requirements of different grades, the vocabulary library may include words/phrases in multiple languages such as chinese and english, and a corresponding word/phrase recognition model may be trained for each word/phrase in the vocabulary library. The training process for the word/phrase recognition model will be described below, and will not be described in detail here.

The pronunciation characteristics and pronunciation habits of children in different age groups are different. For example, children speak without words in a confused state and without the concept of word sequence, for example, for a sentence "i want to eat apple", the words are often expressed by a word sequence similar to "apple, want to eat". If the pronunciation characteristics of a user (especially a child user) are not considered, the speech of the user is directly recognized by using the existing ASR technology, and then a correct recognition result is difficult to obtain.

The method trains the character/word recognition model corresponding to the specific character/word by taking the character/word as a unit, on one hand, the trained character/word recognition model can be used for recognizing the character/word level of the speech to be evaluated so as to obtain a more accurate recognition result, and the problem of word sequence can not be considered in the recognition process; on the other hand, when evaluating a speech to be evaluated in a multi-language mixture (for example, a speech in a mixture of english and chinese), the word/word recognition model can be directly selected without considering the switching problem of the language recognizer. Therefore, the method and the device can greatly reduce the data cost and the algorithm development cost in the aspect of voice evaluation while improving the accuracy of voice recognition.

The word/word recognition model described in this disclosure may determine whether a voice matches a corresponding word/word by comparing pronunciation similarities of the voice and the corresponding word/word. That is, the word/word recognition model may be used to determine the similarity of the pronunciation of the word/word corresponding to the voice and the word/word corresponding thereto, and based on the similarity, it may be determined whether the voice is pronunciation data for the word/word corresponding to the word/word recognition model. As an example, the word/word recognition Model may be a Hidden Markov Model (HMM). Alternatively, the word/word recognition model may be other model structures, such as a Gaussian Mixture Model (GMM), a machine learning model, and so forth.

After the speech to be evaluated is recognized by the character/word recognition model, the speech to be evaluated can be evaluated based on the recognition result to obtain an evaluation result. Wherein, the evaluation result can be output to the user in a score or other forms. For the evaluation process, reference may be made to the following description, which is not repeated herein.

As shown in fig. 1, the present disclosure may also adjust the evaluation result by using a feature distribution model. The feature distribution model is used to characterize the audio feature distribution. The audio feature distribution represented by the feature distribution model may refer to audio feature distribution of a large amount of speech, that is, general audio feature distribution. That is, the feature distribution model is used to fit the feature distribution of a large number of voices. As an example, the feature distribution Model may be a Gaussian Mixture Model (GMM), one of which may be trained on a large amount of speech data.

The similarity between the speech to be evaluated and the audio characteristic distribution represented by the speech can be calculated by utilizing the characteristic distribution model, and the evaluation result is adjusted according to the similarity calculated by the characteristic distribution model, wherein the height of the evaluation result is inversely related to the size of the similarity. That is to say, the larger the similarity obtained by the feature distribution model calculation is, the lower the evaluation score corresponding to the finally adjusted evaluation result is. As an example, the similarity of the word/word recognition model output and the similarity of the feature distribution model output may be subtracted as the similarity between the speech and the recognized word/word, and the final evaluation result may be calculated according to the subtracted similarity.

The character/word recognition model is used for recognizing the similarity between the voice and the corresponding character/word, and the characteristic distribution model is characterized by the characteristic distribution of a large amount of voice data. Based on the similarity obtained by calculating the voice to be evaluated by the feature distribution model, the similarity obtained by calculating the voice to be evaluated by using the character/word recognition model is subjected to negative correlation adjustment (such as subtraction), so that the influence of environmental noise and general voice features can be reduced or even eliminated, the adjusted similarity can more highlight the difference between the voice to be evaluated and the pronunciation of a specific character/word, and the adjusted similarity can more accurately reflect the pronunciation similarity between the voice to be evaluated and the character/word.

So far, the implementation flow of the speech evaluation scheme of the present disclosure is briefly described with reference to fig. 1. The following further describes aspects of the present disclosure.

The voice evaluation scheme can be applied to various voice interaction devices, such as devices which can be suitable for intelligent sound boxes, intelligent watches (such as child watches), mobile phones and the like and support voice interaction functions.

Fig. 2 shows a schematic block diagram of the structure of a voice interaction device according to an embodiment of the present disclosure. Wherein the functional blocks of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 2 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

Referring to fig. 2, the voice interaction device 200 includes a first output module 210, a receiving module 220, a recognition module 230, and an evaluation module 240. Optionally, the voice interaction device 200 may further include a second output module.

The voice interaction device 200 can support two working modes of voice teaching and voice evaluation, and the two working modes can be switched automatically by the voice interaction device 200 or switched according to the operation of a user.

In the voice teaching mode, the voice interaction device 200 may output the voice teaching data to the user by using the second output module, where the voice teaching data may be pronunciation data of new words or new poems, or pronunciation data of articles or poems. Taking the english word teaching scenario as an example, the voice interaction device 200 may output the voice teaching data like "apple" and "banana".

The speech evaluation is divided into speech evaluation under a text-related scene and speech evaluation under a text-unrelated scene. The speech evaluation in the context of text correlation refers to speech evaluation for a specified test text, where the test text may include one or more words and, in the case where the test text includes multiple words, the multiple words may belong to different languages (such as english and chinese). The speech evaluation under the text-independent scene does not specify the test content, but can evaluate the speech freely uttered by the user.

FIG. 3 is a schematic flow chart diagram illustrating a speech assessment method in a text-related scenario according to an embodiment of the present disclosure. The speech evaluation method shown in fig. 3 can be executed by the speech interaction device 200 shown in fig. 2.

Referring to fig. 3, in step S310, a prompt message may be output, for example, by the first output module 210 in the voice interaction apparatus 200.

The prompt is for prompting a user to speak speech for a test text, where the test text includes one or more words/phrases. The prompt message can be information in various forms, such as text message, voice message, and image message. For example, in the case where the prompt message is a voice message of "please recite pity nong", the poetry content of the text "pity nong" is tested; in the case where the prompt message is a voice message of "how the apple reads in english", the test text is "applet".

In step S320, the voice may be received, for example, by the receiving module 220 in the voice interaction device 200. The received speech is the speech which is sent by the user aiming at the test text, namely the speech to be evaluated.

In step S330, the speech may be recognized, for example, by the recognition module 230 in the speech interaction device 200, based on the word/word recognition model corresponding to the word/word in the test text.

In the case where the test text includes a plurality of words, the speech may be segmented to obtain a plurality of audio segments. Where each audio piece can be viewed as pronunciation data for a word. The segmentation may specifically be performed in a number of ways. As an example, there may be some pauses between different words during the speech, so the speech may be divided into a plurality of audio segments by analyzing the silence in the received speech. For example, each mute section in the voice may be determined first, and then whether the duration of each mute section exceeds a predetermined threshold value may be determined, and in the case that the duration of each mute section exceeds the predetermined threshold value, the mute section may be cut as a cutting point, so that the voice may be cut into a plurality of audio segments.

After obtaining the plurality of audio segments, each audio segment may be recognized based on a plurality of word/word recognition models to determine a word/word corresponding to each audio segment, wherein each word/word recognition model corresponds to one word/word of the plurality of words/words included in the test text. That is to say, in the case that the test text includes a plurality of characters/words, the character/word recognition model corresponding to each character/word in the test text may be selected to recognize the speech to be evaluated.

In the case where the test text includes a single word/word, the received speech may be recognized directly based on a word/word recognition model corresponding to the word/word to determine whether the speech corresponds to the word/word.

Taking the HMM model as a word/word recognition model and the test text including a plurality of words/words as an example, the speech may be segmented first to obtain a plurality of audio segments. Feature extraction may be performed on each audio clip, such as may be performed on audio clips based on a neural network. The extracted feature sequence of each audio segment may be respectively input into a plurality of HMM models, where each HMM model corresponds to one word/word of the plurality of words/words. The HMM model may output a probability value, where the probability value is used to represent a probability that the HMM model generates the feature sequence, and a size of the probability value may be used to represent a similarity between the audio segment and a word corresponding to the HMM model. The probability values (i.e. similarities) output by the HMM models may be compared with a predetermined threshold (for convenience of distinction, may be referred to as a first threshold), if all the probability values are smaller than the first threshold, the audio segment may be considered not to belong to any word/word in the test text, and if a probability value larger than the first threshold exists, the audio segment may be considered to belong to a certain word/word in the test text, that is, the audio segment is pronunciation data for the certain word/word in the test text. As an example, in the case where there is a word/word recognition model having a probability value greater than the first threshold value, the audio clip may be recognized as the word/word corresponding to the word/word recognition model having the highest probability value generated. Wherein the first threshold may be determined during training of the HMM model.

Taking the word/word recognition model as the HMM model and the test text including a single word/word as an example, the speech may be subjected to feature extraction, for example, the speech may be subjected to feature extraction based on a neural network. The extracted feature sequence can be input into an HMM model corresponding to a word included in the test text, so as to obtain a recognition result output by the HMM model. The HMM model may output a probability value, where the probability value is used to represent a probability that the HMM model generates the feature sequence, and a size of the probability value may be used to represent a similarity between the speech and a word corresponding to the HMM model. The obtained probability value may be compared with a predetermined threshold (for convenience of distinction, may be referred to as a second threshold), and in the case that the probability value is greater than the second threshold, the recognition result of the speech is regarded as the word/word corresponding to the HMM model, that is, the speech is pronunciation data for the word/word in the test text, otherwise, the speech may be regarded as not matching the word/word corresponding to the HMM model, that is, the speech is not pronunciation data for the word/word in the test text. Wherein the second threshold may be determined during training of the HMM model. The second threshold may be the same as or different from the first threshold.

Optionally, in the process of recognizing the speech, a garbage (filer) model may be further selected for recognizing the audio segments corresponding to the out-of-vocabulary words and/or the non-linguistic phenomena in the speech. That is, in the recognition process, the garbage (filler) model may be used as a word/word recognition model to recognize the speech together with the selected word/word recognition model. For a specific identification process, see the above description, and no further description is given here.

In step S340, the speech may be evaluated based on the recognition result, for example, by the evaluation module 240 in the speech interaction device 200.

After the recognition result of the speech is obtained, the speech can be evaluated according to the difference between the recognized characters/words corresponding to the speech and the characters/words in the test text. References herein to differences may refer to differences in number, and optionally may also include differences in location. That is, the speech may be evaluated by comparing whether all the words/phrases in the test text are included in the speech uttered by the user. Also, the difference between the order of words/phrases in the speech and the order of words/phrases in the test text may also be considered in evaluating the speech.

And/or the speech can be evaluated according to the similarity (for the convenience of distinguishing, the similarity can be called as a first similarity) between the speech output by the character/word recognition model and the corresponding character/word. The word/word recognition model herein refers to a word/word recognition model corresponding to a word/word corresponding to a recognized voice. The similarity of the output of the character/word recognition model, that is, the probability value of the speech output by the character/word recognition model belonging to the corresponding character/word.

In order to reduce the influence of ambient noise on voice evaluation, the accuracy of the recognition result is further improved. The method and the device can also adjust the evaluation result by utilizing the characteristic distribution model. For the feature distribution model and the adjustment principle, see the above description.

As an example, in the case that the test text includes a plurality of words/phrases, the feature distribution model may be used to calculate the similarity (for convenience of distinction, it may be referred to as a second similarity) of each segmented audio segment, and for identifying an audio segment with a corresponding word/phrase, the second similarity output by the feature distribution model may be subtracted from the first similarity output by the word/phrase identification model corresponding to the identified word/phrase, and the evaluation may be performed according to the obtained difference. The adjustment process under the condition that the specific evaluation logic and the test text comprise a single character/word is not repeated.

FIG. 4 is a schematic flow chart diagram illustrating a speech assessment method in a text-independent scenario according to an embodiment of the present disclosure. The speech evaluation method shown in fig. 4 can be executed by the speech interaction device 200 shown in fig. 2.

Referring to fig. 4, in step S410, a voice may be received, for example, by the receiving module 220 in the voice interaction device 200. The received voice is the voice to be evaluated sent by the user.

In step S420, a speech may be recognized based on the word/word recognition model, for example, by the recognition module 230 in the speech interaction device 200.

The character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model. For the word/phrase recognition model, see the above description, and are not described in detail here.

In a text-independent scene, the speech can be compared with each character/word recognition model in the character/word recognition system respectively to determine the character/word recognition model most similar to the speech, and the character/word corresponding to the character/word recognition model is used as a speech recognition result. For example, the word/word corresponding to the word/word recognition model with the highest output probability may be used as the recognition result of the speech.

Considering that the speech may include a plurality of characters/words, after the speech is received, the speech may be further segmented to segment the speech into a plurality of audio segments, each audio segment may be regarded as pronunciation data for one character/word, and the segmentation process of the speech may refer to the above related description, which is not described herein again.

For each obtained audio segment, feature extraction can be performed on the audio segment to obtain a feature sequence of the audio segment, and then the feature sequence can be respectively input into each character/word recognition model to obtain a recognition result of each audio segment. As an example, for each audio segment, the word/word corresponding to the word/word recognition model with the largest output probability value (i.e., similarity) may be selected as the recognition result of the audio segment.

In step S430, the speech is evaluated based on the recognition result.

As an example, the speech may be evaluated according to the similarity between the speech output by the word/word recognition model and the corresponding word/word. The word/word recognition model herein refers to a word/word recognition model corresponding to a word/word corresponding to a recognized voice. The similarity of the output of the character/word recognition model, that is, the probability value of the speech output by the character/word recognition model belonging to the corresponding character/word.

In addition, the evaluation result can be adjusted according to the characteristic distribution model. For the feature distribution model and the adjustment principle, see the above description, and are not described herein again.

Training of models

In the following, taking the word/word recognition model as the HMM model and the feature distribution model as the GMM model as an example, the training process of the word/word recognition model and the feature distribution model is exemplarily described.

Referring to fig. 5, voice data is collected at step S510.

The collected speech data may be pronunciation data for each word/phrase in different languages by different users. The collected voice data are different in scale according to different application scenes. For example, in the case of training a model for speech evaluation of a 2-6 year old infant, a vocabulary library composed of common words/phrases in multiple languages such as chinese and english may be first constructed, and then corresponding speech data may be collected for each word/phrase in the vocabulary library. As an example, voice data may be collected for a predetermined period of time (e.g., 100 hours) for each language.

After the raw speech data is collected, the speech data may be preprocessed to obtain training data. For example, the collected voice data may be labeled to obtain voice data corresponding to each word/phrase. Wherein each word/phrase may correspond to a plurality of pieces of voice data.

Optionally, the collected speech data may also include pronunciation data for extracorporal words and various non-linguistic phenomena.

In step S520, the GMM model is trained.

After the speech data is collected, the GMM model may be trained first. For example, a GMM model may be trained that fits the feature distribution of the acquired speech data. The structure and training process of the GMM model are well known in the art and will not be described herein.

At step S530, the available data is filtered using the GMM model.

Before training the HMM model, the trained GMM model can be used for screening the training data to remove the less accurate training data. As an example, for voice data corresponding to the same word/phrase, the GMM model may be used to cull voice data in which the voice feature distribution is more different.

In step S540, HMM models are trained based on the speech data of the same word/phrase. The specific training process of the HMM model is well-known in the art and will not be described herein.

A plurality of HMM models corresponding to different words/phrases can be trained using the training method shown in fig. 5. And a plurality of words/words corresponding to the plurality of HMM models may belong to different languages. The final trained HMM model and GMM model can be used in the speech evaluation scheme of the present disclosure.

Optionally, in the case that the collected speech data further includes pronunciation data of out-of-vocabulary words and various non-linguistic phenomena, a garbage (filer) model may also be trained according to the pronunciation data of out-of-vocabulary words and various non-linguistic phenomena.

As an example, before training the HMM Model, a Universal Background Model (UBM) may be trained, and the UBM Model may be regarded as an HMM Model trained according to a large amount of audio data, and when training the HMM Model for a specific word/word, training may be performed on the basis of the UBM Model, so that robustness of Model parameters may be ensured, recognition accuracy of the Model may be improved, and training time may be reduced.

After the training obtains the HMM model, the recognition performance of the trained HMM model can be measured. As an example, a performance measurement scheme combining recall and accuracy may be used to measure the recognition performance of the HMM model. Wherein the recall rate is used to characterize the ratio of the number of words/phrases in the recognized speech to the total number of words/phrases in the speech. Accuracy refers to the ratio of the number of words/phrases in the correctly recognized speech to the total number of words/phrases in the speech. Here, "correctly recognize" may mean that the recognized candidate speech segment and the word/phrase corresponding to the labeled answer have the same content and overlap in position. In the process of measuring the recognition performance of the trained HMM model, a threshold (i.e., the first threshold/the second threshold mentioned above) may be set according to the measurement result, and when the probability value output by the HMM model is greater than the threshold, it may be determined that the input speech is the pronunciation data for the word or phrase corresponding to the HMM model.

The speech evaluation scheme of the present disclosure is described in detail with reference to fig. 1 to 5.

The present disclosure may also be implemented as a speech recognition scheme. Speech may be received and then recognized based on a word/word recognition model to obtain a recognition result of the speech. Therefore, when the voice mixed with multiple languages is recognized, the switching problem of the language recognizer is not considered, and a more accurate recognition result can be obtained by recognizing the voice at a character/word level. For the structure and recognition principle of the word/phrase recognition model, refer to the above description, and are not repeated here.

Optionally, the similarity between the speech and the audio feature distribution represented by the speech may be calculated based on the feature distribution model, and the recognition result may be adjusted according to the similarity calculated by the feature distribution model. Wherein the height of the recognition result is inversely related to the size of the similarity. As an example, the similarity of the word/word recognition model output and the similarity of the feature distribution model output may be subtracted as the similarity between the speech and the recognized word/word, and a final recognition result may be obtained according to the subtracted similarity. For the adjustment principle of the feature distribution model, see the above related description, and will not be described herein again.

Fig. 6 shows a schematic block diagram of the structure of a speech evaluation device according to an embodiment of the present disclosure. Wherein the functional modules of the speech evaluation device can be realized by hardware, software or a combination of hardware and software for realizing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 6 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the speech evaluation device can have and the operations that each functional module can perform are briefly described below, and for the details related thereto, reference may be made to the above-mentioned description, which is not repeated here.

Referring to fig. 6, the speech evaluation device 600 includes an output module 610, a receiving module 620, a recognition module 630, and an evaluation module 640.

The output module 610 is configured to output a prompt message, where the prompt message is used to prompt a user to emit a voice for a test text, and the test text includes one or more words/phrases.

The receiving module 620 is used for receiving voice.

The recognition module 630 is configured to recognize the speech based on a word/word recognition model corresponding to the word/word in the test text, and the word/word recognition model is configured to recognize whether the speech matches the word/word corresponding to the word/word recognition model. For the word/phrase recognition model and the specific recognition process, reference may be made to the above description, and details are not repeated here.

The evaluating module 640 is used for evaluating the voice based on the recognition result. For a specific evaluation process, see the above description, and no further details are given here.

Optionally, the speech evaluating apparatus 600 further includes a calculating module configured to calculate a similarity between the speech and the audio feature distribution represented by the speech based on the feature distribution model, and an adjusting module configured to adjust the evaluation result according to the similarity calculated by the feature distribution model, where the level of the evaluation result is inversely related to the magnitude of the similarity. For the feature distribution model and the adjustment process, see the above description, and are not described herein again.

Optionally, the speech evaluation device 600 further comprises a training module for training the feature distribution model and/or the word/phrase recognition model. For the training process, see the above description, and are not repeated here.

Fig. 7 shows a schematic block diagram of the structure of a speech evaluation device according to another embodiment of the present disclosure. Wherein the functional modules of the speech evaluation device can be realized by hardware, software or a combination of hardware and software for realizing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 7 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

Referring to fig. 7, the speech evaluation device 700 includes a receiving module 710, a recognition module 720, and an evaluation module 730.

The receiving module 710 is used for receiving voice.

The recognition module 720 is configured to recognize the speech based on a word/phrase recognition model, where the word/phrase recognition model is configured to recognize whether the speech matches a word/phrase corresponding to the word/phrase recognition model. For the word/phrase recognition model and the specific recognition process, reference may be made to the above description, and details are not repeated here.

The evaluation module 730 is configured to evaluate the speech based on the recognition result. For a specific evaluation process, see the above description, and no further details are given here.

Optionally, the speech evaluation device 700 further includes a calculation module configured to calculate a similarity between the speech and the audio feature distribution represented by the speech based on the feature distribution model, and an adjustment module configured to adjust the evaluation result according to the similarity calculated by the feature distribution model, where the level of the evaluation result is inversely related to the magnitude of the similarity. For the feature distribution model and the adjustment process, see the above description, and are not described herein again.

Optionally, the speech evaluation device 700 further comprises a training module for training the feature distribution model and/or the word/phrase recognition model. For the training process, see the above description, and are not repeated here.

Fig. 8 shows a schematic block diagram of the structure of a speech recognition apparatus according to an embodiment of the present disclosure. Wherein the functional blocks of the speech recognition apparatus may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 8 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the speech recognition apparatus can have and operations that each functional module can perform are briefly described, and for details related thereto, reference may be made to the above-mentioned related description, which is not repeated herein.

Referring to fig. 8, the speech recognition apparatus 800 includes a receiving module 810 and a recognition module 820.

The receiving module 810 is used for receiving voice.

The recognition module 820 is used for recognizing the voice based on a word/word recognition model, and the word/word recognition model is used for recognizing whether the voice is matched with the word/word corresponding to the word/word recognition model. For the recognition principle of the word/phrase recognition module, the above related description can be referred to, and the details are not repeated herein.

Optionally, the speech recognition apparatus 800 may further include a calculation module and an adjustment module. The computation module may be configured to compute a similarity between the speech and the audio feature distribution characterized thereby based on the feature distribution model. The adjusting module can be used for adjusting the recognition result according to the similarity obtained by the calculation of the characteristic distribution model. Wherein the height of the recognition result is inversely related to the size of the similarity.

As an example, the calculation module may subtract the similarity output by the character/word recognition model from the similarity output by the feature distribution model to obtain a similarity between the speech and the recognized character/word, and the adjustment module may obtain a final recognition result according to the subtracted similarity. For the principle of using the feature distribution model for adjustment, see the above description, and will not be described herein again.

Fig. 9 is a schematic structural diagram of a computing device that can be used to implement the speech evaluation method or the speech recognition method according to an embodiment of the present invention.

Referring to fig. 9, the computing device 1000 includes a memory 1010 and a processor 1020.

The processor 1020 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1020 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 1020 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 1010 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are needed by the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, among others. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, can cause the processor 1020 to perform the speech evaluation method or the speech recognition method described above.

The speech evaluating and speech recognizing method, apparatus, speech interacting device and computing device according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A speech evaluation method, comprising:

outputting prompt information, wherein the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words;

receiving voice;

recognizing the voice based on a character/word recognition model corresponding to the character/word in the test text, wherein the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model;

and evaluating the voice based on the recognition result.

2. The speech assessment method of claim 1, wherein the test text comprises a plurality of words, the method further comprising:

segmenting the speech to obtain a plurality of audio segments, wherein,

the step of recognizing the speech based on the word/word recognition model corresponding to the word/word in the test text includes: and identifying each audio segment based on a plurality of character/word identification models to determine the character/word corresponding to each audio segment, wherein each character/word identification model corresponds to one character/word in the plurality of characters/words.

3. The speech assessment method according to claim 2, wherein the word/word recognition model is a hidden markov model, and wherein the step of recognizing each of the audio segments based on a plurality of word/word recognition models to determine the word/word corresponding to each of the audio segments comprises:

extracting the characteristics of each audio clip to obtain a characteristic sequence of the audio clip;

respectively inputting the characteristic sequences into a plurality of character/word recognition models to obtain the probability value of each character/word recognition model for generating the characteristic sequences;

and under the condition that the character/word recognition model with the probability value larger than the first threshold exists, recognizing the audio clip as the character/word corresponding to the character/word recognition model with the maximum generated probability value.

4. The speech evaluation method according to claim 1, wherein the word/word recognition model is a hidden markov model, the test text includes a single word/word, and the step of recognizing the speech based on the word/word recognition model corresponding to the word/word in the test text includes:

performing feature extraction on the voice to obtain a feature sequence of the voice;

inputting the characteristic sequence into a character/word recognition model corresponding to the character/word to obtain a probability value of the character/word recognition model for generating the characteristic sequence;

in the event that the probability value is greater than a second threshold, recognizing the speech as the word/word.

5. The speech assessment method according to claim 3 or 4, wherein feature extraction is performed based on a neural network.

6. The speech assessment method according to claim 1, wherein said step of assessing the speech based on the recognition result comprises:

evaluating the voice according to the difference between the characters/words corresponding to the recognized voice and the characters/words in the test text; and/or

And evaluating the voice according to the similarity between the voice output by the character/word recognition model and the character/word corresponding to the voice.

7. The speech assessment method of claim 1, further comprising:

calculating the similarity between the voice and the audio feature distribution represented by the voice based on a feature distribution model;

and adjusting the evaluation result according to the similarity obtained by calculation of the feature distribution model, wherein the height of the evaluation result is in negative correlation with the size of the similarity.

8. The speech assessment method of claim 7,

the characteristic distribution model is a Gaussian mixture model.

9. The speech assessment method of claim 1, further comprising:

acquiring training data, wherein the training data comprises voice data of a plurality of characters/words;

based on the voice data of the same character/word, a character/word recognition model of the character/word is trained.

10. The speech assessment method of claim 9, further comprising:

training a feature distribution model based on the voice data of the plurality of words, the feature distribution model being used for characterizing the audio feature distribution of the voice data of the plurality of words.

11. The speech assessment method of claim 10, further comprising: based on the characteristic distribution model, the voice data of the same character/word is screened,

wherein the step of training the character/word recognition model of the character/word based on the voice data of the same character/word comprises the following steps: and training the character/word recognition model based on the screened voice data.

12. The speech assessment method according to claim 9, wherein the plurality of words/phrases comprises words/phrases in one or more languages.

13. A speech evaluation method, comprising:

receiving voice;

recognizing the voice based on a character/word recognition model, wherein the character/word recognition model is used for recognizing whether the voice is matched with a character/word corresponding to the character/word recognition model;

and evaluating the voice based on the recognition result.

14. The speech assessment method of claim 13, further comprising:

15. A speech recognition method, comprising:

receiving voice;

and recognizing the voice based on a character/word recognition model, wherein the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model.

16. A speech evaluation apparatus, comprising:

the output module is used for outputting prompt information, wherein the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words;

the receiving module is used for receiving voice;

the recognition module is used for recognizing the voice based on a character/word recognition model corresponding to the character/word in the test text, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model;

and the evaluation module is used for evaluating the voice based on the recognition result.

17. A speech evaluation apparatus, comprising:

the receiving module is used for receiving voice;

the recognition module is used for recognizing the voice based on a character/word recognition model, and the character/word recognition model is used for recognizing whether the voice is matched with a character/word corresponding to the character/word recognition model;

18. A voice interaction device, comprising:

the device comprises a first output module, a second output module and a third output module, wherein the first output module is used for outputting prompt information, and the prompt information is used for prompting a user to send out voice aiming at a test text, and the test text comprises one or more characters/words;

the receiving module is used for receiving voice;

19. The voice interaction device of claim 18, further comprising:

and the second output module is used for outputting the voice teaching data.

20. The voice interaction device of claim 18, wherein the voice interaction device is a smart speaker or a smart watch.

21. A speech recognition apparatus, comprising:

the receiving module is used for receiving voice;

and the recognition module is used for recognizing the voice based on a character/word recognition model, and the character/word recognition model is used for recognizing whether the voice is matched with the character/word corresponding to the character/word recognition model.

22. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 15.

23. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-15.