WO2022194044A1

WO2022194044A1 - Pronunciation assessment method and apparatus, storage medium, and electronic device

Info

Publication number: WO2022194044A1
Application number: PCT/CN2022/080357
Authority: WO
Inventors: 顾宇; 马泽君
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-03-19
Filing date: 2022-03-11
Publication date: 2022-09-22
Also published as: CN113077819A

Abstract

The present disclosure relates to a pronunciation assessment method and apparatus, a storage medium, and an electronic device. The method comprises: displaying an example sentence text to a user; capturing audio to be assessed of the user reading aloud on the basis of the example sentence text; generating a vocal organ movement video reflecting movement of a vocal organ when the user reads aloud the example sentence text; generating pronunciation assessment information on the basis of the vocal organ movement video and a vocal organ standard movement video corresponding to the example sentence text; and displaying the pronunciation assessment information to the user. The present disclosure can accurately assess the pronunciation of a user, and intuitively represent whether the pronunciation of the user is accurate.

Description

Pronunciation evaluation method and device, storage medium and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based on the application with the Chinese application number of 202110298227.9 and the filing date of March 19, 2021, and claims its priority. The disclosure of the Chinese application is hereby incorporated into this application as a whole.

technical field

The present disclosure relates to the field of education, and in particular, to a pronunciation evaluation method and device, a storage medium and an electronic device.

Background technique

When learning to pronounce, users are usually only able to imitate what they hear, or imitate the way someone else's lips move. It is difficult for the user to observe the specific movement mode of the specific vocal organs of others, so it is difficult for the user to make a correct judgment on his own pronunciation, which hinders pronunciation learning.

SUMMARY OF THE INVENTION

This Summary is provided to introduce concepts in a simplified form that are described in detail in the Detailed Description section that follows. This summary section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

In a first aspect, the present disclosure provides a method for evaluating pronunciation, including displaying example text to a user; collecting audio to be evaluated read aloud by a user based on the example text; generating an action video of a pronunciation organ based on the audio to be evaluated; Pronunciation evaluation information is generated from the action video and the pronunciation organ standard action video corresponding to the example text; the pronunciation evaluation information is displayed to the user.

In a second aspect, the present disclosure provides a pronunciation evaluation device, comprising: an example sentence display module, used for displaying example sentence text to a user; an audio collection module, used for collecting the audio to be evaluated that the user reads aloud based on the example sentence text; a video generation module, For generating pronunciation organ action video based on described audio to be evaluated; Pronunciation evaluation module, for generating pronunciation evaluation information based on the pronunciation organ standard action video corresponding to described pronunciation organ action video and described example text; Evaluation display module, for The pronunciation evaluation information is displayed to the user.

In a third aspect, the present disclosure provides a non-transitory computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device, including a storage device and a processing device, where a computer program is stored on the storage device; and a processing device is configured to execute the computer program in the storage device, so as to realize the first aspect of the present disclosure. The steps of the method of the aspect.

In a fifth aspect, the present disclosure provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.

In a sixth aspect, the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.

Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

Description of drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:

FIG. 1 is a flow chart of a pronunciation evaluation method according to an exemplary disclosed embodiment.

FIG. 2 is a flowchart of another pronunciation evaluation method according to an exemplary disclosed embodiment of the present disclosure.

FIG. 3 is a block diagram of an apparatus for evaluating pronunciation according to an exemplary disclosed embodiment of the present disclosure.

FIG. 4 is a block diagram of an electronic device according to an exemplary disclosed embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

Fig. 1 is a flowchart of a pronunciation evaluation method according to an exemplary disclosed embodiment. As shown in Fig. 1 , the method includes steps S11-S15.

S11. Display the example text to the user.

The example sentence text can be a text of any length, such as a phrase, a sentence, a paragraph, an article, etc. The example sentence text can also refer to a clause after a longer text is processed by clauses.

For example, in a scenario where the user learns pronunciation, if the user wants to test and practice pronunciation, the example text can be displayed to the user in the form of text, so that the user can perform a pronunciation test. If the user wants to learn pronunciation, the example text can be displayed to the user in the form of audio, so that the user can follow along. Furthermore, the present disclosure is not limited to presenting the example text to the user in the form of text and audio together.

The example text can be displayed in the form of text through the display device of the user terminal, and the example text can also be displayed in the form of voice through the playback device of the user terminal, wherein the voice corresponding to the example text can be stored in advance, and the voice can also be displayed when needed. In the case of converting text to speech directly.

The user terminal may include any device with a display function, such as a mobile phone, a computer, a learning machine, and a wearable device.

In a possible implementation manner, the example sentence audio is generated based on the example sentence text, the audio and the standard action video of the vocal organ are synthesized into an example sentence demonstration video, and the example sentence text and the example sentence demonstration video are displayed to the user.

The standard action video of the pronunciation organ is generated based on the example sentence text, and the video features can be generated by the pre-trained video feature generation model. The example sentence text is divided into unit text sequences, and the unit text sequences are input into a video feature generation model to obtain a video feature sequence, and based on the video feature sequence, a standard action video of a vocal organ is generated.

The unit text sequence is a sequence obtained by dividing the example text into small units for generating videos. In the present disclosure, the unit text can be phonemes, words, single characters, etc. After the example text is segmented, more refined , so that the model can more efficiently generate accurate video feature sequences based on unit text. For example, when the example text is "How are you", the example text can be divided into unit text sequences of "how", "are" and "you" by using words as the division unit, or the phoneme can be used as the division unit to divide Example sentences are split into

The unit text sequence of .

The video feature generation model is trained in the following ways:

Divide the sample text into a sample unit text sequence, construct model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence, and train according to the model training data to obtain the described Video feature generation model.

The sample vocal organ action video is a demonstration video produced or recorded based on the sample text. The demonstration video can be an animation demonstration video of the oral cavity made by any animation production and rendering software, or it can be the head of a person captured by an MRI machine reading the sample text. Department video.

Extract the video features of the sample articulator action video frame by frame or frame by frame to obtain the feature information of multiple image frames of the sample articulator action video, and arrange the video feature information in the order of the video frames to obtain the sample video. feature sequence. It is worth noting that the present disclosure does not limit the form of feature information of an image frame, and any form of feature information that can be extracted and restored to an image through processing can be used as feature information in the video feature sequence in the present disclosure.

In a possible implementation manner, the feature information is principal component information, and the principal component information of each video frame is obtained by performing principal component analysis on the sample vocal organ action video frame by frame, and the principal component information of each video frame is calculated. The component information is arranged in the order of video frames to obtain the sample video feature sequence. By restoring the principal component information, a restored image can be obtained, and the restored image can be arranged and synthesized according to the sequence of the sample video feature sequence to obtain a restored demo video. Using the sample unit text sequence and the sample video feature sequence corresponding to the sample unit text sequence as training samples, the video feature generation model is trained, so that the video feature generation model can generate corresponding video features or videos based on any unit text. feature sequence.

The video feature generation model can be a deep learning model, which generates training samples input to the deep learning model by labeling each sample unit text in the sample unit text sequence. After multiple rounds of iterative training, the deep learning model can accurately Generate video features based on unit text.

The video feature generation model may also be an attention model, the video feature generation model includes an encoder and a decoder, the encoder is configured to generate an encoding result based on a unit text sequence, and the decoder is configured to generate a video based on the encoding result Feature sequences, the encoder and decoder are trained in the form of end-to-end training from unit text sequences to video feature sequences, so that the attention model can accurately generate video feature sequences based on unit text sequences. It is worth noting that when the demo video to be generated is an MRI (Magnetic Resonance Imaging) video, considering the high recording cost of MRI video, recording a longer video at one time can reduce the recording cost. Therefore, this The sample voice organ action video can be segmented from a complete sample demonstration video, and correspondingly, the sample text is also segmented from the complete sample demonstration text.

The sample demonstration video may be an MRI video of the user reciting the sample text captured by an MRI apparatus. Multiple sample texts are obtained by segmenting the sample demonstration text, and the sample demonstration video is segmented into different sample texts based on the result of the sentence segmentation. For the sub-videos corresponding to the text, a plurality of sample voice organ action videos can be obtained.

In a possible implementation, the sample demonstration text is segmented to obtain a plurality of sample texts, the speech recognition is performed on the sample speech recorded synchronously with the sample demonstration video, and the speech corresponding to each sample text is determined based on the speech recognition result. Segment, based on the time axis information of each speech segment, from the sample demonstration video to determine the sample vocal organ action video corresponding to each speech segment. For example, by segmenting the sample demo text "Howareyou? I'mfinethankyou,andyou?", four clauses of "Howareyou", "I'mfine", "thankyou" and "andyou" can be obtained. For identification, the time axis information of the speech segment corresponding to "Howareyou" can be obtained as "00:00:00 to 00:01:40", and the time axis information of the speech segment corresponding to "I'mfine" is "00:01: 40 to 00:02:50", the timeline information of the speech segment corresponding to "thankyou" is "00:02:50 to 00:04:40", and the timeline information of the speech segment corresponding to "andyou" is " 00:04:40 to 00:06:00", the sample demonstration sub-video with a duration of 6 seconds can be divided into "00:00:00 to 00:01:40", "00:01: 40 to 00:02:50", "00:02:50 to 00:04:40", "00:04:40 to 00:06:00" four video clips, each video clip is its corresponding sample Sample demo sub-video of clause text. The above sentence segmentation methods are only shown as examples, and those skilled in the art may use other sentence segmentation methods to perform sentence segmentation processing, which is not limited in the present disclosure.

Considering that the recording instrument may not have the recording function during the MRI video recording, additional recording equipment is required to record the sample voice, and the sample demonstration video and the sample voice may have a time difference caused by problems such as different start times and different end times during recording. Therefore, in a possible implementation manner, an alignment process is performed on the time axis information of the sample speech and the sample demonstration video; the length of the sample speech or the sample demonstration video is adjusted so that the sample speech Consistent with the length of the sample demo video.

Considering that people may have posture changes when recording videos, the facial positions in the recorded videos are not fixed, which may affect the aesthetics of the video, and may also affect the feature information extraction of the video, increasing the training cost of the model. Therefore, in a possible implementation manner, the face position in the sample demonstration video is adjusted frame by frame, so that the same organ in each video frame is located at the same image position. The adjustment can be performed in the form of pixel tracking or optical flow tracking, or it can be performed by extracting and aligning feature points. The processing of video frames includes but is not limited to rotation, translation, zooming in, and zooming out. The screen size is uniformly cropped to reduce the interference information in the video.

S12: Collect the audio to be evaluated read aloud by the user based on the example text.

The voice read aloud by the user can be collected through the voice collecting device of the user terminal.

In a possible implementation, speech recognition can be performed on the collected audio to be evaluated, and the recognition result can be compared with the example text. When the text similarity is lower than a preset similarity threshold, the user can Send a prompt to remind the user to re-read the example text.

S13. Generate a pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example text.

In a possible implementation, the audio to be evaluated is converted into an audio feature vector to be processed; the audio feature vector to be processed is input into a video generation model, and the output of the video generation model and the audio to be evaluated are obtained. Corresponding vocal organ action video.

An achievable implementation is to convert the audio to be evaluated into an audio feature vector to be evaluated. Specifically, the audio to be evaluated can be input into a speech recognition model to obtain the audio feature vector to be evaluated. The audio feature vector to be evaluated includes: The phoneme posterior probability vector of each frame of audio in the audio to be evaluated, and the dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated.

A phoneme is the smallest unit of speech divided according to the natural properties of speech. Each human voice, animal voice, and musical instrument sound can be divided into a limited number of minimum phonetic units based on attributes.

Each frame of audio in the audio to be evaluated may be audio of one phoneme. A phoneme can be represented by a phoneme posterior probability vector. The dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated. For example, assuming that the language type corresponding to the audio to be evaluated is English, since the number of English phonemes is 48, the dimension of the English phoneme posterior probability vector is 48. That is to say, an English phoneme posterior probability vector includes 48 probability values greater than or equal to 0 and less than 1, and the sum of the 48 probability values is 1. The phoneme corresponding to the maximum value among the 48 probability values is the English phoneme represented by the phoneme posterior probability vector. As another example, assuming that the language type corresponding to the audio to be evaluated is the language type that imitates the target musical instrument, if there are 50 phonemes corresponding to the target musical instrument, then the dimension of the phoneme posterior probability vector is also 50, and the sum of 50 is 1. of probability values.

Each frame of audio in the audio to be evaluated may also be audio of one word/word. Correspondingly, a word/word is represented by a posterior probability vector of a word/word. Thus, it is worth noting that the audio frame playback time corresponding to each frame of audio in the audio to be evaluated can be freely set according to requirements, so that each frame of audio is audio of one or more phonemes/words/words.

A speech recognition model (Automatic Speech Recognition, ASR for short) is a model that converts sounds into corresponding text or commands.

Since the number of words or words in any language is huge and the number of phonemes is small, and the pronunciation of each word or word is composed of one or more phonemes, therefore, in a preferred implementation, the speech recognition model can use the following The training method is obtained by training: constructing a model training sample according to the sample audio frame and the phoneme corresponding to the sample audio frame; training according to the model training sample to obtain the speech recognition model.

Specifically, signal processing and knowledge mining are performed on the sample audio frame, the speech characteristic parameters of the sample audio frame are analyzed, a speech template is made, and a speech parameter library is obtained. A mapping table of speech feature parameters and phonemes is constructed according to the sample audio frame and the phoneme corresponding to the sample audio frame.

After inputting the audio to be evaluated into the trained speech recognition model, for each frame of audio in the audio to be evaluated, through the same analysis as during training, the characteristic parameters of the speech to be processed are obtained, and the characteristic parameters of the speech to be processed are combined with the speech parameter library. One-to-one matching is performed on the speech templates in the to-be-processed speech feature parameter to obtain the matching probability between the speech feature parameter to be processed and each speech feature parameter in the speech parameter database. Further, the phoneme posterior probability vector of each frame of audio in the audio to be evaluated is obtained according to the mapping table between speech feature parameters and phonemes.

This method of training a speech recognition model using a small and limited number of phonemes and phoneme audio can reduce the model training task compared with the method of using a large number of words/words and audio of words/words to train a speech recognition model. Get a trained speech recognition model quickly.

The video generation model is obtained by training in the following manner: constructing model training data according to the sample audio and sample vocal organ action videos corresponding to the sample audio; training and obtaining the video generation model according to the model training data.

The present disclosure does not specifically limit the loss function of the video generation model.

Since the number of words or words (or sound segments) in any language is huge and the number of phonemes is small, and the pronunciation of each word or word (or sound segment) is composed of one or more phonemes, a preferred The sample audio is the audio of all phonemes corresponding to the target language type. The sample articulator action video may be an animation demonstration video of the articulator action corresponding to each phoneme produced by any animation production and rendering software. The sample voice organ motion video may also be a voice organ motion video corresponding to each phoneme captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like. Because users can not only read words or words in various human languages, but also imitate sounds such as animals and musical instruments. Therefore, in order to facilitate those skilled in the art to understand the embodiments of the present disclosure, it should be noted that the above-mentioned sound segments may refer to sound segments in other sounds imitating non-human languages (such as a sound segment corresponding to a key of a musical instrument and a string).

Similarly, in another implementation manner, the sample audio is audio of all words or words (or sound segments) corresponding to the target language type. The sample articulator action video may be an animation demonstration video of articulator action corresponding to each word or word (or sound segment) produced by any animation production and rendering software. The sample voice organ motion video may also be a voice organ motion video corresponding to each word or word (or sound segment) captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like.

An achievable embodiment, the construction of model training data according to the sample audio and the sample articulator action video corresponding to the sample audio may specifically include the following steps:

Convert each frame of audio in the sample audio into a sample phoneme posterior probability vector, and obtain a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector; The sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample The vocal organ video feature sequence is used as the model training data.

Each frame of audio in the sample audio is in one-to-one correspondence with each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence, and each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence corresponds to the sample vocal organ video. Each sample vocal organ video feature in the feature sequence has a one-to-one correspondence.

It is easy to understand that, in the case that one frame of audio corresponds to one phoneme, the pronunciation process of the pronunciation organ corresponding to one phoneme is embodied by one or more frames of video images. Therefore, each of the sample vocal organ video features is the pixel point feature information of at least one frame of video image in the sample vocal organ motion video; or, each of the sample vocal organ video features is the sample vocal organ motion Principal component feature information of at least one frame of video image in the video.

It should be noted that the principal component feature information is principal component coefficient data representing the video image obtained by performing dimension reduction processing on the video image through the principal component analysis algorithm.

An achievable implementation manner, before the sample articulator video feature corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence is extracted based on the sample articulator action video. , may also include the following step: adjusting the position of the articulator in the sample articulator action video frame by frame, so that the same articulator in each frame of video image is located at the same image position.

The adjustment can be performed in the form of pixel tracking or optical flow tracking, or can be performed in the form of feature point extraction and alignment. The size of the frame video image is uniformly cropped. The position of the articulator in the sample articulator action video is adjusted frame by frame, so that the same articulator in each frame of video image is located at the same image position, which is conducive to reducing the impact of different positions of the same articulator in each frame of video images. The resulting interference to the model training effect and the model convergence speed.

Since the audio feature vector to be evaluated includes the phoneme posterior probability vector of each frame of audio in the audio to be evaluated, after inputting the audio feature vector to be evaluated into the trained video generation model, the phoneme posterior probability related to each frame of audio can be obtained. The vector corresponding to the vocal organ video features. According to the feature sequence of the voice organ video, the voice organ action video can be generated and output.

It is worth noting that, in step S13, the sample pronunciation organ action video used for training the video generation model is the video corresponding to the sample audio, while the abbreviated version pronunciation organ action video used for training the video feature generation model in step S11 is the same as the sample text. Corresponding video, and the sample audio can be recorded synchronously with the sample articulator action video, in the case where the sample audio and the sample articulator action video are recorded synchronously based on the same sample text, the sample articulator action in step S11 and step S13 Video is the same video. In this case, operations such as audio and video alignment, video cropping, and video center alignment for the sample vocal organ action video can be performed only once. When training the two models, the aligned video and audio are used for training.

S14. Generate pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text.

The pronunciation evaluation information includes at least one of pronunciation scoring information of the user, pronunciation action suggestion information, or a comparison video of the articulator action video and the articulator standard action video.

Described contrast video is generated in the following way: based on the unit text content of example sentence text, the video clip that characterizes the same unit text content in described pronunciation organ action video and described pronunciation organ standard action video is as a group of video clip groups. The video clips belonging to the articulator action video and the articulator standard action video in each video segment group are aligned; The articulator action video after the alignment and the articulator standard action video are spliced to obtain the result. the comparison video.

When the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the action difference information is obtained by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text generating pronunciation scoring information according to the action difference information, and/or matching with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information.

The action difference information may refer to the difference information of the movement trajectories of the feature points of the vocal organs.

The movement trajectory of the feature points of the speech organ is used to reflect the speech movement process of the speech organ. The feature points of the vocal organs can be the centroid points, center points, contour feature points, etc. of the vocal organs, or other feature points other than the vocal organs that take the centroid points, center points, contour feature points, etc. of the vocal organs as reference points. . The present disclosure does not specifically limit the number and types of feature points.

The voice organ action video includes at least one frame of video image, and the position coordinates of the feature points of the voice organ are determined in each frame of the voice organ action video, and the number of frames (groups) corresponding to the number of frames of the voice organ action video can be obtained. The position coordinates of the feature points of the vocal organs. Based on the position coordinates of all the feature points of the vocal organs, the movement trajectory of the feature points of the vocal organs corresponding to the time axis of the motion video of the vocal organs can be constructed.

The preset movement trajectory of the feature point corresponding to the example sentence text is the standard movement trajectory of the vocal organ feature point corresponding to the example sentence text. Similarity calculation is performed on the movement trajectory of the feature points of the speech organ corresponding to the action video of the speech organ and the preset movement trajectory of the standard feature points, and the similarity information of the two trajectory lines can be obtained.

An achievable embodiment, the preset motion trajectory of the feature points corresponding to the example text can be determined in the following manner:

From the model training data for training the video generation model, determine all the phonemes (or other word, word, sentence, etc. unit-granularity information) that make up the example text, and determine the vocal organ video feature sequence corresponding to all the phonemes, based on the vocal organ video feature The sequence generates a standard action video of the pronunciation organs of the example text. The position coordinates of the feature points of the vocal organs are determined in each frame of video images of the standard motion video of the vocal organs, and the preset motion trajectories of the feature points corresponding to the example text are obtained.

In the case of considering the accuracy, from the model training data of the training video generation model, multiple groups of phoneme sequences that form the example text can be determined, and based on the multiple sets of phoneme sequences that make up the example text, a plurality of preset motion trajectories of feature points can be determined. By performing weighted average processing on the preset motion trajectories of the plurality of feature points, a comprehensive and more accurate preset motion trajectories of the feature points can be obtained.

For example, the pronunciation score of the audio to be evaluated is determined according to the similarity value in the similarity information. The pronunciation score of the audio to be evaluated is used as the pronunciation evaluation result. For another example, according to the magnitude of the similarity value in the similarity information, the pronunciation levels of the audio to be evaluated are determined as excellent, medium, qualified, unqualified, and missing pronunciation. The pronunciation of the audio to be evaluated is excellent, moderate, qualified, unqualified, missing pronunciation, etc., as the pronunciation evaluation result.

Using the above pronunciation evaluation method, the audio to be evaluated in which the user reads the example text aloud can be input into the video generation model, and the user's pronunciation organ action video can be fitted and restored. The position coordinates of the feature points of the vocal organs are determined in each frame of video images of the vocal organs action video, and the movement trajectory of the feature points of the vocal organs is obtained. The similarity calculation is performed between the feature point motion trajectory of the vocal organ and the standard feature point preset motion trail corresponding to the example text, so as to obtain the pronunciation action similarity information of the vocal organ. The pronunciation evaluation result can be obtained based on the pronunciation action similarity information of the pronunciation organs. Since pronunciation is directly related to the movements of the vocal organs, the pronunciation evaluation results obtained in this way are more accurate.

Generating the pronunciation evaluation result of the audio to be evaluated according to the similarity information may further include the following steps:

Perform spectrum analysis on the audio to be evaluated, extract sound spectrum feature information, perform similarity calculation between the extracted sound spectrum feature information and the standard sound spectrum feature information corresponding to the example text, obtain spectrum similarity information, and compare the spectrum similarity information with the aforementioned. The similarity information determined based on the movement trajectories of the feature points of the vocal organs is combined to obtain the pronunciation evaluation result.

In this way, on the basis of calculating the user's pronunciation accuracy based on the information of a single sound spectrum dimension, a more accurate pronunciation evaluation result can be determined by further combining the similarity information determined based on the movement trajectory of the feature points of the vocal organs. . This method further improves the accuracy of pronunciation evaluation results.

Due to individual differences among users, different users have different reading speeds when reading the same sentence text. That is to say, the duration of the audio to be evaluated is related to the speed of the user's pronunciation, that is, the duration of the audio to be evaluated is variable. When the duration of the audio to be evaluated is different, the number of frames of the audio to be evaluated is different, and then the audio to be evaluated with different durations is input into the video generation model, and the duration of each vocal organ action video obtained is also different. When the duration of the vocalization organ action video is different, the number of video image frames included in the vocalization organ action video is different. Then, if the number of video image frames in the voice organ action video corresponding to the audio to be evaluated is different from the number of video image frames in the voice organ standard action video corresponding to the example text The length of the preset motion trajectory of the feature points is inconsistent. Furthermore, when the similarity calculation is performed according to the movement trajectory of the feature points of the pronunciation organ and the preset movement trajectory of the characteristic points corresponding to the example text, the obtained similarity information will have a large error. In this regard, the present disclosure provides the following two implementations to avoid the problem of large errors in the similarity information obtained by calculation.

In detail, an achievable embodiment, before the similarity calculation is performed between the feature point motion track of the articulator and the feature point preset motion track corresponding to the example sentence text, and similarity information is obtained, according to The number of feature point position coordinates that constitute the preset motion trajectory of the feature point, adjust the number of feature point position coordinates of the feature point motion trajectory of the vocal organ, so that the feature point position coordinates of the feature point preset motion trajectory The number is the same as the number of the feature point position coordinates of the feature point motion trajectory of the vocal organ.

For example, it is assumed that the number of feature point position coordinates of the preset motion trajectory of the feature point is 5, which are coordinates A, B, C, D, and E, respectively. The number of feature point position coordinates of the feature point motion trajectory is 4, which are coordinates a, b, c, and e respectively. At this time, the number of feature point position coordinates of the feature point motion trajectory can be adjusted. For example, in the current pronunciation Insert the feature point f(0, 0) into the feature point motion track of the organ, and obtain the feature point motion track composed of coordinates a, b, c, f, and e. The insertion position of the feature point f(0, 0) can be determined according to the position of the missing phoneme in the audio to be evaluated. It is easy to understand that, when each phoneme in the audio to be evaluated and each phoneme in the known example text are known through the ASR model, the missing phonemes in the audio to be evaluated can be determined (similarly, it can be seen that the redundant phonemes in the audio to be evaluated, with It is realized that the number of feature point position coordinates of the feature point motion track can be adjusted while reducing the number of feature point position coordinates of the feature point motion track).

Another achievable implementation is, before determining the position coordinates of the feature points of the articulator in each frame of the video of the articulator action, according to the standard action video of the articulator corresponding to the example text The frame number of the video image, adjust the frame number of the video image in the articulator action video, so that the frame number of the video image in the articulator action video is the same as the video image in the articulator standard action video. The number of frames is the same.

It is easy to understand that, when the frame number of the video image in the standard action video of the vocal organ is the same as the frame number of the video image in the vocal organ action video of the audio to be evaluated, based on the premise that one frame of video image corresponds to one feature point, it can be seen that the example sentence The number of feature point position coordinates in the preset motion trajectory of the feature points corresponding to the text is the same as the number of feature point position coordinates in the feature point motion trajectory of the audio to be evaluated.

For example, it is assumed that the number of frames of the video image in the standard action video of the vocal organ is 5 frames, which are 1, 2, 3, 4, and 5 frames respectively. The number of frames of the video image in the speech organ action video of the audio to be evaluated is 3 frames, which are 1, 4, and 5 frames respectively. In this case, frame interpolation processing can be performed on the video image frame sequences 1, 4, and 5. For example, by inserting image frames 1 and 4 into the current video image frame sequence 1, 4, and 5, the obtained video image frame sequence is 1, 1, 4, 4, and 5. For another example, a blank image frame 0 is inserted into the current video image frame sequence 1, 4, and 5 to obtain the video image frame sequence as 1, 0, 0, 4, and 5.

A realizable embodiment, in order to determine which phoneme or which word in the audio to be evaluated is inaccurate (or missing pronunciation) on the basis of determining the pronunciation evaluation result of the audio to be evaluated, the above-mentioned step S13 Described, determine the positional coordinates of the feature point of described articulatory organ in each frame of video image of described articulatory organ action video, obtain the characteristic point movement track of described articulatory organ, can also comprise the following steps:

The to-be-evaluated audio is divided according to the preset pronunciation evaluation granularity to obtain a plurality of sub-audios to be evaluated; in each frame of the video image of the articulator action video, it is determined that the audio to be evaluated corresponds to each sub-audio to be evaluated. The position coordinates of the feature points of the vocal organs are obtained, and the motion track segments of the vocal organs feature points corresponding to each sub-audio to be evaluated are obtained.

The preset pronunciation evaluation granularity is a pronunciation evaluation unit set according to user requirements. The granularity of pronunciation evaluation can be phonemes, characters, words, sentences, paragraphs, articles, etc., which is not specifically limited in the present disclosure. When the audio to be evaluated is divided according to the preset pronunciation evaluation granularity, specifically, the audio to be evaluated may be divided according to the duration corresponding to the preset pronunciation evaluation granularity, thereby obtaining a plurality of sub-audios to be evaluated.

In the case where each sub-audio to be evaluated is determined, a motion track segment of the feature point of the vocal organ corresponding to each sub-audio to be evaluated can be obtained based on the motion video of the vocal organ. Specifically, the position coordinates of the feature points of the vocal organs corresponding to each sub-audio to be evaluated can be determined in each frame of the video image of the vocal organ action video, and the motion trajectory of the vocal organ feature points corresponding to each sub-audio to be evaluated can be obtained. Fragment. It can also be that after obtaining the complete feature point motion trajectory of the vocal organ of the entire audio to be evaluated, the feature point motion trajectory of the vocal organ of the entire audio to be evaluated can be divided according to the method of dividing and obtained each sub-audio to be evaluated, so as to obtain the corresponding each sub-audio. The motion track segment of the vocal organ feature point of the sub-audio to be evaluated.

And adaptively, after obtaining the motion trajectory segment of the vocal organ feature point of each sub-audio to be evaluated, for each sub-audio to be evaluated, the motion trajectory segment of the vocal organ feature point corresponding to the sub-audio to be evaluated is matched with the corresponding The similarity calculation is performed on the preset motion trajectory segments of the feature points to obtain a first similarity value corresponding to the sub-audio to be evaluated, and the similarity information includes the first similarity value of each of the sub-audio to be evaluated.

The feature point preset motion track segment is a track segment in the complete feature point preset motion track. The method of obtaining the feature point preset motion track segment is similar to the method of dividing the feature point motion track segment of the vocal organ feature point of each sub-audio to be evaluated from the feature point motion track of the vocal organ of the entire audio to be evaluated, and will not be repeated here. .

Referring to Fig. 2, the flowchart of the method for locating which phoneme or which word in the audio to be evaluated is inaccurately pronounced includes steps S21-S28.

S21. Acquire the audio to be evaluated, where the audio to be evaluated is the audio of the user reading the example text.

S22. Input the to-be-evaluated audio into a video generation model, and obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.

S23. Divide the audio to be evaluated according to the preset pronunciation evaluation granularity to obtain a plurality of sub audios to be evaluated.

S24. In each frame of the video image of the articulator action video, determine the position coordinates of the feature points of the articulator corresponding to each sub-audio to be evaluated, and obtain the pronunciation corresponding to each sub-audio to be evaluated Organ feature point motion track segment.

S25. For each sub-audio to be evaluated, perform similarity calculation between the motion track segment of the vocal organ feature point corresponding to the sub-audio to be evaluated and the corresponding preset motion track segment of the feature point, to obtain the sub-audio corresponding to the sub-audio to be evaluated. The first similarity value.

S26. Determine a first target similarity value smaller than a preset threshold, and determine a target sub-audio to be evaluated corresponding to the target first similarity value.

The preset threshold may be preset values such as 90% and 98%. In the case that the first similarity value is smaller than the preset threshold, it is determined that the target sub-audio to be evaluated corresponding to the first similarity is inaccurate in pronunciation. The magnitude of the first similarity value is used to represent the similarity of the target consonant to be evaluated and the standard pronunciation corresponding to the target sub-audio to be evaluated.

S27. Determine a target example sentence text segment according to the target sub-audio to be evaluated, where the target example sentence text segment is a segment in the example sentence text.

When the target sub-audio to be evaluated with inaccurate pronunciation is determined, the target example sentence text segment corresponding to the target sub-audio to be evaluated can be determined. The target example sentence text segment may include one or more phonemes/characters/words/sentences, etc.

S28 , displaying the target example sentence text fragment and the target first similarity value in association, and obtaining the pronunciation evaluation result, so as to remind the user of the target sentence sentence text fragment that is mispronounced.

By adopting the method shown in FIG. 2 , it is possible to locate which phoneme or which word in the audio to be evaluated is inaccurately pronounced, and let the user know. Therefore, it is convenient for the user to carry out targeted pronunciation practice for the part with wrong pronunciation. For example, the standard action video of the vocal organ corresponding to the part with the wrong pronunciation and the preset motion trajectory segment of the standard feature point of the vocal organ are displayed to the user. The inaccurate articulation organ feature point motion track segment is displayed to the user, so that the user can know which pronunciation is inaccurate and where the difference from the standard pronunciation is.

Since the sound is produced by the synergy of multiple vocal organs, the motion video of the vocal organs in the embodiment of the present disclosure includes the upper lip, lower lip, upper teeth, lower teeth, gums, hard palate, soft palate, uvula, tongue tip, and tongue surface , the action of at least one organ in the base of the tongue, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or larynx, and the feature point motion trajectory (or feature point motion trajectory segment) of the articulatory organ includes the articulatory organ motion video The feature point motion track (or feature point motion track segment) of each organ in the

That is to say, by using the methods in the above embodiments of the present disclosure, the feature point motion track (or feature point motion track segment) of any speech organ can be obtained.

And for the feature point motion track (or feature point motion track segment) of each pronunciation organ, the feature point motion track (or feature point motion track segment) of the organ can be matched with the feature corresponding to the organ under the example sentence text. The point preset motion track (or feature point preset motion track segment) performs similarity calculation to obtain a second similarity value, and the second similarity value represents the feature point motion track (or feature point motion track segment) of a vocal organ and The standard similarity degree between the feature point preset motion trajectories (or feature point preset motion trajectory segments) of the vocal organ.

Further, the target second similarity value that is smaller than the threshold may be determined, and the target articulator may be determined according to the target second similarity value. In this way, it can be determined which specific one or several pronunciation organs of the multiple pronunciation organs have the incorrect pronunciation of the pronunciation action of the example sentence text (or the example sentence text segment).

In this way, on the basis of locating which phoneme or which word in the audio to be evaluated is inaccurately pronounced, it is possible to further locate which one or several articulators cause the inaccurate pronunciation. By displaying the standard motion video of the vocal organ corresponding to which one or several vocal organs and the preset movement trajectory of the standard vocal organ feature point to the user, it is beneficial for the user to perform targeted correction and learning of the vocal organ action.

The voice organ action video is a magnetic resonance imaging MRI video. Correspondingly, the sample voice organ action video used for training the video generation model is also a magnetic resonance imaging MRI video. The sample voice organ action video includes the upper lip, lower lip, and upper teeth. , lower teeth, gums, hard palate, soft palate, uvula, tongue tip, lingual surface, tongue base, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or the action of at least one articulatory organ in the larynx.

In addition, since the speech organs also include speech power organs such as the lungs, the diaphragm, and the trachea, the speech organs action video and the sample speech organ action video may also include the action of at least one speech organ among the lungs, the diaphragm, and the trachea. .

After the incorrectly pronounced phoneme or word is obtained, it is possible to match the preset pronunciation action suggestion information according to the action difference information between the incorrectly pronounced phoneme or word and the correct action video, so as to obtain the action difference information. Matching target action suggestion information.

For example, after obtaining the wrongly pronounced word, the action difference information indicates that the position of the upper jaw in the action video of the vocal organ of the user is lower than the position of the upper jaw in the standard action video of the vocal organ, then the corresponding target action suggestion information "Raise the upper jaw" can be matched. ”, the action difference information indicates that the position of the tongue in the action video of the voice organ of the user is backward compared to the position of the tongue in the standard action video of the voice organ, and the corresponding target action suggestion information “protrude the tongue” can be matched.

S15. Display the pronunciation evaluation information to the user.

The displayed pronunciation evaluation information can be at least one of the user's pronunciation scoring information, pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video, or a combination of the three. display, or all three at the same time.

Considering that the imaging of MRI images is not clear enough, and the non-professionals are not familiar with the shape of organs, it is difficult for users to extract information from MRI videos. Under the circumstances, the model can be generated by animation, and the video of the articulator action or the standard action video of the articulator can be rendered frame by frame to obtain an animated video of the articulator, and the animated video of the articulator can be used as the video of the articulator or the articulator. Standard action video for presentation.

The training samples of the animation generation model include a plurality of MRI sample images and animation organ maps corresponding to each MRI sample image, and the training samples of the animation generation model are obtained by determining the position of the organ in each MRI sample image. ; At the position of the organ in each MRI sample image, an animation organ corresponding to the position of the organ is generated, and an animation organ map is obtained.

MRI video is composed of multiple video frames. When generating animation, you can choose to input all the video frames into the animation generation model. After getting the animation frames output by the animation generation model, the animation frames can be arranged in the order of the video frames. Recombination to get the animation video corresponding to the video frame.

In a possible implementation, it is also possible to select video frames at intervals of preset frames to input the animation generation model. In this way, after obtaining the animation frames generated by the animation generation model, frames can be supplemented between the animation frames to generate a smooth animation. video. In this way, the workload of the animation generation model can be reduced, the consumption of computing resources can be reduced, and the animation generation efficiency can be improved.

The animation generation model can be any machine learning model that can learn samples, such as an adversarial generation network model, a recurrent neural network model, a convolutional network model, etc., which is not limited in the present disclosure. The training samples of the model include multiple MRI sample images and animated organ maps corresponding to each MRI sample image. By learning the training samples, the animation generation model can generate corresponding animation images based on the input MRI images, so that the MRI video frames can be converted into the corresponding animation images. Effects converted to animation frames.

The animation generation model can sequentially output the animation frames corresponding to the video frames in the order in which the video frames are input, wherein the positions of the vocal organs in the animation frames are filled by the animation vocal organs, which is convenient for users to view and understand.

In a possible implementation, different colors can be filled for each animated vocal organ according to different vocal organs, and the name of the organ can also be marked on the animated vocal organ. For example, the upper jaw position can be filled with light yellow, and Mark the character "upper jaw", fill the position of the tongue with positive red, mark the character "tongue", fill the position of the teeth with white, and mark the character "tooth", so that the position and connection relationship of each organ can be more intuitively reflected. It is easier for users to understand.

It is worth noting that the above-mentioned color filling method and name labeling method are only described as an example, and the present disclosure does not limit the color filling method and name labeling method of an organ. For example, the name labeling can also be labeled in a foreign language, or Add phonetic symbols and pinyin for pronunciation.

The animation frames are reorganized according to the sequence of the video frames, and a complete animation video can be obtained. The playback speed of the animation frames can be consistent with the video frames, or the playback speed of the animation frames can be adjusted according to the application requirements. For example, when the animation video application is used in education In the scene, in order to show the movement mode and force of the vocal organs more clearly, the playback speed of the animation frame can be reduced. In the case of reducing the playback speed of the animation frames, in order to make the animation video smoother, frames can also be supplemented between each frame to increase the number of frames of the animation video.

In a possible implementation, the animation generation model is an adversarial generation network model, the animation generation model includes a generator for generating animation images based on MRI images, and the animation generation model is obtained by training in the following manner:

Repeatedly executing the generator to generate a training animation image based on the MRI sample image, and to generate a loss value based on the animation vocal organ map corresponding to the MRI sample image and a preset loss function, and to adjust the animation based on the loss value. The parameters in the generator, and the discriminator of the confrontation generation network model evaluates the training animation image based on the animation articulator map, until the evaluation result satisfies the preset evaluation result condition.

The generator is used to generate an image based on the input data, and the discriminator is used to evaluate whether the image output by the generator has the same characteristics as the images in the specified set, that is, it can be judged whether the picture is a picture in the specified set. The evaluation result of the discriminator may be correct or wrong. When the image output by the generator is significantly different from the image features in the specified set, the evaluation result of the discriminator is usually correct, that is to say, the discriminator is correct. It can correctly judge whether the picture is a picture in the specified set, but when the feature difference between the picture generated by the generator and the picture in the specified set is not obvious, it is difficult for the discriminator to always correctly judge whether the picture is in the specified set. In this way, the training stop condition can be set by setting the correct ratio threshold of the discriminative evaluation results, so that the images generated by the generator are more in line with the characteristics of the training target in the training set.

Before training the generator, the discriminator can also be pre-trained, for example, input random features to the generator to obtain an image, and the discriminator will evaluate the image features consistent with the animated organ diagram in the training sample, Based on whether the evaluation result is correct, the parameters in the discriminator are adjusted until the discriminator can correctly judge whether the image generated by the generator is consistent with the animation vocal organ map in the training sample. After the discriminator is trained, the generator can be trained using the discriminator. It is worth noting that the training of the generator and the discriminator can also be carried out synchronously, so that they can constrain each other, so that the images generated by the generator are more in line with the characteristics of the animated vocal organ map, and the discriminator can evaluate the images more correctly.

In a possible implementation manner, the training samples are obtained by: determining the position of the articulator in each MRI sample image, and generating the position of the articulating organ in each MRI sample image with the position of the articulating organ The corresponding animation articulation organ is obtained to obtain the animation articulation organ diagram.

The position of each organ can be distinguished by the color block area in the MRI sample image, and the position of the vocal organ can also be identified by the recognition model, or the organ template image can be overlapped with the MRI sample image, and the organ position based on the organ template image can be found in the image. Regions are merged in the MRI sample images, and the color block in the region where the vocal organs are located is used as the location of the vocal organs.

In a possible implementation manner, for each MRI sample image, the organ contour of the MRI sample image is extracted, and the organ image corresponding to the articulating organ is filled in the organ contour of each articulating organ.

The organ image can be a cartoon image or a realistic image. In a possible implementation, the organ map can be called from the preset flash animation library, and the organ map corresponding to the vocal organ is filled in the organ outline of each vocal organ . It is worth noting that there may be multiple organ textures for the same vocal organ in the flash animation library, and one type of organ texture can be automatically selected for filling, or the type of texture can be modified according to the user's designation.

In a possible implementation, for the MRI sample image corresponding to the first frame of the MRI sample video, the organ map is called from a preset flash animation library, and the organ outline of each vocal organ is filled with the corresponding image of the vocal organ Organ map; for the MRI sample images corresponding to other video frames, call the organ map corresponding to each vocal organ in the MRI sample image corresponding to the first frame from the flash animation library in the organ corresponding to each vocal organ Fill in the outline.

That is to say, after the first frame is filled with textures, other frames can be filled based on the texture type of the first frame, so that all animation frames have the same texture style for the same vocal organ, making the final animation video more natural. .

For example, there are 3 kinds of textures for the tongue and 4 kinds of textures for the teeth in the flash animation library. When filling the MRI sample image corresponding to the first frame, the tongue 1 texture is selected for the tongue, and the tooth 3 texture is selected for the teeth. The texture fills the contour of the tongue and the contour of the teeth respectively. When filling other subsequent frames, the tongue 1 map can be automatically selected to fill the contour of the tongue, and the tooth 3 map can be selected to fill the contour of the teeth. filling.

Considering that there may be deviations in the extraction of the organ contour, in a possible implementation manner, after the organ contour is extracted, the organ contour may be corrected. Organ contours can be corrected frame by frame, and after the organ contour of the first frame is corrected, the organ contour can be tracked by means of feature point recognition, so as to achieve organ contour correction in other frames.

In a possible implementation manner, for the MRI sample image corresponding to the first frame of the MRI sample video, the contour of the organ in the MRI sample image is adjusted based on the MRI sample image, so that the contour of the speech organ is the same as the one in the MRI sample image. The feature points in the MRI sample image correspond; for the MRI sample images corresponding to other video frames, feature point tracking is performed between the feature points in the MRI sample image and the feature points in the previous video frame of the MRI sample image, And based on the feature point tracking results, the organ contour in the MRI sample image is automatically adjusted.

It is worth noting that steps S11 to S15 in this embodiment may all be performed on the user terminal. Optionally, in order to reduce the computing pressure on the terminal, steps S13 and S14 may also be performed on the server. After the audio to be evaluated is generated, the audio can be sent to the server, and after the server processes the audio, the pronunciation evaluation information is returned to the user terminal.

Through the above technical solutions, at least the following technical effects can be achieved:

By acquiring the audio to be evaluated read aloud by the user based on the example text, and generating the pronunciation evaluation information based on the pronunciation organ action video generated based on the to-be-evaluated audio and the pronunciation organ standard action video corresponding to the example text, the user's pronunciation can be more accurately evaluated, This more intuitively reflects whether the user's pronunciation is accurate.

FIG. 3 is a block diagram of a pronunciation evaluation apparatus according to an exemplary disclosed embodiment. As shown in FIG. 3 , the pronunciation evaluation apparatus 300 includes:

The example sentence display module 310 is used to display the example sentence text to the user;

The audio collection module 320 is used to collect the audio to be evaluated that the user reads aloud based on the example text;

Video generation module 330, for generating the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;

The pronunciation evaluation module 340 is used for generating pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;

The evaluation display module 350 is configured to display the pronunciation evaluation information to the user.

In a possible implementation, the pronunciation evaluation information includes at least one of the pronunciation scoring information of the user, the pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video one.

In a possible implementation, the example sentence display module 310 is configured to generate example sentence audio based on the example sentence text; synthesize the example sentence audio and the standard action video of the pronunciation organ into an example sentence demonstration video; display the example sentence to the user Text and demo video of said example sentences.

In a possible implementation, the pronunciation evaluation module 340 is configured to obtain action difference information by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text; according to the action difference information Pronunciation scoring information is generated, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.

In a possible implementation manner, the action difference information is difference information of movement trajectories of feature points of speech organs.

In a possible implementation, the pronunciation evaluation module is configured to, based on the unit text content of the example sentence text, use the video clips representing the same unit text content in the articulator action video and the articulator standard action video as A group of video clip groups; Align the video clips belonging to the articulator action video and the articulator standard action video in each video clip group; Align the articulator action video and the articulator standard action after the alignment Video splicing is performed to obtain the comparison video.

In a possible implementation manner, the video generation module 330 is configured to convert the to-be-evaluated audio into a to-be-processed audio feature vector; input the to-be-processed audio feature vector into a video generation model to obtain the video generated The voice organ action video output by the model and corresponding to the audio to be evaluated.

In a possible implementation manner, the pronunciation evaluation apparatus 300 further includes a video generation model training module, which is configured to construct model training data according to the sample audio and the sample articulator action video corresponding to the sample audio; The video generation model is obtained by data training.

In a possible implementation, the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain a sample including at least one sample phoneme posterior probability vector A sequence of phoneme posterior probability vectors; based on the sample voice organ action video, extract the sample voice organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence, to obtain a sample voice organ A video feature sequence; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.

In a possible implementation manner, the sample vocal organ video feature is at least one of pixel point feature information or principal component feature information of at least one frame of video image in the sample vocal organ motion video.

In a possible implementation, the video generation module 330 is further configured to divide the example sentence text into unit text sequences; input the unit text sequences into a video feature generation model to obtain a video feature sequence; The video feature sequence generates the standard action video of the pronunciation organ; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into sample unit text sequences; according to the sample unit text sequences and corresponding to the sample unit text sequences The model training data is constructed from the sample video feature sequence of the sample vocal organ action video; the video feature generation model is obtained by training according to the model training data.

In a possible implementation manner, the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on the MRI video, and the device further includes a video rendering module for generating through animation The model renders the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.

In a possible implementation manner, the training samples of the animation generation model include a plurality of MRI sample images and an animated voice organ map corresponding to each MRI sample image, and the apparatus further includes a training sample generation module configured to determine The position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, an animated artifact corresponding to the position of the articulator is generated to obtain an animated articulation map.

The specific steps performed by the above modules have been described in detail in some embodiments of the method, and are not repeated here.

Referring next to FIG. 4 , it shows a schematic structural diagram of an electronic device (eg, user equipment or server) 400 suitable for implementing an embodiment of the present disclosure. Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 4 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 4 , an electronic device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 401 that may be loaded into random access according to a program stored in a read only memory (ROM) 402 or from a storage device 408 Various appropriate actions and processes are executed by the programs in the memory (RAM) 403 . In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .

Typically, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 of a computer, etc.; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409. Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 4 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 408, or from the ROM 402. When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the user terminal and the server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires at least two Internet Protocol addresses; A node evaluation request for an Internet Protocol address, wherein the node evaluation device selects an Internet Protocol address from the at least two Internet Protocol addresses and returns it; receives the Internet Protocol address returned by the node evaluation device; wherein the obtained The Internet Protocol address indicates an edge node in the content distribution network.

Alternatively, the above computer-readable medium carries one or more programs, and when the above one or more programs are executed by the electronic device, the electronic device: receives a node evaluation request including at least two Internet Protocol addresses; From the at least two Internet Protocol addresses, the Internet Protocol address is selected; the selected Internet Protocol address is returned; wherein, the received Internet Protocol address indicates an edge node in the content distribution network.

Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring at least two Internet Protocol addresses".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a pronunciation evaluation method, the method includes: displaying example sentence text to a user; collecting audio to be evaluated read aloud by a user based on the example sentence text; based on the to-be-evaluated The audio generates a voice organ action video; generates a voice organ action video reflecting the actions of the voice organ when the user reads the example text; and displays the voice pronunciation evaluation information to the user.

According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.

According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 1. The presenting the example sentence text to the user includes: generating example sentence audio based on the example sentence text; comparing the example sentence audio with the pronunciation organ standard The action video is synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.

According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 2, in the case that the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the Pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text generate pronunciation evaluation information, including: by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text, obtain action difference information; According to Pronunciation scoring information is generated from the action difference information, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.

According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, where the action difference information is difference information of the movement trajectories of the feature points of the vocal organs.

According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 2, wherein the comparison video is generated by: based on the unit text content of the example sentence text, combining the articulator action video with the The video clips representing the same unit text content in the standard action video of the articulator are used as a video clip group; the video clips belonging to the articulator action video and the standard action video of the articulator in each video clip group are aligned; The latter action video of the articulator and the standard action video of the articulator are spliced to obtain the comparison video.

According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 1, wherein the generating an articulator action video that reflects the action of the articulator when the user reads the example text aloud includes: converting the to-be-to-be The evaluation audio is converted into the to-be-processed audio feature vector; the to-be-processed audio feature vector is input into a video generation model to obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.

According to one or more embodiments of the present disclosure, Example 8 provides the method of Example 7, further comprising: constructing model training data according to sample audio and sample vocal organ action videos corresponding to the sample audio; and according to the model training data The video generation model is obtained by training.

According to one or more embodiments of the present disclosure, Example 9 provides the method of Example 8, wherein the constructing model training data according to the sample audio and the sample articulator action video corresponding to the sample audio includes: adding the sample audio to Each frame of audio is converted into a sample phoneme posterior probability vector, and a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector is obtained; The sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as The model training data.

According to one or more embodiments of the present disclosure, Example 10 provides the method of Example 9, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.

According to one or more embodiments of the present disclosure, Example 11 provides the method of Example 1, the articulator standard action video is generated by: dividing the example sentence text into unit text sequences; dividing the unit text The sequence inputs a video feature generation model to obtain a video feature sequence; based on the video feature sequence, a standard action video of a vocal organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence ; build model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence; obtain the video feature generation model according to the model training data training.

According to one or more embodiments of the present disclosure, Example 12 provides the method of Examples 1-11, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so The method further includes: rendering the voice organ motion video or the voice organ standard motion video frame by frame through an animation generation model to obtain a voice organ animation video.

According to one or more embodiments of the present disclosure, Example 13 provides the method of Example 12, the training samples of the animation generation model include a plurality of MRI sample images and an animated articulation organ map corresponding to each MRI sample image, and the method The method also includes: determining the position of the articulator in each MRI sample image; generating an animated articulator corresponding to the position of the articulator at the position of the articulator in each MRI sample image to obtain an animated articulation diagram.

According to one or more embodiments of the present disclosure, Example 14 provides an apparatus for evaluating pronunciation, the apparatus comprising: an example sentence display module for displaying example sentence text to a user; an audio collection module for collecting user based example sentences based on the example sentences The audio to be evaluated that the text is read aloud; the video generation module is used to generate the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text; the pronunciation evaluation module is used for based on the pronunciation organ action video and all Describe the pronunciation organ standard action video corresponding to the example text to generate pronunciation evaluation information; an evaluation display module is used to display the pronunciation evaluation information to the user.

According to one or more embodiments of the present disclosure, Example 15 provides the apparatus of Example 14, the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.

According to one or more embodiments of the present disclosure, Example 16 provides the apparatus of Example 14, wherein the example sentence display module is used to generate example sentence audio based on the example sentence text; Synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.

According to one or more embodiments of the present disclosure, Example 17 provides the apparatus of Example 10, wherein the pronunciation evaluation module is configured to obtain by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text Action difference information; generate pronunciation scoring information according to the action difference information, and/or match with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information .

According to one or more embodiments of the present disclosure, Example 18 provides the apparatus of Example 15, wherein the motion difference information is difference information of the movement trajectory of the feature points of the speech organ.

According to one or more embodiments of the present disclosure, Example 19 provides the apparatus of Example 15, the pronunciation evaluation module is further configured to compare the voice organ action video and the voice organ standard based on the unit text content of the example sentence text In the action video, the video clips representing the same unit text content are used as a video clip group; the video clips belonging to the voice organ action video and the voice organ standard action video in each video clip group are aligned; The voice organ action video and the voice organ standard action video are spliced to obtain the comparison video.

According to one or more embodiments of the present disclosure, Example 20 provides the apparatus of Example 14, the video generation module for converting the audio to be evaluated into a feature vector of audio to be processed; converting the audio feature vector to be processed A video generation model is input to obtain a voice organ action video output by the video generation model and corresponding to the audio to be evaluated.

According to one or more embodiments of the present disclosure, Example 21 provides the apparatus of Example 20, the pronunciation evaluation apparatus further includes a video generation model training module, configured to generate a model training module according to the sample audio and the sample voice organ action video corresponding to the sample audio. constructing model training data; and obtaining the video generation model by training according to the model training data.

According to one or more embodiments of the present disclosure, Example 22 provides the apparatus of Example 21, and the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain A sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector; based on the sample vocal organ action video, extract the sample phoneme posterior probability vector corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence The sample vocal organ video features are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.

According to one or more embodiments of the present disclosure, Example 23 provides the apparatus of Example 22, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.

According to one or more embodiments of the present disclosure, Example 24 provides the apparatus of Example 14, wherein the video generation module is further configured to segment the example sentence text into a unit text sequence; input the unit text sequence into a video feature to generate The model obtains a video feature sequence; based on the video feature sequence, a standard action video of the pronunciation organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence; according to the sample unit text The sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence construct model training data; the video feature generation model is obtained by training according to the model training data.

According to one or more embodiments of the present disclosure, Example 25 provides the apparatus of Examples 14-24, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so The device further includes a video rendering module, which is used for generating a model through animation, and rendering the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.

According to one or more embodiments of the present disclosure, Example 26 provides the apparatus of Example 25, wherein the training samples of the animation generation model include a plurality of MRI sample images and an animation articulation organ map corresponding to each MRI sample image, and the animation generates The training samples of the model are obtained in the following manner: determine the position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, generate an animation articulation corresponding to the position of the articulator, and obtain Animated articulation organ diagram.

The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims

A pronunciation evaluation method, including:

show the user the example text;

Collect the audio to be evaluated that the user reads aloud based on the example text;

Generate a pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;

Generate pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;

The pronunciation evaluation information is displayed to the user.
The method according to claim 1, wherein the pronunciation evaluation information includes the pronunciation scoring information of the user, the pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video. at least one of.
The method of claim 1 , wherein the presenting the example text to the user comprises:

generating example sentence audio based on the example sentence text;

Synthesize the example example audio and the standard action video of the articulator into example example demonstration video;

The user is presented with the example text and a demonstration video of the example.
The method according to claim 2, wherein, when the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the action video based on the pronunciation organ and the example text The corresponding pronunciation organ standard action video generates pronunciation evaluation information, including:

Action difference information is obtained by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;

Pronunciation scoring information is generated according to the motion difference information, and/or, target motion suggestion information matching the motion difference information is obtained by matching with preset pronunciation motion suggestion information according to the motion difference information.
The method according to claim 4, wherein the action difference information is the difference information of the movement trajectories of the feature points of the vocal organs.
The method of claim 2, wherein the comparison video is generated in the following manner:

Based on the unit text content of the example sentence text, the video clips representing the same unit text content in the articulator action video and the articulator standard action video are used as a video clip group;

Aligning the video clips belonging to the articulator action video and the articulator standard action video in each video clip group;

The alignment video of the articulator action and the standard action video of the articulator are spliced to obtain the comparison video.
The method according to claim 1, wherein said generating a pronunciation organ action video reflecting the action of the pronunciation organ when the user reads the example text aloud, comprises:

Converting the to-be-evaluated audio into a to-be-processed audio feature vector;

The to-be-processed audio feature vector is input into a video generation model to obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.
The method of claim 7, further comprising:

Build model training data according to the sample audio and the sample articulator action video corresponding to the sample audio;

The video generation model is obtained by training according to the model training data.
The method according to claim 8, wherein the building model training data according to the sample audio and the sample articulator action video corresponding to the sample audio comprises:

Converting each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector;

Based on the sample articulator action video, extract a sample articulator video feature corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence, to obtain a sample articulator video feature sequence;

The sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
The method according to claim 9, wherein the sample vocal organ video feature is at least one of pixel point feature information or principal component feature information of at least one frame of video image in the sample vocal organ action video.
The method according to claim 1, wherein the standard motion video of the vocal organs is generated in the following manner:

dividing the example sentence text into unit text sequences;

Inputting the unit text sequence into a video feature generation model to obtain a video feature sequence;

Generate a standard motion video of vocal organs based on the video feature sequence;

Wherein, the video feature generation model is obtained by training in the following ways:

Divide the sample text into sample unit text sequences;

Build model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence;

The video feature generation model is obtained by training according to the model training data.
The method according to any one of claims 1-11, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI video, and the method further comprises:

The animation generation model is used to render the articulator action video or the articulator standard action video frame by frame to obtain the articulator animation video.
The method according to claim 12 , wherein the training samples of the animation generation model include a plurality of MRI sample images and an animated articulation organ map corresponding to each MRI sample image, and the method further comprises:

determining the position of the vocal organs in each MRI sample image;

At the position of the articulator in each MRI sample image, an animated articulator corresponding to the position of the articulator is generated, and an animated articulation diagram is obtained.
A pronunciation evaluation device, comprising:

The example sentence display module is used to display the example sentence text to the user;

An audio collection module for collecting the audio to be evaluated based on the example text read aloud by the user;

Video generation module, for generating the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;

A pronunciation evaluation module for generating pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;

An evaluation display module, configured to display the pronunciation evaluation information to the user.
A non-transitory computer-readable medium having stored thereon a computer program, wherein the program, when executed by a processing device, implements the steps of the method of any one of claims 1-13.
An electronic device comprising:

a storage device on which a computer program is stored;

A processing device, configured to execute the computer program in the storage device, to implement the steps of the method of any one of claims 1-13.
A computer program comprising:

Instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any of claims 1-13.
A computer program product comprising instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any of claims 1-13.