WO2022194044A1 - Pronunciation assessment method and apparatus, storage medium, and electronic device - Google Patents

Pronunciation assessment method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2022194044A1
WO2022194044A1 PCT/CN2022/080357 CN2022080357W WO2022194044A1 WO 2022194044 A1 WO2022194044 A1 WO 2022194044A1 CN 2022080357 W CN2022080357 W CN 2022080357W WO 2022194044 A1 WO2022194044 A1 WO 2022194044A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
sample
organ
pronunciation
action
Prior art date
Application number
PCT/CN2022/080357
Other languages
French (fr)
Chinese (zh)
Inventor
顾宇
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022194044A1 publication Critical patent/WO2022194044A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present disclosure relates to the field of education, and in particular, to a pronunciation evaluation method and device, a storage medium and an electronic device.
  • the present disclosure provides a method for evaluating pronunciation, including displaying example text to a user; collecting audio to be evaluated read aloud by a user based on the example text; generating an action video of a pronunciation organ based on the audio to be evaluated; Pronunciation evaluation information is generated from the action video and the pronunciation organ standard action video corresponding to the example text; the pronunciation evaluation information is displayed to the user.
  • the present disclosure provides a pronunciation evaluation device, comprising: an example sentence display module, used for displaying example sentence text to a user; an audio collection module, used for collecting the audio to be evaluated that the user reads aloud based on the example sentence text; a video generation module, For generating pronunciation organ action video based on described audio to be evaluated; Pronunciation evaluation module, for generating pronunciation evaluation information based on the pronunciation organ standard action video corresponding to described pronunciation organ action video and described example text; Evaluation display module, for The pronunciation evaluation information is displayed to the user.
  • the present disclosure provides a non-transitory computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, including a storage device and a processing device, where a computer program is stored on the storage device; and a processing device is configured to execute the computer program in the storage device, so as to realize the first aspect of the present disclosure.
  • the steps of the method of the aspect are described in detail below.
  • the present disclosure provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.
  • the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.
  • FIG. 1 is a flow chart of a pronunciation evaluation method according to an exemplary disclosed embodiment.
  • FIG. 2 is a flowchart of another pronunciation evaluation method according to an exemplary disclosed embodiment of the present disclosure.
  • FIG. 3 is a block diagram of an apparatus for evaluating pronunciation according to an exemplary disclosed embodiment of the present disclosure.
  • FIG. 4 is a block diagram of an electronic device according to an exemplary disclosed embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flowchart of a pronunciation evaluation method according to an exemplary disclosed embodiment. As shown in Fig. 1 , the method includes steps S11-S15.
  • the example sentence text can be a text of any length, such as a phrase, a sentence, a paragraph, an article, etc.
  • the example sentence text can also refer to a clause after a longer text is processed by clauses.
  • the example text can be displayed to the user in the form of text, so that the user can perform a pronunciation test. If the user wants to learn pronunciation, the example text can be displayed to the user in the form of audio, so that the user can follow along.
  • the present disclosure is not limited to presenting the example text to the user in the form of text and audio together.
  • the example text can be displayed in the form of text through the display device of the user terminal, and the example text can also be displayed in the form of voice through the playback device of the user terminal, wherein the voice corresponding to the example text can be stored in advance, and the voice can also be displayed when needed. In the case of converting text to speech directly.
  • the user terminal may include any device with a display function, such as a mobile phone, a computer, a learning machine, and a wearable device.
  • the example sentence audio is generated based on the example sentence text
  • the audio and the standard action video of the vocal organ are synthesized into an example sentence demonstration video
  • the example sentence text and the example sentence demonstration video are displayed to the user.
  • the standard action video of the pronunciation organ is generated based on the example sentence text, and the video features can be generated by the pre-trained video feature generation model.
  • the example sentence text is divided into unit text sequences, and the unit text sequences are input into a video feature generation model to obtain a video feature sequence, and based on the video feature sequence, a standard action video of a vocal organ is generated.
  • the unit text sequence is a sequence obtained by dividing the example text into small units for generating videos.
  • the unit text can be phonemes, words, single characters, etc. After the example text is segmented, more refined , so that the model can more efficiently generate accurate video feature sequences based on unit text. For example, when the example text is "How are you", the example text can be divided into unit text sequences of "how", "are” and "you” by using words as the division unit, or the phoneme can be used as the division unit to divide Example sentences are split into The unit text sequence of .
  • the video feature generation model is trained in the following ways:
  • the sample vocal organ action video is a demonstration video produced or recorded based on the sample text.
  • the demonstration video can be an animation demonstration video of the oral cavity made by any animation production and rendering software, or it can be the head of a person captured by an MRI machine reading the sample text. Department video.
  • the feature information is principal component information
  • the principal component information of each video frame is obtained by performing principal component analysis on the sample vocal organ action video frame by frame, and the principal component information of each video frame is calculated.
  • the component information is arranged in the order of video frames to obtain the sample video feature sequence.
  • a restored image can be obtained, and the restored image can be arranged and synthesized according to the sequence of the sample video feature sequence to obtain a restored demo video.
  • the video feature generation model is trained, so that the video feature generation model can generate corresponding video features or videos based on any unit text. feature sequence.
  • the video feature generation model can be a deep learning model, which generates training samples input to the deep learning model by labeling each sample unit text in the sample unit text sequence. After multiple rounds of iterative training, the deep learning model can accurately Generate video features based on unit text.
  • the video feature generation model may also be an attention model
  • the video feature generation model includes an encoder and a decoder
  • the encoder is configured to generate an encoding result based on a unit text sequence
  • the decoder is configured to generate a video based on the encoding result Feature sequences
  • the encoder and decoder are trained in the form of end-to-end training from unit text sequences to video feature sequences, so that the attention model can accurately generate video feature sequences based on unit text sequences.
  • the sample demonstration video may be an MRI video of the user reciting the sample text captured by an MRI apparatus. Multiple sample texts are obtained by segmenting the sample demonstration text, and the sample demonstration video is segmented into different sample texts based on the result of the sentence segmentation. For the sub-videos corresponding to the text, a plurality of sample voice organ action videos can be obtained.
  • the sample demonstration text is segmented to obtain a plurality of sample texts
  • the speech recognition is performed on the sample speech recorded synchronously with the sample demonstration video
  • the speech corresponding to each sample text is determined based on the speech recognition result.
  • Segment based on the time axis information of each speech segment, from the sample demonstration video to determine the sample vocal organ action video corresponding to each speech segment. For example, by segmenting the sample demo text "Howareyou? I'mfinethankyou,andyou?", four clauses of "Howareyou”, “I'mfine”, “thankyou” and “andyou” can be obtained.
  • the time axis information of the speech segment corresponding to "Howareyou” can be obtained as "00:00:00 to 00:01:40", and the time axis information of the speech segment corresponding to "I'mfine” is “00:01: 40 to 00:02:50”, the timeline information of the speech segment corresponding to "thankyou” is “00:02:50 to 00:04:40”, and the timeline information of the speech segment corresponding to "andyou” is " 00:04:40 to 00:06:00”, the sample demonstration sub-video with a duration of 6 seconds can be divided into "00:00:00 to 00:01:40", “00:01: 40 to 00:02:50", “00:02:50 to 00:04:40”, "00:04:40 to 00:06:00” four video clips, each video clip is its corresponding sample Sample demo sub-video of clause text.
  • the above sentence segmentation methods are only shown as examples, and those skilled in the art may use other sentence segmentation methods to perform sentence segmentation processing
  • the recording instrument may not have the recording function during the MRI video recording, additional recording equipment is required to record the sample voice, and the sample demonstration video and the sample voice may have a time difference caused by problems such as different start times and different end times during recording. Therefore, in a possible implementation manner, an alignment process is performed on the time axis information of the sample speech and the sample demonstration video; the length of the sample speech or the sample demonstration video is adjusted so that the sample speech Consistent with the length of the sample demo video.
  • the face position in the sample demonstration video is adjusted frame by frame, so that the same organ in each video frame is located at the same image position.
  • the adjustment can be performed in the form of pixel tracking or optical flow tracking, or it can be performed by extracting and aligning feature points.
  • the processing of video frames includes but is not limited to rotation, translation, zooming in, and zooming out.
  • the screen size is uniformly cropped to reduce the interference information in the video.
  • S12 Collect the audio to be evaluated read aloud by the user based on the example text.
  • the voice read aloud by the user can be collected through the voice collecting device of the user terminal.
  • speech recognition can be performed on the collected audio to be evaluated, and the recognition result can be compared with the example text.
  • the text similarity is lower than a preset similarity threshold, the user can Send a prompt to remind the user to re-read the example text.
  • the audio to be evaluated is converted into an audio feature vector to be processed; the audio feature vector to be processed is input into a video generation model, and the output of the video generation model and the audio to be evaluated are obtained. Corresponding vocal organ action video.
  • An achievable implementation is to convert the audio to be evaluated into an audio feature vector to be evaluated.
  • the audio to be evaluated can be input into a speech recognition model to obtain the audio feature vector to be evaluated.
  • the audio feature vector to be evaluated includes: The phoneme posterior probability vector of each frame of audio in the audio to be evaluated, and the dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated.
  • a phoneme is the smallest unit of speech divided according to the natural properties of speech.
  • Each human voice, animal voice, and musical instrument sound can be divided into a limited number of minimum phonetic units based on attributes.
  • Each frame of audio in the audio to be evaluated may be audio of one phoneme.
  • a phoneme can be represented by a phoneme posterior probability vector.
  • the dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated. For example, assuming that the language type corresponding to the audio to be evaluated is English, since the number of English phonemes is 48, the dimension of the English phoneme posterior probability vector is 48. That is to say, an English phoneme posterior probability vector includes 48 probability values greater than or equal to 0 and less than 1, and the sum of the 48 probability values is 1.
  • the phoneme corresponding to the maximum value among the 48 probability values is the English phoneme represented by the phoneme posterior probability vector.
  • the language type corresponding to the audio to be evaluated is the language type that imitates the target musical instrument
  • the dimension of the phoneme posterior probability vector is also 50, and the sum of 50 is 1. of probability values.
  • Each frame of audio in the audio to be evaluated may also be audio of one word/word.
  • a word/word is represented by a posterior probability vector of a word/word.
  • a speech recognition model (Automatic Speech Recognition, ASR for short) is a model that converts sounds into corresponding text or commands.
  • the speech recognition model can use the following
  • the training method is obtained by training: constructing a model training sample according to the sample audio frame and the phoneme corresponding to the sample audio frame; training according to the model training sample to obtain the speech recognition model.
  • a mapping table of speech feature parameters and phonemes is constructed according to the sample audio frame and the phoneme corresponding to the sample audio frame.
  • the characteristic parameters of the speech to be processed are obtained, and the characteristic parameters of the speech to be processed are combined with the speech parameter library.
  • One-to-one matching is performed on the speech templates in the to-be-processed speech feature parameter to obtain the matching probability between the speech feature parameter to be processed and each speech feature parameter in the speech parameter database.
  • the phoneme posterior probability vector of each frame of audio in the audio to be evaluated is obtained according to the mapping table between speech feature parameters and phonemes.
  • This method of training a speech recognition model using a small and limited number of phonemes and phoneme audio can reduce the model training task compared with the method of using a large number of words/words and audio of words/words to train a speech recognition model. Get a trained speech recognition model quickly.
  • the video generation model is obtained by training in the following manner: constructing model training data according to the sample audio and sample vocal organ action videos corresponding to the sample audio; training and obtaining the video generation model according to the model training data.
  • the present disclosure does not specifically limit the loss function of the video generation model.
  • the sample audio is the audio of all phonemes corresponding to the target language type.
  • the sample articulator action video may be an animation demonstration video of the articulator action corresponding to each phoneme produced by any animation production and rendering software.
  • the sample voice organ motion video may also be a voice organ motion video corresponding to each phoneme captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like. Because users can not only read words or words in various human languages, but also imitate sounds such as animals and musical instruments.
  • the above-mentioned sound segments may refer to sound segments in other sounds imitating non-human languages (such as a sound segment corresponding to a key of a musical instrument and a string).
  • the sample audio is audio of all words or words (or sound segments) corresponding to the target language type.
  • the sample articulator action video may be an animation demonstration video of articulator action corresponding to each word or word (or sound segment) produced by any animation production and rendering software.
  • the sample voice organ motion video may also be a voice organ motion video corresponding to each word or word (or sound segment) captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like.
  • model training data according to the sample audio and the sample articulator action video corresponding to the sample audio may specifically include the following steps:
  • the sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample The vocal organ video feature sequence is used as the model training data.
  • Each frame of audio in the sample audio is in one-to-one correspondence with each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence, and each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence corresponds to the sample vocal organ video.
  • Each sample vocal organ video feature in the feature sequence has a one-to-one correspondence.
  • each of the sample vocal organ video features is the pixel point feature information of at least one frame of video image in the sample vocal organ motion video; or, each of the sample vocal organ video features is the sample vocal organ motion Principal component feature information of at least one frame of video image in the video.
  • the principal component feature information is principal component coefficient data representing the video image obtained by performing dimension reduction processing on the video image through the principal component analysis algorithm.
  • sample articulator video feature corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence is extracted based on the sample articulator action video.
  • the adjustment can be performed in the form of pixel tracking or optical flow tracking, or can be performed in the form of feature point extraction and alignment.
  • the size of the frame video image is uniformly cropped.
  • the position of the articulator in the sample articulator action video is adjusted frame by frame, so that the same articulator in each frame of video image is located at the same image position, which is conducive to reducing the impact of different positions of the same articulator in each frame of video images. The resulting interference to the model training effect and the model convergence speed.
  • the audio feature vector to be evaluated includes the phoneme posterior probability vector of each frame of audio in the audio to be evaluated, after inputting the audio feature vector to be evaluated into the trained video generation model, the phoneme posterior probability related to each frame of audio can be obtained.
  • the sample pronunciation organ action video used for training the video generation model is the video corresponding to the sample audio
  • the abbreviated version pronunciation organ action video used for training the video feature generation model in step S11 is the same as the sample text.
  • the sample audio can be recorded synchronously with the sample articulator action video
  • the sample articulator action in step S11 and step S13 Video is the same video.
  • operations such as audio and video alignment, video cropping, and video center alignment for the sample vocal organ action video can be performed only once.
  • the aligned video and audio are used for training.
  • the pronunciation evaluation information includes at least one of pronunciation scoring information of the user, pronunciation action suggestion information, or a comparison video of the articulator action video and the articulator standard action video.
  • Described contrast video is generated in the following way: based on the unit text content of example sentence text, the video clip that characterizes the same unit text content in described pronunciation organ action video and described pronunciation organ standard action video is as a group of video clip groups.
  • the video clips belonging to the articulator action video and the articulator standard action video in each video segment group are aligned;
  • the articulator action video after the alignment and the articulator standard action video are spliced to obtain the result. the comparison video.
  • the action difference information is obtained by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text generating pronunciation scoring information according to the action difference information, and/or matching with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information.
  • the action difference information may refer to the difference information of the movement trajectories of the feature points of the vocal organs.
  • the movement trajectory of the feature points of the speech organ is used to reflect the speech movement process of the speech organ.
  • the feature points of the vocal organs can be the centroid points, center points, contour feature points, etc. of the vocal organs, or other feature points other than the vocal organs that take the centroid points, center points, contour feature points, etc. of the vocal organs as reference points. .
  • the present disclosure does not specifically limit the number and types of feature points.
  • the voice organ action video includes at least one frame of video image, and the position coordinates of the feature points of the voice organ are determined in each frame of the voice organ action video, and the number of frames (groups) corresponding to the number of frames of the voice organ action video can be obtained.
  • the position coordinates of the feature points of the vocal organs Based on the position coordinates of all the feature points of the vocal organs, the movement trajectory of the feature points of the vocal organs corresponding to the time axis of the motion video of the vocal organs can be constructed.
  • the preset movement trajectory of the feature point corresponding to the example sentence text is the standard movement trajectory of the vocal organ feature point corresponding to the example sentence text. Similarity calculation is performed on the movement trajectory of the feature points of the speech organ corresponding to the action video of the speech organ and the preset movement trajectory of the standard feature points, and the similarity information of the two trajectory lines can be obtained.
  • the preset motion trajectory of the feature points corresponding to the example text can be determined in the following manner:
  • the model training data for training the video generation model determine all the phonemes (or other word, word, sentence, etc. unit-granularity information) that make up the example text, and determine the vocal organ video feature sequence corresponding to all the phonemes, based on the vocal organ video feature
  • the sequence generates a standard action video of the pronunciation organs of the example text.
  • the position coordinates of the feature points of the vocal organs are determined in each frame of video images of the standard motion video of the vocal organs, and the preset motion trajectories of the feature points corresponding to the example text are obtained.
  • multiple groups of phoneme sequences that form the example text can be determined, and based on the multiple sets of phoneme sequences that make up the example text, a plurality of preset motion trajectories of feature points can be determined.
  • a comprehensive and more accurate preset motion trajectories of the feature points can be obtained.
  • the pronunciation score of the audio to be evaluated is determined according to the similarity value in the similarity information.
  • the pronunciation score of the audio to be evaluated is used as the pronunciation evaluation result.
  • the pronunciation levels of the audio to be evaluated are determined as excellent, medium, qualified, unqualified, and missing pronunciation.
  • the pronunciation of the audio to be evaluated is excellent, moderate, qualified, unqualified, missing pronunciation, etc., as the pronunciation evaluation result.
  • the audio to be evaluated in which the user reads the example text aloud can be input into the video generation model, and the user's pronunciation organ action video can be fitted and restored.
  • the position coordinates of the feature points of the vocal organs are determined in each frame of video images of the vocal organs action video, and the movement trajectory of the feature points of the vocal organs is obtained.
  • the similarity calculation is performed between the feature point motion trajectory of the vocal organ and the standard feature point preset motion trail corresponding to the example text, so as to obtain the pronunciation action similarity information of the vocal organ.
  • the pronunciation evaluation result can be obtained based on the pronunciation action similarity information of the pronunciation organs. Since pronunciation is directly related to the movements of the vocal organs, the pronunciation evaluation results obtained in this way are more accurate.
  • Generating the pronunciation evaluation result of the audio to be evaluated according to the similarity information may further include the following steps:
  • a more accurate pronunciation evaluation result can be determined by further combining the similarity information determined based on the movement trajectory of the feature points of the vocal organs. .
  • This method further improves the accuracy of pronunciation evaluation results.
  • the duration of the audio to be evaluated is related to the speed of the user's pronunciation, that is, the duration of the audio to be evaluated is variable.
  • the duration of the audio to be evaluated is different, the number of frames of the audio to be evaluated is different, and then the audio to be evaluated with different durations is input into the video generation model, and the duration of each vocal organ action video obtained is also different.
  • the duration of the vocalization organ action video is different, the number of video image frames included in the vocalization organ action video is different.
  • the present disclosure provides the following two implementations to avoid the problem of large errors in the similarity information obtained by calculation.
  • an achievable embodiment before the similarity calculation is performed between the feature point motion track of the articulator and the feature point preset motion track corresponding to the example sentence text, and similarity information is obtained, according to The number of feature point position coordinates that constitute the preset motion trajectory of the feature point, adjust the number of feature point position coordinates of the feature point motion trajectory of the vocal organ, so that the feature point position coordinates of the feature point preset motion trajectory The number is the same as the number of the feature point position coordinates of the feature point motion trajectory of the vocal organ.
  • the number of feature point position coordinates of the preset motion trajectory of the feature point is 5, which are coordinates A, B, C, D, and E, respectively.
  • the number of feature point position coordinates of the feature point motion trajectory is 4, which are coordinates a, b, c, and e respectively.
  • the number of feature point position coordinates of the feature point motion trajectory can be adjusted.
  • the insertion position of the feature point f(0, 0) can be determined according to the position of the missing phoneme in the audio to be evaluated.
  • Another achievable implementation is, before determining the position coordinates of the feature points of the articulator in each frame of the video of the articulator action, according to the standard action video of the articulator corresponding to the example text
  • the frame number of the video image adjust the frame number of the video image in the articulator action video, so that the frame number of the video image in the articulator action video is the same as the video image in the articulator standard action video.
  • the number of frames is the same.
  • the number of frames of the video image in the standard action video of the vocal organ is 5 frames, which are 1, 2, 3, 4, and 5 frames respectively.
  • the number of frames of the video image in the speech organ action video of the audio to be evaluated is 3 frames, which are 1, 4, and 5 frames respectively.
  • frame interpolation processing can be performed on the video image frame sequences 1, 4, and 5.
  • image frames 1 and 4 are inserted into the current video image frame sequence 1, 4, and 5 into the obtained video image frame sequence is 1, 1, 4, 4, and 5.
  • a blank image frame 0 is inserted into the current video image frame sequence 1, 4, and 5 to obtain the video image frame sequence as 1, 0, 0, 4, and 5.
  • step S13 determine the positional coordinates of the feature point of described articulatory organ in each frame of video image of described articulatory organ action video, obtain the characteristic point movement track of described articulatory organ, can also comprise the following steps:
  • the to-be-evaluated audio is divided according to the preset pronunciation evaluation granularity to obtain a plurality of sub-audios to be evaluated; in each frame of the video image of the articulator action video, it is determined that the audio to be evaluated corresponds to each sub-audio to be evaluated.
  • the position coordinates of the feature points of the vocal organs are obtained, and the motion track segments of the vocal organs feature points corresponding to each sub-audio to be evaluated are obtained.
  • the preset pronunciation evaluation granularity is a pronunciation evaluation unit set according to user requirements.
  • the granularity of pronunciation evaluation can be phonemes, characters, words, sentences, paragraphs, articles, etc., which is not specifically limited in the present disclosure.
  • the audio to be evaluated may be divided according to the duration corresponding to the preset pronunciation evaluation granularity, thereby obtaining a plurality of sub-audios to be evaluated.
  • a motion track segment of the feature point of the vocal organ corresponding to each sub-audio to be evaluated can be obtained based on the motion video of the vocal organ. Specifically, the position coordinates of the feature points of the vocal organs corresponding to each sub-audio to be evaluated can be determined in each frame of the video image of the vocal organ action video, and the motion trajectory of the vocal organ feature points corresponding to each sub-audio to be evaluated can be obtained. Fragment.
  • the feature point motion trajectory of the vocal organ of the entire audio to be evaluated can be divided according to the method of dividing and obtained each sub-audio to be evaluated, so as to obtain the corresponding each sub-audio.
  • the motion track segment of the vocal organ feature point of the sub-audio to be evaluated.
  • the motion trajectory segment of the vocal organ feature point of each sub-audio to be evaluated is matched with the corresponding
  • the similarity calculation is performed on the preset motion trajectory segments of the feature points to obtain a first similarity value corresponding to the sub-audio to be evaluated, and the similarity information includes the first similarity value of each of the sub-audio to be evaluated.
  • the feature point preset motion track segment is a track segment in the complete feature point preset motion track.
  • the method of obtaining the feature point preset motion track segment is similar to the method of dividing the feature point motion track segment of the vocal organ feature point of each sub-audio to be evaluated from the feature point motion track of the vocal organ of the entire audio to be evaluated, and will not be repeated here. .
  • the flowchart of the method for locating which phoneme or which word in the audio to be evaluated is inaccurately pronounced includes steps S21-S28.
  • the preset threshold may be preset values such as 90% and 98%. In the case that the first similarity value is smaller than the preset threshold, it is determined that the target sub-audio to be evaluated corresponding to the first similarity is inaccurate in pronunciation.
  • the magnitude of the first similarity value is used to represent the similarity of the target consonant to be evaluated and the standard pronunciation corresponding to the target sub-audio to be evaluated.
  • the target example sentence text segment corresponding to the target sub-audio to be evaluated can be determined.
  • the target example sentence text segment may include one or more phonemes/characters/words/sentences, etc.
  • the standard action video of the vocal organ corresponding to the part with the wrong pronunciation and the preset motion trajectory segment of the standard feature point of the vocal organ are displayed to the user.
  • the inaccurate articulation organ feature point motion track segment is displayed to the user, so that the user can know which pronunciation is inaccurate and where the difference from the standard pronunciation is.
  • the motion video of the vocal organs in the embodiment of the present disclosure includes the upper lip, lower lip, upper teeth, lower teeth, gums, hard palate, soft palate, uvula, tongue tip, and tongue surface , the action of at least one organ in the base of the tongue, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or larynx, and the feature point motion trajectory (or feature point motion trajectory segment) of the articulatory organ includes the articulatory organ motion video
  • the feature point motion track (or feature point motion track segment) of any speech organ can be obtained.
  • the feature point motion track (or feature point motion track segment) of each pronunciation organ can be matched with the feature corresponding to the organ under the example sentence text.
  • the point preset motion track (or feature point preset motion track segment) performs similarity calculation to obtain a second similarity value, and the second similarity value represents the feature point motion track (or feature point motion track segment) of a vocal organ and The standard similarity degree between the feature point preset motion trajectories (or feature point preset motion trajectory segments) of the vocal organ.
  • the target second similarity value that is smaller than the threshold may be determined, and the target articulator may be determined according to the target second similarity value. In this way, it can be determined which specific one or several pronunciation organs of the multiple pronunciation organs have the incorrect pronunciation of the pronunciation action of the example sentence text (or the example sentence text segment).
  • the voice organ action video is a magnetic resonance imaging MRI video.
  • the sample voice organ action video used for training the video generation model is also a magnetic resonance imaging MRI video.
  • the sample voice organ action video includes the upper lip, lower lip, and upper teeth. , lower teeth, gums, hard palate, soft palate, uvula, tongue tip, lingual surface, tongue base, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or the action of at least one articulatory organ in the larynx.
  • the speech organs also include speech power organs such as the lungs, the diaphragm, and the trachea
  • the speech organs action video and the sample speech organ action video may also include the action of at least one speech organ among the lungs, the diaphragm, and the trachea.
  • the action difference information indicates that the position of the upper jaw in the action video of the vocal organ of the user is lower than the position of the upper jaw in the standard action video of the vocal organ, then the corresponding target action suggestion information "Raise the upper jaw” can be matched. ”, the action difference information indicates that the position of the tongue in the action video of the voice organ of the user is backward compared to the position of the tongue in the standard action video of the voice organ, and the corresponding target action suggestion information “protrude the tongue” can be matched.
  • the displayed pronunciation evaluation information can be at least one of the user's pronunciation scoring information, pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video, or a combination of the three. display, or all three at the same time.
  • the model can be generated by animation, and the video of the articulator action or the standard action video of the articulator can be rendered frame by frame to obtain an animated video of the articulator, and the animated video of the articulator can be used as the video of the articulator or the articulator. Standard action video for presentation.
  • the training samples of the animation generation model include a plurality of MRI sample images and animation organ maps corresponding to each MRI sample image, and the training samples of the animation generation model are obtained by determining the position of the organ in each MRI sample image. ; At the position of the organ in each MRI sample image, an animation organ corresponding to the position of the organ is generated, and an animation organ map is obtained.
  • MRI video is composed of multiple video frames.
  • the animation frames can be arranged in the order of the video frames. Recombination to get the animation video corresponding to the video frame.
  • the animation generation model can be any machine learning model that can learn samples, such as an adversarial generation network model, a recurrent neural network model, a convolutional network model, etc., which is not limited in the present disclosure.
  • the training samples of the model include multiple MRI sample images and animated organ maps corresponding to each MRI sample image.
  • the animation generation model can generate corresponding animation images based on the input MRI images, so that the MRI video frames can be converted into the corresponding animation images. Effects converted to animation frames.
  • the animation generation model can sequentially output the animation frames corresponding to the video frames in the order in which the video frames are input, wherein the positions of the vocal organs in the animation frames are filled by the animation vocal organs, which is convenient for users to view and understand.
  • different colors can be filled for each animated vocal organ according to different vocal organs, and the name of the organ can also be marked on the animated vocal organ.
  • the upper jaw position can be filled with light yellow, and Mark the character "upper jaw”, fill the position of the tongue with positive red, mark the character "tongue”, fill the position of the teeth with white, and mark the character "tooth", so that the position and connection relationship of each organ can be more intuitively reflected. It is easier for users to understand.
  • color filling method and name labeling method are only described as an example, and the present disclosure does not limit the color filling method and name labeling method of an organ.
  • name labeling can also be labeled in a foreign language, or Add phonetic symbols and pinyin for pronunciation.
  • the animation frames are reorganized according to the sequence of the video frames, and a complete animation video can be obtained.
  • the playback speed of the animation frames can be consistent with the video frames, or the playback speed of the animation frames can be adjusted according to the application requirements. For example, when the animation video application is used in education In the scene, in order to show the movement mode and force of the vocal organs more clearly, the playback speed of the animation frame can be reduced. In the case of reducing the playback speed of the animation frames, in order to make the animation video smoother, frames can also be supplemented between each frame to increase the number of frames of the animation video.
  • the animation generation model is an adversarial generation network model
  • the animation generation model includes a generator for generating animation images based on MRI images
  • the animation generation model is obtained by training in the following manner:
  • the generator Repeatedly executing the generator to generate a training animation image based on the MRI sample image, and to generate a loss value based on the animation vocal organ map corresponding to the MRI sample image and a preset loss function, and to adjust the animation based on the loss value.
  • the parameters in the generator, and the discriminator of the confrontation generation network model evaluates the training animation image based on the animation articulator map, until the evaluation result satisfies the preset evaluation result condition.
  • the generator is used to generate an image based on the input data, and the discriminator is used to evaluate whether the image output by the generator has the same characteristics as the images in the specified set, that is, it can be judged whether the picture is a picture in the specified set.
  • the evaluation result of the discriminator may be correct or wrong.
  • the evaluation result of the discriminator is usually correct, that is to say, the discriminator is correct. It can correctly judge whether the picture is a picture in the specified set, but when the feature difference between the picture generated by the generator and the picture in the specified set is not obvious, it is difficult for the discriminator to always correctly judge whether the picture is in the specified set.
  • the training stop condition can be set by setting the correct ratio threshold of the discriminative evaluation results, so that the images generated by the generator are more in line with the characteristics of the training target in the training set.
  • the discriminator Before training the generator, the discriminator can also be pre-trained, for example, input random features to the generator to obtain an image, and the discriminator will evaluate the image features consistent with the animated organ diagram in the training sample, Based on whether the evaluation result is correct, the parameters in the discriminator are adjusted until the discriminator can correctly judge whether the image generated by the generator is consistent with the animation vocal organ map in the training sample.
  • the generator After the discriminator is trained, the generator can be trained using the discriminator. It is worth noting that the training of the generator and the discriminator can also be carried out synchronously, so that they can constrain each other, so that the images generated by the generator are more in line with the characteristics of the animated vocal organ map, and the discriminator can evaluate the images more correctly.
  • the training samples are obtained by: determining the position of the articulator in each MRI sample image, and generating the position of the articulating organ in each MRI sample image with the position of the articulating organ
  • the corresponding animation articulation organ is obtained to obtain the animation articulation organ diagram.
  • the position of each organ can be distinguished by the color block area in the MRI sample image, and the position of the vocal organ can also be identified by the recognition model, or the organ template image can be overlapped with the MRI sample image, and the organ position based on the organ template image can be found in the image. Regions are merged in the MRI sample images, and the color block in the region where the vocal organs are located is used as the location of the vocal organs.
  • the organ contour of the MRI sample image is extracted, and the organ image corresponding to the articulating organ is filled in the organ contour of each articulating organ.
  • the organ image can be a cartoon image or a realistic image.
  • the organ map can be called from the preset flash animation library, and the organ map corresponding to the vocal organ is filled in the organ outline of each vocal organ . It is worth noting that there may be multiple organ textures for the same vocal organ in the flash animation library, and one type of organ texture can be automatically selected for filling, or the type of texture can be modified according to the user's designation.
  • the organ map is called from a preset flash animation library, and the organ outline of each vocal organ is filled with the corresponding image of the vocal organ Organ map; for the MRI sample images corresponding to other video frames, call the organ map corresponding to each vocal organ in the MRI sample image corresponding to the first frame from the flash animation library in the organ corresponding to each vocal organ Fill in the outline.
  • the tongue 1 texture is selected for the tongue
  • the tooth 3 texture is selected for the teeth.
  • the texture fills the contour of the tongue and the contour of the teeth respectively.
  • the tongue 1 map can be automatically selected to fill the contour of the tongue
  • the tooth 3 map can be selected to fill the contour of the teeth. filling.
  • Organ contours can be corrected frame by frame, and after the organ contour of the first frame is corrected, the organ contour can be tracked by means of feature point recognition, so as to achieve organ contour correction in other frames.
  • the contour of the organ in the MRI sample image is adjusted based on the MRI sample image, so that the contour of the speech organ is the same as the one in the MRI sample image.
  • the feature points in the MRI sample image correspond; for the MRI sample images corresponding to other video frames, feature point tracking is performed between the feature points in the MRI sample image and the feature points in the previous video frame of the MRI sample image, And based on the feature point tracking results, the organ contour in the MRI sample image is automatically adjusted.
  • steps S11 to S15 in this embodiment may all be performed on the user terminal.
  • steps S13 and S14 may also be performed on the server. After the audio to be evaluated is generated, the audio can be sent to the server, and after the server processes the audio, the pronunciation evaluation information is returned to the user terminal.
  • the user's pronunciation can be more accurately evaluated, This more intuitively reflects whether the user's pronunciation is accurate.
  • FIG. 3 is a block diagram of a pronunciation evaluation apparatus according to an exemplary disclosed embodiment. As shown in FIG. 3 , the pronunciation evaluation apparatus 300 includes:
  • the example sentence display module 310 is used to display the example sentence text to the user;
  • the audio collection module 320 is used to collect the audio to be evaluated that the user reads aloud based on the example text;
  • Video generation module 330 for generating the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;
  • the pronunciation evaluation module 340 is used for generating pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;
  • the evaluation display module 350 is configured to display the pronunciation evaluation information to the user.
  • the pronunciation evaluation information includes at least one of the pronunciation scoring information of the user, the pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video one.
  • the example sentence display module 310 is configured to generate example sentence audio based on the example sentence text; synthesize the example sentence audio and the standard action video of the pronunciation organ into an example sentence demonstration video; display the example sentence to the user Text and demo video of said example sentences.
  • the pronunciation evaluation module 340 is configured to obtain action difference information by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text; according to the action difference information Pronunciation scoring information is generated, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.
  • the action difference information is difference information of movement trajectories of feature points of speech organs.
  • the pronunciation evaluation module is configured to, based on the unit text content of the example sentence text, use the video clips representing the same unit text content in the articulator action video and the articulator standard action video as A group of video clip groups; Align the video clips belonging to the articulator action video and the articulator standard action video in each video clip group; Align the articulator action video and the articulator standard action after the alignment Video splicing is performed to obtain the comparison video.
  • the video generation module 330 is configured to convert the to-be-evaluated audio into a to-be-processed audio feature vector; input the to-be-processed audio feature vector into a video generation model to obtain the video generated The voice organ action video output by the model and corresponding to the audio to be evaluated.
  • the pronunciation evaluation apparatus 300 further includes a video generation model training module, which is configured to construct model training data according to the sample audio and the sample articulator action video corresponding to the sample audio;
  • the video generation model is obtained by data training.
  • the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain a sample including at least one sample phoneme posterior probability vector A sequence of phoneme posterior probability vectors; based on the sample voice organ action video, extract the sample voice organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence, to obtain a sample voice organ A video feature sequence; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
  • the sample vocal organ video feature is at least one of pixel point feature information or principal component feature information of at least one frame of video image in the sample vocal organ motion video.
  • the video generation module 330 is further configured to divide the example sentence text into unit text sequences; input the unit text sequences into a video feature generation model to obtain a video feature sequence;
  • the video feature sequence generates the standard action video of the pronunciation organ; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into sample unit text sequences; according to the sample unit text sequences and corresponding to the sample unit text sequences
  • the model training data is constructed from the sample video feature sequence of the sample vocal organ action video; the video feature generation model is obtained by training according to the model training data.
  • the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on the MRI video
  • the device further includes a video rendering module for generating through animation The model renders the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.
  • the training samples of the animation generation model include a plurality of MRI sample images and an animated voice organ map corresponding to each MRI sample image
  • the apparatus further includes a training sample generation module configured to determine The position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, an animated artifact corresponding to the position of the articulator is generated to obtain an animated articulation map.
  • the user's pronunciation can be more accurately evaluated, This more intuitively reflects whether the user's pronunciation is accurate.
  • FIG. 4 it shows a schematic structural diagram of an electronic device (eg, user equipment or server) 400 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 4 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 401 that may be loaded into random access according to a program stored in a read only memory (ROM) 402 or from a storage device 408 Various appropriate actions and processes are executed by the programs in the memory (RAM) 403 . In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • I/O interface 405 the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 of a computer, etc.; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409. Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 4 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 408, or from the ROM 402.
  • the processing apparatus 401 When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the user terminal and the server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires at least two Internet Protocol addresses; A node evaluation request for an Internet Protocol address, wherein the node evaluation device selects an Internet Protocol address from the at least two Internet Protocol addresses and returns it; receives the Internet Protocol address returned by the node evaluation device; wherein the obtained The Internet Protocol address indicates an edge node in the content distribution network.
  • the above computer-readable medium carries one or more programs, and when the above one or more programs are executed by the electronic device, the electronic device: receives a node evaluation request including at least two Internet Protocol addresses; From the at least two Internet Protocol addresses, the Internet Protocol address is selected; the selected Internet Protocol address is returned; wherein, the received Internet Protocol address indicates an edge node in the content distribution network.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring at least two Internet Protocol addresses".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a pronunciation evaluation method, the method includes: displaying example sentence text to a user; collecting audio to be evaluated read aloud by a user based on the example sentence text; based on the to-be-evaluated The audio generates a voice organ action video; generates a voice organ action video reflecting the actions of the voice organ when the user reads the example text; and displays the voice pronunciation evaluation information to the user.
  • Example 2 provides the method of Example 1, wherein the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.
  • Example 3 provides the method of Example 1.
  • the presenting the example sentence text to the user includes: generating example sentence audio based on the example sentence text; comparing the example sentence audio with the pronunciation organ standard
  • the action video is synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.
  • Example 4 provides the method of Example 2, in the case that the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the Pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text generate pronunciation evaluation information, including: by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text, obtain action difference information; According to Pronunciation scoring information is generated from the action difference information, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.
  • Example 5 provides the method of Example 4, where the action difference information is difference information of the movement trajectories of the feature points of the vocal organs.
  • Example 6 provides the method of Example 2, wherein the comparison video is generated by: based on the unit text content of the example sentence text, combining the articulator action video with the The video clips representing the same unit text content in the standard action video of the articulator are used as a video clip group; the video clips belonging to the articulator action video and the standard action video of the articulator in each video clip group are aligned; The latter action video of the articulator and the standard action video of the articulator are spliced to obtain the comparison video.
  • Example 7 provides the method of Example 1, wherein the generating an articulator action video that reflects the action of the articulator when the user reads the example text aloud includes: converting the to-be-to-be The evaluation audio is converted into the to-be-processed audio feature vector; the to-be-processed audio feature vector is input into a video generation model to obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.
  • Example 8 provides the method of Example 7, further comprising: constructing model training data according to sample audio and sample vocal organ action videos corresponding to the sample audio; and according to the model training data The video generation model is obtained by training.
  • Example 9 provides the method of Example 8, wherein the constructing model training data according to the sample audio and the sample articulator action video corresponding to the sample audio includes: adding the sample audio to Each frame of audio is converted into a sample phoneme posterior probability vector, and a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector is obtained; The sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as The model training data.
  • Example 10 provides the method of Example 9, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.
  • Example 11 provides the method of Example 1, the articulator standard action video is generated by: dividing the example sentence text into unit text sequences; dividing the unit text The sequence inputs a video feature generation model to obtain a video feature sequence; based on the video feature sequence, a standard action video of a vocal organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence ; build model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence; obtain the video feature generation model according to the model training data training.
  • Example 12 provides the method of Examples 1-11, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so The method further includes: rendering the voice organ motion video or the voice organ standard motion video frame by frame through an animation generation model to obtain a voice organ animation video.
  • Example 13 provides the method of Example 12, the training samples of the animation generation model include a plurality of MRI sample images and an animated articulation organ map corresponding to each MRI sample image, and the method The method also includes: determining the position of the articulator in each MRI sample image; generating an animated articulator corresponding to the position of the articulator at the position of the articulator in each MRI sample image to obtain an animated articulation diagram.
  • Example 14 provides an apparatus for evaluating pronunciation, the apparatus comprising: an example sentence display module for displaying example sentence text to a user; an audio collection module for collecting user based example sentences based on the example sentences The audio to be evaluated that the text is read aloud; the video generation module is used to generate the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text; the pronunciation evaluation module is used for based on the pronunciation organ action video and all Describe the pronunciation organ standard action video corresponding to the example text to generate pronunciation evaluation information; an evaluation display module is used to display the pronunciation evaluation information to the user.
  • Example 15 provides the apparatus of Example 14, the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.
  • Example 16 provides the apparatus of Example 14, wherein the example sentence display module is used to generate example sentence audio based on the example sentence text; Synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.
  • Example 17 provides the apparatus of Example 10, wherein the pronunciation evaluation module is configured to obtain by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text Action difference information; generate pronunciation scoring information according to the action difference information, and/or match with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information .
  • Example 18 provides the apparatus of Example 15, wherein the motion difference information is difference information of the movement trajectory of the feature points of the speech organ.
  • Example 19 provides the apparatus of Example 15, the pronunciation evaluation module is further configured to compare the voice organ action video and the voice organ standard based on the unit text content of the example sentence text
  • the video clips representing the same unit text content are used as a video clip group; the video clips belonging to the voice organ action video and the voice organ standard action video in each video clip group are aligned; The voice organ action video and the voice organ standard action video are spliced to obtain the comparison video.
  • Example 20 provides the apparatus of Example 14, the video generation module for converting the audio to be evaluated into a feature vector of audio to be processed; converting the audio feature vector to be processed
  • a video generation model is input to obtain a voice organ action video output by the video generation model and corresponding to the audio to be evaluated.
  • Example 21 provides the apparatus of Example 20, the pronunciation evaluation apparatus further includes a video generation model training module, configured to generate a model training module according to the sample audio and the sample voice organ action video corresponding to the sample audio. constructing model training data; and obtaining the video generation model by training according to the model training data.
  • a video generation model training module configured to generate a model training module according to the sample audio and the sample voice organ action video corresponding to the sample audio.
  • Example 22 provides the apparatus of Example 21, and the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain A sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector; based on the sample vocal organ action video, extract the sample phoneme posterior probability vector corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence
  • the sample vocal organ video features are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
  • Example 23 provides the apparatus of Example 22, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.
  • Example 24 provides the apparatus of Example 14, wherein the video generation module is further configured to segment the example sentence text into a unit text sequence; input the unit text sequence into a video feature to generate The model obtains a video feature sequence; based on the video feature sequence, a standard action video of the pronunciation organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence; according to the sample unit text The sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence construct model training data; the video feature generation model is obtained by training according to the model training data.
  • Example 25 provides the apparatus of Examples 14-24, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so
  • the device further includes a video rendering module, which is used for generating a model through animation, and rendering the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.
  • Example 26 provides the apparatus of Example 25, wherein the training samples of the animation generation model include a plurality of MRI sample images and an animation articulation organ map corresponding to each MRI sample image, and the animation generates
  • the training samples of the model are obtained in the following manner: determine the position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, generate an animation articulation corresponding to the position of the articulator, and obtain Animated articulation organ diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present disclosure relates to a pronunciation assessment method and apparatus, a storage medium, and an electronic device. The method comprises: displaying an example sentence text to a user; capturing audio to be assessed of the user reading aloud on the basis of the example sentence text; generating a vocal organ movement video reflecting movement of a vocal organ when the user reads aloud the example sentence text; generating pronunciation assessment information on the basis of the vocal organ movement video and a vocal organ standard movement video corresponding to the example sentence text; and displaying the pronunciation assessment information to the user. The present disclosure can accurately assess the pronunciation of a user, and intuitively represent whether the pronunciation of the user is accurate.

Description

发音评价方法和装置、存储介质和电子设备Pronunciation evaluation method and device, storage medium and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请是以中国申请号为202110298227.9,申请日为2021年3月19日的申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。This application is based on the application with the Chinese application number of 202110298227.9 and the filing date of March 19, 2021, and claims its priority. The disclosure of the Chinese application is hereby incorporated into this application as a whole.
技术领域technical field
本公开涉及教育领域,具体地,涉及一种发音评价方法和装置、存储介质和电子设备。The present disclosure relates to the field of education, and in particular, to a pronunciation evaluation method and device, a storage medium and an electronic device.
背景技术Background technique
在学习发音时,用户通常只能模仿自己听到的发音,或者模仿他人的唇部运动方式。用户难以观测到他人具体的发音器官的具体运动方式,因而难以对自己的发音情况进行正确的判断,这对发音学习造成了阻碍。When learning to pronounce, users are usually only able to imitate what they hear, or imitate the way someone else's lips move. It is difficult for the user to observe the specific movement mode of the specific vocal organs of others, so it is difficult for the user to make a correct judgment on his own pronunciation, which hinders pronunciation learning.
发明内容SUMMARY OF THE INVENTION
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This Summary is provided to introduce concepts in a simplified form that are described in detail in the Detailed Description section that follows. This summary section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
第一方面,本公开提供一种发音评价方法,包括向用户展示例句文本;采集用户基于所述例句文本朗读的待评价音频;基于所述待评价音频生成发音器官动作视频;基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;向所述用户展示所述发音评价信息。In a first aspect, the present disclosure provides a method for evaluating pronunciation, including displaying example text to a user; collecting audio to be evaluated read aloud by a user based on the example text; generating an action video of a pronunciation organ based on the audio to be evaluated; Pronunciation evaluation information is generated from the action video and the pronunciation organ standard action video corresponding to the example text; the pronunciation evaluation information is displayed to the user.
第二方面,本公开提供一种发音评价装置,包括:例句展示模块,用于向用户展示例句文本;音频采集模块,用于采集用户基于所述例句文本朗读的待评价音频;视频生成模块,用于基于所述待评价音频生成发音器官动作视频;发音评价模块,用于基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;评价展示模块,用于向所述用户展示所述发音评价信息。In a second aspect, the present disclosure provides a pronunciation evaluation device, comprising: an example sentence display module, used for displaying example sentence text to a user; an audio collection module, used for collecting the audio to be evaluated that the user reads aloud based on the example sentence text; a video generation module, For generating pronunciation organ action video based on described audio to be evaluated; Pronunciation evaluation module, for generating pronunciation evaluation information based on the pronunciation organ standard action video corresponding to described pronunciation organ action video and described example text; Evaluation display module, for The pronunciation evaluation information is displayed to the user.
第三方面,本公开提供一种非瞬时性计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面所述方法的步骤。In a third aspect, the present disclosure provides a non-transitory computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.
第四方面,本公开提供一种电子设备,包括存储装置和处理装置,存储装置上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面所述方法的步骤。In a fourth aspect, the present disclosure provides an electronic device, including a storage device and a processing device, where a computer program is stored on the storage device; and a processing device is configured to execute the computer program in the storage device, so as to realize the first aspect of the present disclosure. The steps of the method of the aspect.
第五方面,本公开提供一种计算机程序,包括指令,所述指令当由处理器执行时使所述处理器执行本公开第一方面所述方法的步骤。In a fifth aspect, the present disclosure provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.
第六方面,本公开提供一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行本公开第一方面所述方法的步骤。In a sixth aspect, the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform the steps of the method of the first aspect of the present disclosure.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:
图1是根据一示例性公开实施例示出的一种发音评价方法的流程图。FIG. 1 is a flow chart of a pronunciation evaluation method according to an exemplary disclosed embodiment.
图2是根据本公开一示例性公开实施例示出的另一种发音评价方法的流程图。FIG. 2 is a flowchart of another pronunciation evaluation method according to an exemplary disclosed embodiment of the present disclosure.
图3是根据本公开一示例性公开实施例示出的一种发音评价装置的框图。FIG. 3 is a block diagram of an apparatus for evaluating pronunciation according to an exemplary disclosed embodiment of the present disclosure.
图4是根据本公开一示例性公开实施例示出的一种电子设备的框图。FIG. 4 is a block diagram of an electronic device according to an exemplary disclosed embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实 施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
图1是根据一示例性公开实施例示出的一种发音评价方法的流程图,如图1所示,所述方法包括步骤S11-S15。Fig. 1 is a flowchart of a pronunciation evaluation method according to an exemplary disclosed embodiment. As shown in Fig. 1 , the method includes steps S11-S15.
S11、向用户展示例句文本。S11. Display the example text to the user.
该例句文本可以为一个词组、一个句子、一个段落、一篇文章等任意长度的文本,例句文本还可以指一个较长的文本经过分句处理之后的子句。The example sentence text can be a text of any length, such as a phrase, a sentence, a paragraph, an article, etc. The example sentence text can also refer to a clause after a longer text is processed by clauses.
示例地,在用户学习发音的场景下,若用户想要测验和练习发音,则可以通过文本的形式向用户展示例句文本,以使用户进行发音测试。若用户想要学习发音,则可以通过音频的形式向用户展示例句文本,以使用户进行跟读。此外本公开也不限制于通过文本和音频一起的形式向用户展示例句文本。For example, in a scenario where the user learns pronunciation, if the user wants to test and practice pronunciation, the example text can be displayed to the user in the form of text, so that the user can perform a pronunciation test. If the user wants to learn pronunciation, the example text can be displayed to the user in the form of audio, so that the user can follow along. Furthermore, the present disclosure is not limited to presenting the example text to the user in the form of text and audio together.
可以通过用户终端的显示装置以文字的形式展示例句文本,还可以通过用户终端的播放装置以语音的形式展示例句文本,其中,例句文本对应的语音可以预先存储,还可以在需要对语音进行展示的情况下将文本转化为语音直接使用。The example text can be displayed in the form of text through the display device of the user terminal, and the example text can also be displayed in the form of voice through the playback device of the user terminal, wherein the voice corresponding to the example text can be stored in advance, and the voice can also be displayed when needed. In the case of converting text to speech directly.
用户终端可以包括手机、电脑、学习机、穿戴设备等任意具有展示功能的设备。The user terminal may include any device with a display function, such as a mobile phone, a computer, a learning machine, and a wearable device.
在一种可能的实施方式中,基于例句文本生成例句音频,将所述音频与发音器官标准动作视频合成为例句演示视频,向用户展示例句文本和例句演示视频。In a possible implementation manner, the example sentence audio is generated based on the example sentence text, the audio and the standard action video of the vocal organ are synthesized into an example sentence demonstration video, and the example sentence text and the example sentence demonstration video are displayed to the user.
发音器官标准动作视频是基于例句文本所生成的,可以通过预训练的视频特征生成模型生成视频特征。将所述例句文本分割为单位文本序列,将所述单位文本序列输入视频特征生成模型,得到视频特征序列,基于所述视频特征序列生成发音器官标准动作视频。The standard action video of the pronunciation organ is generated based on the example sentence text, and the video features can be generated by the pre-trained video feature generation model. The example sentence text is divided into unit text sequences, and the unit text sequences are input into a video feature generation model to obtain a video feature sequence, and based on the video feature sequence, a standard action video of a vocal organ is generated.
单位文本序列为将例句文本拆分为用于生成视频的小单元之后排列得到的序列,在本公开中,单位文本可以为音素、单词、单字等,经过对例句文本进行分割,可以 得到更精细的模型输入值,使得模型可以根据单位文本更高效地生成准确的视频特征序列。例如,在例句文本为“How are you”的情况下,可以以单词为分割单位,将例句文本分割为“how”“are”“you”的单位文本序列,还可以以音素为分割单位,将例句文本分割为
Figure PCTCN2022080357-appb-000001
的单位文本序列。
The unit text sequence is a sequence obtained by dividing the example text into small units for generating videos. In the present disclosure, the unit text can be phonemes, words, single characters, etc. After the example text is segmented, more refined , so that the model can more efficiently generate accurate video feature sequences based on unit text. For example, when the example text is "How are you", the example text can be divided into unit text sequences of "how", "are" and "you" by using words as the division unit, or the phoneme can be used as the division unit to divide Example sentences are split into
Figure PCTCN2022080357-appb-000001
The unit text sequence of .
该视频特征生成模型是通过以下方式训练得到的:The video feature generation model is trained in the following ways:
将样本文本分割为样本单位文本序列,根据样本单位文本序列以及与所述样本单位文本序列对应的样本发音器官动作视频的样本视频特征序列构建模型训练数据,根据所述模型训练数据训练得到所述视频特征生成模型。Divide the sample text into a sample unit text sequence, construct model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence, and train according to the model training data to obtain the described Video feature generation model.
样本发音器官动作视频为基于样本文本制作或录制的演示视频,演示视频可以是采用任意动画制作渲染软件制成的口腔的动画演示视频,也可以是核磁共振仪拍摄的人朗读样本文本时的头部视频。The sample vocal organ action video is a demonstration video produced or recorded based on the sample text. The demonstration video can be an animation demonstration video of the oral cavity made by any animation production and rendering software, or it can be the head of a person captured by an MRI machine reading the sample text. Department video.
逐帧或抽帧地对样本发音器官动作视频的视频特征进行提取,可以得到样本发音器官动作视频多个图像帧的特征信息,将视频特征信息按照视频帧的排列顺序进行排列,可以得到样本视频特征序列。值得说明的是,本公开不对图像帧的特征信息的形式进行限定,任意可以提取并可以通过处理还原为图像的特征信息的形式均可以作为本公开中的视频特征序列中的特征信息。Extract the video features of the sample articulator action video frame by frame or frame by frame to obtain the feature information of multiple image frames of the sample articulator action video, and arrange the video feature information in the order of the video frames to obtain the sample video. feature sequence. It is worth noting that the present disclosure does not limit the form of feature information of an image frame, and any form of feature information that can be extracted and restored to an image through processing can be used as feature information in the video feature sequence in the present disclosure.
在一种可能的实施方式中,该特征信息为主成分信息,通过逐帧对所述样本发音器官动作视频进行主成分分析得到各视频帧的主成分信息,并将各视频帧的所述主成分信息按照视频帧顺序进行排列,得到所述样本视频特征序列。通过对主成分信息进行还原,可以得到还原图像,将还原图像按照样本视频特征序列的顺序排列并合成,可以得到还原的演示视频。将样本单位文本序列和与该样本单位文本序列对应的样本视频特征序列作为训练样本,对视频特征生成模型进行训练,使得视频特征生成模型可以基于任意的单位文本生成与之对应的视频特征或视频特征序列。In a possible implementation manner, the feature information is principal component information, and the principal component information of each video frame is obtained by performing principal component analysis on the sample vocal organ action video frame by frame, and the principal component information of each video frame is calculated. The component information is arranged in the order of video frames to obtain the sample video feature sequence. By restoring the principal component information, a restored image can be obtained, and the restored image can be arranged and synthesized according to the sequence of the sample video feature sequence to obtain a restored demo video. Using the sample unit text sequence and the sample video feature sequence corresponding to the sample unit text sequence as training samples, the video feature generation model is trained, so that the video feature generation model can generate corresponding video features or videos based on any unit text. feature sequence.
该视频特征生成模型可以为深度学习模型,通过对样本单位文本序列中的各样本单位文本进行标签标注的形式生成输入深度学习模型的训练样本,经过多轮迭代训练后使得深度学习模型可以准确地基于单位文本生成视频特征。The video feature generation model can be a deep learning model, which generates training samples input to the deep learning model by labeling each sample unit text in the sample unit text sequence. After multiple rounds of iterative training, the deep learning model can accurately Generate video features based on unit text.
该视频特征生成模型还可以为注意力模型,所述视频特征生成模型包括编码器和解码器,所述编码器用于基于单位文本序列生成编码结果,所述解码器用于基于所述编码结果生成视频特征序列,通过单位文本序列到视频特征序列的端到端训练的形式对编码器和解码器进行训练,使得注意力模型可以准确地基于单位文本序列生成视频 特征序列。值得说明的是,在待生成的演示视频为MRI(MagneticResonanceImaging,核磁共振成像)视频的情况下,考虑到MRI视频的录制成本较高,一次性录制较长的视频可以降低录制成本,因此,该样本发音器官动作视频可以是从一个完整的样本演示视频中分割得到的,相对应的,样本文本也是从完整的样本演示文本中分割得到的。The video feature generation model may also be an attention model, the video feature generation model includes an encoder and a decoder, the encoder is configured to generate an encoding result based on a unit text sequence, and the decoder is configured to generate a video based on the encoding result Feature sequences, the encoder and decoder are trained in the form of end-to-end training from unit text sequences to video feature sequences, so that the attention model can accurately generate video feature sequences based on unit text sequences. It is worth noting that when the demo video to be generated is an MRI (Magnetic Resonance Imaging) video, considering the high recording cost of MRI video, recording a longer video at one time can reduce the recording cost. Therefore, this The sample voice organ action video can be segmented from a complete sample demonstration video, and correspondingly, the sample text is also segmented from the complete sample demonstration text.
该样本演示视频可以是通过核磁共振仪器拍摄的用户朗诵样本文本时的MRI视频,通过对样本演示文本进行分句得到多个样本文本,并基于分句的结果将样本演示视频分割为与各样本文本对应的子视频,可以得到多个样本发音器官动作视频。The sample demonstration video may be an MRI video of the user reciting the sample text captured by an MRI apparatus. Multiple sample texts are obtained by segmenting the sample demonstration text, and the sample demonstration video is segmented into different sample texts based on the result of the sentence segmentation. For the sub-videos corresponding to the text, a plurality of sample voice organ action videos can be obtained.
在一种可能的实施方式中,对样本演示文本进行分句,得到多个样本文本,对与样本演示视频同步录制的样本语音进行语音识别,并基于语音识别结果,确定各样本文本对应的语音片段,基于各语音片段的时间轴信息,从所述样本演示视频中确定各语音片段对应的样本发音器官动作视频。例如,通过对样本演示文本“Howareyou?I’mfinethankyou,andyou?”进行分句,可以得到“Howareyou”“I’mfine”“thankyou”“andyou”四个子句,通过对时长为6秒的样本语音进行识别,可以得到“Howareyou”对应的语音片段的时间轴信息为“00:00:00至00:01:40”,“I’mfine”对应的语音片段的时间轴信息为“00:01:40至00:02:50”,“thankyou”对应的语25音片段的时间轴信息为“00:02:50至00:04:40”,“andyou”对应的语音片段的时间轴信息为“00:04:40至00:06:00”,则可以将时长为6秒的样本演示子视频按照时间轴信息分割为“00:00:00至00:01:40”、“00:01:40至00:02:50”、“00:02:50至00:04:40”、“00:04:40至00:06:00”四个视频片段,每个视频片段为其对应的样本子句文本的样本演示子视频。上述的分句方式仅作为举例展示,本领域技术人员可以采用其他的分句方式对句子进行分句处理,本公开对此不做限制。In a possible implementation, the sample demonstration text is segmented to obtain a plurality of sample texts, the speech recognition is performed on the sample speech recorded synchronously with the sample demonstration video, and the speech corresponding to each sample text is determined based on the speech recognition result. Segment, based on the time axis information of each speech segment, from the sample demonstration video to determine the sample vocal organ action video corresponding to each speech segment. For example, by segmenting the sample demo text "Howareyou? I'mfinethankyou,andyou?", four clauses of "Howareyou", "I'mfine", "thankyou" and "andyou" can be obtained. For identification, the time axis information of the speech segment corresponding to "Howareyou" can be obtained as "00:00:00 to 00:01:40", and the time axis information of the speech segment corresponding to "I'mfine" is "00:01: 40 to 00:02:50", the timeline information of the speech segment corresponding to "thankyou" is "00:02:50 to 00:04:40", and the timeline information of the speech segment corresponding to "andyou" is " 00:04:40 to 00:06:00", the sample demonstration sub-video with a duration of 6 seconds can be divided into "00:00:00 to 00:01:40", "00:01: 40 to 00:02:50", "00:02:50 to 00:04:40", "00:04:40 to 00:06:00" four video clips, each video clip is its corresponding sample Sample demo sub-video of clause text. The above sentence segmentation methods are only shown as examples, and those skilled in the art may use other sentence segmentation methods to perform sentence segmentation processing, which is not limited in the present disclosure.
考虑到MRI视频录制时录制仪器可能没有录音功能,需由额外的录音设备对样本语音进行录制,而样本演示视频和样本语音在录制时可能有开始时间不同、结束时间不同等问题导致的时差,因此,在一种可能的实施方式中,对所述样本语音与所述样本演示视频的时间轴信息进行对齐处理;调整所述样本语音或所述样本演示视频的长度,以使所述样本语音和所述样本演示视频的长度一致。Considering that the recording instrument may not have the recording function during the MRI video recording, additional recording equipment is required to record the sample voice, and the sample demonstration video and the sample voice may have a time difference caused by problems such as different start times and different end times during recording. Therefore, in a possible implementation manner, an alignment process is performed on the time axis information of the sample speech and the sample demonstration video; the length of the sample speech or the sample demonstration video is adjusted so that the sample speech Consistent with the length of the sample demo video.
考虑到人在录制视频时可能会有姿势变动,使得录制得到的视频中的面部位置不固定,从而可能影响到视频的美观性,还可能影响到视频的特征信息提取,增加模型的训练成本,因此,在一种可能的实施方式中,逐帧对所述样本演示视频中的面部位 置进行调整,以使各视频帧中的相同器官位于相同的图像位置。该调整可以采用像素追踪或者光流跟踪的形式进行,或者,可以采用特征点提取并对齐的方式进行,对视频帧的处理包括但不限于旋转、平移、放大、缩小,还可以对视频帧的画面大小进行统一裁剪,以减少视频中的干扰信息。Considering that people may have posture changes when recording videos, the facial positions in the recorded videos are not fixed, which may affect the aesthetics of the video, and may also affect the feature information extraction of the video, increasing the training cost of the model. Therefore, in a possible implementation manner, the face position in the sample demonstration video is adjusted frame by frame, so that the same organ in each video frame is located at the same image position. The adjustment can be performed in the form of pixel tracking or optical flow tracking, or it can be performed by extracting and aligning feature points. The processing of video frames includes but is not limited to rotation, translation, zooming in, and zooming out. The screen size is uniformly cropped to reduce the interference information in the video.
S12、采集用户基于所述例句文本朗读的待评价音频。S12: Collect the audio to be evaluated read aloud by the user based on the example text.
通过用户终端的语音采集装置,可以采集到用户朗读的语音。The voice read aloud by the user can be collected through the voice collecting device of the user terminal.
在一种可能的实施方式中,可以对采集到的待评价音频进行语音识别,并将识别结果和例句文本进行对比,在文本相似度低于预设的相似度阈值的情况下,可以向用户发送提示信息,以提醒用户重新朗读例句文本。In a possible implementation, speech recognition can be performed on the collected audio to be evaluated, and the recognition result can be compared with the example text. When the text similarity is lower than a preset similarity threshold, the user can Send a prompt to remind the user to re-read the example text.
S13、生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频。S13. Generate a pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example text.
在一种可能的实施方式中,将所述待评价音频转换成待处理音频特征向量;将所述待处理音频特征向量输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。In a possible implementation, the audio to be evaluated is converted into an audio feature vector to be processed; the audio feature vector to be processed is input into a video generation model, and the output of the video generation model and the audio to be evaluated are obtained. Corresponding vocal organ action video.
一种可实现的实施方式,将待评价音频转换成待评价音频特征向量,具体可以是:将待评价音频输入语音识别模型,得到所述待评价音频特征向量,所述待评价音频特征向量包括所述待评价音频中每一帧音频的音素后验概率向量,每一所述音素后验概率向量的维度为所述待评价音频对应的语言类型包括的音素维度。An achievable implementation is to convert the audio to be evaluated into an audio feature vector to be evaluated. Specifically, the audio to be evaluated can be input into a speech recognition model to obtain the audio feature vector to be evaluated. The audio feature vector to be evaluated includes: The phoneme posterior probability vector of each frame of audio in the audio to be evaluated, and the dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated.
音素是根据语音的自然属性划分出来的最小语音单位。每一种人类语音、动物语音、乐器音均可以基于属性划分出有限个数的最小语音单位。A phoneme is the smallest unit of speech divided according to the natural properties of speech. Each human voice, animal voice, and musical instrument sound can be divided into a limited number of minimum phonetic units based on attributes.
待评价音频中的每一帧音频可以为一个音素的音频。一个音素可以由一个音素后验概率向量表征。每一个音素后验概率向量的维度为待评价音频对应的语言类型包括的音素维度。示例地,假设待评价音频对应的语言类型为英语,由于英语的音素个数为48个,那么英语音素后验概率向量的维度为48。也就是说一个英语音素后验概率向量中包括48个大于等于0且小于1的概率值,该48个概率值的和为1。该48个概率值中的最大值所对应的音素为该音素后验概率向量所表征的英语音素。再示例地,假设待评价音频对应的语言类型为模仿目标乐器的语言类型,若该目标乐器对应的音素有50个,那么音素后验概率向量的维度也为50,具体由50个和为1的概率值组成。Each frame of audio in the audio to be evaluated may be audio of one phoneme. A phoneme can be represented by a phoneme posterior probability vector. The dimension of each phoneme posterior probability vector is the phoneme dimension included in the language type corresponding to the audio to be evaluated. For example, assuming that the language type corresponding to the audio to be evaluated is English, since the number of English phonemes is 48, the dimension of the English phoneme posterior probability vector is 48. That is to say, an English phoneme posterior probability vector includes 48 probability values greater than or equal to 0 and less than 1, and the sum of the 48 probability values is 1. The phoneme corresponding to the maximum value among the 48 probability values is the English phoneme represented by the phoneme posterior probability vector. As another example, assuming that the language type corresponding to the audio to be evaluated is the language type that imitates the target musical instrument, if there are 50 phonemes corresponding to the target musical instrument, then the dimension of the phoneme posterior probability vector is also 50, and the sum of 50 is 1. of probability values.
待评价音频中的每一帧音频还可以为一个字/词的音频。相应地,一个字/词由一个字/词的后验概率向量表征。由此,值得说明的是,可以根据需求自由设置待评价音 频中的每一帧音频对应的音频帧播放时间,以使得每一帧音频为一个或多个音素/字/词的音频。Each frame of audio in the audio to be evaluated may also be audio of one word/word. Correspondingly, a word/word is represented by a posterior probability vector of a word/word. Thus, it is worth noting that the audio frame playback time corresponding to each frame of audio in the audio to be evaluated can be freely set according to requirements, so that each frame of audio is audio of one or more phonemes/words/words.
语音识别模型(Automatic Speech Recognition,简称ASR)是一种将声音转变为相应的文本或命令的模型。A speech recognition model (Automatic Speech Recognition, ASR for short) is a model that converts sounds into corresponding text or commands.
由于任一语言的字或词的数量庞大而音素的数量较少,加之每一字或词的发音均由一个或多个音素构成,因此,一种优选的实施方式,语音识别模型可以通过如下训练方式训练得到:根据样本音频帧和所述样本音频帧对应的音素构建模型训练样本;根据所述模型训练样本训练得到所述语音识别模型。Since the number of words or words in any language is huge and the number of phonemes is small, and the pronunciation of each word or word is composed of one or more phonemes, therefore, in a preferred implementation, the speech recognition model can use the following The training method is obtained by training: constructing a model training sample according to the sample audio frame and the phoneme corresponding to the sample audio frame; training according to the model training sample to obtain the speech recognition model.
详细地,对样本音频帧进行信号处理和知识挖掘,分析出样本音频帧的语音特征参数,制作语音模板,得到语音参数库。根据样本音频帧和该样本音频帧对应的音素构建语音特征参数与音素的映射表。Specifically, signal processing and knowledge mining are performed on the sample audio frame, the speech characteristic parameters of the sample audio frame are analyzed, a speech template is made, and a speech parameter library is obtained. A mapping table of speech feature parameters and phonemes is constructed according to the sample audio frame and the phoneme corresponding to the sample audio frame.
将待评价音频输入训练好的语音识别模型后,针对待评价音频中的每一帧音频,经过与训练时相同的分析,得到待处理语音特征参数,将该待处理语音特征参数与语音参数库中的语音模板进行一一匹配,得到该待处理语音特征参数与语音参数库中每一语音特征参数的匹配概率。进一步地,根据语音特征参数与音素的映射表得到待评价音频中的每一帧音频的音素后验概率向量。After inputting the audio to be evaluated into the trained speech recognition model, for each frame of audio in the audio to be evaluated, through the same analysis as during training, the characteristic parameters of the speech to be processed are obtained, and the characteristic parameters of the speech to be processed are combined with the speech parameter library. One-to-one matching is performed on the speech templates in the to-be-processed speech feature parameter to obtain the matching probability between the speech feature parameter to be processed and each speech feature parameter in the speech parameter database. Further, the phoneme posterior probability vector of each frame of audio in the audio to be evaluated is obtained according to the mapping table between speech feature parameters and phonemes.
这种利用数量较少且个数有限的音素和音素的音频训练语音识别模型的方式,与利用海量字/词和字/词的音频训练语音识别模型的方式相比较,能够减少模型训练任务,快速得到训练好的语音识别模型。This method of training a speech recognition model using a small and limited number of phonemes and phoneme audio can reduce the model training task compared with the method of using a large number of words/words and audio of words/words to train a speech recognition model. Get a trained speech recognition model quickly.
所述视频生成模型是通过如下方式训练得到的:根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据;根据所述模型训练数据训练得到所述视频生成模型。The video generation model is obtained by training in the following manner: constructing model training data according to the sample audio and sample vocal organ action videos corresponding to the sample audio; training and obtaining the video generation model according to the model training data.
本公开对视频生成模型的损失函数不作具体限定。The present disclosure does not specifically limit the loss function of the video generation model.
由于任一语言的字或词(或音段)的数量庞大而音素的数量较少,加之每一字或词(或音段)的发音均由一个或多个音素构成,因此,一种优选的实施方式,样本音频为目标语言类型对应的所有音素的音频。而样本发音器官动作视频可以为,采用任意动画制作渲染软件制成的对应于每一音素的发音器官动作动画演示视频。样本发音器官动作视频也可以是通过摄像机、核磁共振仪、CT仪等解剖性成像仪器拍摄的对应于每一音素的发音器官动作视频。由于用户不仅能够念读各种人类语言中的字或词,还能够模仿动物、乐器等声音。因此为了便于本领域普通技术人员理解本公开的实施 方式,需说明上述音段可以是指模仿非人类语言的其他声音中的声音片段(如乐器的一个键、一根弦对应的声音片段)。Since the number of words or words (or sound segments) in any language is huge and the number of phonemes is small, and the pronunciation of each word or word (or sound segment) is composed of one or more phonemes, a preferred The sample audio is the audio of all phonemes corresponding to the target language type. The sample articulator action video may be an animation demonstration video of the articulator action corresponding to each phoneme produced by any animation production and rendering software. The sample voice organ motion video may also be a voice organ motion video corresponding to each phoneme captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like. Because users can not only read words or words in various human languages, but also imitate sounds such as animals and musical instruments. Therefore, in order to facilitate those skilled in the art to understand the embodiments of the present disclosure, it should be noted that the above-mentioned sound segments may refer to sound segments in other sounds imitating non-human languages (such as a sound segment corresponding to a key of a musical instrument and a string).
同理地,另一种实施方式,样本音频为目标语言类型对应的所有字或词(或音段)的音频。而样本发音器官动作视频可以为,采用任意动画制作渲染软件制成的对应于每一字或词(或音段)的发音器官动作动画演示视频。样本发音器官动作视频也可以是通过摄像机、核磁共振仪、CT仪等解剖性成像仪器拍摄的对应于每一字或词(或音段)的发音器官动作视频。Similarly, in another implementation manner, the sample audio is audio of all words or words (or sound segments) corresponding to the target language type. The sample articulator action video may be an animation demonstration video of articulator action corresponding to each word or word (or sound segment) produced by any animation production and rendering software. The sample voice organ motion video may also be a voice organ motion video corresponding to each word or word (or sound segment) captured by an anatomical imaging instrument such as a camera, a nuclear magnetic resonance apparatus, a CT apparatus, or the like.
一种可实现的实施方式,所述根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据,具体可以包括以下步骤:An achievable embodiment, the construction of model training data according to the sample audio and the sample articulator action video corresponding to the sample audio may specifically include the following steps:
将所述样本音频中的每一帧音频转换成样本音素后验概率向量,得到包括至少一个样本音素后验概率向量的样本音素后验概率向量序列;基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征,得到样本发音器官视频特征序列;将所述样本音素后验概率向量序列和所述样本发音器官视频特征序列作为所述模型训练数据。Convert each frame of audio in the sample audio into a sample phoneme posterior probability vector, and obtain a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector; The sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample The vocal organ video feature sequence is used as the model training data.
样本音频中的每一帧音频与样本音素后验概率向量序列中每一样本音素后验概率向量一一对应,样本音素后验概率向量序列中每一样本音素后验概率向量与样本发音器官视频特征序列中的每一样本发音器官视频特征一一对应。Each frame of audio in the sample audio is in one-to-one correspondence with each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence, and each sample phoneme posterior probability vector in the sample phoneme posterior probability vector sequence corresponds to the sample vocal organ video. Each sample vocal organ video feature in the feature sequence has a one-to-one correspondence.
容易理解的是,在一帧音频对应一个音素的情况下,一个音素对应的发音器官的发音过程由一帧或多帧视频图像体现。因而,每一所述样本发音器官视频特征为所述样本发音器官动作视频中的至少一帧视频图像的像素点特征信息;或者,每一所述样本发音器官视频特征为所述样本发音器官动作视频中的至少一帧视频图像的主成分特征信息。It is easy to understand that, in the case that one frame of audio corresponds to one phoneme, the pronunciation process of the pronunciation organ corresponding to one phoneme is embodied by one or more frames of video images. Therefore, each of the sample vocal organ video features is the pixel point feature information of at least one frame of video image in the sample vocal organ motion video; or, each of the sample vocal organ video features is the sample vocal organ motion Principal component feature information of at least one frame of video image in the video.
值得说明的是,主成分特征信息是通过主成分分析算法对视频图像进行降维处理后得到的表征该视频图像的主成分系数数据。It should be noted that the principal component feature information is principal component coefficient data representing the video image obtained by performing dimension reduction processing on the video image through the principal component analysis algorithm.
一种可实现的实施方式,在所述基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征之前,还可以包括以下步骤:逐帧对所述样本发音器官动作视频中的发音器官位置进行调整,以使各帧视频图像中的相同发音器官位于相同的图像位置。An achievable implementation manner, before the sample articulator video feature corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence is extracted based on the sample articulator action video. , may also include the following step: adjusting the position of the articulator in the sample articulator action video frame by frame, so that the same articulator in each frame of video image is located at the same image position.
该调整可以采用像素追踪或者光流跟踪的形式进行,或者,可以采用特征点提取并对齐的方式进行,对各帧视频图像的处理包括但不限于旋转、平移、放大、缩小, 还可以对各帧视频图像的大小进行统一裁剪。逐帧对样本发音器官动作视频中的发音器官位置进行调整,以使各帧视频图像中的相同发音器官位于相同的图像位置的方式,有利于降低因各帧视频图像中相同发音器官位置不同而导致的对模型训练效果及模型收敛速度的干扰。The adjustment can be performed in the form of pixel tracking or optical flow tracking, or can be performed in the form of feature point extraction and alignment. The size of the frame video image is uniformly cropped. The position of the articulator in the sample articulator action video is adjusted frame by frame, so that the same articulator in each frame of video image is located at the same image position, which is conducive to reducing the impact of different positions of the same articulator in each frame of video images. The resulting interference to the model training effect and the model convergence speed.
由于待评价音频特征向量包括待评价音频中每一帧音频的音素后验概率向量,因此将待评价音频特征向量输入训练好的视频生成模型后,可以得到与每一帧音频的音素后验概率向量对应的发音器官视频特征。根据发音器官视频特征序列可以生成并输出发音器官动作视频。Since the audio feature vector to be evaluated includes the phoneme posterior probability vector of each frame of audio in the audio to be evaluated, after inputting the audio feature vector to be evaluated into the trained video generation model, the phoneme posterior probability related to each frame of audio can be obtained. The vector corresponding to the vocal organ video features. According to the feature sequence of the voice organ video, the voice organ action video can be generated and output.
值得说明的是,步骤S13中用于训练视频生成模型的样本发音器官动作视频是与样本音频对应的视频,而步骤S11中用于训练视频特征生成模型的简略版发音器官动作视频是与样本文本对应的视频,而样本音频可以与样本发音器官动作视频同步录制,在基于相同的样本文本对样本音频和样本发音器官动作视频进行同步录制的情况下,步骤S11和步骤S13中的样本发音器官动作视频是相同的视频。在这种情况下,针对样本发音器官动作视频的音频、视频对齐以及视频裁剪、视频中心对齐等操作可以只进行一次,在训练两个模型的时候,均采用对齐后的视频、音频进行训练。It is worth noting that, in step S13, the sample pronunciation organ action video used for training the video generation model is the video corresponding to the sample audio, while the abbreviated version pronunciation organ action video used for training the video feature generation model in step S11 is the same as the sample text. Corresponding video, and the sample audio can be recorded synchronously with the sample articulator action video, in the case where the sample audio and the sample articulator action video are recorded synchronously based on the same sample text, the sample articulator action in step S11 and step S13 Video is the same video. In this case, operations such as audio and video alignment, video cropping, and video center alignment for the sample vocal organ action video can be performed only once. When training the two models, the aligned video and audio are used for training.
S14、基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息。S14. Generate pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text.
发音评价信息包括对所述用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者。The pronunciation evaluation information includes at least one of pronunciation scoring information of the user, pronunciation action suggestion information, or a comparison video of the articulator action video and the articulator standard action video.
所述对比视频是通过以下的方式生成的:基于例句文本的单位文本内容,将所述发音器官动作视频和所述发音器官标准动作视频中表征同一单位文本内容的视频片段作为一组视频片段组;将各视频片段组中属于所述发音器官动作视频和所述发音器官标准动作视频的视频片段进行对齐;将对齐后的所述发音器官动作视频和所述发音器官标准动作视频拼接,得到所述对比视频。Described contrast video is generated in the following way: based on the unit text content of example sentence text, the video clip that characterizes the same unit text content in described pronunciation organ action video and described pronunciation organ standard action video is as a group of video clip groups. The video clips belonging to the articulator action video and the articulator standard action video in each video segment group are aligned; The articulator action video after the alignment and the articulator standard action video are spliced to obtain the result. the comparison video.
在所述发音评价信息包括所述发音打分信息和/或所述发音动作建议信息的情况下,通过对比所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频,得到动作差异信息;根据所述动作差异信息生成发音打分信息,和/或,根据所述动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。When the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the action difference information is obtained by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text generating pronunciation scoring information according to the action difference information, and/or matching with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information.
动作差异信息可以指发音器官的特征点运动轨迹的差异信息。The action difference information may refer to the difference information of the movement trajectories of the feature points of the vocal organs.
发音器官的特征点运动轨迹用于反应发音器官的发音运动过程。发音器官的特征点可以为发音器官的质心点、中心点、轮廓特征点等,也可以为发音器官之外的以发音器官的质心点、中心点、轮廓特征点等为参照点的其他特征点。本公开对特征点的个数和种类不作具体的限制。The movement trajectory of the feature points of the speech organ is used to reflect the speech movement process of the speech organ. The feature points of the vocal organs can be the centroid points, center points, contour feature points, etc. of the vocal organs, or other feature points other than the vocal organs that take the centroid points, center points, contour feature points, etc. of the vocal organs as reference points. . The present disclosure does not specifically limit the number and types of feature points.
发音器官动作视频包括至少一帧视频图像,在发音器官动作视频的每一帧视频图像中确定发音器官的特征点的位置坐标,可以得到与发音器官动作视频的帧数对应个(组)数的发音器官特征点位置坐标。基于所有的发音器官特征点位置坐标可以构造出与发音器官动作视频的时间轴对应的发音器官的特征点运动轨迹。The voice organ action video includes at least one frame of video image, and the position coordinates of the feature points of the voice organ are determined in each frame of the voice organ action video, and the number of frames (groups) corresponding to the number of frames of the voice organ action video can be obtained. The position coordinates of the feature points of the vocal organs. Based on the position coordinates of all the feature points of the vocal organs, the movement trajectory of the feature points of the vocal organs corresponding to the time axis of the motion video of the vocal organs can be constructed.
与例句文本对应的特征点预设运动轨迹,为该例句文本对应的标准的发音器官特征点运动轨迹。对发音器官动作视频对应的发音器官的特征点运动轨迹和标准的特征点预设运动轨迹进行相似度计算,可以得到该两条轨迹线的相似度信息。The preset movement trajectory of the feature point corresponding to the example sentence text is the standard movement trajectory of the vocal organ feature point corresponding to the example sentence text. Similarity calculation is performed on the movement trajectory of the feature points of the speech organ corresponding to the action video of the speech organ and the preset movement trajectory of the standard feature points, and the similarity information of the two trajectory lines can be obtained.
一种可实现的实施方式,与例句文本对应的特征点预设运动轨迹可以通过如下方式进行确定:An achievable embodiment, the preset motion trajectory of the feature points corresponding to the example text can be determined in the following manner:
从训练视频生成模型的模型训练数据中,确定组成例句文本的所有音素(或其他字、词、句等单位粒度的信息),并确定所有音素对应的发音器官视频特征序列,基于发音器官视频特征序列生成该例句文本的发音器官标准动作视频。在发音器官标准动作视频的每一帧视频图像中确定发音器官的特征点的位置坐标,得到例句文本对应的特征点预设运动轨迹。From the model training data for training the video generation model, determine all the phonemes (or other word, word, sentence, etc. unit-granularity information) that make up the example text, and determine the vocal organ video feature sequence corresponding to all the phonemes, based on the vocal organ video feature The sequence generates a standard action video of the pronunciation organs of the example text. The position coordinates of the feature points of the vocal organs are determined in each frame of video images of the standard motion video of the vocal organs, and the preset motion trajectories of the feature points corresponding to the example text are obtained.
在考虑准确度的情况下,可以从训练视频生成模型的模型训练数据中,确定多组组成例句文本的音素序列,基于多组组成例句文本的音素序列,确定多条特征点预设运动轨迹。对该多条特征点预设运动轨迹进行加权平均处理,可以得到一条综合的更准确的特征点预设运动轨迹。In the case of considering the accuracy, from the model training data of the training video generation model, multiple groups of phoneme sequences that form the example text can be determined, and based on the multiple sets of phoneme sequences that make up the example text, a plurality of preset motion trajectories of feature points can be determined. By performing weighted average processing on the preset motion trajectories of the plurality of feature points, a comprehensive and more accurate preset motion trajectories of the feature points can be obtained.
示例地,根据相似度信息中相似度值的大小确定待评价音频的发音分值。将待评价音频的发音分值作为发音评价结果。再示例地,根据相似度信息中相似度值的大小确定待评价音频的发音优秀、中等、合格、不合格、漏发音等等级。将待评价音频的发音优秀、中等、合格、不合格、漏发音等等级作为发音评价结果。For example, the pronunciation score of the audio to be evaluated is determined according to the similarity value in the similarity information. The pronunciation score of the audio to be evaluated is used as the pronunciation evaluation result. For another example, according to the magnitude of the similarity value in the similarity information, the pronunciation levels of the audio to be evaluated are determined as excellent, medium, qualified, unqualified, and missing pronunciation. The pronunciation of the audio to be evaluated is excellent, moderate, qualified, unqualified, missing pronunciation, etc., as the pronunciation evaluation result.
采用上述发音评价方法,可以将用户朗读例句文本的待评价音频输入视频生成模型,拟合还原出该用户的发音器官动作视频。在发音器官动作视频的每一帧视频图像中确定发音器官的特征点的位置坐标,得到发音器官的特征点运动轨迹。将发音器官的特征点运动轨迹和与例句文本对应的标准地特征点预设运动轨迹进行相似度计算, 从而得到发音器官的发音动作相似度信息。基于发音器官的发音动作相似度信息可以得到发音评价结果。由于发音与发音器官的动作直接相关,因而采用这种方式得到的发音评价结果更加的准确。Using the above pronunciation evaluation method, the audio to be evaluated in which the user reads the example text aloud can be input into the video generation model, and the user's pronunciation organ action video can be fitted and restored. The position coordinates of the feature points of the vocal organs are determined in each frame of video images of the vocal organs action video, and the movement trajectory of the feature points of the vocal organs is obtained. The similarity calculation is performed between the feature point motion trajectory of the vocal organ and the standard feature point preset motion trail corresponding to the example text, so as to obtain the pronunciation action similarity information of the vocal organ. The pronunciation evaluation result can be obtained based on the pronunciation action similarity information of the pronunciation organs. Since pronunciation is directly related to the movements of the vocal organs, the pronunciation evaluation results obtained in this way are more accurate.
根据所述相似度信息生成所述待评价音频的发音评价结果还可以包括以下步骤:Generating the pronunciation evaluation result of the audio to be evaluated according to the similarity information may further include the following steps:
对待评价音频进行频谱分析,提取声音频谱特征信息,将提取到的声音频谱特征信息和与例句文本对应的标准声音频谱特征信息进行相似度计算,得到频谱相似度信息,将频谱相似度信息与前述基于发音器官的特征点运动轨迹确定的相似度信息进行结合,得到发音评价结果。Perform spectrum analysis on the audio to be evaluated, extract sound spectrum feature information, perform similarity calculation between the extracted sound spectrum feature information and the standard sound spectrum feature information corresponding to the example text, obtain spectrum similarity information, and compare the spectrum similarity information with the aforementioned. The similarity information determined based on the movement trajectories of the feature points of the vocal organs is combined to obtain the pronunciation evaluation result.
采用这种方式,能够在根据单一的声音频谱维度的信息计算得到用户发音准确度的基础之上,进一步结合基于发音器官的特征点运动轨迹确定的相似度信息,确定出更加精准的发音评价结果。这种方式进一步提升了发音评价结果的准确率。In this way, on the basis of calculating the user's pronunciation accuracy based on the information of a single sound spectrum dimension, a more accurate pronunciation evaluation result can be determined by further combining the similarity information determined based on the movement trajectory of the feature points of the vocal organs. . This method further improves the accuracy of pronunciation evaluation results.
由于用户存在个体差异,因而不同用户在朗读同一例句文本时,朗读速度存在差异。也就是说待评价音频的时长与用户发音的快慢程度相关,即待评价音频的时长是可变的。而在待评价音频的时长不同的情况下,待评价音频的帧数不同,进而将时长不同的待评价音频输入视频生成模型,得到的各发音器官动作视频的时长也不相同。发音器官动作视频的时长不同时,发音器官动作视频包括的视频图像帧数不同。那么,如果待评价音频对应的发音器官动作视频中视频图像帧数与例句文本对应的发音器官标准动作视频中的视频图像帧数不同,则会造成发音器官的特征点运动轨迹和与例句文本对应的特征点预设运动轨迹长度不一致。进而在根据发音器官的特征点运动轨迹和与例句文本对应的特征点预设运动轨迹进行相似度计算时,得到的相似度信息会存在较大误差。对此,本公开提供如下两种实施方式以避免计算得到的相似度信息误差较大的问题。Due to individual differences among users, different users have different reading speeds when reading the same sentence text. That is to say, the duration of the audio to be evaluated is related to the speed of the user's pronunciation, that is, the duration of the audio to be evaluated is variable. When the duration of the audio to be evaluated is different, the number of frames of the audio to be evaluated is different, and then the audio to be evaluated with different durations is input into the video generation model, and the duration of each vocal organ action video obtained is also different. When the duration of the vocalization organ action video is different, the number of video image frames included in the vocalization organ action video is different. Then, if the number of video image frames in the voice organ action video corresponding to the audio to be evaluated is different from the number of video image frames in the voice organ standard action video corresponding to the example text The length of the preset motion trajectory of the feature points is inconsistent. Furthermore, when the similarity calculation is performed according to the movement trajectory of the feature points of the pronunciation organ and the preset movement trajectory of the characteristic points corresponding to the example text, the obtained similarity information will have a large error. In this regard, the present disclosure provides the following two implementations to avoid the problem of large errors in the similarity information obtained by calculation.
详细地,一种可实现的实施方式,在所述将所述发音器官的特征点运动轨迹和与所述例句文本对应的特征点预设运动轨迹进行相似度计算,得到相似度信息之前,根据组成所述特征点预设运动轨迹的特征点位置坐标个数,调整所述发音器官的特征点运动轨迹的特征点位置坐标个数,以使所述特征点预设运动轨迹的特征点位置坐标个数与所述发音器官的特征点运动轨迹的特征点位置坐标个数相同。In detail, an achievable embodiment, before the similarity calculation is performed between the feature point motion track of the articulator and the feature point preset motion track corresponding to the example sentence text, and similarity information is obtained, according to The number of feature point position coordinates that constitute the preset motion trajectory of the feature point, adjust the number of feature point position coordinates of the feature point motion trajectory of the vocal organ, so that the feature point position coordinates of the feature point preset motion trajectory The number is the same as the number of the feature point position coordinates of the feature point motion trajectory of the vocal organ.
示例地,假设特征点预设运动轨迹的特征点位置坐标个数为5,分别为坐标A、B、C、D、E。而特征点运动轨迹的特征点位置坐标个数为4个,分别为坐标a、b、c、e,此时可以对特征点运动轨迹的特征点位置坐标个数进行调整,例如在当前的发音器官 的特征点运动轨迹中插入特征点f(0,0),得到由坐标a、b、c、f、e构成的特征点运动轨迹。其中,特征点f(0,0)的插入位置可以根据待评价音频中缺失音素的位置进行确定。容易理解的是,在通过ASR模型知晓待评价音频中各个音素和已知例句文本中各个音素的情况下,可以确定待评价音频中的缺失音素(同理可知待评价音频中的多余音素,以实现可在减少特征点运动轨迹的特征点位置坐标个数的情况下,调整特征点运动轨迹的特征点位置坐标个数)。For example, it is assumed that the number of feature point position coordinates of the preset motion trajectory of the feature point is 5, which are coordinates A, B, C, D, and E, respectively. The number of feature point position coordinates of the feature point motion trajectory is 4, which are coordinates a, b, c, and e respectively. At this time, the number of feature point position coordinates of the feature point motion trajectory can be adjusted. For example, in the current pronunciation Insert the feature point f(0, 0) into the feature point motion track of the organ, and obtain the feature point motion track composed of coordinates a, b, c, f, and e. The insertion position of the feature point f(0, 0) can be determined according to the position of the missing phoneme in the audio to be evaluated. It is easy to understand that, when each phoneme in the audio to be evaluated and each phoneme in the known example text are known through the ASR model, the missing phonemes in the audio to be evaluated can be determined (similarly, it can be seen that the redundant phonemes in the audio to be evaluated, with It is realized that the number of feature point position coordinates of the feature point motion track can be adjusted while reducing the number of feature point position coordinates of the feature point motion track).
另一种可实现的实施方式,在所述发音器官动作视频的每一帧视频图像中确定所述发音器官的特征点的位置坐标之前,根据与所述例句文本对应的发音器官标准动作视频中的视频图像的帧数,调整所述发音器官动作视频中的视频图像的帧数,以使所述发音器官动作视频中的视频图像的帧数与所述发音器官标准动作视频中的视频图像的帧数相同。Another achievable implementation is, before determining the position coordinates of the feature points of the articulator in each frame of the video of the articulator action, according to the standard action video of the articulator corresponding to the example text The frame number of the video image, adjust the frame number of the video image in the articulator action video, so that the frame number of the video image in the articulator action video is the same as the video image in the articulator standard action video. The number of frames is the same.
容易理解的是,发音器官标准动作视频中视频图像的帧数与待评价音频的发音器官动作视频中视频图像的帧数相同的情况下,基于一帧视频图像对应一个特征点的前提,可知例句文本对应的特征点预设运动轨迹中特征点位置坐标个数与待评价音频的特征点运动轨迹中特征点位置坐标个数相同。It is easy to understand that, when the frame number of the video image in the standard action video of the vocal organ is the same as the frame number of the video image in the vocal organ action video of the audio to be evaluated, based on the premise that one frame of video image corresponds to one feature point, it can be seen that the example sentence The number of feature point position coordinates in the preset motion trajectory of the feature points corresponding to the text is the same as the number of feature point position coordinates in the feature point motion trajectory of the audio to be evaluated.
示例地,假设发音器官标准动作视频中视频图像的帧数为5帧,分别为1、2、3、4、5帧。而待评价音频的发音器官动作视频中视频图像的帧数为3帧,分别为1、4、5帧。此种情况下可以对视频图像帧序列1、4、5进行插帧处理。例如在当前的视频图像帧序列1、4、5中插入图像帧1和4,得到视频图像帧序列为1、1、4、4、5。又例如在当前的视频图像帧序列1、4、5中插入空图像帧0,得到视频图像帧序列为1、0、0、4、5。For example, it is assumed that the number of frames of the video image in the standard action video of the vocal organ is 5 frames, which are 1, 2, 3, 4, and 5 frames respectively. The number of frames of the video image in the speech organ action video of the audio to be evaluated is 3 frames, which are 1, 4, and 5 frames respectively. In this case, frame interpolation processing can be performed on the video image frame sequences 1, 4, and 5. For example, by inserting image frames 1 and 4 into the current video image frame sequence 1, 4, and 5, the obtained video image frame sequence is 1, 1, 4, 4, and 5. For another example, a blank image frame 0 is inserted into the current video image frame sequence 1, 4, and 5 to obtain the video image frame sequence as 1, 0, 0, 4, and 5.
可实现地一种实施方式,为了在确定待评价音频的发音评价结果的基础之上,进一步实现定位待评价音频中哪一个音素或哪一个字词发音不准确(或漏发音),上述步骤S13所述,在所述发音器官动作视频的每一帧视频图像中确定所述发音器官的特征点的位置坐标,得到所述发音器官的特征点运动轨迹,还可以包括以下步骤:A realizable embodiment, in order to determine which phoneme or which word in the audio to be evaluated is inaccurate (or missing pronunciation) on the basis of determining the pronunciation evaluation result of the audio to be evaluated, the above-mentioned step S13 Described, determine the positional coordinates of the feature point of described articulatory organ in each frame of video image of described articulatory organ action video, obtain the characteristic point movement track of described articulatory organ, can also comprise the following steps:
根据预设的发音评价粒度对所述待评价音频进行划分,得到多个待评价子音频;在所述发音器官动作视频的每一帧视频图像中,确定与每一所述待评价子音频对应的发音器官的特征点的位置坐标,得到对应每一所述待评价子音频的发音器官特征点运动轨迹片段。The to-be-evaluated audio is divided according to the preset pronunciation evaluation granularity to obtain a plurality of sub-audios to be evaluated; in each frame of the video image of the articulator action video, it is determined that the audio to be evaluated corresponds to each sub-audio to be evaluated. The position coordinates of the feature points of the vocal organs are obtained, and the motion track segments of the vocal organs feature points corresponding to each sub-audio to be evaluated are obtained.
预设的发音评价粒度为根据用户需求进行设置的发音评价单位。发音评价粒度可 以为音素、字、词、句、段、篇等,本公开对此不作具体限制。根据预设的发音评价粒度对待评价音频进行划分时,具体可以根据预设的发音评价粒度对应的时长对待评价音频进行划分,从而得到多个待评价子音频。The preset pronunciation evaluation granularity is a pronunciation evaluation unit set according to user requirements. The granularity of pronunciation evaluation can be phonemes, characters, words, sentences, paragraphs, articles, etc., which is not specifically limited in the present disclosure. When the audio to be evaluated is divided according to the preset pronunciation evaluation granularity, specifically, the audio to be evaluated may be divided according to the duration corresponding to the preset pronunciation evaluation granularity, thereby obtaining a plurality of sub-audios to be evaluated.
在确定各待评价子音频的情况下,可以基于发音器官动作视频,得到对应每一待评价子音频的发音器官特征点运动轨迹片段。具体可以是在发音器官动作视频的每一帧视频图像中,确定与每一待评价子音频对应的发音器官的特征点的位置坐标,得到对应每一待评价子音频的发音器官特征点运动轨迹片段。也可以是在得到整个待评价音频的发音器官的完整特征点运动轨迹之后,根据划分得到各待评价子音频的方式,划分整个待评价音频的发音器官的特征点运动轨迹,从而得到对应每一待评价子音频的发音器官特征点运动轨迹片段。In the case where each sub-audio to be evaluated is determined, a motion track segment of the feature point of the vocal organ corresponding to each sub-audio to be evaluated can be obtained based on the motion video of the vocal organ. Specifically, the position coordinates of the feature points of the vocal organs corresponding to each sub-audio to be evaluated can be determined in each frame of the video image of the vocal organ action video, and the motion trajectory of the vocal organ feature points corresponding to each sub-audio to be evaluated can be obtained. Fragment. It can also be that after obtaining the complete feature point motion trajectory of the vocal organ of the entire audio to be evaluated, the feature point motion trajectory of the vocal organ of the entire audio to be evaluated can be divided according to the method of dividing and obtained each sub-audio to be evaluated, so as to obtain the corresponding each sub-audio. The motion track segment of the vocal organ feature point of the sub-audio to be evaluated.
而适应性地,在得到每一待评价子音频的发音器官特征点运动轨迹片段之后,针对每一所述待评价子音频,将该待评价子音频对应的发音器官特征点运动轨迹片段与对应的特征点预设运动轨迹片段进行相似度计算,得到对应该待评价子音频的第一相似度值,所述相似度信息包括每一所述待评价子音频的第一相似度值。And adaptively, after obtaining the motion trajectory segment of the vocal organ feature point of each sub-audio to be evaluated, for each sub-audio to be evaluated, the motion trajectory segment of the vocal organ feature point corresponding to the sub-audio to be evaluated is matched with the corresponding The similarity calculation is performed on the preset motion trajectory segments of the feature points to obtain a first similarity value corresponding to the sub-audio to be evaluated, and the similarity information includes the first similarity value of each of the sub-audio to be evaluated.
特征点预设运动轨迹片段为完整的特征点预设运动轨迹中的轨迹片段。得到特征点预设运动轨迹片段的方式与从整个待评价音频的发音器官的特征点运动轨迹中划分得到每一待评价子音频的发音器官特征点运动轨迹片段的方式类似,此处不再赘述。The feature point preset motion track segment is a track segment in the complete feature point preset motion track. The method of obtaining the feature point preset motion track segment is similar to the method of dividing the feature point motion track segment of the vocal organ feature point of each sub-audio to be evaluated from the feature point motion track of the vocal organ of the entire audio to be evaluated, and will not be repeated here. .
参见图2,定位待评价音频中哪一个音素或哪一个字词发音不准确的方法的流程图包括步骤S21-S28。Referring to Fig. 2, the flowchart of the method for locating which phoneme or which word in the audio to be evaluated is inaccurately pronounced includes steps S21-S28.
S21、获取待评价音频,所述待评价音频为用户朗读例句文本的音频。S21. Acquire the audio to be evaluated, where the audio to be evaluated is the audio of the user reading the example text.
S22、将所述待评价音频输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。S22. Input the to-be-evaluated audio into a video generation model, and obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.
S23、根据预设的发音评价粒度对所述待评价音频进行划分,得到多个待评价子音频。S23. Divide the audio to be evaluated according to the preset pronunciation evaluation granularity to obtain a plurality of sub audios to be evaluated.
S24、在所述发音器官动作视频的每一帧视频图像中,确定与每一所述待评价子音频对应的发音器官的特征点的位置坐标,得到对应每一所述待评价子音频的发音器官特征点运动轨迹片段。S24. In each frame of the video image of the articulator action video, determine the position coordinates of the feature points of the articulator corresponding to each sub-audio to be evaluated, and obtain the pronunciation corresponding to each sub-audio to be evaluated Organ feature point motion track segment.
S25、针对每一所述待评价子音频,将该待评价子音频对应的发音器官特征点运动轨迹片段与对应的特征点预设运动轨迹片段进行相似度计算,得到对应该待评价子音频的第一相似度值。S25. For each sub-audio to be evaluated, perform similarity calculation between the motion track segment of the vocal organ feature point corresponding to the sub-audio to be evaluated and the corresponding preset motion track segment of the feature point, to obtain the sub-audio corresponding to the sub-audio to be evaluated. The first similarity value.
S26、确定小于预设阈值的目标第一相似度值,并确定所述目标第一相似度值对应的目标待评价子音频。S26. Determine a first target similarity value smaller than a preset threshold, and determine a target sub-audio to be evaluated corresponding to the target first similarity value.
预设阈值可以为90%、98%等预设值。在第一相似度值小于预设阈值的情况下,确定该第一相似度对应的目标待评价子音频发音不准确。该第一相似度值的大小用于表征目标待评价子音与该目标待评价子音频对应的标准发音的相似程度。The preset threshold may be preset values such as 90% and 98%. In the case that the first similarity value is smaller than the preset threshold, it is determined that the target sub-audio to be evaluated corresponding to the first similarity is inaccurate in pronunciation. The magnitude of the first similarity value is used to represent the similarity of the target consonant to be evaluated and the standard pronunciation corresponding to the target sub-audio to be evaluated.
S27、根据所述目标待评价子音频确定目标例句文本片段,所述目标例句文本片段为所述例句文本中的片段。S27. Determine a target example sentence text segment according to the target sub-audio to be evaluated, where the target example sentence text segment is a segment in the example sentence text.
在确定了发音不准确的目标待评价子音频的情况下,可以确定该目标待评价子音频对应的目标例句文本片段。该目标例句文本片段可能包括一个或多个音素/字/词/句等。When the target sub-audio to be evaluated with inaccurate pronunciation is determined, the target example sentence text segment corresponding to the target sub-audio to be evaluated can be determined. The target example sentence text segment may include one or more phonemes/characters/words/sentences, etc.
S28、关联展示所述目标例句文本片段以及所述目标第一相似度值,得到所述发音评价结果,以提醒所述用户发音错误的目标例句文本片段。S28 , displaying the target example sentence text fragment and the target first similarity value in association, and obtaining the pronunciation evaluation result, so as to remind the user of the target sentence sentence text fragment that is mispronounced.
采用图2所示的这种方式,可以实现定位待评价音频中哪一个音素或哪一个字词发音不准确,并使用户知悉。从而便于用户针对发音错误的部分进行针对性发音练习。例如,将发音错误的部分对应的发音器官标准动作视频以及标准的发音器官特征点预设运动轨迹片段向用户进行展示,同时,还可以将发音错误的部分对应的用户发音器官动作视频以及该用户的不准确发音器官特征点运动轨迹片段向用户进行展示,以便于用户知悉那个发音不准确,以及与标准发音的差异在哪里。By adopting the method shown in FIG. 2 , it is possible to locate which phoneme or which word in the audio to be evaluated is inaccurately pronounced, and let the user know. Therefore, it is convenient for the user to carry out targeted pronunciation practice for the part with wrong pronunciation. For example, the standard action video of the vocal organ corresponding to the part with the wrong pronunciation and the preset motion trajectory segment of the standard feature point of the vocal organ are displayed to the user. The inaccurate articulation organ feature point motion track segment is displayed to the user, so that the user can know which pronunciation is inaccurate and where the difference from the standard pronunciation is.
由于声音由多个发音器官协同作用而产生,因而本公开实施例中的发音器官动作视频中包括上唇、下唇、上齿、下齿、齿龈、硬颚、软颚、小舌、舌尖、舌面、舌根、鼻腔、口腔、咽头、会厌、食道、气管、声带、或喉头中的至少一个器官的动作,所述发音器官的特征点运动轨迹(或特征点运动轨迹片段)包括发音器官动作视频中每一器官的特征点运动轨迹(或特征点运动轨迹片段)。Since the sound is produced by the synergy of multiple vocal organs, the motion video of the vocal organs in the embodiment of the present disclosure includes the upper lip, lower lip, upper teeth, lower teeth, gums, hard palate, soft palate, uvula, tongue tip, and tongue surface , the action of at least one organ in the base of the tongue, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or larynx, and the feature point motion trajectory (or feature point motion trajectory segment) of the articulatory organ includes the articulatory organ motion video The feature point motion track (or feature point motion track segment) of each organ in the
也就是说,采用本公开上述实施例中的方法,可以得到任一种发音器官的特征点运动轨迹(或特征点运动轨迹片段)。That is to say, by using the methods in the above embodiments of the present disclosure, the feature point motion track (or feature point motion track segment) of any speech organ can be obtained.
而针对每一种发音器官的特征点运动轨迹(或特征点运动轨迹片段),可将该器官的特征点运动轨迹(或特征点运动轨迹片段)与该器官在所述例句文本下对应的特征点预设运动轨迹(或特征点预设运动轨迹片段)进行相似度计算,得到第二相似度值,第二相似度值表征一个发音器官的特征点运动轨迹(或特征点运动轨迹片段)与标准的该发音器官的特征点预设运动轨迹(或特征点预设运动轨迹片段)之间的相似 程度。And for the feature point motion track (or feature point motion track segment) of each pronunciation organ, the feature point motion track (or feature point motion track segment) of the organ can be matched with the feature corresponding to the organ under the example sentence text. The point preset motion track (or feature point preset motion track segment) performs similarity calculation to obtain a second similarity value, and the second similarity value represents the feature point motion track (or feature point motion track segment) of a vocal organ and The standard similarity degree between the feature point preset motion trajectories (or feature point preset motion trajectory segments) of the vocal organ.
进一步地,可以确定小于阈值的目标第二相似度值,根据目标第二相似度值可以确定目标发音器官。如此能够确定多个发音器官中具体哪一个或哪几个发音器官的发音动作错误导致了对例句文本(或例句文本片段)的发音错误。Further, the target second similarity value that is smaller than the threshold may be determined, and the target articulator may be determined according to the target second similarity value. In this way, it can be determined which specific one or several pronunciation organs of the multiple pronunciation organs have the incorrect pronunciation of the pronunciation action of the example sentence text (or the example sentence text segment).
采用这种方式,可以在实现定位待评价音频中哪一个音素或哪一个字词发音不准确的基础之上,进一步地定位是哪一个或哪几个发音器官导致的该发音不准确问题。通过将该哪一个或哪几个发音器官对应的发音器官标准动作视频以及标准的发音器官特征点预设运动轨迹向用户进行展示,有利于用户进行发音器官动作针对性矫正学习。In this way, on the basis of locating which phoneme or which word in the audio to be evaluated is inaccurately pronounced, it is possible to further locate which one or several articulators cause the inaccurate pronunciation. By displaying the standard motion video of the vocal organ corresponding to which one or several vocal organs and the preset movement trajectory of the standard vocal organ feature point to the user, it is beneficial for the user to perform targeted correction and learning of the vocal organ action.
所述发音器官动作视频为磁共振成像MRI视频,对应地,用于训练视频生成模型的样本发音器官动作视频也为磁共振成像MRI视频,样本发音器官动作视频中包括上唇、下唇、上齿、下齿、齿龈、硬颚、软颚、小舌、舌尖、舌面、舌根、鼻腔、口腔、咽头、会厌、食道、气管、声带、或喉头中的至少一种发音器官的动作。The voice organ action video is a magnetic resonance imaging MRI video. Correspondingly, the sample voice organ action video used for training the video generation model is also a magnetic resonance imaging MRI video. The sample voice organ action video includes the upper lip, lower lip, and upper teeth. , lower teeth, gums, hard palate, soft palate, uvula, tongue tip, lingual surface, tongue base, nasal cavity, oral cavity, pharynx, epiglottis, esophagus, trachea, vocal cords, or the action of at least one articulatory organ in the larynx.
此外,由于发音器官还包括肺、横膈膜、气管等发音动力器官,因而发音器官动作视频和样本发音器官动作视频中还可以包括肺、横膈膜、气管中的至少一种发音器官的动作。In addition, since the speech organs also include speech power organs such as the lungs, the diaphragm, and the trachea, the speech organs action video and the sample speech organ action video may also include the action of at least one speech organ among the lungs, the diaphragm, and the trachea. .
在得到了发音错误的音素或字词之后,可以根据发音错误的音素或字词与正确的动作视频之间的动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。After the incorrectly pronounced phoneme or word is obtained, it is possible to match the preset pronunciation action suggestion information according to the action difference information between the incorrectly pronounced phoneme or word and the correct action video, so as to obtain the action difference information. Matching target action suggestion information.
例如,在得到发音错误的字词后,动作差异信息表明用户的发音器官动作视频中的上颚位置低于发音器官标准动作视频中的上颚位置,则可以匹配对应的目标动作建议信息“抬高上颚”,动作差异信息表明用户的发音器官动作视频中的舌头位置相比发音器官标准动作视频中的舌头位置靠后,则可以匹配对应的目标动作建议信息“前伸舌头”。For example, after obtaining the wrongly pronounced word, the action difference information indicates that the position of the upper jaw in the action video of the vocal organ of the user is lower than the position of the upper jaw in the standard action video of the vocal organ, then the corresponding target action suggestion information "Raise the upper jaw" can be matched. ”, the action difference information indicates that the position of the tongue in the action video of the voice organ of the user is backward compared to the position of the tongue in the standard action video of the voice organ, and the corresponding target action suggestion information “protrude the tongue” can be matched.
S15、向所述用户展示所述发音评价信息。S15. Display the pronunciation evaluation information to the user.
展示的发音评价信息可以为用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者,还可以将三者两两组合进行展示,或三者同时进行展示。The displayed pronunciation evaluation information can be at least one of the user's pronunciation scoring information, pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video, or a combination of the three. display, or all three at the same time.
考虑到MRI图像成像不够清晰,以及非专业人员对器官形状不够熟悉,导致用户从MRI视频中提取信息存在困难,因此,在原始的发音器官动作视频和原始的发 音器官标准动作视频为MRI视频的情况下,可以通过动画生成模型,逐帧对所述发音器官动作视频或所述发音器官标准动作视频进行渲染,得到发音器官动画视频,并将发音器官动画视频作为该发音器官动作视频或发音器官标准动作视频进行展示。Considering that the imaging of MRI images is not clear enough, and the non-professionals are not familiar with the shape of organs, it is difficult for users to extract information from MRI videos. Under the circumstances, the model can be generated by animation, and the video of the articulator action or the standard action video of the articulator can be rendered frame by frame to obtain an animated video of the articulator, and the animated video of the articulator can be used as the video of the articulator or the articulator. Standard action video for presentation.
所述动画生成模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画器官图,所述动画生成模型的训练样本是通过以下方式得到的:确定各MRI样本图像中的器官的位置;在各MRI样本图像中的器官的位置,生成与所述器官的位置对应的动画器官,得到动画器官图。The training samples of the animation generation model include a plurality of MRI sample images and animation organ maps corresponding to each MRI sample image, and the training samples of the animation generation model are obtained by determining the position of the organ in each MRI sample image. ; At the position of the organ in each MRI sample image, an animation organ corresponding to the position of the organ is generated, and an animation organ map is obtained.
MRI视频由多个视频帧组成,在生成动画时,可以选择将所有的视频帧均输入动画生成模型,在得到了动画生成模型输出的动画帧之后,可以将动画帧按照视频帧的排列顺序进行重组,得到视频帧对应的动画视频。MRI video is composed of multiple video frames. When generating animation, you can choose to input all the video frames into the animation generation model. After getting the animation frames output by the animation generation model, the animation frames can be arranged in the order of the video frames. Recombination to get the animation video corresponding to the video frame.
在一种可能的实施方式中,还可以相隔预设帧选取视频帧输入动画生成模型,这样,在得到动画生成模型生成的动画帧之后,可以对动画帧之间进行补帧,生成流畅的动画视频。这样,可以减少动画生成模型的工作量,减少计算资源消耗,提升动画生成效率。In a possible implementation, it is also possible to select video frames at intervals of preset frames to input the animation generation model. In this way, after obtaining the animation frames generated by the animation generation model, frames can be supplemented between the animation frames to generate a smooth animation. video. In this way, the workload of the animation generation model can be reduced, the consumption of computing resources can be reduced, and the animation generation efficiency can be improved.
该动画生成模型可以为任意的可以对样本进行学习的机器学习模型,例如对抗生成网络模型、循环神经网络模型、卷积网络模型等,本公开对此不做限制。模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画器官图,通过对训练样本进行学习,动画生成模型可以基于输入的MRI图像生成对应的动画图像,从而可以实现将MRI视频帧转换为动画帧的效果。The animation generation model can be any machine learning model that can learn samples, such as an adversarial generation network model, a recurrent neural network model, a convolutional network model, etc., which is not limited in the present disclosure. The training samples of the model include multiple MRI sample images and animated organ maps corresponding to each MRI sample image. By learning the training samples, the animation generation model can generate corresponding animation images based on the input MRI images, so that the MRI video frames can be converted into the corresponding animation images. Effects converted to animation frames.
动画生成模型可以按照视频帧输入的顺序依次输出与视频帧对应的动画帧,其中,动画帧中的发音器官位置由动画发音器官进行填充,便于用户查看和理解。The animation generation model can sequentially output the animation frames corresponding to the video frames in the order in which the video frames are input, wherein the positions of the vocal organs in the animation frames are filled by the animation vocal organs, which is convenient for users to view and understand.
在一种可能的实施方式中,可以根据发音器官的不同,为各个动画发音器官分别填充不同的颜色,还可以在动画发音器官上标注器官名称,例如,可以将上颚位置填充为淡黄色,并标注“上颚”字符,将舌头位置填充为正红色,并标注“舌”字符,将牙齿位置填充为白色,并标注“牙齿”字符,这样,可以更直观地体现各器官的位置及连接关系,更有利于用户理解。In a possible implementation, different colors can be filled for each animated vocal organ according to different vocal organs, and the name of the organ can also be marked on the animated vocal organ. For example, the upper jaw position can be filled with light yellow, and Mark the character "upper jaw", fill the position of the tongue with positive red, mark the character "tongue", fill the position of the teeth with white, and mark the character "tooth", so that the position and connection relationship of each organ can be more intuitively reflected. It is easier for users to understand.
值得说明的是,上述的颜色填充方式和名称标注方式仅作为一种示例进行说明,本公开不对器官颜色的填充方式和名称的标注方式进行限定,例如,该名称标注还可以用外文标注,或者添加读音的音标、拼音等。It is worth noting that the above-mentioned color filling method and name labeling method are only described as an example, and the present disclosure does not limit the color filling method and name labeling method of an organ. For example, the name labeling can also be labeled in a foreign language, or Add phonetic symbols and pinyin for pronunciation.
将动画帧按照视频帧的排列顺序进行重组,可以得到完整的动画视频,动画帧的 播放速度可以与视频帧一致,也可以根据应用需求调整动画帧的播放速度,例如,当动画视频应用在教育场景中时,为了更清晰地展示发音器官的运动方式和发力状况,可以降低动画帧的播放速度。在降低动画帧的播放速度的情况下,为了使动画视频更流畅,还可以在各帧之间进行补帧,以提升动画视频的帧数。The animation frames are reorganized according to the sequence of the video frames, and a complete animation video can be obtained. The playback speed of the animation frames can be consistent with the video frames, or the playback speed of the animation frames can be adjusted according to the application requirements. For example, when the animation video application is used in education In the scene, in order to show the movement mode and force of the vocal organs more clearly, the playback speed of the animation frame can be reduced. In the case of reducing the playback speed of the animation frames, in order to make the animation video smoother, frames can also be supplemented between each frame to increase the number of frames of the animation video.
在一种可能的实施方式中,动画生成模型为对抗生成网络模型,所述动画生成模型包括用于基于MRI图像生成动画图像的生成器,所述动画生成模型的是通过以下方式训练得到的:In a possible implementation, the animation generation model is an adversarial generation network model, the animation generation model includes a generator for generating animation images based on MRI images, and the animation generation model is obtained by training in the following manner:
重复执行所述生成器基于所述MRI样本图像生成训练动画图像、并基于所述MRI样本图像对应的动画发音器官图和预设的损失函数生成损失值、并由基于所述损失值调整所述生成器中的参数,并由所述对抗生成网络模型的判别器基于所述动画发音器官图对所述训练动画图像进行评价的步骤,直至所述评价结果满足预设评价结果条件。Repeatedly executing the generator to generate a training animation image based on the MRI sample image, and to generate a loss value based on the animation vocal organ map corresponding to the MRI sample image and a preset loss function, and to adjust the animation based on the loss value. The parameters in the generator, and the discriminator of the confrontation generation network model evaluates the training animation image based on the animation articulator map, until the evaluation result satisfies the preset evaluation result condition.
生成器用于基于输入的数据生成图像,判别器用于评价生成器输出的图像与指定集合中的图像是否具有一致的特征,即可以判断图片是否为制定集合中的图片。判别器的评价结果可能是正确的,也可能是错误的,当生成器输出的图片与指定集合中的图片特征明显不同的情况下,判别器的评价结果通常是正确的,也就是说判别器可以正确地判断图片是否为指定集合中的图片,而当生成器生成的图片与指定集合中的图片的特征差异不明显的情况下,判别器则很难总是正确地判断图片是否为指定集合中的图片,这样,可以通过设定判别评价结果的正确比例阈值来设置训练停止条件,使生成器生成的图像更符合训练集中的训练目标的特征。The generator is used to generate an image based on the input data, and the discriminator is used to evaluate whether the image output by the generator has the same characteristics as the images in the specified set, that is, it can be judged whether the picture is a picture in the specified set. The evaluation result of the discriminator may be correct or wrong. When the image output by the generator is significantly different from the image features in the specified set, the evaluation result of the discriminator is usually correct, that is to say, the discriminator is correct. It can correctly judge whether the picture is a picture in the specified set, but when the feature difference between the picture generated by the generator and the picture in the specified set is not obvious, it is difficult for the discriminator to always correctly judge whether the picture is in the specified set. In this way, the training stop condition can be set by setting the correct ratio threshold of the discriminative evaluation results, so that the images generated by the generator are more in line with the characteristics of the training target in the training set.
在对生成器进行训练之前,还可以对判别器进行预训练,例如,向生成器输入随机的特征得到图像,并由判别器对该图像的特征与训练样本中的动画器官图一致进行评价,基于评价结果是否正确而调整判别器中的参数,直至判别器可以正确判断生成器生成的图像与训练样本中的动画发音器官图是否一致。在完成判别器的训练后,可以再利用判别器对生成器进行训练。值得说明的是,生成器和判别器的训练还可以同步进行,从而可以互相约束,使得生成器生成的图像更符合动画发音器官图的特征,而判别器可以更正确地对图像进行评价。Before training the generator, the discriminator can also be pre-trained, for example, input random features to the generator to obtain an image, and the discriminator will evaluate the image features consistent with the animated organ diagram in the training sample, Based on whether the evaluation result is correct, the parameters in the discriminator are adjusted until the discriminator can correctly judge whether the image generated by the generator is consistent with the animation vocal organ map in the training sample. After the discriminator is trained, the generator can be trained using the discriminator. It is worth noting that the training of the generator and the discriminator can also be carried out synchronously, so that they can constrain each other, so that the images generated by the generator are more in line with the characteristics of the animated vocal organ map, and the discriminator can evaluate the images more correctly.
在一种可能的实施方式中,训练样本是通过以下方式得到的:确定各MRI样本图像中的发音器官的位置,并在各MRI样本图像中的发音器官的位置生成与所述发音器官的位置对应的动画发音器官从而得到动画发音器官图。In a possible implementation manner, the training samples are obtained by: determining the position of the articulator in each MRI sample image, and generating the position of the articulating organ in each MRI sample image with the position of the articulating organ The corresponding animation articulation organ is obtained to obtain the animation articulation organ diagram.
各器官的位置可以通过MRI样本图像中的色块区域进行轮廓区分,还可以通过 识别模型对发音器官位置进行识别,或者将器官模板图与MRI样本图像重叠,并基于器官模板图的器官位置在MRI样本图像中进行区域合并,将发音器官位置所在区域的色块作为发音器官所在的位置。The position of each organ can be distinguished by the color block area in the MRI sample image, and the position of the vocal organ can also be identified by the recognition model, or the organ template image can be overlapped with the MRI sample image, and the organ position based on the organ template image can be found in the image. Regions are merged in the MRI sample images, and the color block in the region where the vocal organs are located is used as the location of the vocal organs.
在一种可能的实施方式中,针对每一MRI样本图像,提取所述MRI样本图像的器官轮廓,并在各发音器官的器官轮廓中填充与该发音器官对应的器官图像。In a possible implementation manner, for each MRI sample image, the organ contour of the MRI sample image is extracted, and the organ image corresponding to the articulating organ is filled in the organ contour of each articulating organ.
器官图像可以为卡通图像也可以为写实图像,在一种可能的实施方式中,可以从预设的flash动画库中调用器官贴图,在各发音器官的器官轮廓中填充该发音器官对应的器官贴图。值得说明的是,flash动画库中针对同一发音器官可能有多种器官贴图,可以自动选择一种器官贴图进行填充,也可以根据用户的指定修改贴图的类型进行填充。The organ image can be a cartoon image or a realistic image. In a possible implementation, the organ map can be called from the preset flash animation library, and the organ map corresponding to the vocal organ is filled in the organ outline of each vocal organ . It is worth noting that there may be multiple organ textures for the same vocal organ in the flash animation library, and one type of organ texture can be automatically selected for filling, or the type of texture can be modified according to the user's designation.
在一种可能的实施方式中,针对所述MRI样本视频的首帧对应的MRI样本图像,从预设的flash动画库中调用器官贴图,在各发音器官的器官轮廓中填充该发音器官对应的器官贴图;针对其他视频帧对应的MRI样本图像,从所述flash动画库中调用与所述首帧对应的所述MRI样本图像中的各发音器官对应的器官贴图在与各发音器官对应的器官轮廓中进行填充。In a possible implementation, for the MRI sample image corresponding to the first frame of the MRI sample video, the organ map is called from a preset flash animation library, and the organ outline of each vocal organ is filled with the corresponding image of the vocal organ Organ map; for the MRI sample images corresponding to other video frames, call the organ map corresponding to each vocal organ in the MRI sample image corresponding to the first frame from the flash animation library in the organ corresponding to each vocal organ Fill in the outline.
也就是说,在对首帧进行贴图填充以后,可以基于首帧的贴图类型对其他帧进行填充,从而使得所有的动画帧中针对相同发音器官的贴图风格相同,使得最终得到的动画视频更加自然。That is to say, after the first frame is filled with textures, other frames can be filled based on the texture type of the first frame, so that all animation frames have the same texture style for the same vocal organ, making the final animation video more natural. .
例如,flash动画库中针对舌头的贴图共有3种,针对牙齿的贴图共有4种,则在对首帧对应的MRI样本图像进行填充时,针对舌头选择了舌头1贴图,针对牙齿选择了牙齿3贴图分别对舌头所在的轮廓和牙齿所在的轮廓进行填充,则在对后续其他帧进行填充时,可以自动选择舌头1贴图对舌头所在的轮廓进行填充,并选择牙齿3贴图对牙齿所在的轮廓进行填充。For example, there are 3 kinds of textures for the tongue and 4 kinds of textures for the teeth in the flash animation library. When filling the MRI sample image corresponding to the first frame, the tongue 1 texture is selected for the tongue, and the tooth 3 texture is selected for the teeth. The texture fills the contour of the tongue and the contour of the teeth respectively. When filling other subsequent frames, the tongue 1 map can be automatically selected to fill the contour of the tongue, and the tooth 3 map can be selected to fill the contour of the teeth. filling.
考虑到器官轮廓的提取可能存在偏差,在一种可能的实施方式中,在提取了器官轮廓之后,可以对所述器官轮廓进行矫正。可以对器官轮廓逐帧分别进行矫正,还可以对首帧的器官轮廓进行矫正后,通过特征点识别的方式对器官轮廓进行追踪,从而实现其他帧的器官轮廓矫正。Considering that there may be deviations in the extraction of the organ contour, in a possible implementation manner, after the organ contour is extracted, the organ contour may be corrected. Organ contours can be corrected frame by frame, and after the organ contour of the first frame is corrected, the organ contour can be tracked by means of feature point recognition, so as to achieve organ contour correction in other frames.
在一种可能的实施方式中,针对所述MRI样本视频的首帧对应的MRI样本图像,基于所述MRI样本图像调整该MRI样本图像中的所述器官轮廓,以使所述发音器官轮廓与所述MRI样本图像中的特征点相对应;针对其他视频帧对应的MRI样本图像, 对该MRI样本图像中的特征点与该MRI样本图像的前一视频帧中的特征点进行特征点追踪,并基于特征点追踪结果,自动对该MRI样本图像中的器官轮廓进行调整。In a possible implementation manner, for the MRI sample image corresponding to the first frame of the MRI sample video, the contour of the organ in the MRI sample image is adjusted based on the MRI sample image, so that the contour of the speech organ is the same as the one in the MRI sample image. The feature points in the MRI sample image correspond; for the MRI sample images corresponding to other video frames, feature point tracking is performed between the feature points in the MRI sample image and the feature points in the previous video frame of the MRI sample image, And based on the feature point tracking results, the organ contour in the MRI sample image is automatically adjusted.
值得说明的是,本实施例中的步骤S11至步骤S15可以均在用户终端执行,可选的,为了减少终端的计算压力,步骤S13和步骤S14还可以在服务器执行,用户终端在采集到用户的待评价音频后,可以将音频发送至服务器,服务器对音频进行处理后,将发音评价信息返回至用户终端。It is worth noting that steps S11 to S15 in this embodiment may all be performed on the user terminal. Optionally, in order to reduce the computing pressure on the terminal, steps S13 and S14 may also be performed on the server. After the audio to be evaluated is generated, the audio can be sent to the server, and after the server processes the audio, the pronunciation evaluation information is returned to the user terminal.
通过上述技术方案,至少可以达到以下的技术效果:Through the above technical solutions, at least the following technical effects can be achieved:
通过获取用户基于例句文本朗读的待评价音频,并通过基于待评价音频生成的发音器官动作视频和例句文本对应的发音器官标准动作视频生成发音评价信息,可以更准确地对用户的发音进行评价,从而更直观地体现用户的发音是否准确。By acquiring the audio to be evaluated read aloud by the user based on the example text, and generating the pronunciation evaluation information based on the pronunciation organ action video generated based on the to-be-evaluated audio and the pronunciation organ standard action video corresponding to the example text, the user's pronunciation can be more accurately evaluated, This more intuitively reflects whether the user's pronunciation is accurate.
图3是根据一示例性公开实施例示出的一种发音评价装置的框图,如图3所示,所述发音评价装置300包括:FIG. 3 is a block diagram of a pronunciation evaluation apparatus according to an exemplary disclosed embodiment. As shown in FIG. 3 , the pronunciation evaluation apparatus 300 includes:
例句展示模块310,用于向用户展示例句文本;The example sentence display module 310 is used to display the example sentence text to the user;
音频采集模块320,用于采集用户基于所述例句文本朗读的待评价音频;The audio collection module 320 is used to collect the audio to be evaluated that the user reads aloud based on the example text;
视频生成模块330,用于生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频; Video generation module 330, for generating the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;
发音评价模块340,用于基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;The pronunciation evaluation module 340 is used for generating pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;
评价展示模块350,用于向所述用户展示所述发音评价信息。The evaluation display module 350 is configured to display the pronunciation evaluation information to the user.
在一种可能的实施方式中,所述发音评价信息包括对所述用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者。In a possible implementation, the pronunciation evaluation information includes at least one of the pronunciation scoring information of the user, the pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video one.
在一种可能的实施方式中,所述例句展示模块310,用于基于所述例句文本生成例句音频;将所述例句音频与所述发音器官标准动作视频合成为例句演示视频;向用户展示例句文本和所述例句演示视频。In a possible implementation, the example sentence display module 310 is configured to generate example sentence audio based on the example sentence text; synthesize the example sentence audio and the standard action video of the pronunciation organ into an example sentence demonstration video; display the example sentence to the user Text and demo video of said example sentences.
在一种可能的实施方式中,所述发音评价模块340,用于通过对比所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频,得到动作差异信息;根据所述动作差异信息生成发音打分信息,和/或,根据所述动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。In a possible implementation, the pronunciation evaluation module 340 is configured to obtain action difference information by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text; according to the action difference information Pronunciation scoring information is generated, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.
在一种可能的实施方式中,所述动作差异信息为发音器官的特征点运动轨迹的差异信息。In a possible implementation manner, the action difference information is difference information of movement trajectories of feature points of speech organs.
在一种可能的实施方式中,所述发音评价模块,用于基于例句文本的单位文本内容,将所述发音器官动作视频和所述发音器官标准动作视频中表征同一单位文本内容的视频片段作为一组视频片段组;将各视频片段组中属于所述发音器官动作视频和所述发音器官标准动作视频的视频片段进行对齐;将对齐后的所述发音器官动作视频和所述发音器官标准动作视频拼接,得到所述对比视频。In a possible implementation, the pronunciation evaluation module is configured to, based on the unit text content of the example sentence text, use the video clips representing the same unit text content in the articulator action video and the articulator standard action video as A group of video clip groups; Align the video clips belonging to the articulator action video and the articulator standard action video in each video clip group; Align the articulator action video and the articulator standard action after the alignment Video splicing is performed to obtain the comparison video.
在一种可能的实施方式中,所述视频生成模块330,用于将所述待评价音频转换成待处理音频特征向量;将所述待处理音频特征向量输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。In a possible implementation manner, the video generation module 330 is configured to convert the to-be-evaluated audio into a to-be-processed audio feature vector; input the to-be-processed audio feature vector into a video generation model to obtain the video generated The voice organ action video output by the model and corresponding to the audio to be evaluated.
在一种可能的实施方式中,发音评价装置300还包括视频生成模型训练模块,被配置为根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据;根据所述模型训练数据训练得到所述视频生成模型。In a possible implementation manner, the pronunciation evaluation apparatus 300 further includes a video generation model training module, which is configured to construct model training data according to the sample audio and the sample articulator action video corresponding to the sample audio; The video generation model is obtained by data training.
在一种可能的实施方式中,视频生成模型训练模块进一步被配置为将所述样本音频中的每一帧音频转换成样本音素后验概率向量,得到包括至少一个样本音素后验概率向量的样本音素后验概率向量序列;基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征,得到样本发音器官视频特征序列;将所述样本音素后验概率向量序列和所述样本发音器官视频特征序列作为所述模型训练数据。In a possible implementation, the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain a sample including at least one sample phoneme posterior probability vector A sequence of phoneme posterior probability vectors; based on the sample voice organ action video, extract the sample voice organ video features corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence, to obtain a sample voice organ A video feature sequence; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
在一种可能的实施方式中,所述样本发音器官视频特征为所述样本发音器官动作视频中的至少一帧视频图像的像素点特征信息或主成分特征信息中的至少一种。In a possible implementation manner, the sample vocal organ video feature is at least one of pixel point feature information or principal component feature information of at least one frame of video image in the sample vocal organ motion video.
在一种可能的实施方式中,所述视频生成模块330,还用于将所述例句文本分割为单位文本序列;将所述单位文本序列输入视频特征生成模型,得到视频特征序列;基于所述视频特征序列生成发音器官标准动作视频;其中,所述视频特征生成模型是通过如下方式训练得到的:将样本文本分割为样本单位文本序列;根据样本单位文本序列以及与所述样本单位文本序列对应的样本发音器官动作视频的样本视频特征序列构建模型训练数据;根据所述模型训练数据训练得到所述视频特征生成模型。In a possible implementation, the video generation module 330 is further configured to divide the example sentence text into unit text sequences; input the unit text sequences into a video feature generation model to obtain a video feature sequence; The video feature sequence generates the standard action video of the pronunciation organ; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into sample unit text sequences; according to the sample unit text sequences and corresponding to the sample unit text sequences The model training data is constructed from the sample video feature sequence of the sample vocal organ action video; the video feature generation model is obtained by training according to the model training data.
在一种可能的实施方式中,所述发音器官动作视频和所述发音器官标准动作视频为基于核磁共振MRI视频生成的发音器官动画视频,所述装置还包括视频渲染模块,用于通过动画生成模型,逐帧对所述发音器官动作视频或所述发音器官标准动作视频 进行渲染,得到发音器官动画视频。In a possible implementation manner, the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on the MRI video, and the device further includes a video rendering module for generating through animation The model renders the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.
在一种可能的实施方式中,所述动画生成模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画发音器官图,并且所述装置还包括训练样本生成模块,被配置为确定各MRI样本图像中的发音器官的位置;在各MRI样本图像中的发音器官的位置,生成与所述发音器官的位置对应的动画器官,得到动画发音器官图。In a possible implementation manner, the training samples of the animation generation model include a plurality of MRI sample images and an animated voice organ map corresponding to each MRI sample image, and the apparatus further includes a training sample generation module configured to determine The position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, an animated artifact corresponding to the position of the articulator is generated to obtain an animated articulation map.
上述各模块所具体执行的步骤在方法部分实施例中已经进行了详细阐述,在此不做赘述。The specific steps performed by the above modules have been described in detail in some embodiments of the method, and are not repeated here.
通过上述技术方案,至少可以达到以下的技术效果:Through the above technical solutions, at least the following technical effects can be achieved:
通过获取用户基于例句文本朗读的待评价音频,并通过基于待评价音频生成的发音器官动作视频和例句文本对应的发音器官标准动作视频生成发音评价信息,可以更准确地对用户的发音进行评价,从而更直观地体现用户的发音是否准确。By acquiring the audio to be evaluated read aloud by the user based on the example text, and generating the pronunciation evaluation information based on the pronunciation organ action video generated based on the to-be-evaluated audio and the pronunciation organ standard action video corresponding to the example text, the user's pronunciation can be more accurately evaluated, This more intuitively reflects whether the user's pronunciation is accurate.
下面参考图4,其示出了适于用来实现本公开实施例的电子设备(例如用户设备或服务器)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring next to FIG. 4 , it shows a schematic structural diagram of an electronic device (eg, user equipment or server) 400 suitable for implementing an embodiment of the present disclosure. Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 4 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4 , an electronic device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 401 that may be loaded into random access according to a program stored in a read only memory (ROM) 402 or from a storage device 408 Various appropriate actions and processes are executed by the programs in the memory (RAM) 403 . In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 of a computer, etc.; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409. Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 4 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机 软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 408, or from the ROM 402. When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
在一些实施方式中,用户终端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the user terminal and the server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取至少两个网际协议地址;向节点评价设备发送 包括所述至少两个网际协议地址的节点评价请求,其中,所述节点评价设备从所述至少两个网际协议地址中,选取网际协议地址并返回;接收所述节点评价设备返回的网际协议地址;其中,所获取的网际协议地址指示内容分发网络中的边缘节点。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires at least two Internet Protocol addresses; A node evaluation request for an Internet Protocol address, wherein the node evaluation device selects an Internet Protocol address from the at least two Internet Protocol addresses and returns it; receives the Internet Protocol address returned by the node evaluation device; wherein the obtained The Internet Protocol address indicates an edge node in the content distribution network.
或者,上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:接收包括至少两个网际协议地址的节点评价请求;从所述至少两个网际协议地址中,选取网际协议地址;返回选取出的网际协议地址;其中,接收到的网际协议地址指示内容分发网络中的边缘节点。Alternatively, the above computer-readable medium carries one or more programs, and when the above one or more programs are executed by the electronic device, the electronic device: receives a node evaluation request including at least two Internet Protocol addresses; From the at least two Internet Protocol addresses, the Internet Protocol address is selected; the selected Internet Protocol address is returned; wherein, the received Internet Protocol address indicates an edge node in the content distribution network.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取至少两个网际协议地址的模块”。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring at least two Internet Protocol addresses".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程 逻辑设备(CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例1提供了一种发音评价方法,所述方法包括:向用户展示例句文本;采集用户基于所述例句文本朗读的待评价音频;基于所述待评价音频生成发音器官动作视频;生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频;向所述用户展示所述发音评价信息。According to one or more embodiments of the present disclosure, Example 1 provides a pronunciation evaluation method, the method includes: displaying example sentence text to a user; collecting audio to be evaluated read aloud by a user based on the example sentence text; based on the to-be-evaluated The audio generates a voice organ action video; generates a voice organ action video reflecting the actions of the voice organ when the user reads the example text; and displays the voice pronunciation evaluation information to the user.
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述发音评价信息包括对所述用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者。According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.
根据本公开的一个或多个实施例,示例3提供了示例1的方法,所述向用户展示例句文本,包括:基于所述例句文本生成例句音频;将所述例句音频与所述发音器官标准动作视频合成为例句演示视频;向用户展示例句文本和所述例句演示视频。According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 1. The presenting the example sentence text to the user includes: generating example sentence audio based on the example sentence text; comparing the example sentence audio with the pronunciation organ standard The action video is synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.
根据本公开的一个或多个实施例,示例4提供了示例2的方法,在所述发音评价信息包括所述发音打分信息和/或所述发音动作建议信息的情况下,所述基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息,包括:通过对比所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频,得到动作差异信息;根据所述动作差异信息生成发音打分信息,和/或,根据所述动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 2, in the case that the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the Pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text generate pronunciation evaluation information, including: by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text, obtain action difference information; According to Pronunciation scoring information is generated from the action difference information, and/or, according to the action difference information and preset pronunciation action suggestion information, target action suggestion information matching the action difference information is obtained.
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述动作差异信息为发音器官的特征点运动轨迹的差异信息。According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, where the action difference information is difference information of the movement trajectories of the feature points of the vocal organs.
根据本公开的一个或多个实施例,示例6提供了示例2的方法,所述对比视频是 通过以下的方式生成的:基于例句文本的单位文本内容,将所述发音器官动作视频和所述发音器官标准动作视频中表征同一单位文本内容的视频片段作为一组视频片段组;将各视频片段组中属于所述发音器官动作视频和所述发音器官标准动作视频的视频片段进行对齐;将对齐后的所述发音器官动作视频和所述发音器官标准动作视频拼接,得到所述对比视频。According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 2, wherein the comparison video is generated by: based on the unit text content of the example sentence text, combining the articulator action video with the The video clips representing the same unit text content in the standard action video of the articulator are used as a video clip group; the video clips belonging to the articulator action video and the standard action video of the articulator in each video clip group are aligned; The latter action video of the articulator and the standard action video of the articulator are spliced to obtain the comparison video.
根据本公开的一个或多个实施例,示例7提供了示例1的方法,所述生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频,包括:将所述待评价音频转换成待处理音频特征向量;将所述待处理音频特征向量输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 1, wherein the generating an articulator action video that reflects the action of the articulator when the user reads the example text aloud includes: converting the to-be-to-be The evaluation audio is converted into the to-be-processed audio feature vector; the to-be-processed audio feature vector is input into a video generation model to obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.
根据本公开的一个或多个实施例,示例8提供了示例7的方法,还包括:根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据;根据所述模型训练数据训练得到所述视频生成模型。According to one or more embodiments of the present disclosure, Example 8 provides the method of Example 7, further comprising: constructing model training data according to sample audio and sample vocal organ action videos corresponding to the sample audio; and according to the model training data The video generation model is obtained by training.
根据本公开的一个或多个实施例,示例9提供了示例8的方法,所述根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据包括:将所述样本音频中的每一帧音频转换成样本音素后验概率向量,得到包括至少一个样本音素后验概率向量的样本音素后验概率向量序列;基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征,得到样本发音器官视频特征序列;将所述样本音素后验概率向量序列和所述样本发音器官视频特征序列作为所述模型训练数据。According to one or more embodiments of the present disclosure, Example 9 provides the method of Example 8, wherein the constructing model training data according to the sample audio and the sample articulator action video corresponding to the sample audio includes: adding the sample audio to Each frame of audio is converted into a sample phoneme posterior probability vector, and a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector is obtained; The sample vocal organ video features corresponding to each of the sample phoneme posterior probability vectors in the probability vector sequence are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as The model training data.
根据本公开的一个或多个实施例,示例10提供了示例9的方法,所述样本发音器官视频特征为所述样本发音器官动作视频中的至少一帧视频图像的像素点特征信息或主成分特征信息中的至少一种。According to one or more embodiments of the present disclosure, Example 10 provides the method of Example 9, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.
根据本公开的一个或多个实施例,示例11提供了示例1的方法,所述发音器官标准动作视频是通过以下方式生成的:将所述例句文本分割为单位文本序列;将所述单位文本序列输入视频特征生成模型,得到视频特征序列;基于所述视频特征序列生成发音器官标准动作视频;其中,所述视频特征生成模型是通过如下方式训练得到的:将样本文本分割为样本单位文本序列;根据样本单位文本序列以及与所述样本单位文本序列对应的样本发音器官动作视频的样本视频特征序列构建模型训练数据;根据所述模型训练数据训练得到所述视频特征生成模型。According to one or more embodiments of the present disclosure, Example 11 provides the method of Example 1, the articulator standard action video is generated by: dividing the example sentence text into unit text sequences; dividing the unit text The sequence inputs a video feature generation model to obtain a video feature sequence; based on the video feature sequence, a standard action video of a vocal organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence ; build model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence; obtain the video feature generation model according to the model training data training.
根据本公开的一个或多个实施例,示例12提供了示例1-11的方法,所述发音器 官动作视频和所述发音器官标准动作视频为基于核磁共振MRI视频生成的发音器官动画视频,所述方法还包括:通过动画生成模型,逐帧对所述发音器官动作视频或所述发音器官标准动作视频进行渲染,得到发音器官动画视频。According to one or more embodiments of the present disclosure, Example 12 provides the method of Examples 1-11, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so The method further includes: rendering the voice organ motion video or the voice organ standard motion video frame by frame through an animation generation model to obtain a voice organ animation video.
根据本公开的一个或多个实施例,示例13提供了示例12的方法,所述动画生成模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画发音器官图,并且所述方法还包括:确定各MRI样本图像中的发音器官的位置;在各MRI样本图像中的发音器官的位置,生成与所述发音器官的位置对应的动画发音器官,得到动画发音器官图。According to one or more embodiments of the present disclosure, Example 13 provides the method of Example 12, the training samples of the animation generation model include a plurality of MRI sample images and an animated articulation organ map corresponding to each MRI sample image, and the method The method also includes: determining the position of the articulator in each MRI sample image; generating an animated articulator corresponding to the position of the articulator at the position of the articulator in each MRI sample image to obtain an animated articulation diagram.
根据本公开的一个或多个实施例,示例14提供了一种发音评价装置,所述装置包括:例句展示模块,用于向用户展示例句文本;音频采集模块,用于采集用户基于所述例句文本朗读的待评价音频;视频生成模块,用于生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频;发音评价模块,用于基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;评价展示模块,用于向所述用户展示所述发音评价信息。According to one or more embodiments of the present disclosure, Example 14 provides an apparatus for evaluating pronunciation, the apparatus comprising: an example sentence display module for displaying example sentence text to a user; an audio collection module for collecting user based example sentences based on the example sentences The audio to be evaluated that the text is read aloud; the video generation module is used to generate the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text; the pronunciation evaluation module is used for based on the pronunciation organ action video and all Describe the pronunciation organ standard action video corresponding to the example text to generate pronunciation evaluation information; an evaluation display module is used to display the pronunciation evaluation information to the user.
根据本公开的一个或多个实施例,示例15提供了示例14的装置,所述发音评价信息包括对所述用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者。According to one or more embodiments of the present disclosure, Example 15 provides the apparatus of Example 14, the pronunciation evaluation information includes pronunciation scoring information for the user, pronunciation action suggestion information, or the articulator action video and the At least one of the contrasting videos of the vocal organ standard motion videos.
根据本公开的一个或多个实施例,示例16提供了示例14的装置,所述例句展示模块,用于基于所述例句文本生成例句音频;将所述例句音频与所述发音器官标准动作视频合成为例句演示视频;向用户展示例句文本和所述例句演示视频。According to one or more embodiments of the present disclosure, Example 16 provides the apparatus of Example 14, wherein the example sentence display module is used to generate example sentence audio based on the example sentence text; Synthesized into an example sentence demonstration video; the example sentence text and the example sentence demonstration video are presented to the user.
根据本公开的一个或多个实施例,示例17提供了示例10的装置,所述发音评价模块,用于通过对比所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频,得到动作差异信息;根据所述动作差异信息生成发音打分信息,和/或,根据所述动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。According to one or more embodiments of the present disclosure, Example 17 provides the apparatus of Example 10, wherein the pronunciation evaluation module is configured to obtain by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example sentence text Action difference information; generate pronunciation scoring information according to the action difference information, and/or match with preset pronunciation action suggestion information according to the action difference information to obtain target action suggestion information that matches the action difference information .
根据本公开的一个或多个实施例,示例18提供了示例15的装置,所述动作差异信息为发音器官的特征点运动轨迹的差异信息。According to one or more embodiments of the present disclosure, Example 18 provides the apparatus of Example 15, wherein the motion difference information is difference information of the movement trajectory of the feature points of the speech organ.
根据本公开的一个或多个实施例,示例19提供了示例15的装置,所述发音评价模块,还用于基于例句文本的单位文本内容,将所述发音器官动作视频和所述发音器官标准动作视频中表征同一单位文本内容的视频片段作为一组视频片段组;将各视频 片段组中属于所述发音器官动作视频和所述发音器官标准动作视频的视频片段进行对齐;将对齐后的所述发音器官动作视频和所述发音器官标准动作视频拼接,得到所述对比视频。According to one or more embodiments of the present disclosure, Example 19 provides the apparatus of Example 15, the pronunciation evaluation module is further configured to compare the voice organ action video and the voice organ standard based on the unit text content of the example sentence text In the action video, the video clips representing the same unit text content are used as a video clip group; the video clips belonging to the voice organ action video and the voice organ standard action video in each video clip group are aligned; The voice organ action video and the voice organ standard action video are spliced to obtain the comparison video.
根据本公开的一个或多个实施例,示例20提供了示例14的装置,所述视频生成模块,用于将所述待评价音频转换成待处理音频特征向量;将所述待处理音频特征向量输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。According to one or more embodiments of the present disclosure, Example 20 provides the apparatus of Example 14, the video generation module for converting the audio to be evaluated into a feature vector of audio to be processed; converting the audio feature vector to be processed A video generation model is input to obtain a voice organ action video output by the video generation model and corresponding to the audio to be evaluated.
根据本公开的一个或多个实施例,示例21提供了示例20的装置,发音评价装置还包括视频生成模型训练模块,被配置为根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据;根据所述模型训练数据训练得到所述视频生成模型。According to one or more embodiments of the present disclosure, Example 21 provides the apparatus of Example 20, the pronunciation evaluation apparatus further includes a video generation model training module, configured to generate a model training module according to the sample audio and the sample voice organ action video corresponding to the sample audio. constructing model training data; and obtaining the video generation model by training according to the model training data.
根据本公开的一个或多个实施例,示例22提供了示例21的装置,视频生成模型训练模块进一步被配置为将所述样本音频中的每一帧音频转换成样本音素后验概率向量,得到包括至少一个样本音素后验概率向量的样本音素后验概率向量序列;基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征,得到样本发音器官视频特征序列;将所述样本音素后验概率向量序列和所述样本发音器官视频特征序列作为所述模型训练数据。According to one or more embodiments of the present disclosure, Example 22 provides the apparatus of Example 21, and the video generation model training module is further configured to convert each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain A sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector; based on the sample vocal organ action video, extract the sample phoneme posterior probability vector corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence The sample vocal organ video features are obtained, and the sample vocal organ video feature sequence is obtained; the sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
根据本公开的一个或多个实施例,示例23提供了示例22的装置,所述样本发音器官视频特征为所述样本发音器官动作视频中的至少一帧视频图像的像素点特征信息或主成分特征信息中的至少一种。According to one or more embodiments of the present disclosure, Example 23 provides the apparatus of Example 22, wherein the sample articulator video feature is pixel point feature information or principal components of at least one frame of video image in the sample articulator action video At least one of the feature information.
根据本公开的一个或多个实施例,示例24提供了示例14的装置,所述视频生成模块,还用于将所述例句文本分割为单位文本序列;将所述单位文本序列输入视频特征生成模型,得到视频特征序列;基于所述视频特征序列生成发音器官标准动作视频;其中,所述视频特征生成模型是通过如下方式训练得到的:将样本文本分割为样本单位文本序列;根据样本单位文本序列以及与所述样本单位文本序列对应的样本发音器官动作视频的样本视频特征序列构建模型训练数据;根据所述模型训练数据训练得到所述视频特征生成模型。According to one or more embodiments of the present disclosure, Example 24 provides the apparatus of Example 14, wherein the video generation module is further configured to segment the example sentence text into a unit text sequence; input the unit text sequence into a video feature to generate The model obtains a video feature sequence; based on the video feature sequence, a standard action video of the pronunciation organ is generated; wherein, the video feature generation model is obtained by training in the following manner: dividing the sample text into a sample unit text sequence; according to the sample unit text The sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence construct model training data; the video feature generation model is obtained by training according to the model training data.
根据本公开的一个或多个实施例,示例25提供了示例14-24的装置,所述发音器官动作视频和所述发音器官标准动作视频为基于核磁共振MRI视频生成的发音器官 动画视频,所述装置还包括视频渲染模块,用于通过动画生成模型,逐帧对所述发音器官动作视频或所述发音器官标准动作视频进行渲染,得到发音器官动画视频。According to one or more embodiments of the present disclosure, Example 25 provides the apparatus of Examples 14-24, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI videos, so The device further includes a video rendering module, which is used for generating a model through animation, and rendering the voice organ action video or the voice organ standard action video frame by frame to obtain a voice organ animation video.
根据本公开的一个或多个实施例,示例26提供了示例25的装置,所述动画生成模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画发音器官图,所述动画生成模型的训练样本是通过以下方式得到的:确定各MRI样本图像中的发音器官的位置;在各MRI样本图像中的发音器官的位置,生成与所述发音器官的位置对应的动画发音器官,得到动画发音器官图。According to one or more embodiments of the present disclosure, Example 26 provides the apparatus of Example 25, wherein the training samples of the animation generation model include a plurality of MRI sample images and an animation articulation organ map corresponding to each MRI sample image, and the animation generates The training samples of the model are obtained in the following manner: determine the position of the articulator in each MRI sample image; at the position of the articulator in each MRI sample image, generate an animation articulation corresponding to the position of the articulator, and obtain Animated articulation organ diagram.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims (18)

  1. 一种发音评价方法,包括:A pronunciation evaluation method, including:
    向用户展示例句文本;show the user the example text;
    采集用户基于所述例句文本朗读的待评价音频;Collect the audio to be evaluated that the user reads aloud based on the example text;
    生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频;Generate a pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;
    基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;Generate pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;
    向所述用户展示所述发音评价信息。The pronunciation evaluation information is displayed to the user.
  2. 根据权利要求1所述的方法,其中,所述发音评价信息包括对所述用户的发音打分信息、发音动作建议信息、或所述发音器官动作视频与所述发音器官标准动作视频的对比视频中的至少一者。The method according to claim 1, wherein the pronunciation evaluation information includes the pronunciation scoring information of the user, the pronunciation action suggestion information, or the comparison video of the articulator action video and the articulator standard action video. at least one of.
  3. 根据权利要求1所述的方法,其中,所述向用户展示例句文本,包括:The method of claim 1 , wherein the presenting the example text to the user comprises:
    基于所述例句文本生成例句音频;generating example sentence audio based on the example sentence text;
    将所述例句音频与所述发音器官标准动作视频合成为例句演示视频;Synthesize the example example audio and the standard action video of the articulator into example example demonstration video;
    向用户展示例句文本和所述例句演示视频。The user is presented with the example text and a demonstration video of the example.
  4. 根据权利要求2所述的方法,其中,在所述发音评价信息包括所述发音打分信息和/或所述发音动作建议信息的情况下,所述基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息,包括:The method according to claim 2, wherein, when the pronunciation evaluation information includes the pronunciation scoring information and/or the pronunciation action suggestion information, the action video based on the pronunciation organ and the example text The corresponding pronunciation organ standard action video generates pronunciation evaluation information, including:
    通过对比所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频,得到动作差异信息;Action difference information is obtained by comparing the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;
    根据所述动作差异信息生成发音打分信息,和/或,根据所述动作差异信息与预设的发音动作建议信息进行匹配,得到与所述动作差异信息相匹配的目标动作建议信息。Pronunciation scoring information is generated according to the motion difference information, and/or, target motion suggestion information matching the motion difference information is obtained by matching with preset pronunciation motion suggestion information according to the motion difference information.
  5. 根据权利要求4所述的方法,其中,所述动作差异信息为发音器官的特征点运动轨迹的差异信息。The method according to claim 4, wherein the action difference information is the difference information of the movement trajectories of the feature points of the vocal organs.
  6. 根据权利要求2所述的方法,其中,所述对比视频是通过以下的方式生成的:The method of claim 2, wherein the comparison video is generated in the following manner:
    基于例句文本的单位文本内容,将所述发音器官动作视频和所述发音器官标准动作视频中表征同一单位文本内容的视频片段作为一组视频片段组;Based on the unit text content of the example sentence text, the video clips representing the same unit text content in the articulator action video and the articulator standard action video are used as a video clip group;
    将各视频片段组中属于所述发音器官动作视频和所述发音器官标准动作视频的视频片段进行对齐;Aligning the video clips belonging to the articulator action video and the articulator standard action video in each video clip group;
    将对齐后的所述发音器官动作视频和所述发音器官标准动作视频拼接,得到所述对比视频。The alignment video of the articulator action and the standard action video of the articulator are spliced to obtain the comparison video.
  7. 根据权利要求1所述的方法,其中,所述生成反映所述用户朗读所述例句文本时的发音器官的动作的发音器官动作视频,包括:The method according to claim 1, wherein said generating a pronunciation organ action video reflecting the action of the pronunciation organ when the user reads the example text aloud, comprises:
    将所述待评价音频转换成待处理音频特征向量;Converting the to-be-evaluated audio into a to-be-processed audio feature vector;
    将所述待处理音频特征向量输入视频生成模型,得到所述视频生成模型输出的与所述待评价音频对应的发音器官动作视频。The to-be-processed audio feature vector is input into a video generation model to obtain a voice organ action video corresponding to the to-be-evaluated audio output by the video generation model.
  8. 根据权利要求7所述的方法,还包括:The method of claim 7, further comprising:
    根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据;Build model training data according to the sample audio and the sample articulator action video corresponding to the sample audio;
    根据所述模型训练数据训练得到所述视频生成模型。The video generation model is obtained by training according to the model training data.
  9. 根据权利要求8所述的方法,其中,所述根据样本音频以及与所述样本音频对应的样本发音器官动作视频构建模型训练数据包括:The method according to claim 8, wherein the building model training data according to the sample audio and the sample articulator action video corresponding to the sample audio comprises:
    将所述样本音频中的每一帧音频转换成样本音素后验概率向量,得到包括至少一个样本音素后验概率向量的样本音素后验概率向量序列;Converting each frame of audio in the sample audio into a sample phoneme posterior probability vector to obtain a sample phoneme posterior probability vector sequence including at least one sample phoneme posterior probability vector;
    基于所述样本发音器官动作视频,提取与所述样本音素后验概率向量序列中每一所述样本音素后验概率向量对应的样本发音器官视频特征,得到样本发音器官视频特征序列;Based on the sample articulator action video, extract a sample articulator video feature corresponding to each of the sample phoneme posterior probability vectors in the sample phoneme posterior probability vector sequence, to obtain a sample articulator video feature sequence;
    将所述样本音素后验概率向量序列和所述样本发音器官视频特征序列作为所述模型训练数据。The sample phoneme posterior probability vector sequence and the sample vocal organ video feature sequence are used as the model training data.
  10. 根据权利要求9所述的方法,其中,所述样本发音器官视频特征为所述样本 发音器官动作视频中的至少一帧视频图像的像素点特征信息或主成分特征信息中的至少一种。The method according to claim 9, wherein the sample vocal organ video feature is at least one of pixel point feature information or principal component feature information of at least one frame of video image in the sample vocal organ action video.
  11. 根据权利要求1所述的方法,其中,所述发音器官标准动作视频是通过以下方式生成的:The method according to claim 1, wherein the standard motion video of the vocal organs is generated in the following manner:
    将所述例句文本分割为单位文本序列;dividing the example sentence text into unit text sequences;
    将所述单位文本序列输入视频特征生成模型,得到视频特征序列;Inputting the unit text sequence into a video feature generation model to obtain a video feature sequence;
    基于所述视频特征序列生成发音器官标准动作视频;Generate a standard motion video of vocal organs based on the video feature sequence;
    其中,所述视频特征生成模型是通过如下方式训练得到的:Wherein, the video feature generation model is obtained by training in the following ways:
    将样本文本分割为样本单位文本序列;Divide the sample text into sample unit text sequences;
    根据样本单位文本序列以及与所述样本单位文本序列对应的样本发音器官动作视频的样本视频特征序列构建模型训练数据;Build model training data according to the sample unit text sequence and the sample video feature sequence of the sample vocal organ action video corresponding to the sample unit text sequence;
    根据所述模型训练数据训练得到所述视频特征生成模型。The video feature generation model is obtained by training according to the model training data.
  12. 根据权利要求1-11任一项所述的方法,其中,所述发音器官动作视频和所述发音器官标准动作视频为基于核磁共振MRI视频生成的发音器官动画视频,所述方法还包括:The method according to any one of claims 1-11, wherein the voice organ motion video and the voice organ standard motion video are voice organ animation videos generated based on nuclear magnetic resonance MRI video, and the method further comprises:
    通过动画生成模型,逐帧对所述发音器官动作视频或所述发音器官标准动作视频进行渲染,得到发音器官动画视频。The animation generation model is used to render the articulator action video or the articulator standard action video frame by frame to obtain the articulator animation video.
  13. 根据权利要求12所述的方法,其中,所述动画生成模型的训练样本包括多张MRI样本图像和各MRI样本图像对应的动画发音器官图,并且所述方法还包括:The method according to claim 12 , wherein the training samples of the animation generation model include a plurality of MRI sample images and an animated articulation organ map corresponding to each MRI sample image, and the method further comprises:
    确定各MRI样本图像中的发音器官的位置;determining the position of the vocal organs in each MRI sample image;
    在各MRI样本图像中的发音器官的位置,生成与所述发音器官的位置对应的动画发音器官,得到动画发音器官图。At the position of the articulator in each MRI sample image, an animated articulator corresponding to the position of the articulator is generated, and an animated articulation diagram is obtained.
  14. 一种发音评价装置,包括:A pronunciation evaluation device, comprising:
    例句展示模块,用于向用户展示例句文本;The example sentence display module is used to display the example sentence text to the user;
    音频采集模块,用于采集用户基于所述例句文本朗读的待评价音频;An audio collection module for collecting the audio to be evaluated based on the example text read aloud by the user;
    视频生成模块,用于生成反映所述用户朗读所述例句文本时的发音器官的动作的 发音器官动作视频;Video generation module, for generating the pronunciation organ action video that reflects the action of the pronunciation organ when the user reads the example sentence text;
    发音评价模块,用于基于所述发音器官动作视频和所述例句文本对应的发音器官标准动作视频生成发音评价信息;A pronunciation evaluation module for generating pronunciation evaluation information based on the pronunciation organ action video and the pronunciation organ standard action video corresponding to the example text;
    评价展示模块,用于向所述用户展示所述发音评价信息。An evaluation display module, configured to display the pronunciation evaluation information to the user.
  15. 一种非瞬时性计算机可读介质,其上存储有计算机程序,其中,该程序被处理装置执行时实现权利要求1-13中任一项所述方法的步骤。A non-transitory computer-readable medium having stored thereon a computer program, wherein the program, when executed by a processing device, implements the steps of the method of any one of claims 1-13.
  16. 一种电子设备,其中,包括:An electronic device comprising:
    存储装置,其上存储有计算机程序;a storage device on which a computer program is stored;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-13中任一项所述方法的步骤。A processing device, configured to execute the computer program in the storage device, to implement the steps of the method of any one of claims 1-13.
  17. 一种计算机程序,包括:A computer program comprising:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-13中任一项所述方法的步骤。Instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any of claims 1-13.
  18. 一种计算机程序产品,包括指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-13中任一项所述方法的步骤。A computer program product comprising instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any of claims 1-13.
PCT/CN2022/080357 2021-03-19 2022-03-11 Pronunciation assessment method and apparatus, storage medium, and electronic device WO2022194044A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110298227.9A CN113077819A (en) 2021-03-19 2021-03-19 Pronunciation evaluation method and device, storage medium and electronic equipment
CN202110298227.9 2021-03-19

Publications (1)

Publication Number Publication Date
WO2022194044A1 true WO2022194044A1 (en) 2022-09-22

Family

ID=76612827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080357 WO2022194044A1 (en) 2021-03-19 2022-03-11 Pronunciation assessment method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN113077819A (en)
WO (1) WO2022194044A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705070A (en) * 2023-08-02 2023-09-05 南京优道言语康复研究院 Method and system for correcting speech pronunciation and nasal sound after cleft lip and palate operation

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090305203A1 (en) * 2005-09-29 2009-12-10 Machi Okumura Pronunciation diagnosis device, pronunciation diagnosis method, recording medium, and pronunciation diagnosis program
CN103218841A (en) * 2013-04-26 2013-07-24 中国科学技术大学 Three-dimensional vocal organ animation method combining physiological model and data driving model
CN104505089A (en) * 2014-12-17 2015-04-08 福建网龙计算机网络信息技术有限公司 Method and equipment for oral error correction
CN107424450A (en) * 2017-08-07 2017-12-01 英华达(南京)科技有限公司 Pronunciation correction system and method
US20180137778A1 (en) * 2016-08-17 2018-05-17 Ken-ichi KAINUMA Language learning system, language learning support server, and computer program product
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN111445925A (en) * 2020-03-31 2020-07-24 北京字节跳动网络技术有限公司 Method and apparatus for generating difference information
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN111951828A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, device, system, medium and computing equipment
CN111968676A (en) * 2020-08-18 2020-11-20 北京字节跳动网络技术有限公司 Pronunciation correction method and device, electronic equipment and storage medium
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010107035A (en) * 2000-05-24 2001-12-07 서주철 An internet english study service method for voice recognition and voice synthesis
CN108537702A (en) * 2018-04-09 2018-09-14 深圳市鹰硕技术有限公司 Foreign language teaching evaluation information generation method and device
CN108962216B (en) * 2018-06-12 2021-02-02 北京市商汤科技开发有限公司 Method, device, equipment and storage medium for processing speaking video
CN108922563B (en) * 2018-06-17 2019-09-24 海南大学 Based on the visual verbal learning antidote of deviation organ morphology behavior
CN109697976B (en) * 2018-12-14 2021-05-25 北京葡萄智学科技有限公司 Pronunciation recognition method and device
CN111723606A (en) * 2019-03-19 2020-09-29 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN110347867B (en) * 2019-07-16 2022-04-19 北京百度网讯科技有限公司 Method and device for generating lip motion video
CN111429885B (en) * 2020-03-02 2022-05-13 北京理工大学 Method for mapping audio clip to human face-mouth type key point
CN111741326B (en) * 2020-06-30 2023-08-18 腾讯科技(深圳)有限公司 Video synthesis method, device, equipment and storage medium
CN111833859B (en) * 2020-07-22 2024-02-13 科大讯飞股份有限公司 Pronunciation error detection method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090305203A1 (en) * 2005-09-29 2009-12-10 Machi Okumura Pronunciation diagnosis device, pronunciation diagnosis method, recording medium, and pronunciation diagnosis program
CN103218841A (en) * 2013-04-26 2013-07-24 中国科学技术大学 Three-dimensional vocal organ animation method combining physiological model and data driving model
CN104505089A (en) * 2014-12-17 2015-04-08 福建网龙计算机网络信息技术有限公司 Method and equipment for oral error correction
US20180137778A1 (en) * 2016-08-17 2018-05-17 Ken-ichi KAINUMA Language learning system, language learning support server, and computer program product
CN107424450A (en) * 2017-08-07 2017-12-01 英华达(南京)科技有限公司 Pronunciation correction system and method
CN111951828A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, device, system, medium and computing equipment
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN111445925A (en) * 2020-03-31 2020-07-24 北京字节跳动网络技术有限公司 Method and apparatus for generating difference information
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN111968676A (en) * 2020-08-18 2020-11-20 北京字节跳动网络技术有限公司 Pronunciation correction method and device, electronic equipment and storage medium
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705070A (en) * 2023-08-02 2023-09-05 南京优道言语康复研究院 Method and system for correcting speech pronunciation and nasal sound after cleft lip and palate operation
CN116705070B (en) * 2023-08-02 2023-10-17 南京优道言语康复研究院 Method and system for correcting speech pronunciation and nasal sound after cleft lip and palate operation

Also Published As

Publication number Publication date
CN113077819A (en) 2021-07-06

Similar Documents

Publication Publication Date Title
WO2022194044A1 (en) Pronunciation assessment method and apparatus, storage medium, and electronic device
JP6206960B2 (en) Pronunciation operation visualization device and pronunciation learning device
US11514634B2 (en) Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses
CN113256821B (en) Three-dimensional virtual image lip shape generation method and device and electronic equipment
CN113077537B (en) Video generation method, storage medium and device
US11847726B2 (en) Method for outputting blend shape value, storage medium, and electronic device
US11968433B2 (en) Systems and methods for generating synthetic videos based on audio contents
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN112785670B (en) Image synthesis method, device, equipment and storage medium
Wang et al. Computer-assisted audiovisual language learning
CN113223123A (en) Image processing method and image processing apparatus
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
CN112383721B (en) Method, apparatus, device and medium for generating video
Liu et al. An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children
CN112381926A (en) Method and apparatus for generating video
CN111415662A (en) Method, apparatus, device and medium for generating video
WO2023035969A1 (en) Speech and image synchronization measurement method and apparatus, and model training method and apparatus
KR20140087956A (en) Apparatus and method for learning phonics by using native speaker's pronunciation data and word and sentence and image data
CN113079327A (en) Video generation method and device, storage medium and electronic equipment
CN114428879A (en) Multimode English teaching system based on multi-scene interaction
CN113035235A (en) Pronunciation evaluation method and apparatus, storage medium, and electronic device
CN111445925A (en) Method and apparatus for generating difference information
Fabre et al. Automatic animation of an articulatory tongue model from ultrasound images using Gaussian mixture regression.
KR20210131698A (en) Method and apparatus for teaching foreign language pronunciation using articulator image
CN112185186A (en) Pronunciation correction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770399

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22770399

Country of ref document: EP

Kind code of ref document: A1