CN112614489A - User pronunciation accuracy evaluation method and device and electronic equipment - Google Patents

User pronunciation accuracy evaluation method and device and electronic equipment Download PDF

Info

Publication number
CN112614489A
CN112614489A CN202011522673.5A CN202011522673A CN112614489A CN 112614489 A CN112614489 A CN 112614489A CN 202011522673 A CN202011522673 A CN 202011522673A CN 112614489 A CN112614489 A CN 112614489A
Authority
CN
China
Prior art keywords
pronunciation
user
mouth shape
matching degree
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011522673.5A
Other languages
Chinese (zh)
Inventor
王岩
安�晟
蔡红
杨森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zuoyebang Education Technology Beijing Co Ltd
Original Assignee
Zuoyebang Education Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zuoyebang Education Technology Beijing Co Ltd filed Critical Zuoyebang Education Technology Beijing Co Ltd
Priority to CN202011522673.5A priority Critical patent/CN112614489A/en
Publication of CN112614489A publication Critical patent/CN112614489A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Social Psychology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention belongs to the technical field of online education, and provides a method and a device for evaluating pronunciation accuracy of a user, electronic equipment and a recording medium, wherein the method comprises the following steps: acquiring audio information and image information when a user pronounces; screening out at least one frame of image generated when the user pronounces from the image information; extracting mouth shape information of a user from the image during pronunciation; and respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree. The method and the device evaluate whether the pronunciation of the user is accurate through multiple dimensions, so that the evaluation result is more accurate, the evaluation result is fed back to the user in real time, and a corresponding correction scheme is provided according to the evaluation result, so that the user can conveniently adjust the pronunciation mouth shape and the pronunciation tone, and the user experience is improved.

Description

User pronunciation accuracy evaluation method and device and electronic equipment
Technical Field
The invention belongs to the technical field of education, is particularly suitable for online education, and more particularly relates to a method and a device for evaluating pronunciation accuracy of a user, electronic equipment and a computer readable medium.
Background
In the process of language learning, learning correct spoken language pronunciation is also a very important part, in the previous years, the spoken language learning can only follow off-line teachers, with the development of technology, on-line spoken language learning becomes a trend, and in recent years, scoring and correction of spoken language pronunciation are mainly established on representation of voice characteristics. Whether the mouth shape is correct or not during pronunciation plays an important role in pronunciation, namely, the learners can be prompted to send out standard pronunciation by mastering the correct mouth shape.
When the existing product is used for user pronunciation practice, the mouth shape of a user during pronunciation is compared with a standard mouth shape, or the voice of the user during pronunciation is compared with a standard voice to judge whether the pronunciation of the user is standard, but the judgment result of the single comparison mode is not accurate, the problem that the pronunciation of the user is inaccurate due to the fact that the mouth shape standard of the user is likely to occur, and the corresponding correction effect on the problem that the pronunciation of the user is inaccurate is limited.
Disclosure of Invention
Technical problem to be solved
The invention aims to solve the problem of how to effectively evaluate and correct the inaccurate pronunciation of the user.
(II) technical scheme
In order to solve the above technical problem, an aspect of the present invention provides a method for evaluating pronunciation accuracy of a user, including:
acquiring audio information and image information when a user pronounces;
screening out at least one frame of image generated when the user pronounces from the image information;
extracting mouth shape information of a user from the image during pronunciation;
and respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree.
According to a preferred embodiment of the present invention, the calculating the mouth shape matching degree of the user further comprises:
extracting key point area images of the mouth from the mouth shape information of the images when each frame pronounces;
inputting the key point region image into a first deep learning model to obtain a first model type of the user;
and judging whether the first mouth shape type is the same as the correct mouth shape type.
According to a preferred embodiment of the present invention, the calculating the mouth shape matching degree of the user further comprises:
inputting the mouth shape information into a second deep learning model, and extracting key point regional characteristics of the mouth;
matching the key point region characteristics with characteristics in a preset mouth shape library to obtain a corresponding second mouth shape category;
and judging whether the second mouth shape type is the same as the correct mouth shape type.
According to a preferred embodiment of the present invention, the matching the feature of the key point region with the feature in the preset mouth shape library to obtain the corresponding mouth shape category further includes:
matching the key point region characteristics with characteristics in a preset mouth shape library, and selecting the mouth shape class with the most characteristics as the mouth shape class of the user;
and carrying out similarity calculation on the characteristics of the mouth shape type and the characteristics of the correct mouth shape type to obtain a similarity value.
According to a preferred embodiment of the present invention, the determining whether the pronunciation of the user is accurate according to the pronunciation matching degree and the mouth shape matching degree further includes:
judging whether the pronunciation of the user is accurate according to a preset rule, wherein the rule comprises the following steps: and when at least one of the pronunciation matching degree and the mouth shape matching degree is lower than a preset lower limit value, the pronunciation of the user is inaccurate.
According to a preferred embodiment of the present invention, the determining whether the pronunciation of the user is accurate according to a predetermined rule further includes:
setting the score of the first mouth shape category as mouth shape matching degree, and when the first mouth shape category is the same as the correct mouth shape category but the mouth shape matching degree score is lower than a preset first lower limit value, the pronunciation of the user is inaccurate;
setting the feature similarity of the second mouth shape type as mouth shape matching degree, and when the second mouth shape type is the same as the correct mouth shape type but the feature similarity is lower than a preset second lower limit value, the pronunciation of the user is inaccurate;
and when the matching degree of the audio information and the correct pronunciation is smaller than a preset third lower limit value, the pronunciation of the user is inaccurate.
According to a preferred embodiment of the present invention, the determining whether the pronunciation of the user is accurate according to the pronunciation matching degree and the mouth shape matching degree further includes:
and inputting the calculated pronunciation matching degree and the mouth shape matching degree as parameters, calculating the pronunciation accuracy of the user by using a preset calculation formula, and judging that the pronunciation of the user is inaccurate when the calculated pronunciation accuracy of the user is lower than a preset threshold value.
According to a preferred embodiment of the invention, the method further comprises:
and displaying pronunciation correction indication information for users with pronunciation accuracy lower than a preset threshold value.
According to a preferred embodiment of the present invention, the pronunciation correction instruction information includes: correct pronunciation and corresponding correct mouth shape.
According to a preferred embodiment of the present invention, the displaying pronunciation correction instruction information further includes: and displaying different pronunciation correction indication information according to the combination of different pronunciation matching degrees and mouth shape matching degrees.
The second aspect of the present invention provides a user pronunciation accuracy assessment apparatus, including:
the information acquisition module is used for acquiring audio information and image information when a user pronounces;
the image acquisition module is used for screening out at least one frame of image generated when the user pronounces from the image information;
the mouth shape acquisition module is used for extracting the mouth shape information of the user from the image during pronunciation;
and the pronunciation judging module is used for respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree.
A third aspect of the invention proposes an electronic device comprising a processor and a memory for storing a computer-executable program, which, when executed by the processor, performs the method.
The fourth aspect of the present invention also provides a computer-readable medium storing a computer-executable program, which when executed, implements the method.
(III) advantageous effects
The method extracts at least one frame of image from the image when the user pronounces, extracts the mouth shape image and the audio when the user pronounces, inputs different deep learning models respectively to obtain the mouth shape and the audio identified by the models, compares the mouth shape and the audio with the standard mouth shape and the audio, evaluates whether the pronunciation of the user is accurate according to the matching degree of the mouth shape and the audio, further corrects the user from the two aspects of the mouth shape and the pronunciation, evaluates whether the pronunciation of the user is accurate through multiple dimensions, enables the evaluation result to be more accurate, feeds the evaluation result back to the user in real time, and gives a corresponding correction scheme according to the evaluation result, so that the user can adjust the pronunciation mouth shape and the pronunciation tone of the user conveniently, and the user experience is improved.
Drawings
FIG. 1 is a flow diagram of a user pronunciation accuracy assessment method in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a key point region of a mouth of an embodiment of the present invention;
FIG. 3 is a schematic diagram of a user pronunciation accuracy assessment apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an electronic device of one embodiment of the invention;
fig. 5 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention.
Detailed Description
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
In order to solve the technical problem, the invention provides a method for evaluating the pronunciation accuracy of a user.
The method comprises the steps of collecting image information of a user during pronunciation in real time during live broadcasting, extracting each frame of image during pronunciation from the image information, extracting mouth shape information of the user from the image, collecting audio information of the user during pronunciation, and comprehensively judging whether the pronunciation of the user is accurate or not through a trained deep learning model from two aspects of sound and mouth shape images.
For a mouth shape image generated when a user pronounces, two different deep learning models are used in the embodiment of the invention, the image of a key point region of a mouth is input into one of the deep learning models, so that the mouth shape with the highest recognized score can be output, and then the mouth shape is compared with a regular mouth shape to see whether the mouth shape is the same or not, so that whether the mouth shape generated when the user pronounces is accurate or not can be judged;
through another deep learning model of mouth shape image input when pronouncing with the user, the regional characteristic of mouth key point in the image can be drawed to the deep learning model to match with the characteristic in the mouth shape storehouse, select the mouth shape that the matching degree is the highest from it, then compare with regular mouth shape and see whether the same, alright judge whether accurate of the mouth shape of user pronunciation.
And finally, performing voice recognition and matching on the audio information generated when the user pronounces with the standard pronunciation through the deep learning model, and judging whether the pronunciation of the user is standard.
Through the three dimensions, if any dimension of the user is judged to be inaccurate, the pronunciation of the user can be determined to be nonstandard, and a corresponding correction scheme is provided according to the evaluation results of different dimensions, so that the evaluation results are more accurate, and the learning efficiency of the user is improved.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
Fig. 1 is a flowchart of a user pronunciation accuracy assessment method according to an embodiment of the present invention.
As shown in fig. 1, the method includes:
s101, acquiring audio information and image information when a user pronounces.
Specifically, when a user live-broadcasts and learns pronunciation by using a designated application program through terminal equipment, the application program of the terminal equipment can record the learning process of the user in real time, the terminal equipment can comprise a mobile phone, a tablet personal computer or other general equipment with a video shooting function and a communication function, when the user follows and reads or answers questions and needs pronunciation, the terminal equipment collects video and/or audio of the user when pronouncing through data collection equipment such as a camera (image collection equipment) and/or a microphone (audio collection equipment), the collected data is sent to a server for processing, and the data is fed back to the user through the terminal equipment, so that the wrong pronunciation mouth shape of the user can be evaluated.
Considering that the acquired video may include not only the user pronunciation process but also some invalid video segments (such as a preparation stage before the user pronounces), in an embodiment of the present embodiment, the acquired video is processed first to acquire user pronunciation images and audio information in the video, and in an embodiment of the present embodiment, the valid video segments may be acquired by removing the invalid video (the video that does not include the user pronunciation process, that is, the video that does not open the mouth before the user pronounces and the video that closes the mouth after the user pronounces), for example, the pronunciation video is cropped based on the fluctuation of the video signal and the audio signal to remove the video frames that are not pronounced by the user, so as to acquire the valid video segments.
In this embodiment, whether the current video is the valid video is determined according to fluctuation conditions of the video signal and the audio signal, and the smaller the fluctuation of the signal, the smaller the change of the video picture is, that is, the smaller the probability that the video includes the user pronunciation picture is, so that whether the current video frame includes the user pronunciation picture can be determined by setting a reasonable threshold.
S102, screening out at least one frame of image generated when the user pronounces from the image information.
Specifically, after acquiring image information when a user pronounces, acquiring a key frame (in a user pronunciation video) in the image information, considering that a mouth shape determining whether the pronunciation is correct is different when pronunciation contents are different, so that it is difficult to acquire the key frame in the video in a uniform manner (standard), in an embodiment of the present invention, the key frame is determined by:
when the pronunciation content needs the user to open the mouth, if the mouth shape is kept unchanged during pronunciation, such as "a", "c", etc., only one or two frames of images in the image can be extracted as the key frame, for example, one frame with the maximum and minimum opening degree is respectively selected as the key frame; when the pronunciation content does not need the user to open the mouth, a frame with pronunciation pause in the pronunciation video of the user can be selected as a key frame; when the mouth shape needs to be changed during pronunciation, such as m, w and the like, the pronunciation accuracy of the user is difficult to judge by using one frame or two frames of images, all frames of images of the user with complete pronunciation can be extracted, and in order to ensure that the selected key frame is accurate and effective, a plurality of frames adjacent to the key frame can be selected to be analyzed together.
S103, extracting mouth shape information of a user from the image during pronunciation;
specifically, the extracted mouth shape information may be the area of the mouth region, the outline of the mouth region, fig. 2 is a schematic diagram of a key point region of the mouth according to an embodiment of the present invention, a partial image of 8 key points of the outline in the mouth region as shown in fig. 2 may be extracted as the mouth shape information, such as the mouth corner, the highest point and the lowest point of the lips, and a place with a large radian, and corresponding mouth shape features may also be extracted from the mouth region.
And S104, respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree.
Specifically, the deep learning model may be trained first, the key point region image of the mouth is extracted from the mouth shape information of each frame of image when the historical user pronounces sound, and the extracted key point region image of the mouth is input into the deep learning model, and if the mouth shape is kept unchanged during pronunciation, for example, "a" or "c", only one frame of key point region image of the mouth when the user pronounces sound needs to be input into the deep learning model, the output mouth shape is matched with the mouth shape of the actual user, and the parameters of the deep learning model are adjusted according to the matching result until the output mouth shape is consistent with the actual mouth shape of the user, at which time the trained deep learning model is the first deep learning model.
In order to make the evaluation result more accurate, the first deep learning model is improved, all frame mouth key point region images of a complete pronunciation of the historical user are extracted and input into the deep learning model together, the output mouth shape is matched with the mouth shape of the actual user, for example, the mouth shape of the "m" is output, and the mouth shape of the "f" is output by the historical user, and then model parameters need to be adjusted until the output mouth shape is consistent with the actual mouth shape of the user.
Inputting a mouth key point region image of a historical user during pronunciation into another deep learning model, matching the features of the mouth region of the historical user with the features in a pre-stored mouth shape library by the deep learning model to obtain a mouth shape which has the most features as the output model and is the same as the features in the mouth shape library, performing similarity calculation with the actual mouth shape of the user, and adjusting parameters of the deep learning model according to the calculation result until the output mouth shape is consistent with the actual mouth shape of the user (the similarity is infinitely close to 1), wherein the trained deep learning model is the second deep learning model.
There are two ways to calculate the degree of mouth shape matching, the first way is: firstly, acquiring a mouth key point region image extracted in the embodiment, acquiring a frame of mouth key point region image during pronunciation when the pronunciation mouth shape of a user is unchanged, and inputting the key point region image into a trained first deep learning model; when the pronunciation mouth shape of the user changes, all frame mouth key point region images during pronunciation are obtained, the key point region images are input into the improved first deep learning model, the model can comprehensively calculate a mouth shape recognition result with the highest score according to the change of each frame mouth shape outline during pronunciation and the change of the position of a mouth shape key point, the obtained mouth shape is matched with the mouth shape during standard pronunciation, the matching degree of the two mouth shapes is calculated, when the matching degree is larger than a preset threshold value, the mouth shape standard during pronunciation of the user can be judged, otherwise, the pronunciation of the user is judged to be not standard.
Similarly, the second method is: firstly, acquiring a key point region image of the mouth extracted in the above embodiment, inputting the key point region image into a trained second deep learning model, and extracting key point region features of the mouth, for example, in fig. 5, the internal angle angles of 8 key points of the contour in the mouth region are 60 °, 170 °, 175 °, 160 °, 45 °, 160 °, 165 °, and 145 ° in sequence, so that the mouth shape feature of the key frame can be encoded as λ ═ (60, 170, 175, 160, 45, 160, 165, 145), and other features of the mouth can also be extracted; matching the extracted features with the features in a preset mouth shape library to obtain a mouth shape which is output as a model with the mouth shape with the most features in the mouth shape library, then carrying out similarity calculation on the features of the output mouth shape and the features of the mouth shape during standard pronunciation to calculate the similarity of the features and the mouth shape, taking the similarity value as a matching degree value, judging the mouth shape standard during user pronunciation when the matching degree is greater than a preset threshold value, and otherwise judging the user pronunciation to be nonstandard.
When the feature of the key point region of the mouth of the user during pronunciation is matched with the feature in the mouth shape library, a similarity matching method can be used, namely, a cosine value between feature vectors is calculated to serve as the similarity, and the higher the similarity is, the closer the two features are.
Preferably, the two methods can be used in combination, and the two methods are performed simultaneously, and whether the pronunciation of the user is standard or not is comprehensively judged according to the last two judgment results.
As another dimension, the extracted audio information may be input into a corresponding deep learning model, the audio information is converted into a waveform diagram, and the waveform diagram is compared with a waveform diagram of a standard voice to calculate a matching degree between the voice of the user when pronouncing and the standard voice, thereby determining whether the pronunciation of the user is standard.
For example, the lower limit of the mouth shape matching degree score calculated by the first method may be set to 0.9 (range 0-1), the lower limit of the mouth shape matching degree (corresponding similarity) calculated by the second method may be set to 0.95 (range 0-1), and the lower limit of the pronunciation matching degree calculated by the voice matching may be set to 0.8 (range 0-1).
The standard pronunciation is A mouth shape, the voice is a, after a user A pronounces through the mobile terminal, the terminal equipment sends an image of the user A during pronunciation to the server, the server extracts mouth images and audio information of the user A during pronunciation, the mouth images are input into the first deep learning model, the mouth shape output as a result is A, and the score is larger than 0.94; inputting the mouth image into a second deep learning model, wherein the mouth shape matched from the mouth shape library is A and the similarity is 0.99; the matching degree of the voice of the user A and the standard voice is 0.8 and both are higher than the preset lower limit value, so that the pronunciation of the user A can be judged to be accurate.
After the user B pronounces through the mobile terminal, the terminal equipment sends an image of the user B when pronouncing to the server, the server extracts mouth images and audio information of the user B when pronouncing, the mouth images are input into the first deep learning model, the mouth shape output by the result is A, and the score is larger than 0.74; inputting the mouth image into a second deep learning model, wherein the mouth shape matched from the mouth shape library is A and the similarity is 0.9; the matching degree of the voice of the user B and the standard voice is 0.6, although the matched mouth shape is correct, the mouth shape matching degree obtained through the two modes is lower than the set lower limit value, and the pronunciation matching degree is also lower than the lower limit value, so that the mouth shape of the user B is judged to be not standard and inaccurate.
Further, as a preferred embodiment, the calculated pronunciation matching degree and the mouth shape matching degree may be used as parameters, a predetermined calculation formula may be input to calculate the user pronunciation accuracy, and when the calculated user pronunciation accuracy is lower than a predetermined threshold, it may be judged that the user pronunciation is inaccurate.
The embodiment of the invention also provides how to correct the pronunciation of the user according to the evaluation result, firstly the server displays the evaluation result to the user through the terminal equipment, then different correction strategies are provided according to the evaluation result, the server prompts whether the mouth shape and the voice of the user are accurate or not, if not, the server provides correct mouth shape images and voice, and the user can correct wrong pronunciation according to the mouth shape images and the voice provided by the server. If the user is in a live broadcast state, the server sends the evaluation result and the correction strategy to the user in a station message form, and the user can correct the pronunciation of the user through the content in the station message after the live broadcast is finished.
For example, when the user is correct in mouth shape but inaccurate in speech, the server prompts the user that the mouth shape is correct and provides the correct speech. When the mouth shape of the user is inaccurate, the server can also provide a mouth shape error form, such as a large mouth opening, a tongue tip against a lower tooth and the like.
According to the embodiment of the invention, the mouth shape image and the voice frequency of the user during pronunciation are respectively compared with the mouth shape and the voice frequency during standard pronunciation to obtain the corresponding results, so as to evaluate whether the pronunciation of the user is accurate, further correct the user from the two aspects of the mouth shape and the pronunciation, evaluate whether the pronunciation of the user is accurate through multiple dimensions, so that the evaluation result is more accurate, the evaluation result is fed back to the user in real time, and the corresponding correction scheme is provided according to the evaluation result, so that the user can conveniently adjust the pronunciation mouth shape and the pronunciation tone, and the user experience is improved.
Embodiments of the apparatus of the present invention are described below, which may be used to perform method embodiments of the present invention. The details described in the device embodiments of the invention should be regarded as complementary to the above-described method embodiments; reference is made to the above-described method embodiments for details not disclosed in the apparatus embodiments of the invention.
Fig. 3 is a schematic diagram of a user pronunciation accuracy assessment apparatus according to an embodiment of the present invention.
The apparatus 200 comprises:
the information acquisition module 201 is used for acquiring audio information and image information when a user pronounces;
the image acquisition module 202 is configured to screen out at least one frame of image generated by the user during pronunciation from the image information;
a mouth shape obtaining module 203, configured to extract mouth shape information of the user from the image during pronunciation;
and the pronunciation judging module 204 is used for respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate according to the pronunciation matching degree and the mouth shape matching degree.
According to a preferred embodiment of the present invention, the pronunciation determination module 204 further comprises:
an image extraction unit, configured to extract a key point region image of the mouth from the mouth shape information;
the first model type obtaining unit is used for inputting the key point region image into a first deep learning model to obtain a first model type of the user;
and the first mouth shape type evaluation unit is used for judging whether the first mouth shape type is the same as the correct mouth shape type.
According to a preferred embodiment of the present invention, the pronunciation determination module 204 further comprises:
the feature extraction unit is used for inputting the mouth shape information into a second deep learning model and extracting key point region features of the mouth;
the second mouth shape type acquisition unit is used for matching the key point region characteristics with the characteristics in a preset mouth shape library to obtain a corresponding second mouth shape type;
and the second mouth shape type evaluation unit is used for judging whether the second mouth shape type is the same as the correct mouth shape type.
According to a preferred embodiment of the present invention, the second mouth shape class acquisition unit further includes:
the characteristic matching unit is used for matching the key point region characteristics with the characteristics in a preset mouth shape library and selecting the mouth shape class with the most characteristics as the mouth shape class of the user;
and the similarity calculation unit is used for performing similarity calculation on the characteristics of the mouth shape type and the characteristics of the correct mouth shape type to obtain a similarity value.
According to a preferred embodiment of the present invention, the pronunciation determination module 204 further comprises:
the matching degree judging unit is used for judging whether the pronunciation of the user is accurate according to a preset rule, and the rule comprises the following steps: and when at least one of the pronunciation matching degree and the mouth shape matching degree is lower than a preset lower limit value, the pronunciation of the user is inaccurate.
According to a preferred embodiment of the present invention, the matching degree determination unit further includes:
the first mouth shape matching degree judging unit is used for setting the score of the first mouth shape type as the mouth shape matching degree, and when the first mouth shape type is the same as the correct mouth shape type but the mouth shape matching degree score is lower than a preset first lower limit value, the pronunciation of the user is inaccurate;
the second mouth shape matching degree judging unit is used for setting the feature similarity of the second mouth shape type as the mouth shape matching degree, and when the second mouth shape type is the same as the correct mouth shape type but the feature similarity is lower than a preset second lower limit value, the pronunciation of the user is inaccurate;
and the audio matching degree judging unit is used for judging that the pronunciation of the user is inaccurate when the matching degree of the audio information and the correct pronunciation is smaller than a preset third lower limit value.
According to a preferred embodiment of the present invention, the pronunciation determination module 204 further comprises:
and the comprehensive matching degree judging unit is used for inputting the calculated pronunciation matching degree and the mouth shape matching degree as parameters, calculating the pronunciation accuracy of the user by using a preset calculation formula, and judging that the pronunciation of the user is inaccurate when the calculated pronunciation accuracy of the user is lower than a preset threshold value.
According to a preferred embodiment of the present invention, the apparatus further comprises a pronunciation correction module for presenting pronunciation correction indication information to a user whose pronunciation accuracy is below a predetermined threshold.
According to a preferred embodiment of the present invention, the pronunciation correction instruction information includes: correct pronunciation and corresponding correct mouth shape.
According to a preferred embodiment of the present invention, the pronunciation correction module further comprises:
and the correction indication information display unit is used for displaying different pronunciation correction indication information according to the combination of different pronunciation matching degrees and mouth shape matching degrees.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which includes a processor and a memory, the memory storing a computer-executable program, and the processor executing a user pronunciation accuracy assessment method when the computer program is executed by the processor.
As shown in fig. 4, the electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.
The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.
The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).
Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for data exchange between the electronic device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.
It should be understood that the electronic device shown in fig. 4 is only one example of the present invention, and elements or components not shown in the above example may be further included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method.
Fig. 5 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention. As shown in fig. 5, the computer-readable recording medium stores therein a computer-executable program that, when executed, implements the user pronunciation accuracy assessment method of the present invention described above. The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system. The invention may also be implemented by computer software for performing the method of the invention. It should be noted, however, that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, but may also be implemented in a distributed manner by hardware entities without specific details, for example, some method steps executed by a computer program may be executed by a mobile client, and another part may be executed by a smart meter, a smart pen, or the like. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.
In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components in embodiments in accordance with the invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A method for evaluating pronunciation accuracy of a user, comprising:
acquiring audio information and image information when a user pronounces;
screening out at least one frame of image generated when the user pronounces from the image information;
extracting mouth shape information of a user from the image during pronunciation;
and respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree.
2. The method for evaluating pronunciation accuracy of a user according to claim 1, wherein the calculating the mouth shape matching degree of the user further comprises:
extracting key point area images of the mouth from the mouth shape information of the images when each frame pronounces;
inputting the key point region image into a first deep learning model to obtain a first model type of the user;
and judging whether the first mouth shape type is the same as the correct mouth shape type.
3. The method according to claim 1 or 2, wherein the calculating the mouth shape matching degree of the user further comprises:
inputting the mouth shape information into a second deep learning model, and extracting key point regional characteristics of the mouth;
matching the key point region characteristics with characteristics in a preset mouth shape library to obtain a corresponding second mouth shape category;
and judging whether the second mouth shape type is the same as the correct mouth shape type.
4. The method for assessing user pronunciation accuracy according to any one of claims 1-3, wherein the matching of the key point region features with features in a preset mouth shape library to obtain corresponding mouth shape categories further comprises:
matching the key point region characteristics with characteristics in a preset mouth shape library, and selecting the mouth shape class with the most characteristics as the mouth shape class of the user;
carrying out similarity calculation on the characteristics of the mouth shape type and the characteristics of the correct mouth shape type to obtain a similarity value;
optionally, the determining whether the pronunciation of the user is accurate according to the pronunciation matching degree and the mouth shape matching degree further includes:
judging whether the pronunciation of the user is accurate according to a preset rule, wherein the rule comprises the following steps: and when at least one of the pronunciation matching degree and the mouth shape matching degree is lower than a preset lower limit value, the pronunciation of the user is inaccurate.
5. The method for evaluating the accuracy of pronunciation by a user according to any one of claims 1 to 4, wherein the determining whether the pronunciation by the user is accurate according to a predetermined rule further comprises:
setting the score of the first mouth shape category as mouth shape matching degree, and when the first mouth shape category is the same as the correct mouth shape category but the mouth shape matching degree score is lower than a preset first lower limit value, the pronunciation of the user is inaccurate;
setting the feature similarity of the second mouth shape type as mouth shape matching degree, and when the second mouth shape type is the same as the correct mouth shape type but the feature similarity is lower than a preset second lower limit value, the pronunciation of the user is inaccurate;
and when the matching degree of the audio information and the correct pronunciation is smaller than a preset third lower limit value, the pronunciation of the user is inaccurate.
6. The method for evaluating the accuracy of pronunciation of a user according to any one of claims 1 to 5, wherein the determining whether the pronunciation of the user is accurate according to the pronunciation matching degree and the mouth shape matching degree further comprises:
and inputting the calculated pronunciation matching degree and the mouth shape matching degree as parameters, calculating the pronunciation accuracy of the user by using a preset calculation formula, and judging that the pronunciation of the user is inaccurate when the calculated pronunciation accuracy of the user is lower than a preset threshold value.
7. The method for assessing pronunciation accuracy of a user according to any one of claims 1-6, wherein the method further comprises:
displaying pronunciation correction indication information for users with pronunciation accuracy lower than a preset threshold;
optionally, the pronunciation correction instruction information includes: correct pronunciation and corresponding correct mouth shape;
optionally, the presenting pronunciation correction instruction information further includes: and displaying different pronunciation correction indication information according to the combination of different pronunciation matching degrees and mouth shape matching degrees.
8. A user pronunciation accuracy assessment apparatus, comprising:
the information acquisition module is used for acquiring audio information and image information when a user pronounces;
the image acquisition module is used for screening out at least one frame of image generated when the user pronounces from the image information;
the mouth shape acquisition module is used for extracting the mouth shape information of the user from the image during pronunciation;
and the pronunciation judging module is used for respectively inputting the audio information and the mouth shape information into different deep learning models, calculating the pronunciation matching degree and the mouth shape matching degree of the user, and judging whether the pronunciation of the user is accurate or not according to the pronunciation matching degree and the mouth shape matching degree.
9. An electronic device comprising a processor and a memory, the memory for storing a computer-executable program, characterized in that:
the computer program, when executed by the processor, performs the method of any of claims 1-7.
10. A computer-readable medium storing a computer-executable program, wherein the computer-executable program, when executed, implements the method of any of claims 1-7.
CN202011522673.5A 2020-12-22 2020-12-22 User pronunciation accuracy evaluation method and device and electronic equipment Pending CN112614489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011522673.5A CN112614489A (en) 2020-12-22 2020-12-22 User pronunciation accuracy evaluation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011522673.5A CN112614489A (en) 2020-12-22 2020-12-22 User pronunciation accuracy evaluation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112614489A true CN112614489A (en) 2021-04-06

Family

ID=75243876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011522673.5A Pending CN112614489A (en) 2020-12-22 2020-12-22 User pronunciation accuracy evaluation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112614489A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257231A (en) * 2021-07-07 2021-08-13 广州思正电子股份有限公司 Language sound correcting system method and device
CN113782055A (en) * 2021-07-15 2021-12-10 北京墨闻教育科技有限公司 Student characteristic-based voice evaluation method and system
CN114758647A (en) * 2021-07-20 2022-07-15 无锡柠檬科技服务有限公司 Language training method and system based on deep learning
CN115798513A (en) * 2023-01-31 2023-03-14 新励成教育科技股份有限公司 Talent expression management method, system and computer readable storage medium
CN114664132B (en) * 2022-04-05 2024-04-30 苏州市立医院 Language rehabilitation training device and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Method of correcting pronunciation aiming at language class learning and device of correcting pronunciation aiming at language class learning
US20180137778A1 (en) * 2016-08-17 2018-05-17 Ken-ichi KAINUMA Language learning system, language learning support server, and computer program product
CN111652165A (en) * 2020-06-08 2020-09-11 北京世纪好未来教育科技有限公司 Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium
CN111951828A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, device, system, medium and computing equipment
CN111950327A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Method of correcting pronunciation aiming at language class learning and device of correcting pronunciation aiming at language class learning
US20180137778A1 (en) * 2016-08-17 2018-05-17 Ken-ichi KAINUMA Language learning system, language learning support server, and computer program product
CN111951828A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, device, system, medium and computing equipment
CN111950327A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment
CN111652165A (en) * 2020-06-08 2020-09-11 北京世纪好未来教育科技有限公司 Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257231A (en) * 2021-07-07 2021-08-13 广州思正电子股份有限公司 Language sound correcting system method and device
CN113257231B (en) * 2021-07-07 2021-11-26 广州思正电子股份有限公司 Language sound correcting system method and device
CN113782055A (en) * 2021-07-15 2021-12-10 北京墨闻教育科技有限公司 Student characteristic-based voice evaluation method and system
CN114758647A (en) * 2021-07-20 2022-07-15 无锡柠檬科技服务有限公司 Language training method and system based on deep learning
CN114664132B (en) * 2022-04-05 2024-04-30 苏州市立医院 Language rehabilitation training device and method
CN115798513A (en) * 2023-01-31 2023-03-14 新励成教育科技股份有限公司 Talent expression management method, system and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN112614489A (en) User pronunciation accuracy evaluation method and device and electronic equipment
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
CN106485984B (en) Intelligent teaching method and device for piano
CN111193834B (en) Man-machine interaction method and device based on user sound characteristic analysis and electronic equipment
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN110740389A (en) Video positioning method and device, computer readable medium and electronic equipment
CN112232276B (en) Emotion detection method and device based on voice recognition and image recognition
CN111526405B (en) Media material processing method, device, equipment, server and storage medium
CN113703579B (en) Data processing method, device, electronic equipment and storage medium
CN109286848B (en) Terminal video information interaction method and device and storage medium
CN115205764B (en) Online learning concentration monitoring method, system and medium based on machine vision
CN113392273A (en) Video playing method and device, computer equipment and storage medium
CN113886641A (en) Digital human generation method, apparatus, device and medium
CN115798518B (en) Model training method, device, equipment and medium
CN111951629A (en) Pronunciation correction system, method, medium and computing device
CN111666820A (en) Speaking state recognition method and device, storage medium and terminal
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
CN111415662A (en) Method, apparatus, device and medium for generating video
US20190228765A1 (en) Speech analysis apparatus, speech analysis system, and non-transitory computer readable medium
CN112101046B (en) Conversation analysis method, device and system based on conversation behavior
CN112989252A (en) Foreign language teaching system for original film and television
CN113255470A (en) Multi-mode piano partner training system and method based on hand posture estimation
CN114119819A (en) Data processing method and device, electronic equipment and computer storage medium
CN111554269A (en) Voice number taking method, system and storage medium
CN111078073B (en) Handwriting amplification method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination