CN111950327A - Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment - Google Patents

Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment Download PDF

Info

Publication number
CN111950327A
CN111950327A CN201910405361.7A CN201910405361A CN111950327A CN 111950327 A CN111950327 A CN 111950327A CN 201910405361 A CN201910405361 A CN 201910405361A CN 111950327 A CN111950327 A CN 111950327A
Authority
CN
China
Prior art keywords
mouth
user
key frame
pronunciation
mouth shape
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910405361.7A
Other languages
Chinese (zh)
Inventor
胡太
孙怿
沈欣尧
刘晨晨
张蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Liulishuo Information Technology Co ltd
Original Assignee
Shanghai Liulishuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Liulishuo Information Technology Co ltd filed Critical Shanghai Liulishuo Information Technology Co ltd
Priority to CN201910405361.7A priority Critical patent/CN111950327A/en
Publication of CN111950327A publication Critical patent/CN111950327A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/04Speaking
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • G09B5/065Combinations of audio and video presentations, e.g. videotapes, videodiscs, television systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a mouth shape correction method. The method comprises the following steps: acquiring a key frame of a user pronunciation video; extracting mouth shape features of the key frames; classifying the keyframes based on the mouth-shape features; confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video; and if the two types of the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video. The method can determine whether the pronunciation mouth shape of the user is standard or not according to the key frame in the pronunciation video of the user, and further correct the wrong pronunciation mouth shape, thereby bringing better experience to the user. In addition, the embodiment of the invention also provides a mouth shape correcting device, a medium and a computing device.

Description

Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment
Technical Field
The embodiment of the invention relates to the field of oral evaluation, in particular to a mouth shape correction method, a mouth shape correction device, a mouth shape correction medium and a computing device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In the process of language learning, learning correct spoken pronunciation is also a very important part, however, in the previous years, the spoken learning can only follow the teacher under the line, and with the development of technology, the online spoken learning becomes a trend, and in recent years, scoring and correction of spoken pronunciation are mainly established on the representation of phonetic features.
However, the correctness of the pronunciation mouth shape plays an important role in pronunciation, namely, the learners can be prompted to send out standard pronunciation by mastering the correct mouth shape, the existing mouth shape identification and judgment are particularly dependent on a lip language identification technology, a high-performance GPU and a larger storage memory are required to be used, and pronunciation contents are scored by combining a deep learning algorithm, so that the method is not very suitable for mobile terminal equipment due to the high requirement on hardware configuration.
Disclosure of Invention
In this context, embodiments of the present invention are intended to provide a method, apparatus, medium, and computing device for correction of a mouth shape.
In a first aspect of embodiments of the present invention, there is provided a mouth shape correction method, comprising: .
Acquiring a key frame of a user pronunciation video;
extracting mouth shape features of the key frames;
classifying the keyframes based on the mouth-shape features;
confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and if the two types of the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video.
In one embodiment of the invention, the extraction mode of the key frame is determined according to the category of the content of the pronunciation of the user.
In another embodiment of the invention, the content of the pronunciation of the user is the pronunciation content displayed on the screen of the terminal equipment.
In yet another embodiment of the present invention, the content of the pronunciation of the user is a phonetic symbol.
In yet another embodiment of the present invention, the categories of content of the pronunciation include at least four categories;
when the content of the pronunciation belongs to the first class, acquiring a frame with the maximum mouth opening degree in the user pronunciation video as a key frame when the user opens the mouth;
when the content of the pronunciation belongs to a second class, acquiring a frame of pronunciation pause in the user pronunciation video as a key frame;
when the pronunciation content belongs to the third category, acquiring a frame with the minimum mouth opening degree and the maximum mouth opening degree in the user pronunciation video as a key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
and when the pronunciation content belongs to the fourth class, acquiring a frame with the maximum mouth opening degree when the first vowel pronounces and a frame with the minimum mouth opening degree when the later vowel pronounces in the pronunciation video of the user as key frames.
In yet another embodiment of the present invention, extracting the mouth shape feature of the key frame comprises:
and extracting the shape characteristics of the mouth shape in the key frame.
In another embodiment of the present invention, extracting shape features of the mouth shape in the key frame includes:
acquiring key points of the outline of the mouth region in the key frame;
obtaining angles of all internal angles of a polygon constructed based on the key points;
and coding the angle of each internal angle according to a preset rule to obtain the shape characteristic of the mouth in the key frame.
In a further embodiment of the invention, the key points of the mouth region contour are key points of a contour within the mouth region.
In yet another embodiment of the present invention, obtaining the keypoints of the contour of the mouth region in the keyframe includes:
performing face detection on the key frame to obtain a boundary frame of the face;
and performing key point detection on the face region by adopting an integrated regression tree algorithm of gradient lifting based on the bounding box to obtain key points of the contour of the mouth region.
In another embodiment of the present invention, performing face detection on the key frame to obtain a bounding box of a face includes:
constructing an image descriptor by combining the local gradient and the gradient strength of the key frame image;
and judging whether the image in the window is a human face region or not by adopting a sliding window based on the image descriptor.
In yet another embodiment of the present invention, extracting the mouth shape feature of the key frame comprises:
and acquiring a directional gradient histogram and a color histogram of the mouth region in the key frame as the characteristics of the key frame.
In yet another embodiment of the present invention, the keyframes are classified using a pre-trained mouth-shape classifier.
In yet another embodiment of the present invention, the mouth shape classifier is constructed based on:
performing feature dimension reduction on the training data set;
and training based on the reduced-dimension low-dimension spatial features to obtain the mouth shape classifier.
In yet another embodiment of the present invention, feature dimensionality reduction is performed using one of principal component analysis, linear discriminant analysis, and local linear embedding.
In yet another embodiment of the present invention, the mouth shape classifier is trained using one of a support vector machine, random forest and extreme gradient boosting.
In a further embodiment of the invention, the training data set comprises mouth shape features for different poses of different faces.
In a second aspect of an embodiment of the present invention, there is provided an oral correction device including:
the key frame acquisition module is configured to acquire key frames of a user pronunciation video;
a feature extraction module configured to extract mouth shape features of the key frames;
a classification module configured to classify the keyframes based on the mouth-shape features;
the judging module is configured to confirm whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and the prompting module is configured to perform corresponding prompting according to the category of the key frame of the standard pronunciation video if the two types are inconsistent.
In a third aspect of embodiments of the present invention, there is provided a computer readable storage medium storing program code, which when executed by a processor, implements a method as described in any of the embodiments of the first aspect.
In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising a processor and a storage medium storing program code that, when executed by the processor, implements a method as described in any of the embodiments of the first aspect.
According to the mouth shape correcting method, the device, the medium and the computing equipment, whether the pronunciation mouth shape of the user is standard or not can be determined according to the key frame in the pronunciation video of the user, and then the wrong pronunciation mouth shape can be corrected, without the need of combining a deep learning algorithm to score pronunciation contents like a technical scheme of lip language recognition (a high-performance GPU and a larger storage memory need to be used), so that the resource consumption is remarkably reduced, the hardware limitation (a mobile terminal such as a GPU of a mobile phone and the performance of the mobile phone does not accord with the larger storage memory) is broken through, the mouth shape correcting method, the device, the medium and the computing equipment are suitable for the mobile terminal, and better experience is brought to the user.
Drawings
The foregoing and other objects, features and advantages of exemplary embodiments of the present invention will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates an application scenario in accordance with various embodiments of the present invention;
FIG. 2 schematically illustrates a flow diagram of a mouth shape correction learning method according to an embodiment of the invention;
FIG. 3 schematically illustrates yet another application scenario in accordance with various embodiments of the present invention;
FIG. 4 schematically illustrates a schematic diagram of inside and outside contour keypoints for a mouth region, according to embodiments of the invention;
FIG. 5 schematically illustrates a schematic diagram of construction of polygons based on contour keypoints within a mouth region, according to various embodiments of the invention;
FIG. 6 schematically illustrates a block diagram of a die correction learning apparatus according to an embodiment of the present invention;
FIG. 7 schematically illustrates a schematic diagram of a computer-readable storage medium provided in accordance with an embodiment of the present invention;
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a mouth shape correction method, a mouth shape correction device, a mouth shape correction medium and a computing device are provided.
Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present invention are explained in detail below with reference to several exemplary embodiments thereof.
Summary of The Invention
The existing mouth shape recognition and judgment is particularly dependent on a lip language recognition technology, a high-performance GPU and a larger storage memory are required to be used, and a deep learning algorithm is combined to score pronunciation contents.
The invention designs a mouth shape correcting method which can determine whether the pronunciation mouth shape of a user is standard or not according to key frames in a pronunciation video of the user so as to correct wrong pronunciation mouth shape without scoring pronunciation content by combining a deep learning algorithm like a technical scheme of lip language recognition (needing to use a high-performance GPU and a larger storage memory), thereby obviously reducing resource consumption, breaking through hardware limitation (a mobile terminal such as a GPU of a mobile phone with hardware condition and performance is not consistent with the larger storage memory), being suitable for the mobile terminal and bringing better experience to the user.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
Referring first to fig. 1, fig. 1 is a schematic view of an application scenario of a mouth shape correction method of the present invention, in fig. 1, a user may perform spoken language learning through a terminal device a, which may display contents (such as phonetic symbols or words or sentences) to be learned by the user on a screen, and may further acquire video and/or audio when the user pronounces according to the contents through a data acquisition device such as a camera (image acquisition device) and/or a microphone (audio acquisition device) to evaluate a pronunciation mouth shape of the user by the mouth shape correction method.
It is understood that the content may be downloaded by the terminal a from a server, and the data collected by the terminal a is analyzed (i.e., the mouth shape correction method is performed) or the server. In an actual application process, the server may have multiple stages, that is, the receiving server receives video and/or audio data sent by the terminal device and sends the received video and/or audio data to the processing server, and the processing server processes the received video data according to the mouth shape correcting method of the present invention to obtain a mouth shape evaluation result of the user and feeds the result back to the terminal device a for display.
Exemplary method
In the following, in connection with the application scenario of fig. 1, a method for mouth shape correction according to an exemplary embodiment of the present invention is described with reference to fig. 2. It should be noted that the above application scenarios are merely illustrative for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Fig. 2 is a schematic flow chart of an example of a mouth shape correction method according to the first aspect of the embodiment of the present invention. Although the present invention provides the method operation steps or apparatus structures as shown in the following embodiments or figures, more or less operation steps or module units after partial combination may be included in the method or apparatus based on conventional or non-inventive labor. In the case of steps or structures which do not logically have a necessary relationship, the execution order of the steps or the block structure of the apparatus is not limited to the execution order or the block structure shown in the embodiment of the present invention or the drawings. When the described method or module structure is applied to a device, a server or an end product in practice, the method or module structure may be executed sequentially or in parallel according to the embodiments or the method or module structure shown in the attached drawings (for example, in the environment of parallel processors or multi-thread processing, even in the environment of implementation including distributed processing and server clustering).
For clarity, the following embodiments are described in a specific implementation scenario in which a user performs mouth shape correction through a mobile terminal. The mobile terminal can comprise a mobile phone, a tablet computer or other general or special equipment with a video shooting function and a data communication function. The mobile terminal and the server may be deployed with corresponding application modules, such as a certain spoken language learning APP (application) installed in the mobile terminal, to implement corresponding data processing. However, those skilled in the art can understand that the spirit of the present solution can be applied to other implementation scenarios of mouth shape correction, for example, referring to fig. 3, after the mobile terminal collects data, the collected data is sent to the server for processing, and is fed back to the user through the mobile terminal.
In a specific embodiment, as shown in fig. 2, in an embodiment of a mouth correction method provided by the present invention, the method may include:
step S110, acquiring a key frame of a user pronunciation video;
considering that the acquired video may include not only the pronunciation process of the user but also some invalid video segments (for example, a preparation stage before the user pronounces), in an embodiment of the present embodiment, before acquiring a key frame of the pronunciation video of the user, the acquired video is processed first to acquire a user pronunciation video segment in the video, and in an embodiment of the present embodiment, the valid video segment may be acquired by removing the invalid video (a video that does not include the pronunciation process of the user, that is, a video that does not have a mouth before the user pronounces and a video that has a mouth closed after the user pronounces), specifically, the method includes:
acquiring a video signal of the video;
and cutting the pronunciation video based on the fluctuation of the video signal, and removing the video frames which are not pronounced by the user to obtain an effective video segment.
In this embodiment, whether the current video is the valid video is determined according to the fluctuation condition of the video signal, and the smaller the fluctuation of the signal, the smaller the change of the video picture is, that is, the smaller the probability that the video includes the user pronunciation picture, so that whether the current video frame includes the user pronunciation picture can be determined by setting a reasonable threshold.
In an embodiment of the present invention, the fluctuation of the video signal is determined by a z-score threshold matching method, specifically, the average value of the signal is subtracted from the current signal and then divided by the standard deviation to obtain a z-score value, and if the z-score value is smaller, the fluctuation of the signal is smaller. Therefore, in this embodiment, a threshold is preset, and if the z-score value obtained from the current video signal is smaller than the preset threshold, it is determined that the current video frame does not include the user pronunciation picture, so that it can be determined that the current video frame should be clipped.
After obtaining the effective video segment, acquiring a key frame (of the user pronunciation video) in the effective video segment;
considering that the mouth shapes determining whether the pronunciation is correct or not are different when the pronunciation contents are different, it is difficult to acquire the key frames in the video in a uniform manner (standard), in an embodiment of the present embodiment, the extraction manner of the key frames is determined according to the category of the content of the pronunciation of the user, specifically, in the present embodiment, the category of the content of the pronunciation includes at least four categories;
when the content of the pronunciation belongs to the first class, acquiring a frame with the maximum mouth opening degree in the user pronunciation video as a key frame when the user opens the mouth;
in this embodiment, the pronunciation content is taken as an example of a phonetic symbol, and when the phonetic symbol is of a first class (MAX _ HEIGHT), for example,/a:/,/and,
Figure BDA0002061029620000081
And
Figure BDA0002061029620000082
the mouth shape of the voice symbol when the mouth reaches the maximum in the pronunciation process is suitable for being used as a condition for judging whether the pronunciation mouth sound is correct or not, so that a frame with the maximum mouth opening degree can be selected from a pronunciation video of a user to be used as a key frame;
when the content of the pronunciation belongs to a second class, acquiring a frame of pronunciation pause in the user pronunciation video as a key frame;
there are some phonetic symbol pronunciation processes that are not open mouth, such as phonetic symbol [ e ], at this time, it is not suitable to choose a frame with the greatest degree of open mouth as the key frame, therefore, when the pronunciation content belongs to the second category (STANDTILL), it can choose a frame with pronunciation pause in the user pronunciation video as the key frame;
when the pronunciation content belongs to the third category, acquiring a frame with the minimum mouth opening degree and the maximum mouth opening degree in the user pronunciation video as a key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
the pronunciation process of some phonetic symbols is dynamic (from small to large), such as plosives ([ p ], [ b ], [ t ], [ d ], [ k ], [ g ], and the like), at this time, the pronunciation content belongs to a third category (MIN _ MAX), and then a frame is acquired as a key frame, so that whether the mouth shape of the user is standard or not cannot be accurately judged, namely when the user opens the mouth in a pronunciation video, a frame with the minimum mouth opening degree and a frame with the maximum mouth opening degree are respectively acquired as the key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
and when the pronunciation content belongs to the fourth class, acquiring a frame with the maximum mouth opening degree when the first vowel pronounces and a frame with the minimum mouth opening degree when the later vowel pronounces in the pronunciation video of the user as key frames.
For a part of the vowel phonetic symbol, the mouth shape is from large to small when pronouncing, for example
Figure BDA0002061029620000091
And
Figure BDA0002061029620000092
and therefore, the key frames obtained according to the above-mentioned obtaining mode (standard) can not accurately judge whether the pronunciation mouth shape of the phonetic symbol is standard or not, when the pronunciation content belongs to the fourth class (MAX _ MIN), firstly obtaining a frame with the maximum mouth opening degree when the first vowel pronounces, then obtaining a frame with the minimum mouth opening degree when the second vowel pronounces, and finally taking the obtained two frames as key frames.
It should be noted that, in an embodiment of the present invention, the mouth opening degree may be determined according to the area of the mouth region, specifically, a polygon may be constructed by the key points of the mouth region contour in the key frame, and then the area of the polygon is calculated, and the area of the polygon is taken as the area of the mouth region, or the mouth opening degree may also be determined according to the mouth opening height, for example, the distance between the highest and lowest key points of the mouth region contour in the key frame is calculated to determine the mouth opening height, where the highest and lowest key points of the mouth region contour are unified as the key points on the inner contour or the outer contour.
After acquiring the key frame in the user pronunciation video, executing step S120 to extract the mouth shape feature of the key frame;
in this embodiment, the mouth shape features may be obtained from a plurality of different dimensions, such as mouth shape features, in an embodiment of this embodiment, the mouth shape features in the key frame are extracted as the mouth shape features of the key frame, specifically, the key points of the mouth region contour in the key frame are obtained first; then obtaining the angle of each internal angle of the polygon constructed based on the key points; and finally, coding the angles of the internal angles according to a preset rule to obtain the shape characteristics of the mouth shape in the key frame.
In an embodiment of the present invention, when key points of the mouth region contour in the key frame are obtained, face detection may be performed on the key frame to obtain a bounding box of the face;
in this embodiment, the image may be subjected to face detection by using the hog (histogram of oriented gradient) directional gradient histogram feature to obtain a bounding box of the face, and an image descriptor is first constructed by combining the local gradient and the gradient strength of the keyframe image; and then judging whether the image in the window is a human face region or not by adopting a sliding window based on the image descriptor.
Specifically, based on the bounding box, a gradient-boosting integrated regression tree algorithm (the algorithm is excellent in processing speed of the mobile terminal device) may be used to perform keypoint detection on the face region to obtain keypoints of the mouth region contour: the key points are found at 68 key points (Landmarks) commonly found on human faces, including the top of the chin, the outer contour of each eye, the inner contour of each eyebrow, etc.
As can be clearly understood from the above steps, the key points of the contour of the mouth region obtained in this embodiment include key points of the inner and outer contours (as shown in fig. 4), in this embodiment, the number of the key points of the mouth region is 20, wherein, the number of the key points of the inner contour of 8 mouth regions and the number of the key points of the outer contour of 12 mouth regions, when selecting the key points to construct a polygon, the inner and outer contours may be selected respectively, for example, the key points of the inner contour are selected separately to construct a polygon (as shown in fig. 5), and certainly, the key points of the outer contour may also be selected separately to construct a key point, in this embodiment, the case of selecting the key points of the inner contour of the mouth region separately to construct a polygon is taken as an example to explain, after obtaining the polygons, the angles of the inner corners of the polygon are obtained respectively, and then the angles of the, the mouth shape (shape) feature of the key frame is obtained, for example, in fig. 5, the angles of the 8 inner corners of the polygon are 60 °, 170 °, 175 °, 160 °, 45 °, 160 °, 165 °, and
145 °, then the mouth shape feature of the keyframe may be encoded as λ ═ (60, 170, 175, 160, 45, 160, 165, 145);
it is to be understood that, in an embodiment of the present invention, polygons may also be constructed based on the key points of the inner and outer contours, and then features obtained by encoding the inner angles of the two polygons are combined to serve as the mouth shape features of the key frame.
Optionally, in an embodiment of the present invention, a histogram of directional gradient and a color histogram of a mouth region in the key frame are obtained as features of the key frame, specifically, color distribution in a rectangular mouth region may be counted as features of the key frame, and then the color distribution histogram is used to perform mouth type classification, so that the tooth exposure condition may also be effectively distinguished. .
After the mouth shape features of the key frames are obtained, step S130 may be executed to classify the key frames based on the mouth shape features;
in one example of this embodiment, the mouth shape categories may be first defined as the following: ENLARGE (the mouth of the user needs to be enlarged), MIDDLE (the mouth of the user needs to be moderate), SAMLL (the mouth of the user needs to be reduced), ROUND (the mouth of the user needs to be rounded) and FLAT (the mouth of the user needs to be split one point again and stretched into a shape of Chinese character 'yi'); in this step, it is first determined which of the above categories the key frame of the user pronunciation video belongs to according to the mouth characteristics of the key frame, in an embodiment of this embodiment, a mouth shape classifier trained in advance may be used to classify the key frame, wherein the mouth shape classifier is constructed based on the following:
performing feature dimension reduction on the training data set;
in this embodiment, feature dimensionality reduction may be performed using one of principal component analysis, linear discriminant analysis, and local linear embedding;
considering that the principal component analysis method has a good and mature effect in the field, in one embodiment of the present embodiment, the principal component analysis method is preferentially adopted to perform feature dimension reduction on the training data set;
training based on the reduced-dimension low-dimension spatial features to obtain the mouth shape classifier;
in this embodiment, the mouth shape classifier may be trained using one of a support vector machine, random forest, and extreme gradient boosting.
Considering that the support vector machine is more mature and easy to use, in an embodiment of the present embodiment, the mouth shape classifier is preferably trained by using the support vector machine
After training is finished, testing the mouth shape classifier by adopting a corresponding test data set, and determining the accuracy of the mouth shape classifier, wherein the test data set is different from the training data set, and each mouth shape type comprises mouth shape features of different human faces under different postures.
After determining the (mouth shape) category to which the key frame of the user pronunciation video belongs, executing step S140 to confirm whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and if the two types of the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video.
In this embodiment, if the (mouth shape) category to which the key frame of the user pronunciation video belongs is consistent with the category of the key frame of the standard pronunciation video, the mouth shape correction prompt may not be performed, but a drum excitation prompt, such as "your mouth shape is very standard", it may be understood that, when classifying the key frame of the user pronunciation video, rather than giving a very determined judgment for one mouth shape type, the mouth shape classifier determines the probability that the mouth shape feature of the key frame of the user pronunciation video is for each mouth shape type, for example, the mouth shape classifier determines that the probability that the mouth shape feature of the key frame of the user pronunciation video belongs to ENLARGE (the user mouth needs to be opened greatly) is 80%, the probability that the key frame belongs to MIDDLE (the user mouth needs to be opened moderately) is 40%, the probability that the key frame of the user pronunciation video belongs to FLAT (the user mouth needs to be opened slightly again and is stretched into one font) is 50%, the probabilities of belonging to other mouth shapes are respectively 10%, 15% and 20%, then it can be determined that the mouth shape feature of the key frame of the user pronunciation video belongs to ENLARGE (the mouth of the user needs to be expanded), and at this time, the probability can be reported to the user as the score of the mouth shape of the user.
In addition, in an embodiment of the present invention, it may also be possible to select not only the mouth shape type with the highest probability as the category of the mouth shape feature of the key frame of the user pronunciation video (determining the category of the mouth shape feature of the key frame of the user pronunciation video also requires that the probability of the most probable mouth shape type exceeds a threshold, for example 80%), because, in some cases, although the probability of a certain mouth shape type is the highest, the criterion that can be determined cannot be achieved, for example, the probability that the mouth shape feature of the key frame of the user pronunciation video belongs to the entry (the mouth of the user needs to be widened) is 40%, at this time, although the probability is the highest, however, it is still not enough to determine that the mouth shape feature of the key frame of the user pronunciation video belongs to ENLARGE (the mouth of the user needs to be expanded), so that the user can be prompted to learn and test again.
According to the mouth shape correcting method provided by the embodiment of the invention, whether the pronunciation mouth shape of the user is standard or not can be determined according to the key frame in the pronunciation video of the user, and then the wrong pronunciation mouth shape can be corrected, without the need of combining a deep learning algorithm to score pronunciation contents like a technical scheme of lip language recognition (a high-performance GPU and a larger storage memory need to be used), so that the resource consumption is remarkably reduced, the hardware limitation is broken through (hardware conditions and performance of a mobile terminal such as a mobile phone are not consistent with the GPU and the larger storage memory), the mouth shape correcting method is suitable for the mobile terminal (the scheme only needs to use a CPU for calculation, and the smaller memory can be implemented to obtain a result in real time), and better experience is brought to the user.
Exemplary devices
Having described the method of an exemplary embodiment of the present invention, next, a description is given of a mouth shape correcting apparatus of an exemplary embodiment of the present invention with reference to fig. 6, the apparatus including:
a key frame acquisition module 610 configured to acquire key frames of a user pronunciation video;
a feature extraction module 620 configured to extract mouth shape features of the key frames;
a classification module 630 configured to classify the keyframes based on the mouth-shape features;
a judging module 640 configured to confirm whether the classification result is consistent with the classification of the key frame of the standard pronunciation video;
and the prompt module 650 is configured to prompt correspondingly according to the category of the key frame of the standard pronunciation video if the two are inconsistent.
In an embodiment of the present invention, the key frame acquiring module 610 is further configured to determine an extraction manner of the key frame according to a category of content uttered by the user.
In one embodiment of the present invention, the content of the pronunciation of the user is the pronunciation content displayed on the screen of the terminal device.
In one embodiment of the present invention, the content of the pronunciation of the user is a phonetic symbol.
In one embodiment of this embodiment, the categories of the content of the pronunciation include at least four categories;
when the content of the pronunciation belongs to the first class, acquiring a frame with the maximum mouth opening degree in the user pronunciation video as a key frame when the user opens the mouth;
when the content of the pronunciation belongs to a second class, acquiring a frame of pronunciation pause in the user pronunciation video as a key frame;
when the pronunciation content belongs to the third category, acquiring a frame with the minimum mouth opening degree and the maximum mouth opening degree in the user pronunciation video as a key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
and when the pronunciation content belongs to the fourth class, acquiring a frame with the maximum mouth opening degree when the first vowel pronounces and a frame with the minimum mouth opening degree when the later vowel pronounces in the pronunciation video of the user as key frames.
In an embodiment of this embodiment, the feature extraction module 620 includes:
a shape feature extraction unit configured to extract shape features of the mouth shape in the key frame.
In one embodiment of the present embodiment, the shape feature extraction unit includes:
a key point acquisition subunit configured to acquire key points of the mouth region contour in the key frame;
an angle acquisition subunit configured to acquire angles of respective internal angles of a polygon constructed based on the key points;
and the shape characteristic obtaining subunit is configured to encode the angle of each internal angle according to a preset rule to obtain the shape characteristic of the mouth shape in the key frame.
In an embodiment of the present invention, the key points of the contour of the mouth region are key points of the contour in the mouth region.
In an embodiment of this embodiment, the key point obtaining subunit is further configured to perform face detection on the key frame to obtain a bounding box of a face; and
and performing key point detection on the face region by adopting an integrated regression tree algorithm of gradient lifting based on the bounding box to obtain key points of the contour of the mouth region.
In an embodiment of the present invention, performing face detection on the key frame to obtain a bounding box of a face includes:
constructing an image descriptor by combining the local gradient and the gradient strength of the key frame image;
and judging whether the image in the window is a human face region or not by adopting a sliding window based on the image descriptor.
In an embodiment of this embodiment, the feature extraction module 620 is further configured to obtain a histogram of directional gradients and a color histogram of the mouth region in the key frame as the feature of the key frame.
In an embodiment of this embodiment, the classification module 630 is further configured to classify the key frames by using a pre-trained mouth shape classifier.
In one embodiment of this embodiment, the mouth shape classifier is constructed based on:
performing feature dimension reduction on the training data set;
and training based on the reduced-dimension low-dimension spatial features to obtain the mouth shape classifier.
In one embodiment of the present embodiment, feature dimensionality reduction is performed using one of principal component analysis, linear discriminant analysis, and local linear embedding.
In one embodiment of this embodiment, the mouth shape classifier is trained using one of support vector machine, random forest, and extreme gradient boosting.
In an embodiment of this embodiment, the training data set comprises mouth shape features of different faces in different poses.
Exemplary Medium
Having described the method and apparatus of the exemplary embodiments of the present invention, a computer-readable storage medium of the exemplary embodiments of the present invention is described with reference to fig. 7, referring to fig. 7, which illustrates a computer-readable storage medium, an optical disc 70, having a computer program (i.e., a program product) stored thereon, which when executed by a processor, implements the steps described in the above-mentioned method embodiments, for example, acquiring key frames of a user pronunciation video; extracting mouth shape features of the key frames; classifying the keyframes based on the mouth-shape features; confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video; and if the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video. The specific implementation of each step is not repeated here.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
Exemplary computing device
Having described the methods, apparatus and media of exemplary embodiments of the present invention, a description of exemplary embodiments of the present invention will now be given of computing devices, which may be computer systems or servers, showing block diagrams of exemplary computing devices suitable for implementing embodiments of the present invention. The computing device shown is only one example and should not be taken as limiting the scope of use and functionality of embodiments of the invention.
Components of the computing device may include, but are not limited to: one or more processors or processing units, a system memory, and a bus connecting the various system components (including the system memory and the processing units).
The computing device typically includes a variety of computer system readable media. Such media may be any available media that is accessible by a computing device and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory. The computing device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM may be used to read from and write to non-removable, nonvolatile magnetic media (not shown, and often referred to as a "hard drive"). Although not shown, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus by one or more data media interfaces. At least one program product may be included in system memory 802 having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility having a set (at least one) of program modules may be stored, for example, in system memory, and such program modules include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the described embodiments of the invention.
The computing device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.). Such communication may be through an input/output (I/O) interface. Also, the computing device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through a network adapter. The network adapter 806 communicates with other modules of the computing device (e.g., processing unit, etc.) via the bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computing device.
The processing unit executes various functional applications and data processing by running programs stored in the system memory, for example, acquiring key frames of user pronunciation videos; extracting mouth shape features of the key frames; classifying the keyframes based on the mouth-shape features; confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video; and if the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video. The specific implementation of each step is not repeated here. It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the die correction device are mentioned, this division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Further, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto:
1. a method of mouth shape correction comprising:
acquiring a key frame of a user pronunciation video;
extracting mouth shape features of the key frames;
classifying the keyframes based on the mouth-shape features;
confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and if the two types of the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video.
2. The method according to claim 1, wherein the key frame extraction manner is determined according to the category of the content uttered by the user.
3. The method according to the scheme 2, wherein the content of the pronunciation of the user is the pronunciation content displayed on the screen of the terminal equipment.
4. The method according to claim 2, wherein the content of the pronunciation of the user is phonetic symbol.
5. The method of any of claims 1-4, wherein the categories of content of the pronunciation include at least four categories;
when the content of the pronunciation belongs to the first class, acquiring a frame with the maximum mouth opening degree in the user pronunciation video as a key frame when the user opens the mouth;
when the content of the pronunciation belongs to a second class, acquiring a frame of pronunciation pause in the user pronunciation video as a key frame;
when the pronunciation content belongs to the third category, acquiring a frame with the minimum mouth opening degree and the maximum mouth opening degree in the user pronunciation video as a key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
and when the pronunciation content belongs to the fourth class, acquiring a frame with the maximum mouth opening degree when the first vowel pronounces and a frame with the minimum mouth opening degree when the later vowel pronounces in the pronunciation video of the user as key frames.
6. The method of scheme 1, wherein extracting the mouth shape feature of the key frame comprises:
and extracting the shape characteristics of the mouth shape in the key frame.
7. The method of scheme 6, wherein extracting shape features of the mouth shape in the key frame comprises:
acquiring key points of the outline of the mouth region in the key frame;
obtaining angles of all internal angles of a polygon constructed based on the key points;
and coding the angle of each internal angle according to a preset rule to obtain the shape characteristic of the mouth in the key frame.
8. The method of claim 7, wherein the keypoints of the mouth region contour are keypoints of a contour within the mouth region.
9. The method according to claim 7 or 8, wherein acquiring the key points of the mouth region contour in the key frame includes:
performing face detection on the key frame to obtain a boundary frame of the face;
and performing key point detection on the face region by adopting an integrated regression tree algorithm of gradient lifting based on the bounding box to obtain key points of the contour of the mouth region.
10. The method according to claim 9, wherein performing face detection on the key frame to obtain a bounding box of a face includes:
constructing an image descriptor by combining the local gradient and the gradient strength of the key frame image;
and judging whether the image in the window is a human face region or not by adopting a sliding window based on the image descriptor.
11. The method of scheme 1, wherein extracting the mouth shape feature of the key frame comprises:
and acquiring a directional gradient histogram and a color histogram of the mouth region in the key frame as the characteristics of the key frame.
12. The method of scheme 1, wherein the key frames are classified using a pre-trained mouth shape classifier.
13. The method of scheme 12, wherein the mouth shape classifier is constructed based on:
performing feature dimension reduction on the training data set;
and training based on the reduced-dimension low-dimension spatial features to obtain the mouth shape classifier.
14. The method of claim 13, wherein feature dimensionality reduction is performed using one of principal component analysis, linear discriminant analysis, and local linear embedding.
15. The method of scheme 13, wherein the mouth shape classifier is trained using one of a support vector machine, random forest, and extreme gradient boosting.
16. The method of claim 13, wherein the training data set includes mouth shape features at different poses of different faces.
17. An apparatus for correcting a shape of mouth, comprising:
the key frame acquisition module is configured to acquire key frames of a user pronunciation video;
a feature extraction module configured to extract mouth shape features of the key frames;
a classification module configured to classify the keyframes based on the mouth-shape features;
the judging module is configured to confirm whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and the prompting module is configured to perform corresponding prompting according to the category of the key frame of the standard pronunciation video if the two types are inconsistent.
18. The apparatus of claim 17, wherein the key frame acquiring module is further configured to determine a key frame extracting manner according to a category of content uttered by the user.
19. The apparatus according to claim 18, wherein the content of the pronunciation of the user is the content of the pronunciation displayed on the screen of the terminal device.
20. The apparatus of claim 18, wherein the content of the user pronunciation is a phonetic symbol.
21. The apparatus of any of claims 17-20, wherein the categories of content of the pronunciation include at least four categories;
when the content of the pronunciation belongs to the first class, acquiring a frame with the maximum mouth opening degree in the user pronunciation video as a key frame when the user opens the mouth;
when the content of the pronunciation belongs to a second class, acquiring a frame of pronunciation pause in the user pronunciation video as a key frame;
when the pronunciation content belongs to the third category, acquiring a frame with the minimum mouth opening degree and the maximum mouth opening degree in the user pronunciation video as a key frame, wherein the time sequence of the key frame with the minimum mouth opening degree is before the key frame with the maximum mouth opening degree;
and when the pronunciation content belongs to the fourth class, acquiring a frame with the maximum mouth opening degree when the first vowel pronounces and a frame with the minimum mouth opening degree when the later vowel pronounces in the pronunciation video of the user as key frames.
22. The apparatus of claim 17, wherein the feature extraction module comprises:
a shape feature extraction unit configured to extract shape features of the mouth shape in the key frame.
23. The apparatus of claim 22, wherein the shape feature extraction unit comprises:
a key point acquisition subunit configured to acquire key points of the mouth region contour in the key frame;
an angle acquisition subunit configured to acquire angles of respective internal angles of a polygon constructed based on the key points;
and the shape characteristic obtaining subunit is configured to encode the angle of each internal angle according to a preset rule to obtain the shape characteristic of the mouth shape in the key frame.
24. The apparatus of claim 23, wherein the key points of the mouth region contour are key points of a contour within the mouth region.
25. The apparatus according to claim 23 or 24, wherein the key point obtaining subunit is further configured to perform face detection on the key frame to obtain a bounding box of a face; and
and performing key point detection on the face region by adopting an integrated regression tree algorithm of gradient lifting based on the bounding box to obtain key points of the contour of the mouth region.
26. The apparatus according to claim 25, wherein performing face detection on the key frame to obtain a bounding box of a face includes:
constructing an image descriptor by combining the local gradient and the gradient strength of the key frame image;
and judging whether the image in the window is a human face region or not by adopting a sliding window based on the image descriptor.
27. The apparatus of claim 17, wherein the feature extraction module is further configured to obtain a histogram of directional gradients and a color histogram of the mouth region in the key frame as the feature of the key frame.
28. The apparatus of claim 17, wherein the classification module is further configured to classify the keyframes using a pre-trained mouth-shape classifier.
29. The apparatus of scheme 28, wherein the mouth shape classifier is constructed based on:
performing feature dimension reduction on the training data set;
and training based on the reduced-dimension low-dimension spatial features to obtain the mouth shape classifier.
30. The apparatus of scheme 29, wherein feature dimensionality reduction is performed using one of principal component analysis, linear discriminant analysis, and local linear embedding.
31. The apparatus of scheme 29, wherein the mouth shape classifier is trained using one of a support vector machine, random forest, and extreme gradient boosting.
32. The apparatus of claim 29, wherein the training data set comprises mouth shape features for different poses of different faces.
33. A computer-readable storage medium storing program code which, when executed by a processor, implements a method as in one of schemes 1-16.
34. A computing device comprising a processor and a storage medium storing program code that, when executed by the processor, implements a method as in one of schemes 1-16.

Claims (10)

1. A method of mouth shape correction comprising:
acquiring a key frame of a user pronunciation video;
extracting mouth shape features of the key frames;
classifying the keyframes based on the mouth-shape features;
confirming whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and if the two types of the key frames are inconsistent, carrying out corresponding prompt according to the types of the key frames of the standard pronunciation video.
2. The method of claim 1, wherein the extraction manner of the key frame is determined according to a category of contents uttered by the user.
3. The method of claim 2, wherein the user-spoken content is spoken content displayed on a screen of the terminal device.
4. The method of claim 2, wherein the user-pronounced content is a phonetic symbol.
5. An apparatus for correcting a shape of mouth, comprising:
the key frame acquisition module is configured to acquire key frames of a user pronunciation video;
a feature extraction module configured to extract mouth shape features of the key frames;
a classification module configured to classify the keyframes based on the mouth-shape features;
the judging module is configured to confirm whether the classification result is consistent with the category of the key frame of the standard pronunciation video;
and the prompting module is configured to correspondingly prompt according to the category of the key frame of the standard pronunciation video if the two types are inconsistent.
6. The apparatus of claim 5, wherein the key frame acquisition module is further configured to determine a manner of key frame extraction according to a category of content uttered by the user.
7. The apparatus of claim 6, wherein the user-uttered content is uttered content displayed on a screen of the terminal device.
8. The apparatus of claim 6, wherein the content of the user's pronunciation is a phonetic symbol.
9. A computer-readable storage medium storing program code which, when executed by a processor, implements a method according to one of claims 1 to 4.
10. A computing device comprising a processor and a storage medium storing program code which, when executed by the processor, implements the method of one of claims 1 to 4.
CN201910405361.7A 2019-05-16 2019-05-16 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment Pending CN111950327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910405361.7A CN111950327A (en) 2019-05-16 2019-05-16 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910405361.7A CN111950327A (en) 2019-05-16 2019-05-16 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment

Publications (1)

Publication Number Publication Date
CN111950327A true CN111950327A (en) 2020-11-17

Family

ID=73335472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910405361.7A Pending CN111950327A (en) 2019-05-16 2019-05-16 Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment

Country Status (1)

Country Link
CN (1) CN111950327A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614489A (en) * 2020-12-22 2021-04-06 作业帮教育科技(北京)有限公司 User pronunciation accuracy evaluation method and device and electronic equipment
CN112949554A (en) * 2021-03-22 2021-06-11 湖南中凯智创科技有限公司 Intelligent children accompanying education robot
CN114664132B (en) * 2022-04-05 2024-04-30 苏州市立医院 Language rehabilitation training device and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520903A (en) * 2009-04-23 2009-09-02 北京水晶石数字科技有限公司 Method for matching Chinese mouth shape of cartoon role
CN103092329A (en) * 2011-10-31 2013-05-08 南开大学 Lip reading technology based lip language input method
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Pronunciation correcting method and device for language learning
CN106997451A (en) * 2016-01-26 2017-08-01 北方工业大学 Lip contour positioning method
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520903A (en) * 2009-04-23 2009-09-02 北京水晶石数字科技有限公司 Method for matching Chinese mouth shape of cartoon role
CN103092329A (en) * 2011-10-31 2013-05-08 南开大学 Lip reading technology based lip language input method
CN104808794A (en) * 2015-04-24 2015-07-29 北京旷视科技有限公司 Method and system for inputting lip language
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Pronunciation correcting method and device for language learning
CN106997451A (en) * 2016-01-26 2017-08-01 北方工业大学 Lip contour positioning method
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯亚荣, 熊璋: "唇同步的自动识别与验证研究", 计算机工程与设计, no. 02, 28 February 2004 (2004-02-28) *
单卫, 姚鸿勋, 高文: "唇读中序列口型的分类", 中文信息学报, no. 01, 25 January 2002 (2002-01-25) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614489A (en) * 2020-12-22 2021-04-06 作业帮教育科技(北京)有限公司 User pronunciation accuracy evaluation method and device and electronic equipment
CN112949554A (en) * 2021-03-22 2021-06-11 湖南中凯智创科技有限公司 Intelligent children accompanying education robot
CN112949554B (en) * 2021-03-22 2022-02-08 湖南中凯智创科技有限公司 Intelligent children accompanying education robot
CN114664132B (en) * 2022-04-05 2024-04-30 苏州市立医院 Language rehabilitation training device and method

Similar Documents

Publication Publication Date Title
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
CN111160533B (en) Neural network acceleration method based on cross-resolution knowledge distillation
US20180197547A1 (en) Identity verification method and apparatus based on voiceprint
CN111339913A (en) Method and device for recognizing emotion of character in video
CN110970018B (en) Speech recognition method and device
CN111723791A (en) Character error correction method, device, equipment and storage medium
CN110717492B (en) Method for correcting direction of character string in drawing based on joint features
CN111950327A (en) Mouth shape correcting method, mouth shape correcting device, mouth shape correcting medium and computing equipment
CN111951828B (en) Pronunciation assessment method, device, system, medium and computing equipment
US10255487B2 (en) Emotion estimation apparatus using facial images of target individual, emotion estimation method, and non-transitory computer readable medium
CN111951825A (en) Pronunciation evaluation method, medium, device and computing equipment
CN110111778B (en) Voice processing method and device, storage medium and electronic equipment
US7702145B2 (en) Adapting a neural network for individual style
Sterpu et al. Towards lipreading sentences with active appearance models
CN111191073A (en) Video and audio recognition method, device, storage medium and device
US10592733B1 (en) Computer-implemented systems and methods for evaluating speech dialog system engagement via video
KR20170081350A (en) Text Interpretation Apparatus and Method for Performing Text Recognition and Translation Per Frame Length Unit of Image
CN109766419A (en) Products Show method, apparatus, equipment and storage medium based on speech analysis
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
CN113283327A (en) Video text generation method, device, equipment and storage medium
WO2021196390A1 (en) Voiceprint data generation method and device, and computer device and storage medium
CN114357206A (en) Education video color subtitle generation method and system based on semantic analysis
CN115423908A (en) Virtual face generation method, device, equipment and readable storage medium
US20220012520A1 (en) Electronic device and control method therefor
CN115312030A (en) Display control method and device of virtual role and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination