WO2017088727A1

WO2017088727A1 - Image processing method and apparatus

Info

Publication number: WO2017088727A1
Application number: PCT/CN2016/106752
Authority: WO
Inventors: 汪铖杰
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2015-11-25
Filing date: 2016-11-22
Publication date: 2017-06-01
Also published as: CN106778450B; CN106778450A

Abstract

Disclosed are an image processing method and apparatus. The image processing method comprises: acquiring video data, and extracting frames having face features from the video data; determining a mouth position from each frame to obtain a mouth image; analyzing the mouth image to obtain mouth features; identifying a mouth state according to the mouth features by utilizing a preset rule; and identifying a mouth action of a corresponding face in the video data on the basis of the identified mouth state. The solution can increase the identification accuracy and improve the identification effect.

Description

Image processing method and device

The present application claims priority to Chinese Patent Application No. 20151082742, the entire disclosure of which is hereby incorporated by reference. .

Technical field

The present application relates to the field of communications technologies, and in particular, to an image processing method and apparatus.

Background technique

With the development of communication technology, a variety of biometric recognition technologies have emerged, and facial recognition is one of them. Face recognition, also known as face recognition, face recognition and face recognition. Compared with fingerprint scanning or iris recognition, facial recognition has the characteristics of convenient use, obvious intuitiveness, high recognition accuracy, and difficulty in counterfeiting. Therefore, it is easier for users to accept.

Summary of the invention

The embodiment of the present application provides an image processing method and device, which can improve the correct rate of recognition and improve the recognition effect.

An embodiment of the present application provides an image processing method, including:

Obtain video data;

Extracting a frame having facial features from the video data;

Determining a mouth position from the frames to obtain a mouth image;

Analyzing the mouth image to obtain a mouth feature;

Using a preset rule to identify the state of the mouth according to the mouth feature;

A mouth motion of the corresponding face in the video data is identified based on the identified mouth state.

Correspondingly, the embodiment of the present application further provides an image processing apparatus, including:

An acquiring unit, configured to acquire video data, and extract a frame having facial features from the video data;

a determining unit, configured to determine a mouth position from the frames to obtain a mouth image;

An analyzing unit, configured to analyze the mouth image to obtain a mouth feature;

An identifier unit, configured to identify a mouth state according to the mouth feature by using a preset rule;

And an identifying unit, configured to identify a mouth motion of the corresponding face in the video data based on the identified mouth state.

In the embodiment of the present application, after the video data is acquired, a frame having facial features is extracted from the video data, and then the mouth position is determined from the extracted frames to obtain a mouth image, and then analyzed. The mouth feature is obtained, and then the mouth state is identified according to the mouth feature by using a preset rule as a basis for judging whether the mouth is moving, thereby realizing recognition of the mouth motion. Because the scheme has low dependence on the accuracy of the key points of the facial features, it is better than the existing schemes. Even if the face is shaken in the video, the recognition result will not be too great. Impact, in summary, the program can greatly improve the correct rate of recognition and improve the recognition effect.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings can also be obtained from those skilled in the art based on these drawings without paying any creative effort.

1 is a flowchart of an image processing method provided by an embodiment of the present application;

2a is another flowchart of an image processing method provided by an embodiment of the present application;

2b is a schematic diagram of a rectangular frame of a face coordinate in an image processing method according to an embodiment of the present application;

3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present application without creative efforts are within the scope of the present application.

In an embodiment of the present application, facial recognition is widely used. For example, facial recognition technology may be applied to data security, or facial recognition technology may be used for face capture and tracking. In facial recognition, the recognition of the mouth is one of the most important parts. For example, by judging whether the face in the video data has a mouth movement, it is possible to judge the facial expression of the object or determine whether the object is talking. and many more. In this embodiment, when determining whether the face in the video data has a mouth opening motion, a facial five-point key point positioning technique is generally used, that is, a plurality of points are used to locate the mouth of each frame of the video image sequence, and then The internal area of the mouth is calculated by using these point coordinates, and finally, by calculating the change of the area, it is determined whether there is a mouth movement in the face in the video.

In this scheme, if the face in the video is shaken, the key points of the face may appear to be failed or the deviation is large, which will cause the calculated inner area of the mouth to be wrong, and eventually cause the mouth movement. The failure of state detection, that is to say, the correct rate of the scheme is not high, and the recognition effect is poor.

For the correct rate and recognition effect of the face recognition, the embodiment of the present application provides an image processing method and apparatus. The details will be described separately below.

This embodiment will be described from the perspective of an image processing apparatus, and the image processing apparatus may be specifically integrated in a device such as a terminal or a server. The terminal may include a mobile phone, a tablet, a laptop, or a personal computer (PC, Personal Computer).

An image processing method includes: acquiring video data, and extracting a frame having facial features from the video data; determining a mouth position from the frame to obtain a mouth image; analyzing the mouth image to obtain a mouth feature; Using a preset rule, the mouth state is identified according to the mouth feature; based on the identifier, the mouth motion of the corresponding face in the video data is identified.

As shown in FIG. 1, the specific process of the image processing method may include the following steps:

101. Acquire video data, and extract a frame having facial features from the video data.

For example, it is specifically possible to read video data that requires facial recognition, and extract a frame having facial features from the video data using face recognition technology.

Wherein, the facial features may include eyebrows, eyes, nose and/or mouth, etc., if these features are present in a certain frame image, they may be considered as frames having facial features.

102. Determine the position of the mouth from each frame to obtain a mouth image.

In the present embodiment, the mouth image can be determined by the following method.

(1) Positioning the facial features in the frame to obtain the coordinate position of the facial features.

For example, the frame may be subjected to face detection to obtain a rectangular frame of the face coordinate, and the facial features are positioned according to the rectangular frame of the face coordinate to obtain a key point of the facial features, and then the facial five is determined according to the key points of the facial features. The coordinate position of the official.

Among them, the key point of the face, also known as the key feature point of the face, refers to the area of the face with characteristic features, such as the corner of the eye or the corner of the mouth. The key point of the five senses is part of the key points of the face, mainly used to identify the five senses.

Among them, the five-point key points can be obtained by using the facial coordinate rectangular frame to perform the five-point key point positioning. For example, the key point of the face nose area can be determined as the midpoint of the center line of the two nostrils, that is, the nose lip center point. The key points of the mouth area can be determined by locating the two corner points of the mouth.

(2) The position of the mouth is determined based on the coordinate position of the facial features to obtain a mouth image.

For example, the mouth position may be determined according to the coordinate position of the facial features, and then the image corresponding to the mouth position is intercepted or captured from the frame image to obtain a mouth image.

103. Analyze the mouth image to obtain a mouth feature.

For example, a texture feature can be specifically extracted from the mouth image to obtain a mouth feature.

The texture feature may include a histogram of oriented gradient (HOG) feature, a local binary pattern (LBP) feature, or a Gabor feature.

104. Using a preset rule, the mouth state is identified according to the mouth feature.

The preset rule may be set according to the requirements of the actual application. For example, the mouth feature may be classified by a regression device or a classifier, and then the mouth state is identified based on the classification, and the like. Using the preset rule to identify the mouth state according to the mouth feature may include the following steps:

(1) The mouth features are classified using a regression or classifier.

For example, the mouth feature can be classified by a support vector machine (SVM), or other regressions or classifiers such as a linear regression, a random forest, or the like can be used to classify the mouth features, etc. Wait.

(2) Identify the mouth state according to the classification result, for example, as follows:

If it is determined according to the classification result that the mouth state is the open mouth state, the mouth state flag is set for the frame;

If it is determined according to the classification result that the mouth state is the closed state, the closed state flag is set for the frame.

It should be noted that if it is not determined according to the classification result that the mouth state is the open mouth state or the closed mouth state, it can be determined that the mouth state is a fuzzy state, and then, at this time, there is no need to set the identification bit, that is, neither the mouth opening state identifier is set. Bit, also does not set the shutdown status flag.

In addition, it should be noted that the mouth state of each frame having facial features in the video data When the identification is performed, the method may be adopted in a parallel manner, or a loop operation may be adopted, that is, the frame that needs to be identified by the mouth state is determined first, and then the operations of steps 102 to 104 are performed, and the mouth state identifier is determined for the current need. After the frame processing is completed, the process returns to perform the determination of the frame currently requiring the mouth state identification to perform the mouth state identification processing on the next frame until all the frames having the facial features in the video data are processed (ie, the mouth state). The logo is completed.

105. Identify a mouth motion of the corresponding face in the video data based on the identified mouth state.

Wherein, a face (face) may appear in the video data, and multiple faces may appear, and one frame may include one face or multiple faces, and different faces may be distinguished by facial features; The corresponding frame may be extracted from the video data by the facial feature of the target face to obtain a target frame set. For example, if the mouth motion analysis of the face A is required, the facial feature of the face A may be used from the video data. Extract all frames with face A, get the target frame set, and so on. Identifying the mouth motion of the corresponding face in the video data based on the identifier includes the following steps:

S1. Receive a mouth motion analysis request, and the mouth motion analysis request indicates a target face that needs to perform mouth motion analysis.

For example, a mouth motion analysis request triggered by a user by clicking or sliding a trigger key may be received, and the like.

S2. Extract a frame corresponding to the target face from the video data according to the target face, and obtain a target frame set.

For example, a facial feature of the target face may be acquired according to the target face, and then a frame having the facial feature of the target face is extracted from the video data according to the facial feature of the target facial to obtain a target frame set.

In this step, the frame having the facial feature of the target face can be acquired from the frame identifying the mouth state obtained by performing the steps 102 to 104 described above.

S3. Determine whether the frame in the target frame set has both the open state flag and the closed state flag. If yes, execute S4. If not, execute S5.

For example, if the target frame set includes four frames: frame 1, frame 2, frame 3, and frame 4, wherein frame 1 and frame 2 have a mouth state flag, frame 3 has no flag, and frame 4 has a closed state. The identification bit can be determined at this time that the frame in the target frame set has both the open mouth status flag and the closed state status bit, and then step S4 is performed; otherwise, if frame 1, frame 2, frame 3, and frame 4 are not present The flag bit, or only the mouth state flag or the mouth state flag, may determine that the frame in the target frame set does not have a mouth state flag at the same time The identification and the shutdown status flag are then executed, and step S5 is performed.

S4. Determine that a frame in the target frame set has both a mouth state flag and a mouth state flag, and determine that the target face has a mouth motion state;

S5. When it is determined that the frame in the target frame set does not have both the open state flag and the closed state flag, it is determined that the target face does not have a mouth motion state.

As can be seen from the above, in this embodiment, after the video data is acquired, a frame having a facial feature is extracted from the video data, and then the mouth position is determined from the extracted frame to obtain a mouth image, and then analyzed. The mouth feature, and then, using the preset rule, the mouth state is identified according to the mouth feature, as a basis for judging whether the mouth is moving, thereby realizing the recognition of the mouth movement; The accuracy of the key point positioning results is low. Therefore, compared with the existing solution, the stability is better. Even if the face is shaken in the video, the recognition result will not be greatly affected. In short, the solution can Greatly improve the correct rate of recognition and improve the recognition effect.

According to the method described in Embodiment 1, the following will be exemplified in further detail.

In this embodiment, the image processing apparatus is specifically integrated in the terminal, and the mouth state of the face in each frame is identified by a cyclic operation as an example.

As shown in FIG. 2a, an image processing method may be as follows:

201. The terminal acquires video data, and performs face detection on the video data to extract a frame having facial features.

Wherein, the facial features may include eyebrows, eyes, nose and/or mouth, and the like. If these features are present in a certain frame image, the frame can be considered to be a frame having facial features.

For example, if it is determined by face detection that the first frame, the second frame, and the third frame in the video data have facial features, and the fourth frame and the fifth frame do not have facial features, then One frame, the second frame, and the third frame are extracted.

202. The terminal determines, according to the extracted frame with the facial feature, a frame that needs to be identified by the mouth state.

For example, if the extracted frames are the first frame, the second frame, and the third frame, the mouth state may be identified in sequence, for example, the first frame is determined to be the frame currently requiring the mouth state identification. Then, steps 203 to 209 are performed, and then the second frame is determined to be a frame that needs to be identified by the mouth state, and then steps 203 to 209 are performed, and then the second frame is determined to be a frame that needs to be identified by the mouth state. Analogy, and so on.

203. The terminal performs face detection on the frame that needs to be identified by the mouth state, and obtains a rectangular frame of the face coordinate. For example, referring to FIG. 2b, FIG. 2b is a schematic diagram of a rectangular frame of the face coordinate in the image processing method provided by the embodiment of the present application.

204. The terminal performs the five-point key point positioning according to the rectangular frame of the face coordinate, and obtains a key point of the facial features, and determines a coordinate position of the facial features according to the key points of the facial features.

Among them, according to the rectangular coordinate frame of the face coordinate, the key points of the facial features can be obtained, and the manner of obtaining the key points of the facial features can be various, which can be determined according to the requirements of the actual application. For example, the key points of the nose area can be determined as two nostrils. The midpoint of the center line, the center point of the nose and lips. The key points of the mouth area and the like can be determined by locating the two corner points of the mouth.

205. The terminal determines a mouth position according to a coordinate position of the facial features to obtain a mouth image.

For example, the terminal may determine the position of the mouth according to the coordinate position of the facial features of the face, and then intercept or capture an image corresponding to the position of the mouth from the image of the frame to obtain a mouth image.

206. The terminal extracts a texture feature from the mouth image to obtain a mouth feature.

The texture feature may include: an HOG feature, an LBP feature, or a Gabor feature.

207. The terminal classifies the mouth features by using an SVM.

It should be noted that in addition to the SVM, other regressions or classifiers such as linear regression and random forest may be used to classify the mouth features.

208. The terminal identifies the state of the mouth according to the classification result. For example, the terminal may be as follows:

If it is determined according to the classification result that the mouth state is the closed state, the closed state flag is set for the frame;

It should be noted that if it is not determined according to the classification result that the mouth state is the open mouth state or the closed mouth state, it can be determined that the mouth state is a fuzzy state, and at this time, the setting operation of the marker bit is not required, that is, neither is set. The mouth status flag is not set, and the mouth status flag is not set.

209. The terminal determines whether all the frames having the facial features in the video data are processed. If yes, step 210 is performed, and if no, return to step 202.

For example, if the frame having the facial features in the video data has only the first frame, the second frame, and the third frame, after the first frame is identified, since the second frame and the third frame are not yet processed, Need to continue to the second The frame performs the identification of the mouth state, so it is necessary to return to step 202, and if both the second frame and the third frame are also identified, step 210 is performed.

210. The terminal identifies, according to the identifier, a mouth motion of the corresponding face in the video data. For example, it can be as follows:

S1. The terminal receives a mouth motion analysis request, and the mouth motion analysis request indicates a target face that needs to perform mouth motion analysis.

S2. The terminal extracts a frame corresponding to the target face from the video data according to the target face, to obtain a target frame set.

For example, the terminal may acquire a facial feature of the target face according to the target face, and then extract a frame having the facial feature of the target facial from the video data according to the facial feature of the target facial to obtain a target frame set.

With the above steps 202 to 209, all the frames having the facial features can be extracted from the video data, and the frame with the facial features identified by the extracted frame with the facial features is identified. In the present embodiment, frames having facial features of the target face are extracted from all of the identified frames having facial features.

S3. The terminal determines whether the frame in the target frame set has both the open state flag and the closed state flag. If yes, execute S4. If not, execute S5.

For example, if the target frame set includes four frames: frame 1, frame 2, frame 3, and frame 4, wherein frame 1 and frame 2 have a mouth state flag, frame 3 has no flag, and frame 4 has a closed state. The identification bit can be determined at this time that the frame in the target frame set has both the open mouth status flag and the closed state status bit, and then step S4 is performed; otherwise, if frame 1, frame 2, frame 3, and frame 4 are not present The flag bit, or only the mouth state flag or the mouth state flag, may be determined that the frame in the target frame set does not have both the mouth state flag and the mouth state flag, and step S5 is performed.

S4. When the terminal determines that the frame in the target frame set has both the open mouth status flag and the closed state status bit, it is determined that the target face has a mouth movement state.

S5. When the terminal determines that the frame in the target frame set does not have both the open state flag and the closed state flag, it is determined that the target face does not have a mouth motion state.

As can be seen from the above, in the embodiment, after the video data is acquired, a frame having a facial feature is extracted from the video data, and then the mouth position is determined from the extracted frame to obtain a mouth image, and then Mouth The texture features are extracted from the image, and the texture features are classified by SVM, and the mouth state is identified based on the classification result as a basis for judging whether the mouth is moving, thereby realizing the recognition of the mouth motion. Because the scheme has low dependence on the accuracy of the key points of the facial features, it is better than the existing schemes. Even if the face is shaken in the video, the recognition result will not be too great. Impact, in summary, the program can greatly improve the correct rate of recognition and improve the recognition effect.

The embodiment of the present application further provides an image processing apparatus. As shown in FIG. 3, the image processing apparatus may include: an obtaining unit 301, a determining unit 302, an analyzing unit 303, an identifying unit 304, and an identifying unit 305.

The obtaining unit 301 is configured to acquire video data, and extract a frame having facial features from the video data.

For example, the obtaining unit 301 may be specifically configured to read video data that needs to be recognized by the face, and extract a frame having facial features from the video data by using a face recognition technology.

Wherein, the facial features may include eyebrows, eyes, nose and/or mouth, and the like. If these features are present in a certain frame image, it can be considered as a frame having facial features.

The determining unit 302 is configured to determine the mouth position from each frame to obtain a mouth image.

For example, the determining unit 302 can include a positioning subunit and a determining subunit.

The positioning sub-unit is configured to locate facial features in each frame to obtain a coordinate position of the facial features. For example, the positioning sub-unit can be used for performing face detection on each frame to obtain a rectangular frame of the face coordinate, and performing a five-point key point positioning according to the rectangular frame of the face coordinate to obtain a key point of the facial features, and determining the coordinates of the facial features according to the key points of the facial features. position.

Among them, according to the rectangular coordinate frame of the face coordinate, the key points of the facial features can be obtained, and the manner of obtaining the key points of the facial features can be various, which can be determined according to the requirements of the actual application. For example, the key points of the nose area can be determined as two nostrils. The midpoint of the center line, the center point of the nose and lips. The key points of the mouth area can be determined by determining the two corner points of the mouth.

The determining subunit is configured to determine a mouth position according to a coordinate position of the facial features to obtain a mouth image.

For example, the determining subunit may be specifically configured to determine a mouth position according to a coordinate position of the facial features, and then intercept or capture an image corresponding to the mouth position from the frame image to obtain a mouth image.

The analyzing unit 303 is configured to analyze the mouth image to obtain a mouth feature.

For example, the analyzing unit 303 is specifically configured to extract a texture feature from the mouth image to obtain a mouth feature.

The texture feature may include an HOG feature, an LBP feature, or a Gabor feature.

The marking unit 304 is configured to identify the mouth state according to the mouth feature by using a preset rule.

The preset rule may be set according to requirements of an actual application. For example, the identifier unit may include: a classification subunit and an identifier subunit.

The classification subunit is configured to classify the mouth features by using a regression or classifier.

For example, the classification sub-unit may be specifically used to classify the mouth features by using the SVM, or may also use a linear regression, a random forest or other regression or classifier to classify the mouth features, and the like.

The identifier subunit is configured to identify the mouth state according to the classification result. For example, it can be as follows:

(5) identification unit 305;

The identifying unit 305 is configured to determine a mouth motion of the corresponding face in the video data based on the identifier. For example, the details can be as follows:

Receiving a mouth motion analysis request, the mouth motion analysis request indicating a target face that needs to perform mouth motion analysis;

Extracting a corresponding frame from the video data according to the target face, to obtain a target frame set;

Determining whether the frame in the target frame set has both a mouth state flag and a mouth state flag;

If yes, determining that the target face has a mouth open motion state;

If not, it is determined that there is no open mouth motion state on the target face.

For example, if the target frame set includes four frames: frame 1, frame 2, frame 3, and frame 4, wherein frame 1 and frame 2 have a mouth state flag, frame 3 has no flag, and frame 4 has a closed state. The identification bit can be determined at this time that the frame in the target frame set has both the open mouth status flag and the closed mouth status flag, so that the target face has a mouth opening motion state; otherwise, if frame 1, frame 2, frame 3 and If there is no flag in the frame 4, or only the mouth state flag or the mouth state flag is present, it may be determined that the frame in the target frame does not have both the mouth state flag and the mouth state flag, and then the target is determined. There is no mouth movement state on the face.

The image processing device may be specifically integrated in a device such as a terminal or a server, and the terminal may include a device such as a mobile phone, a tablet computer, a notebook computer, or a PC.

In the specific implementation, the foregoing units may be implemented as a separate entity, or may be implemented in any combination, and may be implemented as the same or a plurality of entities. For the specific implementation of the foregoing, refer to the foregoing method embodiments, and details are not described herein.

As can be seen from the above, the image processing apparatus of the present embodiment extracts a frame having facial features from the video data after acquiring the video data, and then the determining unit 302 determines the mouth position from the extracted frame to obtain the mouth. The image of the part is further analyzed by the analyzing unit 303 to obtain the mouth feature, and then the marking unit 304 uses the preset rule to identify the mouth state according to the mouth feature as the identification unit 305. The basis for judging whether the mouth is moving, thereby realizing the recognition of the mouth movement; since the scheme has low dependence on the accuracy of the positioning result of the facial features of the facial features, the stability is better than the existing scheme. Even if the face is shaken in the video, it will not have much influence on the recognition result. In short, the scheme can greatly improve the recognition accuracy and improve the recognition effect.

FIG. 4 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in FIG. 4, the apparatus includes a processor 401, a non-volatile computer readable memory 402, a display unit 403, and a network communication interface 404. These components communicate over bus 405.

In this embodiment, a plurality of program modules are stored in the memory 402, including an operating system 406, a network communication module 407, and an application 408.

The processor 401 can read various modules (not shown) included in the application in the memory 402 to perform various functional applications and data processing of image processing. The processor 401 in this embodiment may be one or multiple, and may be a CPU, a processing unit/module, an ASIC, a logic module, or a programmable gate array.

The operating system 406 can be: a Windows operating system, an Android operating system, or an Apple iPhone OS operating system.

Application 408 can include an image processing module 409. The image processing module 409 may include: a computer executable instruction set 409-1 formed by the obtaining unit 301, the determining unit 302, the analyzing unit 303, the identifying unit 304, and the identifying unit 305 in FIG. 3, and corresponding metadata and heuristics. Algorithm 409-2. These sets of computer executable instructions may be executed by the processor 401 and perform the functions of the method illustrated in FIG. 1 or FIG. 2a or the image processing apparatus illustrated in FIG.

In this embodiment, the network communication interface 404 cooperates with the network communication module 407 to complete transmission and reception of various network signals of the image processing apparatus.

The display unit 403 has a display panel for completing input and display of related information.

If the image processing apparatus has no communication requirements, the network communication interface 404 and the network communication module 407 may not be included.

A person skilled in the art may understand that all or part of the various steps of the foregoing embodiments may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), disk or optical disk.

An image processing method and apparatus provided by the embodiments of the present application are described in detail. The principles and implementations of the present application are described in the specific examples. The description of the above embodiments is only used to help understand the present application. The method and its core idea; at the same time, those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation manner and the scope of application, in summary, the contents of this specification should not be construed as Application restrictions.

Claims

An image processing method, comprising:

Obtain video data;

Extracting a frame having facial features from the video data;

Determining a mouth position from the frames to obtain a mouth image;

Analyzing the mouth image to obtain a mouth feature;

Using a preset rule to identify the state of the mouth according to the mouth feature;

A mouth motion of the corresponding face in the video data is identified based on the identified mouth state.
The method according to claim 1, wherein said determining a mouth position from said frame to obtain a mouth image comprises:

Positioning the facial features in the frames to obtain a coordinate position of the facial features;

The mouth position is determined according to the coordinate position of the facial features, and the mouth image is obtained.
The method according to claim 2, wherein the positioning of facial features in the frames to obtain coordinate positions of facial features comprises:

Performing face detection on each frame to obtain a rectangular frame of face coordinates;

According to the rectangular frame of the face coordinate, the five-point key point is positioned to obtain the five-point key point;

The coordinate position of the facial features is determined according to the facial features.
The method according to any one of claims 1 to 3, wherein the analyzing the mouth image to obtain the mouth feature comprises:

A texture feature is extracted from the mouth image to obtain a mouth feature.
The method according to any one of claims 1 to 3, wherein the using the preset rule to identify the mouth state according to the mouth feature comprises:

Sorting the mouth features using a regression or classifier;

The mouth state is identified based on the classification result.
The method according to claim 5, wherein the identifying the mouth state according to the classification result comprises:

If it is determined that the mouth state is a mouth state according to the classification result, setting a mouth state flag for the frame;

If it is determined according to the classification result that the mouth state is the closed state, the closed state flag is set for the frame.
The method according to claim 6, wherein the identifying the mouth motion of the corresponding face in the video data based on the identifier comprises:

Receiving a mouth motion analysis request, the mouth motion analysis request indicating a target face that needs to perform mouth motion analysis;

Extracting a frame corresponding to the target face from the frames according to the target face, to obtain a target frame set;

Determining whether there is a frame in the target frame set with a mouth opening status flag and a mouth closing status flag bit;

If there is a frame provided with a mouth state flag and a mouth state flag, it is determined that the target face has a mouth movement state;

If the frame of the mouth state flag and the mouth state flag is not set, it is determined that the target face does not have a mouth motion state.
An image processing apparatus, comprising:

And an obtaining unit, configured to acquire video data, and extract a frame having facial features from the video data.

a determining unit, configured to determine a mouth position from the frames to obtain a mouth image;

An analyzing unit, configured to analyze the mouth image to obtain a mouth feature;

An identifier unit, configured to identify a mouth state according to the mouth feature by using a preset rule;

And an identifying unit, configured to identify a mouth motion of the corresponding face in the video data based on the identified mouth state.
The apparatus according to claim 8, wherein the determining unit comprises: a positioning subunit and a determining subunit;

The positioning subunit is configured to locate facial features in the frames to obtain a coordinate position of the facial features;

The determining subunit is configured to determine a mouth position according to a coordinate position of the facial features to obtain a mouth image.
The device of claim 9 wherein:

The locating sub-unit is further configured to perform face detection on the frames to obtain a rectangular frame of a face coordinate; perform a five-point key point positioning according to the rectangular frame of the face coordinate to obtain a five-point key point; and determine a face according to the key point of the facial features The coordinate position of the five senses.
A device according to any one of claims 8 to 10, characterized in that

The analyzing unit is specifically configured to extract a texture feature from the mouth image to obtain a mouth feature.
The apparatus according to any one of claims 8 to 10, wherein the identification unit comprises a classification subunit and an identification subunit.

The classification subunit is configured to classify the mouth features by using a regression device or a classifier;

The identifier subunit is configured to identify the mouth state according to the classification result.
The device according to claim 12, characterized in that

The identifier subunit is further configured to: if the mouth state is determined to be a mouth opening state according to the classification result, setting a mouth opening state identifier bit for the frame; and if the mouth state is determined to be a closed state according to the classification result, A closed state flag is set for the frame.
The apparatus according to claim 13, wherein said identification unit is further used for

Receiving a mouth motion analysis request, the mouth motion analysis request indicating a target face that needs to perform mouth motion analysis;

Extracting a frame corresponding to the target face from the frames according to the target face, to obtain a target frame set;

Determining whether there is a frame in the target frame set with a mouth opening status flag and a mouth closing status flag bit;

If there is a frame provided with a mouth state flag and a mouth state flag, it is determined that the target face has a mouth movement state;

If the frame of the mouth state flag and the mouth state flag is not set, it is determined that the target face does not have a mouth motion state.
A computer readable storage medium storing computer readable instructions executable by at least one processor to perform the method of any one of claims 1 to 7.