WO2017107345A1

WO2017107345A1 - Image processing method and apparatus

Info

Publication number: WO2017107345A1
Application number: PCT/CN2016/079163
Authority: WO
Inventors: 倪辉
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2015-12-26
Filing date: 2016-04-13
Publication date: 2017-06-29
Also published as: CN106919891B; CN106919891A

Abstract

Disclosed are an image processing method and apparatus. The method may comprise: detecting a face area in each frame of images contained in a video to be processed, and positioning a lip area from the face area (S101); extracting a feature column pixel of the lip area from each frame of images to construct a lip variation graph (S102); and performing lip movement recognition according to a texture feature of the lip variation graph to obtain a recognition result (S103). A lip movement is recognised according to a lip variation of an image over a time span, which can avoid the impact of amplitude of the lip variation, improve the accuracy of a recognition result, and improve the practicability of image processing.

Description

Image processing method and device

The present application claims the priority of the priority of the application Serial No. 201510996643.0, filed on Dec.

Technical field

The present invention relates to the field of Internet technologies, and in particular, to the field of video image processing technologies, and in particular, to an image processing method and apparatus.

Background technique

Some Internet scenes usually involve the process of lip recognition. For example, in an identity authentication scenario, in order to prevent illegal users from using static pictures to confuse audio and video, it is usually necessary to record a video image of the user's speech, and then perform lip movement recognition on the video image. Process to confirm the identity of a legitimate user. One of the solutions for performing lip movement recognition processing on an image in the prior art is: calculating the area of the lip region in each frame of the image in the video, and confirming whether the lip occurs by the difference in the area of the lip region between the frame images. move. Another solution is: extracting a lip opening and closing state in each frame of the image in the video, and detecting whether a lip motion occurs according to the opening and closing amplitude. The prior art relies on the amplitude of the change of the lip. If the change of the lip is small, the change of the area of the lip area and the extent of the opening and closing of the lip are not obvious enough, which may affect the accuracy of the lip recognition result and influence. The utility of the prior art solutions.

Summary of the invention

The embodiment of the invention provides an image processing method and device, which can recognize the lip movement according to the lip variation of the image over time span, can avoid the influence of the lip variation amplitude, improve the accuracy of the recognition result, and improve the practicability of the image processing.

A first aspect of the embodiments of the present invention provides an image processing method, which may include:

Detecting a face area in each frame image included in the to-be-processed video, and locating a lip area from the face area;

Extracting a feature column pixel of the lip region from each of the frame images to construct a lip variation map;

The lip motion recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained.

Preferably, the detecting a face area in each frame image included in the to-be-processed video, and locating the lip area from the face area, includes:

Parsing the processed video to obtain at least one frame of image;

A face detection algorithm is used to detect a face region in each frame image;

A lip registration region is used to locate the lip region from the face region.

Preferably, the feature column pixel construction lip variation map of extracting a lip region from each frame image comprises:

Intercepting a lip region map in each frame of image;

Extracting a feature column pixmap from the lip region map;

The extracted feature column pixel map is spliced according to the chronological order of each frame image to obtain a lip change map.

Preferably, the extracting the feature column pixmap from the lip region map comprises:

Determining a preset position in the lip region map;

Drawing a vertical axis along the preset position;

A column of pixel maps composed of all the pixels located on the vertical axis in the lip region map is extracted as a feature column pixmap.

Preferably, the preset position is a central pixel point position of the lip region map.

Preferably, the lip movement recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained, including:

Calculating a texture feature of the lip variation map, the texture feature comprising an LBP feature and/or an HOG feature;

The texture features are classified by using a preset classification algorithm to obtain a lip motion recognition result, and the recognition result includes: lip movement or lip motion does not occur.

A second aspect of the embodiments of the present invention provides an image processing apparatus, which may include:

a positioning unit, configured to detect a face area in each frame image included in the to-be-processed video, and locate a lip area from the face area;

a building unit, configured to extract a feature column pixel of the lip region from the image of each frame to construct a lip variation map;

a lip motion recognition unit configured to perform lip motion recognition according to the texture feature of the lip change map to obtain a recognition result.

Preferably, the positioning unit comprises:

a parsing unit, configured to parse the video to be processed to obtain at least one frame of image;

a face detecting unit, configured to detect a face region in each frame image by using a face detection algorithm;

A face registration unit is configured to locate a lip region from the face region using a face registration algorithm.

Preferably, the building unit comprises:

An intercepting unit for intercepting a lip region map in each frame image;

An extracting unit, configured to extract a feature column pixmap from the lip region map;

The splicing processing unit is configured to perform splicing processing on the extracted feature column pixmap according to the chronological order of each frame image to obtain a lip change map.

Preferably, the extracting unit comprises:

a position determining unit, configured to determine a preset position in the lip area map;

a vertical axis determining unit for drawing a vertical axis along the preset position;

The feature column pixel extracting unit is configured to extract a column of pixel maps composed of all the pixels located on the vertical axis in the lip region map as a feature column pixel map.

Preferably, the lip movement recognition unit comprises:

a calculating unit, configured to calculate a texture feature of the lip variation map, the texture feature comprising an LBP (Local Binary Patterns) feature and/or a HOG (Histogram of Oriented Gradient) feature;

And a classification unit, configured to classify the texture features by using a preset classification algorithm, to obtain a lip recognition result, where the recognition result comprises: lip motion or lip motion does not occur.

Embodiments of the present invention have the following beneficial effects:

In the embodiment of the present invention, the face region detection and the lip region localization are performed on each frame image included in the video, and the feature column pixels of the lip region are extracted from each frame image to construct a lip variation map, because the lip portion The change graph is from each frame image, which enables the lip change map to reflect the time span of each image composition as a whole; the lip motion recognition through the texture feature of the lip change map obtains the recognition result, that is, the lip according to the time span Change recognition lip movement, can avoid the influence of the amplitude of the lip change, identify the effect The rate is higher and the recognition result is more accurate.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

FIG. 1 is a flowchart of an image processing method according to an embodiment of the present invention;

2 is a schematic structural diagram of an Internet device according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

In the embodiment of the present invention, the face region detection and the lip region localization are performed on each frame image included in the video, and the feature column pixels of the lip region are extracted from each frame image to construct a lip variation map, because the lip portion The change graph is from each frame image, which enables the lip change map to reflect the time span of each image composition as a whole; the lip motion recognition through the texture feature of the lip change map obtains the recognition result, that is, the lip according to the time span The change recognizes the lip movement, which can avoid the influence of the change range of the lip, the recognition efficiency is high and the recognition result is high.

The image processing method of the embodiment of the present invention can be applied to many Internet scenarios, for example, in a voice input scenario, the voice acquisition process can be controlled by lip movement recognition of the user's speech video; for example, in an identity authentication scenario. The identity of the legitimate user can be confirmed by lip-moving recognition of the user's speaking video, and the illegal user is prevented from using the static picture to confuse the audiovisual; Similarly, the image processing apparatus of the embodiment of the present invention can be applied to various devices in an Internet scenario, for example, can be applied to a terminal or applied to a server.

Based on the foregoing description, an embodiment of the present invention provides an image processing method. Referring to FIG. 1, the method may perform the following steps S101-S103.

S101: Detect a face region in each frame image included in the to-be-processed video, and locate a lip region from the face region.

The video to be processed may be a video recorded in real time. For example, when the user initiates a voice input request to the terminal, the terminal may record the user's speaking video as a to-be-processed video in real time. The to-be-processed video may also be the received real-time video. For example, when the server authenticates the terminal-side user, the server may receive the user-speaking video recorded by the terminal in real time as the to-be-processed video. The face detection technology refers to determining whether a given image contains a human face by using a certain strategy scan, and determining the position, size and posture of the face in the image after determining the inclusion. The face registration technique refers to the use of a certain algorithm to clearly distinguish the contours of the face, nose, and lips of the face according to the position, size, and posture of the face. The method of the present embodiment specifically relates to the face detection technology and the face registration technology in the process of performing step S101; specifically, the method performs the following steps s11-s13 when performing step S101:

S11, parsing the processed video to obtain at least one frame of image. The video is composed of one frame and one frame of image in chronological order. Therefore, the frame to be processed can be framed to obtain one frame and one frame of image.

S12, using a face detection algorithm to detect a face region in each frame image.

The face detection algorithm may include, but is not limited to, a PCA (Principal Component Analysis) algorithm, an elastic model based method, a Hidden Markov Model, and the like. For each frame image obtained by the video framing process, a face detection algorithm can be used to determine a face area, which is used to display the position, size and posture of the face in each frame image.

S13, using a face registration algorithm to locate a lip region from the face region.

The face registration algorithm may include, but is not limited to, a Lasso face regression registration algorithm, a wavelet domain algorithm, and the like. The face registration algorithm can accurately locate the lip region for the face position, size and posture of the face region displayed in each frame image.

S102. Extract a feature column of the lip region from the image of each frame to construct a lip variation map.

The lip change map requires an overall reflection of the lip change over the time span. Since the video is composed of one frame and one frame of image in chronological order, and the video is movable over the time span of the frame of each frame The state reflects the change of the lip. Therefore, this step can construct the lip change map by using the changing characteristics of the lip region in each frame image. In a specific implementation, the method performs the following steps s21-s23 when performing step S101:

S21, the lip region map is intercepted in each frame image. Since the lip region has been accurately positioned from each frame image, the lip region map can be directly intercepted from each frame image in this step s21, and then the first lip region map can be intercepted in the first frame image. A second lip area map can be captured in the second frame image, and so on.

S22, extracting a feature column pixmap from the lip region map.

The feature column pixel refers to a column of pixels in a frame image that can reflect the characteristics of the lip change, and the image formed by the feature column is referred to as a feature column pixel map. In a specific implementation, the method performs the following steps ss221-ss223 when performing step s22:

Ss221, determining a preset position in the lip area map.

The preset position may be the position of any pixel point in the lip area map. The change of the center of the lip is most obvious when the lip is moved. Therefore, in the embodiment of the present invention, the preset position is the lip area. The center pixel position of the graph.

Ss222, drawing a vertical axis along the preset position.

Ss223 extracts a column of pixel maps composed of all the pixels located on the vertical axis in the lip region map as a feature column pixel map.

The change of the lip during lip movement is directly manifested by the opening of the lip, which belongs to the longitudinal change of the lip. Therefore, in step ss222-ss223, the characteristic column pixmap can be extracted longitudinally along the preset position; it can be understood that if the pre-prevention The position is the central pixel position of the lip region map, and the extracted feature column pixel map is a column of pixel maps in the center of the lip region.

S23: Perform splicing processing on the extracted feature column pixmap according to the chronological order of each frame image to obtain a lip change map.

After the above step s22, the feature column pixel map can be extracted from the preset position in each frame image, and the lip change map obtained by splicing the feature column pixel map extracted from each frame image also reflects the lip portion. The change in the preset position. Taking the preset position as the central pixel position of the lip region map as an example: the pixel map of the central column of the lip region extracted from the first frame image may be referred to as a first central column pixel map; Extracted into the central column pixel map of the lip region, which can be called the second central column image The splicing process in the step s23 may be: after the second central column pixel image is horizontally spliced into the first central column pixel image, the third central column pixel image is horizontally spliced to the second central portion. After the pixmap is listed, and so on, a lip change map is formed, which reflects the change in the center of the lip.

S103: Perform lip movement recognition according to the texture feature of the lip change map to obtain a recognition result.

Lip recognition is the process of confirming whether or not lip movement occurs. The method specifically performs the following steps s31-s32 when performing step S103:

S31. Calculate texture features of the lip variation map, including but not limited to: LBP features and/or HOG features.

The LBP feature can effectively describe and measure the local texture information of the image, and has significant advantages such as rotation invariance and gray invariance. In the process of performing step s31, the LBP algorithm can be used to calculate the LBP feature of the lip change map. . The HOG feature is a feature descriptor for performing object detection in image processing; in the process of performing step s31, the HOG algorithm may be used to calculate the HOG feature of the lip change map. It can be understood that the texture feature may also include other features such as SIFT features, so the method may also use other algorithms to calculate the texture features of the lip variation map during the execution of step s31.

S32: classify the texture feature by using a preset classification algorithm to obtain a lip recognition result, where the recognition result includes: lip motion or lip motion does not occur.

The preset classification algorithm may include, but is not limited to, a Bayesian algorithm, a logistic regression algorithm, and an SVM (Support Vector Machine) algorithm. Taking the SVM algorithm as an example, the texture feature is substituted into the SVM algorithm classifier as an input parameter, and the SVM algorithm classifier can output the classification result (ie, the lip recognition result).

In the embodiment of the present invention, the image processing method is used to perform face region detection and lip region localization for each frame image included in the video, and the feature column pixel of the lip region is extracted from each frame image to construct a lip variation map. Since the lip change map is from each frame image, the lip change map can reflect the time span of each image composition as a whole; the lip motion recognition through the texture feature of the lip change map obtains the recognition result, that is, according to the time span. The lip change on the upper side recognizes the lip movement, which can avoid the influence of the amplitude of the lip change, and the recognition efficiency is high and the recognition result is high.

The embodiment of the present invention further provides an Internet device, which may be a terminal or a server. Referring to FIG. 2, the internal structure of the Internet device may include, but is not limited to, processing. , user interface, network interface and memory. The processor, the user interface, the network interface, and the memory in the Internet device can be connected by a bus or other means. In FIG. 2 shown in the embodiment of the present invention, a bus connection is taken as an example.

The user interface is a medium for realizing interaction and information exchange between the user and the Internet device, and the specific embodiment thereof may include a display for output and a keyboard for input, etc., it should be noted that The keyboard here can be either a physical keyboard or a touch screen virtual keyboard, or a keyboard combining physical and touch screen virtual. A processor (or CPU (Central Processing Unit)) is the computing core and control core of an Internet device, which can parse various types of instructions in the Internet device and process various types of data. Memory is a memory device in an Internet device that stores programs and data. It can be understood that the memory herein may be a high speed RAM memory, or may be a non-volatile memory, such as at least one disk memory; optionally, at least one storage located away from the foregoing processor. Device. The memory provides a storage space that stores an operating system of the Internet device and also stores an image processing device.

In the embodiment of the present invention, the Internet device can execute the corresponding steps of the method flow shown in FIG. 1 by running the image processing device in the memory. Referring to FIG. 3 together, the image processing apparatus operates as follows:

The locating unit 101 is configured to detect a face area in each frame image included in the to-be-processed video, and locate a lip area from the face area.

The building unit 102 is configured to extract a feature column pixel construction lip change map of the lip region from each frame image.

The lip motion recognition unit 103 is configured to perform lip motion recognition according to the texture feature of the lip change map to obtain a recognition result.

In a specific implementation, the image processing apparatus runs the following unit in the process of running the positioning unit 101:

The parsing unit 1001 is configured to parse the video to be processed to obtain at least one frame of image.

The face detecting unit 1002 is configured to detect a face region in each frame image by using a face detection algorithm. area.

The face registration unit 1003 is configured to locate a lip region from the face region by using a face registration algorithm.

In a specific implementation, the image processing apparatus runs the following units in the process of running the building unit 102:

The intercepting unit 2001 is configured to intercept a lip region map in each frame image.

The extracting unit 2002 is configured to extract a feature column pixmap from the lip region map.

The splicing processing unit 2003 is configured to perform splicing processing on the extracted feature column pixmap according to the chronological order of each frame image to obtain a lip change map.

In a specific implementation, the image processing apparatus runs the following unit in the process of running the extracting unit 2002:

The position determining unit 2221 is configured to determine a preset position in the lip area map; preferably, the preset position is a central pixel point position of the lip area map.

The vertical axis determining unit 2222 is configured to draw a vertical axis along the preset position.

The feature column pixel extracting unit 2223 is configured to extract, as a feature column pixmap, a column of pixel maps composed of all the pixels located on the vertical axis in the lip region map.

In a specific implementation, the image processing apparatus runs the following unit in the process of running the lip recognition unit 103:

The calculating unit 3001 is configured to calculate a texture feature of the lip variation map, the texture feature including an LBP feature and/or an HOG feature.

The classification unit 3002 is configured to classify the texture features by using a preset classification algorithm to obtain a lip motion recognition result, where the recognition result includes: lip motion or lip motion does not occur.

In the same manner as the method shown in FIG. 2, the embodiment of the present invention performs face detection and lip region localization for each frame image included in the video by running the image processing device, and extracts the lip from each frame image. The feature column of the region constructs a lip variation map. Since the lip variation map is from each frame image, the lip variation map can reflect the time span of each image composition as a whole; the lip motion is performed by the texture feature of the lip variation map. Recognizing the recognition result, that is, recognizing the lip movement according to the change of the lip over the time span, can avoid the influence of the amplitude of the lip change, the recognition efficiency is high and the recognition result is high.

One of ordinary skill in the art can understand all or part of the process in implementing the above embodiments. This may be accomplished by a computer program instructing the associated hardware, which may be stored in a computer readable storage medium, which, when executed, may include the flow of an embodiment of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

The above is only the preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, and thus equivalent changes made in the claims of the present invention are still within the scope of the present invention.

Claims

An image processing method, comprising:

Detecting a face area in each frame image included in the to-be-processed video, and locating a lip area from the face area;

Extracting a feature column pixel of the lip region from each of the frame images to construct a lip variation map;

The lip motion recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained.
The method according to claim 1, wherein the detecting a face region in each frame image included in the video to be processed and locating the lip region from the face region comprises:

Parsing the processed video to obtain at least one frame of image;

A face detection algorithm is used to detect a face region in each frame image;

A lip registration region is used to locate the lip region from the face region.
The method of claim 2, wherein the extracting the characteristic column of the lip region from the image of each frame to construct a lip variation map comprises:

Intercepting a lip region map in each frame of image;

Extracting a feature column pixmap from the lip region map;

The extracted feature column pixel map is spliced according to the chronological order of each frame image to obtain a lip change map.
The method of claim 3, wherein the extracting the feature column pixmap from the lip region map comprises:

Determining a preset position in the lip region map;

Drawing a vertical axis along the preset position;

A column of pixel maps composed of all the pixels located on the vertical axis in the lip region map is extracted as a feature column pixmap.
The method of claim 4 wherein said predetermined position is a central pixel point location of said lip region map.
The method according to any one of claims 1 to 5, wherein the lip movement recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained, including:

Calculating a texture feature of the lip variation map, the texture feature comprising an LBP feature and/or an HOG feature;

The texture features are classified by using a preset classification algorithm to obtain a lip motion recognition result, and the recognition result includes: lip movement or lip motion does not occur.
An image processing apparatus, comprising:

a positioning unit, configured to detect a face area in each frame image included in the to-be-processed video, and locate a lip area from the face area;

a building unit, configured to extract a feature column pixel of the lip region from the image of each frame to construct a lip variation map;

a lip motion recognition unit configured to perform lip motion recognition according to the texture feature of the lip change map to obtain a recognition result.
The device according to claim 7, wherein the positioning unit comprises:

a parsing unit, configured to parse the video to be processed to obtain at least one frame of image;

a face detecting unit, configured to detect a face region in each frame image by using a face detection algorithm;

A face registration unit is configured to locate a lip region from the face region using a face registration algorithm.
The apparatus of claim 8 wherein said building unit comprises:

An intercepting unit for intercepting a lip region map in each frame image;

An extracting unit, configured to extract a feature column pixmap from the lip region map;

The splicing processing unit is configured to perform splicing processing on the extracted feature column pixmap according to the chronological order of each frame image to obtain a lip change map.
The apparatus according to claim 9, wherein said extracting unit comprises:

a position determining unit, configured to determine a preset position in the lip area map;

a vertical axis determining unit for drawing a vertical axis along the preset position;

The feature column pixel extracting unit is configured to extract a column of pixel maps composed of all the pixels located on the vertical axis in the lip region map as a feature column pixel map.
The device of claim 10 wherein said predetermined position is a central pixel point location of said lip region map.
The device according to any one of claims 7 to 11, wherein the lip movement recognition unit comprises:

a calculating unit, configured to calculate a texture feature of the lip variation map, the texture feature comprising an LBP feature and/or an HOG feature;

And a classification unit, configured to classify the texture features by using a preset classification algorithm, to obtain a lip recognition result, where the recognition result comprises: lip motion or lip motion does not occur.
An internet device, including:

a memory that stores a set of program code;

a processor for executing the program code to perform the following operations:

Detecting a face area in each frame image included in the to-be-processed video, and locating a lip area from the face area;

Extracting a feature column pixel of the lip region from each of the frame images to construct a lip variation map;

The lip motion recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained.
The Internet device according to claim 13, wherein the detecting a face region in each frame image included in the to-be-processed video and locating the lip region from the face region comprises:

Parsing the processed video to obtain at least one frame of image;

A face detection algorithm is used to detect a face region in each frame image;

A lip registration region is used to locate the lip region from the face region.
The Internet device according to claim 14, wherein the feature column pixel construction lip variation map for extracting a lip region from each frame image comprises:

Intercepting a lip region map in each frame of image;

Extracting a feature column pixmap from the lip region map;

The extracted feature column pixel map is spliced according to the chronological order of each frame image to obtain a lip change map.
The Internet device according to claim 15, wherein the extracting the feature column pixmap from the lip region map comprises:

Determining a preset position in the lip region map;

Drawing a vertical axis along the preset position;

A column of pixel maps composed of all the pixels located on the vertical axis in the lip region map is extracted as a feature column pixmap.
The Internet device according to claim 16, wherein said preset position is a central pixel point position of said lip region map.
The Internet device according to any one of claims 13 to 17, wherein the lip recognition is performed according to the texture feature of the lip change map, and the recognition result is obtained, including:

Calculating a texture feature of the lip variation map, the texture feature comprising an LBP feature and/or an HOG feature;

The texture features are classified by using a preset classification algorithm to obtain a lip motion recognition result, and the recognition result includes: lip movement or lip motion does not occur.