CN111160178A

CN111160178A - Image processing method and device, processor, electronic device and storage medium

Info

Publication number: CN111160178A
Application number: CN201911322102.4A
Authority: CN
Inventors: 李若岱; 高哲峰; 庄南庆; 马堃
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-05-15
Anticipated expiration: 2039-12-19
Also published as: CN111160178B

Abstract

The application discloses an image processing method and device, a processor, electronic equipment and a storage medium. The method comprises the following steps: acquiring binocular images and parameters of a binocular camera for acquiring the binocular images; obtaining depth information of at least four reference points, horizontal position information of the at least four reference points and vertical position information of the at least four reference points in the human body region according to the binocular image; obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points; and under the condition that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold value, determining that the human object to be detected is a living body. Corresponding products are also disclosed.

Description

Image processing method and device, processor, electronic device and storage medium

Technical Field

The present application relates to the field of security technologies, and in particular, to an image processing method and apparatus, a processor, an electronic device, and a storage medium.

Background

With the development of face recognition technology, face recognition technology has been widely applied to different application scenarios, wherein confirming the identity of a person through face recognition is an important application scenario, for example, real-name authentication, identity authentication, and the like are performed through face recognition technology. However, recently, an event of attacking the face recognition technology using a "non-living body" face image has been increasingly occurring. The above-mentioned "non-living body" face image includes: paper photos, electronic images, etc. The face recognition technology is attacked by using the non-living body face image, namely, the non-living body face image is used for replacing the face area of a person, so that the effect of attacking the face recognition technology is achieved. How to effectively prevent the attack of the 'non-living body' face image on the face recognition technology is of great significance.

The traditional method determines whether a person in the acquired face image is a living body based on a binocular image acquired by a binocular camera, but the detection accuracy of the method is low.

Disclosure of Invention

The application provides an image processing method and device, a processor, an electronic device and a storage medium, which are used for detecting whether a person object to be detected is a living body.

In a first aspect, an image processing method is provided, the method comprising:

acquiring binocular images and parameters of a binocular camera for acquiring the binocular images, wherein the binocular images comprise human body areas of to-be-detected character objects;

obtaining depth information of at least four reference points, horizontal position information of the at least four reference points and vertical position information of the at least four reference points in the human body region according to the binocular image, wherein the reference points comprise face key points, and the reference points comprise the face key points and trunk key points;

obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points;

and under the condition that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold value, determining that the human object to be detected is a living body, wherein the depth direction is a direction perpendicular to an image plane of the binocular camera when the binocular camera collects the binocular image.

In the aspect, the depth information of at least four reference points in the human object to be detected is obtained based on the binocular image, and then the three-dimensional position information of the at least four reference points in the world coordinate system is obtained. According to the three-dimensional position information of the at least four reference points in the world coordinate system, whether the human body area of the person object to be detected is a three-dimensional area or not can be determined, and therefore two-dimensional attack on the face recognition technology can be effectively prevented. The implementation obtains the three-dimensional position information of at least four reference points in the character object to be detected on the basis of the hardware of the two-dimensional in-vivo detection method, and compared with the two-dimensional in-vivo detection method, the implementation does not increase the hardware cost but improves the detection accuracy.

In one possible implementation, the binocular image includes: a first image to be processed and a second image to be processed; the binocular camera includes: the first camera is used for collecting the first image to be processed and the second camera is used for collecting the second image to be processed; the parameters of the binocular camera include: a distance between the first camera and the second camera, a first focal length of the first camera, and a second focal length of the second camera;

the obtaining of the depth information of at least four reference points in the human body region according to the binocular image includes:

obtaining parallax images of the first image to be processed and the second image to be processed according to the first image to be processed and the second image to be processed, wherein the parallax images carry parallax information of the at least four reference points;

performing stereo correction processing on the first image to be processed and the second image to be processed to normalize the first focal length and the second focal length to obtain a normalized focal length;

and obtaining the depth information of the at least four reference points according to the parallax information of the at least four reference points, the normalized focal length and the distance.

In this possible implementation manner, the normalization of the first focal length of the first camera and the second focal length of the second camera is realized by performing stereo correction processing on the first image to be processed and the second image to be processed, and the vertical displacement difference of the homonymous point in the first image to be processed and the second image to be processed is reduced to obtain the normalized focal length. According to the parallax information of the at least four reference points, the distance between the first camera and the second camera and the normalized focal length, the depth information of the at least four reference points is obtained, and the accuracy of the obtained depth information of the reference points can be improved.

In another possible implementation, the at least four reference points include a first reference point;

the obtaining the depth information of the at least four reference points according to the parallax information of the at least four reference points, the normalized focal length and the distance includes:

determining the product of the normalized focal length and the distance to obtain a first intermediate value;

and determining the quotient of the first intermediate value and the parallax information of the first reference point to obtain the depth information of the first reference point.

In another possible implementation manner, the performing stereo correction processing on the first image to be processed and the second image to be processed to normalize the first focal length and the second focal length to obtain a normalized focal length includes:

performing stereo correction processing on the first image to be processed and the second image to be processed to normalize the parameters of the first camera and the parameters of the second camera to obtain normalized camera parameters;

the obtaining of the three-dimensional position information of the at least four reference points in the world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points, and the depth information of the at least four reference points includes:

and obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the normalized camera parameters, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points.

In the former possible implementation manner, by performing stereo correction processing on the first image to be processed and the second image to be processed, the vertical displacement difference of the homonymous point in the first image to be processed and the second image to be processed can be reduced, and the normalized camera parameter is obtained. In this possible implementation manner, the three-dimensional position information of the at least four reference points in the world coordinate system is obtained by using the normalized camera parameters, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points, and the depth information of the at least four reference points, so that the accuracy of the obtained three-dimensional position information of the at least four reference points in the world coordinate system can be improved.

In yet another possible implementation, the at least four reference points include a second reference point; the normalized camera parameters include: the normalized horizontal position information of the central point of the camera and the normalized vertical position information of the central point are obtained, wherein the central point is the intersection point of the image plane of the first camera and the optical axis of the first camera; the normalized focal length includes: the horizontal position information of the normalized focal length and the vertical position information of the normalized focal length; the three-dimensional position information of the at least four reference points in the world coordinate system comprises: horizontal position information of the at least four reference points in a world coordinate system, vertical position information of the at least four reference points in the world coordinate system and depth position information of the at least four reference points in the world coordinate system;

the obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the normalized camera parameters, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points, and the depth information of the at least four reference points includes:

determining a difference between the horizontal position information of the second reference point and the horizontal position information of the central point to obtain a second intermediate value, and determining a quotient of the parallax information of the second reference point and the horizontal position information of the normalized focal length to obtain a third intermediate value;

determining a difference between the vertical position information of the second reference point and the vertical position information of the central point to obtain a fourth intermediate value, and determining a quotient of the parallax information of the second reference point and the normalized vertical position information of the focal length to obtain a fifth intermediate value;

and taking the product of the second intermediate value and the third intermediate value as horizontal position information of the second reference point in the world coordinate system, taking the product of the fourth intermediate value and the fifth intermediate value as vertical position information of the second reference point in the world coordinate system, and taking parallax information of the second reference point as depth position information of the second reference point in the world coordinate system.

In another possible implementation manner, in a case that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold, the determining that the human object to be detected is a living body includes:

constructing a matrix by using the three-dimensional position information of the at least four reference points in a world coordinate system, so that each row in the matrix comprises the three-dimensional position information of one reference point, and obtaining a coordinate matrix;

and determining at least one singular value of the coordinate matrix, and determining that the human object to be detected is a living body under the condition that the ratio of the minimum value of the at least one singular value to the sum of the at least one singular value is greater than or equal to a second threshold value.

In the possible implementation mode, a coordinate matrix is constructed according to three-dimensional coordinates of at least four reference points in a world coordinate system, whether a human body area of the character object to be detected is a three-dimensional area or not is determined according to singular values of the coordinate matrix, and whether the character object to be detected is a living body or not is further determined. As for any human body region of the human body object to be detected, a coordinate matrix can be constructed according to three-dimensional coordinates of at least four reference points in a world coordinate system, and the method for determining whether the human body object to be detected is a living body provided by the embodiment is applicable to any scene. The technical scheme provided by the implementation can improve the universality of the three-dimensional in-vivo detection method.

In yet another possible implementation manner, the human body region includes a human face region; the at least four keypoints comprise first face keypoints;

the obtaining of the horizontal position information of the at least four reference points and the vertical position information of the at least four reference points according to the binocular image includes:

respectively carrying out face key point detection processing on the first image to be processed and the second image to be processed to obtain initial horizontal position information of the first face key point, initial vertical position information of the first face key point, initial horizontal position information of the second face key point and initial vertical position information of the second face key point in the first image to be processed, wherein the first face key point and the second face key point are homonymous points;

taking the initial vertical position information of the first face key point or the initial vertical position information of the second face key point as the vertical displacement information of the first face key point;

obtaining a first horizontal parallax displacement between the first face key point and the second face key point according to the parallax image;

determining a sum of the initial horizontal position information of the first face keypoints and the first horizontal parallax displacement as horizontal position information of the first face keypoints.

In yet another possible implementation, the human body region includes a face region and a torso region; the at least four keypoints comprise: a third face key point and a first trunk key point;

respectively carrying out face key point detection processing and trunk key point detection processing on the first image to be processed and the second image to be processed to obtain initial horizontal position information of a third face key point, initial vertical position information of the third face key point, initial horizontal position information of a fourth face key point, initial vertical position information of the fourth face key point, initial horizontal position information of the first trunk key point in the first image to be processed, initial vertical position information of the first trunk key point, initial horizontal position information of the second trunk key point and vertical position information of the second trunk key point in the second image to be processed, wherein the third face key point and the fourth face key point are homonymous points, the first trunk key point and the second trunk key point are homonymous points;

taking the initial vertical position information of the third face key point or the initial vertical position information of the third face key point as the vertical displacement information of the third face key point, and taking the initial vertical position information of the first trunk key point or the initial vertical position information of the second trunk key point as the vertical displacement information of the first trunk key point;

obtaining a second horizontal parallax displacement between the third face key point and a third horizontal parallax displacement between the first trunk key point and the second trunk key point according to the parallax image;

determining the sum of the initial horizontal position information of the third face key point and the second horizontal parallax displacement as the horizontal position information of the third face key point, and determining the sum of the initial horizontal position information of the first trunk key point and the third horizontal parallax displacement as the horizontal position information of the first trunk key point.

In a second aspect, there is provided an image processing apparatus, the apparatus comprising:

the binocular image acquisition unit is used for acquiring binocular images and parameters of a binocular camera for acquiring the binocular images, wherein the binocular images comprise human body areas of to-be-detected character objects;

the first processing unit is used for obtaining depth information of at least four reference points, horizontal position information of the at least four reference points and vertical position information of the at least four reference points in the human body area according to the binocular image, wherein the reference points comprise human face key points, and the reference points comprise the human face key points and trunk key points;

the second processing unit is used for obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points;

the determining unit is used for determining that the character object to be detected is a living body under the condition that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is larger than or equal to a first threshold, wherein the depth direction is a direction perpendicular to the image plane of the binocular camera when the binocular camera collects the binocular images.

the first processing unit is configured to:

In another possible implementation manner, the first processing unit is configured to:

the second processing unit is configured to:

the determining unit is configured to include:

In yet another possible implementation manner, the determining unit is configured to:

the first processing unit is configured to:

In a third aspect, a processor is provided, which is configured to perform the method according to the first aspect and any one of the possible implementations thereof.

In a fourth aspect, an electronic device is provided, comprising: a processor, transmitting means, input means, output means, and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of the first aspect and any one of its possible implementations.

In a fifth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform the method of the first aspect and any one of its possible implementations.

A sixth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect and any of its possible implementations.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a homonymy point in a binocular image according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a face key point provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a torso key point provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a distribution of reference points provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an image processing apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

With the development of face recognition technology, face recognition technology has been widely applied to different application scenarios, wherein confirming the identity of a person through face recognition is an important application scenario, for example, real-name authentication, identity authentication, and the like are performed through face recognition technology.

The face recognition technology obtains face feature data by performing feature extraction processing on a face image obtained by collecting a face region of a person. And comparing the extracted face feature data with the face feature data in the database to determine the identity of the person in the face image.

However, recently, an event of attacking the face recognition technology using a "non-living body" face image has been increasingly occurring. The above-mentioned "non-living body" face image includes: paper photos, electronic images, etc. The face recognition technology is attacked by using the non-living body face image, namely, the non-living body face image is used for replacing the face area of a person, so that the effect of deceiving the face recognition technology is achieved. For example, Zhang III places a photo of Li IV in front of a cell phone of Li IV for face recognition unlocking. The mobile phone shoots the photo of the fourth plum through the camera to obtain a face image of the face area containing the fourth plum, further determines the identity of the third plum as the fourth plum, and unlocks the mobile phone. Therefore, Zhang III realizes the unlocking of the Li IV mobile phone by successfully deceiving the face recognition technology of the mobile phone by using the Li IV photo. The effective prevention of "non-living" human face images is of great importance to the attack of face recognition technology (hereinafter referred to as two-dimensional attack).

The attack of the human face image of the non-living body on the human face recognition technology can be effectively prevented by carrying out the living body detection on the human face image. The conventional in-vivo detection methods can be classified into two-dimensional in-vivo detection methods and three-dimensional in-vivo detection methods.

The two-dimensional living body detection method comprises the steps of collecting binocular images of a face area of a person object to be detected through a binocular camera, obtaining horizontal position information and vertical position information of face key points in the face area of the person to be detected based on the binocular images, and determining whether the person object to be detected is a living body according to the horizontal position information and the vertical position information of the face key points.

In the traditional three-dimensional in-vivo detection method, hardware (such as a depth camera and a structured light camera) for obtaining depth information of face key points in a face area of a person object to be detected is added on the basis of a two-dimensional in-vivo detection method, and the depth information of the face key points in the face area of the person to be detected is obtained through the hardware. Inputting the horizontal position information, the vertical position information and the depth information of the key points of the human face into the trained deep learning model to determine whether the character object to be detected is a living body.

In the two methods, because the two-dimensional living body detection method only utilizes the horizontal position information and the vertical position information of the key points of the human face, the detection accuracy of the two-dimensional living body detection method is lower than that of the traditional three-dimensional living body detection method. The traditional three-dimensional living body detection method needs to obtain the key depth information of the human face by means of hardware (such as a depth camera), so that the hardware cost of the traditional three-dimensional living body detection method is high. In addition, the detection accuracy of the trained deep learning model used in the conventional three-dimensional in-vivo detection method greatly depends on the training data of the deep learning model. Specifically, the application scenario included in the training data is the scenario to which the deep learning model is applicable. For example, if the training data includes a paper photograph but does not include an electronic photograph, the living body detection accuracy of the electronic photograph using the deep learning model obtained by training using the training data is low. This also results in poor versatility of the conventional three-dimensional biopsy method.

Based on the current situation, the embodiment of the application provides a three-dimensional in-vivo detection method which has the same hardware cost as a two-dimensional in-vivo detection method and strong universality. The embodiments of the present application will be described below with reference to the drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an image processing method according to an embodiment (a) of the present application.

101. And acquiring binocular images and parameters of a binocular camera for acquiring the binocular images.

The binocular image comprises a human body area of a character object to be detected;

the technical scheme provided by the embodiment of the application can be applied to the first terminal, wherein the first terminal comprises a mobile phone, a computer, a tablet computer, a server and the like.

The binocular images are two images respectively obtained by shooting the same object or scene from different positions at the same time by two different imaging devices (namely, the binocular camera). The imaging device may be a camera or a camera. For example, two cameras on one mobile phone. For another example, two cameras are mounted on a smart car. For another example, two cameras on the drone.

Since the term "synonym" will appear several times below, the meaning of synonym is clarified first, before proceeding to the explanation below. In the embodiment of the application, the same physical point corresponds to the pixel points in different images in the binocular image, and the pixel points are the same-name points. Fig. 2 shows two images in a binocular image, in which a pixel point a and a pixel point C are homonymous points, and a pixel point B and a pixel point D are homonymous points.

It should be understood that the embodiments of the present application illustrate how to implement three-dimensional live body detection based on binocular images by taking two different imaging devices as examples. In practical application, at least three images can be obtained by shooting the same object or scene from different positions at the same time through three or more than three imaging devices, the technical scheme provided by the embodiment of the application can realize three-dimensional living body detection based on at least three images, and the number of the imaging devices is not limited in the application.

The binocular image may be acquired by receiving a binocular image input by a user through an input assembly, wherein the input assembly includes: keyboard, mouse, touch screen, touch pad, audio input device, etc. The binocular image sent by the second terminal can also be received, wherein the second terminal comprises a mobile phone, a computer, a tablet computer, a server and the like, and the mode of obtaining the binocular image is not limited in the application.

In the embodiment of the present application, the parameters of the binocular camera include: internal parameters of each imaging device and external parameters of each imaging device. The internal parameters include: the coordinates of the focal length and the coordinates of a central point, wherein the central point is an intersection point of an optical axis of the imaging device and the image plane. The external parameters include: the distance between the center points of the two imaging devices (which will be referred to as the baseline length hereinafter), the rotation matrix between the camera coordinate system of the imaging device and the world coordinate system, and the amount of translation between the camera coordinate system of the imaging device and the world coordinate system.

Optionally, before using the binocular camera to collect binocular images, the binocular camera may be calibrated so that two imaging devices in the binocular camera are on the same horizontal plane, and image planes of the two imaging devices are on the same plane, wherein the image planes are planes where imaging components in the imaging devices are located. The parameters of the binocular camera can be obtained by calibrating the binocular camera.

In the embodiment of the application, the binocular image includes a human body region of the human object to be detected, wherein the human body region may include a human face region, for example, the binocular camera acquires an image of the face region of the human object to be detected to obtain a human face image only including the face region of the human object to be detected. The body region may also include a face region and a torso region. For example, the binocular camera acquires images of a face area and a torso area of a human subject to be detected to obtain a human body image including the face area and the torso area of the human subject to be detected.

102. And obtaining depth information of at least four reference points, horizontal position information of the at least four reference points and vertical position information of the at least four reference points in the human body region according to the binocular image.

The face key points include: face contour key points and facial features key points. As shown in fig. 3, the key points of the five sense organs include: key points of the eyebrow region, key points of the eye region, key points of the nose region, key points of the mouth region, key points of the ear region. The face contour key points include key points on a face contour line. It should be understood that the number and the positions of the face key points shown in fig. 3 are only an example provided in the embodiment of the present application, and should not be construed as limiting the present application.

The above torso keypoints include keypoints at joints of the torso. As shown in fig. 4, the torso key points include: shoulder joint key points, elbow joint key points, wrist joint key points, hip joint key points, knee joint key points and ankle joint key points. It should be understood that the number and location of the torso key points shown in fig. 4 are only an example provided in the embodiments of the present application and should not be construed as limiting the present application.

In this embodiment of the application, the reference points include face key points, for example, the at least four reference points include: key points in the nose region, key points in the mouth region, key points in the ear region, and key points on the face contour. The reference points may include face key points and torso key points, for example, the at least four key points include: key points in the nose region, key points in the mouth region, key points in the ear region, and key points in the shoulder joint.

In the embodiment of the application, the face key points and the position information thereof (i.e., the horizontal position information of the face key points and the vertical position information of the face key points) in the binocular image may be obtained through any face key point detection algorithm, and the face key point detection algorithm may be one of OpenFace, a multi-task cascaded convolutional neural network (MTCNN), a Tuned Convolutional Neural Network (TCNN), or a task-constrained deep convolutional neural network (TCDCN), which does not limit the face key point detection algorithm.

If the human body region contains a trunk region, trunk key points and position information thereof (namely, horizontal position information of the trunk key points and vertical position information of the trunk key points) in the binocular image can be obtained through any trunk key point detection algorithm, and the trunk key point detection algorithm can be as follows: the trunk keypoint detection algorithm is not limited by the present application, and may be any one of a cascaded pyramid neural network (CPN), a region-masked convolutional neural network (RCNN), or a multi-person pose estimation (RMPE).

In the embodiment of the present application, the binocular image includes: a first image to be processed and a second image to be processed. According to the first image to be processed and the second image to be processed, a parallax image between the first image to be processed and the second image to be processed can be obtained. The parallax image carries information of horizontal parallax displacement between homonymous points in the first image to be processed and the second image to be processed. It should be understood that, if the two imaging devices in the binocular camera are located on the same vertical plane and the image planes of the two imaging devices are located on the same plane, the information of the vertical parallax displacement between the homonymous points in the first to-be-processed image and the second to-be-processed image of the parallax image is described above.

In an implementation manner of obtaining a parallax image between a first image to be processed and a second image to be processed according to the first image to be processed and the second image to be processed, a homonymous point in the first image to be processed and a homonymous point in the second image to be processed can be determined by performing feature matching processing on the first image to be processed and the second image to be processed. And obtaining the parallax image according to the horizontal parallax displacement between the homonymous points in the first image to be processed and the second image to be processed. The feature matching process may be implemented by one of a storm algorithm (brute force), a k-nearest neighbor algorithm (KNN), or a fast nearest neighbor search algorithm (FLANN), which is not limited in this application.

After the parallax image is obtained, the depth information of the at least four key points can be obtained according to the focal length of the binocular camera, the horizontal parallax information carried by the parallax image and the length of the base line. For example, the at least four key points include an eye key point, the eye key point is a first pixel point in the first image to be processed, and the eye key point is a second pixel point in the second image to be processed, that is, the first pixel point and the second pixel point are homologous points. According toThe parallax image can determine that the horizontal displacement difference between the first pixel point and the second pixel point is d₁. The base length can be determined to be d through calibrating the binocular camera₂. Based on d₁And d₂Depth information of eye key points can be obtained.

103. And obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points.

Through the processing of step 101 and step 102, three-dimensional position information of the at least four reference points in the camera coordinate system of the binocular camera can be obtained. For the convenience of subsequent processing, the three-dimensional position information under the camera coordinate system needs to be converted into the three-dimensional position information under the world coordinate system.

In one possible implementation (which will be referred to as a non-correction implementation hereinafter), the three-dimensional coordinates of the reference point in the world coordinate system and thus the three-dimensional position information of the reference point in the world coordinate system may be obtained from parameters of the binocular camera and the three-dimensional coordinates determined from the three-dimensional position information of the reference point in the camera coordinate system.

In another possible implementation (which will be referred to as a correction implementation hereinafter), the binocular camera comprises: a first camera and a second camera. Before determining the three-dimensional position information of the reference point in the world coordinate system, stereo correction processing can be performed on the binocular image so as to normalize the parameters of the first camera and the parameters of the second camera, and obtain normalized camera parameters. And obtaining the three-dimensional position information of the reference point in the world coordinate system according to the normalized parameters and the three-dimensional position information of the reference point in the camera coordinate system. For example, the normalized camera parameters obtained by performing stereo correction processing on the parameters of the first camera and the parameters of the second camera include: the normalized horizontal position information of the focal length, the normalized numerical position information of the focal length, the normalized horizontal position information of the central point and the normalized vertical position information. If the reference point is horizontal in the camera coordinate systemThe horizontal coordinate of the reference point determined by the position information in the camera coordinate system is x_cThe vertical coordinate of the reference point in the camera coordinate system determined by the vertical position information of the reference point in the camera coordinate system is y_cD as depth value determined by the depth information of the reference point, and f as horizontal coordinate of the normalized focal length determined by the horizontal position information of the normalized focal length_xThe vertical coordinate of the normalized focal length determined from the vertical position information of the normalized focal length is f_yThe horizontal coordinate of the normalized central point determined by the horizontal position information of the normalized central point is u_xThe vertical coordinate of the normalized central point determined by the vertical position information of the normalized central point is u_y. The horizontal coordinate x of the reference point in the world coordinate system_wSatisfies the following formula:

vertical coordinate y of reference point in world coordinate system_wSatisfies the following formula:

the depth coordinate z of the reference point in the world coordinate system satisfies the following formula: and z is equal to the sum of the two. The above-mentioned depth coordinates may be understood as coordinates of the reference point in a direction perpendicular to the image plane of the binocular camera with the normalized center point as an origin.

104. And determining that the human object to be detected is a living body under the condition that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold value.

In an embodiment of the present application, the depth direction is a direction perpendicular to an image plane of the binocular camera when the binocular camera collects the binocular image.

After the three-dimensional position information of the four reference points in the world coordinate system is obtained through the processing of step 103, the three-dimensional coordinates of the four reference points in the world coordinate system can be determined, and further whether the human object to be detected is a living body can be determined according to the three-dimensional coordinates of the four reference points in the world coordinate system.

In a possible implementation manner, the variance of the three-dimensional coordinates of the four reference points in the depth direction in the world coordinate system can be used for representing the dispersion of the three-dimensional coordinates of the four reference points in the depth direction in the world coordinate system. If the variance of the three-dimensional position information of the four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold, the dispersion degree of the at least four reference points in the depth direction is not 0, that is, at least one reference point of the at least four reference points in the depth direction and other reference points do not belong to the same plane, that is, the human body region of the human body object to be detected is a three-dimensional solid object, not a two-dimensional plane object, and thus the human body object to be detected can be determined to be a living body. If the variance of the three-dimensional position information of the four reference points in the world coordinate system in the depth direction is smaller than a first threshold, the four reference points in the human body region of the human body object to be detected in the depth direction belong to the same plane, that is, the human body region of the human body object to be detected is a two-dimensional object, and thus the human body object to be detected can be determined to be a non-living body. The first threshold is a positive number, and the specific value of the first threshold can be set according to the actual use condition.

For example, as shown in fig. 5, the at least four reference points include: reference point a, reference point B, reference point C, and reference point D. Reference point B, reference point C and reference point D belong to plane a, and reference point a does not belong to plane a. The dispersion in the depth direction of the three-dimensional coordinates of the four reference points shown in fig. 4 is greater than 0, that is, the variance in the depth direction of the three-dimensional coordinates of the four reference points is greater than or equal to a first threshold value.

It should be understood that, since three points can determine a plane, in the embodiment of the present application, three-dimensional coordinates of at least four reference points are required to determine whether the human body region of the human object to be detected is a three-dimensional region. The greater the number of reference points, the higher the accuracy of the in vivo test. But an increase in the number of reference points will also result in an increase in the data throughput. Therefore, in practical application, a user can adjust the number of the reference points according to requirements, and the number of the reference points is not limited in the application.

The implementation obtains the depth information of at least four reference points in the character object to be detected based on the binocular image, and further obtains the three-dimensional position information of the at least four reference points in the world coordinate system. According to the three-dimensional position information of the at least four reference points in the world coordinate system, whether the human body area of the person object to be detected is a three-dimensional area or not can be determined, and therefore two-dimensional attack on the face recognition technology can be effectively prevented. The implementation obtains the three-dimensional position information of at least four reference points in the character object to be detected on the basis of the hardware of the two-dimensional in-vivo detection method, and compared with the two-dimensional in-vivo detection method, the implementation does not increase the hardware cost but improves the detection accuracy.

In step 103, two implementations are provided for converting the three-dimensional position information of the reference point in the camera coordinate system into the three-dimensional position information in the world coordinate system. Compared with the correction implementation mode, the non-correction implementation mode has the advantages of large data processing capacity and low processing speed. In order to reduce the data processing amount of the living body detection and improve the processing speed, optionally, the three-dimensional position information of the reference point in the camera coordinate system may be converted into the three-dimensional position information in the world coordinate system by a correction implementation manner. Referring to fig. 5, fig. 6 is a flowchart illustrating a possible implementation manner of step 103 according to the second embodiment of the present application.

601. And performing stereo correction processing on the first image to be processed and the second image to be processed to normalize the parameters of the first camera and the parameters of the second camera to obtain normalized camera parameters.

In this embodiment, the binocular image includes: a first image to be processed and a second image to be processed. Although the two imaging devices are located on the same horizontal plane and the image planes of the two imaging devices are located on the same plane by calibrating the binocular camera before the binocular camera acquires the binocular image of the human body region of the person object to be detected, the same-name points in the first image to be processed and the second image to be processed have vertical displacement difference due to calibration errors, lens distortion of the imaging devices and the like. If the homonymous point in the first image to be processed and the second image to be processed has a vertical displacement difference, the precision of the parallax image between the first image to be processed and the second image to be processed is reduced, and the accuracy of the living body detection is further influenced. Therefore, the present embodiment performs the stereo correction processing on the first image to be processed and the second image to be processed to reduce the vertical displacement difference of the homonymous point in the first image to be processed and the second image to be processed.

In an implementation manner of performing stereo correction processing on a first image to be processed and a second image to be processed, distortion parameters of a first camera can be obtained according to parameters of the first camera which collects the first image to be processed, and distortion parameters of a second camera can be obtained according to parameters of the second camera which collects the second image to be processed. And adjusting the first image to be processed based on the distortion parameter of the first camera to obtain the first image to be processed after the distortion is eliminated, and adjusting the second image to be processed based on the distortion parameter of the second camera to obtain the second image to be processed after the distortion is eliminated. And obtaining a rotation matrix and a translation amount (which will be referred to as an epipolar rotation matrix and an epipolar translation amount hereinafter) for aligning the epipolar line of the first camera and the epipolar line of the second camera according to the parameters of the first camera and the parameters of the second camera. The epipolar line of the first camera is an intersection line between any plane containing the baseline between the first camera and the second camera and the image plane of the first camera, and the epipolar line of the second camera is an intersection line between any plane containing the baseline between the first camera and the second camera and the image plane of the second camera, wherein the baseline between the first camera and the second camera refers to a straight line between the center point of the first camera and the center point of the second camera. And respectively adjusting the first image to be processed after distortion elimination and the second image to be processed after distortion elimination based on the epipolar line rotation matrix and the epipolar line translation amount to obtain the first image to be processed after alignment and the second image to be processed after alignment. And obtaining the normalized camera parameters according to the aligned first image to be processed and the aligned second image to be processed, so as to realize the normalization of the parameters of the first camera and the parameters of the second camera. Optionally, after the aligned first to-be-processed image and the aligned second to-be-processed image are obtained, cropping processing may be performed on the aligned first to-be-processed image and the aligned second to-be-processed image to remove irregular corner regions in the aligned first to-be-processed image and the aligned second to-be-processed image, so as to obtain the cropped first to-be-processed image and the cropped second to-be-processed image. And obtaining the normalized camera parameters according to the clipped first image to be processed and the clipped second image to be processed.

Alternatively, after the aligned first image to be processed and the aligned second image to be processed are obtained by performing the correction processing on the first image to be processed and the second image to be processed, the parallax image between the first image to be processed and the second image to be processed may be obtained based on the aligned first image to be processed and the aligned second image to be processed. After the normalized camera parameters are obtained, depth information of the reference point can be obtained based on the parallax information of the reference point in the parallax image, the normalized focal length and the length of the base line. For example, the at least four reference points include a first reference point, and a depth coordinate z of the first reference point in the camera coordinate system satisfies the following formula:

wherein f is the normalized focal length, b is the baseline length, and d is the horizontal parallax displacement of the first reference point determined according to the parallax image.

602. And obtaining three-dimensional position information of the at least four reference points in a world coordinate system according to the normalized camera parameters, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points and the depth information of the at least four reference points.

The implementation process of this step can refer to the correction implementation manner in step 103, and will not be described herein.

In the embodiment, the first image to be processed and the second image to be processed are subjected to stereo correction processing, so that the vertical displacement difference between the homonymous points in the first image to be processed and the homonymous points in the second image to be processed are reduced, the lens distortion of the first camera and the lens distortion of the second camera are eliminated, the precision of subsequently obtained parallax images is improved, the precision of obtained key depth information is improved, and the accuracy of in-vivo detection is improved.

And determining the three-dimensional coordinates of the at least four reference points in the world coordinate system according to the three-dimensional position information of the at least four reference points in the world coordinate system, and further determining whether the human body area of the human body object to be detected is a three-dimensional area.

Whether the at least four reference points are in the same plane in the depth direction is determined through singular values of a coordinate matrix constructed by three-dimensional coordinates of the at least four reference points in a world coordinate system, and whether the human body area of the human body object to be detected is a three-dimensional area is further determined. Therefore, the technical scheme for determining whether the human object to be detected is a living body or not based on singular values of the coordinate matrix constructed by the three-dimensional coordinates of at least four reference points in the world coordinate system is provided.

Referring to fig. 7, fig. 7 is a flowchart illustrating a possible implementation manner of step 104 according to the third embodiment of the present application.

701. And constructing a matrix by using the three-dimensional position information of the at least four reference points in the world coordinate system, so that each row in the matrix comprises the three-dimensional position information of one reference point, and obtaining a coordinate matrix.

In this embodiment, each row in the coordinate matrix includes only the three-dimensional coordinates of one reference point. For example, the at least four reference points include: a first reference point, a second reference point, a third reference point, and a fourth reference point. The three-dimensional coordinate of the first reference point in the world coordinate system is (x)₁，y₁，z₁) And the three-dimensional coordinate of the second reference point in the world coordinate system is (x)₂，y₂，z₂) And the three-dimensional coordinate of the third reference point in the world coordinate system is (x)₃，y₃，z₃) And the three-dimensional coordinate of the fourth reference point in the world coordinate system is (x)₄，y₄，z₄). The coordinate matrix constructed with the three-dimensional coordinates of the first reference point, the second reference point, the third reference point, and the fourth reference point may be:

or

One kind of (1).

702. And determining at least one singular value of the coordinate matrix, and determining the human object to be detected as the living body under the condition that the ratio of the minimum value of the at least one singular value to the sum of the at least one singular value is greater than or equal to a second threshold value.

At least one singular value of the coordinate matrix can be calculated by a singular value decomposition theorem. In the embodiment of the present application, singular value decomposition is performed on a coordinate matrix constructed by three-dimensional coordinates of at least one reference point to obtain 3 singular values, which are respectively used for representing the dispersion of the at least four reference points in the horizontal direction, and the dispersion of the at least four reference points in the depth direction. When the singular value is greater than 0, the dispersion of the at least four reference points in the direction corresponding to the singular value is characterized to be not 0, that is, at least one reference point in the at least four reference points in the corresponding direction does not belong to the same plane with other reference points. For example, the at least four reference points include: the singular value A, the singular value B and the singular value C can be obtained by carrying out singular value decomposition on a matrix A which is constructed by the coordinates of the first reference point, the second reference point, the third reference point and the fourth reference point. The singular value A is used for representing the dispersion of the first reference point, the second reference point, the third reference point and the fourth reference point in the horizontal direction, the singular value B is used for representing the dispersion of the first reference point, the second reference point, the third reference point and the fourth reference point in the vertical direction, and the singular value C is used for representing the dispersion of the first reference point, the second reference point, the third reference point and the fourth reference point in the depth direction. Assuming that the singular value a is greater than 0, the dispersion in the horizontal direction characterizing the first reference point, the second reference point, the third reference point, and the fourth reference point is not 0. Assuming that the singular value B is greater than 0, the dispersion in the vertical direction characterizing the first reference point, the second reference point, the third reference point, and the fourth reference point is not 0. Assuming that the singular value C is larger than 0, the dispersion in the depth direction characterizing the first reference point, the second reference point, the third reference point, and the fourth reference point is not 0.

If the human body area of the human body object to be detected is a three-dimensional area, the dispersion of the at least four reference points in three directions (including a horizontal direction, a vertical direction and a depth direction) is not 0. Because the human body area of the person object to be detected is either a two-dimensional area (such as a paper photo and an electronic picture) or a three-dimensional area, at least two singular values in the three singular values of the coordinate matrix are larger than 0. Therefore, whether the human body region of the human body object to be detected is a three-dimensional region is determined, and only whether the minimum singular value of the three singular values of the coordinate matrix is greater than 0 is required.

In order to reduce the error, the embodiment of the application determines whether the human body region of the human body object to be detected is a three-dimensional region by determining whether the ratio of the minimum singular value in the at least one singular value to the sum of the at least one singular value is greater than or equal to a second threshold value. The second threshold is a small positive number, and the value range is greater than 0 and less than 1.

And under the condition that the ratio of the minimum value of the three singular values of the coordinate matrix to the sum of the three singular values is greater than or equal to a second threshold value, determining that the human body region of the character object to be detected is a three-dimensional region, and further determining that the character object to be detected is a living body. And under the condition that the ratio of the minimum value of the three singular values of the coordinate matrix to the sum of the three singular values is smaller than a second threshold value, determining that the human body region of the character object to be detected is a two-dimensional region, and further determining that the character object to be detected is not a living body.

For example, the three singular values of the coordinate matrix are: singular values a of size 3, singular values B of size 4 and singular values C of size 1. The minimum value of the three singular values is 1, the sum of the three singular values is 8, and the ratio of the minimum value to the sum of the three singular values is 0.125. Assuming that the second threshold value is 0.08, since 0.125 is larger than 0.08, it is determined that the human object to be detected is a living body.

In the embodiment, a coordinate matrix is constructed according to three-dimensional coordinates of at least four reference points in a world coordinate system, and whether a human body region of a character object to be detected is a three-dimensional region is determined according to singular values of the coordinate matrix, so that whether the character object to be detected is a living body is determined. As for any human body region of the human body object to be detected, a coordinate matrix can be constructed according to three-dimensional coordinates of at least four reference points in a world coordinate system, and the method for determining whether the human body object to be detected is a living body provided by the embodiment is applicable to any scene. The technical scheme provided by the implementation can improve the universality of the three-dimensional in-vivo detection method.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

The method of the embodiments of the present application is set forth above in detail and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, where the apparatus 1 includes: an acquisition unit 11, a first processing unit 12, a second processing unit 13 and a determination unit 14, wherein:

the system comprises an acquisition unit 11, a display unit and a control unit, wherein the acquisition unit is used for acquiring binocular images and parameters of a binocular camera for acquiring the binocular images, and the binocular images comprise human body areas of human objects to be detected;

the first processing unit 12 is configured to obtain depth information of at least four reference points in the human body region, horizontal position information of the at least four reference points, and vertical position information of the at least four reference points according to the binocular image, where the reference points include face key points, and the reference points include the face key points and trunk key points;

the second processing unit 13 is configured to obtain three-dimensional position information of the at least four reference points in a world coordinate system according to the parameters of the binocular camera, the horizontal position information of the at least four reference points, the vertical position information of the at least four reference points, and the depth information of the at least four reference points;

the determining unit 14 is configured to determine that the human object to be detected is a living body when variances of three-dimensional position information of the at least four reference points in a world coordinate system in a depth direction are greater than or equal to a first threshold, where the depth direction is a direction perpendicular to image planes of the binocular cameras when the binocular cameras acquire the binocular images.

the first processing unit 12 is configured to:

In yet another possible implementation manner, the first processing unit 12 is configured to:

the second processing unit 13 is configured to:

the determining unit 14 is configured to include:

In yet another possible implementation manner, the determining unit 14 is configured to:

the first processing unit 12 is configured to:

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Fig. 9 is a schematic diagram of a hardware structure of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus 2 includes a processor 21, a memory 22, an input device 23, and an output device 24. The processor 21, the memory 22, the input device 23 and the output device 24 are coupled by a connector, which includes various interfaces, transmission lines or buses, etc., and the embodiment of the present application is not limited thereto. It should be appreciated that in various embodiments of the present application, coupled refers to being interconnected in a particular manner, including being directly connected or indirectly connected through other devices, such as through various interfaces, transmission lines, buses, and the like.

The processor 21 may be one or more Graphics Processing Units (GPUs), and in the case that the processor 21 is one GPU, the GPU may be a single-core GPU or a multi-core GPU. Alternatively, the processor 21 may be a processor group composed of a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. Alternatively, the processor may be other types of processors, and the like, and the embodiments of the present application are not limited.

Memory 22 may be used to store computer program instructions, as well as various types of computer program code for executing the program code of aspects of the present application. Alternatively, the memory includes, but is not limited to, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or compact disc read-only memory (CD-ROM), which is used for related instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It is understood that, in the embodiment of the present application, the memory 22 may be used to store not only the related instructions, but also the related data, for example, the memory 22 may be used to store the binocular image acquired through the input device 23, or the memory 22 may also be used to store the depth information obtained through the processor 21, and the like, and the embodiment of the present application is not limited to the data stored in the memory.

It will be appreciated that fig. 9 only shows a simplified design of an image processing apparatus. In practical applications, the image processing apparatuses may further include other necessary components, including but not limited to any number of input/output devices, processors, memories, etc., and all image processing apparatuses that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It is also clear to those skilled in the art that the descriptions of the various embodiments of the present application have different emphasis, and for convenience and brevity of description, the same or similar parts may not be repeated in different embodiments, so that the parts that are not described or not described in detail in a certain embodiment may refer to the descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media that can store program codes, such as a read-only memory (ROM) or a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein the binocular image comprises: a first image to be processed and a second image to be processed; the binocular camera includes: the first camera is used for collecting the first image to be processed and the second camera is used for collecting the second image to be processed; the parameters of the binocular camera include: a distance between the first camera and the second camera, a first focal length of the first camera, and a second focal length of the second camera;

3. The method of claim 2, wherein the at least four reference points comprise a first reference point;

4. The method according to claim 2, wherein the performing stereo correction processing on the first image to be processed and the second image to be processed to normalize the first focal length and the second focal length to obtain a normalized focal length includes:

5. The method of claim 4, wherein the at least four reference points comprise a second reference point; the normalized camera parameters include: the normalized horizontal position information of the central point of the camera and the normalized vertical position information of the central point are obtained, wherein the central point is the intersection point of the image plane of the first camera and the optical axis of the first camera; the normalized focal length includes: the horizontal position information of the normalized focal length and the vertical position information of the normalized focal length; the three-dimensional position information of the at least four reference points in the world coordinate system comprises: horizontal position information of the at least four reference points in a world coordinate system, vertical position information of the at least four reference points in the world coordinate system and depth position information of the at least four reference points in the world coordinate system;

6. The method according to claim 1, wherein the determining that the human object to be detected is a living body in the case that the variance of the three-dimensional position information of the at least four reference points in the world coordinate system in the depth direction is greater than or equal to a first threshold value comprises:

7. The method of claim 2, wherein the body region comprises a face region; the at least four keypoints comprise first face keypoints;

8. The method of claim 2, wherein the body region comprises a face region and a torso region; the at least four keypoints comprise: a third face key point and a first trunk key point;

9. An image processing apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein the binocular image comprises: a first image to be processed and a second image to be processed; the binocular camera includes: the first camera is used for collecting the first image to be processed and the second camera is used for collecting the second image to be processed; the parameters of the binocular camera include: a distance between the first camera and the second camera, a first focal length of the first camera, and a second focal length of the second camera;

the first processing unit is configured to:

11. The apparatus of claim 10, wherein the at least four reference points comprise a first reference point;

the first processing unit is configured to:

12. The apparatus of claim 10, wherein the first processing unit is configured to:

the second processing unit is configured to:

13. The apparatus of claim 12, wherein the at least four reference points comprise a second reference point; the normalized camera parameters include: the normalized horizontal position information of the central point of the camera and the normalized vertical position information of the central point are obtained, wherein the central point is the intersection point of the image plane of the first camera and the optical axis of the first camera; the normalized focal length includes: the horizontal position information of the normalized focal length and the vertical position information of the normalized focal length; the three-dimensional position information of the at least four reference points in the world coordinate system comprises: horizontal position information of the at least four reference points in a world coordinate system, vertical position information of the at least four reference points in the world coordinate system and depth position information of the at least four reference points in the world coordinate system;

the determining unit is configured to include:

14. The apparatus of claim 9, wherein the determining unit is configured to:

15. The apparatus of claim 10, wherein the body region comprises a face region; the at least four keypoints comprise first face keypoints;

the first processing unit is configured to:

16. The apparatus of claim 10, wherein the body region comprises a face region and a torso region; the at least four keypoints comprise: a third face key point and a first trunk key point;

the first processing unit is configured to:

17. A processor configured to perform the method of any one of claims 1 to 8.

18. An electronic device, comprising: a processor, transmitting means, input means, output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1 to 8.

19. A computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to carry out the method of any one of claims 1 to 8.

20. A computer program product comprising instructions for causing a computer to perform the method of any one of claims 1 to 8 when the computer program product is run on a computer.