CN111079520B

CN111079520B - Image recognition method, device and storage medium

Info

Publication number: CN111079520B
Application number: CN201911057996.9A
Authority: CN
Inventors: 林坤宁
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2024-05-21
Anticipated expiration: 2039-11-01
Also published as: CN111079520A

Abstract

The embodiment of the application discloses an image identification method, image identification equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object; determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a face target image according to the target contour image and a depth image matched with the target contour image or according to the target depth image and a contour image matched with the target depth image.

Description

Image recognition method, device and storage medium

Technical Field

The present application relates to image processing technologies, and in particular, to an image recognition method, apparatus, and storage medium.

Background

Face recognition in the related art generally considers that a desired face image is recognized in the case that the same face or a similar face having a high similarity is recognized. The identification method is to identify the outline features of the face, such as the size and shape of the five sense organs, the distance between the five sense organs, the color and the like. The scheme is approximately as follows in concrete implementation: and recognizing an image with high similarity from the multi-frame images acquired by the camera as a recognized face image. It is understood that, in addition to the difference in the contour of the face, the difference in detail such as the depth of the eye sockets in the five sense organs, the height of the nose, the height of the cheekbones, and the like may also exist for different persons as a whole. For convenience of description, the depth of the eye sockets in the five sense organs, the height of the nose, the height of the cheekbones, etc. are referred to as depth information of the face. In the related art, there are also schemes for face recognition through depth information: the process is that an image with high similarity of depth information is identified from the acquired multi-frame images and used as an identified face depth image. As can be seen, in the face recognition schemes in the related art, face recognition is performed by starting from a single dimension, such as a face contour or depth information of a face, and the accuracy of the face recognition scheme performed from the single dimension is still to be improved.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the application provides an image identification method, image identification equipment and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides an image identification method, which comprises the following steps:

Acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

Determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images;

And obtaining a face target image according to the target contour image and a depth image matched with the target contour image or according to the target depth image and a contour image matched with the target depth image.

In the above scheme, the determining the target contour image according to the at least two frames of visible light images includes:

Obtaining at least part of the at least two frames of visible light images;

identifying an image with the similarity of the human face reaching a first threshold value according to the obtained at least part of the frame visible light image, and determining the image as a target contour image;

acquiring acquisition time or return time information of the target contour image;

Correspondingly, determining the depth image matched with the target contour image according to at least two frames of structured light images comprises the following steps:

Caching each frame of structured light image one by one; wherein the caching is stopped in case the target contour image is identified;

Acquiring the acquisition time or the return time information of each cached frame structure light image;

And determining that the image with the acquisition time difference or the return time difference smaller than or equal to a third threshold value in the cached frame structure light images is a depth image matched with the target contour image.

In the above scheme, the determining the target depth image according to the at least two frame structured light images includes:

Obtaining at least part of the frame structure light images in the at least two frame structure light images;

Identifying an image with similarity of face depth information reaching a second threshold value according to the obtained at least partial frame structure light image, and determining the image as a target depth image;

acquiring acquisition time or return time information of the target depth image;

Correspondingly, the determining the contour image matched with the target depth image according to at least two frames of visible light images comprises the following steps:

Caching each frame of visible light image one by one; wherein the caching is stopped in case a target depth image is identified;

Acquiring the acquisition time or the return time information of each cached frame of visible light image;

And determining that the image with the acquisition time difference or the return time difference smaller than or equal to a fourth threshold value in the cached visible light images of each frame is a contour image matched with the target depth image.

In the scheme, the collected structured light images or visible light images are cached one by utilizing a buffer with preset capacity.

In the above scheme, under the condition that the buffer is not stopped and the capacity of the buffer is not full, the structured light images or visible light images with the capacity are buffered one by one according to the acquisition time or the return time;

And deleting the structured light or visible light image with early acquisition time or transmission time under the condition that the buffer is not stopped and the capacity of the buffer is full, and buffering the newly acquired or newly transmitted structured light image or visible light image.

In the above solution, the obtaining a part of the frame visible light images in the at least two frames visible light images includes:

Creating a thread pool for the at least two frames of visible light images, wherein each thread in the thread pool is used in series;

In the case where a first thread in the thread pool performs target contour image recognition on one of the at least two frames of visible light images,

A second thread of the thread pool cannot recognize target contour images of at least one other visible light image except one of the at least two visible light images;

And deleting at least one other visible light image from the at least two visible light images to obtain the partial visible light image.

In the above aspect, the obtaining a part of the frame structured light image in the at least two frame structured light images includes:

Creating a thread pool for the at least two frames of structured light images, wherein each thread in the thread pool is used in series;

In case a first thread in the thread pool performs target depth image recognition on one of the at least two frame structured light images,

The second thread of the thread pool can not recognize the target depth image of at least one frame of other structured light images except one frame of the at least two frames of structured light images;

And deleting the at least one other frame of structured light image from the at least two frames of structured light images to obtain the partial frame of structured light image.

The embodiment of the application also provides an image recognition device, which comprises:

The camera is used for acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one video of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

The first determining unit is used for determining a target contour image according to the at least two frames of visible light images; or determining a target depth image according to the at least two frames of structured light images;

The second determining unit is used for determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a contour image matched with the target depth image according to at least two frames of visible light images;

the obtaining unit is used for obtaining a face target image according to the target contour image and a depth image matched with the target contour image or according to the target depth image and a contour image matched with the target depth image.

In the above aspect, the first determining unit is configured to:

Obtaining at least part of the at least two frames of visible light images;

correspondingly, the second determining unit is configured to:

In the above aspect, the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

The embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned image recognition method.

The embodiment of the application also provides image recognition equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the image recognition method when executing the program.

The embodiment of the application provides an image identification method, image identification equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object; determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a face target image according to the target contour image and a depth image matched with the target contour image or according to the target depth image and a contour image matched with the target depth image.

The embodiment of the application provides a finer face recognition scheme: the face contour features and the face depth information are combined to carry out face recognition, so that the accuracy of face recognition can be greatly improved. In addition, the contour image and the depth image which are shot at the same time or similar time are identified through matching of the contour image and the depth image, so that a desired face image is obtained, and a certain guarantee is provided for various applications based on the face image.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic implementation flow chart of a first embodiment of an image recognition method provided by the present application;

fig. 2 is a schematic implementation flow chart of a second embodiment of the image recognition method provided by the present application;

Fig. 3 is a schematic implementation flow chart of a third embodiment of an image recognition method provided by the present application;

fig. 4 is a schematic diagram of a face to be acquired in an application scenario according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the working principle of the buffer according to the present application;

Fig. 6 is a schematic diagram of hardware configuration of an embodiment of an image recognition apparatus provided by the present application;

fig. 7 is a schematic diagram of a composition structure of an embodiment of an image recognition apparatus provided by the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

In the embodiment of the application, the face contour and the face depth feature are combined to recognize the face, and compared with a single scheme of recognizing the face only from the face contour or only from the face depth information, the face recognition accuracy can be greatly improved by combining the face contour feature and the face depth feature. In addition, not only a desired face contour image (target contour image) or a desired face depth image (target depth image) is recognized, but also a depth image matching the desired face contour image or a contour image matching the desired face depth image is recognized, which corresponds to a contour image and a depth image captured for the same scene of the same face. It can be understood that the contour image and the depth image shot for the same scene of the same face can be regarded as the contour image and the depth image shot for the same face at the same moment or at similar moments, the two images are identified, the collection of the two images is a face target image, and the contour feature and the depth feature of the face at the same moment or at similar moments are equivalent to the identification. The method is a scheme capable of identifying the face image by combining the contour information and the depth information, and also considers that the contour image and the depth image acquired at the same time or at similar time are matched, and a synthesized image of the contour image and the depth image at the same time or at similar time is taken as a final identified expected face image. For further description of the solution provided by the embodiments of the present application, please refer to the following.

Fig. 1 is a first embodiment of an image recognition method according to the present application, as shown in fig. 1, where the method includes:

Step (S) 101: acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

In particular, the image of the target object may be acquired by an acquisition device capable of separating visible light information from structured light information in the image. For example, the acquisition device can be a multi-view camera such as a binocular camera or a three-view camera. Because the embodiment of the application is configured to recognize the face image, the target object is a face, for example, a face as shown in fig. 4. The method comprises the steps of acquiring images of a face by utilizing a multi-view camera to obtain visible light images and structured light images of the face. The principle of how a specific multi-view camera can collect a visible light image and a structured light image is not specifically described herein, please refer to the related description.

S102: determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images;

s103: and obtaining a face target image according to the target contour image and a depth image matched with the target contour image or according to the target depth image and a contour image matched with the target depth image.

In S103, the target contour image and the depth image matched with the target contour image may be synthesized to obtain a final expected face image; or synthesizing the target depth image and the contour image matched with the target depth image to obtain a final expected face image.

The entities performing S101 to S103 are any image recognition devices that can recognize or need to recognize a face. Such as smart phones, tablet computers, all-in-one computers, desktop computers, smart door locks, etc.; of course, any other reasonable device with face recognition function is also included, and the devices are not enumerated one by one due to the large number.

It will be appreciated that visible light information in the image facilitates analysis of the contour features of the face. Structured light information in the image facilitates analysis of the height of points in the face, such as nose height and eye socket depth. S101 is equivalent to distinguishing videos of the acquired images from two aspects of contour characteristics and depth characteristics, namely dividing at least two frames of acquired images aiming at a human face into two paths of videos (streams), wherein one path of videos is used for representing visible light information, and the other path of videos is used for representing structured light information. Based on such a discrimination result, a target contour image and a depth image matching the target contour image are determined, or a target depth image and a contour image matching the target depth image are determined, thereby obtaining a desired face image (face target image). Therefore, the recognition method provided by the embodiment of the application not only can recognize the expected face contour image (target contour image) or the expected face depth image (target depth image) based on one of the two paths of videos, but also can recognize the depth image matched with the expected face contour image or the contour image matched with the expected face depth image based on the other path of video stream. The face contour features and the face depth information are combined to carry out face recognition, so that the accuracy of face recognition can be greatly improved. In addition, the contour image and the depth image from the same time or similar time can be identified, and the synthesized image of the contour image and the depth image at the same time or similar time can be used as the final identified expected face image.

In S102 and S103, which video is selected as the basic video and which video is selected as the secondary video, and the contour and depth information of the face are identified to obtain a desired face contour image (target contour image) and a desired face depth image (target depth image), which may be implemented specifically by one of the following two ways:

mode one: determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; and obtaining a face target image according to the target contour image and the depth image matched with the target contour image.

In the first mode, a video characterized by visible light information is used as a basic video, and a video characterized by structured light information is used as a secondary video. Identifying a desired face contour image from a video characterized by visible light information; and identifying a depth image acquired at the same time or at a similar time to the expected face contour image from the video characterized as the structured light information, and taking a composite image of the expected face contour image and the depth image matched with the expected face contour image as a finally identified face image.

Mode two: determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a face target image according to the target depth image and the contour image matched with the target depth image.

In the second mode, the video characterized by the structured light information is used as a basic video, and the video characterized by the visible light information is used as a secondary video. Identifying a desired face depth image from a video characterized as structured light information; and identifying a contour image acquired at the same time or at a similar time to the expected face depth image from the video which is characterized by visible light information, and taking a synthesized image of the expected face depth image and the contour image matched with the expected face depth image as a final identified face image.

In the scheme, no matter which way is adopted, the face contour features and the face depth information are combined to carry out face recognition, so that the accuracy of face recognition can be greatly improved. In addition, the contour image and the depth image which come from the same time or similar time can be identified, and the image synthesized by the contour image and the depth image can be used as a finally identified face image. The finally recognized face image not only has the outline characteristics of the face but also has the depth characteristics, so that the accuracy of face recognition can be improved. And compared with the image in the related art, which only recognizes the outline features of the human face and only recognizes the depth features of the human face, the recognized human face features are more, and the subsequent application of the human face recognition result such as human face unlocking, suspicious part recognition and the like is more convenient.

Fig. 2 is a second embodiment of an image recognition method according to the present application, as shown in fig. 2, where the method includes:

s201: acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

The description of S201 is referred to the previous description of S101, and is not repeated.

S202: obtaining at least part of the at least two frames of visible light images;

S203: identifying an image with the similarity of the human face reaching a first threshold value according to the obtained at least part of the frame visible light image, and determining the image as a target contour image;

the step S202 and the step S203 are further described as determining the target contour image based on the at least two frames of visible light images.

S204: acquiring acquisition time or return time information of the target contour image;

s204, as the further description of determining the target contour image according to the at least two frames of visible light images; or may occur as a separate step in embodiment two.

S205: caching each frame of structured light image one by one; wherein the caching is stopped in case the target contour image is identified;

S206: acquiring the acquisition time or the return time information of each cached frame structure light image;

S207: determining that an image with the acquisition time difference or the return time difference smaller than or equal to a third threshold value in each cached frame structured light image is a depth image matched with the target contour image;

S205 to S207 are further descriptions of determining a depth image matching the target contour image from the at least two frame structured light images. It is understood that, S202 to S204 are taken as one branch processing flow, and S205 to S206 are taken as another branch processing flow, and the processes can be performed simultaneously without strict sequence.

S208: and obtaining a face target image according to the target contour image and the depth image matched with the target contour image.

In S208, the target contour image and the depth image matched with the target contour image may be synthesized to obtain the final desired face image.

In the foregoing S201 to S208, the video characterized by the visible light information is taken as the basic video, and the video characterized by the structured light information is taken as the secondary video.

Identifying an image with the similarity of the face reaching a first threshold value from the video characterized by visible light information as a desired face contour image; and acquiring acquisition time or return time information of the target contour image. Identifying an image with the acquisition time difference or the return time difference smaller than or equal to a third threshold value from the cached structured light image as a depth image matched with the target contour image; and taking the synthesized image of the expected face contour image and the depth image matched with the expected face contour image as a finally recognized face image. In the embodiment of the application, on one hand, the face contour features and the face depth information are combined to carry out face recognition; on the other hand, the finally recognized face image not only has the outline characteristics of the face but also has the depth characteristics, so that the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

Fig. 3 is a third embodiment of an image recognition method according to the present application, as shown in fig. 3, where the method includes:

S301: acquiring at least two frames of images aiming at a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

The description of S301 is referred to the previous description of S101, and is not repeated.

S302: obtaining at least part of the frame structure light images in the at least two frame structure light images;

S303: identifying an image with similarity of face depth information reaching a second threshold value according to the obtained at least partial frame structure light image, and determining the image as a target depth image;

s304: acquiring acquisition time or return time information of the target depth image;

s305: caching each frame of visible light image one by one; wherein the caching is stopped in case a target depth image is identified;

S306: acquiring the acquisition time or the return time information of each cached frame of visible light image;

s307: determining that an image with the acquisition time difference or the return time difference smaller than or equal to a fourth threshold value in each cached frame of visible light image is a contour image matched with the target depth image;

S308: and obtaining a face target image according to the target depth image and the contour image matched with the target depth image.

In S308, the target depth image and the contour image matched with the target depth image are synthesized to obtain the final expected face image.

In the foregoing S301 to S308, the video represented by the structured light information is used as a basic video, and the video represented by the visible light information is used as a secondary video.

Identifying an image with the face depth information reaching a second threshold value from the video characterized as the structured light information as a desired face depth image; and acquiring acquisition time or return time information of the target depth image. Then, identifying an image with the acquisition time difference or the return time difference smaller than or equal to a fourth threshold value from the cached visible light image as a contour image matched with the target depth image; and taking the synthesized image of the expected face depth image and the contour image matched with the expected face depth image as a final identified face image. In the embodiment of the application, on one hand, the face contour features and the face depth information are combined to carry out face recognition; on the other hand, the finally recognized face image not only has the outline characteristics of the face but also has the depth characteristics, so that the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

For the foregoing embodiments, particularly the second embodiment and/or the third embodiment, the buffering of the image in the secondary video may further be: and caching the collected structured light images or visible light images one by utilizing a buffer with preset capacity. In the process of using the buffer, under the condition that the buffer is not stopped and the capacity of the buffer is not full, buffering the structured light images or visible light images one by one according to the acquisition time or the return time; and deleting the structured light or visible light image with early acquisition time or transmission time under the condition that the buffer is not stopped and the capacity of the buffer is full, and buffering the newly acquired or newly transmitted structured light image or visible light image. In the process of caching the images, the images are cached by utilizing the characteristics of the buffer, so that the images which are acquired at the same or similar time as the expected images identified in the basic video stream can be conveniently and rapidly identified from the secondary video stream, the speed of identifying the images can be increased, the time of face recognition is shortened, and the face recognition is convenient to use in various application fields.

In addition, in order to ensure normal recognition of depth images or contour images in a basic video stream and avoid congestion caused by simultaneous recognition of multiple images in the basic video stream, in the embodiment of the application, a thread pool is created, the thread pool comprises multiple threads, and each thread in the thread pool is used by adopting a string. Further, the method comprises the steps of,

Aiming at the condition that the basic video stream is characterized as a visible light image, when a first thread in the thread pool carries out target contour image identification on one of the at least two frames of visible light images, a second thread in the thread pool cannot carry out target contour image identification on at least one other visible light image except the one of the at least two frames of visible light images; and deleting the at least one other visible light image from the at least two visible light images to obtain the part of the visible light images for carrying out the identification of the expected contour image.

Aiming at the condition that a basic video stream is characterized as a structural light image, when a first thread in the thread pool carries out target depth image identification on one frame of structural light image in the at least two frames of structural light images, a second thread in the thread pool cannot carry out target depth image identification on at least one frame of other structural light images except the one frame of structural light image in the at least two frames of structural light images; and deleting the at least one other frame of structured light image from the at least two frames of structured light images to obtain the partial frame of structured light image for carrying out the identification of the expected depth image.

The foregoing scheme, in addition to being able to avoid congestion, can be understood from another perspective: for all (visible light or structured light) images in the basic video stream, the identification of the target contour image or the target depth image may be performed by using all the images, or the identification of the target contour image or the target depth image may be performed by using only a part of the images in all the images. The identification of the target contour image or the target depth image is preferably performed using a partial image of the entire image. In other words, in the embodiment of the application, the accurate target contour image or the target depth image can be identified by only using part of the images, so that excessive resources are not required to be called in the identification process while the identification accuracy is ensured, the resource burden is reduced, and the identification speed is accelerated.

Embodiments of the present application are described in further detail below with reference to fig. 4 and 5, and specific application scenarios. It is understood that the execution subject of the scheme in the following application scenario is an image recognition apparatus.

Application scenario one: taking a multi-frame image characterized by visible light information as a basic video stream;

It can be understood that the execution subject in the present application scenario is an image recognition apparatus including a multi-view camera.

In this application scenario, the images of the face shown in fig. 4 are captured (acquired) multiple times by using a multi-view camera such as a binocular or a trinocular camera. It should be understood by those skilled in the art that the multi-view camera has a function of collecting the visible light information and the 3D structured light information in the same shooting scene (the external ambient light at the time of shooting, the shooting angle of the face, the pose of the face at the time of shooting, etc.), respectively. And shooting the face for multiple times by using a plurality of cameras to obtain two paths of video streams. One path of video stream is characterized by visible light information obtained by shooting a face in the shooting scene; the other path of video stream is characterized by 3D structured light information obtained by shooting the face in the shooting scene. In the process of shooting (collecting) each frame image in two paths of video streams, the collecting time of each frame image also needs to be recorded. The specific process of the multi-view camera to obtain the 3D structured light information and the visible light information is referred to the related description, and is not repeated.

The scheme for processing the video stream composed of the visible light images is as follows:

For video streams characterized by visible light information, such as 1 st to nth (N is a positive integer greater than or equal to 1) frames of visible light images, they are ordered according to acquisition time. And (3) for the 1 st frame of visible light image acquired by the multi-view camera, distributing the thread 1 in the thread pool to identify whether the image is a high-quality face image. And (3) for the 2 nd frame of visible light image acquired by the multi-view camera, distributing threads 2 in the thread pool to identify whether the image is a high-quality face image. And so on, each thread in the thread pool is responsible for the image of the corresponding frame and whether the corresponding frame image is a high quality face. Of course, the number of threads in the thread pool may be greater than or equal to the number of collected visible light images, or may be less than the number of collected visible light images. For example, there are 5 threads in the thread pool, each thread is allocated as described above, and if thread 1 recognizes that the 1 st frame of visible light image is not a high quality face image, the task of identifying the 1 st frame of visible light image by thread 1 may be ended, and thread 1 is allocated to the 6 th frame of visible light image collected by the multi-camera, and then whether the 6 th frame of visible light image is a high quality face image is identified. That is, the thread in the present application scenario may end the recognition task for one of the visible light images, and implement recognition for the other visible light image. Thus, the thread resources can be saved.

In the application scene, the multi-camera collects visible light images and the image recognition equipment distributes threads. In a specific implementation, a face detection program is preset, and whether a high-quality face image is recognized or not is performed through the program. Specifically, the face detection program is called by a thread assigned to an nth frame (any one of the 1 st to N th frame visible light images), such as thread N, and the recognition of whether the nth frame visible light image is a high-quality face image is performed by the program. Because of occupation of the thread n caused by calling the face detection program, other threads cannot call the face detection program to recognize high-quality face images. The other threads can only call the face detection program if the thread n releases the call to the face detection program (for example, releases the program if the face image with high quality is not recognized), and then recognize whether the visible light image of the corresponding frame is the face image with high quality. That is, the serial use mode adopted by each thread in the thread pool in the application scene can only allow one of the other threads to be used after one of the threads calls and completes using the face detection program.

In practical application, because the acquisition frequency of the multi-camera is greater than the time for the face detection program to recognize the high-quality face, the number of threads waiting for the threads to release the face detection program can be too large in the process of recognizing the high-quality face by a certain thread through the face detection program. For example, if the acquisition frequency of the multi-camera is 30 ms/frame, and the time for performing one-frame identification by using the face detection program is 100ms, then the thread 1 is allocated to the 1 st frame of visible light image acquired by the multi-camera, the thread 1 calls the face detection program and performs high-quality face image identification, and then the 2 nd to 4 th frames of visible light images are acquired by the multi-camera within the 100ms for identifying the 1 st frame of visible light image by the face detection program, and the frames of visible light images are deleted or discarded to avoid congestion waiting. For the visible light image of the 5 th frame, a thread 2 is allocated to the visible light image of the 5 th frame, at the moment, the thread 1 releases a face detection program, and the thread 2 calls the face detection program and recognizes the high-quality face image of the visible light image of the 5 th frame. The face detection program is any face detection algorithm capable of recognizing a face, such as a local binary algorithm, a characteristic face method, a neural network and the like. At the technical level, if the face detection program is not called, it is set to an idle state, allowing the thread to call it. If the face detection program is called, the face detection program is set to be in a non-idle state such as a working state, and other threads are forbidden to call the face detection program. Thus, it can be appreciated that not all the visible light images acquired by the multi-camera are identified as to whether they are high quality face images, but that some of the acquired visible light images qualify as being identified. Whether the visible light image qualified for recognition is a high-quality face image or not is recognized by using the assigned thread to call a face detection program. Therefore, the burden of identifying resources can be reduced, and the identification speed can be increased. It should be noted that, since the images collected by the multiple cameras are images of the faces in the same or similar shooting scenes shown in fig. 4, the difference between the images is small, so that even if the high-quality face image is recognized on part of the collected all visible light images, the recognition result will not be responded, and the recognition result is still accurate.

By adopting the scheme, the visible light images of each frame are processed by using the corresponding threads, after a plurality of frames of visible light images are processed, if a high-quality face image can be detected, such as a visible light image with the similarity reaching a first threshold value with the preset outline characteristics of the face is detected, the frame of visible light image (target outline image) is output, and a notification message is generated. The first threshold is preset, and may be any reasonable value, such as 90% or 97%. The generation of the notification message is followed by the buffer stopping the buffering.

It should be noted that if a high-quality face image is detected, the multi-camera may at least stop collecting the visible light image, and may also stop collecting the 3D structured light image at the same time, depending on the case.

The scheme for processing the video stream composed of the 3D structured light images is as follows:

For a video stream characterized by 3D structured light information, a plurality of cameras acquire and buffer the video stream in a buffer. When the multi-view camera collects 3D structure light images, the collection time of each 3D structure light image is required to be recorded.

It will be appreciated that since the size (M) of the buffer is constant, e.g. the maximum size is m=25, 25 3D visible light images can be stored. And in the process that the capacity of the buffer is not full and the buffer is not stopped, buffering each image one by one according to the acquisition time of the 3D structured light image. Taking the buffer as shown in fig. 5, taking the maximum capacity of the buffer as 25 (the number of buffer positions is 25) as an example, assuming that the buffer is in an idle state in an initial state and no data is stored, and if the 1 st frame of 3D structured light image is collected by the multiple cameras, buffering the 1 st frame of 3D structured light image into the position 1 of the buffer; and under the condition that the 2 nd frame 3D structured light image is acquired by the multi-view camera, the 2 nd frame 3D structured light image is cached in the position 2 of the buffer, and so on. It should be noted that, in the case that the buffer is full, in order to ensure that the newly acquired 3D structured light image is also buffered normally, the 3D structured light image with the front acquisition time buffered by the buffer needs to be deleted, and the newly acquired 3D structured light image is vacated for buffering. For example, in the case where the 1 st frame 3D structured-light image to the 25 th frame 3D structured-light image are buffered in the buffer, and the buffer is full, the image with the forefront (earliest) acquisition time, such as the 1 st frame 3D structured-light image, of the 25 frames of images is deleted. The rest 24 frames of images are sequentially moved to a buffer position for buffer, so that the 1 st position to the 24 th position in the buffer are sequentially occupied by the 2 nd frame of structured light images to the 25 th frame of structured light images. And buffering the 26 th frame 3D structure light image acquired by the multi-view camera into a buffer position originally buffering the 25 th frame 3D structure light image, namely buffering the 26 th frame 3D structure light image into the 25 th position of a buffer. And deleting the image with the forefront (earliest) acquisition time in the 25 frames of images of the buffer, such as the 2 frames of 3D structured light images buffered in the 1 st position, for the 27 frames of 3D structured light images acquired by the multi-eye cameras, sequentially moving the rest structured light images to one buffer position, and buffering the 27 frames of 3D structured light images to the 25 th buffer position. The regular buffering of the buffer memory ensures that the deleted image is the structured light image at the 1 st position of the buffer memory, and the newly acquired structured light image is stored in the last position of the buffer memory.

And the multi-camera collects 3D structured light images and stores the images, and once a notification message is detected to be generated in the process of caching each frame of 3D structured light collected by the multi-camera into the buffer, the buffer operation is stopped, and the buffer is locked. Judging whether a 3D structure light image with the acquisition time being closest to the acquisition time of the target contour image, such as an acquisition time with an error smaller than a third threshold value, such as 30ms, can be found from the locked buffer, and if the 3D structure light image can be found, judging that the found 3D structure light image is a 3D structure light image matched with the target contour image, wherein the image and the target contour image can be regarded as an image which is obtained by shooting a human face as shown in fig. 4 at the same time or similar time and is characterized as depth information. And synthesizing the two images to obtain and output a face image (expected face image) which has the most accurate contour feature and has depth feature at the same time or similar time.

In the first application scenario, the face contour features and the face depth information are combined to perform face recognition. The finally recognized face image not only has the outline characteristics of the face but also has the depth characteristics, so that the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

And (2) an application scene II: taking a multi-frame image characterized by structured light information as a basic video stream;

For video streams characterized as 3D structured light images,

The scheme that the multi-view camera collects 3D structured light images and distributes threads, the number of threads, the distribution method and the deleting or discarding part of 3D structured light are described in the first application scene, and are not repeated. In the second application scenario, a face depth information detection program is preset, a thread identifies a high-quality depth image by calling the program, and when the thread calls the program, the program is set to be in a working state, and other threads are forbidden to call the program. In the case where no thread makes a call to the program, the program is set to an idle state, allowing other threads to make calls to it. And performing identification processing on each frame of 3D structure light image by using a corresponding thread, outputting the frame of 3D structure light image (target depth image) and generating a notification message if a high-quality depth image can be detected after a plurality of frames of 3D structure light images are processed, such as a 3D structure light image with the similarity with the preset depth feature of the human face reaching a second threshold value. The second threshold is preset, and may be any reasonable value, such as 95% or 98%.

For the video stream characterized by visible light images, the multi-camera acquires and caches the video stream in a buffer, and records the acquisition time of each visible light image. And in the process that the capacity of the buffer is not full and the buffer is not stopped, buffering each image one by one according to the acquisition time of the visible light image. And under the condition that the buffer is full, deleting the image with the earliest acquisition time in the buffer, and caching the newly acquired visible light image into an vacated position. In the process of caching the visible light images acquired by the multi-camera into the buffer, once the generation of the notification message is detected, the caching operation is stopped, and the buffer is locked. Judging whether a visible light image with the acquisition time being smaller than a fourth threshold value such as 25ms can be found from the locked buffer, if so, the visible light image which can be found is a visible light image matched with the target depth image, and the image and the target depth image can be regarded as an image which is obtained by shooting the face shown in fig. 4 at the same time or similar time and is characterized by contour features and an image which is characterized by depth information. And synthesizing the two images to obtain a face image which has the most accurate depth characteristic and has the contour characteristic at the same time or similar time finally, and outputting the face image.

In the second application scene, combining the face contour features and the face depth information to perform face recognition; the finally recognized face image not only has the depth characteristic of the face but also has the outline characteristic, so that the accuracy of face recognition can be greatly improved, the recognized face has more characteristics, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

It can be understood that the description of the similar part of the application scenario two is referred to the detailed description of the part in the application scenario one, and the repeated description is not made in the application scenario two.

The foregoing scheme is exemplified by the case where the multi-camera is provided in the image recognition apparatus, the recorded time information is the time (acquisition time) at which the multi-camera acquires images, and in addition, the multi-camera is provided separately and not in the image recognition apparatus in the embodiment of the present application, so that the multi-camera also needs to transmit the acquired images back to the image recognition apparatus, and for the image recognition apparatus, it is still possible to perform the scheme as described above using the acquisition time of each image, and in addition, it is also possible to record the time (transmission time) at which each image is transmitted back to itself, and perform the scheme as described above using the transmission time of each frame of images. Taking a basic video stream as a video composed of visible light images as an example: the visible light image may be used for face detection, and for each frame of visible light image transmitted back by the multi-camera, the image recognition device receives the image and records the time of transmission back. And distributing threads for the received visible light images from the established thread pool, sending the visible light images returned by the multi-camera into the face detection program to detect the faces in the visible light images when the face detection program is in an idle state, and setting the face detection program to be in a working state. When the face detection program processes several frames of visible light images, if a high-quality face image is detected, the frame of visible light image is output, and a notification message is generated. For 3D structure light image frames transmitted back by the multi-camera, a buffer memory such as a concurrent safe container is used for buffering a certain number of frames in a default state, when the buffer memory number exceeds 25, the earliest frame transmitted back in the container is removed, a new frame transmitted back is put in, the container is locked at the moment of detecting the generation of notification information, the new frame transmitted back is not received any more, the 3D structure light image frames with the transmission time closest to the transmission time are searched in the container according to the transmission time of the high-quality visible light image frames, and the searched 3D structure light image frames can be paired with the high-quality visible light image frames and are output after synthesis. It can be understood that the paired two image frames are a set of two images capable of characterizing depth features and contour features, which are captured under the same capturing scene for the same target object at the same time or at similar times. Therefore, in the scheme of the embodiment of the application, the finally recognized face image not only has the outline characteristics of the face but also has the depth characteristics of the images shot at the same time or similar time, and the accuracy of face recognition can be greatly improved. The identified face features are more, so that the face recognition method can be conveniently applied to various application fields such as face unlocking, suspicious part identification and tracking and the like.

The embodiment of the application also provides an image recognition device, as shown in fig. 7, which comprises: a camera 11, a first determination unit 12, a second determination unit 13, and an obtaining unit 14; wherein,

The camera 11 is configured to acquire at least two frames of images for a target object, and obtain a first video and a second video, where one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

a first determining unit 12, configured to determine a target contour image according to the at least two frames of visible light images; or determining a target depth image according to the at least two frames of structured light images;

A second determining unit 13, configured to determine a depth image matched with the target contour image according to at least two frames of structured light images; or determining a contour image matched with the target depth image according to at least two frames of visible light images;

The obtaining unit 14 is configured to obtain a face target image according to a target contour image and a depth image matched with the target contour image, or according to a target depth image and a contour image matched with the target depth image.

In an alternative embodiment, the first determining unit 12 is configured to:

Obtaining at least part of the at least two frames of visible light images;

correspondingly, the second determining unit 13 is configured to:

In an alternative embodiment, the first determining unit 12 is configured to:

correspondingly, the second determining unit 13 is configured to:

In an alternative embodiment, the collected structured-light images or visible-light images are cached one by one using a buffer of a preset capacity.

In an alternative embodiment, in the case that the buffering is not stopped and the capacity of the buffer is not full, buffering the structured light images or the visible light images one by one according to the acquisition time or the return time;

In an alternative embodiment, the first determining unit 12 is configured to obtain a part of the at least two frames of visible light images. Further, the first determining unit 12 includes:

In an alternative embodiment, the first determining unit 12 is configured to obtain a partial frame structured light image of the at least two frame structured light images. Further, it is used for:

It will be appreciated that the first determining unit 12, the second determining unit 13 and the obtaining unit 14 in the apparatus may be implemented by a central processing unit (CPU, central Processing Unit), a digital signal Processor (DSP, digital Signal Processor), a micro control unit (MCU, microcontroller Unit) or a Programmable gate array (FPGA, field-Programmable GATE ARRAY) of the apparatus in practical applications. The camera 11 may be implemented by any number of cameras, such as a binocular or a trinocular camera.

It should be noted that, since the principle of the image recognition device for solving the problem is similar to that of the image recognition method, the implementation process and the implementation principle of the image recognition device can be described with reference to the implementation process and the implementation principle of the image recognition method, and the repetition is omitted.

The embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program, which when executed by a processor is at least adapted to carry out the steps of the method shown in any of fig. 1 to 5. The computer readable storage medium may be a memory in particular. The memory may be the memory 62 shown in fig. 6.

The embodiment of the application also provides image recognition equipment. Fig. 6 is a schematic hardware structure of an image recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the image recognition apparatus includes: a communication component 63 for data transmission, at least one processor 61 and a memory 62 for storing a computer program capable of running on the processor 61. The various components in the terminal are coupled together by a bus system 64. It is understood that the bus system 64 is used to enable connected communications between these components. The bus system 64 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 64 in fig. 6.

Wherein the processor 61, when executing the computer program, performs at least the steps of the method shown in any of fig. 1 to 5.

It will be appreciated that the memory 62 may be volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be, among other things, a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), Magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), and, Double data rate synchronous dynamic random access memory (DDRSDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), Direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 62 described in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 61 or implemented by the processor 61. The processor 61 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 61 or by instructions in the form of software. The processor 61 may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 61 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in a memory 62. The processor 61 reads information from the memory 62 and, in combination with its hardware, performs the steps of the method as described above.

In an exemplary embodiment, the image recognition device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic elements for performing the aforementioned image recognition methods.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or optical disk, or the like, which can store program codes.

Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image recognition method, the method comprising:

Acquiring at least two frames of images aiming at a target object through an acquisition device with a function of separating visible light information and structured light information to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

Determining a target contour image according to the at least two frames of visible light images; determining depth images matched with the target contour images according to at least two frames of structured light images, caching the structured light images one by one, and stopping caching under the condition that the target contour images are identified; or determining a target depth image according to the at least two frames of structured light images; determining contour images matched with the target depth images according to at least two frames of visible light images, caching the visible light images of each frame one by one, and stopping caching under the condition that the target depth images are identified;

And synthesizing the target contour image and a depth image matched with the target contour image, or synthesizing the target depth image and a contour image matched with the target depth image, so as to obtain a face target image.

2. The method of claim 1, wherein determining the target profile image from the at least two frames of visible light images comprises:

Obtaining at least part of the at least two frames of visible light images;

3. The method of claim 1, wherein determining the target depth image from the at least two frames of structured light images comprises:

4. The method according to claim 1, wherein in the case where the buffering is not stopped and the capacity of the buffer is not full, the buffering of the structured light images or the visible light images of the capacity is performed one by one according to the acquisition time or the return time;

5. The method of claim 2, wherein the obtaining a partial frame of the at least two frames of visible light images comprises:

6. A method according to claim 3, wherein said obtaining a partial frame structured light image of said at least two frame structured light images comprises:

7. An image recognition apparatus, characterized by comprising:

The camera is used for acquiring at least two frames of images aiming at a target object through the acquisition device with the function of separating visible light information and structured light information to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video is characterized by at least two frames of structured light images of the target object;

The second determining unit is used for determining depth images matched with the target contour images according to at least two frames of structured light images, caching the structured light images one by one, and stopping caching when the target contour images are identified; or determining contour images matched with the target depth images according to at least two frames of visible light images, caching the visible light images one by one, and stopping caching under the condition that the target depth images are identified;

And the obtaining unit is used for synthesizing the target contour image and the depth image matched with the target contour image or synthesizing the target depth image and the contour image matched with the target depth image to obtain the face target image.

8. The apparatus of claim 7, wherein the first determining unit is configured to:

Obtaining at least part of the at least two frames of visible light images;

correspondingly, the second determining unit is configured to:

9. The apparatus of claim 7, wherein the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the image recognition method as claimed in any one of claims 1 to 6.

11. An image recognition device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the image recognition method according to any one of claims 1 to 6 when the program is executed by the processor.