CN111079520A

CN111079520A - Image recognition method, device and storage medium

Info

Publication number: CN111079520A
Application number: CN201911057996.9A
Authority: CN
Inventors: 林坤宁
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-04-28
Anticipated expiration: 2039-11-01
Also published as: CN111079520B

Abstract

The embodiment of the application discloses an image identification method, equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object; determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image or according to the target depth image and the contour image matched with the target depth image.

Description

Image recognition method, device and storage medium

Technical Field

The present application relates to image processing technologies, and in particular, to an image recognition method, an image recognition apparatus, and a storage medium.

Background

Face recognition in the related art generally recognizes a desired face image when the same face or a similar face having a high degree of similarity is recognized. The identification method is carried out from the aspects of the outline characteristics of the human face, such as the size and the shape of five sense organs, the distance between the five sense organs, the color and the like. The scheme is roughly realized in the following way: and identifying an image with high similarity from the multi-frame images collected by the camera as an identified face image. It is understood that different persons as a whole differ in details such as the depth of the eye socket in the five sense organs, the height of the nose, the height of the cheekbones, and the like, in addition to the difference in the contour of the face. For convenience of description, the depth of the eye socket, the height of the nose, the height of the cheekbones, and the like in the five sense organs are referred to as depth information of the face. In the related art, a scheme for performing face recognition through depth information is also provided: the process is to identify an image with high depth information similarity from the collected multi-frame images as an identified face depth image. Therefore, in the face recognition schemes in the related art, the face recognition is performed from a single dimension, such as a face contour or depth information of a face, and the accuracy of the face recognition scheme from the single dimension needs to be improved.

Disclosure of Invention

In order to solve the existing technical problem, embodiments of the present application provide an image recognition method, an image recognition apparatus, and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image identification method, which comprises the following steps:

acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images;

and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image or according to the target depth image and the contour image matched with the target depth image.

In the foregoing solution, the determining a target contour image according to the at least two frames of visible light images includes:

obtaining at least part of the frame visible light image in the at least two frames of visible light images;

identifying an image with the similarity of human faces reaching a first threshold value according to the obtained at least partial frame visible light image, and determining the image as a target contour image;

acquiring the acquisition time or the return time information of the target contour image;

correspondingly, the determining a depth image matched with the target contour image according to at least two frames of the structured light image comprises:

caching the structured light images of each frame one by one; wherein the caching is stopped in case a target contour image is identified;

acquiring the acquisition time or returning time information of each cached frame structured light image;

and determining the image with the acquisition time difference or return time difference smaller than or equal to a third threshold value in each cached frame of structured light image as a depth image matched with the target contour image.

In the foregoing solution, the determining a target depth image according to the at least two frames of structured light images includes:

obtaining at least a portion of the at least two frames of structured light images;

identifying an image with the similarity of face depth information reaching a second threshold value according to the obtained at least partial frame structured light image, and determining the image as a target depth image;

acquiring the acquisition time or return time information of the target depth image;

correspondingly, the determining a contour image matched with the target depth image according to at least two frames of visible light images includes:

caching the visible light images of each frame one by one; wherein the caching is stopped in case a target depth image is identified;

acquiring the acquisition time or returning time information of each cached frame of visible light image;

and determining the images with the acquisition time difference or return time difference smaller than or equal to a fourth threshold value in the cached frames of visible light images as contour images matched with the target depth images.

In the scheme, the collected structured light images or visible light images are buffered one by using a buffer with preset capacity.

In the scheme, under the condition that the caching is not stopped and the capacity of the buffer is not full, the structured light images or the visible light images with the capacity are cached one by one according to the acquisition time or the return time;

and under the condition that the buffering is not stopped and the capacity of the buffer is full, deleting the structured light or visible light image with earlier acquisition time or return time, and buffering the newly acquired or newly returned structured light image or visible light image.

In the foregoing solution, the obtaining a partial frame visible light image in the at least two frames of visible light images includes:

creating a thread pool for the at least two frames of visible light images, wherein each thread in the thread pool is used in series;

in the case that a first thread in the thread pool performs target contour image recognition on one of the at least two frames of visible light images,

a second thread of the thread pool cannot identify a target contour image of at least one frame of other visible light images except the one frame of image in the at least two frames of visible light images;

and deleting the at least one other frame of visible light image from the at least two frames of visible light images to obtain the partial frame of visible light image.

In the foregoing solution, the obtaining a partial frame structured light image in the at least two frame structured light images includes:

creating a thread pool for the at least two frames of structured light images, wherein each thread in the thread pool is used in series;

in the case that a first thread in the thread pool performs target depth image recognition on one of the at least two frames of structured light images,

a second thread of the thread pool cannot identify at least one frame of other structured light image except the one frame of image in the at least two frames of structured light images according to the target depth image;

and deleting the at least one other frame of structured light image from the at least two frames of structured light images to obtain the partial frame of structured light image.

An embodiment of the present application further provides an image recognition apparatus, including:

the camera is used for acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

the first determining unit is used for determining a target contour image according to the at least two frames of visible light images; or determining a target depth image according to the at least two frames of structured light images;

the second determining unit is used for determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a contour image matched with the target depth image according to at least two frames of visible light images;

and the obtaining unit is used for obtaining the human face target image according to the target contour image and the depth image matched with the target contour image or according to the target depth image and the contour image matched with the target depth image.

In the foregoing solution, the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

In the foregoing solution, the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the foregoing image recognition method.

The embodiment of the present application further provides an image recognition device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the image recognition method when executing the program.

The embodiment of the application provides an image identification method, equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object; determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image or according to the target depth image and the contour image matched with the target depth image.

The embodiment of the application provides a more detailed face recognition scheme: the face contour characteristics and the face depth information are combined for face recognition, so that the accuracy of face recognition can be greatly improved. In addition, the contour image and the depth image which are shot at the same time or close time are identified through matching of the contour image and the depth image, and then the expected face image is obtained, so that certain guarantee is provided for various applications based on the face identification image.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart illustrating an implementation of a first embodiment of an image recognition method provided in the present application;

fig. 2 is a schematic flow chart illustrating an implementation of a second embodiment of the image recognition method provided in the present application;

fig. 3 is a schematic flow chart of an implementation of a third embodiment of an image recognition method provided in the present application;

fig. 4 is a schematic diagram of a face to be acquired in an application scenario according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating the operation of a buffer according to the present invention;

FIG. 6 is a schematic diagram of a hardware configuration of an embodiment of an image recognition apparatus provided in the present application;

fig. 7 is a schematic structural diagram of an embodiment of an image recognition apparatus provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In the embodiment of the application, the face contour and the face depth feature are combined to recognize the face, and compared with a single scheme of only performing face recognition from the face contour or only from the face depth information, the face contour feature and the face depth feature are combined to perform face recognition, so that the accuracy of face recognition can be greatly improved. Further, not only the desired face contour image (target contour image) or the desired face depth image (target depth image) but also the recognition of a depth image matching the desired face contour image or the matching of a contour image matching the desired face depth image corresponds to the recognition of a contour image and a depth image captured for the same scene of the same face. It can be understood that a contour image and a depth image captured for the same scene of the same face may be regarded as a contour image and a depth image obtained by capturing the face at the same time or at similar times, and the two images are recognized, and a set of the two images is regarded as a face target image, which is equivalent to recognizing the contour feature and the depth feature of the face at the same time or at similar times. The method is a scheme capable of identifying the face image by combining the contour information and the depth information, and also considers matching the contour image and the depth image acquired at the same time or close time, and takes the synthetic image of the contour image and the depth image at the same time or close time as the finally identified expected face image. The following is a further description of the solutions provided in the examples of the present application and is referred to.

Fig. 1 is a first embodiment of an image recognition method provided in the present application, and as shown in fig. 1, the method includes:

step (S) 101: acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

in a specific implementation, the acquisition device capable of separating visible light information and structured light information in the image can be used for acquiring the image of the target object. If collection system can be for many mesh cameras like binocular camera, three mesh cameras. Since the solution of the embodiment of the present application is to recognize a face image, the target object is a human face, for example, a human face as shown in fig. 4. Namely, the multi-camera is used for collecting images of the human face to obtain a visible light image and a structured light image of the human face. The principle of how to realize the collection of the visible light image and the structured light image by the specific multi-view camera is not specifically described here, and please refer to the related description.

S102: determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; or determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images;

s103: and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image or according to the target depth image and the contour image matched with the target depth image.

In S103, synthesizing the target contour image and the depth image matched with the target contour image to obtain a final expected face image; or synthesizing the target depth image and the contour image matched with the target depth image to obtain a final expected face image.

The entity performing S101 to S103 is any image recognition device that can or needs to recognize a human face. Such as smart phones, tablet computers, all-in-one computers, desktop computers, smart door locks, and the like; of course, any other reasonable devices with face recognition function are included, and the number is large and not always enumerated.

It will be appreciated that the visible light information in the image facilitates the analysis of the contour features of the face. The structured light information in the image helps to analyze the height of each point in the face, such as the nose height, eye socket depth. S101 is equivalent to performing video differentiation on the acquired image from two aspects, namely, a profile characteristic and a depth characteristic, that is, dividing at least two frames of the acquired image for a human face into two paths of videos (streams), where one path is used to represent visible light information and the other path is used to represent structured light information. Based on the discrimination result, a target contour image and a depth image matching the target contour image or a target depth image and a contour image matching the target depth image are determined, thereby obtaining a desired face image (face target image). Therefore, the recognition method provided in the embodiment of the application can recognize the expected face contour image (target contour image) or the expected face depth image (target depth image) based on one of the two videos, and can also recognize the depth image matched with the expected face contour image or the contour image matched with the expected face depth image based on the other video stream. The face contour characteristics and the face depth information are combined for face recognition, so that the accuracy of face recognition can be greatly improved. In addition, the contour image and the depth image from the same time or the similar time can be recognized, and the composite image of the contour image and the depth image at the same time or the similar time can be used as the finally recognized expected face image.

In S102 and S103, which video is selected as the basic video and which video is selected as the secondary video, and the contour and depth information of the face is identified to obtain a desired face contour image (target contour image) and a desired face depth image (target depth image), which may be specifically implemented by one of the following two ways:

the first method is as follows: determining a target contour image according to the at least two frames of visible light images; determining a depth image matched with the target contour image according to at least two frames of structured light images; and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image.

In the first mode, the video characterized by visible light information is used as the basic video, and the video characterized by structured light information is used as the secondary video. Identifying a desired face contour image from the video characterized as visible light information; and identifying a depth image which belongs to the same time or close time with the expected face contour image from the video characterized by the structured light information, and taking a synthetic image of the expected face contour image and the depth image matched with the expected face contour image as a finally identified face image.

The second method comprises the following steps: determining a target depth image according to the at least two frames of structured light images; determining a contour image matched with the target depth image according to at least two frames of visible light images; and obtaining a human face target image according to the target depth image and the contour image matched with the target depth image.

In the second mode, the video represented by the structured light information is used as the primary video, and the video represented by the visible light information is used as the secondary video. Identifying a desired face depth image from the video characterized as structured light information; and identifying a contour image which belongs to the same time or similar time with the expected face depth image from the video characterized by visible light information, and taking a synthetic image of the expected face depth image and the contour image matched with the expected face depth image as a finally identified face image.

In the scheme, no matter which mode is adopted, the face contour characteristics and the face depth information are combined to carry out face recognition, and the accuracy of the face recognition can be greatly improved. In addition, the contour image and the depth image which are from the same time or close time can be recognized, and the image which is synthesized by the contour image and the depth image is used as the finally recognized face image. The finally recognized face image not only has the contour characteristic of the face but also has the depth characteristic, and the accuracy of face recognition can be improved. Compared with the image in the related technology in which only the contour feature of the face is recognized and only the depth feature of the face is recognized, the recognized face features are more, and the subsequent application of the face recognition result, such as face unlocking, suspicious part recognition and the like, is more convenient.

Fig. 2 is a second embodiment of the image recognition method provided in the present application, and as shown in fig. 2, the method includes:

s201: acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

for the description of S201, reference is made to the foregoing description of S101, which is not repeated.

S202: obtaining at least part of the frame visible light image in the at least two frames of visible light images;

s203: identifying an image with the similarity of human faces reaching a first threshold value according to the obtained at least partial frame visible light image, and determining the image as a target contour image;

the foregoing S202 and S203 serve as a further explanation for determining the target contour image according to the at least two frames of visible light images.

S204: acquiring the acquisition time or the return time information of the target contour image;

s204 may be used as a further description for determining the target contour image according to the at least two frames of visible light images; or may occur as a separate step in example two.

S205: caching the structured light images of each frame one by one; wherein the caching is stopped in case a target contour image is identified;

s206: acquiring the acquisition time or returning time information of each cached frame structured light image;

s207: determining an image with the acquisition time difference or the return time difference smaller than or equal to a third threshold value in each cached frame structured light image as a depth image matched with the target contour image;

s205 to S207 are further descriptions of determining a depth image matching the target contour image according to the at least two frames of structured light images. It is to be understood that S202 to S204 are taken as one branch processing flow, and S205 to S206 are taken as another branch processing flow, which have no strict sequence, and can also be performed simultaneously.

S208: and obtaining a human face target image according to the target contour image and the depth image matched with the target contour image.

In S208, the target contour image and the depth image matched with the target contour image may be synthesized to obtain a final desired face image.

In the foregoing S201 to S208, the video represented by the visible light information is used as the primary video, and the video represented by the structured light information is used as the secondary video.

Identifying an image with the face similarity reaching a first threshold value from videos represented as visible light information as an expected face contour image; and acquiring the acquisition time or returning time information of the target contour image. Identifying an image with the acquisition time difference or the return time difference smaller than or equal to a third threshold value from the cached structured light image as a depth image matched with the target contour image; and taking a composite image of the expected face contour image and the depth image matched with the expected face contour image as a finally recognized face image. In the embodiment of the application, on one hand, the face contour characteristics and the face depth information are combined to carry out face recognition; on the other hand, the finally recognized face image not only has the contour characteristic of the face but also has the depth characteristic, the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

Fig. 3 is a third embodiment of an image recognition method provided in the present application, and as shown in fig. 3, the method includes:

s301: acquiring at least two frames of images of a target object to obtain a first video and a second video, wherein one of the first video and the second video is characterized as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

for the description of S301, refer to the foregoing description of S101, which is not repeated.

S302: obtaining at least a portion of the at least two frames of structured light images;

s303: identifying an image with the similarity of face depth information reaching a second threshold value according to the obtained at least partial frame structured light image, and determining the image as a target depth image;

s304: acquiring the acquisition time or return time information of the target depth image;

s305: caching the visible light images of each frame one by one; wherein the caching is stopped in case a target depth image is identified;

s306: acquiring the acquisition time or returning time information of each cached frame of visible light image;

s307: determining an image with the acquisition time difference or the return time difference smaller than or equal to a fourth threshold value in each frame of cached visible light image as a contour image matched with the target depth image;

s308: and obtaining a human face target image according to the target depth image and the contour image matched with the target depth image.

In S308, the target depth image and the contour image matched with the target depth image are synthesized to obtain a final expected face image.

In the foregoing S301 to S308, the video represented by the structured light information is used as the primary video, and the video represented by the visible light information is used as the secondary video.

Identifying an image of which the face depth information reaches a second threshold value from the video characterized by the structured light information as an expected face depth image; and acquiring the acquisition time or returning time information of the target depth image. Identifying an image with the acquisition time difference or the return time difference smaller than or equal to a fourth threshold value from the cached visible light image as a contour image matched with the target depth image; and taking a composite image of the expected face depth image and the matched contour image as a finally recognized face image. In the embodiment of the application, on one hand, the face contour characteristics and the face depth information are combined to carry out face recognition; on the other hand, the finally recognized face image not only has the contour characteristic of the face but also has the depth characteristic, the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

For the foregoing embodiments, especially the buffering of the images in the second video in the second embodiment and/or the third embodiment may further be: and caching the collected structured light images or visible light images one by utilizing a cache with preset capacity. In the process of using the buffer, under the condition that the buffer is not stopped and the capacity of the buffer is not full, the structured light images or the visible light images with the capacity are buffered one by one according to the acquisition time or the return time; and under the condition that the buffering is not stopped and the capacity of the buffer is full, deleting the structured light or visible light image with earlier acquisition time or return time, and buffering the newly acquired or newly returned structured light image or visible light image. In the process of caching images, the characteristics of the cache are utilized to cache the images, so that the images which belong to the same or similar time as the expected images identified in the basic video stream can be conveniently and quickly identified from the secondary video stream, the speed of image identification can be increased, the time length of face identification is shortened, and the use of the face identification in each application field is facilitated.

In addition, in order to ensure normal recognition of the depth image or the contour image in the basic video stream in the embodiment of the present application and avoid a congestion problem caused by simultaneous recognition of multiple images in the basic video stream, in the embodiment of the present application, a thread pool is created, the thread pool includes multiple threads, and each thread in the thread pool is used in a serial manner. Further, in the above-mentioned case,

for the condition that the basic video stream is characterized by visible light images, in the case that a first thread in the thread pool identifies a target contour image of one visible light image of the at least two frames of visible light images, a second thread in the thread pool cannot identify the target contour image of at least one other visible light image of the at least two frames of visible light images except for the one frame of visible light image; and deleting the at least one other frame of visible light image from the at least two frames of visible light images to obtain the partial frame of visible light image for carrying out the expected outline image identification.

For the condition that a basic video stream is characterized by a structured light image, when a first thread in the thread pool identifies a target depth image of one frame of the at least two frames of structured light images, a second thread in the thread pool cannot identify a target depth image of at least one frame of other structured light images except for the one frame of the at least two frames of structured light images; and deleting the at least one frame of other structured light image from the at least two frames of structured light images to obtain the partial frame structured light image for carrying out the expected depth image identification.

Besides being able to avoid congestion, the foregoing solution can be understood from another perspective: for all (visible light or structured light) images in the elementary video stream, the identification of the target contour image or the target depth image may be performed using all images, or the identification of the target contour image or the target depth image may be performed using only a partial image in all images. Preferably, the target contour image or the target depth image is identified using a partial image of the entire image. That is, in the embodiment of the application, only the partial image is utilized to identify the accurate target contour image or the target depth image, so that the identification accuracy is ensured, excessive resources do not need to be called in the identification process, the resource burden is reduced, and the identification speed is accelerated.

The embodiments of the present application will be described in further detail with reference to fig. 4 and 5 and specific application scenarios. It will be appreciated that the implementation of the scheme in the following application scenario is an image recognition device.

The application scene one: taking a plurality of frames of images represented as visible light information as a basic video stream;

it can be understood that the execution subject in the application scenario is an image recognition device, and the image recognition device includes a multi-view camera.

In this application scenario, the image of the face shown in fig. 4 is shot (acquired) multiple times by using a multi-view camera, such as a binocular or trinocular camera. Those skilled in the art will understand that the multi-view camera has the function of collecting visible light information and 3D structured light information of the same shooting scene (external environment light during shooting, shooting angle of human face, posture of human face during shooting, etc.) respectively. And shooting the human face for multiple times by the multi-camera to obtain two paths of video streams. One path of video stream is characterized by visible light information obtained by shooting a human face in the shooting scene; and the other path of video stream is represented by 3D structured light information obtained by shooting the face in the shooting scene. In the process of shooting (acquiring) each frame of image in the two paths of video streams, the acquisition time of each frame of image needs to be recorded. For a specific process of acquiring the 3D structured light information and the visible light information by the multi-view camera, please refer to the related description, which is not repeated.

The scheme for processing the video stream composed of the visible light images is as follows:

video streams characterized by visible light information, such as the 1 st frame (sheet) to the Nth frame (N is a positive integer greater than or equal to 1) of visible light images, are ordered according to the acquisition time. And for the 1 st frame of visible light image collected by the multi-view camera, allocating the thread 1 in the thread pool to identify whether the image is a high-quality face image. And for the 2 nd frame of visible light image collected by the multi-view camera, allocating the thread 2 in the thread pool to identify whether the image is a high-quality face image. And so on, each thread in the thread pool is responsible for the image of the corresponding frame and whether the image of the corresponding frame is a high-quality face. Of course, the number of threads in the thread pool may be greater than or equal to the number of the collected visible light images, or may be smaller than the number of the collected visible light images. For example, 5 threads are arranged in the thread pool, and the threads are allocated in the above manner, so that the task of the thread 1 for identifying the 1 st frame of visible light image can be ended when the thread 1 identifies that the 1 st frame of visible light image is not a high-quality face image, and the thread 1 is allocated to the 6 th frame of visible light image collected by the multi-view camera, so as to identify whether the 6 th frame of visible light image is a high-quality face image. That is to say, in the case that a thread in the application scene can end the recognition task for one of the visible light images, the recognition for the other visible light image is realized. Thus, thread resources can be saved.

In the application scene, the multi-view camera collects visible light images, and the image identification equipment distributes the thread. In the concrete implementation, a human face detection program is preset, and whether the human face image is high-quality or not is identified through the program. Specifically, the face detection program is called by a thread such as thread N assigned to the N-th frame (any one of the 1 st to N-th frames of visible light images) of the visible light image, and the program identifies whether the N-th frame of visible light image is a high-quality face image. Due to occupation generated by calling the face detection program by the thread n, other threads cannot call the face detection program to identify the high-quality face image. Other threads can call the face detection program only when the thread n releases the call to the face detection program (for example, releases the program when a face image which is not high quality is recognized), and then recognize whether the visible light image of the corresponding frame is a high quality face image. That is to say, in the serial use mode adopted by each thread in the thread pool in the application scenario, after one of the threads is called and completed by using the face detection program, one of the other threads can be allowed to use.

In practical application, because the acquisition frequency of the multi-view camera is greater than the time for the face detection program to identify the high-quality face, in the process of identifying the high-quality face by using the face detection program for a certain thread, the number of threads waiting for the thread to release the face detection program is excessive, and in order to avoid the congestion problem caused by excessive waiting threads in the application scene, the visible light images acquired by the multi-view camera in the waiting process are deleted or discarded to avoid the congestion. For example, the acquisition frequency of the multi-view camera is 30 milliseconds/frame, the time for performing one-frame recognition by using the face detection program is 100ms, then a thread 1 is allocated to the 1 st frame of visible light image acquired by the multi-view camera, the thread 1 calls the face detection program and performs high-quality face image recognition, then the multi-view camera acquires the 2 nd to 4 th frames of visible light images within the 100ms for recognizing the 1 st frame of visible light image by the face detection program, and deletes or discards the frames of visible light images so as to avoid congestion waiting. And for the 5 th frame of visible light image, a thread 2 is allocated to the 5 th frame of visible light image, at this time, the thread 1 releases the face detection program, and the thread 2 calls the face detection program and identifies the 5 th frame of visible light image with high quality face image. The face detection program is any face detection algorithm capable of identifying a face, such as a local binary algorithm, a eigenface method, a neural network and the like. In the technical aspect, if the face detection program is not called, the face detection program is set to be in an idle state, and the thread is allowed to call the face detection program. And if the face detection program is called, setting the face detection program to be in a non-idle state such as a working state, and forbidding other threads to call the face detection program. Therefore, it can be understood that, for all visible light images collected by the multi-view camera, it is not always recognized whether the visible light images are high-quality face images, but a part of the visible light images collected by the multi-view camera are qualified to be recognized. And calling a face detection program by using the assigned thread to identify whether the visible light image qualified to be identified is a high-quality face image. Therefore, the load of the identification resources can be reduced, and the identification speed is increased. It should be noted that, since the images collected by the multi-view camera are all images of the human face shown in fig. 4 in the same or similar shooting scene, and the difference between the images is small, even if the high-quality human face image recognition is performed on the partial images in all the collected visible light images, the recognition result will not be responded, and the recognition result is still accurate.

By adopting the scheme, each frame of visible light image is processed by using the corresponding thread, and after a plurality of frames of visible light images are processed, if a high-quality face image can be detected, and if a visible light image with the similarity reaching a first threshold value with the preset contour feature of the face is detected, the frame of visible light image (target contour image) is output, and a notification message is generated. The first threshold is preset and may be any reasonable value, such as 90% or 97%. The buffer stops buffering after the generation of the notification message.

It should be noted that, if a high-quality face image is detected, the multi-view camera may stop collecting at least the visible light image and may also stop collecting the 3D structured light image at the same time, as the case may be.

The scheme for processing the video stream composed of 3D structured light images is as follows:

for a video stream characterized by 3D structured light information, a multi-view camera collects and caches the video stream in a buffer. When the multi-view camera collects the 3D structured light image, the collection time of each 3D structured light image needs to be recorded.

It is understood that since the size (M) of the buffer is constant, for example, the maximum size is M-25, 25 3D visible light images can be stored. And in the process that the capacity of the buffer is not full and the buffer is not stopped, buffering each image one by one according to the acquisition time of the 3D structured light image. As shown in fig. 5, for example, the maximum capacity of the buffer is 25 (the number of buffer locations is 25), assuming that the buffer is in an idle state and does not store any data in the initial state, and the 1 st frame of 3D structured light image is buffered in location 1 of the buffer when the 1 st frame of 3D structured light image is collected by the multi-view camera; and buffering the 2 nd frame of 3D structured light image into a position 2 of the buffer under the condition that the 2 nd frame of 3D structured light image is collected by the multi-view camera, and so on. It should be noted that, under the condition that the buffer is full, in order to ensure that the newly acquired 3D structured light image is also normally buffered, the 3D structured light image with the earlier acquisition time needs to be deleted from the buffer, and the buffer is left to provide the newly acquired 3D structured light image for buffering. For example, when the 3D structured light images of the 1 st frame to the 25 th frame are buffered in the buffer and the buffer is full, the image with the earliest acquisition time (earliest) among the 25 frames, such as the 3D structured light image of the 1 st frame, is deleted. And the rest 24 other frames of images are sequentially moved to a buffer position for buffering, so that the 1 st position to the 24 th position in the buffer are sequentially occupied by the 2 nd frame structured light image to the 25 th frame structured light image. And for the 26 th frame of 3D structured light image collected by the multi-view camera, caching the 26 th frame of 3D structured light image into the original cache position for caching the 25 th frame of 3D structured light image, namely caching the 26 th frame of 3D structured light image into the 25 th position of the cache. For the 27 th frame of 3D structured light image collected by the multi-view camera, the image with the most previous (earliest) collection time, such as the 2 nd frame of 3D structured light image cached in the 1 st position, in the 25 th frame of image in the cache at this time is deleted, and the rest structured light images are sequentially moved to the cache position to cache the 27 th frame of 3D structured light image into the 25 th cache position. The regular buffer storage of the buffer storage ensures that the deleted image is the structured light image located at the 1 st position of the buffer storage, and the newly collected structured light image is stored in the last position of the buffer storage.

The multi-view camera collects 3D structured light images and stores the images, and in the process of caching each frame of 3D structured light collected by the multi-view camera into the buffer, once the notification message is detected to be generated, the buffer operation is stopped, and the buffer is locked. And judging whether a 3D structured light image with the acquisition time being closest to the acquisition time of the target contour image, wherein the error between the acquisition time and the acquisition time is smaller than a third threshold value, such as 30ms, can be found from the locked buffer, if the 3D structured light image can be found, the 3D structured light image which can be found is considered to be a 3D structured light image matched with the target contour image, and the image and the target contour image can be considered to be an image which is characterized by contour characteristics and an image which is characterized by depth information, wherein the image is obtained by shooting the human face shown in the figure 4 at the same time or close time. And synthesizing the two images to obtain and output a face image (expected face image) which has the most accurate contour characteristic and also has the depth characteristic at the same time or close time.

In the first application scene, the face contour features and the face depth information are combined to perform face recognition. The finally recognized face image not only has the contour characteristic of the face but also has the depth characteristic, the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

Application scenario two: taking a plurality of frames of images represented as structured light information as a basic video stream;

for a video stream characterized as a 3D structured light image,

the multi-view camera collects the 3D structured light image and performs the allocation of the threads, the number of the threads, the allocation method, and the scheme of deleting or discarding part of the 3D structured light as described in the application scenario one, which are not repeated. In the second application scene, a human face depth information detection program is preset, a high-quality depth image is identified by calling the program through a thread, and the program is set to be in a working state and is forbidden to be called by other threads under the condition that the program is called by the thread. And under the condition that no thread calls the program, setting the program to be in an idle state and allowing other threads to call the program. And identifying and processing each frame of 3D structured light image by using a corresponding thread, and after a plurality of frames of 3D structured light images are processed, if a high-quality depth image can be detected, such as a 3D structured light image with the similarity degree reaching a second threshold value with the preset depth feature of the human face is detected, outputting the frame of 3D structured light image (target depth image), and generating a notification message. The second threshold is preset and may be any reasonable value, such as 95% or 98%.

For video streams represented by visible light images, the multi-view camera caches the video streams in the buffer while collecting the video streams, and the collecting time of each visible light image is recorded. And in the process that the capacity of the buffer is not full and the buffering is not stopped, buffering each image one by one according to the acquisition time of the visible light image. And under the condition that the buffer is full, deleting the image with the earliest acquisition time in the buffer, and caching the newly acquired visible light image into an empty position. In the process of caching the visible light images collected by the multi-view camera into the buffer, once the notification message is detected to be generated, the caching operation is stopped, and the buffer is locked. And judging whether a visible light image with the acquisition time being closest to the acquisition time of the target depth image, for example, the error between the acquisition times being smaller than a fourth threshold value, for example, 25ms, can be found from the locked buffer, and if the visible light image can be found, the visible light image can be regarded as a visible light image matched with the target depth image, and the image and the target depth image can be regarded as an image which is obtained by shooting the human face shown in fig. 4 at the same time or at a similar time and is characterized by a contour feature and an image which is characterized by depth information. And synthesizing the two images to obtain the final face image with the most accurate depth characteristic and the contour characteristic at the same time or at a similar time and outputting the face image.

In the application scene two, the face contour characteristics and the face depth information are combined to carry out face recognition; the finally recognized face image not only has the depth characteristic of the face but also has the contour characteristic, so that the accuracy of face recognition can be greatly improved, the recognized face characteristics are more, and the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

It can be understood that for the description of the same part of the second application scenario, reference is made to the detailed description of the part in the first application scenario, and no repeated description is made in the second application scenario.

In the foregoing scheme, the multi-view camera is disposed in the image recognition device, and the recorded time information is the time (acquisition time) when the multi-view camera acquires an image, and is further disposed separately and is not located in the image recognition device in the embodiment of the present application, so that the multi-view camera needs to return the acquired image to the image recognition device, and for the image recognition device, the above scheme may still be performed by using the acquisition time of each image, and in addition, the time (return time) when each image is returned to itself may also be recorded, and the above scheme may be performed by using the return time of each frame of image. Take the video composed of the basic video stream as the visible light image as an example: the visible light image can be used for face detection, and for each frame of visible light image transmitted back by the multi-view camera, the image recognition equipment receives the image and records the transmission time. And distributing threads for the received visible light images from the established thread pool, sending the visible light images returned by the multi-view camera to the face detection program to detect faces therein when the face detection program is in an idle state, and setting the face detection program to be in a working state. Under the condition that the face detection program processes a plurality of frames of visible light images, if a high-quality face image is detected, the frame of visible light image is output, and a notification message is generated. For the 3D structure light image frame transmitted back by the multi-view camera, a buffer is used in a default state, for example, a certain number of frames are cached by a concurrent safety container, when the caching number exceeds 25, the earliest frame transmitted back in the container is removed, a newly transmitted frame is put in, at the moment when a notification message is detected to be generated, the container is locked, the newly transmitted image frame is not received, the 3D structure light image frame with the transmission time closest to the transmission time is searched in the container according to the transmission time of the high-quality visible light image frame, the searched 3D structure light image frame can be paired with the high-quality visible light image frame, and the 3D structure light image frame is output after being synthesized. It is understood that the paired two image frames are a set of two images capable of characterizing the depth feature and the contour feature, which are captured in the same capturing scene for the same target object at the same time or similar times. Therefore, in the scheme of the embodiment of the application, the finally recognized face image not only has the contour feature of the face, but also has the depth feature of the image shot at the same time or at a close time, so that the accuracy of face recognition can be greatly improved. And the recognized face features are more, so that the face recognition can be conveniently applied to various application fields such as face unlocking, suspicious part recognition and tracking and the like.

An embodiment of the present application further provides an image recognition apparatus, as shown in fig. 7, the apparatus includes: the device comprises a camera 11, a first determining unit 12, a second determining unit 13 and an obtaining unit 14; wherein,

the camera 11 is configured to acquire at least two frames of images of a target object to obtain a first video and a second video, where one of the first video and the second video is represented as at least two frames of visible light images of the target object; another video representation is at least two frames of structured light images of the target object;

a first determining unit 12, configured to determine a target contour image according to the at least two frames of visible light images; or determining a target depth image according to the at least two frames of structured light images;

a second determining unit 13, configured to determine a depth image matching the target contour image according to at least two frames of structured light images; or determining a contour image matched with the target depth image according to at least two frames of visible light images;

the obtaining unit 14 is configured to obtain a human face target image according to the target contour image and the depth image matched with the target contour image, or according to the target depth image and the contour image matched with the target depth image.

In an alternative embodiment, the first determining unit 12 is configured to:

correspondingly, the second determining unit 13 is configured to:

In an alternative embodiment, the first determining unit 12 is configured to:

correspondingly, the second determining unit 13 is configured to:

In an alternative embodiment, the collected structured light images or visible light images are buffered one by using a buffer with a preset capacity.

In an optional embodiment, under the condition that the caching is not stopped and the capacity of the buffer is not full, the structured light images or the visible light images of the capacity are cached one by one according to the acquisition time or the return time;

In an alternative embodiment, the first determining unit 12 is configured to obtain a partial frame visible light image of the at least two frames of visible light images. Further, the first determining unit 12 includes:

In an alternative embodiment, the first determining unit 12 is configured to obtain a partial frame structured light image of the at least two frame structured light images. Further, for:

It is understood that the first determining Unit 12, the second determining Unit 13 and the obtaining Unit 14 in the apparatus can be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU) or a Programmable Gate Array (FPGA) of the apparatus in practical applications. The camera 11 may be implemented by any multi-view camera, such as a binocular or a trinocular camera.

It should be noted that, in the image recognition apparatus according to the embodiment of the present application, because a principle of solving a problem of the image recognition apparatus is similar to that of the image recognition method, both an implementation process and an implementation principle of the image recognition apparatus can be described with reference to the implementation process and the implementation principle of the image recognition method, and repeated details are not repeated.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, perform at least the steps of the method shown in any one of fig. 1 to 5. The computer readable storage medium may be specifically a memory. The memory may be the memory 62 as shown in fig. 6.

The embodiment of the application also provides image identification equipment. Fig. 6 is a schematic diagram of a hardware structure of an image recognition apparatus according to an embodiment of the present application, and as shown in fig. 6, the image recognition apparatus includes: a communication component 63 for data transmission, at least one processor 61 and a memory 62 for storing computer programs capable of running on the processor 61. The various components in the terminal are coupled together by a bus system 64. It will be appreciated that the bus system 64 is used to enable communications among the components. The bus system 64 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 64 in fig. 6.

Wherein the processor 61 executes the computer program to perform at least the steps of the method of any of fig. 1 to 5.

It will be appreciated that the memory 62 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 62 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present application may be applied to the processor 61, or implemented by the processor 61. The processor 61 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 61. The processor 61 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 61 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 62, and the processor 61 reads the information in the memory 62 and performs the steps of the aforementioned method in conjunction with its hardware.

In an exemplary embodiment, the image recognition Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned image recognition methods.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image recognition method, characterized in that the method comprises:

2. The method according to claim 1, wherein determining the target contour image from the at least two frames of visible light images comprises:

3. The method of claim 1, wherein determining a target depth image from the at least two frames of structured light images comprises:

4. The method according to claim 2 or 3, wherein the collected structured light image or visible light image is buffered one by using a buffer with a preset capacity.

5. The method according to claim 4, wherein, under the condition that the caching is not stopped and the capacity of the cache is not full, the structured light images or the visible light images of the capacity are cached one by one according to the acquisition time or the return time;

6. The method of claim 2, wherein the obtaining a partial frame visible light image of the at least two frames of visible light images comprises:

7. The method of claim 3, wherein obtaining the partial frame structured light image of the at least two frame structured light images comprises:

8. An image recognition apparatus characterized by comprising:

9. The apparatus of claim 8, wherein the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

10. The apparatus of claim 8, wherein the first determining unit is configured to:

correspondingly, the second determining unit is configured to:

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the image recognition method of any one of claims 1 to 7.

12. An image recognition apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the image recognition method according to any one of claims 1 to 7 are implemented when the processor executes the program.