CN110390234B

CN110390234B - Image processing apparatus and method, and storage medium

Info

Publication number: CN110390234B
Application number: CN201810366536.3A
Authority: CN
Inventors: 黄耀海; 李岩
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2023-10-13
Anticipated expiration: 2038-04-23
Also published as: CN110390234A

Abstract

The invention discloses an image processing apparatus and method, and a storage medium. The image processing apparatus includes: a unit that acquires more than one face image of a person, wherein the face images have different resolutions; a unit that merges first features respectively obtained from the face images, wherein the first features for mutual merging have the same feature expression; and a unit that generates a second feature based on the combined features. According to the invention, a clearer feature can be obtained from an image of a low resolution, in particular, no matter what resolution the feature is to be obtained from between the images.

Description

Image processing apparatus and method, and storage medium

Technical Field

The present invention relates to image processing, and more particularly to, for example, feature extraction processing and recognition processing.

Background

Currently, monitoring systems typically utilize facial attribute recognition techniques and/or facial recognition techniques (e.g., facial matching, facial searching, etc.) to monitor a location. The monitoring effect of the monitoring system is generally dependent on the face attribute recognition and the recognition accuracy of the face recognition, which are dependent on information useful for recognition contained in the image for recognition (e.g., clear facial features available therefrom). In monitoring systems, particularly wide area monitoring systems, devices for capturing images (e.g., cameras, video cameras, etc.) typically capture a wide variety of images, such as images with different resolutions, etc. Since the single image or the low-resolution image contains very limited information useful for recognition (for example, the definition of the facial features available therefrom is not high), the recognition accuracy of the facial attribute recognition and the facial recognition may be affected, and thus the monitoring effect of the monitoring system may be affected.

To avoid the impact of individual images or low resolution images on facial attribute recognition and facial recognition, prior to facial attribute recognition and facial recognition, the input image is typically processed using techniques such as facial restoration (face hallucination) to obtain a corresponding high resolution image for subsequent facial attribute recognition and facial recognition. An exemplary technique is disclosed in non-patent document "Video Super-Resolution with Convolutional Neural Networks" (Armin Kappeler et al, < IEEE Transactions on Computational Imaging > vol.2, no.2,2016). The method mainly comprises the following steps: obtaining a plurality of images (e.g., a plurality of low resolution images) having different resolutions from a video; these images are then connected (e.g., directly concatenated) to obtain a multi-channel image (e.g., RGB image, HSV image, etc.); features are then extracted from the obtained multi-channel image for facial attribute recognition or facial recognition.

Wherein the above exemplary technique treats multiple images equally when they are connected. That is, the exemplary techniques described above will directly connect images regardless of the resolution differences between the images. Then, in the subsequent feature extraction operation, the above-described exemplary technique will extract features having the same level from the images obtained by the connection. That is, for images with different resolutions, the above-described exemplary techniques will express the images with features having the same level. However, because images with different resolutions, the levels required for their corresponding features are different, for example, a high resolution image typically requires more levels of features to express its contained information, while a low resolution image often requires fewer levels of features to express its contained information. Thus, for images with different resolutions, if both are equally expressed with features of the same hierarchy, this will cause images with different resolutions to not have the same feature expression, so that these features do not express well the information actually contained by each image, resulting in poor definition of the features ultimately extracted therefrom, so that the final facial attribute recognition and the recognition accuracy of facial recognition are affected.

Disclosure of Invention

In view of the foregoing background, the present invention aims to solve at least one of the problems described above.

According to an aspect of the present invention, there is provided an image processing apparatus including: an acquisition unit that acquires more than one face image of one person, wherein the face images have different resolutions; a merging unit that merges first features respectively obtained from the face images, wherein the first features for mutual merging have the same feature expression; and a generation unit that generates a second feature based on the combined features.

According to another aspect of the present invention, there is provided an image processing method including: an acquisition step of acquiring more than one face image of one person, wherein the face images have different resolutions; a merging step of merging first features respectively obtained from the face images, wherein the first features for mutual merging have the same feature expression; and generating a second feature based on the combined features.

According to yet another aspect of the present invention, there is provided a storage medium storing instructions that, when executed by a processor, enable the image processing method as described above to be performed.

With the present invention, for the features (i.e., first features) to be combined, which are obtained from images having different resolutions, the features will have the same feature expression as each other, so that the features can well express the information actually contained in the corresponding images. Therefore, according to the invention, no matter what resolution is provided between images from which the features are to be acquired, particularly for images with low resolution, clearer features can be obtained from the images, so that the corresponding facial attribute recognition and the recognition accuracy of facial recognition can be improved, and the monitoring effect of the monitoring system can be improved.

Other features and advantages of the present invention will become apparent from the following description of exemplary embodiments, which refers to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description of the embodiments, serve to explain the principles of the invention.

Fig. 1 is a block diagram schematically showing a hardware configuration in which a technique according to an embodiment of the present invention can be implemented.

Fig. 2 is a block diagram illustrating the construction of an image processing apparatus according to an embodiment of the present invention.

Fig. 3 schematically shows a flow chart of image processing according to an embodiment of the invention.

Fig. 4 schematically shows a flow chart of the merging step S320 as shown in fig. 3 according to an embodiment of the present invention.

Fig. 5 schematically shows an exemplary process in which the merging unit 220 as shown in fig. 2 utilizes a first convolutional neural network among the convolutional neural networks generated in advance according to the flowchart as shown in fig. 4.

Fig. 6 schematically shows a schematic structure of the network branches 1 to 3 as shown in fig. 5.

Fig. 7 schematically shows a flow chart of a generation method for generating a neural network that can be used in the present invention.

Fig. 8 schematically shows a flow chart of another generation method for generating a neural network that can be used in the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely illustrative and exemplary in nature and is in no way intended to limit the invention, its application, or uses. The relative arrangement of the components and steps, numerical expressions and numerical values set forth in the examples do not limit the scope of the present invention unless it is specifically stated otherwise. In addition, techniques, methods, and apparatus known to those of skill in the art may not be discussed in detail, but are intended to be part of this specification where appropriate.

Note that like reference numerals and letters refer to like items in the drawings, and thus once an item is defined in one drawing, it is not necessary to discuss it in the following drawings.

On the one hand, in a monitoring location, the images in the video or image sequence obtained by the monitoring system will have different resolutions (i.e. have different spatiotemporal structures) due to differences/variations in the speed of movement of the person and the distance of the person from the device used to capture the images. On the other hand, in the case of expressing information contained in a certain image in a video or image sequence by hierarchical features, generally, the lower the resolution of the image, the less hierarchy will be required for its corresponding feature. Wherein the hierarchical features include, for example, low-level features, middle-level features, high-level features, and the like. The low-level features may be, for example, edge features, corner features, etc., the mid-level features may be, for example, component features, etc. (e.g., eye features, joint features, arm features, etc.), and the high-level features may be, for example, semantic layer features, etc. (e.g., age attributes, gender attributes, etc.).

In view of this, the inventors found that, when features for subsequent face attribute recognition or face recognition are commonly obtained from a plurality of images of a video or image sequence, in order to be able to obtain from each image a feature expression that sufficiently embodies the information it contains, images with different resolutions should be treated differently. Mainly embodied in the following two aspects. In one aspect, when a respective feature (e.g., a respective hierarchical feature) is obtained from each image, the hierarchy of the obtained features should be in a relationship (e.g., positive correlation) with the resolution of the image. In other words, images with different resolutions should be expressed with features having different levels. On the other hand, since common features are obtained from a plurality of images for use in, for example, subsequent face attribute recognition or face recognition, there is an operation of merging features obtained from images having different resolutions, and thus, when these features are merged, the features that are merged with each other should have the same feature expression as each other. In other words, for features obtained from images having different resolutions, the features that are combined with each other should have the same or similar feature space therebetween. Based on the above-mentioned differential treatment, the features obtained from the multiple images can express the information actually contained in each image well, so that the subsequent facial attribute recognition or facial recognition can be facilitated.

The present invention has been made in view of the above findings, and will be described in detail below with reference to the accompanying drawings. According to the invention, no matter what resolution is provided between images from which the features are to be acquired, particularly for low-resolution images, clearer features can be obtained from the images, so that the corresponding facial attribute recognition and the recognition accuracy of facial recognition can be improved, and the monitoring effect of a monitoring system can be improved.

(hardware construction)

First, a hardware configuration that can implement the techniques described below will be described with reference to fig. 1.

The hardware architecture 100 includes, for example, a Central Processing Unit (CPU) 110, random Access Memory (RAM) 120, read Only Memory (ROM) 130, hard disk 140, input device 150, output device 160, network interface 170, and system bus 180. Further, the hardware architecture 100 may be implemented by a device such as a camera, video camera, personal Digital Assistant (PDA), smart phone, tablet, notebook, desktop, or other suitable electronic device.

In one implementation, image processing in accordance with the present invention is constructed from hardware or firmware and used as a module or component of hardware construction 100. For example, an image processing apparatus 200, which will be described in detail below with reference to fig. 2, is used as a module or component of the hardware configuration 100. In another implementation, image processing according to the present invention is constructed by software stored in ROM 130 or hard disk 140 and executed by CPU 110. For example, the process 300 described in detail below with reference to fig. 3 and the processes 700 and 800 described in detail below with reference to fig. 7 to 8 are used as programs stored in the ROM 130 or the hard disk 140.

The CPU 110 is any suitable programmable control device (such as a processor), and can execute various functions to be described below by executing various application programs stored in the ROM 130 or the hard disk 140 (such as a memory). The RAM 120 is used to temporarily store programs or data loaded from the ROM 130 or the hard disk 140, and is also used as a space in which the CPU 110 performs various processes (such as implementing techniques that will be described in detail below with reference to fig. 3 to 8) and other available functions. The hard disk 140 stores various information such as an Operating System (OS), various applications, control programs, videos, images, pre-generated networks (e.g., neural networks), data or models for facial attribute recognition and/or facial recognition, pre-defined data (e.g., threshold values (THs)), and the like.

In one implementation, the input device 150 is used to allow a user to interact with the hardware configuration 100. In one example, a user may input image/video/data through the input device 150. In another example, a user may trigger a corresponding process of the present invention through the input device 150. Further, the input device 150 may take a variety of forms, such as buttons, a keyboard, or a touch screen. In another implementation, the input device 150 is used to receive images/video output from specialized electronic devices such as digital cameras, video cameras, and/or web cameras.

In one implementation, the output device 160 is used to display the processing results (such as the generated features, facial attribute recognition, and/or results of facial recognition) to the user. Also, the output device 160 may take various forms such as a Cathode Ray Tube (CRT) or a liquid crystal display.

The network interface 170 provides an interface for connecting the hardware configuration 100 to a network. For example, the hardware construct 100 may be in data communication via the network interface 170 with other electronic devices connected via a network. Alternatively, a wireless interface may be provided for the hardware configuration 100 for wireless data communication. The system bus 180 may provide a data transmission path for transmitting data between the CPU 110, the RAM 120, the ROM 130, the hard disk 140, the input device 150, the output device 160, the network interface 170, and the like to each other. Although referred to as a bus, system bus 180 is not limited to any particular data transfer technique.

The hardware configuration 100 described above is merely illustrative and is in no way intended to limit the invention, its applications or uses. Also, only one hardware configuration is shown in fig. 1 for simplicity. However, a plurality of hardware configurations may be used as needed.

(image processing)

Next, image processing according to the present invention will be described with reference to fig. 2 to 6.

Fig. 2 is a block diagram illustrating the construction of an image processing apparatus 200 according to an embodiment of the present invention. Wherein some or all of the modules shown in fig. 2 may be implemented by dedicated hardware. As shown in fig. 2, the image processing apparatus 200 includes an acquisition unit 210, a combining unit 220, and a generating unit 230.

First, the input device 150 shown in fig. 1 receives a video or image sequence output from a specialized electronic device (e.g., a video camera, etc.) or input by a user. The input device 150 then transmits the received video or image sequence to the image processing apparatus 200 via the system bus 180.

Then, as shown in fig. 2, the acquisition unit 210 acquires more than one face image of one person from the received video or image sequence, wherein the face images have different resolutions. Hereinafter, the person will be regarded as a "target person". In one implementation, the acquisition unit 210 acquires a facial image of the target person from the received video or image sequence, for example, as follows. First, the acquisition unit 210 acquires a plurality of images including the target person from the received video or image sequence, wherein the acquired images have different resolutions. In one example, an image is selected from the received video or image sequence that matches the predetermined resolution, again including the target person, for example, according to the predetermined resolution. Then, the acquisition unit 210 performs interpolation processing, such as resizing and alignment operations, on the acquired images, respectively, on the acquired images to obtain a plurality of face images of the target person of a predetermined size.

Then, on the one hand, the merging unit 220 obtains first features, which are expressed in the form of, for example, a "feature map" (e.g., multi-channel matrix data), from each of the face images acquired by the acquisition unit 210, respectively. Alternatively, the first feature may be obtained without the merging unit 220 and with a special feature extraction unit (not shown). On the other hand, the merging unit 220 merges the first features. In other words, the merging unit 220 is configured to commonly obtain a certain feature from a plurality of images. As described above, the present invention treats differently for images having different resolutions when a certain feature is commonly obtained from a plurality of images. Therefore, for each of the face images acquired by the acquisition unit 210 with different resolutions, on the one hand, the features (i.e., the first features) obtained from each of the face images need to have different levels; on the other hand, the features to be combined with each other are required to have the same feature expression.

Thus, in one implementation, for each facial image, the merge unit 220 may perform a multi-layer mapping operation thereon to obtain features that meet the requirements of the present invention. Wherein, on the one hand, for each face image, the number of layers (i.e., the number of times) of the mapping operation performed is in, for example, a positive correlation with the resolution of the face image; on the other hand, for two face images of the feature (i.e., the first feature) obtained therefrom to be merged, the number of layers of the mapping operation to be performed also needs to be such that the feature to be merged has the same feature expression therebetween. Wherein the characteristic expression is for example the response of a certain mapping operation.

Since deep learning-based methods, such as Neural Networks (NNs), can learn feature expressions hierarchically, images with different resolutions can be treated differently, for example, using such characteristics of neural networks. Thus, in another implementation, a neural network that can be used to obtain features meeting the requirements of the present invention can be pre-trained/generated from a variety of preset resolutions and training samples corresponding to each resolution, and the trained neural network stored in a corresponding storage device. Hereinafter, a method of generating a neural network usable in the present invention will be described in detail with reference to fig. 7 to 8. The storage device for storing the neural network generated in advance is, for example, the storage device 250 shown in fig. 2. Further, the storage device 250 may also store data or models for facial attribute recognition and/or facial recognition, for example. In one example, the storage device 250 is the ROM 130 or the hard disk 140 shown in FIG. 1. In another example, the storage device 250 is a server or an external storage device connected to the image processing apparatus 200 via a network (not shown). Thus, in this implementation, in one aspect, the merge unit 220 retrieves a pre-generated neural network from the storage 250. On the other hand, the merging unit 220 obtains the first features from each of the face images acquired by the acquisition unit 210, respectively, using the neural network generated in advance and merges the obtained first features. In this case, the feature expression for the first feature obtained from any one of the face images is, for example, the response of the convolution operation.

Returning to fig. 2, in order to be able to obtain, from the features obtained by the merging unit 220, a required feature satisfying the requirements of the subsequent target task (e.g., face attribute recognition, face recognition, etc.), the generating unit 230 generates a second feature based on the features obtained by the merging unit 220, wherein the second feature is represented, for example, in the form of a "feature map" (e.g., multi-channel matrix data). In the case where the merging unit 220 obtains the corresponding feature using the previously generated neural network, the generating unit 230 may also generate the second feature using the previously generated neural network. In this case, the generation unit 230 generates the second feature based on the features obtained by the combination unit 220 by using the neural network generated in advance acquired by the combination unit 220.

Finally, after the generating unit 230 generates the second feature, the generating unit 230 transmits the generated second feature to the output device 160 via the system bus 180 shown in fig. 1 for displaying the generated second feature to the user. Further, as described above, according to the present invention, a clearer feature can be obtained from a face image regardless of the resolution (e.g., low resolution) between face images. Thus, if the feature obtained according to the present invention (i.e., the second feature generated by the generating unit 230) is used for, for example, face attribute recognition and/or face recognition in a monitoring system, the recognition accuracy of the face attribute recognition and the face recognition will be improved, and the monitoring effect of the monitoring system will be improved. Thus, as an alternative, the second feature generated by the generating unit 230 may not be displayed to the user, but may be used for, for example, facial attribute recognition and/or facial recognition, so that the image processing apparatus 200 shown in fig. 2 may further include the recognition unit 240.

In one aspect, the recognition unit 240 obtains a third feature from the second features generated by the generation unit 230, where the third feature is represented, for example, in the form of a "feature vector". Alternatively, the third feature may be obtained without the identification unit 240 and with another dedicated feature extraction unit (not shown). Further, in the case where the merging unit 220 and the generating unit 230 perform the corresponding operations using the above-described pre-generated neural network, the identifying unit 240 may also obtain the above-described third feature using the above-described pre-generated neural network. In this case, the recognition unit 240 obtains the third feature from the second feature generated by the generation unit 230 using the previously generated neural network acquired by the merging unit 220. On the other hand, the recognition unit 240 acquires data or a model, for example, for face attribute recognition and/or face recognition from the storage device 250, and performs face attribute recognition and/or face recognition on the above-described target person based on the third feature and the acquired data/model. The face attribute identification is, for example, identification of the age, sex, race, etc. of the target person. For face recognition, for example, face matching is performed on the target person, an image of the same person as the target person is searched from a database (i.e., face search), face tracking based on apparent features, and the like. In this case, after the recognition unit 240 performs the corresponding recognition operation, the recognition unit 240 transmits the recognition result (i.e., the facial attribute recognition and/or the facial recognition result) to the output device 160 via the system bus 180 shown in fig. 1 for displaying the corresponding recognition result to the user.

The flowchart 300 shown in fig. 3 is a corresponding procedure of the image processing apparatus 200 shown in fig. 2. Hereinafter, description will be given taking an example in which the merging unit 220 and the generating unit 230 perform corresponding operations using a neural network generated in advance. Hereinafter, a description will be given also taking, as an example, a previously generated Convolutional Neural Network (CNN) which can also be generated with reference to the generation methods shown in fig. 7 to 8. However, it is obviously not necessarily limited thereto.

As shown in fig. 3, in the acquiring step S310, the acquiring unit 210 acquires more than one face image (for example, N face images, where N is a natural number) of the target person from the received video or image sequence, where the face images have different resolutions.

In the merging step S320, on the one hand, the merging unit 220 acquires a convolution neural network generated in advance from the storage device 250, wherein the convolution neural network generated in advance includes at least a first convolution neural network and a second convolution neural network, for example. On the other hand, the merging unit 220 obtains a first feature from each face image using the previously generated convolutional neural network (particularly, using a first convolutional neural network of the previously generated convolutional neural networks) and merges the obtained first features. As described above, the present invention treats images having different resolutions differently, and thus, in one implementation, the merging unit 220 obtains the merged features with reference to fig. 4 to 6. Hereinafter, description will be given taking as an example that the acquiring unit 210 can acquire face images of 3 (i.e., n=3) target persons by default. However, it is obviously not necessarily limited thereto.

As shown in fig. 4, in step S321, for each of the 3 face images, the merging unit 220 determines a network branch for obtaining a corresponding first feature from the face image from the first convolutional neural network based on the resolution of the face image. In the present invention, the number of network branches in the first convolutional neural network is the same as the number of face images acquirable by the acquisition unit 210 by default. That is, in the case where the acquisition unit 210 can acquire 3 face images by default, the network branch number of the first convolutional neural network is 3. In one example, the merging unit 220 determines the corresponding network branches by comparing the resolution of the facial image to a predefined threshold. For example, as shown in fig. 5, assume that 3 face images acquired by the acquisition unit 210 are 510 to 530, and that the network branches included in the first convolutional neural network are network branches 1 to 3; assuming that in case the resolution of one face image is greater than a first threshold (e.g. TH 1), a first feature will be obtained from the face image using the network branch 1; assuming that in case the resolution of one face image is smaller than a second threshold (e.g. TH 2), a first feature will be obtained from the face image using the network branch 3; it is assumed that in case the resolution of one face image is between TH1 and TH2 (where, for example, TH1 is larger than TH 2), the first feature will be obtained from the face image using the network branch 2. In other words, the network branch 1 can process, for example, a face image with high resolution, the network branch 2 can process, for example, a face image with moderate resolution, and the network branch 3 can process, for example, a face image with low resolution. For example, as shown in fig. 5, it is assumed that based on the above-described comparison, the merging unit 220 determines to obtain the corresponding first feature from the face image 510 using the network branch 1, determines to obtain the corresponding first feature from the face image 520 using the network branch 2, and determines to obtain the corresponding first feature from the face image 530 using the network branch 3.

Further, as described above, in the case of expressing information contained in one image by hierarchical features, in general, the higher the resolution of the image, the higher the hierarchy required for the corresponding feature thereof. Thus, in the present invention, for a network branch in the first convolutional neural network, the higher the resolution with which the facial image of the feature is obtained from, the higher the number of layers of that network branch will be. In other words, in the present invention, for one face image, the resolution of the face image is in positive correlation with the number of layers of the network branch from which the first feature corresponding thereto is obtained. For example, fig. 6 schematically shows a schematic structure of the network branches 1 to 3 as shown in fig. 5. As shown in fig. 6, since the network branch 1 can process a face image (for example, the face image 510) with high resolution, in order to enable the obtained first feature to sufficiently embody information contained in the corresponding face image, the network branch 1 has, for example, 3 convolutional layers (for example, convolutional layers 611 to 613) having a scaling/mapping function. Since network branch 2 may process a moderate resolution facial image (e.g., facial image 520), network branch 2 has, for example, 2 layers of convolution layers (e.g., convolution layers 621-622) with scaling/mapping functions. Since network branch 3 may process a low resolution face image (e.g., face image 530), network branch 3 has, for example, a 1-layer convolution layer (e.g., convolution layer 631) with scaling/mapping functionality. Further, as described above, in order to enable the same feature expression between the first features combined with each other, the network branches 1 to 3 also need to have, for example, 1 layer of convolution layers (for example, the convolution layer 614, the convolution layer 623, the convolution layer 632) having the scaling/mapping function, respectively. For example, as shown in fig. 6, the first feature 1 obtained via the network branch 1 and the first feature 2 obtained via the network branch 2 will have the same feature expression. The structure of the network branches as shown in fig. 6 is merely illustrative and obviously not necessarily limited thereto.

Returning to fig. 4, after determining the corresponding network branch for each face image, in step S322, the merging unit 220 extracts the first feature from each face image using the network branch corresponding thereto. For example, as shown in fig. 5, the features extracted from the face images 510 to 530 are, for example, the first feature 1, the first feature 2, and the first feature 3, respectively, using the network branches 1 to 3, respectively.

Then, in step S323, the merging unit 220 merges the extracted first features to obtain "merged features". In the present invention, in the case where the first convolutional neural network has more than two network branches, for any one network branch subsequent to the second network branch, the merging unit 220 merges the "intermediate merge feature" obtained by merging the first feature obtained by the previous network branch of the network branch and the first feature map obtained by the network branch. For example, as shown in fig. 5, first, the merging unit 220 merges the first feature 1 obtained via the network branch 1 and the first feature 2 obtained via the network branch 2 by the merging operator 540, thereby obtaining a corresponding "intermediate merged feature". Then, for a first feature 3 obtained via network branch 3, wherein the convolution layer 632 in network branch 3 as shown in fig. 6 may cause the first feature to be 3 have the same feature expression as the above-mentioned "intermediate merged feature", and the merging unit 220 merges the above-mentioned "intermediate merged feature" and the first feature 3 by the merging operator 550, thereby obtaining a final "merged feature". In one example, the merge operators 540 and 550 described above are pixel matrix sums or pixel matrix concatenation operations. For example, assume that the pixel matrix of the first feature 1 isThe pixel matrix of the first feature 2 is +.>The pixel matrix of the "intermediate merge feature" obtained by the merge operator 540 is:

however, it is obviously not necessarily limited thereto.

Returning to fig. 3, after the merging unit 220 obtains the merged feature, in a generating step S330, the generating unit 230 generates a second feature based on the merged feature using the previously generated convolutional neural network (in particular, using a second convolutional neural network of the previously generated convolutional neural networks) obtained by the merging unit 220. Wherein, in order to enable the generated second feature to meet the requirements of the subsequent target task (e.g. facial attribute recognition, facial recognition, etc.), the second convolutional neural network may also have multiple convolutional layers with scaling/mapping functions, wherein the number of layers of the convolutional layers is related to the actual application scenario or determined based on experience.

As described above, after the second feature is generated, it may be further used for, for example, facial attribute recognition and/or facial recognition. In this case, the convolution neural network generated in advance acquired by the merging unit 220 further includes, for example, a third convolution neural network. Thus, as shown in fig. 3, in the identifying step S340, on the one hand, the identifying unit 240 obtains a third feature from the generated second features using the third convolutional neural network. On the other hand, the recognition unit 240 acquires data or a model, for example, for face attribute recognition and/or face recognition from the storage device 250, and performs face attribute recognition and/or face recognition on the above-described target person based on the third feature and the acquired data/model.

Finally, after the recognition unit 240 performs the corresponding recognition operation, the recognition unit 240 transmits the recognition result (i.e., the facial attribute recognition and/or the facial recognition result) to the output device 160 via the system bus 180 shown in fig. 1 for displaying the corresponding recognition result to the user. It is apparent that in the case where the flowchart 300 shown in fig. 3 does not include the recognition operation, after the generation unit 230 generates the second feature, the generation unit 230 may transmit the generated second feature to the output device 160 via the system bus 180 shown in fig. 1 for displaying the generated second feature to the user.

As described above, according to the present invention, regardless of the resolution between face images from which features are to be acquired, particularly for low-resolution face images, clearer features can be obtained therefrom, so that the recognition accuracy of the corresponding face attribute recognition and face recognition can be improved, and the monitoring effect of the monitoring system can be improved.

(Generation of neural networks)

In order to generate the neural network usable in the present invention, the neural network usable in the present invention may be previously generated based on a preset initial neural network and a plurality of training samples having different resolutions by using the generation method of referring to fig. 7 to 8. The generating method with reference to fig. 7 and 8 may also be performed by the hardware configuration 100 shown in fig. 1. In the present invention, the first, second, and third neural networks usable in the neural network of the present invention, that is, the first, second, and third convolutional neural networks among the previously generated convolutional neural networks described above, will be updated in common.

In one implementation, to reduce the time it takes to generate the neural networks, the first, second, and third neural networks will be updated in common by a reverse pass approach. Fig. 7 schematically illustrates a flow chart 700 of a generation method for generating a neural network that may be used with the present invention.

As shown in fig. 7, first, the CPU 110 as shown in fig. 1 acquires an initial neural network and a plurality of training samples having different resolutions, which are set in advance, through the input device 150. Wherein each training sample is labeled with a desired feature or a true value (GT) feature.

Then, in step S710, on the one hand, the CPU 110 passes the training sample through the current neural network (e.g., the initial neural network) to obtain the corresponding feature. That is, the CPU 110 passes the training samples through the current first neural network, the current second neural network, and the current third neural network (e.g., the initial first neural network, the initial second neural network, and the initial third neural network) to obtain, for example, the third feature. On the other hand, the CPU 110 determines a Loss (e.g., loss 1) between the obtained feature (e.g., third feature) and the sample feature. Wherein the sample characteristics may be obtained from the expected or GT characteristics marked in the training sample. Where Loss1 represents the error between the predicted eigenvalue and the sample eigenvalue obtained with the current neural network, where the error can be measured, for example, by distance. For example, loss1 can be obtained by the following formula (1):

Wherein j represents the number of categories to which the face sample in the training sample may belong, and C represents the maximum number of categories, wherein different categories correspond to faces with different IDs; y is _j Representing the real label of the face sample on category j; p is p _j Representing the predicted feature value of the face sample on category j.

Then, in step S720, the CPU 110 will determine whether the current neural network satisfies a predetermined condition based on the Loss 1. For example, the Loss1 is compared with a threshold (e.g., TH 3), and in the case where the Loss1 is less than or equal to TH3, the current neural network is judged to satisfy the predetermined condition to be output as a final neural network, which is output to, for example, the storage device 250 shown in fig. 2 for image processing as described in fig. 2 to 6. In the case where Loss1 is greater than TH3, the current neural network will be judged that the predetermined condition has not been satisfied, and the generation process will proceed to step S730.

In step S730, the CPU 110 updates parameters of each layer in the current second neural network and the current third neural network based on the Loss1, where the parameters of each layer are, for example, weight values in convolution layers each having a scaling/mapping function in the current second neural network and the current third neural network. In one example, parameters for each layer are updated based on Loss1, for example, using a random gradient descent method.

In step S740, the CPU 110 updates parameters of each layer in each network branch in the current first neural network based on the Loss1, where the parameters of each layer are, for example, weight values in convolution layers each having a scaling/mapping function in each network branch. In one example, for any one of the network branches, loss1 is assigned to it, and then parameters for layers in that network branch are updated based on Loss1 using a random gradient descent method. After that, the generation process advances again to step S710.

In the flow 700 shown in fig. 7, whether Loss1 satisfies a predetermined condition is taken as a condition for stopping updating the current neural network. However, it is obviously not necessarily limited thereto. As an alternative, for example, step S720 may be omitted, and the corresponding update operation may be stopped after the number of updates to the current neural network reaches a predetermined number.

In another implementation, in order to make the feature obtained by using the generated neural network more accurate, the first neural network and the second neural network are updated together by using a reverse transfer method, and then the third neural network is updated together by using a reverse transfer algorithm and the updated first neural network and second neural network. Fig. 8 schematically illustrates a flow chart 800 of another generation method for generating a neural network that may be used with the present invention.

As shown in fig. 8, first, the CPU 110 as shown in fig. 1 acquires a plurality of training samples having different resolutions of an initial neural network, which is set in advance, through the input device 150. Wherein each training sample is marked with desired features and data or true value (GT) features and data.

Then, in step S810, on the one hand, the CPU 110 passes the training samples through the current first neural network and the current second neural network (e.g., the initial first neural network and the initial second neural network) to obtain the corresponding features (e.g., the second features). On the other hand, the CPU 110 obtains an image of the obtained feature via, for example, a recurrent neural network.

In step S820, the CPU 110 determines a Loss (e.g., loss 2) between the obtained image and GT data. Where Loss2 represents the error between the predicted image obtained with the current first neural network and the current second neural network and the GT data, where the error can be measured, for example, by distance. For example, loss2 can be obtained by the following formula (2):

wherein N represents the number of pixels in the predicted image, y _i Representing normalized pixel values, p, of the ith pixel in GT data _i Representing the normalized pixel value of the i-th pixel in the predicted image.

Then, in step S830, the CPU 110 will determine whether the current first neural network and the current second neural network satisfy a predetermined condition based on the Loss 2. For example, the Loss2 is compared with a threshold (e.g., TH 4), in the case where the Loss2 is less than or equal to TH4, the current first neural network and the current second neural network will be judged to satisfy the predetermined condition as the final first neural network and the final second neural network, and the generation process will proceed to step S860. In the case where Loss2 is greater than TH4, the current first neural network and the current second neural network will be judged to have not satisfied the predetermined condition yet, and the generation process will proceed to step S840.

In step S840, the CPU 110 updates parameters of each layer in the current second neural network based on the Loss2, where the parameters of each layer are, for example, weight values in each convolution layer having a scaling/mapping function in the current second neural network. In one example, parameters for each layer are updated based on Loss2, for example, using a random gradient descent method.

In step S850, the CPU 110 updates parameters of each layer in each network branch in the current first neural network based on the Loss2, where the parameters of each layer are, for example, weight values in convolution layers each having a scaling/mapping function in each network branch. Since the operation of step S850 is the same as the operation of step S740 shown in fig. 7, a detailed description thereof will be omitted. After that, the generation process advances again to step S810.

In steps S810 to S850 shown in fig. 8, whether Loss2 satisfies a predetermined condition is taken as a condition for stopping updating the current first neural network and the current second neural network. However, it is obviously not necessarily limited thereto. As an alternative, step S830 may be omitted, for example, and the corresponding updating operation may be stopped after the number of updates to the current first neural network and the current second neural network reaches a predetermined number.

In step S860, the CPU 110 passes the training samples through the updated first neural network and the updated second neural network (i.e., the final first neural network and the final second neural network) to obtain the corresponding features (e.g., the second features).

In step S870, on the one hand, the CPU 110 passes the obtained feature (e.g., the second feature) through the current third neural network (e.g., the initial neural network) to obtain the corresponding feature (e.g., the third feature). On the other hand, the CPU 110 determines a Loss (e.g., loss 3) between the obtained feature (e.g., third feature) and the sample feature. Wherein the sample characteristics may be obtained from the expected or GT characteristics marked in the training sample. Where Loss3 represents the error between the predicted eigenvalue and the sample eigenvalue obtained with the current third neural network, where the error can be measured, for example, by distance. For example, loss3 can also be obtained by the above formula (1).

Then, in step S880, the CPU 110 will determine whether the current third neural network satisfies a predetermined condition based on the Loss 3. For example, the Loss3 is compared with a threshold (e.g., TH 5), and in the case where the Loss3 is less than or equal to TH5, the current third neural network will be judged to satisfy the predetermined condition to be the final third neural network, and the final third neural network will be output together with the final first neural network and the final second neural network, wherein the final neural networks (i.e., the final first neural network, the final second neural network, and the final third neural network) are output to, for example, the storage device 250 shown in fig. 2 for image processing as described in fig. 2 to 6. In the case where Loss3 is greater than TH5, the current third neural network will be judged that the predetermined condition has not been satisfied, and the generation process will proceed to step S890.

In step S890, the CPU 110 updates the parameters of the layers in the current third neural network, such as weight values in the convolutional layers each having the scaling/mapping function in the current third neural network, based on the Loss 3. In one example, parameters for each layer are updated based on Loss3, for example, using a random gradient descent method. After that, the generation process advances again to step S870.

In steps S870 to S890 shown in fig. 8, whether Loss3 satisfies a predetermined condition is taken as a condition for stopping updating the current third neural network. However, it is obviously not necessarily limited thereto. As an alternative, for example, step S880 may be omitted, and the corresponding update operation may be stopped after the number of updates to the current third neural network reaches a predetermined number.

All of the elements described above are exemplary and/or preferred modules for implementing the processes described in this disclosure. These units may be hardware units (such as Field Programmable Gate Arrays (FPGAs), digital signal processors, application specific integrated circuits, etc.) and/or software modules (such as computer readable programs). The units for implementing the steps are not described in detail above. However, where there are steps to perform a particular process, there may be corresponding functional modules or units (implemented by hardware and/or software) for implementing that same process. The technical solutions by means of all combinations of the described steps and the units corresponding to these steps are included in the disclosure of the application as long as they constitute a complete, applicable technical solution.

The method and apparatus of the present invention can be implemented in a variety of ways. For example, the methods and apparatus of the present invention may be implemented by software, hardware, firmware, or any combination thereof. The above-described sequence of steps of the method is intended to be illustrative only, and the steps of the method of the present invention are not limited to the order specifically described above, unless specifically stated otherwise. Furthermore, in some embodiments, the present invention may also be implemented as a program recorded in a recording medium including machine-readable instructions for implementing the method according to the present invention. Therefore, the present invention also covers a recording medium storing a program for implementing the method according to the present invention.

While certain specific embodiments of the present invention have been illustrated in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are intended to be illustrative only and are not limiting of the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the embodiments described above without departing from the scope and spirit of the invention. The scope of the invention is to be limited by the following claims.

Claims

1. An image processing apparatus, the image processing apparatus comprising:

An acquisition unit that acquires more than one face image of one person, wherein the face images have different resolutions;

a merging unit that merges first features respectively obtained from the face images, wherein the first features for mutual merging have the same feature expression, wherein the merging unit obtains the first features from the face images using a first neural network; a kind of electronic device with high-pressure air-conditioning system

A generation unit that generates a second feature based on the combined features, wherein the generation unit generates the second feature using a second neural network,

wherein, for any one face image, the resolution of the face image has a positive correlation with the layer number of the network branches in the first neural network, which are used for obtaining the first feature corresponding to the face image from the face image.

2. The image processing apparatus according to claim 1, wherein, for any one face image, a network branch for obtaining a corresponding first feature from the face image is determined from the first neural network based on a resolution of the face image.

3. The image processing apparatus according to claim 1, the image processing apparatus further comprising:

And an identification unit that performs face attribute identification and/or face identification on the person based on a third feature obtained from the second feature.

4. An image processing apparatus according to claim 3, wherein the identification unit obtains the third feature from the second feature using a third neural network.

5. The image processing apparatus of claim 4, wherein the first, second, and third neural networks are updated in common.

6. The image processing apparatus according to claim 5, wherein the current first neural network, the current second neural network, and the current third neural network are updated in common by a reverse transfer manner.

7. The image processing apparatus of claim 6, wherein the current first neural network, the current second neural network, and the current third neural network are updated at least once in common by:

determining a loss between a feature obtained via the current first neural network, the current second neural network, and the current third neural network and a sample feature;

updating parameters of each layer in the current second neural network and the current third neural network based on the loss; a kind of electronic device with high-pressure air-conditioning system

Updating parameters of each layer in each network branch in the current first neural network based on the loss.

8. The image processing apparatus according to claim 5, wherein,

commonly updating the current first neural network and the current second neural network by using a reverse transfer mode; a kind of electronic device with high-pressure air-conditioning system

The current third neural network is updated jointly using the reverse pass algorithm and the updated first and second neural networks.

9. The image processing apparatus of claim 8, wherein the current first neural network and the current second neural network are updated at least once in common by:

determining a penalty between an image and a truth data, wherein the image is obtained based on features obtained via the current first neural network and the current second neural network;

updating parameters of each layer in the current second neural network based on the loss; a kind of electronic device with high-pressure air-conditioning system

10. The image processing apparatus of claim 8, wherein the current third neural network is updated at least collectively once by:

Determining a loss between a feature obtained by passing a pre-obtained feature through the current third neural network and a sample feature, wherein the pre-obtained feature is obtained through the updated first and second neural networks; a kind of electronic device with high-pressure air-conditioning system

And updating parameters of each layer in the current third neural network based on the loss.

11. An image processing method, the image processing method comprising:

an acquisition step of acquiring more than one face image of one person, wherein the face images have different resolutions;

a merging step of merging first features obtained from the face images, respectively, wherein the first features for mutual merging have the same feature expression, wherein in the merging step, the first features are obtained from the face images using a first neural network; a kind of electronic device with high-pressure air-conditioning system

A generation step of generating a second feature based on the combined features, wherein in the generation step, the second feature is generated using a second neural network;

12. The image processing method according to claim 11, wherein, for any one face image, a network branch for obtaining a corresponding first feature from the face image is determined from the first neural network based on a resolution of the face image.

13. The image processing method according to claim 11, the image processing method further comprising:

and a recognition step of performing face attribute recognition and/or face recognition on the person based on a third feature obtained from the second feature.

14. The image processing method according to claim 13, wherein in the identifying step, the third feature is obtained from the second feature using a third neural network.

15. The image processing method according to claim 14, wherein the first, second, and third neural networks are updated in common.

16. A storage medium storing instructions which, when executed by a processor, cause performance of the image processing method according to any one of claims 11-15.