CN113569598A

CN113569598A - Image processing method and image processing apparatus

Info

Publication number: CN113569598A
Application number: CN202010355012.1A
Authority: CN
Inventors: 费扬; 陈凯; 龚文洪
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-10-29
Also published as: WO2021218238A1

Abstract

The embodiment of the application relates to a computer image processing technology in the field of artificial intelligence, and discloses an image processing method which is applied to the naked eye recognition of a face image or the recognition of a face recognition system and improves the success rate of face recognition by reconstructing an unobstructed front face image. The method comprises the following steps: the image processing device acquires the face characteristics of the face image and the position information of the key points; and the image processing device acquires an unobstructed frontal face image corresponding to the face image through a first neural network model according to the face features and the position information of the key points.

Description

Image processing method and image processing apparatus

Technical Field

The present application relates to computer image processing technologies in the field of artificial intelligence, and in particular, to an image processing method and an image processing apparatus.

Background

The face retrieval is widely applied to scenes such as security monitoring or access gates, and in practical application scenes, images with missing front faces, such as blurred images, shielded images and large-angle images such as side face images, are often acquired by image acquisition equipment, and the images with missing front faces cause difficulty for recognition algorithms or naked eye identification, because the human face information of the images is not easily identified by naked eyes, and the similarity of the recognition algorithms for recognizing the images with missing front faces of the same person is low.

In the prior art, a one-dimensional feature vector, namely, a face feature, is extracted from an image with missing front face based on a pre-constructed neural network model, and the shielded front face image is restored or a non-shielded front face image is reconstructed through the face feature.

In the prior art, the front face image is reconstructed by the human face features, compared with the original image, the loss of information is high, and accordingly, the quality of the reconstructed image is low, and the reconstructed image cannot be identified by naked eyes or has low matching degree identified by an identification algorithm.

Disclosure of Invention

The embodiment of the application provides an image processing method, which is used for generating a non-shielding front face image based on an image with a missing front face, wherein the generated non-shielding front face image is used for naked eye identification or identification information identification by an identification algorithm, and the success rate of face identification can be improved.

A first aspect of an embodiment of the present application provides an image processing method, including: acquiring face features of a face image and position information of key points, wherein the face features comprise feature vectors of the face information in the face image, and the key points comprise feature points representing face positions in the face image; and obtaining an unobstructed frontal face image corresponding to the face image through a pre-trained first neural network model according to the face features and the position information of the key points.

In a scene of identity recognition through a face image, a face image is usually shot by an image acquisition device, a common image acquisition device such as a monitoring camera usually does not contain complete frontal face information, which brings great difficulty for subsequent visual recognition through the face image or recognition through a face recognition algorithm in a face recognition system, in the image processing method provided by the embodiment of the application, for the face image acquired by the image acquisition device, the image processing device acquires the face features of the face image and the position information of key points of the face image, a first neural network model obtained through pre-training fuses the face features and the position information of the key points to obtain a non-shielded frontal face image corresponding to the face image, and visual recognition or recognition algorithm identification identity information is carried out according to the generated non-shielded frontal face image, the success rate of face recognition can be improved.

Optionally, the position information of the key point is a five-dimensional feature map, and each dimension in the five-dimensional feature map represents one key point of the face. Optionally, the size of the face image is 224 × 224, and the size of the five-dimensional feature map is 5 × 224. The first neural network model is a generator model.

In one possible implementation form of the first aspect, the first neural network model includes one or more deconvolution layers, optionally, the first neural network model includes a plurality of deconvolution layers.

In the image processing method provided by the embodiment of the application, the first neural network model includes the deconvolution layer, and the deconvolution layer can expand input, so that more data (reconstructed face images) can be recovered according to less data volume (face features and position information of key points).

In a possible implementation manner of the first aspect, the location information of the keypoint includes a keypoint feature map; according to the human face features and the position information of the key points, obtaining the non-shielding frontal face image corresponding to the human face image through a pre-trained first neural network model comprises the following steps: inputting the human face features into a first deconvolution layer to obtain a first feature map; fusing the first feature and the key point feature map to obtain a second feature map; and inputting the second feature map into a second deconvolution layer to obtain the unobstructed front face image. Optionally, the first deconvolution layer comprises one or more deconvolution layers. Optionally, the second deconvolution layer comprises one or more deconvolution layers. Optionally, fusing the first feature and the keypoint feature map comprises: and fusing the first feature and the key point feature map by bit-aligned multiplication. Optionally, the first feature map is a feature map with a size of 8 × 56, the second feature map is a feature map with a size of 40 × 56, and the unobstructed front face image is of a size of 3 × 112.

The image processing method provided by the embodiment of the application provides a specific mode for fusing the position information of the face features and the key points, the position information of the key points is usually used for positioning the face region in the face image, the position information of the key points and the face features are fused in the embodiment of the application, namely, the face features are processed through a plurality of deconvolution layers, then the face features and the position information of the key points are fused through counterpoint multiplication, and finally the non-shielded face image is obtained through reconstruction of the plurality of deconvolution layers. The identity information in the reconstructed face image can be increased, and the face recognition effect of the non-shielding face image is improved.

In a possible implementation manner of the first aspect, the first neural network model is obtained after training of a first initial network model, and the method further includes: inputting first face features of a first training sample and position information of a first key point into the first initial network model for training to obtain a first loss; and updating the weight parameters in the first initial network model according to the first loss to obtain the first neural network model.

The image processing method provided by the embodiment of the application provides a specific training method of the first neural network model, namely, the image processing device performs training through the facial features of the first training sample and the position information of the key points, and updates the weight parameters of the first initial network model according to the obtained loss.

In a possible implementation manner of the first aspect, the first loss is a difference between a first feature and a second feature, the first feature is a face feature extracted from a face image corresponding to the first training sample, the second feature is a face feature extracted from a first generated image, and the first generated image is an unobstructed face image obtained by inputting the first training sample into the first initial network model. Optionally, the first loss is a difference between the first characteristic and the second characteristic is the first characteristicA mean square error between a feature and the second feature. A mean square error between the first feature and the second feature is:

wherein F is the total length of the first feature, the total length of the first feature is equal to the total length of the second feature, F is the F-th number in F, E_fA value representing the f-th number in said first feature, Z_fA value representing the f-th number in said second feature.

The image processing method provided by the embodiment of the application provides a specific training method of a first neural network model, a loss function comprises a first loss which is a difference between a first feature and a second feature, the difference between the features can be measured through a mean square error, the second feature extracted by a generated image can be similar to a first feature extracted by a face image (namely a true value) corresponding to a training sample by setting the loss function, and a non-shielded face image generated by the first neural network model obtained through training can retain more identity information so as to improve the success rate of face recognition.

In one possible implementation manner of the first aspect, the first loss includes a difference between the first feature and the second feature, and a difference between a frontal face image and a first generated image corresponding to the first training sample. Optionally, a difference between the front face image corresponding to the first training sample and the first generated image is a mean square error between a pixel value of a first position in the front face image corresponding to the first training sample and a pixel value of a second position in the first generated image, where the second position in the first generated image corresponds to the first position in the front face image corresponding to the first training sample.

The image processing method provided by the embodiment of the application provides another specific training method of the first neural network model, the loss function further comprises the difference between the frontal face image corresponding to the first training sample and the first generated image, the generated image can be similar to the frontal face image corresponding to the training sample, and the success rate of naked eye recognition can be improved by the aid of the non-shielding frontal face image generated by the trained first neural network model.

In a possible implementation manner of the first aspect, the first loss includes a difference between the first feature and the second feature, a difference between a frontal face image corresponding to the first training sample and a first generated image, and a determination loss, where the determination loss is a probability that a discriminator discriminates the first generated image as false, and the discriminator is configured to discriminate a true image as true and discriminate a generated image as false. The determination loss is log (1-D (G)), where D is a discriminator, G is the first generated image, D (G) is a decimal between 0 and 1, a value of 1 represents a determination that the first generated image is true, and a value of 0 represents a determination that the first generated image is false.

The image processing method provided by the embodiment of the application provides another specific training method of the first neural network model, the loss function further comprises the loss of the discriminator, the discriminator is used for judging whether the input image is the generated image, if the input image is the generated image, the generated image is false, and if the input image is the real image, the generated image is true.

In a possible implementation manner of the first aspect, the acquiring the facial features of the facial image and the location information of the key points includes: inputting the face image into a second neural network model to obtain the face features; and inputting the face image into a third neural network model to obtain the position information of the key point.

The image processing method provided by the embodiment of the application can also be used for directly acquiring the non-shielding front face image based on the face image, so that the diversity of scheme implementation is increased.

In a possible implementation manner of the first aspect, the second neural network model includes a feature extractor or an encoder, the feature extractor is a neural network that outputs facial features according to an input facial image, and the encoder is a part of an auto-encoder.

The image processing method provided by the embodiment of the application provides two specific modes for extracting the human face features, the feature vectors are obtained through the feature extractor, or the potential space characterization is extracted through the encoder, the calculated amount of the human face features obtained through the encoder is less, but the identity information contained in the potential space characterization is also less. In the actual application scene, the specific mode of extracting the human face features can be selected according to the needs, and the flexibility of realizing the scheme is improved.

In a possible implementation manner of the first aspect, the third neural network model includes a key point model, and the key point model is a neural network model that outputs position information of key points according to an input face image.

The image processing method provided by the embodiment of the application provides a specific way for extracting the position information of the key points, namely extracting through a key point model. The method comprises the steps that position information of key points extracted by a key point model is specifically a feature map of each key point, the feature map is used for indicating the probability that each pixel position in an image is the key point, the position information of the key points is generally used for positioning a face region in the image, the position information of the key points is fused with face features, identity information contained in a generated non-shielded face image can be increased, and the success rate of face recognition is improved.

In a possible implementation manner of the first aspect, the non-occluded frontal face image is used for naked eye recognition or is used for inputting a face recognition system to realize face recognition.

The image processing method provided by the embodiment of the application provides two specific application scenes of the non-shielding front face image produced by the image processing device.

A second aspect of the embodiments of the present application provides an image processing apparatus, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the face characteristics of a face image and the position information of key points, the face characteristics comprise characteristic vectors of the face information in the face image, and the key points comprise characteristic points which represent the positions of the face in the face image;

and the processing unit is used for obtaining an unobstructed frontal face image corresponding to the face image through a pre-trained first neural network model according to the face features and the position information of the key points.

In one possible implementation of the second aspect, the first neural network model includes one or more deconvolution layers. Optionally, the first neural network model comprises a plurality of deconvolution layers.

In one possible implementation manner of the second aspect, the location information of the keypoint includes a keypoint feature map; the processing unit is specifically configured to: inputting the human face features into a first deconvolution layer to obtain a first feature map; fusing the first feature and the key point feature map to obtain a second feature map; and inputting the second feature map into a second deconvolution layer to obtain the unobstructed front face image.

In a possible implementation manner of the second aspect, the first neural network model is obtained after training of a first initial network model, and the apparatus further includes: the training unit is used for inputting the first face characteristics of the first training sample and the position information of the first key point into the first initial network model for training to obtain a first loss; the processing unit is further configured to update a weight parameter in the first initial network model according to the first loss to obtain the first neural network model.

In one possible implementation manner of the second aspect, the first loss is a difference between a first feature and a second feature, the first feature is a face feature extracted from a face image corresponding to the first training sample, the second feature is a face feature extracted from a first generated image, and the first generated image is an unobstructed face image obtained by inputting the first training sample into the first initial network model.

In one possible implementation manner of the second aspect, the first loss includes a difference between the first feature and the second feature, and a difference between the front face image and the first generated image corresponding to the first training sample.

In a possible implementation manner of the second aspect, the first loss includes a difference between the first feature and the second feature, a difference between a frontal face image corresponding to the first training sample and a first generated image, and a determination loss, the determination loss is a probability that a discriminator discriminates the first generated image as false, and the discriminator is configured to discriminate a true image as true and discriminate a generated image as false.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: inputting the face image into a second neural network model to obtain the face features; and inputting the face image into a third neural network model to obtain the position information of the key point.

In one possible implementation manner of the second aspect, the second neural network model includes a feature extractor or an encoder, the feature extractor is a neural network that outputs facial features according to an input facial image, and the encoder is a part of an auto-encoder.

In a possible implementation manner of the second aspect, the third neural network model includes a key point model, and the key point model is a neural network model that outputs position information of key points according to an input face image.

In a possible implementation manner of the second aspect, the non-occluded frontal face image is used for naked eye recognition or is used for inputting a face recognition system to realize face recognition.

A third aspect of embodiments of the present application provides an image processing apparatus, which includes a processor and a memory, where the processor and the memory are connected to each other, where the memory is configured to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method according to any one of the foregoing first aspect and various possible implementation manners.

A fourth aspect of embodiments of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the method according to the first aspect and any one of the various possible implementations.

A fifth aspect of embodiments of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect and any one of the various possible implementations.

A sixth aspect of embodiments of the present application provides a chip, including a processor. The processor is used for reading and executing the computer program stored in the memory so as to execute the method in any possible implementation mode of any one aspect. Optionally, the chip may include a memory, and the memory and the processor may be connected to the memory through a circuit or a wire. Further optionally, the chip further comprises a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information needing to be processed, the processor acquires the data and/or information from the communication interface, processes the data and/or information, and outputs a processing result through the communication interface. The communication interface may be an input output interface.

According to the technical scheme, the embodiment of the application has the following advantages:

according to the image processing method provided by the embodiment of the application, the face features of the face image and the position information of the key points are input into the neural network model, the non-shielding face image can be generated based on the image with the missing face, the generated non-shielding face image can be used for naked eye identification, and the success rate of naked eye identification is improved; the method can also be used for inputting a face recognition system to distinguish through a recognition algorithm, and the non-shielding frontal face image can improve the distinguishing success rate of the recognition model.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;

FIG. 2 is a system architecture diagram according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another convolutional neural network structure provided in the embodiments of the present application;

FIG. 5a is a schematic diagram of an application scenario of an image processing method in an embodiment of the present application;

FIG. 5b is a schematic diagram of another application scenario of the image processing method in the embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of an image processing method in an embodiment of the present application;

FIG. 7 is a schematic diagram of another embodiment of an image processing method in an embodiment of the present application;

FIG. 8 is a diagram illustrating an embodiment of a training process of a generator in an embodiment of the present application;

FIG. 9 is a diagram illustrating an embodiment of a training process of an arbiter in an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of an image processing apparatus according to the embodiment of the present application;

fig. 11 is a schematic diagram of another embodiment of an image processing apparatus in an embodiment of the present application;

fig. 12 is a diagram of a chip hardware structure according to an embodiment of the present application.

Detailed Description

For the sake of understanding, some technical terms related to the embodiments of the present application are briefly described as follows:

1. the Convolutional Neural Network (CNN) is essentially an input-to-output mapping, and can learn a large number of mapping relationships between inputs and outputs without any precise mathematical expression between the inputs and outputs.

2. A deconvolved neural network (TC) and a CNN belong to the same type of network, and can also learn a mapping relationship through a large number of inputs and outputs, unlike a convolutional neural network, a CNN generally reduces an input (i.e., an output is smaller than an input), and a TC expands an input.

3. The Generator (Generator) provided by the embodiment of the application is a neural network model taking the position information of the features (one-dimensional array) and the key points as input and the image as output, and the neural network model for generating the non-occlusion frontal face image according to the position information of the face features and the key points is called the Generator for the convenience of distinguishing, the Generator is composed of a plurality of TC layers, and the mapping relation between the features and the image is learned by utilizing a back propagation algorithm, so that the generated image meets a specific purpose. The generator is also referred to as a generator model in the embodiments of the present application.

4. A Discriminator (Discriminator) is a neural network for judging whether an input image is a generated image. If the image is generated, false is output, and if the image is real, true is output. The discriminator is also referred to as a discriminator model in the embodiments of the present application.

5. The feature Extractor (Extractor) is a neural network that takes the image as input and the features as output, the output one-dimensional feature vector contains face information, which can be called face features, and whether the two face images are similar or not is judged by comparing the similarity of the face features of the two images. In the embodiments of the present application, the feature extractor is also referred to as a feature extractor model.

6. A keypoint model (Facial Landmark Detector) is a model that takes an image as an input and position information of keypoints as an output, where the keypoints are feature points in a face image that represent the position of a face, and the position information of the keypoints is used to indicate the pixel position of one or more feature points in the image of the face image, and is generally used to locate the position of the face. For example, in the embodiment of the present application, the position information of the keypoints is a five-dimensional feature map, the five dimensions respectively represent five keypoints of a human face, the keypoints may be identification points of a nose, an eye, or a mouth, for example, and each pixel position on the feature map corresponds to a pixel position on the input image, and the five-dimensional data of each pixel position in the feature map represents a probability that the five keypoints are located at the corresponding pixel position of the input image.

7. Mean Square Error (MSE) is a metric that reflects the degree of difference between the estimator and the estimated volume. The average value of the sum of squares of the difference between the predicted value and the actual value can be used as an index for measuring the prediction result.

8. Face features: the face features can be divided into structured features and unstructured features, wherein the structured features refer to features with specific physical meanings, such as age, gender, angle and the like; the unstructured features have no specific physical meaning, but are composed of a series of numbers, which are also called feature vectors, the similarity between the feature vectors may represent the similarity (e.g., euclidean distance, cosine distance) between original images, and the facial features mentioned in the embodiments of the present application all refer to the unstructured features, specifically to one-dimensional arrays.

9. An auto-encoder: (autoencode, AE) a neural network that uses a back-propagation algorithm to make output values equal to input values by first compressing the input into a latent spatial representation and then reconstructing the output from this representation. The self-encoder consists of two parts, the encoder part compresses the input into a latent spatial representation, which can be represented by the coding function h ═ f (x), and the decoder can reconstruct the input from the latent spatial representation, which can be represented by the decoding function r ═ g (h).

10. Identity information: face recognition is a biometric technology for identity recognition based on facial feature information of a person. In the face image, the feature information for performing identity recognition may be referred to as identity information.

11. Front face image without occlusion: in the embodiment of the present application, the non-blocking frontal face image obtained by the generator is a frontal face image that includes all parts of the human face and has reduced blocking or decoration (such as a hat, a mask, or sunglasses) relative to the original input human face image.

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Referring to fig. 2, a system architecture 200 is provided in an embodiment of the present application. The data acquisition device 260 is configured to acquire facial image data and store the facial image data in the database 230, and the training device 220 generates the target model/rule 201 based on the facial image data maintained in the database 230. How the training device 220 derives the target model/rule 201 based on the face image data will be described in more detail below, and the target model/rule 201 can be used in application scenarios such as face recognition, image classification, and virtual reality.

In the embodiment of the present application, training may be performed based on face image data, and specifically, various face images, including a face image with occlusion, may be acquired through the data acquisition device 260 and stored in the database 230. In addition, the face image data can also be directly obtained from commonly used databases, such as databases of LFW, YaleB, CMU PIE, CFW, Celeba and the like.

The target model/rule 201 may be derived based on a deep neural network, which is described below.

The operation of each layer in the deep neural network can be expressed mathematically

To describe: from the physical layerThe work of each layer in the surface depth neural network can be understood as performing the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) by five operations on the input space (the set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The target models/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 processes the input data using the target model/rule 201, and taking facial image recognition as an example, the calculation module 211 may analyze the input facial image to obtain image features such as texture information in the facial image.

The correlation function module 213 may perform preprocessing on the image data in the calculation module 211, such as face image preprocessing, including face alignment, etc.

The correlation function module 214 may perform preprocessing on the image data in the calculation module 211, such as face image preprocessing, including face alignment, etc.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

Further, the training device 220 may generate corresponding target models/rules 201 based on different data for different targets to provide better results to the user.

In the case shown in FIG. 2, the user may manually specify data to be input into the execution device 210, for example, to operate in an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 240 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collection end to store the collected training data in the database 230.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

In the embodiment of the present application, the neural network used for extracting the deep neural network of the face feature from the image and extracting the position information of the key point may be, for example, a Convolutional Neural Network (CNN). In the embodiment of the present application, the neural network used for generating an unobstructed frontal face image according to the facial features and the location information of the key points may be, for example, a deconvolution neural network (TC), where CNN usually reduces the input, and TC enlarges the input. CNN is described in detail below.

CNN is a deep neural network with a convolution structure, and is a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels by a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network, for example, image processing, in which individual neurons respond to overlapping regions in an image input thereto. Of course, other types are possible, and the application is not limited to the type of deep neural network.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved. To facilitate description of the network structure, a plurality of convolutional layers may be referred to as a block.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

The last layer after the plurality of hidden layers in the neural network layer 130, i.e., the entire convolutional neural network 100, is the output layer 140.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

In practical application scenarios, the image acquisition device often acquires images with missing front faces, such as blurred images, images with occlusion, and large-angle images, such as side face images, and the like, and the images with missing front faces cause difficulty in recognition algorithms or identification by naked eyes, because the human faces of the images are not easily identified by naked eyes, the recognition algorithms have low similarity in recognizing the images with missing front faces of the same person.

The general image processing method extracts face features from an image with missing front face based on a pre-constructed neural network model, wherein the face features are one-dimensional feature vectors, and the shielded front face image is restored or reconstructed without shielding through the face features.

Because the face features are one-dimensional arrays in the prior art, the loss of the face features is more compared with the loss of original image information, the quality of the reconstructed image is low, the contained identity information is less, and the matching degree of the image which can not be identified by naked eyes or identified by an identification algorithm is low.

Please refer to fig. 5a, which is a schematic diagram of an application scenario of the image processing method in the embodiment of the present application;

various human face images collected by the image collecting device, including blurred images, shielded images, or large-angle images such as side face images, can be generated into non-shielded front face images by the image processing method provided by the embodiment of the application, the generated images can be used for naked eye identification, or can be used for identification by inputting a human face identification system based on an identification algorithm, and the image processing method can improve the success rate of naked eye identification and the success rate of identification model identification. The application of the generated non-occlusion frontal image is not limited herein.

Please refer to fig. 5b, which is a schematic diagram of another application scenario of the image processing method in the embodiment of the present application;

in another possible application scenario, the image acquisition device can process the acquired face image to acquire the face features of the face image and the position information of the key points, based on the image processing method provided by the embodiment of the application, the non-shielding front face image can be generated based on the face features and the position information of the key points, the generated image can be used for naked eye identification or used for inputting a face recognition system based on a recognition algorithm for identification, and the image processing method can improve the success rate of naked eye identification and can also improve the success rate of recognition model identification.

Alternatively, the image capturing device may be a Software Defined Camera (SDC), which is a camera with capturing and algorithmic processing capabilities, and which can process the captured images in real time, locate the face, and extract features and key points of the face.

Alternatively, the image capturing device may also be a video content management system (VCM), which is a system for performing algorithm processing on a video captured by a camera, and after the camera captures a picture, face detection and feature extraction and key point extraction are performed through the system.

It should be noted that the image processing method provided in the embodiments of the present application may be implemented by being integrated in an image capturing device, such as SDC or VCM, or may be implemented by being deployed in a separate image processing apparatus.

Please refer to fig. 6, which is a schematic diagram illustrating an embodiment of an image processing method according to an embodiment of the present application;

601. acquiring the face characteristics of the image and the position information of the key points;

optionally, extracting the face features of the image and the position information of the key points according to the trained feature extraction model and the trained key point model, or directly obtaining the face features of the image and the position information of the key points.

The input of the feature extractor model is an image, and the output of the feature extractor model is a human face feature, specifically an unstructured feature, which is used for comparing the similarity between the images. Optionally, the face features are one-dimensional feature vectors, that is, one-dimensional arrays, and the specific form is not limited here. The feature extractor model is a pre-trained model, and the specific training process is not limited here.

The input of the key point model is an image, the output of the key point model is the position information of the key point, and the position information of the face in the image can be provided. Optionally, the position information of the keypoints is a five-dimensional feature map, the five dimensions respectively represent five keypoints of the human face, the keypoints may be identification points of a nose, an eye, or a mouth, for example, each pixel position on the feature map corresponds to a pixel position on the input image, and the five-dimensional data of each pixel position in the feature map represents a probability that the five keypoints are located at the corresponding pixel position of the input image.

Illustratively, the size of the image of the input keypoint model is 224 × 224 (pixels × pixels), and the output is a five-dimensional feature map, with the size of 5 × 224, and each dimension is 1 × 224.

602. Inputting the face features and the position information of the key points into a trained generator model to generate an unobstructed frontal face image;

the face features obtained by the feature extractor model and the position information of the key points obtained by the key point model are input into a generator model trained in advance to generate an unobstructed frontal face image, and the training process of the generator model is specifically referred to in the following embodiments. The generator model is also referred to as a generator in the embodiments of the present application.

The generator is used for fusing the position information of the input key points and the human face features and generating an unobstructed frontal face image. Because the position information of the key points is multi-dimensional data, contains more information related to the key points of the face, is fused with the face features, and is used for generating the non-occlusion frontal face image together, the quality of the obtained frontal face image is high, and specifically, the similarity between the face features extracted from the frontal face image and the face features extracted from the original image in step 601 is high.

Please refer to fig. 7, which is a schematic diagram illustrating another embodiment of an image processing method according to an embodiment of the present application. The fusion method of the generator comprises the steps that an input graph I701 respectively obtains position information 704 of key points, namely a feature graph F and face features 705, namely features A, through a key point model 702 and a feature extraction model 703, then the feature graph F and the features A are input into a generator 706 to be fused, firstly, the features A are expanded into a feature graph 708 with the size of 8 x 56 through a multiple deconvolution layer 707, then the feature graph 708 and the feature graph F704 are subjected to alignment multiplication (element wise) to be fused into a feature graph 709 with the size of 40 x 56, and finally, a front face image 711 with the size of 3 x 112 is output through the multiple deconvolution layer 710.

The use process comprises the following steps: a front face image G is generated.

603. Carrying out face recognition through the non-shielding frontal face image;

the non-occlusion frontal face image G obtained in step 602 is used for naked eye identification, optionally, the image G is obtained by a feature extractor to generate an image face feature E, and the generated image face feature E may be used for outputting a face recognition system to perform feature comparison.

According to the image processing method provided by the embodiment of the application, when the acquired face image is a large side face image, a shielding image or a fuzzy image and the like which are not beneficial to distinguishing the identity corresponding to the face in the image by naked eyes, the non-shielding front face image can be generated by the image processing method provided by the embodiment of the application, so that the identity can be easily distinguished by naked eyes.

The image processing method provided by the embodiment of the application has the advantages that under the condition that the collected face image is a large-side face image, a shielding image or a fuzzy image and the like, the face features extracted based on the original face image are generally low in similarity with the face features in the identification photo in the database, and the identification matching degree is low.

The generator used in step 602 in the embodiment corresponding to fig. 6 is a deep neural network model trained in advance, and the training process of the generator is described in detail below. Please refer to fig. 8, which is a diagram illustrating an embodiment of a training process of a generator in an embodiment of the present application.

In the embodiment of the application, the generator is obtained by training a first neural network model, and optionally, the first neural network model is constructed by a plurality of deconvolution neural networks.

The face images in the image dataset used for training comprise side face images or occluded face images. It should be noted that each face image in the data set has a corresponding non-occlusion frontal face map, and in a specific implementation process, the non-occlusion frontal face map may be a certificate photo, for example, the face image I₁The corresponding certificate photo is R₁(ii) a Face image I₂The corresponding certificate photo is R₂(ii) a Face image I₃The corresponding certificate photo is R₃。

The training process of the generator comprises the following steps:

and extracting the face characteristics A of the training sample I through a characteristic extractor, extracting the position information F of the key points through a key point model, and inputting the face characteristics A and the position information F of the key points into a generator for fusion. Illustratively, the human face features a are expanded to feature maps with the size of 8 × 56 through multiple deconvolution layers, then the feature maps and the key point feature maps are subjected to counterpoint multiplication to be fused into a feature map with the size of 40 × 56, and finally, the non-occlusion frontal face image G is generated through the multiple deconvolution layers. For example, taking the input image I1801 as an example, I1801 inputs the feature extractor to obtain the facial features a 1804, I1 inputs the keypoint model 805 to obtain the position information F1806 of the keypoints, and A1 and F1 are input to the generator 807 to obtain the non-occlusion frontal face image G1808 corresponding to the input image I1.

The human face features obtained by the non-occlusion frontal face image G1808 through the feature extractor 809 are features E1810, and the human face features extracted by the feature extractor 803 into the identification photograph R1802 corresponding to the image I1801 are features Z1811.

Optionally, the objective function form of the reconstructed feature loss is:

the human face features are one-dimensional arrays, the lengths of the features E and the features Z are the same, wherein F represents the total length of the features, F represents the F-th numerical value in the one-dimensional arrays, and E_fA number representing the f-th number in feature E, Z_fThe more the values of the features E and Z at each corresponding position are closer, the smaller the mean square error is, and the smaller the difference between the values of the features E and Z is.

The meaning of the objective function is such that the eigenvalues of each position of E and Z are as similar as possible.

Loss function loss as reconstruction characteristic loss

Optionally, the difference between the non-occlusion frontal face image G generated according to the input image and the identification photograph R corresponding to the input image needs to be as small as possible, that is, the loss of the reconstructed image is as small as possible, and optionally, the loss of the reconstructed image refers to a Mean Square Error (MSE) between the image G and the image R. For example, the difference between the unobstructed frontal image G1808 and the identification photograph R1802 is as small as possible.

The target function form of the loss of the reconstructed image is as follows:

the sizes of the image G and the image R are consistent, namely, the total number of pixels of the image is the same, P represents the total number of pixels, P represents the P-th pixel in the image, and G_pRepresenting the value of a pixel at p point of image G, R_pRepresenting the value of a pixel at a point p in the image R. The meaning of the objective function is to make each pixel point of G and R as similar as possible.

Loss function loss is reconstruction characteristic loss + reconstruction image loss

Optionally, a discriminator may be introduced in the training of the generator, and for each training sample, the discriminator and the generator need to be trained alternately. The following is specifically described:

the joint training process of the arbiter and the generator:

the discriminator is a neural network for judging whether the input image is a generated image. The training goal of the discriminator is to be able to discriminate the generated image, i.e. if it is the generated image input discriminator, the discriminator output value is false; if the image is a real image input discriminator, the output value of the discriminator is true. For example, in fig. 8, the discriminator determines whether the generated non-occlusion frontal face image G1808 is true, and the obtained discrimination result D1812 aims at recognizing that the true image is true for the discriminator, and the image generated by the generator is false, please refer to fig. 9, which is a schematic diagram illustrating an embodiment of the training process of the discriminator in the embodiment of the present application.

For example, after the input image I1901 is subjected to feature and key point extraction and reconstructed by the generator 902, the generated image G1903 is determined by the discriminator 904 to obtain a determination result D1905, and the identification photograph R1906 corresponding to the input image, i.e., the real face image, is also input to the discriminator 904 for determination to obtain a determination result T1907. The discriminator should recognize as possible that the generated image is a false image, the identification photo is a true image, and the objective function is min (log (D (G)) + log (1-D (R))). Wherein D (G) is the judgment result of the image generated by the generator by the discriminator, and D (R) is the judgment result of the real face image by the discriminator. In the above discriminant training process, the generator does not perform training.

After the arbiter has trained one cycle, the generator will train. As shown in fig. 8, for the generator, its loss function includes three parts, one is a reconstruction characteristic loss, one is a reconstruction image loss, and the other is a judgment loss. Judging the loss means that the graph is generated to be as confusing as possible for the arbiter so that the arbiter judges the graph to be true, i.e., min (log (1-D (G)). In the embodiment of the application, in the training process of the generator, the generated image can be input into a pre-trained discriminator for judgment, and the non-occlusion frontal face image generated by the generator is judged to be true, that is, the discriminator can be confused to judge that the generated non-occlusion frontal face image is judged to be a true frontal face image instead of a generated image.

Loss function loss is reconstruction characteristic loss, reconstruction image loss and judgment loss

The determination loss is log (1-d (g)), which represents a probability that the non-occluded front face image generated by the discriminator determination generator is a false image, that is, a loss at which the discriminator determines that the non-occluded front face image is true.

Wherein D is a discriminator, G is an image generated by the generator, D (G) is a decimal between 0 and 1, represents a result of the discriminator judging the image generated by the generator, a value of 1 represents that G is true, a value of 0 represents that G is false, R is a true image, D (R) is a result of the discriminator judging the true face image, is a decimal between 0 and 1, represents a result of the discriminator judging the image generated by the generator, a value of 1 represents that G is true, and a value of 0 represents that G is false.

The whole training process is formed by alternately training the arbiter and the generator.

When face retrieval is performed according to a face recognition system, images representing the identities of people stored in a system database are usually positive face pictures, such as identification photos, and input images for query, which are usually monitoring photos, cause the situation that the picture style and data distribution of a query data field are inconsistent with the queried data field. In the image processing method provided by the embodiment of the application, the generator model is trained based on the identification photo corresponding to the face image in the training process, and the face features extracted from the non-shielding face image generated by the trained generator can transfer the monitoring photo features of the query data domain to the identification photo features of the queried data domain, namely, the domain transfer function is achieved, so that the overall performance of the retrieval system can be improved.

According to the image processing method provided by the embodiment of the application, the feature extraction model can be replaced by an encoder in the self-encoder. This will be described in detail below.

An autoencoder is a neural network that uses a back-propagation algorithm to make output values equal to input values, by first compressing the input into a latent spatial representation and then reconstructing the output from this representation. The self-encoder consists of two parts, the encoder part compresses the input into a latent spatial representation, which can be represented by the coding function h ═ f (x), and the decoder can reconstruct the input from the latent spatial representation, which can be represented by the decoding function r ═ g (h). The potential spatial characterization is specifically a feature vector that needs to be restored to the original as far as possible by the decoder, i.e. needs to include as much as possible the original image information, which is generally not available for comparing similarity with information of other images, without resolving power. In the embodiment of the application, an encoder in a self-encoder can compress an input image into a potential space representation to replace the face features extracted by the feature extraction module. And inputting the potential spatial representation and the position information of the key points into a trained generator to generate an unobstructed frontal face image.

According to the image processing method provided by the embodiment of the application, the encoder in the self-encoder compresses the input image into the potential space representation to replace the face features extracted by the feature extraction module, the self-encoder is simpler in structure and can reduce the calculation amount, and the defect is that the identity information contained in the potential space representation extracted by the self-encoder is less than the face features, so that compared with a scheme of extracting the face features by the feature extractor, the generated front face image contains less identity information, the naked eye recognition effect is poor or the matching degree of face recognition by a recognition algorithm is low, and the recognition success rate is low.

With reference to fig. 10, a schematic diagram of an embodiment of an image processing apparatus according to the embodiment of the present application is shown.

Only one or more of the various modules in fig. 10 may be implemented in software, hardware, firmware, or a combination thereof. The software or firmware includes, but is not limited to, computer program instructions or code and may be executed by a hardware processor. The hardware includes, but is not limited to, various integrated circuits such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).

The image processing apparatus includes:

an obtaining unit 1001, configured to obtain a face feature of a face image and position information of a key point, where the face feature includes a feature vector of face information in the face image, and the key point includes a feature point representing a face position in the face image;

the processing unit 1002 is configured to obtain an unobstructed front face image corresponding to the face image through a pre-trained first neural network model according to the face features and the location information of the key points.

Optionally, the first neural network model comprises one or more deconvolution layers.

Optionally, the location information of the keypoints includes a keypoint feature map; the processing unit 1002 is specifically configured to: inputting the human face features into a first deconvolution layer to obtain a first feature map; fusing the first feature and the key point feature map to obtain a second feature map; and inputting the second feature map into a second deconvolution layer to obtain the unobstructed front face image.

Optionally, the first neural network model is obtained after the first initial network model is trained, and the apparatus further includes: a training unit 1003, configured to input the first facial feature of the first training sample and the position information of the first keypoint into the first initial network model for training, so as to obtain a first loss; the processing unit 1002 is further configured to update a weight parameter in the first initial network model according to the first loss to obtain the first neural network model.

Optionally, the first loss is a difference between a first feature and a second feature, the first feature is a face feature extracted from a front face image corresponding to the first training sample, the second feature is a face feature extracted from a first generated image, and the first generated image is an unobstructed front face image obtained by inputting the first training sample into the first initial network model.

Optionally, the first loss includes a difference between the first feature and the second feature, and a difference between the frontal face image and the first generated image corresponding to the first training sample.

Optionally, the first loss includes a difference between the first feature and the second feature, a difference between a frontal face image corresponding to the first training sample and a first generated image, and a determination loss, the determination loss is a probability that the first generated image is determined to be false by a discriminator, and the discriminator is configured to determine the real image to be true and determine the generated image to be false.

Optionally, the obtaining unit 1001 is specifically configured to: inputting the face image into a second neural network model to obtain the face features; and inputting the face image into a third neural network model to obtain the position information of the key point.

Optionally, the second neural network model includes a feature extractor or an encoder, the feature extractor is a neural network that outputs human face features according to an input human face image, and the encoder is a part of an auto-encoder.

Optionally, the third neural network model includes a key point model, and the key point model is a neural network model that outputs position information of key points according to the input face image.

Optionally, the non-occluded frontal face image is used for naked eye recognition, or is used for inputting a face recognition system to realize face recognition.

Please refer to fig. 11, which is a schematic diagram of another embodiment of an image processing apparatus according to an embodiment of the present application.

The image processing apparatus 1100 may have a relatively large difference depending on the configuration or performance, and may include one or more processors 1101 and a memory 1102, in which the memory 1102 stores programs or data.

Memory 1102 may be volatile memory or nonvolatile memory, among others. Optionally, the processor 1101 is one or more Central Processing Units (CPUs), which may be single core CPUs or multi-core CPUs, the processor 1101 may be in communication with the memory 1102 to execute a series of instructions in the memory 1102 on the image processing apparatus 1100.

The image processing device 1100 also includes one or more wired or wireless network interfaces 1103, such as ethernet interfaces.

Optionally, although not shown in fig. 11, the image processing apparatus 1100 may further include one or more power supplies; the input/output interface may be used to connect a display, a mouse, a keyboard, a touch screen device, a sensing device, or the like, and the input/output interface is an optional component, and may or may not be present, and is not limited herein.

The process executed by the processor 1101 in the image processing apparatus 1100 in this embodiment may refer to the method process described in the foregoing method embodiment, which is not described herein again.

Please refer to fig. 12, which is a diagram of a chip hardware structure according to an embodiment of the present disclosure.

The embodiment of the present application provides a chip system, which can be used to implement the image processing method, and specifically, the algorithm based on the convolutional neural network shown in fig. 3 and 4 can be implemented in the NPU chip shown in fig. 12.

The neural network processor NPU 50 is mounted as a coprocessor on a main CPU (Host CPU), and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in the accumulator 508 accumulator.

The unified memory 506 is used to store input data as well as output data. The weight data is directly transferred to the weight memory 502 by a memory access controller 505 (DMAC). The input data is also carried through the DMAC into the unified memory 506.

The BIU is a Bus Interface Unit, i.e., a Bus Interface Unit 510, for the interaction of the AXI Bus with the DMAC and the Instruction Fetch memory 509Instruction Fetch Buffer.

A bus interface unit 510 (BIU) for fetching the instruction from the external memory by the instruction fetch memory 509 and for fetching the original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 505.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506 or to transfer weight data into the weight memory 502 or to transfer input data into the input memory 501.

The vector calculation unit 507 may include a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.

In some implementations, the vector calculation unit can 507 store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the operations of the layers in the convolutional neural networks shown in fig. 3 and 4 may be performed by the matrix calculation unit 212 or the vector calculation unit 507.

In the embodiments of the present application, various illustrations are made for the sake of an understanding of aspects. However, these examples are merely examples and are not meant to be the best mode of carrying out the present application.

The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof, and when implemented using software, may be implemented in whole or in part in the form of a computer program product.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An image processing method, comprising:

acquiring face features of a face image and position information of key points, wherein the face features comprise feature vectors of the face information in the face image, and the key points comprise feature points representing face positions in the face image;

and obtaining an unobstructed frontal face image corresponding to the face image through a pre-trained first neural network model according to the face features and the position information of the key points.

2. The method of claim 1,

the first neural network model includes one or more deconvolution layers.

3. The method of claim 2, wherein the location information of the keypoints comprises a keypoint feature map;

obtaining an unobstructed frontal face image corresponding to the face image through a pre-trained first neural network model according to the face features and the position information of the key points, wherein the unobstructed frontal face image comprises:

inputting the human face features into a first deconvolution layer to obtain a first feature map;

fusing the first feature and the key point feature map to obtain a second feature map;

and inputting the second feature map into a second deconvolution layer to obtain the unobstructed front face image.

4. The method according to any one of claims 1 to 3,

the first neural network model is obtained after the first initial network model is trained, and the method further comprises the following steps:

inputting first face features of a first training sample and position information of a first key point into the first initial network model for training to obtain a first loss;

and updating the weight parameters in the first initial network model according to the first loss to obtain the first neural network model.

5. The method of claim 4,

the first loss is a difference between a first feature and a second feature, the first feature is a face feature extracted from a front face image corresponding to the first training sample, the second feature is a face feature extracted from a first generated image, and the first generated image is an unobstructed front face image obtained by inputting the first training sample into the first initial network model.

6. The method of claim 4,

the first loss includes a difference between the first feature and the second feature, and a difference between a frontal face image and a first generated image corresponding to the first training sample.

7. The method of claim 4,

the first loss includes a difference between the first feature and the second feature, a difference between a frontal face image corresponding to the first training sample and a first generated image, and a determination loss, the determination loss is a probability that the first generated image is determined to be false by a discriminator, the discriminator is configured to discriminate the real image as true, and discriminate the generated image as false.

8. The method according to any one of claims 1 to 7, wherein the acquiring the face features of the face image and the position information of the key points comprises:

inputting the face image into a second neural network model to obtain the face features;

and inputting the face image into a third neural network model to obtain the position information of the key point.

9. The method of claim 8,

the second neural network model comprises a feature extractor or an encoder, the feature extractor is a neural network which outputs human face features according to an input human face image, and the encoder is a part in an autoencoder.

10. The method according to any one of claims 1 to 9,

the non-shielding front face image is used for naked eye identification or is input into a face identification system to realize face identification.

11. An image processing apparatus characterized by comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the face characteristics of a face image and the position information of key points, the face characteristics comprise characteristic vectors of the face information in the face image, and the key points comprise characteristic points which represent the positions of the face in the face image;

12. The apparatus of claim 11,

the first neural network model includes one or more deconvolution layers.

13. The apparatus of claim 12, wherein the location information of the keypoints comprises a keypoint feature map;

the processing unit is specifically configured to:

14. The apparatus according to any one of claims 11 to 13,

the first neural network model is obtained after the first initial network model is trained, and the device further comprises:

the training unit is used for inputting the first face characteristics of the first training sample and the position information of the first key point into the first initial network model for training to obtain a first loss;

the processing unit is further configured to update a weight parameter in the first initial network model according to the first loss to obtain the first neural network model.

15. The apparatus according to claim 14, wherein the first loss is a difference between a first feature extracted from the face image corresponding to the first training sample and a second feature extracted from a first generated image that is an unobstructed face image obtained by inputting the first training sample into the first initial network model.

16. The apparatus of claim 14, wherein the first loss comprises a difference between the first feature and the second feature and a difference between the corresponding frontal face image and the first generated image of the first training sample.

17. The apparatus of claim 14, wherein the first loss comprises a difference between the first feature and the second feature, a difference between a frontal face image corresponding to the first training sample and a first generated image, and a determination loss, wherein the determination loss is a probability that a discriminator discriminates the first generated image as false, wherein the discriminator is configured to discriminate a true image as true and discriminate a generated image as false.

18. The apparatus of any one of claims 11 to 17,

the obtaining unit is specifically configured to:

19. The apparatus of claim 18, wherein the second neural network model comprises a feature extractor or an encoder, the feature extractor is a neural network that outputs facial features from an input facial image, and the encoder is a part of an autoencoder.

20. The apparatus according to any one of claims 11 to 19, wherein the unobstructed frontal image is used for visual recognition or for input to a face recognition system for face recognition.

21. An image processing apparatus comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1 to 10.

22. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.

23. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 10.