CN108229532B

CN108229532B - Image recognition method and device and electronic equipment

Info

Publication number: CN108229532B
Application number: CN201711042845.7A
Authority: CN
Inventors: 王飞; 黄诗尧; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2021-02-12
Anticipated expiration: 2037-10-30
Also published as: CN108229532A

Abstract

The embodiment of the application discloses an image identification method, an image identification device, electronic equipment and a computer readable medium, wherein the method comprises the following steps: inputting an image to be recognized into a deep neural network; outputting image features of the image to be recognized through the deep neural network; identifying the image to be identified based on the image characteristics of the image to be identified; the deep neural network is obtained through training of a ternary loss function, and a triplet in the ternary loss function is obtained by utilizing a class center of at least two centers contained in the deep neural network and a feature vector of the center.

Description

Image recognition method and device and electronic equipment

Technical Field

The present application relates to computer vision technologies, and in particular, to an image recognition method, medium, image recognition apparatus, and electronic device.

Background

For implementing face recognition, object classification, scene classification, or motion recognition on an image, a corresponding image recognition algorithm is usually used, for example, a face recognition algorithm, an object class recognition algorithm, a scene class recognition algorithm, or a motion recognition algorithm.

One of the purposes of image recognition algorithms includes: the deep neural network (such as a deep convolutional neural network) can learn more compact target feature vectors such as face feature vectors, object feature vectors, scene feature vectors or action feature vectors. Learning a more compact target feature vector may refer to: the distance of the target feature vectors extracted by the deep neural network aiming at different images of the same target in the feature space is as close as possible, namely the intra-class variance is as small as possible, and the distance of the target feature vectors extracted by the deep neural network aiming at different images of different targets in the feature space is as far as possible, namely the inter-class variance is as large as possible. In a specific example, the learning of a more compact facial feature vector by a deep neural network may be: the distance of the face feature vectors extracted by the deep neural network aiming at different images of the same person in the feature space is as close as possible, namely the intra-class variance is as small as possible, and the distance of the face feature vectors extracted by the deep neural network aiming at different images of different persons in the feature space is as far as possible, namely the inter-class variance is as large as possible.

How to enable the deep neural network to learn more compact target feature vectors (such as human face feature vectors) so as to improve the image recognition accuracy of the deep neural network is a technical problem worthy of attention.

Disclosure of Invention

The embodiment of the application provides an implementation technical scheme of image recognition.

According to one aspect of the embodiments of the present application, there is provided an image recognition method, including: inputting an image to be recognized into a deep neural network; outputting image features of the image to be recognized through the deep neural network; identifying the image to be identified based on the image characteristics of the image to be identified; the deep neural network is obtained through training of a ternary loss function, and a triplet in the ternary loss function is obtained by utilizing a class center of at least two centers contained in the deep neural network and a feature vector of the center.

In an embodiment of the application, the identifying the image to be identified based on the image feature of the image to be identified includes at least one of: carrying out face recognition on the image to be recognized based on the image characteristics of the image to be recognized; performing gesture recognition on the image to be recognized based on the image characteristics of the image to be recognized; identifying the pedestrian of the image to be identified based on the image characteristics of the image to be identified; performing vehicle identification on the image to be identified based on the image characteristics of the image to be identified; performing action recognition on the image to be recognized based on the image characteristics of the image to be recognized; and carrying out scene recognition on the image to be recognized based on the image characteristics of the image to be recognized.

In yet another embodiment of the present application, the method further comprises: training the deep neural network; the training the deep neural network comprises: acquiring a target feature vector of an image sample to be identified based on a deep neural network; calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two centers in the category centers, and selecting one similarity from the similarities; and performing supervised learning on the deep neural network through a ternary loss function based on a triplet formed by the target characteristic vector of the image sample to be recognized, the characteristic vector of the center corresponding to the image sample to be recognized in the class center and the characteristic vector of the center corresponding to the selected similarity.

In yet another embodiment of the present application, the dimension of the target feature vector of the image sample to be identified, which is obtained based on the deep neural network, is the same as the dimension of the feature vector of each of the category centers.

In yet another embodiment of the present application, the calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of at least two of the class centers includes: and calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of all the centers except the center corresponding to the image sample to be recognized in the category center.

In yet another embodiment of the present application, the calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of at least two of the class centers includes: selecting M centers except the center corresponding to the image sample to be identified from the category centers, and respectively calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of the M centers; wherein M is an integer not less than 2, M is less than N-1, and N is the number of centers contained in the category centers.

In yet another embodiment of the present application, the calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of at least two of the class centers includes: and calculating cosine similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two of the class centers.

In another embodiment of the present application, the process of performing cosine similarity calculation on the target feature vector of the image sample to be recognized and the feature vector of one of the class centers includes: respectively carrying out normalization processing on the target characteristic vector of the image sample to be recognized and the characteristic vector of the center by utilizing the modulus of the target characteristic vector of the image sample to be recognized and the modulus of the characteristic vector of the center; and calculating the dot product of the two results after the normalization processing to obtain the cosine similarity between the target characteristic vector of the image sample to be identified and the characteristic vector of the center.

In another embodiment of the present application, the selecting one of the similarities includes: and selecting the highest similarity from the similarities.

In another embodiment of the present application, the performing supervised learning on the deep neural network via a ternary loss function based on a triplet formed by a target feature vector of the image sample to be recognized, a feature vector of a center corresponding to the image sample to be recognized in a category center, and a feature vector of a center corresponding to the selected similarity includes: taking the target characteristic vector of the image sample to be identified as a basic element of the triple, taking the characteristic vector of the center corresponding to the image sample to be identified in the category center as a positive example, and taking the characteristic vector of the center corresponding to the selected similarity as a negative example; and performing supervised learning on the deep neural network through a ternary loss function based on the cosine similarity between the basic elements and the positive examples and the cosine similarity between the basic elements and the negative examples.

In yet another embodiment of the present application, the ternary loss function includes: and a ternary loss function for relaxing the difference between the cosine similarity of the basic element and the negative example element and the cosine similarity of the basic element and the positive example element by using a preset constant.

In another embodiment of the present application, the feature vector of the center corresponding to the category center of the image sample to be identified includes: feature vectors of centers where labels in category centers match labels of the image samples to be identified.

In another embodiment of the present application, the image sample to be recognized includes: based on the image sample to be recognized of the human face, the target feature vector comprises: a face feature vector; or the image sample to be identified comprises: based on the image sample to be recognized of the gesture, the target feature vector comprises: a gesture feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the pedestrian, the target feature vector comprises: a pedestrian feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the vehicle, the target feature vector comprises: a vehicle feature vector; or, the image sample to be recognized comprises: based on the motion of the image sample to be recognized, the target feature vector comprises: a motion feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the scene, the target feature vector comprises: a scene feature vector.

According to another aspect of embodiments of the present application, there is provided an image recognition apparatus including: the input module is used for inputting the image to be recognized into the deep neural network; the acquisition module is used for outputting the image characteristics of the image to be recognized through the deep neural network; the identification module is used for identifying the image to be identified based on the image characteristics of the image to be identified; the deep neural network is obtained through training of a ternary loss function, and a triplet in the ternary loss function is obtained by utilizing a class center of at least two centers contained in the deep neural network and a feature vector of the center.

In an embodiment of the application, the identification module is specifically configured to at least one of: carrying out face recognition on the image to be recognized based on the image characteristics of the image to be recognized; performing gesture recognition on the image to be recognized based on the image characteristics of the image to be recognized; identifying the pedestrian of the image to be identified based on the image characteristics of the image to be identified; performing vehicle identification on the image to be identified based on the image characteristics of the image to be identified; performing action recognition on the image to be recognized based on the image characteristics of the image to be recognized; and carrying out scene recognition on the image to be recognized based on the image characteristics of the image to be recognized.

In yet another embodiment of the present application, the apparatus further comprises: the training module is used for training the deep neural network; the training module comprises: the feature vector obtaining submodule is used for obtaining a target feature vector of an image sample to be identified based on a deep neural network; the similarity calculation submodule is used for calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two centers in the category centers; selecting a similarity submodule for selecting a similarity from the similarities; and the supervised learning submodule is used for carrying out supervised learning on the deep neural network through a ternary loss function based on a triplet formed by the target characteristic vector of the image sample to be recognized, the characteristic vector of the center corresponding to the image sample to be recognized in the class center and the characteristic vector of the center corresponding to the selected similarity.

In yet another embodiment of the present application, the supervised learning sub-module obtains the dimension of the target feature vector of the image sample to be identified based on the deep neural network, and the dimension of the feature vector of each of the category centers is the same.

In another embodiment of the present application, the module for calculating similarity is specifically configured to: and calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of all the centers except the center corresponding to the image sample to be recognized in the category center.

In another embodiment of the present application, the module for calculating similarity is specifically configured to: selecting M centers except the center corresponding to the image sample to be identified from the category centers, and respectively calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of the M centers; wherein M is an integer not less than 2, M is less than N-1, and N is the number of centers contained in the category centers.

In another embodiment of the present application, the module for calculating similarity is specifically configured to: and calculating cosine similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two of the class centers.

In another embodiment of the present application, the process of performing cosine similarity calculation on the target feature vector of the image sample to be identified and the feature vector of one of the class centers by the calculate similarity submodule includes: respectively carrying out normalization processing on the target characteristic vector of the image sample to be recognized and the characteristic vector of the center by utilizing the modulus of the target characteristic vector of the image sample to be recognized and the modulus of the characteristic vector of the center; and calculating the dot product of the two results after the normalization processing to obtain the cosine similarity between the target characteristic vector of the image sample to be identified and the characteristic vector of the center.

In another embodiment of the present application, the selected similarity submodule is specifically configured to: and selecting the highest similarity from the similarities.

In yet another embodiment of the present application, the supervised learning sub-module is specifically configured to: taking the target characteristic vector of the image sample to be identified as a basic element of the triple, taking the characteristic vector of the center corresponding to the image sample to be identified in the category center as a positive example, and taking the characteristic vector of the center corresponding to the selected similarity as a negative example; and performing supervised learning on the deep neural network through a ternary loss function based on the cosine similarity between the basic elements and the positive examples and the cosine similarity between the basic elements and the negative examples.

In another embodiment of the present application, the image sample to be recognized includes: based on the image sample to be recognized of the human face, the target feature vector comprises: a face feature vector; or the image sample to be identified comprises: based on the image sample to be recognized of the gesture, the target feature vector comprises: a gesture feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the pedestrian, the target feature vector comprises: a pedestrian feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the vehicle, the target feature vector comprises: a vehicle feature vector; or the image sample to be identified comprises: based on the motion of the image sample to be recognized, the target feature vector comprises: a motion feature vector; or the image sample to be identified comprises: based on the image sample to be identified of the scene, the target feature vector comprises: a scene feature vector.

According to still another aspect of embodiments of the present application, there is provided an electronic apparatus including: a memory for storing a computer program; a processor for executing the computer program stored in the memory, and when the computer program is executed, executing the steps of the method embodiments of the present application.

According to yet another aspect of embodiments of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of embodiments of the method of the present application.

According to a further aspect of embodiments of the present application, there is provided a computer program which, when executed by a processor in a device, performs the steps of embodiments of the method of the present application.

Based on the image identification method, the image identification device, the electronic equipment and the computer storage medium, the deep neural network is obtained by utilizing the ternary loss function training; according to the method and the device, the category centers are set for the feature vectors of the categories, and the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two centers in the category centers is calculated, so that the triples can be conveniently and quickly formed on the basis of the target feature vector of the image sample to be identified, the feature vector of the center corresponding to the image sample to be identified in the category centers and the feature vector of the center corresponding to the similarity selected from the calculated similarities, and the phenomenon that the corresponding sample is difficult to select in the process of forming the triples is avoided; the deep neural network is supervised and learned through the ternary loss function based on the triplets, so that the deep neural network can learn more compact target feature vectors, and the image identification accuracy of the deep neural network can be improved.

The technical solution of the present application is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of one embodiment of an image recognition method of the present application;

FIG. 2 is a flow chart of one embodiment of a face recognition method of the present application;

FIG. 3 is a flow diagram of one embodiment of a method of training a deep neural network of the present application;

FIG. 4 is a schematic structural diagram of an embodiment of an image recognition apparatus of the present application;

FIG. 5 is a schematic structural diagram of one embodiment of a training module for training a deep neural network according to the present application;

FIG. 6 is a block diagram of an exemplary device implementing embodiments of the present application;

fig. 7 is a schematic view of an application scenario of the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in a figure, further discussion of it is not necessary in subsequent figures.

The embodiments of the application are applicable to computer systems/servers operable with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, and data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Exemplary embodiments

The technical scheme of image recognition provided by the present application may be implemented by an electronic device capable of running a computer program (also referred to as a program code) such as a single chip microcomputer, a microprocessor, an FPGA (Field Programmable Gate Array), an intelligent mobile phone, a notebook computer, a tablet computer, a desktop computer, or a server, and the computer program may be stored in a computer-readable storage medium such as a flash memory, a cache, a hard disk, or an optical disk.

In an alternative example, the image recognition algorithm in the present application may be embodied as a face recognition algorithm, an object class recognition algorithm, a scene class recognition algorithm, or an action recognition algorithm, etc. That is to say, the technical scheme of the application can be applied to various technical fields based on classification recognition, such as the technical field of face recognition, the technical field of object class recognition, the technical field of scene class recognition, the technical field of action recognition and the like.

The following describes a technical solution of image recognition provided in the present application with reference to fig. 1 to 3.

Fig. 1 is a flowchart of an embodiment of an image recognition method of the present application. As shown in fig. 1, the method of this embodiment mainly includes: step S100, step S110, and step S120.

And S100, inputting the image to be recognized into a deep neural network.

In an optional example, in a case that the image recognition method provided by the present application is applied to the technical field of face recognition, the image to be recognized may specifically be an image to be recognized based on a face, and the target feature vector of the image to be recognized may specifically be a face feature vector of the image to be recognized. Under the condition that the image identification method provided by the application is applied to the technical field of object category identification, the image to be identified can be specifically an image to be identified based on an object, and the target characteristic vector of the image to be identified can be specifically an object characteristic vector of the image to be identified. Under the condition that the image identification method provided by the application is applied to the technical field of scene category identification, the image to be identified can be specifically the image to be identified based on the scene, and the target characteristic vector of the image to be identified can be specifically the scene characteristic vector of the image to be identified. Under the condition that the image identification method provided by the application is applied to the technical field of motion identification, the image to be identified can be specifically an image to be identified based on motion, and the target characteristic vector of the image to be identified can be specifically a motion characteristic vector of the image to be identified.

In an alternative example, the deep neural network in the present application is obtained through training of a ternary loss function, where the triplets in the ternary loss function are obtained by using class centers and feature vectors of the centers of at least two centers included in the deep neural network. One specific example of training the deep neural network is described below with respect to fig. 2.

In an alternative example, the deep neural network in the present application may be specifically a deep convolutional neural network based on a network structure such as AlexNet, VGGNet, google lenet, or ResNet. The application does not limit the specific structure of the deep neural network.

And S110, outputting the image characteristics of the image to be recognized through a deep neural network.

In an optional example, after the image to be recognized is processed by at least one convolution layer and at least one nonlinear ReLU layer, the deep neural network extracts image features from the input image to be recognized and outputs the image features. The specific implementation mode of extracting the image features by the deep neural network is not limited in the application.

And S120, identifying the image to be identified based on the image characteristics of the image to be identified.

In an optional example, the face recognition can be performed on the image to be recognized based on the image characteristics of the image to be recognized; gesture recognition can also be carried out on the image to be recognized based on the image characteristics of the image to be recognized; the pedestrian recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized; the vehicle identification can be carried out on the image to be identified based on the image characteristics of the image to be identified; the motion recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized; and scene recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized. In addition, the method and the device can also perform recognition processing such as face position detection, face key point detection, human body position detection, human body action detection, human body key point detection, living body detection and the like on the image to be recognized based on the image characteristics of the image to be recognized, and the specific expression form of recognizing the image to be recognized is not limited in the method and the device. In addition, the present application may adopt an existing neural network (such as a convolutional neural network, etc.) to implement an operation of identifying an image to be identified based on image features of the image to be identified.

Fig. 2 is a flowchart of an embodiment of training a deep neural network applied in an image recognition method according to the present application. As shown in fig. 2, the method of this embodiment mainly includes: step S200, step S210, and step S220.

S200, acquiring a target characteristic vector of an input image sample to be recognized based on the deep neural network.

In an optional example, when the technical solution provided by the present application is applied to the technical field of face recognition, the image sample to be recognized may specifically be an image sample to be recognized based on a face, and the target feature vector of the image sample to be recognized may specifically be a face feature vector of the image sample to be recognized. Under the condition that the technical scheme provided by the application is applied to the technical field of object class identification, the image sample to be identified can be specifically an image sample to be identified based on an object, and the target feature vector of the image sample to be identified can be specifically an object feature vector of the image sample to be identified. Under the condition that the technical scheme provided by the application is applied to the technical field of scene category identification, the image sample to be identified can be specifically an image sample to be identified based on a scene, and the target feature vector of the image sample to be identified can be specifically a scene feature vector of the image sample to be identified. Under the condition that the technical scheme provided by the application is applied to the technical field of motion recognition, the image sample to be recognized can be specifically a motion-based image sample to be recognized, and the target feature vector of the image sample to be recognized can be specifically a motion feature vector of the image sample to be recognized.

In an alternative example, the present application provides an image sample set, where the image sample set generally includes a large number of image samples, and the image sample set may also be referred to as a training set, and the image samples may also be referred to as training samples. Optionally, each image sample in the image sample set has a label, and the label is used for identifying a category to which the object in the image sample belongs. In an alternative example, in the face-based image sample set, different image samples belonging to the face of the same person have the same label, that is, one category corresponds to one person, and different categories correspond to different persons; in the object-based image sample set, different image samples belonging to the same object have the same label, that is, one class corresponds to one object, and different classes correspond to different objects; in the scene-based image sample set, different image samples belonging to the same scene have the same label, namely, one category corresponds to one scene, and different categories correspond to different scenes; in the motion-based image sample set, different image samples belonging to the same motion have the same label, i.e., one category corresponds to one motion, and different categories correspond to different motions.

In an alternative example, the present application may select (e.g., randomly select) one image sample from the image sample set as an image sample to be recognized, input the image sample to the deep neural network, and extract the target feature vector from the image sample to be recognized by the deep neural network.

S210, calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of at least two centers in the category centers, and selecting one similarity from the similarities.

In an alternative example, a category center is provided in the present application, which may be considered as a category center layer in the deep neural network of the present application. The category center in the present application generally includes a plurality of centers, optionally, each center corresponds to a category, and different centers correspond to different categories. Optionally, the feature vector of any center in the category centers may reflect an average position of the feature of the category corresponding to the center in the feature space, that is, the feature vector of any center in the category centers may reflect an average direction of the feature vector of the category corresponding to the center in the feature space or an average position on the hypersphere. In addition, the feature vectors of the centers in the class center are dynamically updated as the deep neural network is trained.

In an alternative example, the feature vector for any one of the class centers may reflect the features that are most representative of the class to which the center corresponds. A feature vector for any one of the class centers may be understood as an average/compromise feature vector of all the feature vectors that have appeared for the class to which the center corresponds. Any one of the category centers has a unique one label, and different centers have different labels.

Under the condition that the technical scheme provided by the application is applied to the technical field of face recognition, the category center can be specifically a face-based category center, each center in the category center corresponds to one person, different centers correspond to different persons, and the feature vector of the center is the average/compromise feature vector of all face feature vectors of the persons corresponding to the center, which have appeared once. In the case that the technical scheme provided by the application is applied to the technical field of object class identification, the class center may specifically be a class center based on objects, each center in the class center corresponds to one object, different centers correspond to different objects, and the feature vector of the center is an average/compromise feature vector of all object feature vectors that have appeared once of the object corresponding to the center. Under the condition that the technical scheme provided by the application is applied to the technical field of scene category identification, the category center can be specifically a category center based on scenes, each center in the category center corresponds to one scene, different centers correspond to different scenes, and the feature vector of the center is the average/compromise feature vector of all scene feature vectors appearing once of the scene corresponding to the center. In the case that the technical scheme provided by the application is applied to the technical field of motion recognition, the category centers may be specifically motion-based category centers, each center in the category centers corresponds to one motion, different centers correspond to different motions, and the feature vector of a center is an average/compromise feature vector of all motion feature vectors that have appeared in the motions corresponding to the center.

In one optional example, the dimensions of the feature vectors for all centers in the category center are the same. The dimension of the feature vector of any one of the category centers is the same as the dimension of the target feature vector extracted by the deep neural network from the input image sample to be recognized.

In an optional example, the method can obtain a label of an image sample to be recognized input into a deep neural network, judge whether a center corresponding to the label exists in a category center, if the category center does not have the center corresponding to the label, add a new center in the category center, enable the label of the new center to be the label of the image sample to be recognized, and set a feature vector of the new center according to a target feature vector extracted from the image sample to be recognized by the deep neural network; if the center corresponding to the label exists in the category center, the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two centers in the category center can be calculated.

In an optional example, the present application may calculate similarities between the target feature vector of the image sample to be identified and feature vectors of centers in the category center other than the center corresponding to the image sample to be identified, for example, the category center includes N centers, a center (for example, an X-th center) in the category center having the label may be found by using the label of the image sample to be identified, N-1 centers exist in the category center other than the X-th center, and the present application may calculate similarities between the target feature vector of the image sample to be identified and feature vectors of each of the N-1 centers.

In an optional example, the present application may calculate a similarity between a target feature vector of the image sample to be recognized and feature vectors of partial centers of the class centers except for a center corresponding to the image sample to be recognized, for example, the class center includes N centers, a center (for example, an X-th center) having a label in the class center may be found by using the label of the image sample to be recognized, N-1 centers may exist in the class center besides the X-th center, the present application may select (for example, randomly select) M (M < N-1) centers from the N-1 centers, and calculate a similarity between the target feature vector of the image sample to be recognized and a feature vector of each of the M centers.

When the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the centers except the center corresponding to the image sample to be recognized in the class center needs to be calculated, the number of times of floating point operation needs to be performed is (N-1) xK²And the number of centers contained in the category centers is N, and the dimensionality of the target feature vector and the feature vector of the centers is K. When the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the partial centers of the class centers except the center corresponding to the image sample to be recognized needs to be calculated, the number of times of floating point operations needs to be performed is M × K²And secondly, wherein M is the number of the selected centers, and M is less than N-1. Therefore, the number M of the centers selected from the category centers is controlled, so that the calculation overhead and the memory overhead in the deep neural network training process are favorably controlled.

In an optional example, when similarity calculation is performed on a target feature vector of an image sample to be recognized and a feature vector of any one of the class centers, normalization processing may be performed on the target feature vector and the feature vector of the center of the image sample to be recognized, and then cosine similarity between the target feature vector of the image sample to be recognized and the feature vector of the center is obtained by calculating a dot product of two results after the normalization processing. Specifically, the cosine similarity between the two can be calculated by using the following formula (1):

in the above formula (1), f represents a target feature vector extracted from the image sample to be recognized via the deep neural network, c represents a feature vector of any one of the class centers,

and | f | is the module value of the target feature vector f, and | c | is the module value of the target feature vector c.

In an optional example, the application may select the highest similarity (e.g., the shortest euclidean distance or the highest cosine similarity, etc.) from all the similarities (e.g., the euclidean distance or the cosine similarity, etc.) obtained through the calculation. The feature vector of the center corresponding to the highest similarity can be used as the feature vector of the difficult sample which is most similar to the feature vector of the image sample to be identified. Of course, this application does not exclude the possibility of selecting the second highest similarity. It should be particularly noted that the present application may also use other calculation methods (such as euclidean distance, etc.) besides the cosine similarity to determine the similarity between the two feature vectors, and the present application does not limit the specific calculation method for determining the similarity between the two feature vectors.

According to the method and the device, the category center is set, the cosine similarity between the feature vector of the image sample to be identified and the feature vectors of the centers is calculated, one center can be selected from the category center based on the cosine similarity, for example, the center with the highest cosine similarity is selected, and therefore the feature vector of the difficult sample or the semi-difficult sample can be sampled conveniently and quickly.

S220, performing supervised learning on the deep neural network through a ternary loss function based on a triplet formed by a target feature vector of the image sample to be recognized, a feature vector of a center corresponding to the image sample to be recognized in the class center and the feature vector of the center corresponding to the selected similarity.

In an alternative example, a triplet in the present application includes three components, which may be referred to as a base element, a positive element, and a negative element, respectively, for convenience of description. According to the method and the device, the target characteristic vector of the image sample to be recognized extracted through the deep neural network is used as a basic element in the triple, the characteristic vector of the center corresponding to the image sample to be recognized in the class center is used as a positive element in the triple, and the selected characteristic vector of the center corresponding to the highest similarity is used as a negative element.

In one alternative example, the image samples in the set of image samples may be represented as x_iImage sample x_iThe label possessed can be represented as y_iFrom image samples x via a deep neural network_iThe K-dimensional target feature vector (e.g., face feature vector) extracted in (1) can be represented as f_i. Setting having label y in category center_iCan be represented as c_yi. Under the above setting, the present application may change f_iAs the basic element a in the triplet, c_yiAnd as a positive example p in the triple, the feature vector of the center with the highest similarity determined based on the similarity calculation result is used as a negative example n in the triple, so that the triple (a, p, n) is formed.

In an alternative example, the present application may calculate a similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the positive element p in the triplet, and calculate a similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the negative element n in the triplet. In order to improve the clustering effect, the similarity between the basic element a and the positive example element p needs to be made as large as possible, and the similarity between the basic element a and the negative example element n needs to be made as small as possible. Optionally, the application may adopt a ternary loss function shown in the following formula (2) to perform supervised learning on the deep neural network:

in the above formula (2), Z represents a batch size input in one iteration when the deep neural network is trained, that is, based on the number of image samples to be recognized input to the deep neural network in one iteration, i represents a sequence number of the image samples to be recognized input to the deep neural network, γ represents a preset constant, and γ is mainly used for training the deep neural network

Is greater than

The conditions of (a) are relaxed; []₊The symbols represent: if the value in the bracket is greater than or equal to 0, the triplet is put into the ternary loss function, and if the value in the bracket is less than 0, the triplet is not put into the ternary loss function;

representing cosine similarities calculated for a and n of the triples (a, p, n) of image samples to be identified in the ith input deep neural network;

represents the cosine similarity calculated for a and p in the triplet (a, p, n) of image samples to be identified in the ith input deep neural network.

The application presets a constant gamma by setting

Is greater than

Under conditions to relax the present application

When the condition is met, the method contributes to the ternary loss function, so that the deep neural network can learn more compact target feature vectors.

Fig. 3 is a flowchart of an embodiment of training a deep neural network applied to a face recognition method according to the present application. As shown in fig. 3, the method of this embodiment includes: step S300, step S310, and step S320.

S300, acquiring a face feature vector of the input face image sample to be recognized based on the deep neural network.

In an optional example, the present application provides a facial image sample set, where the facial image sample set generally includes a large number of facial image samples, the facial image sample set may also be referred to as a training set, and the facial image samples may also be referred to as training samples. Optionally, each face image sample in the face image sample set has a label, and different face image samples belonging to a face of a same person have the same label, that is, one category corresponds to one person, and different categories correspond to different persons.

In an alternative example, the present application may select (e.g., randomly select) one face image sample from a face image sample set as a face image sample to be recognized, input the face image sample to a deep neural network, and extract a face feature vector from the face image sample to be recognized by the deep neural network.

S310, calculating the similarity between the face feature vector of the face image sample to be recognized and the feature vectors of at least two centers in the category centers, and selecting one similarity from the similarities.

In an alternative example, the feature vector of any one of the class centers may reflect the face features that are most representative of the class to which the center corresponds. The dimensions of the feature vectors of all centers in a category center are the same. The dimension of the feature vector of any one of the category centers is the same as the dimension of the face feature vector extracted by the deep neural network from the input face image sample to be recognized.

In an optional example, the method can obtain a label of a face image sample to be recognized input into a deep neural network, judge whether a center corresponding to the label exists in a category center, if the category center does not exist in the center corresponding to the label, add a new center in the category center, enable the label of the new center to be the label of the face image sample to be recognized, and set a feature vector of the new center according to a face feature vector extracted from the face image sample to be recognized by the deep neural network; if the center corresponding to the label exists in the category center, the similarity between the facial feature vector of the facial image sample to be recognized and the feature vectors of at least two centers in the category center can be calculated.

In an optional example, the present application may calculate similarities between the facial feature vector of the facial image sample to be recognized and feature vectors of centers in the category centers except for the center corresponding to the facial image sample to be recognized, for example, the category centers include N centers, a center (for example, an X-th center) in the category centers having the label may be found by using the label of the facial image sample to be recognized, N-1 centers exist in the category centers except for the X-th center, and the present application may calculate similarities between the facial feature vector of the facial image sample to be recognized and feature vectors of each of the N-1 centers.

In an optional example, the present application may calculate a similarity between a facial feature vector of the facial image sample to be recognized and a feature vector of a part of centers in the category center except for a center corresponding to the facial image sample to be recognized, for example, the category center includes N centers, a center (for example, an X-th center) having a label in the category center may be found by using the label of the facial image sample to be recognized, N-1 centers exist in the category center except for the X-th center, the present application may select (for example, randomly select) M (M < N-1) centers from the N-1 centers, and calculate a similarity between the facial feature vector of the facial image sample to be recognized and a feature vector of each of the M centers.

Calculating the face characteristic vector of the face image sample to be recognized and all centers except the center corresponding to the face image sample to be recognized in the category centersThe number of floating-point operations required to be performed is (N-1) XK²And the number of centers contained in the category centers is N, and the number of centers contained in the category centers is K, wherein the number of the centers is the dimension of the face feature vector and the feature vector of the center. When the similarity between the face feature vector of the face image sample to be recognized and the feature vector of the partial center of the class center except the center corresponding to the face image sample to be recognized needs to be calculated, the number of floating point operations needed to be performed is M × K²And secondly, wherein M is the number of the selected centers, and M is less than N-1. Therefore, the method and the device are beneficial to controlling the calculation overhead and the memory overhead in the training process of the deep neural network for face recognition according to the actual situation by controlling the number M of the centers selected from the category centers.

In an optional example, when similarity calculation is performed on a face feature vector of a face image sample to be recognized and a feature vector of any center in a category center, normalization processing may be performed on the face feature vector of the face image sample to be recognized and the feature vector of the center, respectively, and cosine similarity between the face feature vector of the face image sample to be recognized and the feature vector of the center is obtained by calculating a dot product of results after the two normalization processing. As specifically described above with respect to formula (1).

In an optional example, the application may select the highest similarity (e.g., the shortest euclidean distance or the highest cosine similarity, etc.) from all the similarities (e.g., the euclidean distance or the cosine similarity, etc.) obtained through the calculation. The feature vector of the center corresponding to the highest similarity can be used as the feature vector of the difficult sample which is most similar to the face feature vector of the face image sample to be recognized.

According to the method and the device, the category center is set, and the cosine similarity between the face feature vector of the face image sample to be recognized and the feature vectors of the centers is calculated, so that one center can be selected from the category center based on the cosine similarity, for example, the center with the highest cosine similarity is selected, and the sampling of the feature vectors of the difficult sample or the semi-difficult sample can be conveniently and quickly realized.

S320, performing supervised learning on the deep neural network for face recognition through a ternary loss function based on a triple formed by the face feature vector of the face image sample to be recognized, the feature vector of the center corresponding to the class center of the face image sample to be recognized and the feature vector of the center corresponding to the selected similarity.

In an alternative example, a triplet in the present application includes three components, which may be referred to as a base element, a positive element, and a negative element, respectively, for convenience of description. The method and the device can take the face feature vector of the face image sample to be recognized extracted through the deep neural network for face recognition as a basic element in the triple, take the feature vector of the center corresponding to the face image sample to be recognized in the category center as a positive element in the triple, and take the feature vector of the center corresponding to the highest similarity selected as a negative element.

In an alternative example, the present application may calculate a similarity (e.g., cosine similarity or euclidean distance) between a basic element a and a positive element p in a triplet, calculate a similarity (e.g., cosine similarity or euclidean distance) between the basic element a and a negative element n in the triplet, and then perform supervised learning on the deep neural network by using the ternary loss function shown in the above formula (2).

Fig. 4 is a schematic structural diagram of an embodiment of an image recognition apparatus according to the present application. As shown in fig. 3, the image recognition apparatus of this embodiment includes: an input module 400, an acquisition module 410, and a recognition module 420. Optionally, the image recognition apparatus may further include: a training module 430.

The input module 400 is mainly used for inputting the image to be recognized into the deep neural network.

In an optional example, in a case that the image recognition apparatus provided in the present application is applied to the technical field of face recognition, the image to be recognized may specifically be an image to be recognized based on a face, and the target feature vector of the image to be recognized may specifically be a face feature vector of the image to be recognized. Under the condition that the image recognition device provided by the application is applied to the technical field of object category recognition, the image to be recognized can be specifically an image to be recognized based on an object, and the target feature vector of the image to be recognized can be specifically an object feature vector of the image to be recognized. Under the condition that the image recognition device provided by the application is applied to the technical field of scene category recognition, the image to be recognized can be specifically a scene-based image to be recognized, and the target feature vector of the image to be recognized can be specifically a scene feature vector of the image to be recognized. In the case that the image recognition apparatus provided by the present application is applied to the field of motion recognition technology, the image to be recognized may specifically be a motion-based image to be recognized, and the target feature vector of the image to be recognized may specifically be a motion feature vector of the image to be recognized.

The obtaining module 410 is mainly used for outputting image features of an image to be recognized through a deep neural network.

In an optional example, after the image to be recognized is processed by at least one convolutional layer and at least one nonlinear ReLU layer in the deep neural network, the deep neural network extracts image features from the input image to be recognized, and acquires the image features of the image to be recognized from the acquisition module 410.

The recognition module 420 is mainly configured to recognize the image to be recognized based on the image features of the image to be recognized.

In an alternative example, the recognition module 420 may perform face recognition on the image to be recognized based on the image features of the image to be recognized; gesture recognition can also be carried out on the image to be recognized based on the image characteristics of the image to be recognized; the pedestrian recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized; the vehicle identification can be carried out on the image to be identified based on the image characteristics of the image to be identified; the motion recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized; and scene recognition can be carried out on the image to be recognized based on the image characteristics of the image to be recognized. In addition, the recognition module 420 may perform recognition processing, such as face position detection, face key point detection, human body position detection, human body motion detection, human body key point detection, and living body detection, on the image to be recognized based on the image characteristics of the image to be recognized. The application does not limit the specific representation form of the recognition module 420 for recognizing the image to be recognized. Also, the identification module 420 may be implemented using an existing neural network (e.g., a convolutional neural network, etc.).

The training module 430 is mainly used for training the deep neural network. The structure of the training module 430 is shown in fig. 5.

FIG. 5 is a block diagram of an embodiment of a training module 430 of the present application. As shown in fig. 5, the training module 430 of this embodiment includes: the feature vector acquisition sub-module 500, the similarity calculation sub-module 510, the similarity selection sub-module 520, and the supervised learning sub-module 530.

The feature vector obtaining submodule 500 is mainly used for obtaining a target feature vector of an input image sample to be recognized based on a deep neural network.

In an alternative example, the present application provides an image sample set, where the image sample set generally includes a large number of image samples, and the image sample set may also be referred to as a training set, and the image samples may also be referred to as training samples. Optionally, each image sample in the image sample set has a label, and the label is used for identifying a category to which the object in the image sample belongs.

In an alternative example, the obtain feature vector sub-module 500 may select (e.g., randomly select) an image sample from the image sample set as an image sample to be identified, input the image sample to the deep neural network, and extract the target feature vector from the image sample to be identified by the deep neural network. The deep neural network in the present application may specifically be a deep convolutional neural network to be trained based on network structures such as AlexNet, VGGNet, google lenet, or ResNet. The application does not limit the specific structure of the deep neural network.

The calculate similarity sub-module 510 is mainly used for calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two of the category centers.

In an alternative example, a category center is provided in the present application, which may be considered as a category center layer in a deep neural network. The category center in the present application generally includes a plurality of centers, optionally, each center corresponds to a category, and different centers correspond to different categories. Optionally, the feature vector of any center in the category centers may reflect an average position of the feature of the category corresponding to the center in the feature space, that is, the feature vector of any center in the category centers may reflect an average direction of the feature vector of the category corresponding to the center in the feature space or an average position on the hypersphere. In addition, the feature vectors of the centers in the class center are dynamically updated as the deep neural network is trained.

In an optional example, the similarity calculation submodule 510 may obtain a label of an image sample to be recognized input to the deep neural network, and determine whether a center corresponding to the label exists in the category center, if the category center does not have the center corresponding to the label, the similarity calculation submodule 510 may add a new center in the category center, so that the label of the new center is the label of the image sample to be recognized, and set a feature vector of the new center according to a target feature vector extracted from the image sample to be recognized by the deep neural network; if there is a center corresponding to the label in the category center, the calculate similarity sub-module 510 may calculate the similarity between the target feature vector of the image sample to be identified and the feature vectors of at least two of the category centers.

In an alternative example, the calculate similarity sub-module 510 may calculate similarities between the target feature vector of the image sample to be recognized and feature vectors of centers other than the center corresponding to the image sample to be recognized in the category center, for example, the category center includes N centers, the calculate similarity sub-module 510 may find a center (e.g., an X-th center) having the label in the category center using the label of the image sample to be recognized, N-1 centers exist in the category center other than the X-th center, and the calculate similarity sub-module 510 may calculate similarities between the target feature vector of the image sample to be recognized and the feature vectors of each of the N-1 centers, respectively.

In an alternative example, the calculate similarity sub-module 510 may calculate the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the partial centers of the category centers except the center corresponding to the image sample to be recognized, for example, the category centers may include N centers, the calculate similarity sub-module 510 may use the labels of the image samples to be identified to find the center with the label (e.g., the Xth center) in the category centers, and there may be N-1 centers in the category centers in addition to the Xth center, and the calculate similarity sub-module 510 may select (e.g., randomly select) M (M < N-1) centers from the N-1 centers, and calculating the similarity between the target feature vector of the image sample to be identified and the feature vector of each of the M centers.

In the case that the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the centers of the class centers other than the center corresponding to the image sample to be recognized needs to be calculated, the number of floating point operations that the calculation similarity submodule 510 needs to perform is (N-1) × K²And the number of centers contained in the category centers is N, and the dimensionality of the target feature vector and the feature vector of the centers is K. In the case that the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the centers of the categories except for the center corresponding to the image sample to be recognized needs to be calculated, the number of floating point operations that the calculation similarity submodule 510 needs to perform is M × K²And secondly, wherein M is the number of the selected centers, and M is less than N-1. As can be seen, the computation similarity submodule 510 is beneficial to controlling the computation overhead and the memory overhead in the deep neural network training process by controlling the number M of centers selected from the class centers.

In an optional example, when the similarity calculation is performed on the target feature vector of the image sample to be recognized and the feature vector of any one of the category centers, the calculate similarity sub-module 510 may perform normalization processing on the target feature vector of the image sample to be recognized and the feature vector of the center, respectively, and then the calculate similarity sub-module 510 obtains the cosine similarity of the target feature vector of the image sample to be recognized and the feature vector of the center by calculating a dot product of two results after the normalization processing. Specifically, the calculate similarity sub-module 510 may calculate the cosine similarity between the two using the above formula (1). It should be particularly noted that the similarity calculation sub-module 510 may also determine the similarity between the two feature vectors by using other calculation methods (such as euclidean distance) besides the cosine similarity, and the application does not limit the specific calculation method of the similarity calculation sub-module 510 for calculating the similarity between the two feature vectors.

In an optional example, the calculate similarity sub-module 510 may further calculate a similarity (e.g., cosine similarity or euclidean distance, etc.) between the target feature vector of the image sample to be recognized and a center of the class center having the label of the image sample to be recognized.

The select similarity submodule 520 is mainly configured to select a similarity from the similarities calculated by the calculate similarity submodule 510.

In an optional example, the select similarity submodule 520 may select a highest similarity (e.g., a shortest euclidean distance or a highest cosine similarity, etc.) from all the calculated similarities (e.g., the euclidean distance or the cosine similarity, etc.). The feature vector of the center corresponding to the highest similarity can be used as the feature vector of the difficult sample which is most similar to the feature vector of the image sample to be identified. Of course, the present application does not exclude the possibility of selecting the second highest similarity from the similarity sub-module 520.

According to the method, the category center is set, the cosine similarity between the feature vector of the image sample to be identified and the feature vectors of the centers is calculated by the similarity calculation submodule 510, the similarity selection submodule 520 can select one center from the category centers based on the cosine similarity, for example, the similarity selection submodule 520 selects the center with the highest cosine similarity, and therefore sampling of the feature vectors of the difficult sample or the semi-difficult sample can be achieved conveniently and quickly.

The supervised learning sub-module 530 is mainly configured to perform supervised learning on the deep neural network through a ternary loss function based on a triplet formed by the target feature vector of the image sample to be recognized, the feature vector of the center corresponding to the category center of the image sample to be recognized, and the feature vector of the center corresponding to the selected similarity.

In an alternative example, a triplet in the present application includes: a basic element, a positive example element, and a negative example element. The supervised learning sub-module 530 may use the target feature vector of the image sample to be recognized extracted through the deep neural network as a basic element in the triplet, use the feature vector of the center corresponding to the category center of the image sample to be recognized as a positive example element in the triplet, and use the feature vector of the center corresponding to the selected highest similarity as a negative example element.

In an optional example, the supervised learning sub-module 530 may obtain the similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the positive element p in the calculated triplet calculated by the calculation similarity sub-module 510, and obtain the similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the negative element n in the triplet calculated by the calculation similarity sub-module 510. In order to make the similarity between the basic element a and the positive example p as large as possible and the similarity between the basic element a and the negative example n as small as possible, the supervised learning sub-module 530 may penalize the similarity between the smaller basic element a and the positive example p and the similarity between the larger basic element a and the negative example n through a ternary loss function. Optionally, the supervised learning sub-module 530 may perform supervised learning on the deep neural network by using the ternary loss function shown in the above formula (2).

In the case where the training module 430 of the present application is used to train a deep neural network that implements face recognition, the operations performed by the modules in the training module 430 are as follows.

The feature vector obtaining submodule 500 is mainly used for obtaining a face feature vector of an input face image sample to be recognized based on a deep neural network.

In an alternative example, the obtain feature vector sub-module 500 may select (e.g., randomly select) a face image sample from the face image sample set as a face image sample to be recognized, and input the face image sample to the deep neural network, where the deep neural network extracts a face feature vector from the face image sample to be recognized.

The similarity calculation sub-module 510 is mainly used for calculating the similarity between the facial feature vectors of the facial image sample to be recognized and the feature vectors of at least two of the category centers.

In an optional example, the similarity calculation submodule 510 may obtain a label of a to-be-recognized face image sample input to the deep neural network, and determine whether a center corresponding to the label exists in the category center, if the category center does not exist, the similarity calculation submodule 510 may add a new center in the category center, so that the label of the new center is the label of the to-be-recognized face image sample, and set a feature vector of the new center according to a face feature vector extracted from the to-be-recognized face image sample by the deep neural network; if there is a center corresponding to the label in the category center, the calculate similarity sub-module 510 may calculate the similarity between the facial feature vector of the facial image sample to be recognized and the feature vectors of at least two of the category centers.

In an alternative example, the calculate similarity sub-module 510 may calculate similarities between the facial feature vector of the facial image sample to be recognized and feature vectors of centers other than the center corresponding to the facial image sample to be recognized in the category center, for example, the category center includes N centers, the calculate similarity sub-module 510 may find a center (for example, an X-th center) having the label in the category center using the label of the facial image sample to be recognized, there are N-1 centers in the category center other than the X-th center, and the calculate similarity sub-module 510 may calculate similarities between the facial feature vector of the facial image sample to be recognized and the feature vectors of each of the N-1 centers.

In an alternative example, the calculate similarity sub-module 510 may calculate the similarity between the facial feature vector of the to-be-recognized facial image sample and the feature vectors of the partial centers of the category centers except the center corresponding to the to-be-recognized facial image sample, for example, the category centers include N centers, the calculate similarity sub-module 510 may use the labels of the face image samples to be recognized to find the center with the label (e.g., the xth center) in the category centers, and there are N-1 centers in the category centers in addition to the xth center, and the calculate similarity sub-module 510 may select (e.g., randomly select) M (M < N-1) centers from the N-1 centers, and calculating the similarity between the face feature vector of the face image sample to be recognized and the feature vector of each of the M centers.

Dividing face image sample to be recognized in calculating face characteristic vector and category center of face image sample to be recognizedIn the case of similarity between feature vectors of centers other than the center to which the calculation similarity submodule 510 corresponds, the number of floating-point operations that the calculation similarity submodule 510 needs to perform is (N-1) × K²And the number of centers contained in the category centers is N, and the number of centers contained in the category centers is K, wherein the number of the centers is the dimension of the face feature vector and the feature vector of the center.

In the case that the similarity between the face feature vector of the to-be-recognized face image sample and the feature vector of the partial center of the category center except the center corresponding to the to-be-recognized face image sample needs to be calculated, the number of floating point operations that the calculation similarity submodule 510 needs to perform is M × K²And secondly, wherein M is the number of the selected centers, and M is less than N-1.

Therefore, the calculation similarity submodule 510 is beneficial to controlling the calculation overhead and the memory overhead in the training process of the deep neural network for face recognition according to the actual situation by controlling the number M of the centers selected from the category centers.

In an optional example, when the similarity calculation is performed on the face feature vector of the to-be-recognized face image sample and the feature vector of any one of the class centers, the similarity calculation sub-module 510 may perform normalization processing on the face feature vector of the to-be-recognized face image sample and the feature vector of the center, and obtain cosine similarity of the face feature vector of the to-be-recognized face image sample and the feature vector of the center by calculating a dot product of two results after the normalization processing. As specifically described above with respect to formula (1).

In an optional example, the calculate similarity sub-module 510 may further calculate a similarity (such as cosine similarity or euclidean distance) between the face feature vector of the face image sample to be recognized and a center of the class center having the label of the face image sample to be recognized.

In an optional example, the select similarity sub-module 520 may select the highest similarity (e.g., the shortest euclidean distance or the highest cosine similarity, etc.) from all the similarities (e.g., the euclidean distance or the cosine similarity, etc.) calculated by the calculate similarity sub-module 510. The feature vector of the center corresponding to the highest similarity can be used as the feature vector of the difficult sample which is most similar to the face feature vector of the face image sample to be recognized.

According to the method, the category center is set, the cosine similarity between the face feature vector of the face image sample to be recognized and the feature vectors of a plurality of centers is calculated through the calculation similarity submodule 510, the selection similarity submodule 520 can select one center from the category centers based on the cosine similarity, for example, the selection similarity submodule 520 selects the center with the highest cosine similarity, and therefore sampling of the feature vectors of the difficult sample or the semi-difficult sample can be achieved conveniently and quickly.

The supervised learning sub-module 530 is configured to perform supervised learning on the deep neural network for face recognition through a ternary loss function based on a triplet formed by the face feature vector of the face image sample to be recognized, the feature vector of the center corresponding to the class center of the face image sample to be recognized, and the feature vector of the center corresponding to the selected similarity.

In an alternative example, a triplet in the present application includes three components, which may be referred to as a base element, a positive element, and a negative element, respectively, for convenience of description. The method and the device have the advantages that the face feature vector of the face image sample to be recognized extracted through the deep neural network for face recognition is used as the basic element a in the triple, the feature vector of the center corresponding to the face image sample to be recognized in the class center is used as the positive element p in the triple, and the feature vector of the center corresponding to the highest similarity degree is selected and used as the negative element n.

In an optional example, the supervised learning sub-module 530 may obtain a similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the positive element p in the triplet calculated by the calculation similarity sub-module 510, and obtain a similarity (e.g., cosine similarity or euclidean distance, etc.) between the basic element a and the negative element n in the triplet calculated by the calculation similarity sub-module 510, and the supervised learning sub-module 530 may perform supervised learning on the deep neural network by using the ternary loss function shown in the above formula (2).

Exemplary device

Fig. 6 illustrates an exemplary device 600 suitable for implementing the present application, where the device 600 may be a mobile terminal (e.g., a smart mobile phone, etc.), a personal computer (PC, e.g., a desktop or notebook computer, etc.), a tablet, a server, and so forth. In fig. 6, the device 600 includes one or more processors, communication sections, and the like, and the one or more processors may be: one or more Central Processing Units (CPUs) 601, and/or one or more image processors (GPUs) 613, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)602 or loaded from a storage section 608 into a Random Access Memory (RAM) 603. The communication section 612 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card. The processor may communicate with the read only memory 602 and/or the random access memory 630 to execute the executable instructions, communicate with the communication section 612 through the bus 604, and communicate with other target devices through the communication section 612, thereby completing the corresponding steps in the method embodiments of the present application.

In addition, in the RAM603, various programs and data necessary for the operation of the apparatus can be stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. The ROM602 is an optional module in case of the RAM 603. The RAM603 stores, or writes to, the ROM602 at runtime, executable instructions that cause the central processing unit 601 to perform the steps included in the above-described method embodiments. An input/output (I/O) interface 605 is also connected to bus 604. The communication unit 612 may be provided integrally with the bus, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

It should be particularly noted that the architecture shown in fig. 6 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 6 may be selected, deleted, added or replaced according to actual needs; in the case of different functional component settings, separate settings or integrated settings may be used, for example, the GPU and the CPU may be separately provided, and for example, the GPU may be integrated on the CPU, the communication unit may be separately provided, or the GPU may be integrally provided on the CPU or the GPU. These alternative embodiments are all within the scope of the present application.

In particular, the processes described below with reference to the flowcharts may be implemented as a computer software program according to embodiments of the present application, for example, the embodiments of the present application include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the steps illustrated in the flowcharts, the program code may include instructions corresponding to the performance of the steps provided in the present application.

In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. When the computer program is executed by a Central Processing Unit (CPU)601, the above-described instructions described in the present application are executed.

Exemplary application scenarios

Referring to fig. 7, one application scenario in which embodiments according to the present application may be implemented is schematically illustrated.

In fig. 7, the deep neural network 700 may be a deep convolutional neural network used for face recognition, object classification, scene classification, or motion recognition, and the like, and the set of image samples used for training the deep neural network 700 includes at least Z image samples, that is, at least image sample 1, image sample 2, … …, and image sample Z. Each image sample has a label for characterizing the class to which it belongs, different image samples belonging to the same class have the same label, and different image samples belonging to different classes have different labels. After the deep neural network 700 is trained based on a plurality of image samples in the image sample set by using the technical scheme of the application, the neural learning can be conveniently and rapidly carried out to more compact target feature vectors (such as human face feature vectors) so as to be beneficial to improving the image identification accuracy of the deep neural network.

However, it is fully understood by those skilled in the art that the applicable scenarios for the embodiments of the present application are not limited by any aspect of this framework.

The methods and apparatus, electronic devices, and computer-readable storage media of the present application may be implemented in a number of ways. For example, the methods and apparatus, electronic devices, and computer-readable storage media of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present application are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present application may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present application. Thus, the present application also covers a recording medium storing a program for executing the method according to the present application.

The description of the present application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the application in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the application and the practical application, and to enable others of ordinary skill in the art to understand the application for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. An image recognition method, characterized in that the method comprises:

inputting an image to be recognized into a deep neural network;

outputting image features of the image to be recognized through the deep neural network;

identifying the image to be identified based on the image characteristics of the image to be identified;

the deep neural network is obtained through training of a ternary loss function, and a triplet in the ternary loss function is obtained by utilizing a class center of at least two centers contained in the deep neural network and a feature vector of the center;

the method further comprises the following steps:

training the deep neural network;

the training the deep neural network comprises:

acquiring a target feature vector of an image sample to be identified based on a deep neural network;

selecting M centers except the center corresponding to the image sample to be identified from the category centers, respectively calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of the M centers, and selecting one similarity from the similarities; wherein M is an integer not less than 2, M is less than N-1, and N is the number of centers contained in the category centers;

and performing supervised learning on the deep neural network through a ternary loss function based on a triplet formed by the target characteristic vector of the image sample to be recognized, the characteristic vector of the center corresponding to the image sample to be recognized in the class center and the characteristic vector of the center corresponding to the selected similarity.

2. The method of claim 1, wherein the identifying the image to be identified based on the image features of the image to be identified comprises at least one of:

carrying out face recognition on the image to be recognized based on the image characteristics of the image to be recognized;

performing gesture recognition on the image to be recognized based on the image characteristics of the image to be recognized;

identifying the pedestrian of the image to be identified based on the image characteristics of the image to be identified;

performing vehicle identification on the image to be identified based on the image characteristics of the image to be identified;

performing action recognition on the image to be recognized based on the image characteristics of the image to be recognized;

and carrying out scene recognition on the image to be recognized based on the image characteristics of the image to be recognized.

3. The method according to claim 1, wherein the dimension of the target feature vector of the image sample to be identified obtained based on the deep neural network is the same as the dimension of the feature vector of each of the class centers.

4. The method according to claim 1, wherein the calculating the similarity between the target feature vector of the image sample to be recognized and the feature vectors of the M centers comprises:

and calculating cosine similarity between the target characteristic vector of the image sample to be identified and the characteristic vectors of the M centers.

5. The method according to claim 4, wherein the cosine similarity calculation for the target feature vector of the image sample to be identified and the feature vector of one of the class centers comprises:

respectively carrying out normalization processing on the target characteristic vector of the image sample to be recognized and the characteristic vector of the center by utilizing the modulus of the target characteristic vector of the image sample to be recognized and the modulus of the characteristic vector of the center;

and calculating the dot product of the two results after the normalization processing to obtain the cosine similarity between the target characteristic vector of the image sample to be identified and the characteristic vector of the center.

6. The method according to any one of claims 1 to 5, wherein the selecting one of the similarities comprises:

and selecting the highest similarity from the similarities.

7. The method according to claim 6, wherein the supervised learning of the deep neural network via the ternary loss function based on the triplet formed by the target feature vector of the image sample to be recognized, the feature vector of the center corresponding to the image sample to be recognized in the class center, and the feature vector of the center corresponding to the selected similarity comprises:

taking the target characteristic vector of the image sample to be identified as a basic element of the triple, taking the characteristic vector of the center corresponding to the image sample to be identified in the category center as a positive example, and taking the characteristic vector of the center corresponding to the selected similarity as a negative example;

and performing supervised learning on the deep neural network through a ternary loss function based on the cosine similarity between the basic elements and the positive examples and the cosine similarity between the basic elements and the negative examples.

8. The method of claim 7, wherein the ternary loss function comprises: and a ternary loss function for relaxing the difference between the cosine similarity of the basic element and the negative example element and the cosine similarity of the basic element and the positive example element by using a preset constant.

9. The method according to any one of claims 1 to 5, wherein the feature vector of the center corresponding to the image sample to be identified in the category center comprises: feature vectors of centers where labels in category centers match labels of the image samples to be identified.

10. The method according to any one of claims 1 to 5,

the image sample to be identified comprises: based on the image sample to be recognized of the human face, the target feature vector comprises: a face feature vector; or

The image sample to be identified comprises: based on the image sample to be recognized of the gesture, the target feature vector comprises: a gesture feature vector; or

The image sample to be identified comprises: based on the image sample to be identified of the pedestrian, the target feature vector comprises: a pedestrian feature vector; or

The image sample to be identified comprises: based on the image sample to be identified of the vehicle, the target feature vector comprises: a vehicle feature vector; or

The image sample to be identified comprises: based on the motion of the image sample to be recognized, the target feature vector comprises: a motion feature vector; or

The image sample to be identified comprises: based on the image sample to be identified of the scene, the target feature vector comprises: a scene feature vector.

11. An image recognition apparatus, comprising:

the input module is used for inputting the image to be recognized into the deep neural network;

the acquisition module is used for outputting the image characteristics of the image to be recognized through the deep neural network;

the identification module is used for identifying the image to be identified based on the image characteristics of the image to be identified;

the device further comprises:

the training module is used for training the deep neural network;

the training module comprises:

the feature vector obtaining submodule is used for obtaining a target feature vector of an image sample to be identified based on a deep neural network;

the similarity calculation submodule is used for selecting M centers except the center corresponding to the image sample to be identified from the category centers and calculating the similarity between the target feature vector of the image sample to be identified and the feature vectors of the M centers respectively; wherein M is an integer not less than 2, M is less than N-1, and N is the number of centers contained in the category centers;

selecting a similarity submodule for selecting a similarity from the similarities;

and the supervised learning submodule is used for carrying out supervised learning on the deep neural network through a ternary loss function based on a triplet formed by the target characteristic vector of the image sample to be recognized, the characteristic vector of the center corresponding to the image sample to be recognized in the class center and the characteristic vector of the center corresponding to the selected similarity.

12. The apparatus of claim 11, wherein the identification module is specifically configured to at least one of:

13. The apparatus according to claim 11, wherein the supervised learning sub-module obtains the target feature vector of the image sample to be identified based on the deep neural network with the same dimension as the feature vector of each of the category centers.

14. The apparatus of claim 11, wherein the calculate similarity submodule is specifically configured to:

15. The apparatus of claim 14, wherein the calculating similarity submodule performs a cosine similarity calculation for the target feature vector of the image sample to be identified and the feature vector of one of the class centers, including:

16. The apparatus according to any one of claims 11 to 15, wherein the select similarity submodule is specifically configured to:

and selecting the highest similarity from the similarities.

17. The apparatus of claim 16, wherein the supervised learning sub-module is specifically configured to:

18. The apparatus of claim 17, wherein the ternary loss function comprises: and a ternary loss function for relaxing the difference between the cosine similarity of the basic element and the negative example element and the cosine similarity of the basic element and the positive example element by using a preset constant.

19. The apparatus according to any one of claims 11 to 15, wherein the feature vector of the center corresponding to the image sample to be identified in the category center comprises: feature vectors of centers where labels in category centers match labels of the image samples to be identified.

20. The apparatus of any one of claims 11 to 15,

21. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory, and which when executed performs the steps of the method of any of claims 1-10.

22. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of the preceding claims 1-10.