WO2020114119A1

WO2020114119A1 - Cross-domain network training method and cross-domain image recognition method

Info

Publication number: WO2020114119A1
Application number: PCT/CN2019/112492
Authority: WO
Inventors: 刘若鹏; 栾琳; 赵盟盟
Original assignee: 深圳光启空间技术有限公司
Priority date: 2018-12-07
Filing date: 2019-10-22
Publication date: 2020-06-11
Also published as: CN111291780A

Abstract

The present invention relates to a cross-domain network training method and a cross-domain image recognition method. The cross-domain network training method comprises the following steps: S1: inputting sample data of a first domain and a second domain into a deep neural network, and training the sample data of the first domain and the second domain, such that the deep neural network has a classification capability on the first domain and the second domain, respectively; S2: eliminating an inter-domain statistical distribution difference, such that the first domain and the second domain have similar statistical distribution characteristics; S3: carrying out training, for enhancing internal aggregation, on the first domain and the second domain; and S4: storing a training result that meets a pre-set condition. According to the present invention, even in the case where data has different statistical distribution characteristics, an image can also be correctly recognized.

Description

A cross-domain network training and image recognition method

Technical field

The invention relates to the field of artificial intelligence, and in particular, to a cross-domain network training and image recognition method.

Background technique

Face recognition is a kind of biometrics recognition technology based on human facial feature information. A series of related technologies that use cameras or cameras to collect images or video streams containing human faces, and automatically detect and track human faces in the images, and then perform face recognition on the detected faces, usually also called portrait recognition and facial recognition .

Most of the existing face recognition algorithms can solve the problem of single-domain face recognition (that is, the image to be recognized and the training sample image have the same statistical distribution characteristics), for example, using the video as the training sample, and then identifying the other videos in the trained network. human face. Florian The Facenet algorithm proposed by Schroff et al. is a single-domain face recognition algorithm with good results, and the author also gives a corresponding single-domain data training method.

technical problem

However, in actual application scenarios, the face photos extracted from the video captured by the surveillance camera have very complex changes in lighting, angle, resolution, and expression, which makes the image to be recognized and the training sample image have statistically significant distribution characteristics. , That is, cross-domain identification. The current artificial intelligence network is difficult to achieve accurate recognition of cross-domain images.

Technical solution

The purpose of the present invention is to provide a new cross-domain image recognition method to overcome the defect that the existing image recognition method cannot accurately recognize cross-domain scenes to overcome the current image recognition technology in cross-domain recognition scenes The problem of poor accuracy.

An aspect of the present invention provides a cross-domain network training method, including the following steps:

S1: input the sample data of the first domain and the second domain to the deep neural network, and train the sample data of the first domain and the second domain, so that the deep neural network has the classification ability on the first domain and the second domain, respectively ;

S2: Eliminate the difference in statistical distribution between domains, so that the first domain and the second domain have similar statistical distribution characteristics;

S3: Strengthen the internal aggregation training for the first domain and the second domain;

S4: Save the training results that meet the preset conditions.

Preferably, the training of the sample data of the first domain and the second domain in step S1 includes: training the loss function Triplet-Loss of the sample data of the first domain and the second domain.

Preferably, step S2 includes: when the loss function Triplet-Loss is stable and satisfies convergence, calculate the maximum mean difference loss MMD-Loss using the highest dimensional characteristics of the first domain and the second domain, and add the result to the synthetic loss function, together Perform back propagation and gradient derivation.

Preferably, step S3 includes: removing the MMD-Loss in the synthesis loss function, and adding the mixed Triplet-Loss of the first domain and the second domain to perform the training of enhanced inner aggregation.

Preferably, step S4 includes: When the synthetic loss function of the training with enhanced intra-aggregation converges and is less than the set value, the training result is saved.

Preferably, in step S1, the training of the loss function Triplet-Loss further includes: setting the first learning rate, the initial value of the first learning rate is 0.001 to 0.01, and the training of the loss function Triplet-Loss is performed every 10 rounds. A learning rate is set to 0.7 to 0.9 times.

Preferably, in step S2, a second learning rate is set, the initial value of the second learning rate is less than the initial value of the first learning rate, the training for enhanced internal aggregation is performed every 10 rounds, and the second learning rate is set to 0.7 to 0.9 Times.

Preferably, the initial value of the second learning rate is 0.0001 to 0.001.

Preferably, in step S3, a third learning rate is set, the initial value of the third learning rate is less than the initial value of the first learning rate, the training for enhanced inner aggregation is performed every 5 rounds, and the third learning rate is set to 0.7 to 0.9 Times.

Preferably, the initial value of the third learning rate is 0.0001 to 0.001.

Preferably, step S1 further includes: after the training of Triplet-Loss is completed, extracting the features of the first domain and the second domain, performing data dimensionality reduction, and drawing the distribution of feature positions in the two-dimensional space.

Preferably, step S2 further includes: after the training of the synthetic loss function is completed, extracting the features of the first domain and the second domain, performing data dimensionality reduction, and drawing the feature position distribution in the two-dimensional space.

Preferably, step S3 further includes: after the training for enhanced internal aggregation is completed, extracting the features of the first domain and the second domain, performing data dimensionality reduction, and drawing the distribution of feature positions in the two-dimensional space.

Another aspect of the present invention provides an image recognition method that performs training on the deep neural network as described above, and uses the trained deep neural network to recognize the image.

The present invention also provides a storage medium that stores a computer program, wherein the computer program is set to execute the training method described above at runtime.

Beneficial effect

The cross-domain network training method of the present invention implements the training of the network through cross-domain data as input, so that the data can be correctly identified even when the data has different statistical distribution characteristics. When using the deep neural network obtained by this training method, it can identify and match images from different environmental domains, and is particularly suitable for identifying identity information through video images in the field of security.

BRIEF DESCRIPTION

The accompanying drawings forming part of this application are used to provide a further understanding of the present invention. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an undue limitation on the present invention. In the drawings:

1 is a flowchart of a preferred embodiment of a cross-domain network training method of the present invention;

2 is a flowchart of an image recognition method based on a deep neural network with cross-domain recognition capability;

FIG. 3 is a flowchart of another embodiment of another cross-domain network training method;

4 is a flowchart of a preferred embodiment of a cross-domain network training method of the present invention;

5 is a schematic diagram of a preferred embodiment of a learning rate adjustment scheme;

Figure 6 is the effect diagram after the training of Triplet-Loss;

Figure 7 is the effect diagram after the training of MMD+Triplet-Loss ends;

Fig. 8 is an effect diagram after the completion of the training synthesis loss function.

Embodiments of the invention

It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other if there is no conflict. The present invention will be described in detail below with reference to the drawings and in conjunction with the embodiments.

It should be noted that, unless otherwise specified, all technical and scientific terms used in this application have the same meaning as commonly understood by those of ordinary skill in the technical field to which this application belongs.

In the present invention, the directional words such as "up, down, top, and bottom" are generally used in the direction shown in the drawings, or in the vertical direction of the component itself, unless otherwise stated. In terms of vertical or gravity directions; similarly, for ease of understanding and description, "inside and outside" refers to inside and outside relative to the contour of each component itself, but the above directional words are not used to limit the present invention.

FIG. 1 is a flowchart of a preferred embodiment of the cross-domain network training method provided by the present invention. It should be noted that the network training method for FIG. 1 is carried out using Facenet as an example, and those skilled in the art can also obtain other similar neural network similar to the present invention after training other deep neural networks according to the training method of the present invention. Cross-domain recognition effect.

Step S1: Input sample data from two different domains: the first domain and the second domain to Facenet, and train the first domain and the second domain on the network architecture of Facenet so that the first domain and the second domain The two domains have their own classification capabilities;

Step S2: remove the statistical distribution difference between the domains trained in the above step S1, so that the first domain and the second domain have similar statistical distribution characteristics;

Step S3: Strengthen the inner aggregation of the first domain and the second domain obtained by training in step S2 with similar statistical distribution characteristics;

Step S4: After step S3 is executed, if the deep neural network meets the preset conditions, the training is stopped, the training step of the cross-domain network is completed, and the trained deep neural network is saved.

The present invention also provides an image recognition method based on a deep neural network with cross-domain recognition capability, as shown in FIG. 2 is a preferred embodiment of the method.

The image recognition method in the embodiment of FIG. 2 includes two steps: training the deep neural network with different sample data of two ANDs, so that the trained deep neural network eliminates the difference in statistical distribution characteristics of the sample data in different domains . In this embodiment, the method corresponding to FIG. 1 is specifically used for deep neural network training.

After the above trainings of S1 to S4 are completed, S5 is executed: input image data from the first domain and image data of the second domain to the trained deep neural network for recognition and matching. The matching relationship between the image data from the first domain and the image data from the second domain is obtained. The image recognition is completed.

Preferably, this application provides another embodiment of a cross-domain network training method as shown in FIG. 3. This embodiment continues to be described by training the Facenet network.

First in step S10: this step is a further improvement of S1. On the basis of the Facenet network structure, input the sample data of the first domain and the second domain respectively, train the loss function Triplet-Loss of the respective domain, and set the learning rate at Between 0.001 and 0.01, the purpose is to make the two domains have classification capabilities;

Then in step S20: This step is a further improvement of S2. When the loss function in step S10 is stable and satisfies convergence, the maximum mean difference loss MMD-Loss is calculated using the highest dimensional features of the two domains, and the result is added to the composite loss function. That is (MMD-Loss) + (Triplet-Loss) [for the convenience of description, the composite loss function is abbreviated as MMD+Triplet], which performs back propagation and gradient derivation together, and the learning rate is set between 0.0001 and 0.001 to eliminate the two Differences in statistical distribution among domains;

After eliminating the difference in statistical distribution between domains, before performing the next training, it is necessary to detect whether the loss function in step S20 converges. When the loss function converges, perform step S30, otherwise continue to step S20 until convergence. Since the loss function represents the difference between the predicted value and the true value of the model, when it converges, it indicates that the recognition of the deep neural network has been stable at this stage.

In step S30, this step is a further improvement of S3. For the case where the loss function has converged, it is necessary to further remove the MMD-Loss in the synthesized loss function and add the mixed Triplet-Loss of the first domain and the second domain to learn The rate is set between 0.0001 and 0.001, the purpose is to strengthen the clustering effect within the class;

In step S40, this step is a further improvement of S4, and it is detected whether the synthetic loss function in step S30 converges and is smaller than a set value. In this embodiment, the set value is 0.01. If the above two conditions are satisfied, the training is completed, otherwise step S30 is continued until the synthesized loss function converges and the value is less than the set value of 0.01. For the deep neural network after training, it can be saved for subsequent use.

FIG. 4 shows a preferred embodiment of the present invention regarding a cross-domain network training method. This embodiment is based on the improvement in the corresponding embodiment of FIG. 3.

In step S10: on the basis of the Facenet network structure, input the sample data of the first domain and the second domain respectively, train the loss function Triplet-Loss of the respective domain, and set the learning rate to be set between 0.001 and 0.01, so that the two Each domain has a classification capability;

Before training the loss function Triplet-Loss of each domain in step S10, it also includes S11: performing forward propagation processing on the two input domains. In this embodiment, the forward propagation process specifically includes: performing weighted sum operation on the input specific data of the two domains, and then adding the offset value, and finally through a non-linear function, that is, the activation function, to obtain the output after processing .

In this embodiment, when step S20 is executed, if the loss function of step S10 is stable and the conditions for convergence are not satisfied, the loss function Triplet-Loss of each domain is trained again until the loss function meets the conditions.

In this embodiment, the training of the synthesis loss function in step S30 is implemented in the following manner: first, a pair of images is selected, and the pair of images refers to data of the same domain target having both the first domain image and the second domain image Yes, as if a person has both a camera video screenshot and an ID card photo, use this as training data to train the synthetic loss function in S30.

Based on the above-mentioned cross-domain network training method of FIG. 4, the present invention provides another preferred embodiment, which uses the deep neural network trained in the embodiment of FIG. 4 to perform image pairing.

Further, when training the cross-domain network in this application, in order to improve the training effect, make the model converge as soon as possible and have a better recognition ability, a scheme of dynamically updating the learning rate with the training progress is adopted. FIG. 5 is a schematic diagram of a preferred embodiment of the learning rate adjustment scheme.

In this embodiment, the learning rate is adjusted three times, corresponding to three trainings in the cross-domain network training method provided by the present invention.

First, when training Triplet-Loss, set the first learning rate. In order to reduce the loss function to a lower level as soon as possible, set a larger learning rate, such as 0.001 to 0.01, the specific 0.01 set in the corresponding step above, And keep adjusting the learning rate from 0.7 to 0.9 times every 10 rounds, and choose 0.8 times in this embodiment. The learning rate adjustment at this step corresponds to step S10 in the aforementioned embodiment.

Then adjust the learning rate for the second time. When training MMD+Triplet, set the second learning rate. In order to make the loss function reach a lower level, the second learning rate is set to be less than the first learning rate. The corresponding step in the middle is set to 0.001, and the learning rate is adjusted to 0.7 to 0.9 times every 10 rounds, and is selected to be 0.8 times in this embodiment. The learning rate adjustment at this step corresponds to step S20 in the aforementioned embodiment.

Then adjust the learning rate for the third time. When training the synthetic loss function, set the third learning rate. Also in order to make the loss function reach a lower level, the third learning rate is set to be less than the first learning rate. The corresponding step in the foregoing is set to 0.001, and the learning rate is adjusted to 0.7 to 0.9 times every 5 rounds, and is selected to be 0.8 times in this embodiment. This allows the loss function to converge faster. The adjustment of the learning rate at this step corresponds to step S30 in the aforementioned embodiment.

The invention further provides a visualization solution for the training process. Add visual output to each training session. The above embodiment corresponding to FIG. 4 will be used as an explanation. In step S10, after the training of Triplet-Loss is completed, the features of different domains are extracted and data dimensionality reduction is performed, for example, T-SNE (T-distributed) is adopted in various embodiments of the present invention (stochastic neighbor embedding) dimensionality reduction. After T-sne dimensionality reduction, the feature position distribution is drawn in two-dimensional space to realize the training effect visualization and output the training effect shown in FIG.

Similarly, in step S20, MMD+Triplet-Loss training, and step S40 in the training synthesis loss function stage, the training results are visually output to obtain the training effect as shown in FIGS. 7 and 8.

In Figures 6 to 8, each dot represents the feature position of a face image, dots of the same color (or shape) represent different images of the same person, and the close distance between dots indicates that the features are similar, ideal for cross-domain recognition It is that all the points of the same person are highly concentrated, and the points of different people are far away. It can be seen that as the training progresses, the points of the same color gradually gather, and the points of different colors gradually pull away, finally achieving the cross-domain recognition effect.

In order to more clearly explain the use of the above embodiment of the present invention in the actual image recognition process and its corresponding use effect, the following will be described in conjunction with specific practical applications.

In a practical environment, the first domain data is video image data, and the second domain is image data on the ID card. The network used is the Facenet network.

First input the video image data and ID card image data into the network. On the one hand, the two channels of data are input to the network through forward propagation. On the other hand, the two channels of data will be used again after MMD-Loss training in the subsequent process.

After the two channels of data are input through forward propagation, the network performs the first training: the learning rate is 0.01, and the Triplet-Loss training of the respective domain is performed. In the training for this loss function, there will be multiple rounds, each 10 In the round of training, the learning rate is adjusted to 0.8 times, so that after a certain round of training, the loss function will converge, indicating that the network has now "learned" the data features of these two domains on the loss function.

Then use the highest dimensional features of the two domains to calculate the maximum mean difference loss MMD-Loss, and add the result to the synthetic loss function to jointly perform back propagation and gradient derivation. The learning rate of this method is set at 0.001. Similarly, the training There will also be multiple rounds, for every 10 rounds of training, the learning rate will be adjusted to 0.8 times, so that after a certain round of training, a stable convergence effect will be obtained. After training in this way, the two domains achieve the elimination of the statistical distribution difference between the two domains, that is, the network "learning" of the images in the two domains will obtain the "basis" of recognition, and will follow the same person in the video. The recognition in the image provides possibilities for tasks on the ID card.

At this time, additional pairs of data are needed to train the network at this stage, so that the network can have the ability to identify and match across domains: specifically, select pairs of data, that is, prepare the same person in the video and in The data on the ID card image is input to the network for training. This round of training requires removing the MMD-Loss in the synthetic loss function and adding the mixed Triplet-Loss in the first domain and the second domain. The learning rate setting 0.001, every 5 rounds of training, the learning rate is adjusted to 0.8 times. After this round of training, under the condition that the loss function converges and is stable, the network model implements all training. At this time, the network can recognize people in the video and extract "features", and the "features" can be identified and matched among the corresponding people in the ID card.

When using this model for character recognition, as long as the relevant data from the video image is input to the network model, the corresponding tasks can be matched from the relevant image data of the ID card, so as to realize the character recognition and obtain the effect of cross-domain recognition . Or, input the ID card image data, and then identify the person in the video data.

According to yet another embodiment of the present invention, there is also provided a storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments at runtime.

It can be seen from the description of the above embodiments that the solution of the present invention can solve the problem that the image to be recognized and the training sample image have different statistical distribution characteristics. For example, the image obtained by the video can be analyzed to match its corresponding The identity on the ID photo. Solved the shortcomings that currently cannot achieve this effect.

Obviously, the embodiments described above are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

It should be noted that the terminology used herein is only for describing specific embodiments, and is not intended to limit exemplary embodiments according to the present application. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should also be understood that when the terms "comprising" and/or "including" are used in this specification, they indicate There are features, steps, jobs, devices, components, and/or combinations thereof.

It should be noted that the terms “first” and “second” in the description and claims of the present application and the above drawings are used to distinguish similar objects, and do not have to be used to describe a specific order or sequence. It should be understood that the data used in this way are interchangeable under appropriate circumstances so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein.

Industrial applicability

The above is only the preferred embodiments of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

A cross-domain network training method, which is characterized by the following steps: S1: input the sample data of the first domain and the second domain to the deep neural network, and train the sample data of the first domain and the second domain, so that the deep neural network is on the first domain and the second domain Each has the ability to classify; S2: Eliminate the difference in statistical distribution between domains, so that the first domain and the second domain have similar statistical distribution characteristics; S3: Strengthen the internal aggregation training for the first domain and the second domain; S4: Save the training results that meet the preset conditions.
The cross-domain network training method according to claim 1, wherein the training of the sample data of the first domain and the second domain in the step S1 comprises: the training of the first domain and the second domain The sample data is trained for the loss function Triplet-Loss.
The cross-domain network training method of claim 2, wherein the step S2 comprises: When the loss function Triplet-Loss is stable and satisfies convergence, the maximum mean difference loss MMD-Loss is calculated using the highest dimensional features of the first domain and the second domain, and the result is added to the synthetic loss function to jointly perform back propagation And gradient derivation.
The cross-domain network training method of claim 3, wherein the step S3 comprises: The MMD-Loss in the synthesis loss function is removed, and the mixed Triplet-Loss of the first domain and the second domain is added to perform the training of enhanced internal aggregation.
The cross-domain network training method of claim 4, wherein the step S4 comprises: When the synthetic loss function of the training with enhanced intra-aggregation converges and is less than the set value, the training results are saved.
The cross-domain network training method according to claim 2, wherein in step S1, the training of the loss function Triplet-Loss further comprises: Set a first learning rate. The initial value of the first learning rate is 0.001 to 0.01. The training of the loss function Triplet-Loss is performed every 10 rounds, and the first learning rate is set to 0.7 to 0.9 times.
The cross-domain network training method according to claim 6, wherein in step S2, a second learning rate is set, the initial value of the second learning rate is less than the initial value of the first learning rate, each In the 10 rounds of training to strengthen the inner aggregation, the second learning rate is set to 0.7 to 0.9 times.
The cross-domain network training method according to claim 7, wherein the initial value of the second learning rate is 0.0001 to 0.001.
The cross-domain network training method according to claim 7, wherein in step S3, a third learning rate is set, the initial value of the third learning rate is less than the initial value of the first learning rate, each In the five rounds of training to strengthen the inner aggregation, the third learning rate is set to 0.7 to 0.9 times.
The cross-domain network training method according to claim 9, wherein the initial value of the third learning rate is 0.0001 to 0.001.
The cross-domain network training method according to claim 2, wherein the step S1 further comprises: After the training of Triplet-Loss is completed, the features of the first domain and the second domain are extracted, data dimensionality reduction is performed, and the feature position distribution is drawn in a two-dimensional space.
The cross-domain network training method of claim 3, wherein the step S2 further comprises: After the training of the synthetic loss function is completed, the features of the first domain and the second domain are extracted, data dimensionality reduction is performed, and the feature position distribution is drawn in a two-dimensional space.
The cross-domain network training method according to claim 4, wherein the step S3 further comprises: After the training to strengthen the inner aggregation is completed, the features of the first domain and the second domain are extracted, data dimensionality reduction is performed, and the feature position distribution is drawn in a two-dimensional space.
An image recognition method, characterized in that the deep neural network is trained according to any one of claims 1-13, and the trained deep neural network is used to recognize the image.
A storage medium characterized in that a computer program is stored in the storage medium, wherein the computer program is set to execute the training method according to any one of claims 1-13 when it is run.