CN112131970A

CN112131970A - Identity recognition method based on multi-channel space-time network and joint optimization loss

Info

Publication number: CN112131970A
Application number: CN202010926230.6A
Authority: CN
Inventors: 蒋敏兰; 吴颖; 陈昊然
Original assignee: Zhejiang Normal University CJNU
Current assignee: Zhejiang Normal University CJNU
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-25

Abstract

The invention provides an identity recognition method based on a multi-channel space-time network and joint optimization loss, which comprises a multi-channel space-time network system and a joint optimization loss system, wherein the joint optimization loss system comprises an improved ternary loss function and a label smoothing regularized cross entropy loss function, the label smoothing regularized cross entropy loss function is a cross entropy loss function in the training process aiming at the traditional classification network, and label smoothing regularization is blended in the cross entropy loss in a calculation mode. The problems that the gait recognition cross-view angle accuracy is low based on the traditional image method, the model-based gait recognition method is complex in calculation, long in time consumption and the like are solved, and method guarantee is provided for the real-time identity recognition technology.

Description

Identity recognition method based on multi-channel space-time network and joint optimization loss

Technical Field

The invention relates to the technical field of identity recognition methods, in particular to an identity recognition method based on a multi-channel space-time network and joint optimization loss.

Background

In recent years, artificial intelligence technology is becoming mature and gradually applied, and more industries begin to enter an intelligent technology innovation stage. The identity authentication field is gradually developed from traditional user name/password authentication, IC card authentication and dynamic password authentication into the existing human body biological characteristic identification authentication. The identification of the identity of an individual is completed according to the unique physiological characteristics or behavior characteristics of each person by combining the computer technology and the sensor technology, and the identification technology is one of the identity authentication technologies with the highest safety factor at present. Gait recognition, an emerging biometric recognition technology, has also started to be increasingly explored in recent years. The identified person does not need to contact a sensor, the requirements on the acquisition direction and the image quality of the gait image are not high, and the identity authentication can be completed at any angle in a long distance. Although the current gait recognition technology cannot reach the commercial level, the unique advantages and the wide application prospect attract more and more scholars to participate in research. In a novel construction smart city report, the safety of citizen digital identity authentication and network identity identification technology is emphasized, so that the development of gait identification technology can be well complemented with other biological characteristic identification technology, and a new thought is provided for identity authentication in the modern intelligent security construction. The existing CASIA-B database is a large-scale gait database CASIA-B disclosed by the automation of the Chinese academy, the CASIA-B database comprises 124 persons (93 persons in men and 31 persons in women), each person has 11 visual angles (0 degree, 18 degrees, 36 degrees, 54 degrees, 72 degrees, 90 degrees, 108 degrees, 126 degrees, 144 degrees, 162 degrees and 180 degrees), and 3 walking states (common conditions, wearing overcoat and carrying packages) exist at each visual angle.

The gait recognition based on the gait framework sequence has stronger robustness in the scenes of visual angle change, carrying objects and the like, but the refined framework sequence loses a large number of effective features, the difference between different individuals is reduced, the advantages and the disadvantages of the gait recognition are just complementary with the method based on the gait contour sequence, and the gait recognition method of the multi-channel space-time network and the joint optimization loss is provided by combining the advantages of the two types of gait sequences, so that the network training convergence speed is accelerated while the effective feature similarity learning is ensured, and the identity recognition accuracy in the scenes of visual angle change, carrying objects and the like is improved.

Disclosure of Invention

In order to solve certain or some technical problems in the prior art, the invention provides an identity recognition method based on a multi-channel space-time network and joint optimization loss, solves the problems of low cross-view angle accuracy of gait recognition based on a traditional image method, complex calculation, long time consumption and the like of a gait recognition method based on a model, and provides method guarantee for a real-time identity recognition technology.

In order to solve the above-mentioned existing technical problem, the invention adopts the following scheme:

an identity recognition method based on a multi-channel spatio-temporal network and joint optimization loss comprises a multi-channel spatio-temporal network system and a joint optimization loss system, wherein the joint optimization loss system comprises an improved ternary loss function and a label smoothing regularization cross entropy loss function, the label smoothing regularization cross entropy loss function is a cross entropy loss function in a training process aiming at a traditional classification network, and label smoothing regularization is blended in cross entropy loss in a calculation mode, and the steps of realizing the method comprise:

step one, preprocessing a gait sequence, namely preprocessing gait images in a CASIA-B gait database into a gait contour sequence and a skeleton sequence with consistent sizes and aligned centers through the gait sequence respectively by a gait image preprocessing algorithm;

step two, inputting a gait framework sequence and a contour sequence obtained by preprocessing the gait sequence into a multi-channel space-time network system together so as to fully extract space-time characteristics among the gait sequences;

step three, establishing a gait identification model by combining a triple network;

and step four, jointly supervising network training by combining the improved ternary loss and the optimized cross entropy loss.

Furthermore, the multi-channel space-time network system adopts a structure that a multi-channel shallow convolutional neural network is connected with a long-time memory neural network in series as a main network for feature extraction, and gait frameworks and contour sequences which correspond to each other one by one in a period are directly used as input of the network so as to fully mine space-time information between the gait sequences.

Further, the improved ternary loss is an improvement on the selection mode of the positive and negative samples in the ternary loss training process, and stronger constraint is added on the selection of the positive and negative samples.

Further, the calculation method of the ternary loss value is to calculate the spatial euclidean distance of all samples in each Batch in the training process, and calculate the spatial euclidean distance by using the negative sample closest to the original sample and the positive sample farthest from the original sample, and the calculation formula is as follows:

inputting p original samples of each Batch, selecting different gait sequences of k frames from the samples of each class to form gait sequences of p frames x k frames, and L_thFor the final ternary loss value, a represents the original sample, a is the set of positive samples that are the farthest from the original sample, and B is the set of negative samples that are the closest to the original sample.

Further, a calculation formula of the label smoothing regularization method in the label smoothing regularization cross entropy loss function is as follows:

wherein, λ is the weight of the smooth label, and the value range is λ ∈ [0,1]]And n is the number of label types.

Further, the expression after merging LSR in cross entropy loss is:

further, jointly improving a ternary loss function and a label smoothing regularization cross entropy loss function, and jointly supervising network training, wherein the combined optimization loss system loss function expression after fusion is as follows:

L_total＝k×L_LSR-ce+L_th

wherein L is_LSR-ceFor cross entropy loss function of joining LSR, L_thFor an improved ternary loss function, k is a weight coefficient for fusing two loss functions.

Further, an attention mechanism is included, and key frames can be captured and gait feature emphasis extraction can be carried out on the multi-channel space-time network system so as to increase accuracy and robustness of the network model.

Furthermore, the attention mechanism comprises a weight of the gait sequence, the weight of the gait sequence is obtained by normalizing the common frame number of each type of gait sequence and the fraction output by the corresponding long-time memory neural network, and the calculation formula is as follows:

wherein Q is_jWeight coefficient representing gait sequence of corresponding j frames, c_jAnd expressing the output value of the fusion characteristic of the j frame gait sequence of the long and short time memory neural network.

Further, according to the obtained weight coefficient Q_jAnd further calculating the space-time characteristics based on the attention mechanism, wherein the calculation formula is as follows:

wherein F represents a spatiotemporal feature obtained based on an attention mechanism, Q_jThe weighting coefficients obtained for the jth frame sequence obtained by the attention mechanism,

the gait frame is a fusion feature corresponding to j frames of gait frames and a contour sequence.

Compared with the prior art, the invention has the beneficial effects that:

compared with the gait outline, the existing gait skeleton has stronger robustness under the scenes of visual angle change, carrying objects and the like, but the skeleton sequence after thinning loses a large number of effective characteristics, reduces the difference among different individuals, and has the advantages and the disadvantages which are just complementary with the gait outline. The method comprises the steps of firstly improving the selection mode of positive and negative samples in the process of ternary loss training through a combined optimization loss system, thereby enhancing the generalization performance and robustness of a metric learning network, meanwhile aiming at the problem that the traditional classification network has low accuracy of network classification and identification due to the fact that a cross entropy loss function cannot effectively use the label position of a negative sample in the training process, integrating label smoothing and regularization in the calculation of cross entropy loss for the purpose of improving the accuracy of network classification, jointly supervising network training through the two improved loss functions, improving the accuracy of classification and identification while ensuring effective characteristic distance metric learning, overcoming the problem that the network is not easy to converge, effectively solving the problems of low cross-view-angle accuracy of gait identification based on the traditional image method, complex calculation and long time consumption of the model-based gait identification method, and the like, and providing method guarantee for the real-time identity identification technology, meanwhile, the modern biological behavior characteristic recognition technology is improved, the safety of identity recognition under complex scenes such as carrying objects, wearing and the like is ensured, and meanwhile, the gait identity recognition accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of an identity recognition framework based on a multi-channel spatio-temporal network and joint optimization loss;

FIG. 2 is a schematic diagram of a multi-channel spatio-temporal network system;

FIG. 3 is a schematic diagram of the joint optimization loss scheme of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

As shown in fig. 1 to 3, an identity recognition method based on a multi-channel spatio-temporal network and joint optimization loss includes a multi-channel spatio-temporal network system and a joint optimization loss system, where the joint optimization loss system includes two parts, namely an improved ternary loss function and a label smoothing regularization cross entropy loss function, the label smoothing regularization cross entropy loss function is a cross entropy loss function in a training process for a traditional classification network, and a label smoothing regularization is incorporated in the cross entropy loss in calculation, and the method includes the following steps:

When the gait sequence is preprocessed, firstly, a gait video is processed into a frame of gait image through a gait image preprocessing algorithm, a gait skeleton image and a contour image are obtained by combining a posture estimation method and an image moving target extraction method, and the gait skeleton image and the contour image are processed into a gait sequence with consistent size and aligned center by further combining a two-interpolation amplification and image centroid alignment centralization algorithm, so that an experiment sample is provided for a subsequent characteristic measurement learning and classification network.

The multi-channel time-space network system adopts a structure that a multi-channel shallow convolutional neural network is connected with a long-time memory neural network in series as a main network for feature extraction, and gait frameworks and contour sequences which correspond to each other one by one in a period are directly used as input of the network so as to fully mine time-space information between the gait sequences.

Based on the invented method, compared with the existing method, the method has the outstanding differences and contributions that:

the novel multi-channel space-time network system is characterized in that a multi-channel shallow convolutional network is connected with a long-time memory network in series to serve as a backbone of a multi-channel space-time network, so that gait sequence space-time information is fully extracted, gait features with higher discriminative power are improved for subsequent gait identity recognition, and the specific structure is shown in figure 2.

A gait identity recognition method framework based on a multi-channel spatio-temporal network and joint optimization loss is shown in fig. 1, and the method comprises the following steps: (1) the gait framework sequence and the gait contour sequence are used as the input of a multi-channel space-time network system together, and the attention mechanism is combined to fully extract space-time characteristics among the gait sequences, so that gait characteristics with higher discriminative power are provided for the subsequent characteristic measurement learning and classification network; (2) a joint optimization loss system strategy is provided, aiming at the problems that the ternary group network is not easy to converge and low in generalization performance in the training process, the selection mode of positive and negative samples in the ternary loss training process is improved, and the generalization performance and robustness of the metric learning network are enhanced; meanwhile, a Label Smoothing Regularization (LSR) method is fused to optimize a cross entropy loss function, and for the reason that the cross entropy loss function cannot effectively use the label position of a negative sample in the training process of the traditional classification network, the label smoothing regularization is fused in the calculation of cross entropy loss, so that the network classification accuracy is improved, the network training is jointly supervised by combining the two types of optimization loss functions, the classification identification accuracy is improved while the effective characteristic distance measurement learning is ensured, and the problem that the network is not easy to converge is solved; (3) a large number of comparison experiments are carried out in the CASIA-B database, and the effectiveness of the invention is further verified.

A structure of a multi-channel shallow layer convolution neural network system connected with a long-time memory neural network in series is used as a main network for feature extraction, gait frameworks and contour sequences which correspond to each other one by one in a period are directly used as input of the network, and therefore time-space information between the gait sequences is fully mined.

And further improving a ternary loss function, wherein the selection mode of the positive and negative samples in the ternary loss training process is mainly improved, and stronger constraint is added on the selection of the positive and negative samples, namely, a gait sequence with smaller heterogeneous visual angle difference is selected as a negative sample, and a gait sequence with larger homogeneous visual angle difference is selected as a positive sample. By selecting proper positive and negative samples, the Euclidean distance between the positive and negative samples is effectively avoided from being too far, and the generalization and robustness of the metric learning network are further enhanced.

The improvement is as follows: the improved ternary loss is an improvement aiming at a selection mode of positive and negative samples in a ternary loss training process, and stronger constraint is added on the selection of the positive and negative samples; the ternary loss function is one of the commonly used characteristic distance measurement learning modes at present, and aims to enable the characteristic distances extracted by samples of the same category to be closer and the characteristic distances extracted by samples of different categories to be farther, so that the precision of fine-grained classification is improved. In the ternary loss training process, an original sample (an anchor sample) and a corresponding positive sample and a negative sample need to be input, and the traditional random selection mode is easy to meet constraint conditions before solving a loss function, so that the problems of network generalization performance reduction and the like are easily caused. Therefore, the selection mode of positive and negative samples in the ternary loss training process is improved so as to enhance the generalization and robustness of the metric learning network. Meanwhile, the problem that the classification accuracy is low during testing is caused by the fact that the cross entropy loss function cannot effectively use the label position of a negative sample in the training process of the traditional classification network. The method adopts a fusion label smoothing regularization method to optimize and calculate cross entropy loss so as to improve the classification precision of the test network. The two types of improved loss functions are combined to monitor network training, so that the classification and identification accuracy is improved while effective characteristic distance measurement learning is ensured, and the problem that the network is not easy to converge is solved.

The calculation method of the ternary loss value is characterized in that the space Euclidean distance of all samples is calculated in each Batch in the training process, and is calculated through a negative sample closest to an original sample and a positive sample farthest from the original sample, specifically, the space Euclidean distance of all samples is calculated in each Batch in the training process, and the final ternary loss is calculated through the negative sample (high-similarity negative sample) closest to the original sample and the positive sample (low-similarity positive sample) farthest from the original sample, so that the network can be rapidly converged and meanwhile avoid the oscillation condition. For example: original samples of p categories are input into each Batch, and k frames of different gait sequences are selected from the samples of each category to form p x k frame gait sequences. The final three losses can be expressed as:

wherein, p kinds of original samples are input into each Batch, and k frames of different gait sequences are selected for the samples of each kind to formp x k frame gait sequence, L_thFor the final ternary loss value, a represents the original sample, a is the set of positive samples that are the farthest from the original sample, and B is the set of negative samples that are the closest to the original sample.

The cross entropy loss function of label smoothing and regularization is further improved, the number of negative labels is increased due to the fact that the number of sample categories in a training set is large, and in the cross entropy loss function of the classification probability calculated by using the traditional Softmax, the label positions of the negative samples are ignored by adopting a one-hot label calculation mode. Finally, the network can perform good fitting on the sample classification in the training set, but the accuracy of the network testing is reduced due to the fact that the label position of the negative sample cannot be effectively utilized. Therefore, the cross entropy loss function is calculated by adding a Label Smoothing Regularization (LSR) method, so that the calculation result of the Softmax activation function can be closer to the correct output, and the classification and identification accuracy of the test network is improved. The calculation formula of the label smoothing regularization method in the label smoothing regularization cross entropy loss function is as follows:

wherein, λ is the weight of the smooth label, and the value range is λ ∈ [0,1]]And n is the number of label types. The expression after merging LSR in cross entropy loss is:

and further improving the joint optimization loss, wherein through the design and analysis, the joint improvement ternary loss function and the label smoothing regularization cross entropy loss function are subjected to joint supervision network training, and the fused joint optimization loss system loss function expression is as follows:

L_total＝k×L_LSR-ce+L_th

wherein L is_LSR-ceFor cross entropy loss function of joining LSR, L_thFor an improved ternary loss function, k isAnd fusing the weight coefficients of the two loss functions. The value range of the hyper-parameter as the network is [0,1]]It can be adjusted according to the training situation of the network.

By the combined optimization loss strategy, the network convergence of the set model can be effectively controlled, so that the network approaches an optimization curve, the effective similarity measurement is achieved, and the accuracy of individual fine-grained classification is improved.

Human gait motion can be regarded as a time sequence problem in nature, so each frame of gait sequence in a period also has a strict sequence, and each frame of gait outline and skeleton diagram represent the motion posture at a certain moment in the gait period. Therefore, on the basis of ensuring that the shape features of the image are extracted, whether the time sequence information among the gait sequences can be sufficiently mined is the key for learning the feature distance similarity and improving the accuracy of identity recognition. Compared with a convolutional neural network, the cyclic neural network is better at feature learning of sequence data, so that a multichannel space-time network is provided by combining the convolutional neural network and a long-time memory network, attention is fused to repeatedly extract space-time information between gait sequences, and the specific structure of the multichannel space-time network is shown in fig. 2.

In the multi-channel space-time feature extraction network shown in fig. 2, the one-to-one corresponding contour sequence and the skeleton sequence are used as input together, and because the gait contour sequence is a binary image and the skeleton sequence is an RGB image, the gait sequences of two types are processed into the same size before input. And further respectively inputting the two-channel convolution network to extract the gait sequence spatial features, and fusing the features in a tensor splicing mode. The long-time memory network layer adopts a double-layer network containing 256 hidden nodes, the output fusion characteristic graph of the convolutional network is used as the input of a rear-layer network according to the time sequence, and the time sequence characteristic is extracted by decoding through an LSTM module.

On the basis of the above, an Attention mechanism (Attention) is also introduced, and the method solves the defect that the traditional continuous image sequence processing network can only carry out feature extraction on limited frame input, thereby utilizing the association among different frames in a gait cycle to the maximum extent and weakening the influence of irrelevant frames in the cycle on a network output result. The key frames can be captured through the multi-channel space-time network and gait features of the key frames are mainly extracted so as to increase the accuracy and robustness of a network model, and if each type of gait sequence has Z frames in common, Z scores are output corresponding to the long-time and short-time memory neural network. After normalization, the weight of each frame sequence is obtained, and the calculation method is represented by the following formula:

wherein Q is_jWeight coefficient representing gait sequence of corresponding j frames, c_jAnd expressing the output value of the fusion characteristic of the gait sequence of the jth frame of the long and short time memory neural network, so that the time-space characteristic based on the attention mechanism can be further calculated according to the obtained weight coefficient, and the calculation formula is as follows:

The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. An identity recognition method based on multi-channel space-time network and joint optimization loss is characterized in that: the method comprises a multi-channel space-time network system and a joint optimization loss system, wherein the joint optimization loss system comprises an improved ternary loss function and a label smoothing and regularization cross entropy loss function, the label smoothing and regularization cross entropy loss function is a cross entropy loss function in a training process aiming at the traditional classification network, and label smoothing and regularization is blended in cross entropy loss in calculation, and the method is realized by the following steps:

2. The identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 1, characterized in that: the multi-channel time-space network system adopts a structure that a multi-channel shallow convolutional neural network is connected with a long-time memory neural network in series as a main network for feature extraction, and gait frameworks and contour sequences which correspond to each other one by one in a period are directly used as input of the network so as to fully mine time-space information between the gait sequences.

3. The identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 1, characterized in that: the improved ternary loss is an improvement aiming at the selection mode of the positive and negative samples in the ternary loss training process, and stronger constraint is added on the selection of the positive and negative samples.

4. The identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 3, characterized in that: the calculation method of the ternary loss value is that the space Euclidean distance of all samples is calculated in each Batch in the training process, and the calculation is carried out through a negative sample closest to an original sample and a positive sample farthest from the original sample, and the calculation formula is as follows:

5. The identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 4, characterized in that: the calculation formula of the label smoothing regularization method in the label smoothing regularization cross entropy loss function is as follows:

wherein, λ is the weight of the smooth label, the value range is λ ∈ [0,1], and n is the number of label types.

6. The identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 5, characterized in that: the expression after merging LSR in cross entropy loss is:

7. the identity recognition method based on the multi-channel spatio-temporal network and the joint optimization loss as claimed in claim 6, characterized in that: jointly improving a ternary loss function and a label smoothing regularization cross entropy loss function, and jointly supervising network training, wherein the fused joint optimization loss system loss function expression is as follows: l is_total＝k×L_LSR-ce+L_th

8. The identity recognition method based on the multi-channel space-time network and the joint optimization loss as claimed in any one of claims 1 to 7, wherein: the system also comprises an attention mechanism, and key frames can be captured by the multi-channel space-time network system and gait feature emphasis is extracted, so that the accuracy and the robustness of a network model are improved.

9. The identity recognition method based on the multi-channel spatio-temporal network and joint optimization loss as claimed in claim 8, characterized in that: the attention mechanism comprises a weight of a gait sequence, the weight of the gait sequence is obtained by normalizing the total frame number of each type of gait sequence and the fraction output by the corresponding long-time and short-time memory neural network, and the calculation formula is as follows:

10. The identity recognition method based on the multi-channel spatio-temporal network and joint optimization loss as claimed in claim 9, wherein: according to the obtained weight coefficient Q_jAnd further calculating the space-time characteristics based on the attention mechanism, wherein the calculation formula is as follows: