CN111178115B

CN111178115B - Training method and system for object recognition network

Info

Publication number: CN111178115B
Application number: CN201811340221.8A
Authority: CN
Inventors: 袁培江; 史震云; 李建民; 任鹏远
Original assignee: Beijing Sensing Tech Co ltd
Current assignee: Beijing Sensing Tech Co ltd
Priority date: 2018-11-12
Filing date: 2018-11-12
Publication date: 2024-01-12
Anticipated expiration: 2038-11-12
Also published as: CN111178115A

Abstract

The invention relates to a training method and a training system of an object recognition network, wherein the method comprises the steps of respectively inputting a plurality of sample images in a training set into a first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image; inputting each sample image into a second identification network respectively for processing, obtaining a first network loss of the second identification network, and obtaining a second network loss and a third network loss of the second identification network according to the teacher characteristics; training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network. The second recognition network obtained through training by the method can realize accurate recognition of the target object, and the training method disclosed by the invention has the advantage of good expansibility, and a plurality of first recognition networks can be added to train the second recognition network.

Description

Training method and system for object recognition network

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a training method and system of an object recognition network.

Background

Pedestrian Re-identification (ReID for short) can be implemented to search the search library for images of the Person corresponding to the image in other view cameras by using one or more pedestrian images.

Early ReID technology adopts image characteristics designed manually, has poor precision, and has greatly improved precision after the deep learning technology is started to be used later. The current mainstream ReID technology is a ReID technology based on deep learning, and the recognition accuracy of the ReID technology can be improved through the deep learning, however, the ReID technology based on the deep learning often causes low recognition efficiency because of poor training results of a model.

Therefore, a new training method is needed to improve the recognition accuracy and the working efficiency of the ReID technology based on deep learning.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a training method of an object recognition network, the method comprising:

respectively inputting a plurality of sample images in a training set into a first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

inputting each sample image into a second identification network respectively for processing, obtaining a first network loss of the second identification network, and obtaining a second network loss and a third network loss of the second identification network according to the teacher characteristics;

Training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network,

the second recognition network is used for recognizing the identity of the target object in the image to be processed, and the first recognition network is a teacher network used for training the second recognition network.

In one possible implementation, the first recognition network includes a first view decomposition sub-network, a first image augmentation sub-network, a first convolution sub-network, a first pooling sub-network and a first embedding sub-network,

the method for processing the plurality of sample images in the training set in the first recognition network comprises the steps of:

inputting a target sample image into a first view decomposition sub-network to perform view decomposition processing to obtain a plurality of first views of the target sample image, wherein the target sample image is any one of the plurality of sample images;

inputting the plurality of first views into a first image augmentation sub-network for augmentation treatment to obtain a plurality of augmented second views;

The first view and the second view are subjected to convolution, pooling and embedding processing sequentially through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network, so that first feature vectors of the first view and second feature vectors of the second view of the target sample image are obtained;

and determining teacher characteristics of the target sample image according to the first characteristic vector and the second characteristic vector.

In one possible implementation, the second recognition network includes a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network including a second image augmentation sub-network, a second convolution sub-network, a second pooling sub-network, a second embedding sub-network, and a classification sub-network,

each sample image is respectively input into a second identification network for processing, a first network loss of the second identification network is obtained, and a second network loss and a third network loss of the second identification network are obtained according to the teacher characteristics, and the method comprises the following steps:

inputting a target sample image into a second image augmentation sub-network to carry out augmentation processing to obtain an augmented third view, wherein the target sample image is any one of the plurality of sample images;

Inputting the third view into a second convolution sub-network to carry out convolution processing to obtain a feature map of the target sample image;

inputting the feature images of the target sample images into a feature image mapping network for processing to obtain first predicted values of a plurality of first views of the target sample images;

and determining a second network loss of the second identification network according to the teacher characteristics of the plurality of sample images and the first predicted value.

In one possible implementation manner, each sample image is respectively input into a second identification network to be processed, a first network loss of the second identification network is obtained, and a second network loss and a third network loss of the second identification network are obtained according to the teacher feature, and the method further includes:

the feature images of the target sample images are subjected to pooling and embedding treatment sequentially through a second pooling sub-network and a second embedding sub-network, so that third feature vectors of the target sample images are obtained;

inputting the third feature vector into the feature vector mapping network for processing to obtain second predicted values of a plurality of first views of the target sample image;

and determining a third network loss of the second identification network according to the teacher characteristics of the plurality of sample images and the second predicted value.

In a possible implementation manner, the feature map mapping network includes a first view extraction sub-network, a third pooling sub-network and a third embedding sub-network, where the first view extraction sub-network is used to map a feature map of the target sample image into feature maps of multiple first views of the target sample image;

the feature vector mapping network includes a second view extraction sub-network for mapping a third feature vector of the target sample image to feature vectors of a plurality of first views of the target sample image and a mapping sub-network.

In one possible implementation manner, each sample image is respectively input into a second identification network to be processed, and a first network loss of the second identification network is obtained, and the method further includes:

inputting the third feature vector of the target sample image into the classification sub-network for processing to obtain classification information of the target sample image;

determining a first network loss of the second identification network based on the classification information and the labeling information of the plurality of sample images,

wherein the first network loss comprises a cross entropy loss function.

In one possible implementation manner, training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network includes:

determining a weighted sum of the first network loss, the second network loss, and the third network loss as an overall network loss of the second identified network;

and carrying out reverse training on the second identification network according to the total network loss to obtain a trained second identification network.

In one possible implementation, the amplifying processing is performed on the plurality of first view input image amplifying sub-networks to obtain a plurality of amplified second views, including:

and respectively carrying out at least one of random overturning processing, random shielding processing, random matting processing, random color processing and random rotation processing on the plurality of first views to obtain a plurality of second views.

In one possible embodiment, the plurality of first views includes a whole-body view and a partial view.

According to another aspect of the present disclosure, there is provided a training system of an object recognition network, the system comprising:

the first processing module is used for respectively inputting a plurality of sample images in the training set into the first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

The second processing module is connected with the first processing module and is used for respectively inputting each sample image into a second identification network for processing, obtaining the first network loss of the second identification network, and obtaining the second network loss and the third network loss of the second identification network according to the teacher characteristics;

the training module is connected with the second processing module and is used for training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network,

wherein the first processing module comprises:

a first decomposition sub-module, configured to input a target sample image into a first view decomposition sub-network to perform view decomposition processing, and obtain a plurality of first views of the target sample image, where the target sample image is any one of the plurality of sample images;

The first amplifying sub-module is connected with the first decomposing sub-module and is used for amplifying the plurality of first views through the first image amplifying sub-network to obtain a plurality of amplified second views;

the first processing sub-module is connected with the amplifying sub-module and is used for carrying out convolution, pooling and embedding processing on the plurality of first views and the plurality of second views sequentially through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

and the first determining submodule is connected with the processing submodule and is used for determining teacher characteristics of the target sample image according to the first characteristic vector and the second characteristic vector.

wherein the second processing module comprises:

The second augmentation sub-module is used for inputting the target sample image into a second image augmentation sub-network to carry out augmentation processing to obtain an augmented third view, wherein the target sample image is any one of the plurality of sample images;

the convolution sub-module is connected with the second augmentation sub-module and is used for inputting the third view into a second convolution sub-network to carry out convolution processing to obtain a feature map of the target sample image;

the characteristic map mapping sub-module is connected with the convolution sub-module and is used for inputting the characteristic map of the target sample image into a characteristic map mapping network to be processed, so as to obtain first predicted values of a plurality of first views of the target sample image;

and the second determining sub-module is connected with the characteristic map mapping sub-module and is used for determining second network loss of the second identification network according to teacher characteristics of the plurality of sample images and the first predicted value.

In one possible embodiment, the second processing module further includes:

the second processing sub-module is connected with the convolution sub-module and is used for carrying out pooling and embedding processing on the feature map of the target sample image sequentially through a second pooling sub-network and a second embedding sub-network to obtain a third feature vector of the target sample image;

The feature vector mapping sub-module is connected with the second processing sub-module and is used for inputting the third feature vector into the feature vector mapping network for processing to obtain second predicted values of a plurality of first views of the target sample image;

and the fourth determination submodule is connected with the feature vector mapping submodule and is used for determining third network loss of the second identification network according to teacher features of the plurality of sample images and the second predicted value.

In a possible implementation manner, the feature map mapping network includes a first view extraction sub-network, a third pooling sub-network and a third embedding sub-network, and the feature map mapping sub-module is further configured to map a feature map of the target sample image into feature maps of a plurality of first views of the target sample image through the first view extraction sub-network;

the feature vector mapping network comprises a second view extraction sub-network and a mapping sub-network, and the feature vector mapping sub-module is further used for mapping a third feature vector of the target sample image into feature vectors of a plurality of first views of the target sample image through the second view extraction sub-network.

In one possible embodiment, the second processing module further includes:

the classification sub-module is connected with the second processing sub-module and is used for inputting the third feature vector of the target sample image into the classification sub-network for processing to obtain the classification information of the target sample image;

a fourth determination sub-module, connected to the classification sub-module, for determining a first network loss of the second identification network according to classification information and labeling information of the plurality of sample images,

wherein the first network loss comprises a cross entropy loss function.

In one possible embodiment, the training module includes:

an operator module for determining a weighted sum of the first, second, and third network losses as an overall network loss of the second identified network;

and the training sub-module is connected with the operation sub-module and is used for carrying out reverse training on the second identification network according to the total network loss to obtain a trained second identification network.

According to another aspect of the present disclosure, there is provided a training apparatus of an object recognition network, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the training method of the object recognition network described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described training method of an object recognition network.

The second recognition network obtained through training by the method can realize accurate recognition of the target object, and the training method disclosed by the invention has the advantage of good expansibility, and a plurality of first recognition networks can be added to train the second recognition network.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a training method of an object recognition network according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a single view training network according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of step S110 of a training method of an object recognition network according to an aspect of the present disclosure.

Fig. 4 a-4 g show schematic diagrams of multiple views according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a second identification network according to an embodiment of the present disclosure.

Fig. 6 shows a flowchart of step S120 of a training method of an object recognition network according to an embodiment of the present disclosure.

FIG. 7 illustrates a block diagram of a training system of an object recognition network, according to an embodiment of the present disclosure.

FIG. 8 illustrates a block diagram of a training system of an object recognition network, according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure.

Fig. 10 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Referring to fig. 1, fig. 1 shows a flowchart of a training method of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 1, the method includes:

step S110, respectively inputting a plurality of sample images in a training set into a first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

step S120, each sample image is respectively input into a second identification network for processing, the first network loss of the second identification network is obtained, and the second network loss and the third network loss of the second identification network are obtained according to the teacher characteristics;

step S130, training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network,

According to the training method of the object recognition network, a plurality of sample images in a training set can be respectively input into a first recognition network to be processed, teacher characteristics of a plurality of first views of each sample image are obtained, each sample image is respectively input into a second recognition network to be processed, first network loss of the second recognition network is obtained, second network loss and third network loss of the second recognition network are obtained according to the teacher characteristics, and the second recognition network is trained according to the first network loss, the second network loss and the third network loss, so that the trained second recognition network is obtained. The second recognition network obtained through training by the method can realize accurate recognition of the target object, and the training method disclosed by the invention has the advantage of good expansibility, and a plurality of first recognition networks can be added to train the second recognition network.

For step S110:

In one possible implementation, a training set may be preset, which may include a plurality of sample images for training of the object recognition network.

In one possible implementation, the first recognition network may be a teacher network (or teacher model) that uses knowledge distillation (Knowledge Distillation) or knowledge migration (Knowledge Transfer) for knowledge delivery.

In the system adopting knowledge distillation (hereinafter referred to as a knowledge distillation system), a student network (or referred to as a student model) can be further included, and the teacher network converts hard knowledge into soft knowledge through a knowledge distillation method and transmits the soft knowledge to the student network so as to train the student network, thereby improving the precision and the execution efficiency of the student network.

In one possible implementation, the first recognition network may be a deep learning network trained in advance, for example, a single view training network.

Referring to fig. 2, fig. 2 shows a schematic diagram of a single view training network according to an embodiment of the present disclosure.

As shown in fig. 2, the first recognition network (single view training network) may include a first view decomposition sub-network 401, a first image augmentation sub-network 402, a first convolution sub-network 403, a first pooling sub-network 404, a first embedding sub-network 405, a first classification sub-network 406, and the like.

In one possible implementation, the first recognition network may be a deep learning network built according to model data of the single-view training network, for example, after the first recognition network is built, the network model of the first recognition network may be initialized by using the trained model data of the single-view training network, where the model data may include, for example, weight parameters of each sub-network.

In one possible implementation, the first recognition network may include all of the sub-networks of the single-view training network, or may include a portion of the sub-networks of the single-view training network, e.g., the first recognition network may include other sub-networks of the single-view training network than the first classification sub-network 406.

Referring to fig. 3, fig. 3 is a flowchart illustrating step S110 of a training method of an object recognition network according to an aspect of the present disclosure.

As shown in fig. 3, step S110 of inputting a plurality of sample images in a training set into a first recognition network for processing, to obtain teacher features of a plurality of first views of each sample image may include:

step S111, inputting a target sample image into a first view decomposition sub-network to perform view decomposition processing, and obtaining a plurality of first views of the target sample image, wherein the target sample image is any one of the plurality of sample images;

Step S112, inputting the plurality of first views into a first image augmentation sub-network for augmentation processing to obtain a plurality of augmented second views;

step S113, the plurality of first views and the plurality of second views sequentially perform convolution, pooling and embedding processing through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network, so as to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

step S114, determining a teacher feature of the target sample image according to the first feature vector and the second feature vector.

For step S111:

in one possible implementation, the first view decomposition sub-network 401 may decompose the input target sample image into multiple views, and may also perform scaling or the like on the target sample image.

Referring to fig. 4 a-4 g, fig. 4 a-4 g show schematic diagrams of multiple views according to an embodiment of the present disclosure.

Fig. 4a may be a whole-body view of the target sample image and fig. 4 b-4 g are multiple partial views of the target sample image.

The multiple partial views of fig. 4 b-4 g may be obtained as follows:

The whole body view shown in fig. 4a is divided into 4 parts from top to bottom, and the first part, the second part, the third part and the fourth part can be respectively used as the partial views shown in fig. 4 b-4 d.

The whole body view shown in fig. 4a is divided into 7 parts from top to bottom, the first part, the second part and the third part can be taken as partial views shown in fig. 4e, the third part, the fourth part and the fifth part can be taken as partial views shown in fig. 4f, and the fifth part, the sixth part and the seventh part can be taken as partial views shown in fig. 4 g.

It should be appreciated that other methods may be employed to obtain a different number of partial views, and the present disclosure is not limited by the manner in which the views are partitioned and the number of views.

For step S112:

in one possible implementation, the inputting the plurality of first views into the first image augmentation sub-network for augmentation may include at least one of random flipping, random occlusion, random matting, random color, random rotation of the input first views.

For step S113:

in one possible implementation, the first convolution sub-network 403 may employ a variety of deep convolution neural networks including, but not limited to ResNet, denseNet, squeezeNet, etc., through which the first or second view of the input may be processed to obtain a corresponding feature map.

In one possible implementation, the first pooling subnetwork 404 can employ a variety of global pooling including, but not limited to, global Average Pooling (GAP), global Maximum Pooling (GMP), and the like to process the feature map output by the first convolution subnetwork 403 to output a globally pooled feature vector.

In one possible implementation, the first embedded subnetwork 405 can include one or more fully connected, BN layers (Batch Normalize Layer, batch normalization layers) to perform feature dimension reduction processing on features output by the first pooled subnetwork 404 to reduce the dimensions of those features. For example, when the dimension of the feature output by the first pooled subnetwork 404 is 2048, the first embedded subnetwork 405 can output feature vectors having dimensions 256, 512, 1024, 2048.

For step S114:

in one possible implementation, the teacher feature may be obtained by the following formula:

where j represents the number of the target sample image in the training set, m represents the first view number of the target sample image,representing teacher feature, θ, of target sample image j in first view m ^m (view ^m (I _j ) A first eigenvector, θ, representing the target sample image j in a first view m ^m (flip(view ^m (I _j ) A) second eigenvector representing the target sample image j in the first view m, view) ^m (I _j ) Representing a first view of the target sample image at a first view m.

In this embodiment, flip () may represent a view to the first view ^m And performing random overturn treatment to obtain a second view.

Through step S114, the teacher feature obtained by the first recognition network may be enhanced, and the accuracy of the feature output by the first recognition network may be improved.

The teacher characteristics of the target sample image under different first views can be obtained through the first recognition networks, for example, the plurality of first recognition networks can be used for carrying out operation simultaneously, so that the plurality of teacher characteristics of the target sample image can be obtained according to the plurality of first recognition networks, for example, when the first views comprise 7 types, the 7 first views of the target sample image can be used for carrying out operation simultaneously through the 7 first recognition networks, and therefore the teacher characteristics of the 7 first views can be obtained.

For step S120:

referring to fig. 5, fig. 5 shows a schematic diagram of a second identification network according to an embodiment of the present disclosure.

As shown in fig. 5, the second recognition network may include a feature extraction network 60, a feature map mapping network (Feature Maps Factorization Branch) 61, and a feature vector mapping network (Representation Factorization Branch) 62.

In one possible implementation, the feature extraction network 60 includes a second image augmentation sub-network 602, a second convolution sub-network 603, a second pooling sub-network 604, a second embedding sub-network 605, and a classification sub-network 606.

In the present embodiment, the feature extraction network 60 may be established according to the first identification network, and for example, the feature extraction network 60 may be established by taking a weight parameter of the first identification network as an initial weight parameter of the feature extraction network 60.

In one possible implementation, the feature extraction network 60 may be a student network in a knowledge distillation system.

In one possible implementation, the feature map mapping network 61 may include a first view extraction sub-network 611, a third pooling sub-network 612, and a third embedding sub-network 613.

In one possible implementation, the feature vector mapping network 62 may include a second view extraction sub-network 621 and a mapping sub-network 622.

Referring to fig. 6, fig. 6 is a flowchart illustrating step S120 of a training method of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 6, step S120 of inputting each sample image into a second recognition network for processing, obtaining a first network loss of the second recognition network, and obtaining a second network loss and a third network loss of the second recognition network according to the teacher feature may include:

Step S231, inputting a target sample image into a second image augmentation sub-network for augmentation processing to obtain an augmented third view, wherein the target sample image is any one of the plurality of sample images;

step S232, inputting the third view into a second convolution sub-network for convolution processing to obtain a feature map of the target sample image;

step S233, inputting the feature map of the target sample image into a feature map mapping network for processing, and obtaining first predicted values of a plurality of first views of the target sample image;

step S234, determining a second network loss of the second identification network according to the teacher characteristics of the plurality of sample images and the first predicted value.

In step S120, the second identification network may process the whole-body view of each sample image, so as to obtain a first network loss, a second network loss, and a third network loss of the second identification network. However, the present disclosure is not limited to the processing object of the second recognition network, and the second recognition network may also be processing other views of each sample image to obtain the above-mentioned network loss, and the second recognition network may also include a second view decomposition sub-network 601 (as shown in fig. 6), and the second decomposition sub-network 601 may decompose the sub-network 401 or its variant for the first view in the first recognition network, and through the second view decomposition sub-network 601, a whole-body view or other partial view of each sample image may be obtained.

For step S231:

in one possible implementation, inputting the target sample image into the second image augmentation subnetwork for the augmentation process may include: and performing at least one of random overturn processing, random shielding processing, random matting processing, random color processing and random rotation processing on the input first view.

For step S232:

when the second convolution sub-network 603 receives the third view output by the second image enhancement sub-network 602, the second convolution sub-network 603 may perform convolution processing on the third view, thereby extracting a feature map included in the third view.

For step S233:

in one possible implementation, the first view extraction sub-network 611 may be used to map the feature map of the target sample image to feature maps of multiple first views of the target sample image.

In this embodiment, the first view extraction sub-network 611 may also implement a dimension reduction process on the feature map of the first view.

In this embodiment, the first view extraction sub-network 611 may include at least one convolution layer, BN layer, and ReLU (Rectified Linear Units) layer.

In one possible implementation, the third pooling sub-network 612 may perform global pooling processing with the feature map of the first view.

In a possible implementation manner, the third embedding sub-network 613 may include a full connection layer (fully connected layers, FC) and a BN layer, and the first predicted value is obtained by receiving the feature vector output by the third pooling sub-network 612 and performing an embedding process on the feature vector.

For step S234:

in one possible implementation, the first predicted value and the teacher feature may be fitted using a regression loss function to obtain the second network loss.

In this embodiment, the second network loss may be obtained using the following formula:

wherein,a second network loss representing the first view k, N representing the number of target sample images in the training set,/for>Teacher feature representing the target sample image i in the first view k, < >>Representing a first predicted value of the target sample image i at a first view k.

Through implementation of the method, the second network loss can be obtained according to the teacher characteristics output by the first identification network and the first predicted value.

Step S241, the feature map of the target sample image is sequentially subjected to pooling and embedding processing through a second pooling sub-network and a second embedding sub-network, so as to obtain a third feature vector of the target sample image;

Step S242, inputting the third feature vector into the feature vector mapping network for processing, so as to obtain second predicted values of the plurality of first views of the target sample image;

step S243, determining a third network loss of the second identification network according to the teacher characteristics of the plurality of sample images and the second predicted value.

For step S241:

after the second convolution sub-network 603 obtains the feature map of the third view, the second pooling sub-network 604 of the feature extraction network 60 obtains the feature map and performs pooling processing on the feature map, and features obtained after pooling processing are input into the second embedding sub-network 605 to perform embedding processing and then output a third feature vector of the target sample image.

In this embodiment, the second pooling sub-network 604 may perform global pooling processing on the feature map.

For step S242:

in a possible implementation, the second view extraction sub-network 621 is configured to map the third feature vector of the target sample image to feature vectors of a plurality of first views of the target sample image.

In one possible implementation, the mapping sub-network 622 in the feature vector mapping network 62 may map feature vectors of the plurality of first views to obtain second predicted values of the plurality of first views.

In this embodiment, the mapping subnetwork 622 may include at least one FC layer and BN layer.

In this embodiment, when the first view is a whole-body view, the second view extraction sub-network 621 may directly input the third feature vector into the mapping sub-network 622, and the mapping sub-network 622 maps the third feature vector, so as to obtain the second predicted value of the whole-body view.

In the present embodiment, in the second view extraction sub-network 621 that extracts the feature vector of the whole-body view, the second view extraction sub-network 621 may directly input the third feature vector into the mapping sub-network 622, or in the second view extraction sub-network 621 that extracts the feature vector of the whole-body view, the second view extraction sub-network 621 may not be included, in which case the mapping sub-network 622 is connected to the second embedding sub-network 605, and the mapping sub-network 622 directly obtains the third feature vector from the second embedding sub-network 605; in the second view extraction sub-network 621 that extracts feature vectors of partial views, the second view extraction sub-network 621 may extract one partial view of the target sample image from the third feature vector as a feature vector of the first view to be input into the mapping sub-network 622.

For step S243, in one possible implementation, the regression loss function may be used to fit the second predicted value and the teacher feature, so as to obtain the third network loss.

In this embodiment, the third network loss may be obtained by the following formula:

wherein,a third network loss representing a first view k, N representing the number of sample images in said set of images, is->Teacher feature representing the target sample image i in the first view k, < >>Representing a second predicted value of the target sample image i at the first view k.

Through implementation of the method, the third network loss can be obtained according to the second predicted value and the teacher characteristic output by the first identification network.

Step S251, inputting the third feature vector of the target sample image into a classification sub-network for processing to obtain classification information of the target sample image;

step S252, determining a first network loss of the second identification network according to the classification information and the labeling information of the plurality of sample images, where the first network loss includes a cross entropy loss function.

For step S251:

in one possible implementation, classification sub-network 606 may be implemented by an FC layer.

In the present embodiment, the classification processing may be performed on the third feature vector by the FC layer, thereby obtaining classification information of the target sample image.

In this embodiment, the classification information may be a probability that the target sample image belongs to the labeling information of the target sample image, and the labeling information of the target sample image may be identity information of a person corresponding to the target sample image that is labeled in advance, for example, may include an ID of the person or a labeling number of the target sample image.

For step S252:

in one possible implementation, the first network loss of the feature extraction network may be obtained by the following formula:

wherein L is _cls Representing a first network loss, N representing the number of sample images in a training set, C representing the total number of images in a dataset to which the training set belongs, Z _i，c Classifying information representing a target sample image i, c is prediction labeling information output by the target sample image i through a feature extraction network, and Z _i，yi Representing the preset two-dimensional vector value of the target sample image, yi represents labeling information (such as labeling ID or serial number of the sample image i) of the sample image i in the training set.

For step S130:

as shown in fig. 3, step S130 of training the second identification network according to the first network loss, the second network loss and the third network loss to obtain a trained second identification network may include:

Step S311, determining a weighted sum of the first network loss, the second network loss and the third network loss as an overall network loss of the second identification network;

and step S312, performing reverse training on the second identification network according to the total network loss to obtain a trained second identification network.

For step S311:

in one possible implementation, the overall network loss may be obtained by the following formula:

wherein L is _total Representing the overall network loss, L _cls Representing a loss of said first network in question,second network loss representing kth first view,/->And a third network loss representing a kth first view, K representing the number of the first views, and α and β representing a first constant and a second constant, respectively.

In the present embodiment, the first constant and the second constant may have values of 4 and 2, respectively, and in other embodiments, the first constant and the second constant may have other values, which are not limited herein.

For step S312:

the weight parameters of the second identification network can be updated through random gradient descent algorithms such as SGD and Adam, and therefore reverse training of the second identification network is achieved.

The trained second recognition network can be used for re-recognizing pedestrians, for example, can be applied to the fields of security protection, man-machine interaction, unmanned vending (unmanned shops) and the like, and can be used for recognizing objects such as characters in images or videos.

By adopting the training method disclosed by the disclosure, the high-efficiency second recognition network can be obtained, and the method has the characteristics of high precision and high efficiency when the object recognition is carried out.

Referring to fig. 7, fig. 7 illustrates a training system of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 7, the system may include:

the first processing module 10 is configured to input a plurality of sample images in a training set into a first recognition network for processing, and obtain teacher features of a plurality of first views of each sample image;

the second processing module 20 is connected to the first processing module 10, and is configured to input each sample image into a second identification network for processing, obtain a first network loss of the second identification network, and obtain a second network loss and a third network loss of the second identification network according to the teacher feature;

a training module 30, coupled to the second processing module 20, for training the second identification network according to the first network loss, the second network loss, and the third network loss to obtain a trained second identification network,

It should be noted that, the training system of the object recognition network is a system item corresponding to the training method of the object recognition network, and the specific description of the system item refers to the description of the method item before, which is not repeated herein.

The second recognition network obtained by matching the modules of the training system of the object recognition network can accurately recognize the target object, and the training method has the advantage of good expansibility, and a plurality of first recognition networks can be added to train the second recognition network.

Referring to fig. 8, fig. 8 illustrates a training system of an object recognition network according to an embodiment of the present disclosure.

In one possible implementation, as shown in fig. 8, the first processing module 10 may include:

a first decomposition sub-module 101, configured to input a target sample image into a first view decomposition sub-network to perform view decomposition processing, and obtain a plurality of first views of the target sample image, where the target sample image is any one of the plurality of sample images;

The first augmentation sub-module 102 is connected to the first decomposition sub-module 101, and is configured to perform augmentation processing on the plurality of first views through the first image augmentation sub-network to obtain a plurality of second views after augmentation;

a first processing sub-module 103, connected to the first amplifying sub-module 102, configured to perform convolution, pooling and embedding processing on the plurality of first views and the plurality of second views sequentially through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network, so as to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

a first determining sub-module 104, connected to the processing sub-module, for determining a teacher feature of the target sample image according to the first feature vector and the second feature vector.

In one possible embodiment, the second processing module 20 includes:

a second augmentation sub-module 201, configured to input a target sample image into a second image augmentation sub-network to perform augmentation processing, and obtain an augmented third view, where the target sample image is any one of the plurality of sample images;

A convolution sub-module 202, connected to the second augmentation sub-module 201, configured to input the third view into a second convolution sub-network to perform convolution processing, so as to obtain a feature map of the target sample image;

a feature map mapping sub-module 203, coupled to the convolution sub-module 202, configured to input a feature map of the target sample image into a feature map mapping network to process the feature map to obtain first predicted values of a plurality of first views of the target sample image;

a second determining sub-module 204, connected to the feature map mapping sub-module 203, configured to determine a second network loss of the second identification network according to the teacher features of the plurality of sample images and the first predicted value.

The second processing sub-module 211 is connected to the convolution sub-module 202, and is configured to pool and embed the feature map of the target sample image sequentially through a second pooling sub-network and a second embedding sub-network, so as to obtain a third feature vector of the target sample image;

a feature vector mapping sub-module 212, coupled to the second processing sub-module 211, for inputting the third feature vector into the feature vector mapping network for processing, to obtain second predicted values of the plurality of first views of the target sample image;

A fourth determining sub-module 213, coupled to the feature vector mapping sub-module 212, for determining a third network loss of the second identification network according to the teacher features of the plurality of sample images and the second predicted value.

A classification sub-module 221, connected to the second processing sub-module 211, configured to input a third feature vector of the target sample image into a classification sub-network for processing, so as to obtain classification information of the target sample image;

a fourth determining sub-module 222, connected to the classifying sub-module 221, for determining a first network loss of the second identifying network according to the classifying information and labeling information of the plurality of sample images, wherein the first network loss includes a cross entropy loss function.

in a possible implementation manner, the feature vector mapping network includes a second view extraction sub-network and a mapping sub-network, and the feature vector mapping sub-module is further configured to map a third feature vector of the target sample image into feature vectors of a plurality of first views of the target sample image through the second view extraction sub-network.

In one possible implementation, the training module 30 includes:

an operator module 301, configured to determine a weighted sum of the first network loss, the second network loss, and the third network loss as an overall network loss of the second identification network;

and the training sub-module 302 is connected to the operation sub-module 301, and is configured to perform reverse training on the second identification network according to the overall network loss, so as to obtain a trained second identification network.

Referring to fig. 9, fig. 9 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 9, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, images, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 800 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of apparatus 800 to perform the above-described methods.

Referring to fig. 10, fig. 10 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure. For example, the apparatus 1900 may be provided as a server.

Referring to fig. 10, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of training an object recognition network, the method comprising:

respectively inputting a plurality of sample images in a training set into a first recognition network for processing, and acquiring teacher characteristics of a plurality of first views of each sample image, wherein the plurality of first views comprise a whole-body view and a local view;

2. The method of claim 1, wherein the first identification network comprises a first view decomposition sub-network, a first image augmentation sub-network, a first convolution sub-network, a first pooling sub-network, and a first embedding sub-network,

3. The method of claim 1, wherein the second recognition network comprises a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network comprising a second image augmentation sub-network, a second convolution sub-network, a second pooling sub-network, a second embedding sub-network, and a classification sub-network,

4. A method according to claim 3, wherein each sample image is respectively input into a second recognition network for processing, a first network loss of the second recognition network is obtained, and a second network loss and a third network loss of the second recognition network are obtained according to the teacher feature, and further comprising:

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the feature map mapping network comprises a first view extraction sub-network, a third pooling sub-network and a third embedding sub-network, wherein the first view extraction sub-network is used for mapping the feature map of the target sample image into feature maps of a plurality of first views of the target sample image;

6. The method of claim 4, wherein each sample image is separately input into a second identification network for processing to obtain a first network loss for the second identification network, further comprising:

wherein the first network loss comprises a cross entropy loss function.

7. The method of claim 1, wherein training the second identification network based on the first network loss, the second network loss, and the third network loss to obtain a trained second identification network comprises:

8. The method of claim 2, wherein the step of augmenting the plurality of first view input image augmentation subnetworks to obtain a plurality of second views comprises:

9. A training system for an object recognition network, the system comprising:

the first processing module is used for respectively inputting a plurality of sample images in the training set into the first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image, wherein the plurality of first views comprise a whole-body view and a local view;

10. The system of claim 9, wherein the first recognition network comprises a first view decomposition sub-network, a first image augmentation sub-network, a first convolution sub-network, a first pooling sub-network, and a first embedding sub-network,

wherein the first processing module comprises:

11. The system of claim 9, wherein the second recognition network comprises a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network comprising a second image augmentation sub-network, a second convolution sub-network, a second pooling sub-network, a second embedding sub-network, and a classification sub-network,

Wherein the second processing module comprises:

12. The system of claim 11, wherein the second processing module further comprises:

13. The system of claim 12, wherein the system further comprises a controller configured to control the controller,

the feature map mapping network comprises a first view extraction sub-network, a third pooling sub-network and a third embedding sub-network, and the feature map mapping sub-module is further used for mapping the feature map of the target sample image into feature maps of a plurality of first views of the target sample image through the first view extraction sub-network;

14. The system of claim 12, wherein the second processing module further comprises:

wherein the first network loss comprises a cross entropy loss function.

15. The system of claim 9, wherein the training module comprises: