CN116882486A

CN116882486A - Method, device and equipment for constructing migration learning weight

Info

Publication number: CN116882486A
Application number: CN202311141733.2A
Authority: CN
Inventors: 殷俊; 吴福明; 傅凯; 张朋; 张学涵; 程德强; 郑春煌; 汪志强; 鲁逸峰; 蔡丹平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-10-13
Anticipated expiration: 2043-09-05
Also published as: CN116882486B

Abstract

The application discloses a method, a device and equipment for constructing a transfer learning weight, wherein the method comprises the following steps: acquiring a training sample set, wherein the sample comprises a picture, a local image obtained by matting a target area on the picture, and a relation label of the picture and the local image; inputting the picture into a first feature extraction network, and inputting the corresponding partial picture into a second feature extraction network; calculating a loss function value based on the extracted feature similarity and expected similarity of the first picture feature and the second picture feature; the loss function value meets the requirement, and the weight of the current first feature extraction network and the weight of the second feature network are respectively migrated; otherwise, the weight adjustment is carried out on the first feature extraction network and the second feature extraction network, and the next training is triggered. The method can avoid the problem of poor migration learning effect caused by overlarge distribution difference among different targets, and has higher efficiency compared with a method for generating only one model pre-training weight at a time.

Description

Method, device and equipment for constructing migration learning weight

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a method, an apparatus, and a device for constructing a migration learning weight.

Background

With the rapid development of artificial intelligence theory, intelligent algorithms are widely applied to various visual services. Good vision models typically require a large amount of data as a learning sample, and the increase in the amount of data presents a new problem: the high consumption cost required by training the model is continuously increased, the cost is represented in two aspects, firstly, the training time is obviously increased, and the time gain for reducing the algorithm time consumption of different hardware devices is also obviously different; secondly, the pressure brought to the hardware, in order to shorten the model training time, research and development funds are required to be continuously put into the hardware. In order to accelerate the training fitting speed of the algorithm, researchers have proposed a plurality of migration learning methods, but various obstacles exist in the actual industry landing, such as: the weight pre-trained on the public data set is migrated to the floor service, but the migration learning effect is poor due to too complex real scene and too large distribution difference among different targets.

Disclosure of Invention

The application provides a method, a device and equipment for constructing a migration learning weight, which are used for solving the problem of poor migration learning effect caused by overlarge distribution difference among different targets when the correlation between a scene in a analytic service and a scene or an object in a public data set is not large.

In a first aspect, the present application provides a method for constructing a weight for transfer learning, where the method includes:

acquiring a training sample set, wherein samples in the training sample set comprise pictures, local images obtained by matting target areas on the pictures, and relationship labels of the pictures and the local images, which are marked in advance;

inputting the picture into a first feature extraction network to extract first picture features, and inputting a local picture corresponding to the picture into a second feature extraction network to extract second picture features;

determining expected similarity based on the feature similarity of the extracted first picture features and the second picture features and the relation labels of the pictures and the local pictures, and determining a loss function value;

when the loss function value is determined to not meet the requirement, carrying out weight adjustment on the first feature extraction network and the second feature extraction network, and triggering the next training;

when the loss function value is determined to meet the requirement, the weight of the current first feature extraction network is used as the migration weight of the target detection network, and the weight of the current second feature network is used as the migration weight of the classification network.

In a possible implementation manner, when the number of target areas on the picture is multiple, a plurality of partial images are obtained by matting the multiple target areas on the picture, and one partial image selected randomly from the plurality of partial images is used as the partial image corresponding to the picture.

In one possible implementation manner, inputting a picture into a first feature extraction network to perform first picture feature extraction, and inputting a local image corresponding to the picture into a second feature extraction network to perform second picture feature extraction, including:

the training data sample set comprises a plurality of samples in batches, wherein each batch of pictures is input into a first feature extraction network to perform first picture feature extraction, and a local image corresponding to each batch of pictures is input into a second feature extraction network to perform second picture feature extraction;

determining an expected similarity based on the extracted feature similarity of the first picture feature and the second picture feature and a relationship label of the picture and the local picture, and determining a loss function value, wherein the method comprises the following steps:

and determining expected similarity based on the feature similarity between the extracted first picture feature of each picture of the batch and the extracted second picture feature of each partial picture of the batch and the pre-labeled relation label of the picture and the partial picture of the batch, and determining a loss function value.

In a possible implementation manner, the first feature extraction network is a YOLO model with detection detect layer removed; the second feature extraction network is a ResNet18 classification model with full connection layer removed.

In one possible implementation, inputting the picture into the first feature extraction network for first picture feature extraction includes:

inputting a picture into a YOLO model without a detection layer, forcibly projecting feature matrixes with different step sizes extracted from the YOLO model, and converting the feature matrixes into a plurality of vectors with uniform sizes;

splicing a plurality of vectors with uniform sizes to obtain normalized feature vectors;

and adjusting the normalized feature vector into a first picture feature vector with a preset length by using a single-layer perceptron SLP.

In one possible implementation manner, inputting the local graph corresponding to the picture into a second feature extraction network to perform second picture feature extraction, including:

inputting the partial graph corresponding to the picture into a ResNet18 classification model without a full connection layer to obtain an output feature vector subjected to pooling treatment;

and adjusting the feature vector subjected to the pooling treatment into a second picture feature vector with a preset length by using a single-layer perceptron SLP.

In one possible embodiment, determining the loss function value based on the feature similarity of the extracted first picture feature of each picture of the lot to the extracted second picture feature of each partial picture of the lot and the pre-labeled relationship label of the picture of the lot to the partial picture, includes:

Acquiring a feature similarity matrix of the first picture features of each picture of the extracted batch and the second picture features of each partial picture of the extracted batch;

acquiring an expected similarity matrix determined according to a relationship label of the pre-labeled picture and the local picture of the batch;

determining a loss function value according to the characteristic similarity matrix and the expected similarity matrix;

the line number/column number in the feature similarity matrix/expected similarity matrix represents the serial number of the picture in the batch, the column number/line number represents the serial number of the local picture in the batch, and the picture on the diagonal line and the local picture have marked relation labels.

In one possible implementation, determining the loss function value from the feature similarity matrix and the expected similarity matrix includes:

where B is the number of samples in a batch, CE is the cross entropy loss function, I is the expected similarity matrix,for the full-picture feature matrix corresponding to the first picture feature of each picture of the lot,/for the full-picture feature matrix corresponding to the first picture feature of each picture of the lot>And the local image feature matrix is corresponding to the second image feature of each local image of the batch.

In one possible embodiment, the method further comprises:

acquiring a first training sample set and a second training sample set, wherein the first sample in the first training sample set comprises a picture and a first label of whether the picture comprises a target or not, and the second sample in the second training sample set comprises a local image of a target area corresponding to the picture and a second label of whether the local image comprises the target or not;

Inputting the first training sample into a target detection network for completing weight migration, and training the target detection network by taking the output first label as a target;

and inputting a second training sample into the classification network for completing weight migration, and training the classification network by taking the output second label as a target.

In one possible embodiment, the method further comprises at least one of the following steps:

acquiring a picture to be detected, inputting the picture to be detected into a target detection network, and determining whether a target is identified in the picture to be detected according to a result output by the target detection network;

and acquiring a local image to be detected of a target area of the image to be detected, inputting the local image to be detected into the classification network, and determining whether a target is identified in the local image to be detected according to a result output by the classification network.

In a second aspect, the present application provides a device for constructing a weight for transfer learning, where the device includes:

the sample acquisition module is used for acquiring a training sample set, wherein samples in the training sample set comprise pictures, local images obtained by matting target areas on the pictures and relationship labels of the pictures and the local images, which are marked in advance;

The feature extraction module is used for inputting the picture into a first feature extraction network to extract the first picture feature, and inputting the local picture corresponding to the picture into a second feature extraction network to extract the second picture feature;

the loss function value determining module is used for determining expected similarity based on the feature similarity of the extracted first picture features and the extracted second picture features and the relation label of the picture and the local picture, and determining a loss function value;

the model training module is used for carrying out weight adjustment on the first characteristic extraction network and the second characteristic extraction network when the loss function value is determined to not meet the requirement, and triggering the next training;

and the weight migration module is used for taking the weight of the current first feature extraction network as the migration weight of the target detection network and taking the weight of the current second feature network as the migration weight of the classification network when the loss function value meets the requirement.

In a third aspect, the present application provides an apparatus for constructing a weight for transfer learning, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method for constructing the migration learning weight.

In a fourth aspect, the present application provides a computer-readable storage medium comprising: the computer-readable storage medium stores a computer program that, when executed by a computer, performs the above-described method of constructing the transfer learning weights.

The application discloses a method, a device and equipment for constructing a transfer learning weight, which are used for respectively inputting pictures in a training sample set and corresponding partial pictures into a first feature extraction network and a second feature extraction network, and carrying out network training by comparing the first picture features with the second picture features, so that on one hand, a network model can be effectively trained, and the problem of poor transfer learning effect caused by overlarge distribution difference among different targets is avoided; in addition, the weights of the target detection network and the classification network can be constructed simultaneously, and compared with a method capable of generating only one model pre-training weight at a time, the method is higher in efficiency.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for constructing a transfer learning weight according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for determining a loss function in a method for constructing a transfer learning weight according to an embodiment of the present application;

FIG. 3 is a loss function value with incomplete weight training in the method for constructing the transfer learning weight according to the embodiment of the present application;

FIG. 4 is a loss function value of the weight training in the method for constructing the transfer learning weight according to the embodiment of the present application;

fig. 5 is a comparison chart of loss function values of a network with built transfer learning weights and a control group applied in the same service data;

fig. 6 is a mAP comparing mAP of mAP50 indexes of a network with built transfer learning weights and a control group applied in the same service data according to an embodiment of the present application;

Fig. 7 is a block diagram of a device for constructing a weight for transfer learning according to an embodiment of the present application;

fig. 8 is a block diagram of a device for constructing a migration learning weight according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may mean a or B.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method based on routine or non-inventive labor. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the control device is executing.

The building of the training model in the related art requires a large amount of data as a learning sample, and meanwhile, only one network weight can be built in one training, so that the cost is high and the efficiency is low; in different detection service scenes, because the real scene is too complex and the distribution difference among different targets is too large, the problems of poor effect of constructing pre-training weights in the supervised target detection data set and the like in contrast learning are caused.

Fig. 1 illustrates a flowchart of a method for constructing a migration learning weight according to an embodiment of the present application, where the flowchart mainly includes:

s101, acquiring a training sample set, wherein samples in the training sample set comprise pictures, local images obtained by matting target areas on the pictures, and relationship labels of the pictures and the local images, which are marked in advance;

the pictures in the samples can be, but not limited to, RGB color patterns, or pictures formed from infrared thermal imaging data.

The data sources of the training sample set may be any of the following modes:

mode 1, training a sample set to obtain scene data as a data source;

the scene data are floor service data, the basic form of the scene data is pictures, and the scene data have stronger pertinence.

According to the embodiment, the scene data are adopted to obtain the samples, and when the samples are used for feature network training, the training weights can have stronger pertinence on the features extracted from the pictures of the scene, and the recognition accuracy is high.

Mode 2, the data sources of the training sample set are scene data and public data sets.

The public data set has information generality and does not have strong generalization to information application in specific fields. In this embodiment, a part of the samples in the training sample set is derived from the scene data, and a part is derived from the public data set. As an alternative implementation mode, the proportion of the public data sets is not too large, so that the problem that the training result is poor due to the fact that training time is long and model training is trapped into overfitting due to poor generalization caused by the fact that the quantity of the public data sets is too large can be avoided; on the one hand, the accuracy of feature extraction of the picture of the scene by using the scene data can be improved.

As an optional implementation manner, when the number of target areas on the picture is multiple, a plurality of partial images are obtained by matting the multiple target areas on the picture, and one partial image selected randomly from the plurality of partial images is used as the partial image corresponding to the picture.

When the picture is acquired, the target area is firstly identified, the identification result of the target area can be marked by using a marking frame, and the description format of the marking frame supports YOLO, pascal VOC and the like. Labeling is carried out according to the identification result, and labeling information can be specifically represented as the center point and the width and the height of a labeling frame corresponding to the target areaWherein x and y represent the center point of the label frame, w represents the width of the center frame, and h represents the height of the center frame.

The result of identifying the target area may be that multiple target areas are identified, and then one picture may correspond to multiple labeling information. In the embodiment, one piece of labeling information is randomly selected from a plurality of target areas by using a uniform sampling method, a partial graph of the corresponding target area is obtained by matting according to the selected labeling information, and the uniform sampling method can ensure that the probability of all the labeling information being selected is consistent and the bias condition is eliminated.

The mode of the relationship label of the marked picture and the local picture in the embodiment of the application can be as follows: and (3) marking the sequence numbers of the pictures in the training sample set, and marking the sequence numbers of the local pictures in the training sample set, wherein any picture and the local picture corresponding to the picture adopt the same sequence number.

S102, inputting a picture into a first feature extraction network to extract first picture features to obtain first picture features, and inputting a local picture corresponding to the picture into a second feature extraction network to extract second picture features to obtain second picture features;

in the embodiment of the application, the first feature extraction network and the second feature extraction network are backbone networks which only perform feature extraction in the classification network.

In an alternative embodiment, the first feature extraction network is a YOLO model with detection detect layer removed and the second feature extraction network is a res net18 classification model with full connection layer removed.

When a YOLO model from which the detection detect layer is removed is employed as the first feature extraction network, only the back structure and the back structure in the YOLO model are used. The second feature extraction network described above serves as an information extraction structure for the classification network, and the selection of the model structure may be, but is not limited to, resNet18. The classification network based on the neural network can extract high-dimensional feature vectors with high expression capability from the data, and can encode a two-dimensional image into vector information.

S103, determining expected similarity based on the feature similarity of the extracted first picture features and the second picture features and the relation labels of the pictures and the local pictures, and determining a loss function value;

the first picture feature corresponds to a full-picture global feature matrix obtained by feature extraction of the first feature extraction network, the second picture feature corresponds to a partial region feature matrix obtained by feature extraction of the second feature extraction network, and feature similarity calculation is carried out according to the full-picture global feature matrix and the partial region feature matrix.

In this embodiment, if a certain picture and a local picture have a relationship label, the similarity can be determined to be 1 for any picture and the corresponding local picture, and the similarity can be determined to be 0 for any picture and the local picture corresponding to other pictures, and then the corresponding expected similarity can be obtained according to the input picture and the local picture, and the loss function value can be determined according to the feature similarity and the expected similarity.

S104, judging whether the loss function value meets the requirement, if not, executing S105, and if so, executing S106;

the smaller the loss function value is, the closer the feature similarity matrix and the expected similarity matrix are, and when the loss function value is smaller than a set threshold value, the feature similarity and the expected similarity obtained by adopting the first feature extraction network and the second feature extraction network are considered to be relatively close to each other for the picture and the corresponding partial picture thereof, so that the requirement of feature extraction accuracy is met.

S105, when the loss function value is determined to not meet the requirement, carrying out weight adjustment on the first feature extraction network and the second feature extraction network, and triggering the next training;

the specific weight adjustment policies for the first feature extraction network and the second feature extraction network may be adjusted using existing reverse delivery mechanisms, and will not be described in detail herein.

And S106, when the loss function value is determined to meet the requirement, taking the weight of the current first feature extraction network as the migration weight of the target detection network, and taking the weight of the current second feature network as the migration weight of the classification network.

According to the method, a training sample set is used for providing a pre-training weight method with strong generalization capability, pictures and corresponding local pictures are respectively input into a first feature extraction network and a second feature extraction network, and the first picture features and the second picture features are compared to perform features to train the network, so that the method is a new network weight training mode, on one hand, a network model can be effectively trained, the problem of poor migration learning effect caused by overlarge distribution difference among different targets is avoided, the constructed weight can be used in various different detection service scenes, and the problem of constructing the pre-training weight on a supervised target detection data set by contrast learning is solved; in addition, by means of the shunt structure of the classification network, the weights of the target detection network and the backbone network of the classification network can be simultaneously constructed, and model fitting can be completed only by fine adjustment of the detection head during migration generalization, so that training time is greatly shortened; by simultaneously training the weights of the backbone networks of the target detection network and the classification network, the method has higher efficiency compared with a method which can only generate one model pre-training weight at a time.

The training sample set comprises a plurality of samples, and when the first characteristic extraction network and the second characteristic extraction network are used for training, the plurality of samples in the training sample set can be input in batches, and the input of each batch triggers weight adjustment once.

As an optional implementation manner, inputting a picture into a first feature extraction network to perform first picture feature extraction, and inputting a local image corresponding to the picture into a second feature extraction network to perform second picture feature extraction, including:

and the plurality of samples in the training data sample set are divided into batches, the pictures of each batch are input into a first characteristic extraction network to perform first picture characteristic extraction, and the local images corresponding to the pictures of the batch are input into a second characteristic extraction network to perform second picture characteristic extraction.

Such a batch of inputs may result in a first picture feature of the batch of pictures and a second picture feature of the partial map corresponding to the batch of pictures.

The method for determining the loss function value based on the feature similarity of the extracted first picture feature and the second picture feature and the relationship label of the pre-marked picture and the local picture determines the expected similarity, and includes:

In this embodiment, the feature similarity is calculated between the set of first image features of the images of one batch and the set of second image features of each partial image of the batch, and the calculated feature similarity includes the similarity between the first image feature of any image of the batch and the second image feature of any partial image of the batch. The expected similarity is labeled in advance, and the similarity between the first picture characteristic of any picture of the batch and the second picture characteristic of any partial picture of the batch is also labeled.

The specific embodiment of the first picture feature extraction will be described below by taking the example that the first feature extraction network uses the YOLO model with the detection detect layer removed.

In an alternative embodiment, inputting the picture into the first feature extraction network for first picture feature extraction includes:

inputting the picture into a YOLO model without a detection layer, forcibly projecting feature matrixes with different step sizes extracted from the YOLO model, and converting the feature matrixes into a plurality of vectors with uniform sizes;

and adjusting the normalized feature vector to be a first picture feature with a preset length by using a single-layer perceptron SLP.

In the embodiment of the application, a background structure and a neg structure in a YOLO model are used for carrying out feature extraction and feature fusion on an input batch of pictures, the YOLO model carries out feature extraction through a plurality of cascaded feature extraction layers, the dimensions of extracted picture feature matrixes of different feature extraction layers are different, namely, the feature extraction is carried out by adopting different step sizes (for example, 8, 16 and 32) for the different feature extraction layers.

After the feature matrixes with different step sizes are obtained, the embodiment of the application performs normalization processing through feature pooling coding processing, performs forced projection, splicing and length constraint on the feature matrixes corresponding to the different step sizes, and outputs the first picture features represented by one-dimensional vectors. Taking 3 feature output models as an example, because the step sizes Stride corresponding to different feature layers are different, the obtained feature matrixes have different dimensions, and in order to enable the representation of the feature information to have normalization, the embodiment of the application uses the function Max Pooling technology to forcedly project different feature matrixes to realize that Dimension projection (projection layer) maps the feature matrix into normalized feature vectors. Then for the aboveThe YOLO network has 3 output layers in total, and the dimensions of the characteristics of the 3 output layers are 1*1 after Pooling projection. The 3 sets of vectors are then merged from the channel dimension using a feature stacking operation; the feature information output at this time has completed the normalization operation on the scale; finally, a single-layer perceptron SLP (Single Layer Perceptron) is arranged, the input of the SLP is the characteristic dimension of 1*1, the output of the SLP is 1024, 1024 is an adjustable preset parameter, and the 1*1 characteristic is subjected to SLP to finish the final characteristic normalization operation, so that the first picture characteristic is obtained.

The specific implementation of feature extraction of the second picture will be described below by taking the second feature extraction network as an example, using a res net18 classification model with full connection layer removed.

In an optional implementation manner, inputting the local graph corresponding to the picture into a second feature extraction network to perform second picture feature extraction, including:

and adjusting the feature vector subjected to the pooling treatment into a second picture feature with a preset length by using a single-layer perceptron SLP.

Using a back bone structure and a back structure in a ResNet18 classification model to simultaneously classify the characteristics of the partial images of the same batch, and outputting a second characteristic vector subjected to pooling;

the single-layer perceptron SLP (Single Layer Perceptron) is used for restraining the second feature vector, the input second feature vector is provided with 1024 fixed values in output dimension, and the output dimension in practical application is a preset parameter which can be adjusted according to practical conditions; and finishing final feature normalization operation after the second feature vector of 1*1 feature dimensions passes through the SLP to obtain a second picture feature.

When the plurality of samples in the training sample set are input in batches and the first feature extraction network and the second feature extraction network are trained, since the feature similarity is determined according to any picture in a batch and any partial graph, the expected similarity is determined according to any picture in a batch and the corresponding partial graph, and the loss function value of the batch is obtained according to the feature similarity and the expected similarity, the feature similarity and the expected similarity can be identified in a matrix form.

The similarity between any predicted picture and any partial graph and the similarity between any predicted picture and any partial graph can be obtained by the above mode of the line numbers and the column numbers in the feature similarity matrix/the expected similarity matrix, and the similarity between one picture and the corresponding partial graph is on the diagonal, and one picture and the partial graph are called a sample pair, and in order to distinguish the space distance of different sample pairs, a strong correlation between the two is represented by using 1, an independence between the two is represented by using 0, and other positions are 0 except for the diagonal of the expected similarity matrix. For example, the serial numbers of the pictures in the batch may be represented by a line number, the serial numbers of the partial pictures in the batch may be identified by a column number, the serial numbers increasing sequentially in the order of the line number from top to bottom, and increasing sequentially in the order of the column number from right to left. Of course, other representations meeting the above requirements are also possible.

Assume that the full-image feature matrix corresponding to the first image feature of each image of any batch isThe local image feature matrix corresponding to the second image feature of each local image of any batch is +.>In one possible embodiment, the determination of the loss function value is determined according to the following formula:

where B is the number of samples for a batch of batch size, CE is the cross entropy loss function, and I is the expected similarity matrix.

The scale of the loss function depends on the batch size used for training, and is recorded as defining the dimension of the projected layer output feature vector of the two feature extraction networks as followsDefine the full-view feature matrix as +.>At the same time, the local diagram feature matrix is +.>The dimension of I is +.>Mathematically, it is an identity matrix. />The matrix multiplication is carried out on the transposed matrix of the full graph feature and the box feature, the calculated result is the feature similarity matrix, and the dimension of the similarity matrix is +.>. The cross entropy loss function can measure the feature similarity matrix and expected similarityThe distance between the degree matrices. The closer the feature similarity matrix and the expected similarity matrix are, the smaller the distance between the feature similarity matrix and the expected similarity matrix is, the smaller the result obtained by the cross entropy loss function is, and the more accurate the feature extraction is.

Fig. 2 is a flowchart of a method for determining a loss function value in a method for constructing a transfer learning weight according to an embodiment of the present application, and specifically includes the following steps:

acquiring a training sample set from scene data to obtain a plurality of pictures and corresponding local pictures of each picture; processing a sample through two branches, wherein one branch is to input pictures in the sample into a YOLO model with a detection detect layer removed to extract features with different step sizes; converting the features with different step sizes into vectors with uniform sizes through a feature pooling coding process, splicing, and converting into features with preset lengths; the other branch is to label the picture corresponding to a plurality of local pictures; randomly extracting a partial graph based on the labels; inputting the extracted partial graph into a ResNet18 classification model with the full connection layer removed; the output feature vector subjected to pooling is adjusted to be a feature with a preset length through feature projection coding; and finally, performing feature correlation calculation on the features obtained by the two branches to obtain a loss function value.

When the determined loss function value is smaller than the set threshold value, the requirement of feature extraction accuracy is met, the transfer learning weight training is completed, when the model after weight transfer is put into a real scene for use, the fine adjustment detection head can complete model fitting, a fitted first feature extraction network model is put into a target detection network for training based on a sample of acquiring new scene data, a fitted second feature extraction network model is put into a classification network for training, and the following is a specific implementation mode of target detection network training and classification network training.

As a possible embodiment, the method further includes:

acquiring a first training sample set and a second training sample set, wherein the first sample in the first training sample set comprises a picture and a first mark of whether the picture comprises a target or not, and the second sample in the second training sample set comprises a local image corresponding to the picture and a second mark of whether the local image comprises the target or not;

The first training sample set and the second training sample set can be small-number sample sets, so that the fine adjustment of network parameters can be realized through small-scale training, and the function of target recognition of the picture by the target detection network and the function of target recognition of the local picture by the classification network are realized.

In the embodiment of the application, the two network weights are trained by using the labeling data set with stronger pertinence, the trained weights have stronger pertinence, the loss of the weight migration application in service data is effectively reduced in the use of the weights in similar scenes, and the detection precision of the model is improved.

When the target detection network and the classification network after training are put into use in a real scene, the target detection network and the classification network can be used in combination or can be used independently, and a specific implementation mode of using the target detection network and the classification network for weight migration is described below.

As a possible embodiment, the method further includes:

one possible implementation manner is that a picture to be detected is obtained, the picture to be detected is input into a target detection network, and whether a target is identified in the picture to be detected is determined according to a result output by the target detection network; and acquiring a local image to be detected of a target area of the image to be detected, inputting the local image to be detected into the classification network, and determining whether a target is identified in the local image to be detected according to a result output by the classification network.

In another possible implementation manner, when the result output by the detection network cannot determine whether the target is identified in the picture to be detected, a local image of the target area of the picture to be detected is input into the classification network, and whether the target is identified in the local image of the target area of the picture to be detected is redetermined.

The training-completed target detection network and classification network are applied to an actual scene, if scene data are pictures to be detected, the pictures to be detected are input into the target detection network, and when the output result is a target in the pictures, the target detection network is trained to achieve the expected detection effect; if the scene data is a local graph to be detected, and the output result is a target in the local graph, the classification network training achieves the expected classification effect.

If the scene data is the picture to be detected, inputting the picture to be detected into a target detection network, and when the output result is not the target in the picture, the target detection network fails to achieve the expected detection effect, and inputting the local image of the target area of the picture to be detected into a classification network for secondary identification.

In the embodiment of the application, the feature similarity is calculated based on the extracted first picture feature and the second picture feature, the feature similarity matrix obtained in the initial stage of weight training is shown in fig. 3, and the feature similarity matrix obtained when the weight training is completed is shown in fig. 4. The depth in fig. 3 and 4 represents the degree of similarity, and the deeper the depth, the larger the similarity, and the shallower the depth, the smaller the similarity. The row/column numbers in fig. 3 and 4 represent the serial numbers of the pictures in the lot, and the column/row numbers represent the serial numbers of the partial pictures in the lot. As can be seen from fig. 3, the diagonal depth is not obvious, which indicates that the similarity between the first image feature calculated based on the image and the second image calculated based on the corresponding partial image is smaller for the same image and the corresponding partial image in one batch at the initial stage of weight training. As can be seen from fig. 4, the depth on the diagonal is higher than that of other places in the graph, and for the same picture and its corresponding partial graph, it is indicated that the similarity between the first picture feature calculated based on the picture and the second picture feature calculated based on the corresponding partial graph is greater in a batch when the weight training is completed.

The embodiment of the application is applied in practice, the actual scene data is input into a target detection network and a classification network with built weight transfer learning, and meanwhile, the network with built weight transfer learning and the comparison group are trained and tested on the same service data set, and the comparison group is a network model which is built by using the same target detection code as the detection model, is consistent with the network structure of the detection model and is initialized by adopting a random method. The consistency of the environment factors of the detection model and the control group is ensured in the training process, wherein the environment factors comprise the same operating system version, CPU model, physical machine memory size, display card model and the like; all super parameters of the model training phase use the same values, for example: learning rate, batch size, data enhancement parameters, etc. Fig. 5 is a graph showing comparison of loss function values of the detection model and the control group after weight migration, and fig. 6 is a graph showing comparison of mAP50 indexes of the detection model and the control group after weight migration.

The upper curve in fig. 5 is the magnitude of the loss function value in the service data after the weight migration application of the detection model, and the lower curve is the magnitude of the loss function value of the control group, and the rest parameters are all consistent. The abscissa represents the number of training batches and the ordinate represents the magnitude of the loss function value, and it can be seen that when the number of training batches is the same, the magnitude of the loss function value in the service data after the weight migration application of the detection model is lower than that of the loss function value of the control group. The lower curve of fig. 6 shows the mAP50 index change in the service data after the weight migration application of the present detection model, and the upper curve shows the mAP50 index change of the control group, and the remaining parameters are all consistent. The abscissa represents the training batch times, the ordinate is mAP50 index, and the mAP50 index range is 0-1. As can be seen, when the number of training batches is the same, the recognition accuracy of the detection model weight in service data after the application is higher than that of the comparison group. In the embodiment of the application, the partial features in the picture and the global features in the picture are compared and calculated, so that the loss of the model in the actual scene data training is effectively reduced, and the detection precision of the model is improved.

According to the method for constructing the transfer learning weight, provided by the embodiment of the application, the partial characteristics in the picture and the global characteristics in the picture are compared and calculated, only one weight can be obtained in one training in the prior art, multiple weights can be obtained in one training, the pre-training weight is constructed on the supervised target detection data set, the multiple weights are transferred and learned into a new training model, and the cost of the training model is effectively reduced. Meanwhile, compared with the public data set used for training the network weights in the prior art, the labeling data set with strong pertinence is used for training the two network weights, the trained weights have strong pertinence, the training time is effectively shortened in the use of the weights in similar scenes, the training cost is reduced, and the network training efficiency is improved.

Based on the same inventive concept, the application also provides a device for constructing the transfer learning weight, as shown in fig. 7, the device comprises:

the sample acquisition module 701 is configured to acquire a training sample set, where samples in the training sample set include a picture, a local graph obtained by matting a target area on the picture, and a relationship label between the picture and the local graph that is labeled in advance;

The feature extraction module 702 is configured to input a picture to a first feature extraction network to perform first picture feature extraction, and input a local image corresponding to the picture to a second feature extraction network to perform second picture feature extraction;

a loss function value determining module 703, configured to determine an expected similarity based on the feature similarities of the extracted first picture feature and the second picture feature and the relationship label between the picture and the local picture, and determine a loss function value;

the model training module 704 is configured to perform weight adjustment on the first feature extraction network and the second feature extraction network when the loss function value does not meet the requirement, and trigger the next training;

and the weight migration module 705 is configured to, when determining that the loss function value meets the requirement, take the weight of the current first feature extraction network as the migration weight of the target detection network, and take the weight of the current second feature network as the migration weight of the classification network.

As a possible implementation method, the feature extraction module inputs a picture to a first feature extraction network to perform first picture feature extraction, inputs a local image corresponding to the picture to a second feature extraction network to perform second picture feature extraction, and includes:

the loss function value determining module determines an expected similarity based on the feature similarity of the extracted first picture feature and the second picture feature and a relationship label of the picture and the local picture, and determines a loss function value, including:

As a possible implementation manner, the first feature extraction network is a YOLO model with the detection detect layer removed; the second feature extraction network is a ResNet18 classification model with full connection layer removed.

As a possible implementation manner, the feature extraction module inputs a picture into a first feature extraction network to perform first picture feature extraction, and includes:

the normalized feature vector is adjusted to a first picture feature vector with a preset length by using a single-layer perceptron SLP (Single Layer Perceptron).

As a possible implementation manner, the feature extraction module inputs the local graph corresponding to the picture to a second feature extraction network to perform second picture feature extraction, and includes:

As one possible implementation manner, the loss function value determining module determines the expected similarity based on the feature similarity of the first picture feature of each picture of the batch extracted and the second picture feature of each partial picture of the batch extracted, and the relationship label of the pre-labeled picture of the batch and the partial picture, and determines the loss function value, including:

As one possible implementation manner, the loss function value determining module determines a loss function value according to the feature similarity matrix and the expected similarity matrix, including:

As a possible embodiment, the apparatus further comprises:

The system comprises a first sample acquisition module and a second sample acquisition module, wherein the first sample acquisition module is used for acquiring a first training sample set and a second training sample set, the first sample in the first training sample set comprises a picture and a first mark of whether the picture comprises a target, and the second sample in the second training sample set comprises a local image of a target area corresponding to the picture and a second mark of whether the local image comprises the target;

the target detection network training module is used for inputting a first training sample into a target detection network for completing weight migration, and training the target detection network by taking the output first label as a target;

and the classification network training module is used for inputting a second training sample into the classification network for completing weight migration, and training the classification network by taking the output second label as a target.

As a possible embodiment, the device further comprises at least one of the following modules:

the target detection network identification module is used for acquiring a picture to be detected, inputting the picture to be detected into a target detection network, and determining whether a target is identified in the picture to be detected according to a result output by the target detection network;

the classification network identification module is used for acquiring a to-be-detected partial graph of a target area of the to-be-detected picture, inputting the to-be-detected partial graph into the classification network, and determining whether the target is identified in the to-be-detected partial graph according to a result output by the classification network.

Having described the method and apparatus for constructing the transfer learning weights according to the exemplary embodiment of the present application, next, a construction apparatus for the transfer learning weights according to another exemplary embodiment of the present application will be described.

An embodiment of the present application provides an apparatus, where the apparatus includes a device for constructing the migration learning weight provided in the foregoing embodiment, as shown in fig. 8, may include at least one processor 801 and at least one memory 802. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform:

As a possible embodiment, when the number of target areas on the picture is multiple, a plurality of partial images are obtained by matting the multiple target areas on the picture, and one partial image selected randomly from the plurality of partial images is used as the partial image corresponding to the picture.

As a possible embodiment, the processor inputs a picture to a first feature extraction network to perform a first picture feature extraction, inputs a local image corresponding to the picture to a second feature extraction network to perform a second picture feature extraction, and includes:

As a possible implementation manner, the processor inputs the picture into a first feature extraction network for first picture feature extraction, including:

As a possible embodiment, the processor inputs the local graph corresponding to the picture to a second feature extraction network to perform second picture feature extraction, including:

As a possible embodiment, the processor determines the expected similarity based on the feature similarity of the extracted first picture feature of each picture of the batch and the extracted second picture feature of each partial picture of the batch, and the pre-labeled relationship label of the pictures of the batch and the partial pictures, and determines the loss function value, including:

As a possible embodiment, the determining, by the processor, a loss function value according to the feature similarity matrix and the expected similarity matrix includes:

As a possible embodiment, the processor is further configured to:

As a possible embodiment, the processor is further configured to perform at least one of the following steps:

In some possible embodiments, aspects of a method for constructing a learning-to-migrate weight provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of a method for constructing a learning-to-migrate weight according to the various exemplary embodiments of the present application described above when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A program product for a method of migration learning weight construction according to an embodiment of the present application may employ a portable compact disc read-only memory (CD-ROM) and include program code and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart and block diagrams, and combinations of flowcharts and block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The method for constructing the transfer learning weight is characterized by comprising the following steps:

2. The method according to claim 1, wherein when the number of target areas on the picture is plural, a plurality of partial images are obtained by matting the plurality of target areas on the picture, and one partial image selected randomly from the plurality of partial images is used as the partial image corresponding to the picture.

3. The method of claim 1, wherein inputting a picture into a first feature extraction network for first picture feature extraction and inputting a local map corresponding to the picture into a second feature extraction network for second picture feature extraction comprises:

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the first feature extraction network is a YOLO model with detection detect layers removed;

the second feature extraction network is a ResNet18 classification model with full connection layer removed.

5. The method of claim 4, wherein inputting the picture into the first feature extraction network for first picture feature extraction comprises:

6. The method of claim 4, wherein inputting the local graph corresponding to the picture into the second feature extraction network for second picture feature extraction comprises:

7. A method according to claim 3, wherein determining the expected similarity based on the feature similarity of the extracted first picture feature of each picture of the batch to the extracted second picture feature of each partial picture of the batch and the pre-labeled relationship label of the pictures of the batch to the partial pictures, determining the loss function value comprises:

8. The method of claim 7, wherein determining a loss function value from the feature similarity matrix and an expected similarity matrix comprises:

wherein B is the number of samples in a batch, CE is the cross entropy loss function, I is the expected similarity matrix,/I>For the full-picture feature matrix corresponding to the first picture feature of each picture of the lot,/for the full-picture feature matrix corresponding to the first picture feature of each picture of the lot>And the local image feature matrix is corresponding to the second image feature of each local image of the batch.

9. The method as recited in claim 1, further comprising:

10. The method of claim 9, further comprising at least one of the steps of:

11. A device for constructing a transfer learning weight, the device comprising:

12. An apparatus for constructing a transfer learning weight, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program for performing the method according to any of claims 1-10 when being executed by a computer.