CN114708462A

CN114708462A - Method, system, device and storage medium for generating detection model for multi-data training

Info

Publication number: CN114708462A
Application number: CN202210465516.8A
Authority: CN
Inventors: 陈文晶; 王坚; 李兵; 余昊楠; 胡卫明
Original assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Current assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-05

Abstract

The invention discloses a method, a system, equipment and a storage medium for generating a detection model for multi-data training. The method comprises the following steps: inputting a pre-training data set into a basic network in a matrix data mode; outputting coordinates and target labels of a plurality of target bounding boxes by a basic network; obtaining loss of a basic network according to the output coordinates of the target bounding boxes and the target labels, and completing model pre-training of the basic network to form a pre-training detection network model; inputting a task data set into a pre-training detection network model; and adjusting the pre-training detection network model according to the task data set to generate a detection model for multi-data training. The system comprises: the device comprises a first input unit, an output unit, a pre-training detection network model unit, a second input unit and a generation unit. The computer device includes: memory, processors, and computer programs. A storage medium containing computer-executable instructions for performing the above-described method.

Description

Method, system, device and storage medium for generating detection model for multi-data training

Technical Field

The invention relates to the technical field of computer machine vision, in particular to a target detection model generation method, a system, equipment and a storage medium based on multi-data joint training.

Background

The rapid development of deep learning technology and the large-scale class of labeled data promote the development and progress of computer vision tasks, including image recognition, target detection and image segmentation. Among them, the target detection technology has received wide attention as a basic task with wide application range and demand in computer vision tasks.

Many data sets in real-world tasks have data shortage problems, such as the problem of classifying endangered birds, and since the endangered birds are naturally also rare species, the birds cannot have a huge amount of various tag image data, but almost all birds have wings, feathers, claws, tips and the like in consideration of high similarity among bird species. But training an endangered bird detection model on the general bird data with sufficient data is still difficult to realize.

In summary, the target detection model generation scheme in the prior art cannot solve the technical problem that the sufficient data set is associated and transited to the deficient data set, and then the deficient data is trained to obtain the target detection model pointed to by the deficient data.

Disclosure of Invention

The invention provides a method, a system, equipment and a storage medium for generating a detection model for multi-data training, which are used for solving the problem that in the prior art, sufficient data sets cannot be correlated and transited to deficient data sets, and then a target detection model taking deficient data as a task is obtained through training.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, a method for generating a detection model for multidata training in an embodiment of the present invention includes the following steps: inputting a pre-training data set into a basic network in a matrix data mode; the basic network outputs the coordinates and the target labels of a plurality of target bounding boxes; obtaining the loss of the basic network according to the output coordinates and target labels of the target bounding boxes, completing model pre-training of the basic network, and forming a pre-training detection network model; inputting a task data set into the pre-training detection network model; and adjusting the pre-training detection network model according to the task data set to generate a detection model for multi-data training.

Preferably, before inputting the pre-training data set into the base network in the form of matrix data, the method comprises the following steps: performing differential sampling on samples in the first classified data set and the first detection data set, using low sampling probability for the classes with a large number of samples, using high sampling probability for the classes with a small number of samples, and merging the sampled sample data and the second detection data set into the pre-training data set; or, when the training data mini-batch is generated each time, the first classification data set and the first detection data set are dynamically sampled by adopting dynamic sampling probability, and sample data obtained after sampling and the second detection data set are combined into the pre-training data set.

More preferably, the sample data of the first detection data set and the sample data of the second detection data set are subjected to label mapping, and then merged into the pre-training data set.

More preferably, the loss is composed of a classification loss of a bounding box and a regression loss of the bounding box;

L(p,p^*,t,t^*)＝L_cls(p,p^*)+λ[p^*≥1]L_reg(t,t^*) Formula 1;

wherein p represents the probability of the model outputting the target bounding box classification; p is a radical of^*Representing the true classification of the target bounding box; t ═ t (t)_x,t_y,t_w,t_h) Four coordinates representing the model output target bounding box; t is t^*Real values representing four coordinates of the target bounding box; λ is an indicator function, when]If the internal condition is satisfied, the value is 1, otherwise the value is 0;

represents a smoothing function, x is an argument; l is_clsClass penalty, L, of bounding box calculated by a class penalty function representing bounding box_regRegression loss, t, of bounding box calculated by regression loss function representing bounding box_iFour coordinates t ═ t (t) representing the bounding box of the output target_x,t_y,t_w,t_h)，i∈{x,y,w,h}，

Real values representing four coordinates of the target bounding box; the above equations 1 to 4 all calculate the loss of a single target bounding box; the pre-training data set comprises a classification set picture in a first classification data set and a detection set picture in a first detection data set and a second detection data set, and the classification loss of the boundary frame in the loss is calculated for the classification set picture in the pre-training data set by adopting the formula 3; for the detection set picture in the pre-training data set, calculating the classification loss of the boundary frame and the regression loss of the boundary frame by simultaneously adopting the formula 3 and the formula 4, and then mutually comparing the classification loss of the boundary frame of the classification set picture, the classification loss of the boundary frame of the detection set picture and the regression loss of the boundary frame of the detection set pictureThe loss is added.

More preferably, the pre-training dataset includes a classification set picture in the first classification dataset and a detection set picture in the first detection dataset and the second detection dataset; when the classified set picture is used, the loss is only the classification loss of the bounding box, the regression loss of the bounding box is not calculated, and the classification loss of the bounding box is calculated by adopting the multi-label loss and adopting the following formula 5:

wherein L is_clsThe classification loss of the bounding box obtained by the classification loss function of the bounding box is represented, i represents the corresponding serial number of the bounding box, p represents the probability of the classification of the model output bounding box, and p represents the probability of the classification of the model output bounding box^*Representing the true classification of the bounding box.

More preferably, the basic network includes: area proposed networks and network detection headers.

More preferably, when training is performed by using the classified set pictures in the first classified data set, the classified set pictures in the first classified data set are input into the area proposal network of the base network in a matrix data manner, and the training input of the area proposal network is preset as the central area of each classified set picture.

More preferably, the base network is a fast RCNN network or a Cascade RCNN network.

More preferably, each detection set picture in the first detection data set and the second detection data set has at least one bounding box coordinate and a bounding box category label; each piece of picture data in the first sorted data set is tagged with at least one image category.

More preferably, the adjusting the pre-training detection network model according to the task data set includes: adjusting the area proposed network in the pre-training detection network model, specifically selecting L target bounding boxes with the maximum value of IoU from the real values, and selecting N target bounding boxes with the frame scores of IoU times higher than the real values from the first M, wherein L, N, M is a preset constant, and calculating the loss of the area proposed network by the following formula 6 to select and output the target bounding boxes;

wherein p is_iThe classification probability of the model output bounding box is represented,

representing the true classification of the bounding box, L_clsClass penalty, L, of bounding box calculated by a class penalty function representing bounding box_regRegression loss of bounding box calculated by regression loss function representing bounding box, N_clsRepresenting the number of bounding boxes used to calculate the classification penalty, N_regRepresenting the number of bounding boxes, t, used to calculate the regression loss of the bounding box_iThe coordinates of the model output bounding box are represented,

representing the real coordinates of the bounding box.

More preferably, the adjusting the pre-training detection network model according to the task data set includes: adjusting a network detection head in the pre-training detection network model, specifically, the probability distribution of the target boundary box selected and output by the proposed network in the learning area is adjusted, the uncertainty of the coordinates of the target boundary box is predicted through Gaussian distribution modeling of the following formula 7, the true value of the target boundary box is calculated through the following formula 8, and finally, the loss is obtained through fitting of the formulas 9 and 10, so that the adjustment of the network detection head is completed;

P_D(x)＝δ(x-x_g) Equation 8;

wherein, P_Θ(x) Representing a Gaussian distribution, x representing a function argument, x_eMean value of Gaussian distribution, sigma standard deviation of Gaussian distribution, e natural constant, P_D(x) Denotes the Dirac distribution, x_gRepresents the central value of the Dirac distribution, H (P)_D(x) Represents a smoothing function, L_regRegression loss of bounding box calculated by regression loss function representing bounding box, D_KLRepresenting a KL divergence fit.

A second part, a system for generating a detection model for multidata training according to an embodiment of the present invention includes: the first input unit is used for inputting the pre-training data set into the basic network in a matrix data mode; an output unit for causing the base network to output coordinates of the plurality of target bounding boxes and the target labels; the pre-training detection network model unit is used for obtaining the loss of the basic network according to the output coordinates and target labels of the target boundary boxes, completing model pre-training of the basic network and forming a pre-training detection network model; the second input unit is used for inputting the task data set into the pre-training detection network model; and the generating unit is used for adjusting the pre-training detection network model according to the task data set to generate a detection model for multi-data training.

In a third aspect, a computer device according to an embodiment of the present invention includes: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the multi-data training detection model generation method according to any embodiment of the invention.

In a fourth aspect, a storage medium containing computer-executable instructions for performing the method for multiple data trained detection model generation according to any of the embodiments of the present invention when executed by a computer processor.

According to the method, the system, the equipment and the storage medium for generating the multi-data training detection model, the detection performance of the finally obtained detection model is kept, meanwhile, the target class has richer expression capability, the method can adapt to more object classes appearing in reality, the purpose that sufficient data sets are correlated and transited to deficient data sets, and then the target detection model taking deficient data as a task is obtained through training is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not limit the invention. In the drawings:

fig. 1 is a flowchart of a method for generating a detection model for multiple data training according to embodiment 1 of the present invention;

FIG. 2 is a diagram illustrating a pre-training data set according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of the fast RCNN network structure according to embodiment 1 of the present invention;

fig. 4 is a schematic structural diagram of a regional proposal network according to embodiment 1 of the present invention;

FIG. 5 is a schematic structural diagram of a multi-data training detection model generation system according to embodiment 2 of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to embodiment 3 of the present invention.

Detailed Description

In order to achieve the purpose of performing correlation transition on a sufficient data set to a deficient data set and further training to obtain a target detection model taking deficient data as a task. The inventor has found through research that training a model on common birds with sufficient data and then using the model to train on small data sets of endangered birds can lead to a final model with better performance than a model which is trained on small data directly from the beginning. Furthermore, the inventor considers that the existing detection public data set can be utilized to obtain a better pre-training model, and fine adjustment is carried out on the pre-training model, so that the model can be best represented on a target task. Based on the above research and thought, the inventors propose a method, a system, a device and a storage medium for generating a detection model for multidata training, which are described in detail below by embodiments.

Embodiment 1, the method for generating a detection model for multiple data training of this embodiment, as shown in fig. 1, includes the following main steps:

100. pre-training data is prepared.

In order to improve the expression capability of pre-training, in this embodiment, the classification set data and the detection set data are used for training at the same time, and the corresponding relationship of different classes and the balanced sampling of the training data need to be considered during training. When preparing pre-training data, a first classification data set, a first detection data set and a second detection data set are adopted, and differential sampling is performed according to samples in the first classification data set and the first detection data set, and a low sampling probability is used for classes with a large number of samples, and a high sampling probability is used for classes with a small number of samples, for example: the first classification dataset is a 6000 classification dataset, wherein the number of samples in each classification is different, some samples are only hundreds of sample pictures, and some samples are counted by hundreds of thousands of sample pictures, and different sampling rates are adopted for the sample pictures in different classifications, so that the number of the selected samples in each classification is balanced. Merging the sampled sample data and a second detection data set into the pre-training data set; or, when the training data mini-batch is generated each time, the first classification data set and the first detection data set are dynamically sampled by adopting a dynamic sampling probability, and the sampled sample data and the second detection data set are combined into the pre-training data set. In addition, the sampled sample data of the first detection data set and the data of the second detection data set are subjected to label mapping and then merged into the pre-training data set.

In a specific implementation, referring to fig. 2, the pre-training data may be, but is not limited to, MS COCO data set and Open Images data set. The Open Images training set simultaneously uses 6000 classes (i.e., a first class data set) and 500 classes of detection sets (i.e., a first detection data set), the MS COCO data set is an 80 classes of detection sets (i.e., a second detection data set), the classes include pictures and class labels (image-level labels), the detection sets include pictures, frame coordinates and class labels (bounding box-level labels), i.e., the label corresponding to each picture has multiple mark frame coordinates and mark frame classes as examples of Open Images, and on average, each picture has more than 8 mark frames.

The Open Images data set has the problem of unbalanced data distribution, so when pre-training data is prepared, differential sampling is performed according to the distribution condition of the data, that is, a class with a large number of samples uses a low sampling probability, and a class with a small number of samples uses a high sampling probability. A dynamic sampling mode may also be used, i.e. each time the training data mini-batch is generated, a picture is sampled from the Open Images dataset with a different sampling probability. And finally obtaining a pre-training data set of 6000 classification categories.

When 80 types of detection sets and 500 types of detection sets reach 6000 classification categories of pre-training data sets, label mapping needs to be carried out, and since 6000 types of Open Images are not independent and contain hierarchical relations among the types, labels of upper levels cannot be omitted while the labels need to be kept accurate in label mapping. For example, considering that the names of different dataset labels are different, class 80 of MS COCO datasets may not find the complete corresponding name in class 6000 of the Open Images datasets, but have similar classes, and then the corresponding classes need to be mapped, for example, there is a person class in the MS COCO dataset, but there are man and wman classes in the Open Images dataset, and both need to be mapped.

110. And inputting the pre-training data set into the basic network in a matrix data mode.

The base network in this embodiment employs a network including a Region Proposal Network (RPN) and a network Detection head (Detection head), including but not limited to a fast RCNN network or a Cascade RCNN network. In this embodiment, a fast RCNN network is used as a basic network of a pre-training detection network model, and a network structure is shown in fig. 3 and includes a network framework layer, a region proposing network, a target region of interest, and a network detection head. The input to the Faster RCNN is the mini-batch of data in the pre-training data set, which contains multiple pictures, represented as matrix data, such as a 16-picture mini-batch, with each picture having a resolution of 256 × 256 and a channel count of 3(RGB), then the mini-batch is a 16 × 3 × 256 matrix.

120. The base network outputs coordinates and target labels for a plurality of target bounding boxes.

In a particular implementation, the output of the Faster RCNN network is a plurality of target bounding box coordinates (x1, y1, x2, y2) and a target label (i.e., category c).

130. And obtaining the loss of the basic network according to the output coordinates of the target bounding boxes and the target labels, and completing model pre-training of the basic network to form a pre-training detection network model.

In a specific implementation, the pre-training dataset includes both the classification set pictures in the first classification dataset and the detection set pictures in the first detection dataset and the second detection dataset. When the classification set picture and the detection set picture are used simultaneously, the loss of the Faster RCNN network is formed into two parts, including the classification loss of the boundary frame and the regression loss of the boundary frame; when only classified set pictures are used, the penalty is the classification penalty of the bounding box.

Because the pre-training data set simultaneously comprises a classification set picture and a detection set picture, the pre-training data set needs to be distinguished according to class labels carried by the pictures during training, each picture data in the first detection data set and the second detection data set is provided with at least one boundary frame coordinate and a boundary frame class label, each picture data in the first classification data set is provided with at least one image class label, namely, the pictures in the classification set are image-level labels, and the pictures in the detection set are boundary box-level labels.

When the classification set picture and the detection set picture are used simultaneously, only the classification loss part of the boundary frame in the loss is considered for the classification set picture, and the classification loss of the boundary frame and the regression loss of the boundary frame are considered simultaneously for the detection set picture, and the specific calculation formula is as follows:

L(p,p^*,t,t^*)＝L_cls(p,p^*)+λ[p^*≥1]L_reg(t,t^*) Formula 1;

wherein p represents the probability of the model outputting the target bounding box classification; p is a radical of^*Representing the true classification of the target bounding box; t ═ t (t)_x,t_y,t_w,t_h) Four coordinates representing a model output target bounding box; t is t^*Real values representing four coordinates of the target bounding box; λ is an indicator function, when]If the internal condition is satisfied, the value is 1, otherwise the value is 0;

Real values representing four coordinates of the target bounding box; all of the above equations 1 to 4 calculate the loss of a single target bounding box. For the classified set picture, the above formula 3 is adopted to calculate the classification loss of the bounding box in the loss, and for the detected set picture, the above formula 3 and the above formula 3 are simultaneously adoptedAnd calculating the classification loss of the boundary frame and the regression loss of the boundary frame by using a formula 4, and then adding the classification loss of the boundary frame of the classification set picture, the classification loss of the boundary frame of the detection set picture and the regression loss of the boundary frame of the detection set picture to obtain the loss.

When only classified set pictures are used, since pictures in the Open Images dataset are usually multi-labeled, even if only one label is labeled, the pictures may belong to multiple categories in 6000 categories at the same time, so the classification loss of the bounding box adopts multi-label loss instead of cross entropy loss, and is specifically calculated by the following formula 5:

wherein L is_clsThe classification loss of the bounding box obtained by the classification loss function of the bounding box is represented, i represents the corresponding box serial number, p represents the probability of the classification of the model output box, p^*Representing the true classification of the box.

Due to the existence of the area proposal network, the structure of the area proposal network is shown in fig. 4 and comprises: 3x3 convolutional layers and 1x1 convolutional layers. Although the classified set picture only has the target label, because the classified set picture usually has the target in the center of the picture, when the picture in the first classified data set is input into the area proposal network of the basic network in the form of matrix data, the training input of the area proposal network can be preset as the central area of the picture, and because the area proposal network is trained without considering the specific category, the positioning capability independent of the useful category can be learned from more target categories (6000 categories).

And after the training process is finished, completing model pre-training of the fast RCNN to form a pre-training detection network model.

140. The task data set is input into a pre-training detection network model.

The task data set, i.e. the starvation data set, is usually a small data set, which is indicated by the small number of target classes and the small number of picture samples (compared to the pre-training data set described above). And each piece of picture data in the task data set is provided with a label, and the label of each piece of picture data comprises at least one target boundary box coordinate and a target label. And inputting the task data set into a pre-training detection network model, so that the network model architecture during adjustment is consistent with that during pre-training.

150. And adjusting the pre-training detection network model according to the task data set to generate a detection model for multi-data training.

When the pre-training detection network model is adjusted, because the number of task data set pictures is small, the epochs are trained too much, the original strong expression capacity of the pre-training detection network model is lost, and meanwhile, the pre-training detection network model is easy to overfit; if the number of epochs is too small, the pre-trained detection network model may not be adapted to the new task class.

Therefore, in this embodiment, because the pre-training dataset and the task dataset are different, and the coordinate values of the target bounding box have uncertainty when the area proposing network training is performed, in order to ensure the best effect of fine tuning of the pre-training detection network model, when the target bounding box is output, the L target bounding boxes with the largest value IoU of the true value (ground route) are selected, and the N target bounding boxes with the box scores of the first M and the IoU th highest value of the ground route are also selected, where L, N, M is a preset constant. Calculating the loss of the area proposed network by the following formula 6 to select and output the target bounding box;

wherein p is_iRepresenting the classification probability of the output box of the model,

representing the true classification of the box, L_clsClass penalty, L, of bounding box calculated by a class penalty function representing bounding box_regRegression loss of bounding box calculated by regression loss function representing bounding box, N_clsRepresenting the number of bounding boxes used to calculate the classification penalty, N_regFor indicatingTo calculate the number of frames, t, of the regression loss of the bounding box_iThe coordinates of the output box of the model are represented,

representing the real coordinates of the box.

Note that p is the score of the frame_iThe number is between 0 and 1, when the network is trained, the area proposal network outputs a series of boxes, each box has a score and coordinates, the higher the score is, the higher the possibility that an object is contained in the box is, and i represents the serial number of an anchor box (anchor). The fast RCNN can in fact be understood as two parts: the first part is a skeleton network used for extracting basic features of the picture, and the second part is a Region Proposal Network (RPN) + a network Detection head (Detection head), and has two losses, namely classification loss of a boundary box + regression loss of the boundary box. The difference is that the classification part of the RPN network is two classifications, namely the RPN only judges whether an anchor is an object or a background; the classification part of the network detection head is multi-classification, the number of classes is related to specific tasks, namely, the detection of fast RCNN is divided into two steps, firstly, a series of target interested regions (actually, a series of rectangular frames obtained by classifying and regressing) are obtained through a RPN network by a preset anchor frame (the anchor frame is also a rectangular frame, usually thousands to tens of thousands, the number and the position are fixed and related to the network), the rectangular frames have coordinates and classes (two classes, objects or backgrounds), the rectangular frame output by RPN is further classified and regressed through the network detection head to obtain a finer class and a more accurate frame coordinate, namely, the RPN obtains a coarse frame, and the network detection head obtains a final fine frame, namely, the process how to select RPN output and head input is performed.

In the network detection head part, the uncertainty of the coordinates of a target boundary box selected and output by considering a region proposing network is considered, the probability distribution of the frame is learned, the uncertainty of the coordinates of the target boundary box is predicted through Gaussian distribution modeling of a following formula 7, the true value (ground true) of the target boundary box is modeled through the Dirac distribution of a following formula 8, and finally the two distributions are fitted through KL divergence of a formula 9 and a formula 10 to serve as losses, so that the adjustment of the network detection head is completed.

P_D(x)＝δ(x-x_g) Formula 8;

wherein, P_Θ(x) Representing a Gaussian distribution, x representing a function argument, x_eMean value of Gaussian distribution, sigma standard deviation of Gaussian distribution, e natural constant, P_D(x) Denotes the Dirac distribution, x_gRepresents the central value of the Dirac distribution, H (P)_D(x) Represents a smoothing function, L_regRegression loss of bounding box calculated by regression loss function representing bounding box, D_KLRepresenting a KL divergence fit. Therefore, the adjustment of the pre-training detection network model is completed, and a detection model with multiple data training is generated.

In summary, the method of the embodiment uses a multi-data joint training mode, and considers the potential uncertainty problem of the target class and the target bounding box during fine tuning, thereby improving the accuracy of the model during training on a small data set. The model fine tuning also considers the frame selection of the area proposal network and the uncertainty of the frame category and the frame coordinate of the fast RCNN network, so that the generalization performance of the final model can be improved.

Embodiment 2, the system for generating a detection model for multiple data training of this embodiment, as shown in fig. 5, includes: a pre-training data unit 200, a first input unit 210, an output unit 220, a pre-training detection network model unit 230, a second input unit 240, a generation unit 250, and a mapping unit 260 are prepared.

A pre-training data unit 200 is prepared, which is configured to adopt a first classified data set, a first detection data set, and a second detection data set when pre-training data is prepared, and perform differential sampling according to samples in the first classified data set and the first detection data set, where a class with a large number of samples uses a low sampling probability, and a class with a small number of samples uses a high sampling probability, for example: the first classification dataset is a 6000 classification dataset, wherein the number of samples in each classification is different, some samples are only hundreds of sample pictures, and some samples are counted by hundreds of thousands of sample pictures, and different sampling rates are adopted for the sample pictures in different classifications, so that the number of the selected samples in each classification is balanced. Merging the sampled sample data and the second detection data set into the pre-training data set; or, the method is used for adopting a first classification data set, a first detection data set and a second detection data set when pre-training data is prepared, adopting dynamic sampling probability to dynamically sample the first classification data set and the first detection data set when training data mini-batch is generated each time, and merging sample data obtained after sampling and the second detection data set into the pre-training data set. In a specific implementation, the pre-training data may employ, but is not limited to, the MS COCO dataset and the Open Images dataset. The Open Images training set simultaneously uses 6000 classes (i.e., a first class data set) and 500 classes of detection sets (i.e., a first detection data set), the MS COCO data set is an 80 classes of detection sets (i.e., a second detection data set), the classes include pictures and class labels (image-level labels), the detection sets include pictures, frame coordinates and class labels (bounding box-level labels), i.e., the label corresponding to each picture has multiple mark frame coordinates and mark frame classes as examples of Open Images, and on average, each picture has more than 8 mark frames. The Open Images data set has the problem of unbalanced data distribution, so when pre-training data is prepared, differential sampling is performed according to the distribution condition of the data, that is, a class with a large number of samples uses a low sampling probability, and a class with a small number of samples uses a high sampling probability. A dynamic sampling mode may also be used, i.e. each time the training data mini-batch is generated, a picture is sampled from the Open Images dataset with a different sampling probability. And finally obtaining a pre-training data set of 6000 classification categories.

The mapping unit 260 is configured to perform label mapping on the sampled data of the first detection data set and the data of the second detection data set, and then combine the data into the pre-training data set. In specific implementation, when the 80-class detection set and the 500-class detection set reach a pre-training data set of 6000 classification classes, label mapping is required, and since 6000 classes of Open Images are not independent and contain hierarchical relations, labels at the upper level cannot be omitted while the label mapping is required to be accurate. For example, considering that the names of different dataset labels are different, class 80 of MS COCO datasets may not find the complete corresponding name in class 6000 of the Open Images datasets, but have similar classes, and then the corresponding classes need to be mapped, for example, there is a person class in the MS COCO dataset, but there are man and wman classes in the Open Images dataset, and both need to be mapped.

A first input unit 210, configured to input the pre-training data set into the base network in a matrix data manner. The base network in this embodiment employs a network including a region proposal network and a network detection header, including but not limited to a fast RCNN network or a Cascade RCNN network. In the embodiment, a fast RCNN network is used as a basic network of a pre-training detection network model, and the network detection network comprises a network framework layer, a region proposing network, a target region of interest and a network detection head. The input to the Faster RCNN is the mini-batch of data in the pre-training data set, which contains multiple pictures, represented as matrix data, such as a 16-picture mini-batch, with each picture having a resolution of 256 × 256 and a channel count of 3(RGB), then the mini-batch is a 16 × 3 × 256 matrix.

An output unit 220, configured to enable the base network to output a plurality of target bounding box coordinates and target labels. In a particular implementation, the output of the Faster RCNN network is a plurality of target bounding box coordinates (x1, y1, x2, y2) and a target label (i.e., category c).

And the pre-training detection network model unit 230 is configured to obtain a loss of the base network according to the output coordinates of the multiple target bounding boxes and the target labels, and complete model pre-training of the base network to form a pre-training detection network model. In a specific implementation, the loss of the fast RCNN network is formed into two parts, including a classification loss of a bounding box and a regression loss of the bounding box. Because the pre-training data set simultaneously comprises the classification set pictures and the detection set pictures, the pre-training data set needs to be distinguished according to the class labels carried by the pictures during training, the pictures of the classification set are the labels of image-level, and the pictures of the detection set are the labels of bounding box-level. For the classified collection picture, only the classification loss part of the boundary frame in the loss is considered, and for the detection collection picture, the classification loss of the boundary frame and the regression loss of the boundary frame are considered at the same time. It should be noted that, since pictures in the Open Images dataset are usually multi-labeled, even if only one label is labeled, the pictures may belong to multiple categories in 6000 categories at the same time, and therefore, the classification loss of the bounding box adopts multi-label loss instead of cross entropy loss. Due to the existence of the area proposal network, the area proposal network structure comprises: 3x3 convolutional layer and 1x1 convolutional layer, although the picture of the classification set only has an object label, since the picture of the classification set usually has an object in the center of the picture, the training object of the area proposal network can be preset to be the central area of the picture, and since the area proposal network is trained without considering the specific class, the positioning capability independent of the useful class can be learned from more object classes (6000 classes). And after the training process is finished, completing model pre-training of the fast RCNN to form a pre-training detection network model.

A second input unit 240 for inputting the task data set into the pre-trained detection network model. The task data set, i.e. the starvation data set, is usually a small data set, which is indicated by the small number of target classes and the small number of picture samples (compared to the pre-training data set described above). And each piece of picture data in the task data set is provided with a label, and the label of each piece of picture data comprises at least one target bounding box coordinate and a target label. And inputting the task data set into a pre-training detection network model, so that the network model architecture during adjustment is consistent with that during pre-training.

And the generating unit 250 is configured to adjust the pre-training detection network model according to the task data set, and generate a detection model for multi-data training. In the concrete implementation, when the pre-training detection network model is adjusted, because the number of task data set pictures is small, and the training epochs are too many, the original strong expression capacity of the pre-training detection network model is lost, and meanwhile, the pre-training detection network model is easy to overfit; if the number of epochs is too small, the pre-trained detection network model may not be adapted to the new task class. Therefore, in this embodiment, in order to ensure the best effect of fine tuning of the pre-training detection network model, the category difference between the pre-training data set and the task data set needs to be considered, when the area-proposed network training is performed, the uncertain target bounding box is considered, and specifically, when the network part selects the output target bounding box, not only the part of the target bounding box which is the largest with IoU of the group route, but also the part of the target bounding box which has a high network score and is IoU times higher than the group route is selected. In addition, in the head part of the network detection, uncertainty of the regression frame type and the coordinates of the target boundary frame is considered to further correct the target boundary frame, and meanwhile, the uncertainty of the type can be used as a reference in reference, so that the generalization capability of the model is improved. Here, the probability distribution of the frame is learned, or the bayesian neural network is considered, the new loss (KL divergence).

In summary, the system of the embodiment uses a multi-data joint training mode, and considers the potential uncertainty problem of the target class and the target bounding box during fine tuning, thereby improving the accuracy of the model during training on a small data set. The model fine tuning also considers the frame selection of the area proposal network and the uncertainty of the frame category and the frame coordinate of the fast RCNN network, so that the generalization performance of the final model can be improved.

Embodiment 3, computer device of this embodiment, referring to fig. 6, the computer device 300 shown is only an example, and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.

As shown in fig. 6, computer device 300 is in the form of a general purpose computing device. The components of computer device 300 may include, but are not limited to: one or more processors or processing units 301, a system memory 302, and a bus 303 that couples various system components including the system memory 302 and the processing unit 301.

Bus 303 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 300 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 300 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 302 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)304 and/or cache 305. The computer device 300 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 306 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 303 by one or more data media interfaces. System memory 302 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 308 having a set (at least one) of program modules 307 may be stored, for example, in system memory 302, such program modules 307 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 307 generally perform the functions and/or methodologies of the described embodiments of the invention.

The computer device 300 may also communicate with a display 310 or a plurality of external devices 309 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the computer device 300, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 300 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interfaces 311. Further, computer device 300 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network such as the Internet via network adapter 312 As shown in FIG. 6, network adapter 312 communicates with the other modules of computer device 300 via bus 303.

The processing unit 301 executes various functional applications and data processing by running programs stored in the system memory 302, for example, implementing the detection model generation method for multidata training provided by the embodiment of the present invention, and includes the following main steps: inputting a pre-training data set into a basic network in a matrix data mode; the basic network outputs a plurality of target bounding box coordinates and target labels; obtaining the loss of the basic network according to the output coordinates of the target bounding boxes and the target labels, and completing model pre-training of the basic network to form a pre-training detection network model; inputting a task data set into the pre-training detection network model; and adjusting the pre-training detection network model according to the task data set to generate a multi-data training detection model.

Embodiment 4, the storage medium containing computer-executable instructions of this embodiment stores therein a computer program, and when the program is executed by a processor, the method for generating a detection model for multidata training provided by the embodiment of the present invention is implemented, including the following main steps: inputting a pre-training data set into a basic network in a matrix data mode; the basic network outputs a plurality of target bounding box coordinates and target labels; obtaining the loss of the basic network according to the output coordinates of the target bounding boxes and the target labels, and completing model pre-training of the basic network to form a pre-training detection network model; inputting a task data set into the pre-training detection network model; and adjusting the pre-training detection network model according to the task data set to generate a multi-data training detection model.

The storage media containing computer-executable instructions for the present embodiments may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The method for generating the detection model for the multi-data training is characterized by comprising the following steps:

inputting a pre-training data set into a basic network in a matrix data mode;

the basic network outputs the coordinates and the target labels of a plurality of target bounding boxes;

obtaining the loss of the basic network according to the output coordinates and target labels of the target bounding boxes, completing model pre-training of the basic network, and forming a pre-training detection network model;

inputting a task data set into the pre-training detection network model;

and adjusting the pre-training detection network model according to the task data set to generate a multi-data training detection model.

2. The method for multiple data training detection model generation as claimed in claim 1, wherein the step of inputting the pre-training data set into the base network as matrix data comprises the steps of:

performing differential sampling on samples in the first classified data set and the first detection data set, using low sampling probability for the classes with a large number of samples, using high sampling probability for the classes with a small number of samples, and merging the sampled sample data and the second detection data set into the pre-training data set; or,

and when training data mini-batch is generated every time, dynamically sampling the first classification data set and the first detection data set by adopting dynamic sampling probability, and merging sample data obtained after sampling and the second detection data set into the pre-training data set.

3. The method of claim 2, wherein the pre-training data set is merged with the sampled sample data of the first test data set and the sample data of the second test data set after label mapping.

4. The method for multiple data training detection model generation as defined in claim 2, wherein said penalties are comprised of bounding box classification penalties and bounding box regression penalties;

L(p，p^*，t，t^*)＝L_cls(p，p^*)+λ[p^*≥1]L_reg(t，t^*) Formula 1;

wherein p represents the probability of the model outputting the target bounding box classification; p is a radical of^*Representing the true classification of the target bounding box; t ═ t (t)_x，t_y，t_w，t_h) Four coordinates representing the model output target bounding box; t is t^*Real values representing four coordinates of the target bounding box; λ is an indicator function, when]If the internal condition is satisfied, the value is 1, otherwise the value is 0;

represents a smoothing function, x is an argument; l is a radical of an alcohol_clsClass penalty, L, of bounding box calculated by a class penalty function representing bounding box_regRegression loss, t, of bounding box calculated by regression loss function representing bounding box_iFour coordinates t ═ t (t) representing the bounding box of the output target_x，t_y，t_w，t_h)，i∈{x，y，w，h}，

Real values representing four coordinates of the target bounding box; the above equations 1 to 4 all calculate the loss of a single target bounding box;

the pre-training data set comprises a classification set picture in a first classification data set and a detection set picture in a first detection data set and a second detection data set, and the classification loss of the boundary frame in the loss is calculated for the classification set picture in the pre-training data set by adopting the formula 3; and for the detection set picture in the pre-training data set, calculating the classification loss of the boundary frame and the regression loss of the boundary frame by simultaneously adopting the formula 3 and the formula 4, and adding the classification loss of the boundary frame of the classification set picture, the classification loss of the boundary frame of the detection set picture and the regression loss of the boundary frame of the detection set picture to obtain the loss.

5. The method for multiple data training test model generation as claimed in claim 2, wherein said pre-training dataset comprises a first classification dataset and a first test dataset and a second test dataset; when the classified set picture is used, the loss is only the classification loss of the bounding box, the regression loss of the bounding box is not calculated, and the classification loss of the bounding box is calculated by adopting the multi-label loss and adopting the following formula 5:

6. The method for multiple data training detection model generation as defined in claim 2, wherein said base network comprises: area proposal network and network detection header.

7. The multiple data trained detection model generation method of claim 6, wherein when training using the classification set pictures in the first classification data set, the classification set pictures in the first classification data set are input as matrix data to the area proposal network of the base network, and the training input of the area proposal network is preset to the central area of each of the classification set pictures.

8. The method for multiple data-trained detection model generation of claim 6, wherein said underlying network is a fast RCNN network or a Cascade RCNN network.

9. The multiple data trained detection model generation method of claim 2, wherein each detection set picture in said first detection data set and said second detection data set carries at least one bounding box coordinate and a bounding box class label; each piece of picture data in the first sorted data set is tagged with at least one image category.

10. The method for multiple data training detection model generation as claimed in claim 6, wherein said adapting the pre-trained detection network model based on the task data set comprises:

adjusting the area proposed network in the pre-training detection network model, specifically selecting L target bounding boxes with the maximum value of IoU from the real values, and selecting N target bounding boxes with the frame scores of IoU times higher than the real values from the first M, wherein L, N, M is a preset constant, and calculating the loss of the area proposed network by the following formula 6 to select and output the target bounding boxes;

representing the true classification of the bounding box, L_clsClass penalty, L, of bounding box calculated by a class penalty function representing bounding box_regRegression loss function representing bounding boxCalculated regression loss of bounding box, N_clsRepresenting the number of bounding boxes used to calculate the classification penalty, N_regRepresenting the number of bounding boxes, t, used to calculate the regression loss of the bounding box_iThe coordinates of the model output bounding box are represented,

representing the real coordinates of the bounding box.

11. The method for multiple data training detection model generation as claimed in claim 10, wherein said adapting the pre-trained detection network model based on the task data set comprises:

adjusting a network detection head in the pre-training detection network model, specifically, the probability distribution of the target boundary box selected and output by the proposed network in the learning area is adjusted, the uncertainty of the coordinates of the target boundary box is predicted through Gaussian distribution modeling of the following formula 7, the true value of the target boundary box is calculated through the following formula 8, and finally, the loss is obtained through fitting of the formulas 9 and 10, so that the adjustment of the network detection head is completed;

P_D(x)＝δ(x-x_g) Equation 8;

wherein, P_Θ(x) Representing a Gaussian distribution, x representing a function argument, x_eMean value of Gaussian distribution, sigma standard deviation of Gaussian distribution, e natural constant, P_D(x) Denotes the Dirac distribution, x_gRepresents the central value of the Dirac distribution, H (P)_D(x) Represents a smoothing function, L_regRegression loss of bounding box calculated by regression loss function representing bounding box, D_KLIndicating a KL divergence fit.

12. A system for generating a detection model for multiple data training, comprising:

the first input unit is used for inputting the pre-training data set into the basic network in a matrix data mode;

an output unit for causing the base network to output coordinates of the plurality of target bounding boxes and the target labels;

the pre-training detection network model unit is used for obtaining the loss of the basic network according to the output coordinates of the target bounding boxes and the target labels, completing model pre-training of the basic network and forming a pre-training detection network model;

the second input unit is used for inputting the task data set into the pre-training detection network model;

and the generating unit is used for adjusting the pre-training detection network model according to the task data set to generate a detection model for multi-data training.

13. A computer device, comprising: memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the multiple data trained detection model generation method of any of claims 1-11.

14. A storage medium containing computer-executable instructions for performing the method for multiple data trained detection model generation of any one of claims 1-11 when executed by a computer processor.