CN116246128B

CN116246128B - Training method and device of detection model crossing data sets and electronic equipment

Info

Publication number: CN116246128B
Application number: CN202310221072.8A
Authority: CN
Inventors: 王鹏; 韩冰冰; 刘加美
Original assignee: Shenzhen Ruiming Pixel Technology Co ltd
Current assignee: Shenzhen Ruiming Pixel Technology Co ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-10-27
Anticipated expiration: 2043-02-28
Also published as: CN116246128A

Abstract

The application discloses a training method of a detection model crossing a data set, a training device of the detection model crossing the data set, electronic equipment and a computer storage medium, wherein the training method comprises the following steps: a plurality of detection modules are arranged in one detection model to be trained, and feature extraction and feature fusion can be carried out on training samples from different data sets at the same time, so that a fusion feature map of each training sample is obtained; for each fusion feature map, each detection module can predict the fusion feature map according to the corresponding detection task, output the corresponding prediction feature map and match positive and negative samples for the fusion feature map; optimizing the detection model to be trained by combining training loss values of the plurality of detection modules, and continuously updating model parameters of the detection model to be trained under the condition that the detection model to be trained does not complete training until the detection model to be trained is trained. The method can reduce the data labeling cost of the detection model training set.

Description

Training method and device of detection model crossing data sets and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for a detection model across data sets, a training device for a detection model, an electronic device, and a computer readable storage medium.

Background

In the related art, aiming at a detection model for executing various detection tasks across data sets, when training, labeling of various detection tasks is needed for a data set without labels; for different data sets with only a single detection task, cross labeling of different detection tasks needs to be performed, which results in heavy labeling work. Obviously, heavy labeling work will increase the training cost of the detection model.

Disclosure of Invention

The application provides a training method of a cross-data set detection model, a training device of the cross-data set detection model, electronic equipment and a computer readable storage medium, which can reduce the training cost of the detection model.

In a first aspect, the present application provides a method for training a detection model across a dataset, comprising:

extracting features and fusing features of each training sample in the training set through a detection model to be trained to obtain a fused feature map of each training sample; the training set is determined by at least 2 data sets without cross labels, and each data set is labeled with a real labeling frame corresponding to a detection task; the detection model to be trained comprises at least 2 detection modules, and each detection module corresponds to one detection task;

Predicting the fusion feature map through each detection module aiming at each fusion feature map to obtain a prediction feature map corresponding to each detection module;

for each detection module, matching positive samples and negative samples for each associated real labeling frame based on the prior frames in each corresponding prediction feature map, wherein the associated real labeling frames are matched with detection tasks executed by the detection modules;

for each detection module, determining a training loss value of the detection module based on the positive sample and the negative sample matched with each corresponding associated real annotation frame;

and under the condition that the evaluation index of the detection model to be trained does not meet the preset condition and the iteration times are smaller than the preset times, updating the model parameters of the detection model to be trained by combining the training loss values of each detection module, and returning to execute the steps of feature extraction and feature fusion on each training sample in the training set through the detection model to be trained until the evaluation index meets the preset condition or the iteration times reach the preset times, so as to obtain the detection model to be trained after the training is completed.

In a second aspect, the present application provides a training apparatus for a detection model across a dataset, comprising:

The extraction module is used for carrying out feature extraction and feature fusion on each training sample in the training set through the detection model to be trained to obtain a fusion feature map of each training sample; the training set is determined by at least 2 data sets without cross labels, and each data set is labeled with a real labeling frame corresponding to a detection task; the detection model to be trained comprises at least 2 detection modules, and each detection module corresponds to one detection task;

the prediction module is used for predicting the fusion feature images through each detection module aiming at each fusion feature image to obtain a prediction feature image corresponding to each detection module;

the matching module is used for matching the positive sample and the negative sample for each associated real labeling frame based on the prior frame in each corresponding prediction feature map aiming at each detection module, and the associated real labeling frame is matched with the detection task executed by the detection module;

the first determining module is used for determining a training loss value of each detecting module based on the positive sample and the negative sample matched with each corresponding associated real annotation frame;

and the updating module is used for updating the model parameters of the detection model to be trained by combining the training loss value of each detection module under the condition that the evaluation index of the detection model to be trained does not meet the preset condition and the iteration times are smaller than the preset times, and triggering the execution of the extraction module until the evaluation index meets the preset condition or the iteration times reach the preset times, so as to obtain the detection model to be trained after the training is completed.

In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method according to the first aspect when said computer program is executed.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by one or more processors, implements the steps of the method of the first aspect described above.

Compared with the prior art, the application has the beneficial effects that:

for a detection model to be trained, which needs to perform at least 2 detection tasks, the detection model comprises at least 2 detection modules, wherein each detection module corresponds to one detection task; the training set used in the training process can be determined by at least 2 data sets without cross labels. And each data set is marked with only a real marking frame corresponding to one detection task in at least 2 data sets without cross marking.

Specifically, the training process of the detection model to be trained includes: and carrying out feature extraction and feature fusion on each training sample in the training set by using the detection model to be trained to obtain a fusion feature map of each training sample. For each detection module, the real annotation frame matched with the corresponding detection task can be recorded as an associated real annotation frame; for each fusion feature map, a corresponding prediction feature map can be obtained through prediction of each detection module; for each prediction feature map corresponding to each detection module, positive and negative samples matched for each associated true annotation box can be based on the prior boxes in the prediction feature map. After all the associated real labeling frames are matched with the positive sample and the negative sample, training loss values of each detection module can be respectively determined; under the condition that the evaluation index does not meet the preset condition and the iteration times are smaller than the preset times, the model to be trained still needs to be trained; at this time, in order to obtain a detection model with higher robustness, model parameters of the model to be trained can be updated by combining training loss values of all detection modules; and then, the training set can be processed again based on the updated model parameters until the evaluation index meets the preset condition or the iteration times reach the preset times, and a detection model after training is completed can be obtained.

According to the training method, a detection module of a model to be trained is additionally arranged, so that one detection module corresponds to one detection task; and model parameters are updated by combining training loss values of all detection modules, and the labeling data of training samples labeled with the detection tasks can be migrated to detect training samples not labeled with the detection tasks, so that cross labeling work among more than 2 different training samples is omitted, and the training cost of the detection model is reduced.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic structural diagram of a training model to be detected according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of an SPPFCSPC structure provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a single detection module according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a segmentation module according to an embodiment of the present application;

FIG. 5 is a flow chart of a training method of a detection model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for a detection model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In the related art, for a detection model to perform at least two detection tasks, the heavy labeling work causes a problem of high cost in training the detection model.

In order to solve the problem, the application provides a training method of a detection model crossing a data set, which can reduce the labeling workload and reduce the training cost of the detection model. The training method proposed by the application will be described by specific examples.

The training method of the cross-data set detection model provided by the embodiment of the application can be applied to mobile phones, tablet computers, vehicle-mounted equipment, augmented reality (augmented reality, AR)/Virtual Reality (VR) equipment, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal digital assistant, PDA) and other electronic equipment, and the embodiment of the application does not limit the specific types of the electronic equipment.

Referring to fig. 1, fig. 1 shows a model structure of a test model to be trained. It can be considered that the detection model to be trained mainly consists of a decoder and an encoder, and the encoder is connected with the decoder. Wherein the encoder comprises a Backbone (Backbone) network and a Neck (neg) network in the figure; the neck network is composed of a scale transformation network and a feature fusion network. The connection relation of each network in the whole encoder is as follows: the backbone network is connected with the feature fusion network through a scale transformation network. The encoder is a detection network comprising at least 2 detection modules; the feature fusion network is connected with the detection network.

The detection model to be trained is obtained by improving the structure of the target detection model YOLOv5 in two aspects, wherein the two aspects of improvement are specifically as follows:

improvement of the first aspect: the spatial pyramid pooling (spatial pyramid pooling, SPP) structure of YOLOv5 was replaced with the SPPFCSPC structure. Referring to fig. 2, fig. 2 shows a schematic structural diagram of the SPPFCSPC structure. Compared with the parallel calculation of the maximum pooling layers with the sizes of 5, 9 and 13 of the three cores adopted in the SPP structure, the SPPFCSPC structure can achieve smaller parameter quantity, is more beneficial to information fusion and further improves the calculation speed under the condition of keeping the receptive field unchanged.

Improvement of the second aspect: and decoupling the structure of the detection module. The number of the detection modules corresponds to the types of the detection tasks, namely, a plurality of detection modules are arranged if a plurality of detection tasks exist. For example, two detection tasks, the detection module is a double-decoupling detection module. For the structure of a single detection module, reference may be made to fig. 3.

In some embodiments, if the detection model is further capable of performing the segmentation task, a decoder may be added, and an encoder may be shared with the decoder. The added decoder is a segmentation module capable of realizing real-time effective segmentation. A schematic of the architecture of the splitting module is shown in fig. 4, which includes a cross-phase local network (Cross Stage Partial Network, CSPNet). The cross-stage local network can reduce the calculated amount, increase the fusion of the features and enhance the variability of learning features in different layers. Furthermore, the CSPNet is connected with the C3SPP structure, the CSPP structure combines the ideas of CSP and SPP, and in the calculation process, the receptive field can be enlarged, and the characteristic learning capability of the network can be enhanced. The C3SPP structure can obtain the final segmentation result by being connected with the up-sampling module.

In order to illustrate the training method of the detection model across the data set proposed by the present application, various embodiments will be described below.

FIG. 5 shows a schematic flow chart of a training method of a detection model across a dataset, the training method of the detection model comprising:

and 510, carrying out feature extraction and feature fusion on each training sample in the training set through the detection model to be trained to obtain a fusion feature map of each training sample.

The training set may be determined from at least 2 data sets without cross-labels. The data set without cross marking refers to a data set marked with a real marking frame corresponding to only one detection task. The training set is provided with a plurality of training samples, and for each training sample, the training sample can be subjected to feature extraction through a backbone network of a detection model to be trained, and then the extracted features are subjected to feature fusion through a neck network, so that a fusion feature map of the training sample is obtained.

Specifically, the training detection model includes at least 2 detection modules. The number of the detection modules corresponds to the types of the detection tasks, namely, a plurality of detection tasks are provided, and a plurality of detection modules can be arranged. Each detection module corresponds to one detection task, namely each detection module corresponds to each detection task one by one.

Step 520, predicting the fused feature map through each detection module according to each fused feature map, so as to obtain a predicted feature map corresponding to each detection module.

And predicting the fusion feature map through each detection module aiming at each obtained fusion feature map to obtain a prediction feature map corresponding to each detection module. For example only, assume that there are 3 fused feature maps, 2 detection modules. For each fusion feature map, the fusion feature map can be respectively predicted through 2 detection modules to obtain a prediction feature map corresponding to each detection module, and when the three fusion feature maps are all predicted, 6 prediction feature maps can be obtained in total.

Specifically, the process of predicting each fusion feature map by each detection module may include: firstly, carrying out convolution operation on the fusion feature map by using a convolution layer with the convolution kernel size of 1, and reducing the number of feature channels to 256; then using 2 parallel convolution branches to respectively execute classification operation and regression operation on the fusion feature map after the feature channel number is reduced, wherein each convolution branch comprises 3 convolution layers with the convolution kernel size of 3; and finally, obtaining a corresponding prediction characteristic diagram.

Step 530, for each detection module, matching the positive sample and the negative sample for each associated real labeling frame based on the prior frame in each corresponding prediction feature map.

From the above, a detection module corresponds to a detection task, and a real labeling frame matched with the detection task can be regarded as an associated real labeling frame of the detection module. After the associated real labeling frames are determined, positive samples and negative samples can be matched for each associated real labeling frame according to the prior frames in each prediction feature map corresponding to the detection module.

In the prediction feature map, the anchor frame can be controlled to generate a corresponding prior frame according to a preset rule. The corresponding a priori frame is generated, for example, in terms of a fixed step slip.

Step 540, for each detection module, determining a training loss value of the detection module based on the positive sample and the negative sample matched by each corresponding associated real annotation frame.

After all of the associated real annotation boxes match positive and negative samples, a training loss value for each detection module may be determined based on the positive and negative samples. Specifically, the detection module is designed with a corresponding loss function. For each detection module, the relevant data of the positive sample and the negative sample matched with each associated real annotation frame corresponding to the detection module can be substituted into the loss function, and the corresponding training loss value is calculated.

Step 550, judging whether the evaluation index meets the preset condition or whether the iteration number reaches the preset number.

The detection model to be trained can end training in two situations: in one case, the test model to be trained has converged. For the situation, the evaluation index of the to-be-trained detection model already meets the corresponding requirement, so that the currently obtained to-be-trained detection model can be considered as the target detection model; for the second case, the test model to be trained has iterated a preset number of times, and has not converged yet. For this situation, more iterations may not enable the evaluation index of the to-be-trained detection model to meet the corresponding requirements, and the model needs to be adjusted, the training strategy is changed or the training sample is adjusted, so that the to-be-trained detection model can quickly converge in a limited number of iterations.

From the above, it can be determined whether the test model to be trained can end training by two conditions: the first condition is: whether the evaluation index meets the corresponding preset condition or not; the second condition is: whether the current iteration number has reached a preset number. Obviously, from the training effect, two conditions have priority, i.e. the first condition has priority over the second condition. That is, when each training is completed, it is first determined whether the evaluation index satisfies a preset condition, and when the determination result is no, it is further determined whether the current iteration number reaches the preset number.

Step 560, updating model parameters of the to-be-trained detection model by combining the training loss value of each detection module under the condition that the evaluation index of the to-be-trained detection model does not meet the preset condition and the iteration times are smaller than the preset times, and returning to execute step 510 and the subsequent steps.

After determining the training loss value of each detection module, determining whether to stop training based on the evaluation index of the detection model to be trained and the current iteration number, and obtaining the detection model to be trained after training is completed. It can be understood that the training process of the model is a cyclic process, and when the evaluation index does not meet the preset condition and the iteration number is smaller than the preset number, the current detection model to be trained is not converged yet and training cannot be stopped; at this time, in order to obtain a detection model with higher robustness, the model parameters of the detection model to be trained may be updated in combination with the training loss value of each detection module, and training may be performed again based on the updated model parameters, that is, the execution step 510 and the subsequent steps are returned.

Specifically, updating the model parameters of the detection model to be trained in combination with the training loss value of each detection module specifically includes: accumulating the training loss values of all the detection modules to obtain a comprehensive training loss value; solving the gradient of the reverse rebroadcasting of the detection model to be trained based on the comprehensive training loss value; model parameters are updated according to the gradient.

The evaluation index may be one or a combination of more than 2 of overall average accuracy (mean Average Precision, mAP), accuracy, precision, recall, average accuracy, and cross-over ratio. When the evaluation indexes are different, the corresponding preset conditions are different. For example, when the evaluation index is the mAP, the corresponding preset condition may be a differential threshold, where the differential threshold may be determined empirically; when the evaluation index is the accuracy rate, the corresponding preset condition may be an accuracy rate threshold. It is understood that when the evaluation index includes 2 or more sub-evaluation indexes, it is determined that the evaluation index satisfies the corresponding preset condition only if each sub-average index satisfies the corresponding preset sub-condition.

And 570, obtaining a training-completed detection model to be trained when the evaluation index meets a preset condition or the iteration number reaches a preset number.

Along with the continuous alternation of the model to be trained, if the evaluation index meets the preset condition or the iteration times reach the preset times, the model is completely trained, and the trained model to be trained can be obtained.

It will be appreciated that the training that ends in different situations, the corresponding resulting test models to be trained are different. Under the condition that the evaluation index meets the preset condition, the robustness of the obtained detection model to be trained is higher; under the condition that the iteration times reach the preset times, the robustness of the obtained detection model to be trained is low because the detection model to be trained is not converged yet.

According to the embodiment of the application, a plurality of detection modules are arranged in the same detection model to be trained, so that feature extraction and feature fusion can be carried out on training samples from different data sets at the same time, and a fusion feature map of each training sample is obtained; for each fusion feature map, each detection module can predict the fusion feature map according to the corresponding detection task, output the corresponding prediction feature map and match positive and negative samples for the fusion feature map; optimizing the detection model to be trained by combining training loss values of the plurality of detection modules, and continuously updating model parameters of the detection model to be trained under the condition that the detection model to be trained does not complete training until the detection model to be trained is trained.

In some embodiments, the training set may be determined by:

and A1, carrying out format conversion on each data in the data set based on a specified format, and deleting the data with failed format conversion in the data set to obtain a cleaned data set.

The training set is determined from at least 2 data sets without cross-labels. In order to facilitate the subsequent processing of the training set, the data format of each data in the data set may be unified first, i.e. the format of each training sample in the obtained training set is ensured to be unified. Specifically, the data format to be converted, i.e., the designated format, may be determined first; then, each data in the data set can be subjected to format conversion, namely, the format of each data in the data set is tried to be converted into a specified format; and finally, the data with the format conversion failure can be cleaned, namely, the data with the format conversion failure is deleted from the data set, and the cleaned data set is obtained.

For example only, the specified format may be a text format.

And A2, performing data enhancement operation on each data in the cleaned data set to obtain a preprocessed data set.

Meanwhile, in order to improve generalization ability of the detection model, a data enhancement operation may be performed on the data. The data enhancement operations may include scaling, clipping, arrangement, rotation, stitching, and the like, and for different data enhancement operations, the operation attribute thereof may be set, for example, randomly or according to a preset rule. For example, scaling this data enhancement operation, the operational attribute may be random or scaled by a predetermined scale. The data set obtained after the pretreatment of the data enhancement operation can increase the complexity of the data characteristics; the training set determined by the data set has the complexity of the included training sample, so that the overfitting of the detection model to be trained can be avoided, and the generalization capability of the detection model to be trained is enhanced.

And A3, determining a training set from the preprocessed data set according to a preset proportion.

For the preprocessed data set, the training set may be determined therefrom according to a preset ratio. After determining the training set, the data of the non-training set in the data set may also be used as the verification set. After training of the detection model is finished, the final model parameters can be determined by using the verification set, so that the detection model with higher robustness is obtained.

In the embodiment of the application, the data in the data set is cleaned, so that the uniformity of each data format in the training set can be improved; preprocessing the data in the cleaned training set can increase the complexity of the data characteristics. Based on this, for the training set determined by the preprocessed training set, each training sample included in the training set also has a uniform format and strong feature complexity.

In some embodiments, after step A2, further comprising: and determining an anchor frame applicable to each data in the preprocessed data set based on a clustering algorithm.

The prior frame in the prediction feature map is obtained by sliding the anchor frame in the fusion feature map. Thus, to obtain a priori frames, an anchor frame that is adapted to the individual data of the data set may be determined. Specifically, the preprocessed data set may be input into a clustering algorithm to determine the size of the anchor frame.

The clustering algorithm may include a partitional clustering algorithm, a density-based clustering algorithm, and a hierarchical clustering algorithm, among others. The embodiment of the application prefers a mature partitional clustering algorithm, wherein the clustering algorithm is to designate clustered data or clustering centers in advance, and the aim of sufficiently approaching points in families and sufficiently separating points among families is fulfilled by repeated iteration, so that the clustering is realized.

The partitioned clustering algorithm specifically comprises a K-means clustering algorithm (K-Means Clustering Algorithm) and a derivative algorithm of the K-means clustering algorithm. The K-means clustering algorithm adopted by the embodiment of the application can perform clustering analysis on the anchor frames required by the detection model to be trained so as to determine the anchor frames with the optimal size.

In some embodiments, the targets to be predicted are small, and if targets of different sizes are detected only in one scale, the accuracy of prediction by the detection module is reduced. Therefore, in order to improve the accuracy of the prediction of the detection module, the fusion feature map may include the fusion feature map under each preset scale. One of the preset scales corresponds to a class of targets to be predicted with similar sizes. Specifically, the fusion profile may be determined by:

For each training sample:

and 511, extracting features of the training samples based on the backbone network to obtain a basic feature map.

For each training sample, the training sample can firstly pass through a backbone network to realize feature extraction, so as to obtain a basic feature map. In particular, the backbone network may be CSP-DarkNet. Meanwhile, in order to acquire receptive field information with different scales, an SPPFCSPC structure is introduced, and compared with an SPP structure of YOLOv5, the computation speed of the model can be improved under the condition that the receptive field is kept unchanged, and the fusion of subsequent information is facilitated.

And step 512, performing scale conversion and feature fusion on the basic feature map based on the neck network to obtain a fused feature map corresponding to each preset scale.

After the basic feature map is obtained, in order to generate fusion feature maps under different preset scales, a neck network can be adopted to conduct scale conversion operation on the basic feature map to obtain feature maps corresponding to each preset scale, and then feature fusion is conducted on each feature map to obtain fusion feature maps corresponding to each preset scale.

In particular, the neck network includes a scaling network and a feature fusion network. The scale conversion network can scale-convert the basic feature map to obtain feature maps under each preset scale.

The feature fusion network can perform feature fusion on each feature map to obtain fusion feature maps under each preset scale. Wherein the feature fusion network is a pyramid feature network (feature pyramid networks, FPN), the FPN uses each layer of information in the convolutional neural network (Convolutional Neural Networks, CNN) to generate the final expression feature combination. For the characteristics of the convolutional neural network under the preset scale, the characteristic information of the corresponding dimension can be output and processed; in addition, the features generated after the top-down processing are fused, that is, the upper layer features influence the expression of the lower layer features.

For example only, assume that there are 3 preset dimensions, 20x20, 40x40, and 80x80, respectively. Firstly, determining a feature map with a preset scale of 20x20, wherein the feature map is a first fusion feature map as the network level corresponding to the preset scale is the top layer; processing the first fusion feature map to obtain first semantic information; fusing the first semantic information with a feature map corresponding to a preset scale of 40x40 to generate a second fused feature map with the preset scale of 40x 40; processing the second fusion feature map to obtain second semantic information; and fusing the second semantic information with the feature map corresponding to the preset scale 80x80, so as to generate a third fused feature map with the preset scale 80x80.

In the embodiment of the application, the feature extraction and feature fusion are carried out on the training sample, so that the fusion feature images under different preset scales can be obtained, and the detection module can accurately predict the target to be predicted under different preset scales, thereby accelerating the convergence of the detection model to be trained.

In some embodiments, at each preset scale, the foregoing 530 specifically includes:

and 531, determining a prediction frame based on the position relation between each prior frame and each associated real annotation frame according to each prediction feature map.

For each associated real annotation frame, it can be matched with each prior frame separately to determine the prediction frame. Specifically, the width proportion and the height proportion between each associated real labeling frame and each prior frame can be calculated, the two proportion inverses are counted, and for prior frames with the width proportion inverses and the height proportion inverses smaller than a preset reciprocal threshold value, the probability that such prior frames contain targets to be predicted can be considered to be high.

By way of example only, assume that the preset scale threshold is 4, and there are three prior frames, a prior frame 1, a prior frame 2, and a prior frame 3, one associated with the true annotation frame; the 1 st set of aspect ratio examples (width scale and height scale between a priori frame 1 and the associated true annotation frame) are: 0.2,0.1; the aspect ratio case of group 2 (width and height ratio between a priori frame 2 and the associated true annotation frame) is: 0.3,0.2; the 3 rd set of aspect ratio cases (width scale and height scale between a priori frame 3 and the associated true annotation frame) are: 0.5,0.8; taking the reciprocal of each group of aspect ratio examples, three groups of reciprocal can be obtained, the reciprocal of the 1 st group is: 5, 10; the reciprocal of group 2 is: 3,5; the reciprocal of group 3 is: 2,1. The reciprocal threshold value indicates that the prior frame satisfying the condition is the prior frame 3, that is, the difference between the prior frame 3 and the associated real labeling frame is smaller, so that the target to be predicted corresponding to the associated real labeling frame is easier to predict.

On the basis, in order to further improve the prediction capability of the detection module, the size of the screened prior frame can be adjusted, so that the size of the prior frame after the size adjustment is the same as the size of the associated real labeling frame, and the accuracy of the detection module for predicting the target to be predicted is improved.

Step 532, for each prediction box: mapping the prediction frame into a prediction feature map, and determining a corresponding target grid from grids in the prediction feature map based on the central point of the prediction frame; and determining a positive sample corresponding to each associated real annotation frame based on the center point and the target grid.

To expand the number of positive samples, for each prediction box, the prediction box may be mapped into a prediction feature map; the grid is preset in the prediction feature map, so that a target grid can be determined according to the position of the central point of the prediction frame in the grid; and then the transformation can determine the positive sample corresponding to the associated real annotation frame according to the central point and the target grid.

And 533, determining a negative sample corresponding to each associated real annotation frame based on the positive samples.

For each associated real annotation frame, after the positive sample is determined, all prior frames in the corresponding prediction feature map can be considered as negative samples except for the prior frame determined to be the positive sample.

In some embodiments, the target grid may include a first target grid and a second target grid, the first target grid being a grid where a center point of the prediction frame is located, and the second target grid being a grid adjacent to the first target grid and having a distance from the center point less than a preset distance threshold.

That is, the first target mesh may be determined according to the center point of the prediction frame, and then the second target mesh satisfying the condition may be determined according to four meshes of the first target mesh. For example, a second target mesh whose distance from the first target mesh is smaller than 0.5 unit length may be determined among four meshes of the first target mesh.

In some embodiments, the determination of positive samples may be determined by:

for each associated real annotation box:

and B1, matching at least 2 first candidate samples for the associated real annotation frame based on the prior frame with the center point falling into the target grid.

After the target grid is determined, determining which prior frames have center points falling into the target grid, and screening prior frames based on the condition; for the prior frame obtained by screening, the prior frame can be determined to be a first candidate sample of the corresponding associated real annotation frame. Wherein the first candidate sample is at least 2.

And step B2, calculating the total loss of the samples based on the first candidate samples.

The first candidate sample may be considered a sample that is initially matched for the associated real annotation box. Thus, the total loss of samples during the matching process can be calculated. Specifically, the total loss of the sample includes regression loss and classification loss, which are the sum of the two loss values.

And B3, determining the cross-over ratio between the associated real annotation frame and each corresponding first candidate sample.

The larger the intersection ratio between the associated real annotation frame and the first candidate sample is, the greater the possibility that the first candidate sample is a positive sample is; conversely, the smaller the overlap ratio, the less likely the first candidate sample is to be a positive sample. Based on the method, the cross-over ratio between the associated real annotation frame and each first candidate sample can be determined, so that positive samples corresponding to the associated real annotation frame can be accurately determined later.

And B4, determining a second candidate sample from the first candidate samples based on the cross-over ratio.

As previously described, the size of the cross-over ratio reflects the likelihood that the first candidate sample is a positive sample. Accordingly, the first candidate sample with high possibility of becoming the positive sample can be selected from the first candidate samples according to the cross ratio to serve as the second candidate sample. So that positive samples can be more accurately determined later.

In some embodiments, the step of determining the second candidate sample comprises:

and step B41, sorting the first candidate samples in a descending order based on the cross comparison.

Each first candidate sample corresponds to a cross-over ratio, and the larger the cross-over ratio is, the greater the possibility of the first candidate sample being a positive sample is; thus, the first candidate samples may be sorted in a descending order according to the cross-over ratio, i.e. the cross-over ratio of each sorted first candidate sample becomes gradually smaller.

And step B42, determining the first k ordered candidate samples as second candidate samples corresponding to the associated real annotation frames.

After the first candidate samples are ordered, the top k ordered first candidate samples may be determined to be second candidate samples. Where k is determined by the total loss of samples calculated as described above. The greater the total loss of samples, the less likely it is that positive samples can be determined in the first candidate samples, and k can be set smaller; if the total loss of samples is small, indicating that the likelihood of being determined to be a positive sample in the first candidate sample is large, k may be set to be larger.

And B5, de-duplicating the second candidate sample based on the total loss to obtain a positive sample corresponding to the associated real annotation frame.

In order to improve the quality of the positive sample, a finer positive sample is obtained, the second candidate sample can be de-duplicated according to the total loss, and finally the positive sample corresponding to the associated real label can be obtained.

In some embodiments, to enable faster convergence of the test model to be trained, the training loss value calculation for each test module may be improved by assigning positive and negative samples to calculate the training loss value for each test module.

Specifically, when training the loss value of a single detection module, its loss function may be as follows:

wherein t is _p Is the predictive vector, t _gt Is a true vector. K is the output characteristic diagram, S ² Representing the number of grids, and B representing the number of a priori boxes corresponding to each grid. Alpha _* Representing the weight of the corresponding term, wherein alpha _box ＝0.05，α _cls ＝0.3，α _obj ＝0.7。The weights of the output fusion feature graphs for balancing each preset scale are respectively [4.0,1.0,0.4 ] with the corresponding values of the output feature sizes of 80x80, 40x40 and 20x20]。/>A fusion feature map representing the kth output, an ith grid, whether the jth prior box is a positive sample, if so, < >>The value of (2) is 1, and vice versa is 0.L (L) _box Is a regression cross-over loss function (Complete Intersection over Union Loss, CIoULoss) that takes into account the similarity of distance, overlap ratio, scale, and aspect ratio between the predicted box and the associated true labeled box. L (L) _obj And L _cls Is a cross entropy Loss function (Binary Cross Entropy Loss, BCE Loss) representing the confidence Loss and classification Loss of the predictions, respectively.

During training, the training loss values of all the detection modules are accumulated to obtain comprehensive training loss values, model parameters of the detection model with training are optimized based on the comprehensive training loss values, the priori knowledge corresponding to the detection modules can be migrated, and the training samples which do not contain the priori knowledge are predicted, so that cross labeling work among more than 2 different training samples is omitted, and the training cost of the detection model is reduced.

In some embodiments, when the detection model to be trained can also perform segmentation tasks, a dynamic loss function of the detection model to be trained can be calculated (Dynamic loss function, L _dynamic ) The dynamic loss function can improve the robustness and stability of the detection model to be trained, and effectively solve the gradient explosion problem caused by multitasking and multiple data sets.

Specifically, L _dynamic Computing means of (a)The formula is:

where λ represents the weight of the dynamic loss function, assuming that there are two detection tasks, L _detection1 Is the loss function of training samples corresponding to the first detection task, L _detection2 Is the loss function of training samples corresponding to the second detection task, L _segment Is the loss function of the segmentation task, mAP _j Is the average accuracy (Average Precision, AP) of all classes in the j-th training process.

Specifically, the segmentation loss function is as follows:

/>

where N is the size of the training sample batch, C is the number of segmentation classes, y _c Is the true value, p, for each segmentation class _c Is the predictive probability for each segmentation class.

In some embodiments, L can be utilized when updating model parameters of a test model to be trained _dynamic And calculating the gradient of counter propagation of the detection model to be trained, and updating model parameters based on the gradient.

In some embodiments, the detection model to be trained may be a sensing model of the vehicle panorama, and the sensing model may complete two detection tasks simultaneously. The scene detection is mainly used for detecting pedestrians, vehicles and traffic lights.

Correspondingly, before training the perception model, a training set can be determined from 2 data sets without cross labels, wherein a first data set is labeled with a scene real detection frame corresponding to scene detection; the second dataset is marked with a sign plate real detection frame of 19 types of traffic signs. After the data processing according to the previous steps, a corresponding training set can be obtained.

The perceptual model comprises two detection modules, assumed to be detection module 1 and detection module 2. The detection module 1 is used for detecting scenes, and the detection module 2 is used for detecting traffic signs.

For each training sample, after being processed by a backbone network and a neck network, feature extraction and feature fusion can be realized, and a fusion feature map under three scales is obtained.

For the detection module 1, the prediction of scene identification can be respectively carried out on the feature fusion graphs under three scales to obtain three scene prediction feature graphs; i.e. a scene prediction feature map for each scale. Each scene prediction feature map comprises a plurality of prior frames with sizes matched with the current scale.

For the detection module 2, the prediction of the traffic sign board can be carried out on the feature fusion map under three scales respectively to obtain three sign board prediction feature maps; namely, under each scale, a sign board prediction feature map is corresponding, and each sign board prediction feature map also comprises a plurality of prior frames with the sizes matched with the current scale.

For the detection module 1, positive and negative samples can be matched for each scene real annotation frame according to the prior frame in each scene prediction feature map, and a first loss value under scene detection can be determined according to the matching condition of the positive and negative samples of each scene real annotation frame.

Similarly, for the detection module 2, positive and negative samples can be matched for each real marking frame of each signboard according to each signboard prediction feature map, and a second loss value under traffic signboard detection can be determined according to the matching condition of the positive and negative samples of each real marking frame of each signboard.

Finally, the first loss value and the second loss value can be accumulated to obtain a comprehensive loss value; the gradient of the reverse rebroadcasting of the perceptual model is determined by integrating the loss values, and the model parameters are updated according to the gradient. For the perception model with updated model parameters, the detection module 1 can more accurately predict the training sample of the real annotation frame of the unlabeled scene, so that the detection module 2 can more accurately predict the training sample of the real annotation frame of the unlabeled signboard. Through continuous iteration of the perception model, after the evaluation index of the perception model meets the corresponding condition, the perception model can be considered to be trained; however, if the number of iterations of the perceptual model reaches the preset number threshold, the evaluation index still fails to meet the corresponding requirement, the training of the model can be stopped first, and the training strategy is tried to be adjusted, so that the perceptual model can be quickly converged in a new iteration process.

It will be appreciated that if the perceptual model is also capable of performing segmentation tasks, a segmentation module may also be included; and either the first data set or the second data set may contain segmentation identifications of the lane lines and the zebra crossings. Correspondingly, the segmentation module can process the fusion feature map under the appointed scale (the scale with the largest semantic information) to obtain the corresponding segmentation prediction map. Correspondingly, the third loss value of the segmentation module can be determined according to the prediction segmentation identification in the segmentation prediction graph and the segmentation identification marked in the data set, and when the model parameters of the perception model are optimized, the perception model can be subjected to joint optimization based on the loss values of the two detection modules and the loss value of one segmentation module, so that each module of the optimized perception model can obtain more accurate prediction results when three tasks are executed.

Corresponding to the training method of the detection model across data sets in the above embodiment, fig. 6 shows a block diagram of the training device 6 of the detection model across data sets provided in the embodiment of the present application, and for convenience of explanation, only the portion relevant to the embodiment of the present application is shown.

Referring to fig. 6, the training device 6 for the detection model includes:

The extracting module 61 is configured to perform feature extraction and feature fusion on each training sample in the training set through the to-be-trained detection model, so as to obtain a fused feature map of each training sample; the training set is determined by at least 2 data sets without cross labels, and each data set is labeled with a real labeling frame corresponding to a detection task; the detection model to be trained comprises at least 2 detection modules, and each detection module corresponds to one detection task;

the prediction module 62 is configured to predict, for each fusion feature map, the fusion feature map through each detection module, so as to obtain a prediction feature map corresponding to each detection module;

the matching module 63 is configured to match, for each detection module, the positive sample and the negative sample for each associated real labeling frame based on the prior frame in each corresponding prediction feature map, where the associated real labeling frame is matched with the detection task executed by the detection module;

a first determining module 64, configured to determine, for each detecting module, a training loss value of the detecting module based on the positive sample and the negative sample matched by each corresponding associated real annotation frame;

and the updating module 65 is configured to update the model parameters of the to-be-trained detection model in combination with the training loss value of each detection module when the evaluation index of the to-be-trained detection model does not meet the preset condition and the iteration number is less than the preset number, and trigger the extraction module to execute until the evaluation index meets the preset condition or the iteration number reaches the preset number, thereby obtaining the to-be-trained detection model after training.

It will be appreciated that the execution of the prediction module 62, the matching module 63, the first determining module 64 and the updating module 65 are directly or indirectly dependent on the execution of the extraction module 61, so that when the extraction module 61 is triggered again, the prediction module 62, the matching module 63, the first determining module 64 and the updating module 65 can be sequentially executed according to the logic of data processing, and finally, the training of the to-be-trained detection model is achieved.

Optionally, the training device 6 may further include:

the cleaning module is used for carrying out format conversion on each data in the data set based on the appointed format, deleting the data with failed format conversion in the data set, and obtaining a cleaned data set;

the preprocessing module is used for executing data enhancement operation on each data in the cleaned data set to obtain a preprocessed data set;

the second determining module is used for determining an anchor frame applicable to each data in the preprocessed data set based on a clustering algorithm;

and the third determining module is used for determining a training set from the preprocessed data set according to the preset proportion.

Optionally, the detection model to be trained comprises a backbone network and a neck network, and the backbone network is connected with the neck network; the fused feature map includes a fused feature map corresponding to each preset scale, and the extracting module 61 may include:

The extraction unit is used for extracting the characteristics of the training samples based on the backbone network for each training sample to obtain a basic characteristic diagram;

the conversion and fusion unit is used for carrying out scale conversion and feature fusion on the basic feature map based on the neck network to obtain the fusion feature map corresponding to each preset scale.

Alternatively, the matching module 63 may include:

the determining unit is used for determining a prediction frame based on the corresponding position relation between each priori frame and each associated real annotation frame aiming at each prediction feature map; the prediction frame is a priori frame, wherein the detection target duty ratio in the priori frame is larger than a preset duty ratio threshold value; the detection target corresponds to a detection task under a preset scale;

a matching unit for, for each prediction box: mapping the prediction frame into a prediction feature map, and determining a corresponding target grid from grids in the prediction feature map based on the central point of the prediction frame; determining a positive sample corresponding to each associated real annotation frame based on the center point and the target grid; and determining a negative sample corresponding to each associated real annotation frame based on the positive samples.

Alternatively, the matching unit may include:

the matching sub-unit is used for matching at least 2 first candidate samples for the associated real annotation frames based on the prior frames with the center points falling into the target grids aiming at each associated real annotation frame;

A calculation subunit for calculating a total loss of samples based on the first candidate sample;

the first determining subunit is used for determining the cross ratio between the associated real annotation frame and each corresponding first candidate sample;

a second determination subunit configured to determine a second candidate sample from the respective first candidate samples based on the intersection ratio;

and the de-duplication subunit is used for de-duplication of the second candidate sample based on the total loss to obtain a positive sample corresponding to the associated real annotation frame.

Optionally, the second determining subunit is specifically configured to:

sorting the first candidate samples in a descending order based on the cross-correlation;

and determining the first k ordered candidate samples as second candidate samples corresponding to the associated real annotation frames.

It should be noted that, because the content such as the information interaction and the execution process between the above devices/units are based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

Fig. 7 is a schematic structural diagram of a physical layer of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device 7 of this embodiment includes: at least one processor 70 (only one shown in fig. 7), a memory 71, and a computer program 72 stored in the memory 71 and executable on the at least one processor 70, the processor 70 implementing steps in a training method embodiment of any of the detection models described above, such as steps 110-150 shown in fig. 1, when the computer program 72 is executed by the processor 70.

The processor 70 may be a central processing unit (Central Processing Unit, CPU) and the processor 70 may be other general purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may in some embodiments be an internal storage unit of the electronic device 7, such as a hard disk or a memory of the electronic device 7. The memory 71 may in other embodiments also be an external storage device of the electronic device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 7.

Further, the memory 71 may also include both an internal storage unit and an external storage device of the electronic device 7. The memory 71 is used to store an operating device, an application program, a boot loader (BootLoader), data, and other programs and the like, such as program codes of computer programs and the like. The memory 71 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that enable the implementation of the method embodiments described above.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program comprises computer program code, and the computer program code can be in a source code form, an object code form, an executable file or some intermediate form and the like. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a camera device/electronic apparatus, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of modules or elements described above is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of training a detection model across a dataset, the method comprising:

extracting features and fusing features of each training sample in a training set through a detection model to be trained to obtain a fused feature map of each training sample; the training set is determined by at least 2 data sets without cross labels, and each data set is labeled with a real labeling frame corresponding to a detection task; the detection model to be trained comprises at least 2 detection modules, and each detection module corresponds to one detection task;

for each detection module, matching a positive sample and a negative sample for each associated real labeling frame based on a priori frame in each corresponding prediction feature map, wherein the associated real labeling frame is matched with the detection task executed by the detection module;

determining training loss values of the detection modules based on positive samples and negative samples matched with each corresponding associated real annotation frame aiming at each detection module;

and under the condition that the evaluation index of the detection model to be trained does not meet the preset condition and the iteration times are smaller than the preset times, updating the model parameters of the detection model to be trained by combining the training loss values of each detection module, and returning to execute the steps of extracting and fusing the characteristics of each training sample in the training set through the detection model to be trained until the evaluation index meets the preset condition or the iteration times reach the preset times, so as to obtain the detection model to be trained after the training is completed.

2. The training method of claim 1, further comprising, prior to said feature extraction and feature fusion of each training sample in the training set by the test model to be trained:

performing format conversion on each data in the data set based on a specified format, and deleting the data with failed format conversion in the data set to obtain a cleaned data set;

performing data enhancement operation on each data in the cleaned data set to obtain a preprocessed data set;

determining an anchor frame applicable to each data in the preprocessed data set based on a clustering algorithm;

and determining the training set from the preprocessed data set according to a preset proportion.

3. The training method of claim 1, wherein the test model to be trained comprises a backbone network and a neck network, the backbone network being connected to the neck network; the prediction feature map comprises a prediction feature map corresponding to each preset scale; feature extraction and feature fusion are carried out on each training sample in a training set through a detection model to be trained to obtain a fusion feature map of each training sample, and the feature extraction and feature fusion method comprises the following steps:

For each of the training samples:

performing feature extraction on the training samples based on the backbone network to obtain a basic feature map;

and performing scale conversion and feature fusion on the basic feature map based on the neck network to obtain the fusion feature map corresponding to each preset scale.

4. A training method as claimed in claim 3, wherein said matching positive and negative samples for each associated true annotation frame based on a priori boxes in each of said predictive feature maps corresponding to each preset scale comprises:

for each of the predictive feature maps: determining a prediction frame based on the corresponding position relation between each prior frame and each associated real annotation frame; the prediction frame is that the detection target duty ratio in the prior frame is larger than a preset duty ratio threshold value; the detection target corresponds to the detection task under the preset scale;

for each of the prediction frames:

mapping the prediction frame into the prediction feature map, and determining a corresponding target grid from grids in the prediction feature map based on the central point of the prediction frame;

and determining positive samples and negative samples corresponding to each associated real annotation frame based on the center point and the target grid.

5. The training method of claim 4, wherein the target grid comprises a first target grid and a second target grid, the first target grid being a grid in which a center point of the prediction frame is located, the second target grid being a grid adjacent to the first target grid and having a distance from the center point less than a preset distance threshold.

6. The training method of claim 4, wherein said determining positive samples corresponding to each of said associated true callout boxes based on said center point and said target grid comprises:

for each of the associated real annotation boxes:

matching at least 2 first candidate samples for the associated real annotation frame based on the prior frame with the center point falling into the target grid;

calculating a total sample loss based on the first candidate sample;

determining the cross-over ratio between the associated real annotation frame and each corresponding first candidate sample;

determining a second candidate sample from each of the first candidate samples based on the intersection ratio;

and de-duplicating the second candidate sample based on the total loss to obtain the positive sample corresponding to the associated real annotation frame.

7. The training method of claim 6 wherein said determining a second candidate sample from each of said first candidate samples based on said overlap ratio comprises:

sorting each of the first candidate samples in descending order based on the cross-over comparison;

8. A training apparatus for a detection model across a dataset, comprising:

The matching module is used for matching positive samples and negative samples for each associated real labeling frame based on a priori frames in the prediction feature images aiming at each prediction feature image, and the associated real labeling frames are matched with the detection tasks executed by the detection module corresponding to the prediction feature images;

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the training method of the detection model according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of the detection model according to any one of claims 1 to 7.