CN112633340A

CN112633340A - Target detection model training method, target detection model training device, target detection model detection device and storage medium

Info

Publication number: CN112633340A
Application number: CN202011475085.0A
Authority: CN
Inventors: 王翔宇; 潘武; 张小锋; 黄鹏; 林封笑; 胡彬
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-04-09
Anticipated expiration: 2040-12-14
Also published as: CN112633340B

Abstract

The application relates to a target detection model training method, a target detection method, equipment and a storage medium, wherein the target detection model training method comprises the following steps: acquiring a training image, and labeling a sample target in the training image; inputting the training image into a target detection model to obtain a prediction target of the training image; the target detection model comprises a backbone network, wherein the backbone network comprises a plurality of convolution layers, each convolution layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group; and training a target detection model by using the minimization of the difference between a predicted target and a sample target as a target and the minimization of cosine similarity between filters of each filter group as a target. Therefore, the same group of filters share the same parameters, the number of irrelevant filters is reduced, the number of parameters of the target detection model is effectively reduced, and meanwhile, the effectiveness of feature extraction and the accuracy of target detection are guaranteed.

Description

Target detection model training method, target detection model training device, target detection model detection device and storage medium

Technical Field

The application belongs to the technical field of target detection, and particularly relates to a target detection model training method, a target detection model training device, a target detection model detection method, a target detection model detection device and a storage medium.

Background

Object detection of images is one of the four tasks classical in computer vision, which, unlike object recognition, requires the detection of multiple objects present in the same picture. Due to the complexity of the algorithm, a neural network model is required to contain a large number of trainable parameters to achieve a good detection effect, so that the neural network model is low in efficiency; the existing method for reducing the number of parameters can cause the detection accuracy of the neural network model to be reduced.

Therefore, how to reduce the number of parameters and the model volume of the neural network model and ensure the detection accuracy of the neural network is an urgent problem to be solved.

Disclosure of Invention

The application provides a target detection model training method, a target detection device and a storage medium, which aim to solve the technical problem of large quantity of parameters of a neural network model.

In order to solve the technical problem, the application adopts a technical scheme that: a method of object detection model training, the method comprising: acquiring a training image, and processing the training image to label a sample target in the training image; inputting the training image into the target detection model to obtain a predicted target of the training image; the target detection model comprises a backbone network, wherein the backbone network comprises a plurality of convolutional layers, each convolutional layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group; and training the target detection model by using the difference minimization target between the prediction target and the sample target as a target and the cosine similarity minimization between the filters of each filter group as a target.

According to an embodiment of the present application, the training the target detection model with the target of the difference between the prediction target and the sample target as a target and the minimization of cosine similarity between the filters of each filter bank as a target includes: training the target detection model by using a back propagation gradient algorithm to minimize a preset loss function; the preset loss function comprises a target frame loss function, a classification loss function, a confidence loss function and the sum of filter bank loss functions, and the filter bank loss function comprises:

wherein α' is a constant, k_iFor the ith filter in the filter bank, k_jFor the jth filter in the filter bank, n is the predetermined number, K is a filter bank matrix, tr (KK)^T) Transposed traces that are K times K.

According to an embodiment of the present application, the inter-filter sharing weights of the filter groups of the same group include: in the back propagation gradient algorithm, the weight and weight correction are shared among the filters of the filter bank in the same group.

According to an embodiment of the present application, the target detection model further includes a feature enhancement network and a detection head module sequentially connected to the backbone network.

According to an embodiment of the application, each set of filter banks comprises eight filters obtained by non-rotation, rotation by 90 °, 180 °, 270 ° and symmetric transformation of one of the filters.

In order to solve the above technical problem, the present application adopts another technical solution: a target detection model-based detection method, the method comprising: acquiring a target image; inputting the target image into the target detection model to obtain a detection result of the target image; the target detection model comprises a backbone network, the backbone network comprises a plurality of convolutional layers, each convolutional layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group.

According to an embodiment of the present application, the detection result includes a target box value of an initial target, an initial classification result of the initial target, and an initial confidence of the initial target, and the method includes: obtaining the classification index with the maximum probability in the initial classification result, and obtaining a final classification result by contrasting an index table; acquiring the target frame value of the initial target, and acquiring an initial target frame by using a target frame conversion method; and re-scoring the initial confidence of the initial target frame to screen out a final target detection result.

According to an embodiment of the present application, the target detection model is obtained by training according to any one of the above training methods.

In order to solve the above technical problem, the present application adopts another technical solution: an electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement any of the above methods.

In order to solve the above technical problem, the present application adopts another technical solution: a computer readable storage medium having stored thereon program data which, when executed by a processor, implements any of the methods described above.

The beneficial effect of this application is: different from the prior art, each group of filter groups of the main network of the target detection model comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and the converted filters in the same group share the same parameters, so that similar features can be extracted from multiple angles by rotating and symmetrically transforming weights, the number of irrelevant filters is reduced, the number of parameters of the target detection model is effectively reduced, and meanwhile, the feature extraction effectiveness and the target detection accuracy are ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of object detection model training according to the present application;

FIG. 2 is a schematic diagram of a fourth order dihedral population of an embodiment of the present application trained by the target detection model;

FIG. 3 is a schematic flow chart diagram illustrating an embodiment of a target detection model-based detection method according to the present application;

FIG. 4 is a block diagram of an embodiment of an object detection model training apparatus according to the present application;

FIG. 5 is a block diagram of an embodiment of an object detection model-based detection apparatus according to the present application;

FIG. 6 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1 and 2, fig. 1 is a schematic flowchart illustrating an embodiment of a target detection model training method according to the present application; FIG. 2 is a diagram of a fourth order dihedral population according to an embodiment of the present invention trained by the target detection model.

An embodiment of the present application provides a target detection model training method, including the following steps:

s101: and acquiring a training image, and processing the training image to label a sample target in the training image.

And acquiring a training image, and labeling a sample target in the training image to obtain a data set. Specifically, the training image may be subjected to detection labeling by an existing target detection model, such as a standard YOLOv4 target detection model, to obtain a data set of the training image. When the target detection model is trained by using the training image, the data set is divided according to a cross validation method, so that effective information as much as possible is obtained from limited data, and a more stable target detection model is obtained. It should be noted that the training image is a group of image sequences, and includes a certain number of images, which can achieve an effective training effect on the target detection model.

S102: and inputting the training image into the target detection model to obtain a prediction target of the training image.

And constructing an initial target detection model, wherein the target detection model comprises a backbone network, the backbone network comprises a plurality of convolutional layers, and each convolutional layer comprises a plurality of filter banks. Unlike conventional convolutional neural network models, each filter bank of the backbone network of the object detection model constructed in the present application includes a predetermined number of filters obtained by one filter rotation and/or flipping, and weights are shared among the filters of the same filter bank. The shared weight value comprises a weight correction value.

The inventor of the application finds that filters in the same convolutional layer have similar weights in the convolutional neural network back propagation training process through statistics. The weights are characterized in that the weights between different filters are symmetric to each other or can be obtained by a rotational or symmetric transformation.

For the filters which are independent from each other and tend to be symmetrically similar after training, each group of filter groups of the backbone network constructed by the method comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and the converted same group of filters share the same parameters, so that similar features can be extracted from multiple angles through rotation and symmetric transformation weights, the number of irrelevant filters is reduced, the number of parameters of a target detection model is effectively reduced, and meanwhile, the feature extraction effectiveness and the target detection accuracy are ensured.

When constructing a convolutional layer of a backbone network, a first randomly generated filter in each group of filters is a unit filter, other filters obtained by rotation and/or flipping change in each group of filters are generator filters, and the predetermined number of filters in each group of filters is at least two. For example, as shown in FIG. 2, each set of filter banks includes eight filters obtained by a non-rotating, 90, 180, 270 rotation of one filter, and a symmetric transformation, according to the quadtiegohedron population property. By transformation, the filter can obtain similar characteristics in 8 different directions. The eight symmetric convolution filters after conversion share the same parameters. In the back propagation training, the weight correction values obtained by each set of eight convolution filters are superimposed, and the basic parameters are corrected together.

In one embodiment, the object detection model in the present application can be constructed and initialized according to the structure of the standard YOLOv4 object detection network, including a sequentially connected backbone network, feature enhanced network and detection head modules. Wherein the backbone network uses a modified CSPDarknet53 network, and in a modified CSPDarknet53 network, convolutional layers are constructed using the methods of the present application. In the backbone network of the original CSPDarknet53, the network comprises 5 convolution modules, and 52 convolution layers, wherein each convolution module comprises 2 convolution layers, the number of convolution layer filters is 32, 64, 128, 256, 512, 1024, and the connections between convolution modules comprise 32, 64, 128, 256, 512, 1024 convolution filters. In the convolutional layers of the present application, the number of filters in each convolutional layer is initialized to be original 1/8, and meanwhile, 8 filters in each group are constructed by using these filters according to a fourth-order dihedral group generator, after construction, the total number of filters is the same as the trunk network of the original CSPDarknet53 network, but the filters in each group share weights, share weight correction during back propagation, and respectively extract the features of the single-bit filters in each group in different directions under different generator filter transformations. (supplementary notes: the standard CSPDarknet53 is a Backbone structure generated by referring to the experience of CSPNet in 2019 on the basis of the Yolov3 Backbone network Darknet53, wherein the Backbone structure comprises 5 CSP modules (cross-level local connection modules), the standard YOLOv4 network is improved by nearly 10 points in accuracy relative to YOLOv3, but the speed is hardly reduced, and the YOLOv4 is a detection model with higher speed and better precision, and can complete training only by 1080Ti or 2080 Ti.)

After feature detection of the backbone network, the network is enhanced using standard features. (Standard feature enhancement architecture based on feature pyramid framework enhances communication propagation of features between layers, adding a bottom-up enhancement path to enhance the low-dimensional features in the detection and high-dimensional feature extraction tasks. Using cross-connects, the feature output extracted from each convolutional layer is added to the same stage feature map of the top-down path, which is then sent to the next stage.)

After the characteristic enhancement network, the standard YOLOv3 detection header is accessed in sequence through convolution connection. (the YOLOv3 network consists of a feature extraction network Darknet53 and a YOLOv3 detection head, and the YOLOv3 detection head detects the confidence, the category and the position of a target through 3 feature graphs with different scales, so that the characteristic with finer granularity can be detected, and the detection of the small target is facilitated).

Leaky ReLU was used as the activation function for the target detection model.

The training image is input into the target detection model, and the prediction target of the training image can be obtained.

S103: and training a target detection model by using the target of minimizing the difference between the predicted target and the sample target and the target of minimizing the cosine similarity between the filters of each filter group.

The loss function in the existing mode comprises the sum of a target frame loss function, a classification loss function and a confidence coefficient loss function.

The preset loss function comprises a target frame loss function,A sum of a classification loss function, a confidence loss function, and a filter bank loss function. The preset Loss function is Loss ═ Loss_cls+Loss_conf+Loss_box+ λ r, wherein Loss_clsBeing a Loss-of-class function, Loss_confAs a confidence Loss function, Loss_boxIs the target box loss function, r is the filter bank loss function, and λ is the coefficient. And optimally training the target detection model by minimizing a preset Loss function Loss.

According to the method, the cosine similarity between the filters of each filter group is calculated, the cosine similarity between the filters is minimized, and the generation of similar filters after rotation or symmetric transformation is inhibited. For a set of filter bank matrices K, the constraint term r is calculated as follows:

wherein a is a constant, k_iFor the ith filter in the filter bank, k_jFor the jth filter in the filter bank, all filters in the filter bank matrix K have equal Frobenius norms because they are rotated or flipped by the same filter. Thus, assuming that the filter bank contains a predetermined number n of filters, the above equation can be converted to:

wherein α' is a constant, k_iFor the ith filter in the filter bank, k_jFor the jth filter in the filter bank, n is a predetermined number, K is a filter bank matrix, tr (KK)^T) Transposed traces that are K times K.

Because some filters have rotation invariance, namely a series of filters generated after some filters are subjected to rotation and symmetric transformation, similar feature extraction results are obtained for input, so that the computational complexity is increased, and the network efficiency is reduced. The method adds anti-symmetry constraint to construct a preset loss function. By minimizing cosine similarity among the filters of each group of filter banks, the occurrence of rotation invariant filters is effectively inhibited, the occurrence of redundant parameters is further inhibited, and the algorithm feature extraction efficiency is increased.

Further, the target detection model is trained by using a back propagation gradient algorithm, so that a preset loss function is minimized. Processing the batch size batchsize by training images, initializing a learning rate learnarrate, initializing a training period epoch, and training a target detection model by using a gradient descent training method.

As shown in step S102, the process of obtaining the filter bank when constructing the convolutional layer of the backbone network of the present application can be expressed as:

k_si＝k_i,K_di＝F(k_i)|_i＝1,2,...N

wherein k is_iIs the first filter randomly generated in each group of filters, i.e. the unit cell filter, with k_si(ii) a F (x) is the rotational and symmetric transformation in text; k_diIs to k_iFilter matrix, k, obtained after conversion using a generator filter_diIs K_diExcept for the other elements of the single-bit filter. For each filter k_i，k_diAnd k_siFor a filter bank with shared weights and corrections, these filters will be used on the same convolutional layer at the same time.

Due to weight multiplexing, the total gradient in the backpropagation gradient calculation can be obtained by the sum of two parts:

and repeatedly and iteratively updating the parameters of the target detection model until the training cycle number reaches epoch and then stopping training.

In the back propagation gradient calculation training, the weight correction values obtained by each group of the filters with the preset number are superposed, and basic parameters of the target detection model are corrected together. The overfitting phenomenon of model training can be effectively reduced, and the training of model parameters is accelerated. The influence of the non-uniform distribution of the features in different directions and the mismatch of the feature distribution between the training set and the test set on the detection result can be reduced.

Each group of filter groups of the main network of the target detection model comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and the converted filters in the same group share the same parameters, so that similar features can be extracted from multiple angles by rotating and symmetrically transforming weights, the number of irrelevant filters is reduced, the number of parameters of the target detection model is effectively reduced, and the effectiveness of feature extraction and the accuracy of target detection are ensured.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of a detection method based on a target detection model according to the present application.

Another embodiment of the present application provides a detection method based on a target detection model, including the following steps:

s201: and acquiring a target image.

Acquiring a target image, wherein the target image can be a digital image or can be acquired by preprocessing a video image, converting an analog or digital video stream into the digital image, normalizing the standard RGB image to enable the pixel value to be normalized to be between [ -1,1], and sending the processed video image frame into a target detection model.

S202: and inputting the target image into the target detection model to obtain a detection result of the target image.

And inputting the target image into the target detection model to obtain a detection result of the target image. The target detection model comprises a backbone network, the backbone network comprises a plurality of convolution layers, each convolution layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group. The shared weight value comprises a weight correction value.

For the filters which are independent from each other and tend to be symmetrically similar after training, each filter group of the backbone network constructed by the method comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and the converted filters in the same group share the same parameters, so that similar features can be extracted from multiple angles through rotating and symmetrically transforming weights, the number of irrelevant filters is reduced, the number of parameters of a target detection model is effectively reduced, and the effectiveness of feature extraction and the accuracy of target detection are ensured.

When constructing the convolutional layer of the backbone network, the first filter randomly generated in each group of filters is a unit filter, the other filters obtained by rotation and/or symmetric change in each group of filters are generator filters, and the predetermined number of filters in each group of filters is at least two. For example, according to the quadtimetric dihedral group property, each filter bank includes eight filters obtained by one filter being rotated by 90 °, 180 °, 270 °, and a symmetric transformation, without rotation. By transformation, the filter can obtain similar characteristics in 8 different directions. The eight symmetric convolution filters after conversion share the same parameters. In the back propagation training, the weight correction values obtained by each set of eight convolution filters are superimposed, and the basic parameters are corrected together.

The target detection model of the present application can be obtained by training through the target detection model training method in any of the embodiments described above.

S203: and screening the detection result to obtain a final target detection result.

The detection result comprises an object frame of the initial object, an initial classification result of the initial object and an initial confidence of the initial object.

Screening the detection result, wherein the step of obtaining the final target detection result comprises the following steps:

and obtaining the classification index with the maximum probability in the initial classification result, and obtaining the final classification result of the initial target by contrasting the index table.

And acquiring a target frame value of the initial target, and acquiring the initial target frame by using a target frame conversion method. Specifically, a regression value of the initial target is taken, and the result is converted using standard YOLOv4 target box conversion to output the initial target box.

And re-scoring the initial confidence of the initial target frame, using a standard Matrix NMS screening result to screen out the initial target frame with high confidence as the target frame of the final target, and displaying a final target detection result, wherein the final target detection result comprises the target frame of the final target, a final classification result and the confidence. (the Matrix NMS rescored the confidence level of the target box to screen the target boxes by calculating IoU for each box the same as the maximum IoU and category among all other target boxes with a higher confidence level than itself.)

By the method, videos collected by the camera in real time can be initialized into video image streams, and the video image frames are sent to the target detection model to obtain an accurate target detection result. The target detection model is small in parameter quantity, and meanwhile, the effectiveness of feature extraction and the accuracy of a target detection result are effectively guaranteed.

Referring to fig. 4, fig. 4 is a schematic diagram of a framework of an embodiment of a target detection model training apparatus according to the present application.

The present application further provides a target detection model training apparatus 30, which includes an obtaining module 31, a network module 32, and a processing module 33, so as to implement the target detection model training method according to the above-mentioned corresponding embodiment. Specifically, the obtaining module 31 obtains a training image, and the processing module 33 processes the training image to label a sample target in the training image; the processing module 33 inputs the training image into the network module 32 to obtain a prediction target of the training image; the network module 32 includes a target detection model, the target detection model includes a backbone network, the backbone network includes a first plurality of convolutional layers, each convolutional layer includes a plurality of filter banks, each filter bank includes a predetermined number of filters obtained by rotating and/or turning over one filter, and the filters of the same filter bank share a weight; the processing module 33 trains the target detection model with the objective of minimizing the difference between the prediction target and the sample target and the objective of minimizing the cosine similarity between the filters of each filter bank.

Each filter group of the main network of the target detection model of the training device 30 includes a predetermined number of filters obtained by rotating and/or turning over one filter, and the converted filters in the same group share the same parameters, so that similar features can be extracted from multiple angles by rotating and symmetrically transforming weights, the number of irrelevant filters is reduced, the number of parameters of the target detection model is effectively reduced, and meanwhile, the feature extraction effectiveness and the target detection accuracy are ensured.

Referring to fig. 5, fig. 5 is a schematic diagram of a frame of an embodiment of a detection apparatus based on a target detection model according to the present application.

The present application further provides a target detection model-based detection apparatus 40, which includes an obtaining module 41, a network module 42, and a processing module 43, so as to implement the target detection model-based detection method according to the corresponding embodiment. Specifically, the acquisition module 41 acquires a target image; the obtaining module 41 inputs the target image into the network module 42 to obtain a detection result of the target image, and the network module 42 includes a target detection model; the target detection model comprises a backbone network, the backbone network comprises a plurality of convolution layers, each convolution layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group.

The detection device 40 may initialize the video collected by the camera in real time to the video image stream, and send the video image frame to the target detection model to obtain an accurate target detection result. The target detection model is small in parameter quantity, and meanwhile, the effectiveness of feature extraction and the accuracy of a target detection result are effectively guaranteed.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device according to the present application.

Yet another embodiment of the present application provides an electronic device 50, which includes a memory 51 and a processor 52 coupled to each other, wherein the processor 52 is configured to execute program instructions stored in the memory 51 to implement the target detection model training method of any of the above embodiments and the target detection model-based detection method of any of the above embodiments. In one particular implementation scenario, electronic device 50 may include, but is not limited to: a microcomputer, a server, and the electronic device 50 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

Specifically, the processor 52 is configured to control itself and the memory 51 to implement the steps of the target detection model training method of any of the above embodiments and the target detection model-based detection method of any of the above embodiments. Processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The Processor 52 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 52 may be commonly implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application.

Yet another embodiment of the present application provides a computer-readable storage medium 60, on which program data 61 are stored, and when executed by a processor, the program data 61 are implemented in the target detection model training method of any of the above embodiments and the target detection model-based detection method of any of the above embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium 60. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium 60 and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium 60 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method for training an object detection model, the method comprising:

acquiring a training image, and processing the training image to label a sample target in the training image;

inputting the training image into the target detection model to obtain a predicted target of the training image; the target detection model comprises a backbone network, wherein the backbone network comprises a plurality of convolutional layers, each convolutional layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group;

and training the target detection model by using the difference minimization target between the prediction target and the sample target as a target and the cosine similarity minimization between the filters of each filter group as a target.

2. The method of claim 1, wherein training the target detection model with the target of the difference between the prediction target and the sample target as a target and the minimization of cosine similarity between the filters of each filter bank comprises:

training the target detection model by using a back propagation gradient algorithm to minimize a preset loss function; the preset loss function comprises a target frame loss function, a classification loss function, a confidence loss function and the sum of filter bank loss functions, and the filter bank loss function comprises:

wherein α' is a constant, k_iFor the ith filter in the filter bank, k_jFor the jth filter in the filter bank, n is the predetermined number, K is a filter bank matrix, tr(KK^T) Transposed traces that are K times K.

3. The method of claim 1, wherein the inter-filter sharing weights of the same filter bank comprises:

in the back propagation gradient algorithm, the weight and weight correction are shared among the filters of the same filter bank.

4. The method of claim 1, wherein the target detection model further comprises a feature enhancement network and a detection head module connected in series with the backbone network.

5. The method of claim 1, wherein each of the filter banks comprises eight filters obtained by non-rotation, rotation by 90 °, 180 °, 270 °, and symmetric transformation of one of the filters.

6. A detection method based on an object detection model is characterized by comprising the following steps:

acquiring a target image;

inputting the target image into the target detection model to obtain a detection result of the target image; the target detection model comprises a backbone network, the backbone network comprises a plurality of convolutional layers, each convolutional layer comprises a plurality of filter groups, each filter group comprises a predetermined number of filters obtained by rotating and/or overturning one filter, and weights are shared among the filters of the same filter group.

7. The method of claim 6, wherein the detection result comprises an object box value of an initial object, an initial classification result of the initial object, and an initial confidence level of the initial object, the method comprising:

obtaining the classification index with the maximum probability in the initial classification result, and obtaining a final classification result by contrasting an index table;

acquiring the target frame value of the initial target, and acquiring an initial target frame by using a target frame conversion method;

and re-scoring the initial confidence of the initial target frame to screen out a final target detection result.

8. The method of claim 6, wherein the object detection model is trained by the training method of any one of claims 1-5.

9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the method of any of claims 1 to 8.

10. A computer-readable storage medium, on which program data are stored, which program data, when being executed by a processor, carry out the method of any one of claims 1 to 8.