CN116630768A

CN116630768A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN116630768A
Application number: CN202310401337.2A
Authority: CN
Inventors: 张诚成; 马子昂
Original assignee: Hangzhou Huacheng Software Technology Co Ltd
Current assignee: Hangzhou Huacheng Software Technology Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-08-22

Abstract

The application discloses a target detection method and device, electronic equipment and a storage medium. Training an initial target detection network built based on RepGhost Bottleneck by using a sample image to obtain a first target detection network; wherein RepGhost Bottleneck is composed based on a residual connection of a RepGhost Module, which is composed based on a Ghost Module fused structure re-parameterization; converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on the structural re-parameterization to obtain a second target detection network; and performing target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected. By the aid of the scheme, accuracy and efficiency of target detection can be improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

With advances in technology and human based needs for convenience, object detection is widely used in computer vision task scenarios, including: home security, machine vision, autopilot, etc.

At present, various target detection algorithms based on a neural network are sequentially proposed, but the efficiency is lower due to huge parameter quantity and complex calculation, so that the target detection algorithm is limited in an application scene, and some high-performance target detection models such as Retinaface, yolov5face and the like lose more model precision when the model parameter quantity is reduced. In view of this, how to improve the accuracy and efficiency of target detection is a problem to be solved.

Disclosure of Invention

The application mainly solves the technical problem of providing a target detection method and device, electronic equipment and a storage medium, and can improve the accuracy and efficiency of target detection.

In order to solve the above technical problem, a first aspect of the present application provides a target detection method, including: training an initial target detection network built based on RepGhost Bottleneck by using a sample image to obtain a first target detection network; wherein RepGhost Bottleneck is composed based on a residual connection of a RepGhost Module, which is composed based on a Ghost Module fused structure re-parameterization; converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on the structural re-parameterization to obtain a second target detection network; and performing target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected.

In order to solve the technical problem, a second aspect of the present application provides a target detection device, which includes a network training module, configured to train an initial target detection network built based on RepGhost Bottleneck by using a sample image, so as to obtain a first target detection network; wherein RepGhost Bottleneck is composed based on a residual connection of a RepGhost Module, which is composed based on a Ghost Module fused structure re-parameterization; the structure conversion Module is used for converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on the structure re-parameterization to obtain a second target detection network; and the target detection module is used for carrying out target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the object detection method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor for implementing the object detection method of the first aspect.

According to the scheme, the initial target detection network is built based on RepGhost Bottleneck, repGhost Bottleneck is formed based on the RepGhost Module connected through residual, the RepGhost Module is formed based on the Ghost Module merging structure heavy parameterization, repGhost Bottleneck is merged into the structure of the initial target detection network, image feature extraction capability is improved, accuracy of target detection can be improved, a sample image is obtained, the initial target detection network is trained, a first target detection network is obtained, the RepGhost Module in the first target detection network is converted into an equivalent Ghost Module based on structure heavy parameterization, the structure of the target detection network is simplified, the parameter quantity of an algorithm is reduced, a second target detection network with higher operation rate is obtained, target detection results of images to be detected are obtained, sample training is conducted on the first target detection network through the method, accuracy of the target detection network is improved, target detection of the images to be detected is conducted on the second target detection network with a simpler structure, and accuracy and efficiency of target detection can be improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a target detection method according to the present application;

FIG. 2 is a schematic diagram illustrating the structure of an embodiment of an initial target detection network according to the present application;

FIG. 3 is a schematic diagram of a frame of an embodiment of an object detection device of the present application;

FIG. 4 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 5 is a schematic diagram of a frame of an embodiment of a computer readable storage medium of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flow chart of an embodiment of the target detection method of the present application.

Specifically, the method may include the steps of:

step S10: training an initial target detection network built based on RepGhost Bottleneck by using the sample image to obtain a first target detection network.

In one implementation scenario, sample types, sample positions and sample key points of sample objects are marked in the obtained sample image, the sample objects can be automobiles, houses, animals and the like appearing in the sample image, and particularly when the sample objects are automobiles, the corresponding sample key points can be tires, front covers, rear covers and roofs.

In one implementation scenario, after a sample image with a label is input into an initial target detection network, target detection is performed on the sample image based on the initial target detection network, a predicted category, a predicted position and a predicted key point of a sample object in the sample image can be obtained, and network parameters of an initial target detection network loss function are adjusted based on a first loss between the sample category and the predicted category, a second loss between the sample position and the predicted position, and a third loss between the sample key point and the predicted key point, so that the first target detection network is obtained. According to the method, the network parameters of the initial target detection network are adjusted based on the initial target detection network to obtain the first loss between the sample category and the prediction category, the second loss between the sample position and the prediction position and the third loss between the sample key point and the prediction key point, so that the first target detection network with higher accuracy is obtained.

In a specific implementation scenario, the loss function of the first object detection network is added based on a first loss function, a second loss function, and a third loss function, where the first loss is used to adjust the network parameter of the first loss function, the second loss is used to adjust the network parameter of the second loss function, and the third loss is used to adjust the network parameter of the third loss function. For example, the loss function is calculated as follows:

in the formula (1),as a first loss function, p _i For sample category->In order to predict the category of the object,lambda is the second loss function ₁ As the second loss function coefficient, t _i For the sample position +.>In order to predict the location of the object,lambda is the third loss function ₂ For the third loss function coefficient, l _i For the sample key point, ++>To predict key points.

In a specific implementation scenario, the first loss function may use a cross entropy function, where the cross entropy function is used to measure the difference information between two probability distributions, for example, p represents the distribution of the true labels, q is the prediction result of the trained network, and the cross entropy loss function may measure the similarity between p and q.

In one specific implementation scenario, the second loss function may use the CIOU loss function, e.g., the second loss function may be calculated as follows:

In the formula (2), b represents the center point of the predicted position, b ^gt For the center point of the sample position, ρ is the Euclidean distance, c is the minimum circumscribed rectangle diagonal distance between the predicted position and the real position, ioU is the overlapping degree, ν is a fixed parameter, and α is obtained based on IoU and ν.

The function types of the first, second, and third loss functions are not limited in the present application.

In one implementation scenario, repGhost Bottleneck is composed of a RepGhost Module based on residual connection, and the RepGhost Module is composed of a Ghost Module fused structure based on a heavy parameterization, repGhost Bottleneck increases the number of channels compared with different convolution layers, improves the image feature extraction capability, and can improve the accuracy of target detection.

In a specific implementation scenario, the Ghost Module is a method of model compression, which can replace each convolution layer in the existing convolution network, and generates a feature map through a series of linear operations to enhance the feature extraction capability of the target detection model, where the Ghost Module is formed based on the convolution of the first size and the depth separable convolution of the second size of the residual connection, and the second size is not smaller than the first size. The depth separable convolution (depthwise separable convolution) is an algorithm obtained by improving standard convolution calculation in a convolution neural network, by splitting the correlation of space dimension and channel dimension, the number of parameters required by the convolution calculation is reduced, the convolution calculation is divided into two parts, the space convolution (depthwise convolution) is firstly carried out on channels respectively, the output is spliced, and then the channel convolution (pointwise convolution) is carried out by using a unit convolution kernel so as to obtain a characteristic diagram. For example, the Ghost Module may be derived from a 1×1 convolution and a 3×3 depth separable convolution residual connection.

In one implementation scenario, the step of obtaining the RepGhost Module based on the Ghost Module fused structure re-parameterization comprises the steps of connecting a convolution of a first size and a batch normalization layer in parallel, then connecting the convolution of the first size and the batch normalization layer in parallel, obtaining a first channel, connecting a depth separable convolution of a second size, the batch normalization layer and the depth separable convolution of the first size in parallel, then connecting the convolution of the second size and the batch normalization layer in parallel, then connecting the convolution of the second size and the depth separable convolution of the first size into a second activation function, obtaining a second channel, and connecting the first channel and the second channel in a residual mode, thus obtaining the RepGhost Module. The activation function is a function running on neurons of an artificial neural network, responsible for mapping the inputs of the neurons to the outputs.

In one implementation scenario, model parameters are obtained in the network deep learning process, a group of parameters and a structure are in one-to-one correspondence, for example, a full connection layer a and a full connection layer B can be converted into a full connection layer C if no nonlinearity exists, the parameters of the two full connection layers are respectively a matrix a and a matrix B, the input is x, the output is y=b (Ax), we can construct a matrix C, and c=a×b, then y=b (Ax) =c (x), then C is the parameter of the full connection layer obtained by us, the parameters A, B correspond to a structure a and a structure B respectively, the parameter C corresponds to a structure C, and the structure re-parameterization (structure-parameter) means that a series of structures is firstly constructed and the parameters are equivalently converted into another series of parameters, so that the series of structures are equivalently converted into another series of structures. In practical applications, training resources are generally relatively abundant, we want to reason about the cost and performance of the training, so we want to train the structure larger, have higher precision or other useful properties, the structure is smaller and retains the property when the structure is converted, the structure has the same precision or other useful properties, through structure re-parameterization, the structure when training corresponds to one set of parameters, we want to train the structure when the structure corresponds to another set of parameters, so long as the former parameters can be equivalently converted into the latter, the structure equivalent of the former can be converted into the latter, for example, the structure A corresponds to one set of parameters X, the structure B corresponds to one set of parameters Y, and if we can equivalently convert X into Y, the structure A can be equivalently converted into B.

In a specific implementation scenario, a first path of the RepGhost Module is obtained by connecting a branch obtained by connecting a convolution of a first size with a batch normalization layer and connecting the branch obtained by connecting a depth separable convolution of a first size with the batch normalization layer, a branch obtained by connecting a depth separable convolution of a second size with the batch normalization layer and connecting the branch obtained by connecting the depth separable convolution of the second size with the batch normalization layer and another batch normalization layer in parallel, for example, the first path of the RepGhost Module comprises a 1×1 convolution branch and a batch normalization layer branch in parallel and is connected to a ReLU activation function, the second path comprises a 1×1 branch obtained by connecting a depth separable convolution with the batch normalization layer, a 3×3 branch obtained by connecting a depth separable convolution with the batch normalization layer and a branch obtained by connecting the depth separable convolution with the batch normalization layer in parallel and then connecting an output path residual of the first path and the second path to obtain the RepGhost Module. By the method, the basic structure based on the Ghost Module is integrated into the structure for re-parameterization, the RepGhost Module is integrated into the target detection network, the feature fusion is carried out in the weight space, the parameters of each branch are fused, and the feature map of more sample images can be generated to expand the network capacity, so that the accuracy of target detection can be improved.

It should be noted that the type of the activation function used in the repashost Module in the present application is not limited.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an initial target detection network according to an embodiment of the present application. As shown in fig. 2, in one implementation scenario, the initial target detection network includes a backbone network and a detection network, in a network training process, the backbone network is used for extracting a sample feature map of a sample image, the detection network is used for predicting a target detection result of the sample image based on the sample feature map extracted by the backbone network, and the backbone network is obtained by sequentially connecting a preset number of basic units, where the basic units are formed based on RepGhost Bottleneck.

In one implementation scenario, a plurality of basic units in the backbone network are sequentially connected, and feature maps with corresponding scales can be output based on the basic units in different stages, and the feature maps with different scales, which are extracted from the backbone network, are respectively input into a detection network with corresponding scales in the detection network, so that a target detection result is obtained.

In one implementation scenario, the initial target detection network further comprises a fusion network located between the backbone network and the detection network, the fusion network is formed based on the path aggregation network and the feature pyramid network, when the backbone network outputs the multi-scale feature map, feature fusion can be carried out on the feature map between different scales based on the fusion network, generalization capability is improved, and therefore target detection accuracy of the initial target detection network can be improved. For example, the backbone network is obtained by sequentially connecting six basic units, the third stage basic unit, the fourth stage basic unit, the fifth stage basic unit and the sixth stage basic unit of the backbone network respectively output sample feature images with different scales, the fusion unit obtains a third-scale sample feature image output by the third stage basic unit, a fourth-scale sample feature image output by the fourth stage basic unit, a fifth-scale sample feature image output by the fifth stage basic unit and a sixth-scale sample feature image output by the sixth stage basic unit, and the third-scale sample feature image, the fourth-scale sample feature image, the fifth-scale sample feature image and the sixth-scale sample feature image are respectively input into the fusion network for feature fusion, so as to obtain fusion feature images corresponding to different scales.

In a specific implementation scenario, the feature pyramid network includes up-sampling sub-networks corresponding to multiple scales respectively, the path aggregation network includes down-sampling sub-networks corresponding to multiple scales respectively, the up-sampling sub-networks and the down-sampling sub-networks of the same scale are connected in a jumping manner, the detection network includes detection sub-networks corresponding to multiple scales respectively, the down-sampling sub-networks and the detection sub-networks of the same scale are connected with each other, and the up-sampling sub-networks are formed based on RepGhost Bottleneck, and the down-sampling sub-networks are formed based on RepGhost Bottleneck. The main purpose of the up-sampling sub-network is to enlarge the size of the feature map and reduce the channel number of the feature map, the up-sampling of the image can be realized by adopting methods such as bilinear interpolation, transposition convolution and the like, the main purpose of the down-sampling sub-network is to reduce the size of the feature map and improve the channel number of the feature map, and the down-sampling of the image can be realized by adopting methods such as pooling operation and the like.

In one specific implementation scenario, the upsampling subnetwork is based on an upsampling substructure and one RepGhost Bottleneck, and the downsampling subnetwork is based on one RepGhost Bottleneck. For example, in the foregoing embodiment, after the fusion unit obtains the third-scale sample feature map, the fourth-scale sample feature map, the fifth-scale sample feature map, and the sixth-scale sample feature map, feature extraction is performed on the sixth-scale sample feature map based on convolution to obtain a first sub-feature map, feature extraction is performed on the first sub-feature map through the RepGhost Bottleneck after channel-up sampling and overlapping with the fifth-scale sample feature map to obtain a second sub-feature map, feature extraction is performed on the first sub-feature map through the RepGhost Bottleneck after channel-up sampling and overlapping with the fourth-scale sample feature map to obtain a third sub-feature map, feature extraction is performed on the third sub-feature map through the RepGhost Bottleneck after channel-up sampling and overlapping with the third-scale sample feature map to obtain the first fusion feature map; performing feature extraction through the RepGhost Bottleneck after channel superposition based on the first feature map and the third sub-feature map to obtain a second fusion feature map; performing feature extraction through the RepGhost Bottleneck after channel superposition based on the second feature map and the second sub-feature map to obtain a third fusion feature map; and carrying out feature extraction through RepGhost Bottleneck after the third feature map and the first sub-feature map are overlapped to obtain a fourth fusion feature map.

In one specific implementation, repGhost Bottleneck is based on two repshost Module residual connections when the stride in RepGhost Bottleneck is 1.

In another specific implementation, when the stride in RepGhost Bottleneck is 2, two repshost modules and one depth separable convolution residual are concatenated.

In a different implementation scenario, the stride in RepGhost Bottleneck is not limited in the present application, and the number of basic units in the backbone network is not limited in the present application.

In an implementation scenario, when the scale of the target object spans greatly, a preset number of RepGhost Bottleneck may be accessed after the basic unit of the backbone network in the preset stage, so as to deepen the network depth of the backbone network, and as described in the previous embodiment, two RepGhost Bottleneck may be accessed after the basic unit of the sixth stage, so as to improve the accuracy of target detection.

In one implementation scenario, a detection network acquires sample fusion feature graphs of different scales output by a fusion network, the sample fusion feature graphs of different scales are respectively subjected to feature extraction based on convolution, and the channel number of the feature graphs is limited to be a preset channel number, so that prediction categories, prediction positions and prediction key points corresponding to different scales are obtained. According to the method, as the channel numbers of the feature graphs extracted by the backbone networks at different stages are different, the preset channel numbers can be obtained based on the fusion network based on the preset target category, the preset position and the preset key point output by the detection network, so that the feature graphs with the same channel number and different sizes can be obtained, the prediction result can be obtained, and the accuracy of the multi-scale feature detection network can be improved.

In another implementation scenario, the backbone network is sequentially connected by a plurality of basic units, but only outputs a feature map of one scale, and defines the number of channels of the feature map as a preset number of channels, and the detection network obtains a target detection result based on the feature map, where the target detection result includes a prediction category, a prediction position and a prediction key point of the target object.

In one implementation scenario, the preset number of channels is obtained based on the preset target category, the predicted position and the predicted key point, for example, the calculation formula of the preset number of channels is as follows:

C＝N _anchors *(Y+N _cls +N _points *x)……(3)

in the formula (3), C is the preset channel number, N _anchors To detect the number of windows of the network, N _cls N is the number of target object categories _points And the number of the key points is x, the dimension required by the key point position is Y, and the dimension of position regression output and the dimension of confidence degree output are obtained.

Step S20: and converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on the structural reparameterization to obtain a second target detection network.

In one implementation scenario, a batch of normalized layers in a first via is converted into an equivalent convolution of a first size, the convolution of the first size in the first via is fused with the equivalent convolution of the first size to form a convolution of the first size in an equivalent Ghost Module, the batch of normalized layers in the second via is converted into a first equivalent depth separable convolution of a second size, the depth separable convolution of the first size in the second via is converted into a second equivalent depth separable convolution of the second size, and the depth separable convolution of the second size in the second via is fused with the first equivalent depth separable convolution and the second equivalent depth separable convolution to form a depth separable convolution of the second size in the equivalent Ghost Module, and the convolution of the first size in the equivalent Ghost Module is connected with a depth separable convolution residual of the second size.

In one specific implementation scenario, the batch normalization layer is converted into a minimum unit equivalent convolution; and obtaining the equivalent convolution of the first size based on the minimum unit equivalent convolution and the first size, for example, obtaining the equivalent convolution of the first size by a surrounding interpolation 0 method, obtaining the bias and the weight of the convolution of the first size in all the first paths, respectively and correspondingly superposing the bias and the weight to obtain the first bias and the first weight, obtaining the second bias and the second weight of the equivalent convolution of the first size, superposing the first bias and the second bias, and superposing the first weight and the second weight to obtain the third bias and the third weight, and obtaining the convolution of the first size in the equivalent Ghost Module based on the third bias, the third weight and the first size.

In one specific implementation scenario, the batch normalization layer is converted into a minimum unit equivalent convolution; obtaining a first equivalent depth separable convolution of a second size based on the minimum unit equivalent convolution and the second size, for example, obtaining the first equivalent depth separable convolution of the second size by a surrounding interpolation 0 method, obtaining bias and weight of the depth separable convolution of the second size in all the second passages, and respectively and correspondingly superposing the bias and the weight to obtain a fourth bias and a fourth weight; and obtaining a fifth bias and a fifth weight of the first equivalent depth separable convolution, and obtaining a sixth bias and a sixth weight of the second equivalent depth separable convolution; superposing the fourth bias, the fifth bias and the sixth bias, and superposing the fourth weight, the fifth weight and the sixth weight to obtain a seventh bias and a seventh weight; a depth separable convolution of the second dimension in the equivalent Ghost Module is derived based on the seventh bias, the seventh weight, and the second dimension.

In one specific implementation scenario, the calculation formula of the convolution layer is as follows:

Conv(x)＝W(x)+b……(4)

the calculation formula of the batch normalization layer is as follows:

the combined formulas of the convolution layer and the batch normalization layer are as follows:

step S30: and performing target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected.

According to the scheme, the initial target detection network is built based on RepGhost Bottleneck, repGhost Bottleneck is formed based on the RepGhost Module connected through residual, the RepGhost Module is formed based on the Ghost Module fused structure heavy parameterization, repGhost Bottleneck increases the number of channels, and improves the image feature extraction capability, so that the accuracy of target detection can be improved, a sample image is obtained, the initial target detection network is trained to obtain a first target detection network, the RepGhost Module in the first target detection network is converted into an equivalent Ghost Module based on the structure heavy parameterization, the structure of the target detection network is simplified, the parameter quantity of an algorithm is reduced, a second target detection network with higher operation speed is obtained, the target detection result of the image to be detected is obtained based on the second target detection network, sample training is performed on the first target detection network through the method, the accuracy of the target detection network is improved, and the target detection of the image to be detected is performed on the second target detection network with a simpler structure, so that the accuracy and the efficiency of the target detection can be improved.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a frame of an object detection device 30 according to an embodiment of the application. As shown in fig. 3, the object detection device 30 includes: the network training module 31, the structure conversion module 32 and the target detection module 33, the network training module 31 is used for training an initial target detection network built based on RepGhost Bottleneck by using a sample image to obtain a first target detection network; wherein, repGhost Bottleneck is formed based on a residual connection of a RepGhost Module, which is formed by fusing a Ghost Module into a structural reparameterization; the structure conversion Module 32 is configured to convert the repghest Module in the first target detection network to an equivalent ghest Module based on structure re-parameterization, so as to obtain a second target detection network; the target detection module 33 is configured to perform target detection on an image to be detected based on the second target detection network, so as to obtain a target detection result of the image to be detected.

According to the scheme, the target detection device 30 builds an initial target detection network based on RepGhost Bottleneck, the RepGhost Bottleneck is formed by the RepGhost Module based on residual connection, the RepGhost Module is formed by the re-parameterization of the structure based on the blending of the Ghost Module, the image feature extraction capability is improved by RepGhost Bottleneck, so that the accuracy of target detection can be improved, a sample image is obtained, the initial target detection network is trained to obtain a first target detection network, the RepGhost Module in the first target detection network is converted into an equivalent Ghost Module based on the structure re-parameterization, the structure of the target detection network is simplified, the parameter amount of an algorithm is reduced, a second target detection network with higher operation rate is obtained, the target detection result of the image to be detected is obtained, the sample training is performed on the first target detection network by the method, the accuracy of the target detection network is improved, and the target detection of the image to be detected is performed on the second target detection network with a simpler structure, so that the accuracy and the efficiency of the target detection can be improved.

In some disclosed embodiments, the network training Module 31 further includes a reprghost Module obtaining sub-Module, configured to connect the convolution and batch normalization layers of the first size in parallel and then access the first activation function to obtain a first path; and connecting the depth separable convolution of the second size with the batch normalization layer and the depth separable convolution of the first size in parallel, and then accessing a second activation function to obtain a second path; and connecting the first path with the second path residual to obtain a RepGhost Module.

Therefore, based on the basic structure of the Ghost Module and the integration structure re-parameterization, the RepGhost Module is integrated into the target detection network, the feature integration is carried out in the weight space, the parameters of each branch are integrated, and the feature map of more sample images can be generated to expand the network capacity, so that the accuracy of target detection can be improved.

In some disclosed embodiments, the network training module 31 further includes a target prediction sub-module, configured to perform feature extraction on the sample feature map based on convolution, and limit the number of channels of the sample feature map to a preset number of channels, so as to obtain a target detection result; the target detection result at least comprises a prediction category, a prediction position and a prediction key point, and the preset channel number is obtained based on the preset target category, the prediction position and the prediction key point.

Therefore, as the number of channels of the feature graphs extracted by the backbone network at different stages is different, the preset number of channels can be obtained based on the fusion network based on the preset target category, the predicted position and the predicted key point output by the detection network, so that the feature graphs with the same number of channels but different sizes can be obtained to obtain the prediction result, and the accuracy of the multi-scale feature detection network is improved.

In some disclosed embodiments, the network training module 31 further includes a first target detection network training sub-module, configured to perform target detection on the sample image based on the initial target detection network, so as to obtain a predicted category, a predicted position, and a predicted key point of the sample object in the sample image; and adjusting network parameters of an initial target detection network loss function based on the first loss between the sample category and the prediction category, the second loss between the sample position and the prediction position, and the third loss between the sample key point and the prediction key point, so as to obtain a first target detection network.

Therefore, training the sample image based on the initial target detection network to obtain a first loss between the sample category and the prediction category, a second loss between the sample position and the prediction position, and a third loss between the sample key point and the prediction key point, and adjusting network parameters of the initial target detection network to obtain the first target detection network with higher accuracy.

In some disclosed embodiments, the structure conversion Module 32 further includes an equivalent Ghost Module obtaining sub-Module, configured to convert the batch normalization layer in the first path into an equivalent convolution of the first size, and fuse the convolution of the first size in the first path with the equivalent convolution of the first size as the convolution of the first size in the equivalent Ghost Module; converting the batch normalization layer in the second path into a first equivalent depth separable convolution of a second size, converting the depth separable convolution of the first size in the second path into a second equivalent depth separable convolution of the second size, and fusing the depth separable convolution of the second size in the second path with the first equivalent depth separable convolution and the second equivalent depth separable convolution to serve as the depth separable convolution of the second size in the equivalent Ghost Module; wherein the convolution of the first size and the depth separable convolution residual of the second size in the equivalent Ghost Module are connected.

Therefore, the RepGhost Module in the first target detection network is converted into the equivalent Ghost Module based on the structural reparameterization, so that a second target detection network constructed based on the equivalent Ghost Module is obtained, the target detection network is simplified, the network parameter quantity is reduced, and the target detection rate is improved.

In some disclosed embodiments, the equivalent Ghost Module obtaining sub-Module further includes a first path conversion sub-Module for converting the batch normalization layer into a minimum unit equivalent convolution; obtaining an equivalent convolution of a first size based on the minimum unit equivalent convolution and the first size; acquiring the bias and the weight of convolution of a first size in all the first paths, and respectively and correspondingly superposing the bias and the weight to obtain a first bias and a first weight; and obtaining a second bias and a second weight of the equivalent convolution of the first size; superposing the first bias and the second bias, and superposing the first weight and the second weight to obtain a third bias and a third weight; and obtaining convolution of the first size in the equivalent Ghost Module based on the third bias, the third weight and the first size.

Therefore, the convolution layers in the first path of the RepGhost Module and the weights and offsets of the equivalent convolution are respectively overlapped to obtain the convolution of the first size in the equivalent Ghost Module.

In some disclosed embodiments, the equivalent Ghost Module acquisition sub-Module further includes a second path conversion sub-Module for converting the batch normalization layer into a minimum unit equivalent convolution; obtaining a first equivalent depth separable convolution of a second size based on the minimum unit equivalent convolution and the second size; obtaining the bias and the weight of the depth separable convolution of the second size in all the second passages, and respectively and correspondingly superposing the bias and the weight to obtain a fourth bias and a fourth weight; and obtaining a fifth bias and a fifth weight of the first equivalent depth separable convolution, and obtaining a sixth bias and a sixth weight of the second equivalent depth separable convolution; superposing the fourth bias, the fifth bias and the sixth bias, and superposing the fourth weight, the fifth weight and the sixth weight to obtain a seventh bias and a seventh weight; a depth separable convolution of the second dimension in the equivalent Ghost Module is derived based on the seventh bias, the seventh weight, and the second dimension.

Therefore, the depth separable convolution in the second path of the RepGhost Module and the equivalent depth separable convolution are respectively subjected to weight and offset superposition, and the depth separable convolution of the second size in the equivalent Ghost Module is obtained.

Referring to fig. 4, fig. 4 is a schematic diagram of a frame of an electronic device 40 according to an embodiment of the application. The electronic device 40 comprises a memory 41 and a processor 42 coupled to each other, the memory 41 having stored therein program instructions, the processor 42 being adapted to execute the program instructions to implement the steps of any of the above-described object detection method embodiments. Specifically, electronic device 40 may include, but is not limited to: servers, desktop computers, notebook computers, tablet computers, smart phones, etc., are not limited herein.

In particular, the processor 42 is configured to control itself and the memory 41 to implement the steps of any of the target detection method embodiments described above. The processor 42 may also be referred to as a CPU (Central Processing Unit ). The processor 42 may be an integrated circuit chip having signal processing capabilities. The processor 42 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 42 may be commonly implemented by an integrated circuit chip.

According to the scheme, the electronic device 40 builds the initial target detection network based on RepGhost Bottleneck, the RepGhost Bottleneck is formed by the RepGhost Module based on residual connection, the RepGhost Module is formed by the re-parameterization of the structure based on the integration of the Ghost Module, the RepGhost Bottleneck increases the number of channels, the image feature extraction capability is improved, the accuracy of target detection can be improved, a sample image is obtained, the initial target detection network is trained, a first target detection network is obtained, the RepGhost Module in the first target detection network is converted into an equivalent Ghost Module based on the structure re-parameterization, the structure of the target detection network is simplified, the parameter quantity of an algorithm is reduced, a second target detection network with higher operation rate is obtained, the target detection result of the image to be detected is obtained by carrying out target detection on the image to be detected based on the second target detection network, the accuracy of the sample training is improved in the first target detection network by the method, the target detection of the image to be detected is carried out on the second target detection network with simpler structure, and the accuracy and the target detection efficiency of the image to be detected can be improved.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a frame of an embodiment of a computer readable storage medium 50 according to the present application. The computer readable storage medium 50 stores program instructions 51 executable by the processor, the program instructions 51 for implementing the steps in any of the above-described embodiments of the object detection method.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically located, or may be distributed over a plurality of network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method of detecting an object, comprising:

training an initial target detection network built based on RepGhost Bottleneck by using a sample image to obtain a first target detection network; wherein, repGhost Bottleneck is formed based on a residual connection of a RepGhost Module, which is formed by fusing a Ghost Module into a structural reparameterization;

converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on structural re-parameterization to obtain a second target detection network;

and performing target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected.

2. The method of claim 1, wherein the step of generating the repshost Module based on a guggast Module-in-structure re-parameterization of a residual connection convolution of a first size and a depth separable convolution of a second size, the second size being not smaller than the first size, comprises:

connecting the convolution of the first size and the batch normalization layer in parallel and then accessing a first activation function to obtain a first passage; the method comprises the steps of,

the depth separable convolution of the second size is connected in parallel with the batch normalization layer and the depth separable convolution of the first size, and then a second activation function is connected to obtain a second path;

And connecting the first path with the second path in a residual way to obtain the RepGhost Module.

3. The method of claim 2, wherein converting the repashost Module in the first target detection network to an equivalent Ghost Module based on structural re-parameterization comprises:

converting the batch normalization layer in the first path into an equivalent convolution of a first size, and fusing the convolution of the first size in the first path with the equivalent convolution of the first size to serve as the convolution of the first size in the equivalent Ghost Module;

converting the batch normalization layer in the second pass into a first equivalent depth separable convolution of a second size, converting the depth separable convolution of the first size in the second pass into a second equivalent depth separable convolution of a second size, and fusing the depth separable convolution of the second size in the second pass with the first equivalent depth separable convolution and the second equivalent depth separable convolution as a second size depth separable convolution in the equivalent Ghost Module;

wherein the convolution of the first size and the depth separable convolution residual of the second size in the equivalent Ghost Module are connected.

4. A method according to claim 3, wherein said converting said batch normalization layer in said first pass to an equivalent convolution of said first size comprises:

converting the batch normalization layer into a minimum unit equivalent convolution;

obtaining an equivalent convolution of the first size based on the minimum unit equivalent convolution and the first size;

the fusing the convolution of the first size in the first path with the equivalent convolution as the convolution of the first size in the equivalent Ghost Module includes:

acquiring the bias and the weight of convolution of a first size in all the first paths, and respectively and correspondingly superposing the bias and the weight to obtain a first bias and a first weight; the method comprises the steps of,

acquiring a second bias and a second weight of the equivalent convolution of the first size;

superposing the first bias and the second bias, and superposing the first weight and the second weight to obtain a third bias and a third weight;

and obtaining convolution of the first size in the equivalent Ghost Module based on the third bias, the third weight and the first size.

5. A method according to claim 3, wherein said converting said batch normalization layer in said second pass to a first equivalent depth separable convolution of said second dimension comprises:

obtaining a first equivalent depth separable convolution of the second dimension based on the minimum unit equivalent convolution and the second dimension;

the fusing the depth separable convolution of the second size in the second channel with the first equivalent depth separable convolution and the second equivalent depth separable convolution as the depth separable convolution of the second size in the equivalent Ghost Module includes:

obtaining bias and weight of depth separable convolution of a second size in all the second passages, and respectively and correspondingly superposing the bias and the weight to obtain a fourth bias and a fourth weight; the method comprises the steps of,

obtaining a fifth bias and a fifth weight of the first equivalent depth separable convolution, and obtaining a sixth bias and a sixth weight of the second equivalent depth separable convolution;

superposing the fourth bias, the fifth bias and the sixth bias, and superposing the fourth weight, the fifth weight and the sixth weight to obtain a seventh bias and a seventh weight;

A depth separable convolution of the second dimension in the equivalent Ghost Module is derived based on the seventh bias, the seventh weight, and the second dimension.

6. The method according to claim 1, wherein the initial target detection network comprises at least a backbone network and a detection network, the backbone network is used for extracting a sample feature map of the sample image, the detection network is used for predicting and obtaining a target detection result of the sample image based on the sample feature map, and the backbone network is sequentially connected based on a preset number of basic units, and the basic units are formed based on RepGhost Bottleneck.

7. The method of claim 6, wherein the initial target detection network further comprises a convergence network between the backbone network and the detection network, the convergence network being formed based on a path aggregation network and a feature pyramid network, the feature pyramid network comprising up-sampling sub-networks respectively corresponding to a plurality of scales, the path aggregation network comprising down-sampling sub-networks respectively corresponding to the plurality of scales, a skip connection between up-sampling sub-networks and down-sampling sub-networks of a same scale, the detection network comprising detection sub-networks respectively corresponding to the plurality of scales, a connection between down-sampling sub-networks of a same scale and detection sub-networks, and the up-sampling sub-networks being formed based on RepGhost Bottleneck, the down-sampling sub-networks being formed based on RepGhost Bottleneck.

8. The method of claim 6, wherein the detection network is configured to predict, based on the sample feature map, a target detection result of the sample image comprising: performing feature extraction on the sample feature map based on convolution, and limiting the number of channels of the sample feature map to be a preset number of channels to obtain the target detection result; the target detection result at least comprises a prediction category, a prediction position and a prediction key point, and the preset channel number is obtained based on the preset target category, the prediction position and the prediction key point.

9. The method of claim 1, wherein the sample image is labeled with a sample category, a sample location, and a sample keypoints of the sample object, wherein training the initial target detection network built based on RepGhost Bottleneck using the sample image results in a first target detection network, comprising:

performing target detection on the sample image based on the initial target detection network to obtain a prediction category, a prediction position and a prediction key point of the sample object in the sample image;

and adjusting network parameters of the initial target detection network loss function based on a first loss between the sample category and the prediction category, a second loss between the sample position and the prediction position, and a third loss between the sample key point and the prediction key point, so as to obtain the first target detection network.

10. An object detection apparatus, comprising:

the network training module is used for training the initial target detection network built based on RepGhost Bottleneck by utilizing the sample image to obtain a first target detection network; wherein, repGhost Bottleneck is formed based on a residual connection of a RepGhost Module, which is formed by fusing a Ghost Module into a structural reparameterization;

the structure conversion Module is used for converting the RepGhost Module in the first target detection network into an equivalent Ghost Module based on structure re-parameterization to obtain a second target detection network;

and the target detection module is used for carrying out target detection on the image to be detected based on the second target detection network to obtain a target detection result of the image to be detected.

11. An electronic device comprising a memory and a processor coupled to each other, the processor configured to execute program instructions stored in the memory to implement the object detection method of any one of claims 1 to 9.

12. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the object detection method of any of claims 1 to 9.