CN113378858A

CN113378858A - Image detection method, apparatus, device, vehicle, and medium

Info

Publication number: CN113378858A
Application number: CN202110722127.4A
Authority: CN
Inventors: 吴其蔓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-10

Abstract

The present disclosure provides an image detection method, apparatus, device, vehicle, and medium, which relate to the field of artificial intelligence, in particular to computer vision and deep learning technology, and are particularly applicable to smart cities and intelligent traffic scenes. The specific implementation scheme is as follows: acquiring a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel; fusing convolution kernels with at least two sizes in the same target convolution layer to obtain a fused convolution kernel; and replacing the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model for carrying out target detection on the image to be detected. Through the technical scheme, the image detection precision and the detection efficiency are considered, and the stability of the detection result is improved.

Description

Image detection method, apparatus, device, vehicle, and medium

Technical Field

The utility model relates to an artificial intelligence field especially relates to computer vision and deep learning technique, specifically can be used to under wisdom city and the intelligent transportation scene.

Background

The target detection is a hot direction of computer vision and digital image processing, is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, and becomes a research hotspot of theory and application in recent years.

Disclosure of Invention

The present disclosure provides an image detection method, apparatus, device, vehicle, and medium.

According to an aspect of the present disclosure, there is provided an image detection method including:

acquiring a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel;

fusing convolution kernels with at least two sizes in the same target convolution layer to obtain a fused convolution kernel;

and replacing the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model for carrying out target detection on the image to be detected.

According to another aspect of the present disclosure, there is also provided an image detection apparatus including:

the initial detection model acquisition module is used for acquiring a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel;

a fusion convolution kernel obtaining module for fusing convolution kernels with at least two sizes in the same target convolution layer to obtain a fusion convolution kernel;

and the target detection model obtaining module is used for replacing the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model which is used for carrying out target detection on the image to be detected.

According to another aspect of the present disclosure, there is also provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of image detection provided by any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a vehicle, wherein the vehicle is provided with the electronic device provided in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform an image detection method provided by any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements an image detection method provided by any embodiment of the present disclosure.

According to the technology disclosed by the invention, the image detection precision and the detection efficiency are considered, and the stability of the detection result is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of an image detection method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of another image detection method provided by the embodiments of the present disclosure;

FIG. 3A is a block diagram of an initial inspection model provided by embodiments of the present disclosure;

FIG. 3B is a block diagram of a branch of an FPN network according to an embodiment of the present disclosure;

fig. 3C is a block diagram of an FPN network branch in a specific implementation provided by an embodiment of the present disclosure;

FIG. 3D is a schematic diagram of a fusion convolution kernel generation process provided by an embodiment of the present disclosure;

fig. 3E is a diagram of network branches of an FPN network in a target detection network provided by an embodiment of the present disclosure;

FIG. 4A is a block diagram of another initial inspection model provided by embodiments of the present disclosure;

fig. 4B is a block diagram of a decoding sub-module of a detector network according to an embodiment of the disclosure;

fig. 4C is a block diagram of a decoding module in a specific implementation provided by the embodiment of the present disclosure;

FIG. 4D is a schematic diagram illustrating a process for generating a fused convolution kernel according to an embodiment of the present disclosure;

fig. 4E is a block diagram of a network of detection heads according to an embodiment of the present disclosure;

fig. 5 is a structural diagram of an image detection apparatus provided in an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing an image detection method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the prior art, when image detection is performed, the accuracy of a detection result is generally improved by increasing the network width or the network depth of a detection model. However, the above approach will result in an increase in the amount of computation, thereby increasing the performance requirements of the computing device and increasing the time cost of the computation process. In addition, as the depth and the width of the network are continuously increased, parameters needing to be learned in the corresponding model training process are also continuously increased, and large parameters are easy to be over-fitted, so that the robustness of the model is influenced, and the stability of the detection result of the model is influenced.

In view of the above, the present disclosure provides an image detection method, an image detection apparatus, a device, a vehicle, and a storage medium, so as to overcome the problem that the image detection accuracy, the image detection efficiency, and the stability of the image detection result cannot be considered in the application scenario of image detection.

Each image detection method according to the present disclosure may be executed by an image detection apparatus, which may be implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a terminal device or a server. In a preferred embodiment, the electronic device may be a vehicle-mounted terminal.

Referring to fig. 1, an image detection method includes:

s101, acquiring a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel.

The initial detection model is a trained model for target detection. The target convolutional layer is a convolutional layer in which convolutional kernels of at least two sizes are arranged in parallel. The number of target convolutional layers may be at least one, and is set by a technician who constructs or trains the initial detection model according to needs or empirical values.

The convolution kernels are used for extracting higher-dimensional features of the image, one convolution kernel represents a feature extraction mode, and a feature image is correspondingly generated; the size of the convolution kernel corresponds to the size of the receptive field. It can be understood that, because the convolution kernels with at least two sizes are arranged in the target convolution layer in parallel, different receptive fields are added in the same target convolution layer, and the richness and comprehensiveness of the output result of the target convolution layer are improved.

In an alternative embodiment, the initial detection module may include a feature extraction network for performing feature extraction on the input image; accordingly, the target convolutional layer may be a convolutional layer provided in the feature extraction network, so that feature extraction of different scales is performed on the input image features according to the target convolutional layer.

In yet another alternative embodiment, the initial detection module may include a network of detection heads for performing feature reorganization on the input features; accordingly, the target convolutional layer may be a convolutional layer disposed in the detection head network, so that the input image features are subjected to feature recombination of different scales according to the target convolutional layer.

It can be understood that by setting the target convolution layer in different network parts in the initial detection module and performing feature extraction or recombination, the diversity of the initial detection module is improved, thereby contributing to the diversity of the image detection method.

S102, fusing convolution kernels with at least two sizes in the convolution layers with the same target to obtain fused convolution kernels.

S103, replacing the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model for carrying out target detection on the image to be detected.

And fusing convolution kernels with at least two sizes in each target convolution layer to obtain a fused convolution kernel. And replacing the corresponding target convolution layer in the initial detection model by the fusion convolution kernel, thereby realizing equivalent replacement of the target convolution layer in the initial detection model and the target detection model.

It can be understood that, as at least two convolution kernels with different sizes are fused to obtain a fused convolution layer, the network width of an initial detection model is reduced, thereby being beneficial to reducing the network complexity, reducing the reasoning time when the model is used, and further being beneficial to reducing the data computation amount and improving the data processing efficiency and the model robustness. Meanwhile, the fusion convolution kernel is obtained based on convolution kernels with at least two sizes, so that the fusion convolution kernel also has the capability of feature extraction or feature recombination of different scales, the image features under different scales can be considered in the data processing process, the feature richness and diversity are improved, and the subsequent image detection precision is improved.

The image to be detected can be an image with a target detection requirement, and the application scene of the image detection method can be embodied by limiting the bearing equipment of the target detection model and the self-attribute of the image to be detected, so that the scene diversity of the image detection method related by the disclosure is embodied.

For example, the image to be detected may be an image acquired by an image acquisition module in a scene for tracking and identifying a target by a monitoring system; accordingly, the target detection model can be arranged in an intelligent monitoring device (such as a camera) so as to be adapted to a scene of personnel monitoring in a public place or an enterprise and public institution. For another example, the image to be detected may be an image acquired in the driving environment of the unmanned vehicle in the unmanned scene; correspondingly, the target detection model can be arranged in the unmanned vehicle to adapt to the driving scene of the unmanned vehicle, so that a foundation is laid for realizing intelligent transportation or constructing a smart city.

The method comprises the steps of obtaining an initial detection model of a target convolution layer which is provided with convolution kernels with at least two sizes in parallel; fusing convolution kernels with at least two sizes in the convolution layers with the same target to generate a fused convolution layer; the fusion convolutional layer replaces convolution kernels with at least two sizes in corresponding target convolutional layers in the initial detection model to generate a target detection model for performing target detection on the image to be detected, and the improvement of model complexity caused by increasing the network width or depth of the detection model is avoided, so that the requirement on the operational performance of the image detection model is reduced, the inference time of image detection is reduced, and the image detection efficiency and the model robustness are improved. Meanwhile, the fusion convolution kernel is obtained based on convolution kernels with at least two sizes, so that the fusion convolution kernel has the multi-scale feature extraction capability, the richness or diversity of feature extraction or feature recombination of the target convolution layer is ensured, and the detection precision of the target detection model is further ensured.

On the basis of the above technical solutions, the present disclosure also provides an alternative embodiment. In this alternative embodiment, the generation process of the fused convolution kernel is optimized and improved. It should be noted that, in portions of the disclosure that are not described in detail, reference may be made to the description of the foregoing embodiments, and details are not repeated herein.

Referring to fig. 2, an image detection method includes:

s201, acquiring a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel.

S202, aiming at each target convolutional layer, adjusting the size of each convolutional core in the target convolutional layer to a target size.

The sizes of the convolution kernels in the target convolution layer are uniformly adjusted to the target size, so that the consistency of the sizes of the convolution kernels in the target convolution layer is ensured, and a foundation is laid for the fusion of the convolution kernels.

Illustratively, the target size may be determined by a skilled artisan based on desired or empirical values, or may be set based on the size of each convolution core in the target convolution layer.

Alternatively, the target size may be the size of one of the convolution kernels in the target convolution layer.

When the target size is too large, the sizes of all convolution kernels need to be amplified and adjusted, so that redundancy of a certain calculated amount is brought; when the target size is small, the loss of the receptive field in the target convolutional layer can be caused, and the equivalent substitution of the subsequent target convolutional layer can not be realized, so that the detection precision of the finally generated target detection model is influenced. In order to achieve both of them, in an optional embodiment, the maximum size of the convolution kernel in the target convolution layer may be taken as the target size, and the convolution kernel with a non-maximum size in the target convolution layer may be taken as the convolution kernel to be adjusted; and respectively expanding each convolution kernel to be adjusted to adjust each convolution kernel to be adjusted to the target size, so that the calculation redundancy caused by overlarge target size is avoided, and the loss of receptive field caused by undersize target size is avoided.

In a specific implementation manner, the expanding is performed on each convolution to be adjusted to adjust each convolution to be adjusted to the target size, which may be: and respectively taking each convolution kernel to be adjusted as a center, and adjusting each convolution kernel to be adjusted to a target size in an even zero filling mode. Correspondingly, the convolution kernels of the target sizes (including the adjusted convolution kernels and the convolution kernels of the original target sizes in the target convolution layer) are superposed to obtain a fusion convolution kernel corresponding to the target convolution layer.

It can be understood that the sizes of convolution kernels to be adjusted are adjusted to target sizes by taking the convolution kernels to be adjusted as centers and uniformly zero-filling, so that the centers of the convolution kernels with different sizes in the target convolution layer are aligned, other extra operation amount is not introduced, equivalent substitution of the convolution kernels with at least two sizes arranged in parallel in the target convolution layer by the fusion convolution kernels is realized, the extraction or recombination capability of multi-scale features is reserved while the network complexity is reduced, and the synchronous improvement of the detection precision, the detection efficiency and the model robustness of a target detection model is guaranteed.

And S203, fusing the adjusted convolution kernels to obtain fused convolution kernels corresponding to the target convolution layer.

Since the sizes of the adjusted convolution kernels are all the target sizes, the convolution kernels with the target sizes (including the adjusted convolution kernels and the convolution kernels with the original target sizes in the target convolution layer) can be fused in a numerical value superposition mode to obtain a fused convolution layer.

And S204, replacing the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model for carrying out target detection on the image to be detected.

The method comprises the steps of refining the generation operation of the fusion convolution kernels into a target convolution layer, and adjusting the size of each convolution kernel in the target convolution layer to a target size; and fusing the adjusted convolution kernels to obtain a fused convolution kernel corresponding to the target convolution layer, perfecting a generation mechanism of the fused convolution kernel, and realizing equivalent substitution of the fused convolution kernel and the convolution kernels with at least two sizes in the target convolution layer, so that the model complexity of the target detection model is reduced, the robustness of the model is improved, and the model inference time is reduced. Meanwhile, the fusion convolution kernel keeps the multi-scale characteristic of the target convolution layer while reducing the complexity of the model graph, so that the target detection model still has the extraction or recombination capability of the multi-scale characteristic, the richness and diversity of the characteristic used by the model are improved, and the model detection precision is further improved.

On the basis of the above technical solutions, the present disclosure details the generation process of the target detection model by taking a Feature extraction network as an FPN (Feature Pyramid) network as an example.

Referring to fig. 3A, a structural diagram of an initial detection model is shown, where the initial detection model includes a backbone network (backbone), an FPN network, and a detection head network. Wherein, the backbone network carries out down-sampling upwards at each level for extracting basic features; the FPN network is used for carrying out multi-scale feature fusion by the features extracted by the up-sampling fusion backbone network; and the detection head network is used for outputting a target detection result according to the feature data fused by the FPN. For example, the target detection result may include at least one of target coordinates and confidence level data.

The data processing logic of the initial detection model will be described in detail by taking the initial detection model shown in fig. 3A as an example. Image to be detected I₁Inputting the data into backbone network, down sampling step by step and extracting basic feature to obtain initial feature data C₃、C₄And C₅(ii) a Initial feature data C₃、C₄And C₅Performing feature extraction on a target convolution layer corresponding to the FPN network, performing feature fusion on the up-sampling of the output result of the previous level, and obtaining target feature data P₃、P₄And P₅(ii) a Target feature data P₃、P₄And P₅Outputs a corresponding target detection result through the detection head network.

Referring to fig. 3B, a structure diagram of the FPN network branch in the initial detection model is shown to describe a specific processing mechanism of the FPN network in detail. Wherein the FPN network branch comprises a target convolutional layer, a fusion layer and an activation layer. The target convolution layer comprises at least two convolution kernels which are arranged in parallel and have different sizes, and the convolution kernels are used for extracting features of different scales of the input initial feature data; the fusion layer comprises a feature fusion module, a feature extraction module and a feature fusion module, wherein the feature fusion module is used for performing fusion processing on feature extraction results output by convolution kernels with different sizes to obtain intermediate feature data; and the activation layer is used for activating the intermediate characteristic data based on the activation function so as to solve the nonlinear problem and obtain the target characteristic data. Wherein the activation function may be set by a skilled person according to need or empirical values, or determined iteratively through a number of experiments. It should be noted that one FPN network includes at least one FPN network branch.

In an optional embodiment, the number of convolution kernels in the target convolution layer of the higher hierarchy is greater than that of convolution kernels in the target convolution layer of the lower hierarchy, so that comprehensive extraction of feature data of different scales such as shallow semantic features (information such as color and edge in an image) and deep semantic features (information such as texture in the image) can be achieved, richness and diversity of extracted features of the FPN network are improved, and detection accuracy of the initial detection network is improved.

In one specific implementation, refer to the structure diagram of each network branch of the FPN network in the initial detection network shown in fig. 3C. Wherein, are respectively paired with C₃、C₄And C₅The number of convolution kernels in each target convolution layer being processed increases in sequence. Wherein, C₃The number of convolution kernels in the corresponding target convolution layer is 2, C₄The number of convolution kernels in the corresponding target convolution layer is 3, C₅The number of convolution kernels in the corresponding target convolution layer is 4. Illustratively, the activation function adopted by each FPN network branch is a relu function. It should be noted that the number of target convolutional layers, the number of convolutional kernels in each target convolutional layer, and the specific activation function used in fig. 3C can be set or adjusted by a skilled person according to needs or empirical values, or determined repeatedly through a large number of experiments, and should not be construed as limiting the present disclosure. The activation functions employed by different FPN network branches may be the same or different.

In an alternative embodiment, the maximum size of the convolution kernels in the target convolution layer of the higher-level target convolution layer is greater than the maximum size of the convolution kernels in the target convolution layer of the lower-level target convolution layer. It can be understood that the receptive field of the FPN network can be increased by gradually increasing the convolution kernel size in each level of target convolution layer, so that feature extraction is performed from a global perspective, loss of boundary information is avoided, and accuracy of extracted features is improved.

In one particular implementation, with continued reference to FIG. 3C, for C, respectively₃、C₄And C₅The maximum size of the convolution kernel in each target convolution layer being processed increases in sequence. Wherein, C₃The sizes of convolution kernels in the corresponding target convolution layers are 1 × 1 and 3 × 3 respectively, and the maximum size is 3 × 3; c₄The sizes of convolution kernels in the corresponding target convolution layers are 1 × 1, 3 × 3 and 5 × 5 respectively, and the maximum size is 5 × 5; c₅The convolution kernel sizes in the corresponding target convolution layers are 1 × 1, 3 × 3, 5 × 5, and 7 × 7, respectively, with a maximum size of 7 × 7. It should be noted that, the sizes of the convolution kernels used in different target convolution layers may be set by a skilled person according to needs or experience values, or determined through a large number of experimental adjustments, and it is only necessary to ensure that the maximum size of the convolution kernels in the convolution layers is increased, and the disclosure only illustrates the sizes of the convolution kernels, and should not be construed as limiting the disclosure.

However, since the initial detection model increases the richness and comprehensiveness of the extracted features by arranging convolution kernels of at least two sizes in parallel in the target convolution layer, the model accuracy of the initial detection model is increased. However, due to the way of arranging convolution kernels in parallel, the complexity of the model is increased, and the robustness of the model is influenced. Meanwhile, the inference time during data processing is increased, and the target detection efficiency is reduced.

In order to overcome the above drawback, in an optional embodiment, the present disclosure fuses convolution kernels of different sizes in the target convolution layer to obtain a fused convolution kernel, and then generates the target convolution layer by using a corresponding target convolution layer in the initial detection model of the fused convolution kernel.

Referring to a schematic diagram of the generation process of the fusion convolution kernel shown in fig. 3D, the generation process of the fusion convolution kernel will be described in detail. For any FPN network branch, fusing each convolution kernel in a target convolution layer in the FPN network branch in the initial detection model to obtain a fused convolution kernel; and replacing the target convolution layer with the fusion convolution kernel, and removing redundant fusion layers to obtain the FPN network branch of the target detection model.

In an optional embodiment, each convolution kernel in a target convolution layer in an initial detection model is respectively centered on the convolution kernel, and the convolution kernel is adjusted to the maximum size of the convolution kernel in the target convolution layer in a uniform zero padding mode, so that the number of input channels and the number of output channels of each convolution kernel in the target convolution layer are consistent; and fusing the adjusted convolution kernels to obtain a fused convolution kernel of the target convolution layer.

Specifically, since the convolution has separability, that is, (a + B) X ═ (AX + BX), at least two convolution kernels of different sizes in the target convolution layer can be merged into one merged convolution kernel by merging the network parameters of the convolution kernels for each FPN branch. Illustratively, the network parameters of the convolution kernel may include convolution parameters and/or bias parameters.

With initial characteristic data C₃For example, if the corresponding FPN network branch is 1 × 1 convolution kernel K₁Is [ a ]]3 x 3 convolution kernel K₃Is composed of

The convolution kernel K₁Is adjusted to

Such that the adjusted convolution kernel K₁' and convolution kernel K₃The sizes are the same; the convolution kernel K₁' and convolution kernel K₃Fusion (K) is carried out₁’+K₃) To obtain a fused convolution kernel K_C3：

In one embodiment, C in the original image 3C may be set₃、C₄And C₅Correspondingly replacing each target convolution layer in each corresponding FPN network branch with a fusion convolution kernel K_C3、K_C4And K_C5And obtaining each network branch of the FPN network in the target detection network shown in fig. 3E.

It can be understood that in the model training stage, at least two convolution kernels with the sizes are arranged in parallel in the target convolution layer to expand the network width, so that the model precision is improved; in the model reasoning stage, namely the model using stage, convolution kernels with different sizes in the target convolution layer are fused to replace the target convolution layer, so that the increase of the complexity of the model caused by the expansion of the network width is avoided, and the detection efficiency and the robustness of the model are improved. The fusion convolution kernel still has the multi-scale feature extraction capability, so the model precision is also considered.

On the basis of the above technical solutions, the present disclosure takes a detection head network as an example to detail the generation process of the target detection model.

Referring to fig. 4A, a diagram of an initial detection model is shown, and the initial detection model includes a feature extraction network and a detection head network. The system comprises a feature extraction network, a feature extraction network and a feature extraction network, wherein the feature extraction network is used for extracting features of an image to be detected to obtain target feature data; the detection head network is configured to perform target detection on an image to be detected according to the target feature data to obtain a target detection result, which may include at least one of data such as a target position and a confidence level, for example.

The detection head network comprises an encoding module and a decoding module. The encoding module is used for encoding target characteristic data output by the characteristic detection network to obtain encoded characteristic data; the decoding module is used for carrying out feature recombination on the coding feature data to obtain decoding feature data. For example, a bounding box regression module, such as an encoder based on the Yolo (young only look once) algorithm, may be used to determine the target detection result according to the decoded feature data.

The decoding module comprises at least one decoding submodule cascade connection device and is used for carrying out characteristic recombination which does not pass through the scale on the coding characteristic data output by the coding module. The number of decoding sub-modules can be set by a technician according to needs or empirical values, or can be determined by repeated adjustment through a large number of experiments. Referring to fig. 4B, a block diagram of a decoding submodule of a probe network in the initial detection model is shown. The decoding submodule comprises a target convolution layer, a fusion layer and an activation layer. The target convolutional layer comprises at least two convolutional kernels which are arranged in parallel and have different sizes, and the convolutional kernels are used for carrying out decoding processing on the input coding characteristic data or decoding characteristic data output by a previous decoding submodule in different scales, so that multi-scale characteristic data in the target characteristic data can be reserved as much as possible; the fusion layer comprises a feature fusion module, a feature fusion module and a fusion module, wherein the feature fusion module is used for performing fusion processing on feature data output by convolution kernels with different sizes to obtain intermediate feature data; and the activation layer is used for activating the intermediate characteristic data based on the activation function so as to solve the nonlinear problem and obtain the decoding characteristic data. Wherein the activation function may be set by a skilled person according to need or empirical values, or determined iteratively through a number of experiments.

In an optional embodiment, the number of convolution kernels in the target convolution layer of the higher hierarchy is greater than that of convolution kernels in the target convolution layer of the lower hierarchy, so that comprehensive recombination of feature data of different scales such as shallow semantic features (information such as color and edge in an image) and deep semantic features (information such as texture in the image) in the target feature data can be achieved, richness and diversity of features used by the detection head network can be improved, and further the detection accuracy of the initial detection network can be improved.

In a specific implementation manner, refer to a structure diagram of a decoding module in a probe network in an initial detection network shown in fig. 4C. And the number of convolution kernels in the target convolution layer in each decoding submodule is increased in sequence along with the increase of the network depth. Wherein, the input of the decoding submodule of the header is the coding characteristic data H output by the coding module₁(ii) a The input of each subsequent decoding submodule is the decoding characteristic data H output by the previous decoding submodule₂、H₃And H₄(ii) a The output data of the decoding submodule at the tail part is decoding characteristic data H₅. Wherein H₁The number of convolution kernels in the corresponding target convolution layer is 2, H₂The number of convolution kernels in the corresponding target convolution layer is 3, H₃The number of convolution kernels in the corresponding target convolution layer is 4, H₄The number of convolution kernels in the corresponding target convolution layer is 5. Illustratively, the activation function employed in each decoding submodule is a relu function. It should be noted that the number of target convolutional layers, the number of convolutional kernels in each target convolutional layer, and the specific activation function used in fig. 4C may be set or adjusted by a skilled person according to needs or empirical values, or determined repeatedly through a large number of experiments, and should not be construed as limiting the present disclosure. The activation functions employed by the different decoding sub-modules may be the same or different.

In an alternative embodiment, the maximum size of the convolution kernels in the target convolution layer of the higher-level target convolution layer is greater than the maximum size of the convolution kernels in the target convolution layer of the lower-level target convolution layer. It can be understood that the receptive field of the detection head network can be increased by gradually increasing the sizes of the convolution kernels in the target convolution layers of each level, so that feature recombination is performed from the global perspective, loss of boundary information is avoided, and the accuracy of detection results is improved.

In one particular implementation, with continued reference to FIG. 4C, the maximum size of the convolution kernels in each target convolution layer sequentially increases as the depth of the network increases. Wherein H₁The sizes of convolution kernels in the corresponding target convolution layers are 1 × 1 and 3 × 3 respectively, and the maximum size is 3 × 3; h₂The sizes of convolution kernels in the corresponding target convolution layers are 1 × 1, 3 × 3 and 5 × 5 respectively, and the maximum size is 5 × 5; h₃The sizes of convolution kernels in the corresponding target convolution layers are 1 × 1, 3 × 3, 5 × 5 and 7 × 7 respectively, and the maximum size is 7 × 7; h₄The convolution kernel sizes in the corresponding target convolution layers are 1 × 1, 3 × 3, 5 × 5, 7 × 7, and 9 × 9, respectively, with a maximum size of 9 × 9. It should be noted that the sizes of the convolution kernels used in different target convolution layers may be set by a skilled person according to needs or experience values, or determined through a large number of experimental adjustments, and only the maximum size of the convolution kernels in the convolution layers needs to be ensured to be increased.

However, the initial detection model improves the richness and comprehensiveness of the decoding feature data by arranging convolution kernels of at least two sizes in parallel in the target convolution layer, thereby improving the model accuracy of the initial detection model. However, due to the way of arranging convolution kernels in parallel, the complexity of the model is increased, and the robustness of the model is influenced. Meanwhile, the inference time during data processing is increased, and the target detection efficiency is reduced.

Referring to the schematic diagram of the generation process of the fusion convolution kernel shown in fig. 4D, the generation process of the fusion convolution kernel will be described in detail. For each target convolutional layer, fusing each convolutional kernel in the target convolutional layer to obtain a fused convolutional kernel; and replacing the target convolution layer with the fusion convolution kernel, and removing redundant fusion layers.

By encoding the characteristic data H₁Corresponding decoding sub-module is taken as an example, if the 1 × 1 convolution kernel K₁Is [ a ]]3 x 3 convolution kernel K₃Is composed of

The convolution kernel K₁Is adjusted to

Such that the adjusted convolution kernel K₁' and convolution kernel K₃The sizes are the same; the convolution kernel K₁' and convolution kernel K₃Fusion (K) is carried out₁’+K₃) To obtain a fused convolution kernel K_H1：

In one embodiment, the fused convolution kernel K may be generated for each target convolution layer in the decoding sub-network of the original image 4C_H1、K_H2、K_H3And K_H4And a network structure of the detection head network shown in fig. 4E is obtained.

In a specific implementation manner, in order to realize the fusion of multi-scale feature information, an FPN network is generally introduced into a feature extraction network, so as to extract initial feature data (as described in the foregoing C) from different levels in a backbone network₃、C₄And C₅) And fusing is carried out, so that the problem of multi-scale targets in the image detection process is solved, and the accuracy of the detection result is improved. However, the above method requires feature fusion of features of each level of the FPN with adjacent levels, which increases the amount of computation, thereby affecting the computation efficiency. In order to improve the accuracy of the detection result,in the model training stage, the initial feature data (such as C) carrying more feature information and output from the backbone network can be obtained without introducing the FPN network into the initial detection model₅) The target convolution layer which is provided with convolution kernels with at least two scales in parallel is introduced into the detection head network as the input of the detection head network, the existing features are fully utilized, and the recombination of multi-scale feature information is realized. In a model reasoning stage, namely a model using stage, convolution kernels of all scales in the target convolution layer are fused to generate a fusion convolution kernel, and the target convolution layer is replaced, so that the model is simplified, the calculated amount of the model is reduced, and the robustness and the calculation efficiency of the model are improved.

On the basis of the above technical solutions, the present disclosure also provides an optional embodiment of an execution device for implementing the above image detection methods. Referring to fig. 5, an image sensing apparatus 500 includes: an initial detection model obtaining module 501, a fusion convolution kernel obtaining module 502 and a target detection model obtaining module 503. Wherein,

an initial detection model obtaining module 501, configured to obtain a trained initial detection model; wherein the initial detection model comprises at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel;

a fusion convolution kernel obtaining module 502, configured to fuse convolution kernels of at least two sizes in the same target convolution layer to obtain a fusion convolution kernel;

and a target detection model obtaining module 503, configured to replace the convolution kernels with at least two sizes in the corresponding target convolution layer in the initial detection model with the fusion convolution kernel to obtain a target detection model, which is used to perform target detection on the image to be detected.

The method comprises the steps that an initial detection model of a target convolution layer which is provided with convolution kernels with at least two sizes in parallel is obtained through an initial detection model obtaining module; fusing convolution kernels with at least two sizes in the convolution layers with the same target size through a fusion convolution kernel obtaining module to generate a fusion convolution layer; the fusion convolutional layer replaces convolution kernels with at least two sizes in the corresponding target convolutional layer in the initial detection model through the target detection model obtaining module to generate a target detection model for performing target detection on the image to be detected, and the improvement of model complexity caused by increasing the network width or depth of the detection model is avoided, so that the requirement on the operational performance of the image detection model is lowered, the inference time of image detection is reduced, and the image detection efficiency and the model robustness are improved. Meanwhile, the fusion convolution kernel is obtained based on convolution kernels with at least two sizes, so that the fusion convolution kernel has the multi-scale feature extraction capability, the feature extraction or feature coding richness or diversity of the target convolution layer is guaranteed, and the detection precision of the target detection model is further guaranteed.

In an alternative embodiment, the fused convolution kernel derivation module 502 includes:

a convolution kernel size adjustment unit for adjusting the size of each convolution kernel in each target convolution layer to a target size;

and the fusion convolution kernel obtaining unit is used for fusing the adjusted convolution kernels to obtain a fusion convolution kernel corresponding to the target convolution layer.

In an alternative embodiment, the convolution kernel size adjustment unit includes:

a target size determining subunit, configured to use a maximum size of the convolution kernel in the target convolution layer as a target size, and use a convolution kernel of a non-maximum size in the target convolution layer as a convolution kernel to be adjusted;

and the convolution kernel size adjusting subunit is used for respectively expanding each convolution to be adjusted so as to adjust each convolution to be adjusted to the target size.

In an alternative embodiment, the convolution kernel size adjustment subunit includes:

and the convolution kernel size adjusting slave unit is used for adjusting each convolution kernel to be adjusted to a target size by taking each convolution kernel to be adjusted as a center and adopting an even zero padding mode.

In an alternative embodiment, the initial detection model includes a feature extraction network and a detection head network;

the target convolution layer is located in the feature extraction network and/or the detection head network.

In an alternative embodiment, in the feature extraction network or the detector head network, the number of convolution kernels in the target convolution layer of the higher hierarchy is greater than the number of convolution kernels in the target convolution layer of the lower hierarchy.

In an alternative embodiment, in the feature extraction network or the detector head network, the maximum size of the convolution kernel in the target convolution layer of the higher hierarchy is larger than the maximum size of the convolution kernel in the target convolution layer of the lower hierarchy.

In an alternative embodiment, the target detection model is provided in an unmanned vehicle, and the image to be detected is an image acquired in the driving environment of the unmanned vehicle.

The image detection device can execute the image detection method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects for executing the image detection method.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the data such as the initial detection model and the image to be detected accord with the regulations of related laws and regulations without violating the public order and good customs.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the image detection method. For example, in some embodiments, the image detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the image detection method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the image detection method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

On the basis of the technical solutions, in an optional embodiment of the present disclosure, a vehicle is further provided. The vehicle is provided with an electronic apparatus as shown in fig. 6. In an alternative embodiment, the vehicle may be an unmanned vehicle.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel, sequentially, or in a different order, as long as the desired results of the technical solutions provided by this disclosure can be achieved, and are not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image detection method, comprising:

acquiring a trained initial detection model; wherein the initial inspection model includes at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel;

and replacing the convolution kernels with the at least two sizes in the corresponding target convolution layer in the initial detection model by the fusion convolution kernel to obtain a target detection model for carrying out target detection on the image to be detected.

2. The method of claim 1, wherein said fusing convolution kernels of at least two sizes in the same target convolution layer to obtain a fused convolution kernel comprises:

aiming at each target convolution layer, adjusting the size of each convolution kernel in the target convolution layer to a target size;

and fusing the adjusted convolution kernels to obtain fused convolution kernels corresponding to the target convolution layer.

3. The method of claim 2, wherein the resizing each convolution core in the target convolution layer to a target size comprises:

taking the maximum size of the convolution kernel in the target convolution layer as a target size, and taking the convolution kernel which is not the maximum size in the target convolution layer as a convolution kernel to be adjusted;

and respectively expanding each convolution to be adjusted so as to adjust each convolution to be adjusted to the target size.

4. The method of claim 3, wherein said separately expanding each of said convolutions to be adjusted to adjust each of said convolutions to be adjusted to said target size comprises:

and respectively taking each convolution kernel to be adjusted as a center, and adjusting each convolution kernel to be adjusted to a target size in a uniform zero filling mode.

5. The method of any of claims 1-4, wherein the initial detection model comprises a feature extraction network and a detection head network;

the target convolutional layer is located in the feature extraction network and/or the detection head network.

6. The method of claim 5, wherein in the feature extraction network or the detector head network, a number of convolution kernels in a higher-level target convolutional layer is greater than a number of convolution kernels in a lower-level target convolutional layer.

7. The method of claim 5, wherein, in the feature extraction network or the detector head network, a maximum size of convolution kernels in a higher-level target convolution layer is larger than a maximum size of convolution kernels in a lower-level target convolution layer.

8. The method according to any one of claims 1-7, wherein the object detection model is provided in an unmanned vehicle, and the image to be detected is an image captured in a driving environment of the unmanned vehicle.

9. An image detection apparatus comprising:

the initial detection model acquisition module is used for acquiring a trained initial detection model; wherein the initial inspection model includes at least one target convolutional layer; convolution kernels with at least two sizes are arranged in the target convolution layer in parallel;

10. The apparatus of claim 9, wherein the fused convolution kernel derivation module comprises:

11. The apparatus of claim 10, wherein the convolution kernel size adjustment unit comprises:

12. The apparatus of claim 11, wherein the convolution kernel size adjustment subunit comprises:

and the convolution kernel size adjusting slave unit is used for adjusting each convolution kernel to be adjusted to a target size by taking each convolution kernel to be adjusted as a center and adopting a uniform zero padding mode.

13. The apparatus according to any one of claims 9-11, wherein the initial detection model comprises a feature extraction network and a detection head network;

14. The apparatus of claim 13, wherein a number of convolution kernels in a higher-level target convolutional layer is greater than a number of convolution kernels in a lower-level target convolutional layer in the feature extraction network or the detector head network.

15. The apparatus of claim 13, wherein a maximum size of convolution kernels in a higher-level target convolutional layer is larger than a maximum size of convolution kernels in a lower-level target convolutional layer in the feature extraction network or the detector head network.

16. The apparatus according to any one of claims 9-15, wherein the object detection model is provided in an unmanned vehicle, and the image to be detected is an image captured in a driving environment of the unmanned vehicle.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an image detection method as claimed in any one of claims 1 to 8.

18. A vehicle, wherein the vehicle is provided with an electronic device as claimed in claim 17.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform an image detection method according to any one of claims 1-8.

20. A computer program product comprising a computer program which, when executed by a processor, implements an image detection method according to any one of claims 1 to 8.