CN110765886A

CN110765886A - Road target detection method and device based on convolutional neural network

Info

Publication number: CN110765886A
Application number: CN201910931498.6A
Authority: CN
Inventors: 李国法; 杨一帆; 赖伟鉴; 朱方平; 陈耀昱; 曲行达
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-07
Anticipated expiration: 2039-09-29
Also published as: CN110765886B

Abstract

The application is applicable to the technical field of image processing, and provides a target detection method and device based on a convolutional neural network, wherein the method comprises the following steps: importing a real-time image into a target detection network, and outputting a target object contained in the real-time image; the target detection network comprises a convolution layer, an inverse convolution layer, a feature enhancement layer, a feature fusion block, a first regressor and a second regressor. The method can solve the problem that the target detection method based on the convolutional neural network in the prior art is insensitive to the detection of small-scale objects.

Description

Road target detection method and device based on convolutional neural network

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a road target detection method and device based on a convolutional neural network.

Background

For the autonomous vehicle, the visual perception unit has an important meaning for the autonomous vehicle to perceive the surrounding environment, wherein the road target detection task is the most basic and important task in the visual perception unit of the autonomous vehicle. For a driving scene picture shot by a vehicle-mounted camera, most objects in the driving scene picture are small in size, and the problem of safe driving of an automatic driving vehicle is further solved by effectively identifying the small objects, so that the road target detection method suitable for autonomous driving has high accuracy and high efficiency and has strong capability of detecting small-size targets.

In recent years, the detection accuracy of a target detection method based on a deep neural network in the prior art is continuously improved, however, generally speaking, most detection methods do not make good optimization for detecting small-scale targets, and the capability of detecting the small-scale targets needs to be improved.

Disclosure of Invention

In order to solve the problem that a target detection method based on a deep neural network is insensitive to small-scale object detection in the prior art, the embodiment of the application provides a road target detection method and a road target detection device based on a convolutional neural network, the accuracy of small-scale object detection is improved, and 9 common road traffic objects such as private cars, buses, trucks, pedestrians, motorcycles, bicycles, riders, traffic lights, traffic signs and the like in a road environment can be effectively detected.

In a first aspect, an embodiment of the present application provides a target detection method, including: acquiring a training scene image; the training scene image comprises a truth value frame of each training target;

outputting N layers of convolution images of the training scene image through N layers of convolution layers in a convolution neural network, and executing a first characteristic fusion operation on the N layers of convolution images to obtain an inverse convolution image corresponding to each level; n is a positive integer greater than 2;

outputting a first enhanced image corresponding to the first layer of convolution image and a second enhanced image corresponding to the second layer of convolution image based on a preset feature enhancement algorithm;

generating a plurality of initial anchor point frames in the training scene image according to a preset anchor point frame positioning algorithm according to the N layers of convolution images;

outputting a first loss parameter of the convolutional neural network according to the first enhanced image, the second enhanced image, the residual convolutional image, the initial anchor point frame and the true value frame of each training target contained in the training scene image; the residual convolution images are convolution images corresponding to other levels except the first layer of convolution image and the second layer of convolution image;

adjusting the initial anchor point frame according to the first loss parameter to obtain a first adjustment frame;

performing a second feature fusion operation on all the inverse convolution images, the first enhanced image, the second enhanced image and the residual convolution images to obtain fusion feature maps corresponding to all the levels;

outputting a second loss parameter of the convolutional neural network according to all the fusion feature maps, the first adjusting frame and the true value frames of the training targets contained in the training scene image;

adjusting parameters of the convolutional neural network based on the first loss parameter and the second loss parameter to obtain a target detection network;

and importing the real-time image into the target detection network, and outputting the target object contained in the real-time image.

In a possible implementation form of the first aspect, after the training scene images are acquired, the training scene images are subjected to image enhancement, including image enhancement by random cropping, flipping, color change, affine transformation, and/or gaussian noise, so as to expand the number of training scene images.

Illustratively, the convolutional neural network may be a combination of one or more of Vgg, Resenet, mobilene series networks; the target detection network may be built based on tensorflow or other deep learning framework.

It should be understood that the convolutional neural network may be any one of the prior art; setting a corresponding convolution parameter according to the convolution corresponding to the convolution neural network, and obtaining a convolution kernel through api setting in a deep learning frame; after the image is processed by the convolution kernel, the image size of the processed image is half of the image size before processing.

In a second aspect, an embodiment of the present application provides an apparatus, including:

the image acquisition module is used for acquiring a training scene image; the training scene image comprises a truth value frame of each training target;

the convolutional layer module is used for outputting a plurality of convolutional images of the training scene image through N convolutional layers in the convolutional neural network; n is a positive integer greater than 2;

the deconvolution module is used for executing a first feature fusion operation on the Nth layer of convolution images to obtain deconvolution images corresponding to all levels;

the feature enhancement module is used for outputting a first enhanced image corresponding to the first layer of convolution image and a second enhanced image corresponding to the second layer of convolution image based on a preset feature enhancement algorithm;

the anchor point frame presetting module is used for generating a plurality of initial anchor point frames in the training scene image according to a preset anchor point frame positioning algorithm according to the N layers of convolution images;

a first loss module, configured to output a first loss parameter of the convolutional neural network according to the first enhanced image, the second enhanced image, a residual convolutional image, the initial anchor block, and the true value block of each training target included in the training scene image; the residual convolution images are convolution images corresponding to other levels except the first layer of convolution image and the second layer of convolution image;

the first regression module is used for adjusting the initial anchor point frame according to the first loss parameter to obtain a first adjusting frame;

the feature fusion module is used for executing second feature fusion operation on all the inverse convolution images, the first enhanced images, the second enhanced images and the residual convolution images to obtain fusion feature maps corresponding to all the levels;

a second loss module, configured to output a second loss parameter of the convolutional neural network according to all the fused feature maps, the first adjustment frame, and the true value frames of the training targets included in the training scene image;

a second regression module, configured to adjust parameters of the convolutional neural network based on the first loss parameter and the second loss parameter, to obtain a target detection network;

and the target detection module is used for importing the real-time image into the target detection network and outputting the target object contained in the real-time image.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the object detection method of the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, including: the computer storage medium stores a computer program that, when executed by a processor, implements the object detection method of the first aspect described above.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the object detection method according to the first aspect.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that: the target detection method comprises deep learning training of the convolutional neural network and importing the real-time image into the trained convolutional neural network to obtain a target detection result, and can solve the problem that the detection of the small-scale target is not sensitive in the prior art. According to the target detection method, the smaller anchor point frame is utilized to predict the anchor point frame, and two-stage frame regression is adopted, so that the predicted target detection result, especially the detection result of the small-scale target, of the trained convolutional neural network is closer to a true value compared with the prior art while the target detection speed is considered.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an implementation of a target detection method provided in a first embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a feature fusion method provided by an embodiment of the present application;

fig. 5 is a flowchart of an implementation of the target detection method S103 according to the second embodiment of the present application;

FIG. 6 is a schematic illustration of an attention mechanism provided by an embodiment of the present application;

fig. 7 is a flowchart of an implementation of the target detection method S104 according to the third embodiment of the present application;

fig. 8 is a flowchart of an implementation of the target detection method S105 according to the fourth embodiment of the present application;

fig. 9 is a flowchart of an implementation of the target detection method S106 according to the fifth embodiment of the present application;

fig. 10 is a flowchart of an implementation of the target detection method S108 according to the sixth embodiment of the present application;

fig. 11 is a flowchart of an implementation of the target detection method S109 according to the seventh embodiment of the present application;

FIG. 12 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

In the embodiment of the invention, the execution subject of the process is the terminal equipment. The terminal devices include but are not limited to: the target detection method can be executed by devices such as a server, a computer, a smart phone and a tablet computer. Preferably, the terminal device is specifically a target detection apparatus, and the terminal device can train a target detection network based on a convolutional neural network by inputting a training image, and import a real-time image into the target detection network, thereby implementing target detection. Fig. 1 shows a flowchart of an implementation of the target detection method according to the first embodiment of the present invention, which is detailed as follows:

in S101, acquiring a training scene image; the training scene image comprises a truth box of each training target.

In this embodiment, training scene images are acquired, optionally after acquiring the training scene images, the training scene images are image enhanced, including image enhancement by random cropping, flipping, color change, affine transformation, and/or gaussian noise, in order to expand the number of training scene images. The training scene image is imported into the terminal device, or the terminal device may randomly extract a plurality of training scene images from a training database, the training scene image may be a real-time scene image obtained by pre-shooting, that is, a detection object needing target detection network identification exists in the training scene image, that is, the training object, an administrator may manually mark a region of each training object in the training scene image, that is, the above-mentioned true value box, each true value box corresponds to a box information, the box information may include size information of the true value box, such as a center position, a size, an aspect ratio, and the like, and the box information may also include a type of the training object. The true value box is used as a standard for subsequently judging the prediction accuracy of the target detection network.

In this embodiment, the type of the training target is determined by one-hot codes, and the target detection method provided in this embodiment can detect at least 9 types of target types, including private cars, buses, trucks, pedestrians, motorcycles, bicycles, riders, traffic lights, traffic signs, and background types in the road environment, and can be represented by at least 10 types of one-hot codes.

In S102, N layers of convolution images of the training scene image are output through N layers of convolution layers within a convolutional neural network.

In the present embodiment, N is a positive integer greater than 2, and is exemplarily illustrated as 6. The method comprises the steps of outputting 6 layers of convolution images of a training scene image through 6 layers of convolution layers in a convolution neural network, specifically, performing convolution once on the training scene image to obtain a first layer of convolution layer image, performing convolution once again on the first layer of convolution layer image to obtain a second layer of convolution layer image, and so on to obtain 6 layers of convolution layer images in total, wherein the 6 layers of convolution layers are used for capturing characteristics of training targets with different scales in the training scene image.

In the present embodiment, the convolution kernel for each convolution is set by the API in the deep learning framework, and the ratio of the image size before and after convolution for each layer of convolution kernel is the same, and in general, the image size refers to the height and width of the image. Illustratively, the ratio of the image sizes before and after the convolution is twice.

In S103, a first feature fusion operation is performed on the nth layer of convolution image, so as to obtain an inverse convolution image corresponding to each hierarchy.

In this embodiment, the numerical value of N is exemplarily illustrated as 6, and by way of example and not limitation, a 6 th layer of deconvolution image may be obtained by performing one convolution on a 6 th layer of convolution image, a 5 th layer of deconvolution image may be obtained by performing one standard deconvolution on the 6 th layer of deconvolution image, a 4 th layer of deconvolution image may be obtained by performing one standard deconvolution on the 5 th layer of deconvolution image, and the like, so as to obtain an deconvolution image corresponding to each hierarchy, that is, a 6-layer of deconvolution image may be obtained. The inverse convolution image contains the characteristic information of the convolution image, so that the characteristic information is summarized, the subsequent second characteristic fusion is facilitated, and the accuracy of target detection is improved.

In S104, based on a preset feature enhancement algorithm, a first enhanced image corresponding to the first layer of the convolution image and a second enhanced image corresponding to the second layer of the convolution image are output.

In this embodiment, because the first layer of convolution image and the second layer of convolution image have larger image sizes, that is, the features of the small-scale target in the image are more obvious, the present application performs feature enhancement on the first layer of convolution image and the second layer of convolution image of the algorithm based on preset feature enhancement to obtain the corresponding first enhanced image and second enhanced image, so as to further amplify the features of the small-scale target in the first layer of convolution image and the second layer of convolution image, thereby improving the accuracy of detecting the small-scale target. In order to enable the target detection network to detect the small-scale target with higher precision and simultaneously take the detection speed of the target detection network into consideration, particularly, only the first layer of convolutional layer image and the second layer of convolutional layer image are subjected to feature enhancement to amplify feature information of the small-scale target.

In S105, according to the N layers of convolution images, a plurality of initial anchor frames are generated in the training scene image according to a preset anchor frame positioning algorithm.

In this embodiment, exemplarily, the value of N is 6, and according to the 6 layers of convolution images, a plurality of initial anchor boxes are generated according to a preset anchor box positioning algorithm in the training scene image, where the initial anchor boxes are specifically used for screening anchor boxes meeting a preset condition based on the initial anchor boxes, so as to calculate an offset between a subsequent prediction anchor box and a true box.

In S106, a first loss parameter of the convolutional neural network is output according to the first enhanced image, the second enhanced image, the residual convolutional image, the initial anchor block, and the true value block of each training target included in the training scene image.

In this embodiment, since the first enhanced image and the second enhanced image enlarge the feature information of the small-scale object, the first enhanced image and the second enhanced image are input in S106 instead of the first layer of convolution image and the second layer of convolution image, thereby improving the accuracy of detecting the small-scale object. Illustratively, the first loss parameter includes a first loss value and a first offset value, which are used for subsequently obtaining the first adjusting block and the parameter for subsequently adjusting the convolutional neural network.

In S107, the initial anchor point frame is adjusted according to the first loss parameter, so as to obtain a first adjustment frame.

In this embodiment, the initial anchor frame is adjusted according to the first loss parameter to obtain a first adjustment frame, optionally, the first loss parameter includes a first offset value, where the first offset value is specifically an offset value between the initial anchor frame and a prediction frame of the same training target predicted by the convolutional neural network, and the initial anchor frame is adjusted according to the first offset value to obtain the first adjustment frame. The first adjusting block is used for calculating a second loss parameter subsequently.

In S108, a second feature fusion operation is performed on all the deconvolution images, the first enhanced image, the second enhanced image, and the residual convolution images, so as to obtain a fusion feature map corresponding to each hierarchy.

In this embodiment, the obtained 6 fusion feature maps respectively include feature information of the convolution image and the inverse convolution image of the corresponding hierarchy, so that the subsequent convolution neural network predicts the second offset value through the 6 fusion feature maps and calculates the second loss parameter.

In S109, a second loss parameter of the convolutional neural network is output according to all the fused feature maps, the first adjustment box, and the true value box of each training target included in the training scene image.

In this embodiment, all of the fused feature map, the first adjustment box, and the truth box of each training target included in the training scene image are input, and the second loss parameter is output. The second loss parameter may include a second loss value and a second offset value for subsequently adjusting a parameter of the convolutional neural network.

In S110, the parameter of the convolutional neural network is adjusted according to the first loss parameter and the second loss parameter, so as to obtain a target detection network.

In this embodiment, optionally, after performing S101 to S109, the parameters of the convolutional neural network are adjusted according to the obtained first loss parameter and the second loss parameter, and the above steps are repeated, and after multiple adjustments, the target detection network is obtained.

In this embodiment, the parameters of the convolutional neural network are adjusted according to the first loss parameter and the second loss parameter obtained in S101 to S109, that is, two loss regressions are performed on the convolutional neural network, including correction of two predictions of the convolutional neural network, so that the target detection accuracy of the target detection network obtained after adjustment is improved, and therefore, the target detection of the real-time image is performed in the following step.

It should be understood that, in this embodiment, the number of the training scene images may be multiple, and therefore, the terminal device needs to perform multiple training studies on the target detection network, and in the process of the multiple training studies, the terminal device may repeat S101 to S106 first, adjust parameters of the convolutional neural network according to the first loss parameters obtained from all the training scene images, then repeat S107 to S109, adjust parameters of the convolutional neural network according to the second loss parameters obtained from all the training scene images, and finally obtain the target detection network; or training the target detection network individually according to each training scene image, namely performing S101 to S106, adjusting the parameter of the convolutional neural network according to the obtained first loss parameter, then performing S107 to S109, adjusting the parameter of the convolutional neural network according to the obtained second loss parameter, repeating the above steps, and obtaining the target detection network after multiple adjustments.

In S111, the real-time image is imported into the target detection network, and the target object included in the real-time image is output.

In this embodiment, specifically, a real-time image is imported into the target detection network, optionally, the real-time image is obtained by a camera, performing target detection on the real-time image through a target detection network, determining a target object contained in the real-time image, the specific manner of identifying the target object may refer to operations in S101 to S109, and the process of detecting the target object is similar to the process of training the target detection network, and the main difference is that the training process calculates the loss value in the first loss parameter and the second loss parameter, and the process of target detection does not need to execute the operation, and finally, the real-time image predicted by the target detection network comprises a prediction frame of the target object, the target object is marked, and the real-time image which comprises the target object and is marked on the target object is output, so that the result of target detection is obtained.

In this embodiment, compared with the target detection network obtained in the prior art, the target detection network adjusted in S110 improves the accuracy of target detection, especially the accuracy of detection of small-scale targets, while taking the detection speed into account. Referring to fig. 2, fig. 2 shows an application scenario of an embodiment of the present application, which is detailed as follows:

the target detection method provided by an embodiment of the application is applied to an automatic driving automobile, as an example and not by way of limitation, a real-time image is acquired through a camera built in the automatic driving automobile, the real-time image is imported into an adjusted target detection network, detection results of each road target of the real-time image are obtained, and the automatic driving automobile can carry out safe driving according to the detection results, including avoiding passerby and vehicles, and carrying out safe driving according to traffic rules and information about traffic signs and traffic lights in the detection results.

In this example, to further demonstrate the beneficial effects of this example, the experimental data are provided as follows:

TABLE 1 comparison of results of different target detection methods on BDD100K data set (index AP₅₀)

TABLE 2 comparison of results of different target detection methods on the BDD100K dataset

The adjusted target detection network obtained in the present application is named as catcher, the target detection method based on the convolutional neural network provided in an embodiment of the present application is also named as catcher, and SSD and YoloV3 are methods of the target detection network based on the convolutional neural network in the prior art, where AP refers to average accuracy of different kinds of target detection, and AP refers to average accuracy of different kinds of target detection₅₀Means that the target detection IOU is greater than 0.5 average accuracy, MAP means that the AP value of all kinds of target detection are integrated, AP_sThe average precision of small-scale target detection is referred, the FPS is the target detection speed, and the experimental environment is i9-9900X, TITANTX.

Tables 1 and 2 show the detection results of different object detection methods on an image dataset, which may be particularly BDD100K, where traffic lights and traffic signs are small scale objects. As can be seen from Table 1, the average accuracy of the target detection of the catcher Det at two small scale targets is higher than that of the two methods in the prior art. As can be seen from Table 2, the average accuracy of the catch Det on all the small-scale target detections is higher than that of the two methods in the prior art, and although the average accuracy MAP of all kinds of target detections is not as good as that of YoloV3, the detection speed FPS is lower than that of YoloV 3. From the above two tables, it can be seen that the catcher det improves the detection accuracy of the small-scale target while taking into account the detection speed.

Fig. 3 shows a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application, which is described in detail as follows:

the convolutional neural network comprises a base network, an inverse convolutional network, a feature enhancement block, a feature fusion block, a first regressor and a second regressor. The first loss parameter and the first loss parameter are obtained by inputting the training scene image and are used for adjusting the parameter of the convolutional neural network and improving the accuracy of target detection.

In this embodiment, the whole network is named as a catcher det, and the basic network is also called a down-sampling network, and is used for acquiring images with less characteristic information, including 6 layers of convolutional layers, where, for example, the size of the convolutional layer image of each layer is half of the size of the convolutional layer image of the previous layer; the deconvolution network is also called as an up-sampling network and is used for collecting images with more characteristic information, the rightmost deconvolution layer in the graph is a first layer of deconvolution layer, and the size of the image of each layer of deconvolution layer is half of that of the image of the last layer of deconvolution layer by way of example; the feature enhancement block is used for performing feature enhancement on the first layer of convolution image and the second layer of convolution image, and the feature enhancement is only performed on the first layer of convolution image and the second layer of convolution image so as to give consideration to the target detection speed and the detection precision of the small-scale target; the characteristic fusion block is used for integrating the characteristic information of the convolution layer and the reverse convolution layer so as to improve the target detection precision of the target detection network; the first regressor calculates a first loss parameter and performs first loss regression according to the first loss parameter, and the second regressor calculates a second loss parameter and performs second loss regression according to the second loss parameter, so as to improve the target detection precision of the target detection network, and particularly improve the detection precision of the target detection network on small-scale targets.

Fig. 4 shows a schematic diagram of a feature fusion method provided in an embodiment of the present application, which is detailed as follows:

where H × W × C represents the height, width, and number of channels of the image. In this embodiment, the height, the width, and the number of channels of the two images subjected to feature fusion are equal, and optionally, the two images subjected to feature fusion may be subjected to feature fusion through an element-wise addition algorithm, that is, corresponding elements of the two images subjected to feature fusion are added to obtain a feature fusion map with unchanged height, width, and number of channels; optionally, the two images subjected to feature fusion may be subjected to feature fusion through a channel splicing algorithm, that is, the two images to be subjected to feature fusion are subjected to feature fusion according to a method of splicing corresponding channels together, so as to obtain a feature fusion image with unchanged height and width and twice the number of channels as the images to be subjected to feature fusion.

Fig. 5 shows a flowchart of an implementation of the target detection method S103 according to the second embodiment of the present application. Referring to fig. 5, with respect to the embodiment described in fig. 1, the target detection method S103 provided in this embodiment includes: s1031 to S1034 are specifically described as follows: the details are as follows:

in S1031, the nth layer of convolution image is convolved to obtain an nth layer of inverse convolution image.

In this embodiment, N is a positive integer greater than 2. Illustratively, N in this embodiment is 6, that is, the 6 th-layer convolved image is convolved to obtain a 6 th-layer deconvolved image, and specifically, the 6 th-layer convolved image and the 6 th-layer deconvolved image have the same size.

In S1032, a first preliminary feature map corresponding to the mth layer of the deconvoluted image is output by a preset reuse algorithm, and a second preliminary feature map corresponding to the mth layer of the deconvoluted image is output by a standard deconvolution algorithm.

Wherein M is a positive integer, and the initial value of M is N; the image size of the first preliminary feature map and the second preliminary feature map is twice the image size of the M-th layer inverse convolution image.

In the present embodiment, the preset reuse algorithm includes a method capable of resizing an image or a mapping algorithm between depth and space, such as nearest neighbor interpolation, bilinear interpolation, etc., and the mapping algorithm refers to changing the width (space) of the image by changing the number of channels (depth) of the image, specifically, in the present embodiment, increasing the width (space) of the image by reducing the number of channels of the image, so as to resize the image, and as an example and not by way of limitation, images with the numbers of height, width and channels being 7, 7 and 192, respectively, pass through the mapping algorithm between depth and space, so as to obtain images with the numbers of height, width and channels being 14, 14 and 48, respectively. By way of example and not limitation, the image size of the first preliminary feature map and the second preliminary feature map is twice the image size of the 6 th layer deconvolution image.

In S1033, feature fusion is performed on the first preliminary feature map and the second preliminary feature map to obtain an M-1 th layer of deconvolution image.

In this embodiment, the first preliminary feature map and the second preliminary feature map are feature-fused. For example, specifically, the content of the embodiment described in fig. 4 may be referred to for specific implementation of feature fusion, and details are not described here.

In S1034, if the value of M is greater than 2, the value of M is decreased, and S1032 is executed again.

In this embodiment, if the value of M is greater than 2, the value of M is decreased, and the process returns to step S1032, that is, the operation of outputting the first preliminary feature map corresponding to the mth layer of deconvolution image through the preset reuse algorithm and outputting the second preliminary feature map corresponding to the mth layer of deconvolution image through the standard deconvolution algorithm is performed. Specifically, S1032 and S1033 are repeated until a 4 th layer deconvolution image, a 3 rd layer deconvolution image, a 2 nd layer deconvolution image, and a 1 st layer deconvolution image are obtained.

In this embodiment, the obtained 6-layer deconvolution image includes the feature information of the 6-layer convolution image, so that the feature information is summarized, and therefore, the subsequent second feature fusion is performed, and the accuracy of target detection is increased.

Fig. 6 shows a schematic structural diagram of an attention mechanism model provided in an embodiment of the present application, which is detailed as follows:

referring to fig. 6, fig. 6 shows an attention mechanism model, also called a compression excitation mechanism, which is a channel attention model, and is characterized in that important features are screened out by configuring different weights ω for each channel of an input feature map, the weights ω are obtained by a deep learning framework API, the updating direction is a loss descending direction, the feature map introduced into the attention mechanism model is subjected to dimensionality reduction processing by 1 × 1 convolution, then feature values of the channels are obtained by global pooling, then the weights ω are calculated by a full connection layer, and then the weights ω are multiplied by the compressed feature map, and finally a re-calibrated feature map is obtained.

Wherein H, W, C represents the height, width and channel number of the image, X represents the input characteristic diagram, F_trRepresents a dimension reduction operation of 1 × 1 convolution, in order to reduce the amount of computation required for the following steps by dimension reduction (reducing the number of channels). F_sq(. cndot.) represents a compression operation on a channel, in this embodiment, illustratively, specifically, using global average pooling, F_ex(. W) represents that the feature information with the shape of 1 × 1 × C is mapped into another feature information with the shape of 1 × 1 × C (the mapped feature information represents the importance coefficient of the channel), the mapping scheme adopts a multi-layer perceptron, W represents the weight of the multi-layer perceptron, and the updating direction of the weight is the direction of the descending of the loss gradient. F_scala(-) represents a multiplication operation on the channel by which the feature map U is mapped with feature information (channel importance coefficient) corresponding thereto. Finally obtaining the re-calibrated characteristic diagram

Fig. 7 shows a flowchart of a specific implementation of the target detection method S104 according to the third embodiment of the present application. Referring to fig. 7, with respect to the embodiment described in fig. 1, the target detection method S104 provided in this embodiment includes: s1041 to S1044, detailed description is as follows:

in S1041, determining a neighboring convolution image associated with the first layer convolution image, and transforming the neighboring convolution image and the first layer convolution image according to a preset transformation algorithm to obtain a plurality of transformation feature maps.

In this embodiment, with the first layer of the convolution image as a reference, determining adjacent convolution images related to the first layer of the convolution image, including an upper layer of the convolution image of the first layer, or an upper layer of the convolution image and a lower layer of the convolution image of the first layer, or a lower layer of the convolution image of the first layer; specifically, in the present embodiment, the adjacent convolution images associated with the first layer of convolution image are the second layer of convolution image and the third layer of convolution image; referring to the schematic diagram of the catcher det network structure shown in fig. 3, optionally, the adjacent convolution images associated with the first layer of convolution images include a training scene image and a second layer of convolution images, and the adjacent convolution images associated with the second layer of convolution images include the training scene image and the first layer of convolution images.

By way of example and not limitation, the preset transformation algorithm may be a mapping algorithm between depth and space, that is, the size of the image is changed by changing the number of channels of the image, specifically, in the present embodiment, the width of the image may be increased by decreasing the number of channels of the image, or the width of the image may be decreased by increasing the number of channels of the image, for example, the images with the numbers of channels of 28, and 16 respectively undergo a transformation method between space and depth, so as to obtain images with the numbers of channels of 7, and 256 respectively.

In S1042, based on a preset combination algorithm, a plurality of the transformation feature maps are combined into a combined feature map.

In this embodiment, the preset combination algorithm includes a channel splicing algorithm, and the specific implementation can be seen in fig. 4.

In S1043, the combined feature map is imported to an attention mechanism model, and the first enhanced image corresponding to the first layer of the convolution image is output.

In this embodiment, the above-mentioned combined feature map is introduced into an attention mechanism model, and the attention mechanism model is specifically implemented as shown in fig. 6, and the output re-calibrated feature map is the first enhanced image. The image height and width of the first enhanced image are the same as those of the first layer convolution image.

In S1044, performing feature enhancement on the second convolution layer image to obtain a second enhanced image, which is specifically implemented as in S1041 to S1043.

In this embodiment, only the first layer of convolution image and the second layer of convolution image are subjected to feature enhancement to obtain a corresponding first enhanced image and a corresponding second enhanced image, so as to further amplify features of the small-scale target in the first layer of convolution image and the second layer of convolution image, thereby improving the detection precision for the small-scale target while considering the target detection speed.

Fig. 8 shows a flowchart of a specific implementation of the target detection method S105 according to the fourth embodiment of the present application. Referring to fig. 8, with respect to the embodiment described in fig. 1, the object detection method S105 provided by the present embodiment includes: s1051 to S1057 are described in detail as follows:

in S1051, a first frame size of a first anchor frame associated with the first layer of convolution images is determined based on the image size of the training scene image and a preset first ratio.

In this embodiment, the first ratio is any value between 0.03 and 0.04.

In S1052, a first center position of the first anchor frame is determined based on the first layer convolution image.

In this embodiment, the training scene image is equally divided into H × W networks based on the height and width of the first layer of convolution image, specifically, H is the height of the first layer of convolution image, W is the width of the first layer of convolution image, each network corresponds to anchor points of B frame sizes, specifically, B corresponding to the first layer of convolution image is 1, and the center of each network is the center of the first anchor point.

In S1053, the first anchor frame is marked based on the first frame size and the first center position.

In this embodiment, S1051 determines the first frame size of the first anchor frame, and S1052 determines the center position of the first anchor frame, specifically, this embodiment uses three different anchor frame aspect ratio combinations to determine the first anchor frame, the aspect ratios are 1:1 respectively,

and

thereby determining first anchor frames, specifically, the number of first anchor frames is 3 × B × H × W.

In S1054, a second frame size of a second anchor frame associated with each of the second layer of the convolution image and the residual convolution image is determined based on the image size and a second ratio associated with each layer level.

In the present embodiment, the above-mentioned residual convolution images refer to a third layer convolution image, a fourth layer convolution image, a fifth layer convolution image, and a sixth layer convolution image, the second proportion is any value between 0.05 and 0.8, specifically, the number of the second proportion corresponding to each layer of the convolution image is 3, the size of the second proportion is proportional to the corresponding level of the convolution image, that is, the higher the level of the convolution image is, the larger the value of the corresponding second ratio is, by way of example and not limitation, the second ratio corresponding to the second layer of convolution image is 0.05, 0.1, 0.15, the second ratio corresponding to the third layer of convolution image is 0.2, 0.25, 0.3, the second ratio corresponding to the fourth layer of convolution image is 0.35, 0.4, 0.45, the second ratio corresponding to the fifth layer of convolution image is 0.5, 0.55, 0.6, and the second ratio corresponding to the sixth layer of convolution image is 0.65, 0.7, 0.75.

In S1055, a second center position of the second anchor point frame, each of which is associated with each level, is determined based on the second layer of the convolution images and the residual convolution images.

In this embodiment, taking the second layer of convolution images as an example, based on the height and width of the second layer of convolution images, the training scene image is equally divided into H × W networks, specifically, H is the height of the layer of convolution images, W is the width of the layer of convolution images, each network corresponds to anchor blocks of B frame sizes, specifically, B corresponding to the second layer of convolution images and the remaining convolution images is 3, and the center of each network is the center of the second anchor block. The above residual convolved images and so on.

In S1056, the second anchor box is marked based on the second bounding box size and the second center position.

In this embodiment, taking the second layer of the convolution image as an example, S1054 determines three different second bounding box sizes of the second anchor frame, and S1055 determines the center position of the second anchor frame, specifically, this embodiment determines the second anchor frame by combining three different anchor frame aspect ratios, where the aspect ratios are 1:1 respectively,

and

thereby determining second anchor blocks corresponding to the second layer of the convolutional image, specifically, the number of the second anchor blocks corresponding to the second layer of the convolutional image is 3 × B × H × W. And determining a second anchor point frame corresponding to the residual convolution image, and so on.

In S1057, the initial anchor frame is obtained according to the first anchor frame and the second anchor frame of all the marks.

In this embodiment, all of the first anchor boxes and the second anchor boxes are collectively referred to as initial anchor boxes. The initial anchor block is spread over the training scene image for subsequent convolutional neural network target prediction based on the initial anchor block. In order to enable the target detection network to have better capability of detecting the small-scale target, particularly, the size of the initial anchor point frame corresponding to the first layer of convolution image is set, and through the setting, the capability of detecting the small-scale target is greatly improved.

Fig. 9 shows a flowchart of an implementation of the target detection method S106 according to the fifth embodiment of the present application, which is detailed as follows:

in S1061, calculating an IOU score of an overlapping degree between the initial anchor box and the truth box of the same training target associated with each level, and calculating to obtain the IOU score of each initial anchor box.

In this embodiment, the IOU score calculation is performed on all the initial anchor point frames and the above-mentioned true value frames of the same training target corresponding to each layer of convolution image. The above-mentioned calculation of the IOU score refers to calculating a ratio of an intersection and a union of two boxes that perform the calculation of the IOU score.

In S1062, if the IOU score is greater than the IOU threshold corresponding to the level associated with the initial anchor block, the initial anchor block is identified as the first regular anchor block.

In this embodiment, specifically, the IOU threshold associated with each level is less than or equal to 0.5, and is positively associated with a level, that is, the higher the current level is, the larger the value of the corresponding IOU threshold is, by way of example and not limitation, specifically, the IOU threshold associated with the first layer of convolutional image is 0.3, the IOU threshold associated with the second layer of convolutional image is 0.34, the IOU threshold associated with the third layer of convolutional image is 0.38, the IOU threshold associated with the fourth layer of convolutional image is 0.42, the IOU threshold associated with the fifth layer of convolutional image is 0.46, and the IOU threshold associated with the sixth layer of convolutional image is 0.5. And if the IOU score is larger than the IOU threshold value corresponding to the level associated with the initial anchor frame, identifying the initial anchor frame as a first regular anchor frame.

In S1063, the first enhanced image, the second enhanced image, and the residual convolution image are imported into the convolution neural network, and a first prediction frame of each training target is output.

In this embodiment, as an example and not by way of limitation, a convolutional neural network is built through a tensorflow or other deep learning framework, and a first prediction frame of each training target is output through analysis of the first enhanced image, the second enhanced image and the residual convolutional image, where the first prediction frame includes position information of the training target predicted by the network in a training scene image.

In S1064, a first prediction offset related to the training target is calculated according to an offset between the first positive example anchor block and the first prediction block corresponding to the same training target, and a first offset value is obtained according to the first prediction offsets of all the training targets.

In this embodiment, the offset of the transformation from the first normal anchor frame to the first prediction frame corresponding to the same training target is the first prediction offset of the training target, and the data set of the first prediction offsets of all the training targets is the first offset.

In S1065, a first positive example anchor point box, the first prediction offset, and the true value box corresponding to the same training target are imported into a preset first loss function, a first loss amount related to the training target is calculated, and the first loss parameter is obtained according to the first loss amounts and the first offset values of all the training targets.

In this embodiment, the preset first loss function is as follows:

wherein L is_boxWhich represents the first amount of loss and,

representing the offset of the transformation of the initial anchor block into the first prediction block,

is the offset of the initial anchor block transformation to the true block,

it is used to indicate whether the kth initial anchor point frame located at grid i, j is used to predict the target (i.e. whether the IOU score is greater than the corresponding threshold, if greater than 1, or 0, i.e. the formula only needs to obtain the offset of the transformation from the first positive example anchor point frame to the first prediction frame and the offset of the transformation from the first positive example anchor point frame to the true value frame to obtain the calculation result), H, W represents the height and width of the corresponding level of the initial anchor point frame, B represents the corresponding B initial frames located at grid i, j, see the related description of the third embodiment of the anchor point in this application, i.e. B of the first layer convolution image is 1, and B of the rest levels is 3.

In this embodiment, the first loss parameter includes a first loss value and a first offset value, which are used to subsequently obtain the first adjustment frame and the parameter used to subsequently adjust the convolutional neural network, so that the first prediction frame obtained next time by the convolutional neural network is closer to the true value frame.

Fig. 10 shows a flowchart of an implementation of the target detection method S108 according to the sixth embodiment of the present application, which is detailed as follows:

in S1081, feature fusion is performed on the first enhanced image and the first layer of inverse convolution image, so as to obtain a first layer of fusion feature map.

In this embodiment, feature fusion is performed on the first enhanced image and the first layer of inverse convolution image to obtain a first layer of fusion feature map, where the first layer of fusion feature map includes target feature information of the first enhanced image and the first layer of inverse convolution image, and particularly, the target feature information includes small-scale target feature information.

In S1082, feature fusion is performed on the second enhanced image and the second layer deconvolution image, so as to obtain a second layer fusion feature map.

In this embodiment, feature fusion is performed on the second enhanced image and the second layer of deconvolution image to obtain a second layer fusion feature map, where the second layer fusion feature map includes target feature information of the second enhanced image and the second layer of deconvolution image, and particularly, the target feature information includes small-scale target feature information.

In S1083, feature fusion is performed on the residual convolution image and the inverse convolution image corresponding to the residual convolution image associated hierarchy, so as to obtain the fusion feature map corresponding to the residual convolution image associated hierarchy.

In this embodiment, feature fusion is performed on the residual convolution image and the deconvolution image corresponding to the residual convolution image association level, so as to obtain a fusion feature map corresponding to the residual convolution image association level, where the fusion feature map includes target feature information in the convolution image and the corresponding deconvolution image.

In this embodiment, all the obtained fusion feature maps respectively include feature information of the convolution image and the inverse convolution image of the corresponding hierarchy, so that it is more accurate to predict the second offset value and calculate the second loss parameter when the convolution neural network passes through the fusion feature map.

Fig. 11 shows a flowchart of an implementation of the target detection method S109 according to the seventh embodiment of the present application, which is detailed as follows:

in S1091, the IOU score of each first adjustment box is calculated by performing the IOU score calculation on the first adjustment box and the true value box corresponding to the same training target associated with each level.

In this embodiment, the IOU score calculation is performed on all the first adjustment boxes and the above-mentioned true value boxes of the same training target corresponding to each layer of the convolution image.

In S1092, if the IOU score is greater than the IOU threshold corresponding to the level associated with the first adjustment box, the first adjustment box is identified as a second regular anchor box.

In this embodiment, specifically, the IOU thresholds of the respective level associations are greater than 0.5, for example and without limitation, the IOU thresholds of the respective level associations may all be 0.75, and if the IOU score is greater than 0.75, the first adjustment box is identified as the second positive example anchor box.

In S1093, all the fused feature maps are imported into the convolutional neural network, and a second prediction box of each of the training targets is output.

In this embodiment, by way of example and not limitation, a convolutional neural network is built by using a tensoflow or other deep learning framework, and a second prediction frame of each training target is predicted by analyzing all the fusion feature maps, where the second prediction frame includes the position information and the category information of the training target in the training scene image predicted by the network.

In S1094, a second prediction offset related to the training target is calculated according to the offset between the second regular anchor block and the second prediction block corresponding to the same training target, and a second offset value is obtained according to the second prediction offsets of all the training targets.

In this embodiment, the offset of the transformation from the second normal anchor frame to the second prediction frame corresponding to the same training target is the second prediction offset of the training target, and the data set of the second offsets of all the training targets is the second offset.

In S1095, a second normal anchor point frame, the second prediction offset value, and the true value frame corresponding to the same training target are imported into a preset second loss function, a second loss amount related to the training target is calculated, and the second loss parameter is obtained according to the second loss amounts and the second offset values of all the training targets.

In this embodiment, the preset second loss function is as follows:

L_total＝λ_classL_class+λ_boxL_box

wherein L is_totalRepresents the second loss, L_boxRepresenting a loss of relocation, L_classRepresenting a loss of classification, class_i,j,kOne-hot codes representing the kind of objects predicted by the network,

one-hot codes representing the kind of object,

representing the amount of offset of the first regularization anchor block to transform to the second prediction block,

representing the initial anchorThe offset of the point frame to the first trim frame transform,is the offset of the initial anchor block transformation to the true block,

i.e. the offset, lambda, of the transformation of the first adjustment box into the truth box_classAnd λ_boxTo balance the different types of losses, crossEncopy represents a cross-entropy function,it is used to indicate whether the kth initial anchor point frame located at grid i, j is used to predict the target (i.e. whether the IOU score is greater than the corresponding threshold, if it is greater than the corresponding threshold, it is 1, otherwise, it is 0, i.e. the formula only needs to obtain the offset of the second positive anchor point frame transformed to the second prediction frame and the offset of the second positive anchor point frame transformed to the true value frame to obtain the calculation result), H, W represents the height and width of the corresponding level of the initial anchor point frame, B represents the corresponding B initial frames located at grid i, j, see the related description of the third embodiment of the anchor point in this application, i.e. B of the first layer convolution image is 1, and B of the rest levels is 3.

In this embodiment, the second loss parameter includes a second loss value and a second offset value, which are used to subsequently adjust a parameter of the convolutional neural network, so that the target detection accuracy is improved.

It should be understood that, the sizes and letters of the sequence numbers of the steps in the embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Fig. 12 is a block diagram of a target detection apparatus based on a convolutional neural network according to an embodiment of the present application, which corresponds to the target detection method based on a convolutional neural network described in the above embodiment, and only the relevant parts to the embodiment of the present application are shown for convenience of illustration.

Referring to fig. 12, the apparatus includes: the image acquisition module is used for acquiring a training scene image; the training scene image comprises a truth value frame of each training target; the convolutional layer module is used for outputting a plurality of convolutional images of the training scene image through N convolutional layers in the convolutional neural network; n is a positive integer greater than 2; the deconvolution module is used for executing a first feature fusion operation on the Nth layer of convolution images to obtain deconvolution images corresponding to all levels; the feature enhancement module is used for outputting a first enhanced image corresponding to the first layer of convolution image and a second enhanced image corresponding to the second layer of convolution image based on a preset feature enhancement algorithm; the anchor point frame presetting module is used for generating a plurality of initial anchor point frames in the training scene image according to a preset anchor point frame positioning algorithm according to the N layers of convolution images; a first loss module, configured to output a first loss parameter of the convolutional neural network according to the first enhanced image, the second enhanced image, a residual convolutional image, the initial anchor block, and the true value block of each training target included in the training scene image; the residual convolution images are convolution images corresponding to other levels except the first layer of convolution image and the second layer of convolution image; the first regression module is used for adjusting the initial anchor point frame according to the first loss parameter to obtain a first adjusting frame; the feature fusion module is used for executing second feature fusion operation on all the inverse convolution images, the first enhanced images, the second enhanced images and the residual convolution images to obtain fusion feature maps corresponding to all the levels; a second loss module, configured to output a second loss parameter of the convolutional neural network according to all the fused feature maps, the first adjustment frame, and the true value frames of the training targets included in the training scene image; a second regression module, configured to adjust parameters of the convolutional neural network based on the first loss parameter and the second loss parameter, to obtain a target detection network; and the target detection module is used for importing the real-time image into the target detection network and outputting the target object contained in the real-time image.

It should be noted that, for the information interaction, the execution process, and other contents between the above-mentioned apparatuses, the specific functions and the technical effects of the embodiments of the method of the present application are based on the same concept, and specific reference may be made to the section of the embodiments of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present application. As shown in fig. 13, the network device 13 of this embodiment includes: at least one processor 130 (only one shown in fig. 13), a memory 131, and a computer program 132 stored in the memory 131 and executable on the at least one processor 130, wherein the processor 130 executes the computer program 132 to implement the steps in any of the above-mentioned embodiments of the method for determining a shared service indicator based on a communication credential sharing service.

The network device 13 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The network device may include, but is not limited to, a processor 130, a memory 131. Those skilled in the art will appreciate that fig. 13 is merely an example of the network device 13, and does not constitute a limitation of the network device 13, and may include more or less components than those shown, or combine some of the components, or different components, such as an input output device, a network access device, etc.

The Processor 130 may be a Central Processing Unit (CPU), and the Processor 130 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 131 may be an internal storage unit of the network device 13 in some embodiments, such as a hard disk or a memory of the network device 13. The memory 131 may also be an external storage device of the network device 13 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the network device 13. Further, the memory 131 may also include both an internal storage unit and an external storage device of the network device 13. The memory 131 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory 131 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), random-access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A target detection method based on a convolutional neural network is characterized by comprising the following steps:

acquiring a training scene image; the training scene image comprises a truth value frame of each training target;

obtaining fusion feature maps corresponding to each hierarchy by performing a second feature fusion operation on all the inverse convolution images, the first enhanced image, the second enhanced image and the residual convolution images;

2. The object detection method of claim 1, wherein the performing the first feature fusion operation on the nth layer convolution image to obtain the inverse convolution image corresponding to each hierarchy level comprises:

convolving the N layer of convolution image to obtain an N layer of inverse convolution image; the image size of the Nth layer of convolution image is the same as that of the Nth layer of inverse convolution image;

outputting a first preparation feature map corresponding to the M-th layer of the deconvolution image through a preset reuse algorithm, and outputting a second preparation feature map corresponding to the M-th layer of the deconvolution image through a standard deconvolution algorithm; wherein M is a positive integer, and the initial value of M is N; the image size of the first and second preliminary feature maps is twice the image size of the mth layer inverse convolution image;

performing feature fusion on the first preparation feature map and the second preparation feature map to obtain an M-1 layer of inverse convolution image;

and if the value of the M is larger than 2, reducing the value of the M, and returning to execute the operation of outputting a first prepared feature map corresponding to the M-th layer of the deconvolution image through a preset reuse algorithm and outputting a second prepared feature map corresponding to the M-th layer of the deconvolution image through a standard deconvolution algorithm.

3. The object detection method of claim 1, wherein outputting a first enhanced image corresponding to the first layer of the convolution image and a second enhanced image corresponding to the second layer of the convolution image based on a preset feature enhancement algorithm comprises:

determining an adjacent convolution image associated with the first layer of convolution image by taking the first layer of convolution image as a reference, and transforming the adjacent convolution image and the first layer of convolution image according to a preset transformation algorithm to obtain a plurality of transformation feature maps;

combining the plurality of transformation feature maps into a combined feature map based on a preset combined algorithm;

and importing the combined feature map into an attention mechanism model, and outputting the first enhanced image corresponding to the first layer of convolution image.

4. The method of claim 1, wherein the generating a plurality of initial anchor boxes according to a predetermined anchor box location algorithm within the training scene image based on the N-layered convolution image comprises:

determining a first frame size of a first anchor point frame associated with the first layer of convolution images based on the image size of the training scene image and a preset first proportion; the first ratio is any value between 0.03 and 0.04;

determining a first center position of the first anchor point frame based on the first layer of the convolution image;

marking the first anchor frame based on the first frame size and the first center position;

determining a second frame size of a second anchor frame associated with each of a second layer of the convolution image and the residual convolution image based on the image size and a second proportion associated with each layer; the second ratio is any value between 0.05 and 0.8;

determining a second center position of the second anchor point frame associated with each level based on a second layer of convolution images and the residual convolution images;

marking the second anchor box based on the second bezel size and the second center position;

and obtaining the initial anchor point frame according to the first anchor point frame and the second anchor point frame of all the marks.

5. The object detection method of claim 1, wherein said outputting a first loss parameter of said convolutional neural network based on said first enhanced image, said second enhanced image, a residual convolutional image, said initial anchor block, and said true value block of each of said training objects contained in said training scene image comprises:

performing overlapping degree IOU (input output) score calculation on the initial anchor frame and the true value frame of the same training target associated with each level, and calculating to obtain the IOU score of each initial anchor frame;

if the IOU score is larger than the IOU threshold value corresponding to the level associated with the initial anchor frame, identifying the initial anchor frame as a first regular anchor frame; the IOU threshold associated with each level is less than or equal to 0.5 and positively correlated with a level;

importing the first enhanced image, the second enhanced image and the residual convolution image into the convolution neural network, and outputting a first prediction frame of each training target;

calculating a first prediction offset related to the training target according to the offset between the first positive example anchor frame and the first prediction frame corresponding to the same training target, and obtaining a first offset value according to the first prediction offsets of all the training targets;

and importing a first positive example anchor point frame, the first prediction offset and the true value frame corresponding to the same training target into a preset first loss function, calculating a first loss amount related to the training target, and obtaining the first loss parameter according to the first loss amounts and the first offset values of all the training targets.

6. The object detection method of claim 1, wherein the performing a second feature fusion operation on all the deconvoluted images, the first enhanced image, the second enhanced image and the residual convolved images to obtain a fused feature map corresponding to each hierarchy level comprises:

performing feature fusion on the first enhanced image and the first layer of inverse convolution image to obtain a first layer of fusion feature map;

performing feature fusion on the second enhanced image and the second layer of inverse convolution image to obtain a second layer of fusion feature map;

and performing feature fusion on the residual convolution image and the inverse convolution image corresponding to the residual convolution image association level to obtain the fusion feature map corresponding to the residual convolution image association level.

7. The object detection method of claim 1, wherein the outputting the second loss parameter of the convolutional neural network according to all the fused feature maps, the first adjustment box, and the true value box of each training object included in the training scene image comprises:

performing IOU score calculation on the first adjusting frame and the true value frame corresponding to the same training target associated with each level, and calculating to obtain the IOU score of each first adjusting frame;

if the IOU score is larger than the IOU threshold value corresponding to the level associated with the first adjusting frame, identifying the first adjusting frame as a second regular anchor frame; the IOU threshold associated with each level is greater than 0.5;

importing all the fusion feature maps into the convolutional neural network, and outputting a second prediction box of each training target;

calculating a second prediction offset related to the training target according to the offset between the second regular example anchor frame and the second prediction frame corresponding to the same training target, and obtaining a second offset value according to the second prediction offsets of all the training targets;

and importing a second regular example anchor point frame, the second prediction deviant and the true value frame corresponding to the same training target into a preset second loss function, calculating a second loss quantity related to the training target, and obtaining the second loss parameter according to the second loss quantity and the second deviant of all the training targets.

8. A road target detection device based on a convolutional neural network, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.