CN111967452B

CN111967452B - Target detection method, computer equipment and readable storage medium

Info

Publication number: CN111967452B
Application number: CN202011129543.5A
Authority: CN
Inventors: 张�浩
Original assignee: Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Current assignee: Zhejiang Xinmai Microelectronics Co ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-02-02
Anticipated expiration: 2040-10-21
Also published as: CN111967452A

Abstract

The invention discloses a target detection method, computer equipment and a readable storage medium, and relates to the technical field of target detection. According to the technical scheme provided by the invention, in the first step, the loss of the first output characteristic diagram at the same stage is calculated by using the classification label, and the network is updated reversely; the second step is that: and filtering the first output characteristic diagram of the same stage, decoding a classification label of the second characteristic diagram as a label of the second output characteristic diagram, calculating classification loss, then reversely updating the network, continuously circulating the first step and the second step to iteratively optimize the network, and improving the detection performance of the single-stage end-to-end detection network.

Description

Target detection method, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method, computer equipment and a readable storage medium.

Background

In the prior art, a deep learning method is generally used for road target detection. The current algorithms based on RPN network include Rcnn, Fast-Rcnn, Faster-Rcnn, Cascade-Rcnn, etc. The single-stage end-to-end algorithm with the anchor point box comprises a Yolo series, an SSD series and the like. The algorithms for the single-order end-to-end anchorless box are FCOS, CenterNet, CornerNet, etc.

However, in the prior art, the target detection accuracy of the multi-stage detection network with the regional candidate network (RPN) is high, but the time complexity is much higher, and the multi-stage detection network is not suitable for embedded end deployment. The single-stage end-to-end detection network has high speed and is suitable for the deployment of an embedded end, but the detection precision is poor. Based on the method, the single-stage end-to-end detection network is continuously improved so as to improve the detection rate of the network.

Disclosure of Invention

In order to solve the foregoing problems, the present invention provides a target detection method, which improves the detection performance of a single-stage end-to-end detection network.

In order to achieve the purpose, the invention adopts the following technical scheme:

an object detection method for detecting a road object, comprising the steps of:

acquiring road picture data and preprocessing the road picture data to be used as a sample;

establishing a network model, training the network model by using a sample, and detecting a road target by using the trained network model;

the network model comprises a target frame output layer and at least two target output layers, wherein each target output layer outputs characteristicsThe characteristic diagram of the output of the target output layer comprisesclassA shaft;

the network model training method comprises the following steps:

firstly, updating: calculating the loss of the characteristic diagram output by the first target output layer, outputting the characteristic diagram output by the first target output layer to the next target output layer, and reversely updating the network model;

and (3) cyclic updating: the next target output layer filters the received feature map, the filtering the received feature map comprising the steps of:

obtainingclassThe maximum value in the axis and its corresponding index value;

filtering the index value according to the following formula, wherein the filtered index value is used as a filtered result:

wherein the content of the first and second substances,f_stage _i_0_valueis as followsiIn the first characteristic diagram of a stageclassThe maximum value of the axis is that of the axis,f_stage _i_0_indexis as followsiIn a stageclassThe index value corresponding to the maximum value of the axis,thresh’is a tag threshold;

the filtered result is used as label information of the feature graph output by the current target output layer, the loss of the feature graph output by the current target output layer is calculated according to the label information, the feature graph output by the current target output layer is output to the next target output layer, and the network model is updated reversely;

and repeating the cyclic updating step until the last target output layer.

Optionally, before calculating the loss of the feature map output by the target output layer, classifying the targets in the sample to obtain classification labels of the classified targets, and then calculating the loss of the feature map output by the target output layer according to the following formula:

wherein the content of the first and second substances,E _{lossclass I}In order to classify the loss of the object,L _ifor the classification tags, in the form of one-hot codes,

is the output value of the network model.

Optionally, the classifying the target in the sample includes the following steps:

calculating the IOU between the target frame output by the target frame output layer and the real target frame of the target in the sample according to the following formula:

wherein the content of the first and second substances,

respectively the upper left point and the lower right point of the real target frame of the target in the sample,

the upper left point and the lower right point of the target frame output by the target frame output layer are respectively;

determining the classification of the target in the sample according to the IOU, and marking a classification label:

wherein the content of the first and second substances,neg_classthe negative sample is a frame of the negative sample,pos_classin the case of a positive sample frame,threshto distinguish between thresholds for positive and negative sample boxes.

Optionally, when the network model is trained by using the sample, the loss is calculated for the target frame output layer according to the following formula:

wherein the content of the first and second substances,E _lossframeThe loss of layers is output for the target box.

Optionally, when the network model is trained, the feature graph output by the target output layer is screened according to the following formula:

wherein the content of the first and second substances,f_stage _i_outin order to obtain the characteristic diagram after screening,f_stage _i_lastfor the feature map output by the last target output layer,f_stage _i_(last-1) a feature map output for a target output layer preceding the last target output layer,thresh”is a screening threshold;

and traversing the screened feature map, and screening out the coordinates of the target which is greater than the confidence coefficient threshold value.

Optionally, when the network model is trained, the coordinates of the target in the sample are decoded according to the coordinates of the screened target according to the following formula:

wherein the content of the first and second substances,x、yin order to select the coordinates of the object,x1_offset、y1_offsetoutputting the coordinates of the top left point of the layer predicted target box for the target box,x2_offset、y2_offsetoutputting the coordinates of the bottom right point of the layer predicted target frame for the target frame,box_x1、box_y1 is the coordinate of the upper left point of the target frame output by the target frame output layer,box_x2、box_y2 is the coordinates of the lower right point of the target frame output by the target frame output layer,stridestep size of the feature map relative to the sample;

and finally, carrying out non-maximum suppression operation on the output target frame.

Optionally, the network model has several levels of target detection, where different levels of target detection are for targets of different sizes, and each level of target detection includes a target frame output layer and at least two target output layers.

Optionally, the acquiring and preprocessing the road picture data includes the following steps:

selecting road picture data in a natural scene;

normalizing the road picture data according to the following formula:

wherein:

in order to input the data after normalization,mthe road picture data is obtained;

and randomly scaling the normalized road picture data, and randomly cutting the road picture data after scaling, wherein the size of the cut block is 256 × 256, and if the cut block does not contain the road target, the current cut block is used as a negative sample.

Optionally, an Adam optimization method is used for training the network model, the basic learning rate is 0.001, and the training batch size is 25.

The invention has the following beneficial effects:

according to the technical scheme provided by the invention, the classification label is dynamically assigned to the single-stage target detection, and a multi-step learning method is adopted in the sub-training process, so that the network detection is more stable, the more robust characteristic is learned, and the detection rate of the detection network is improved while the characteristic that the single-stage end-to-end detection network is suitable for embedded end deployment is maintained.

Furthermore, the present invention also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing any of the above methods when executing the computer program.

Meanwhile, the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any one of the above.

These features and advantages of the present invention will be disclosed in more detail in the following detailed description of the invention. The preferred embodiments or means of the present invention are not intended to limit the technical aspects of the present invention. In addition, each of the features, elements and components appearing hereinafter is plural, and different symbols or numerals are given for convenience of representation, but all represent the same or similar structural or functional parts.

Detailed Description

The technical solutions of the embodiments of the present invention are explained and illustrated below, but the following embodiments are only preferred embodiments of the present invention, and not all of them. Based on the embodiments in the implementation, other embodiments obtained by those skilled in the art without any creative effort belong to the protection scope of the present invention.

Reference in the specification to "one embodiment" or "an example" means that a particular feature, structure or characteristic described in connection with the embodiment itself may be included in at least one embodiment of the patent disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The first embodiment is as follows:

the embodiment provides a target detection method for detecting a road target, which comprises the following steps:

and acquiring road picture data and preprocessing the road picture data to be used as a sample. In this step, firstly, road picture data in a natural scene is selected; then, normalizing the road picture data according to the following formula:

wherein:

and randomly zooming the normalized road picture data to adapt to different target sizes, and performing random Cutout blocks after zooming, wherein the size of each Cutout block is 256 × 256, if the Cutout blocks do not contain the road target, the current Cutout blocks are used as negative samples to be trained so as to increase the negative sample learning of the network, and finally, data enhancement operations such as Gaussian blur, brightness, turnover, Cutout and the like are randomly added.

And establishing a network model which comprises a target frame output layer and at least two target output layers, wherein each target output layer outputs a characteristic diagram. Due to the fact that the size of the road target changes greatly, the network model has a plurality of levels of target detection, different levels of target detection aim at targets with different sizes, and each level of target detection comprises a target frame output layer and at least two target output layers. Specifically, in this embodiment, the network model provided in this embodiment has two levels of prediction, each level of prediction has two target output layers, the 1 st level of prediction is a road target with a height greater than 8 and less than 48, and the 2 nd level of prediction is a road target with a height greater than 48 and less than 256:

where w is the width of the road target, h is the height of the road target, and stage1 and stage2 respectively refer to the two-stage outputs of the network.

The specific network configuration is as follows:

where k represents the convolution kernel size, n represents the number of output convolution signatures, s represents the convolution sliding step, Bn represents the BatchNormalization operation, Relu and Softmax represent the activation functions used. Wherein class1_0, class1_1, class2_0, class2_1 output layer employ the softmax activation function:

wherein the content of the first and second substances,x _i’is referred to asiThe output of the' one neuron(s),

it means that all output neurons are summed by exponential operation. The sum of the probability values for each neural node output by the formula is equal to 1.

Training the network model by using the sample, wherein the training of the network model comprises the following steps:

firstly, updating: and calculating the loss of the characteristic diagram output by the first target output layer. Before calculating the loss of the feature map output by the target output layer, classifying the targets in the sample according to the following steps to obtain classification labels of the classification targets:

wherein the content of the first and second substances,

wherein the content of the first and second substances,neg_classthe negative sample is a frame of the negative sample,pos_classin the case of a positive sample frame,threshto distinguish the threshold values of the positive sample box and the negative sample box, the threshold values of the positive sample box and the negative sample box are specifically distinguished in the present embodimentthreshIs 0.6.

After the classification label of the classification target is obtained, calculating the loss of the feature map output by the target output layer according to the classification label and the following formula:

is the output value of the network model.

At the same time, the loss is calculated for the target box output layer according to the following formula:

And outputting the characteristic diagram output by the first target output layer to the next target output layer, and reversely updating the network model according to the calculated loss to finish the first updating step.

In particular, in this embodiment, the present embodiment has two levels of prediction, each level of prediction having two target output layers, i.e.class1_0，class1_1，class2_0，class2_1, and the characteristic graph of each target output layer output is respectively marked asf_ stage1_0、f_stage1_1、f_stage2_0、f_stage1_1, having a size of: (batch, f_num_box,class) Wherein, in the step (A),batchfor the number of samples input into the network model,f_num_boxthe number of target frames for the feature map,classfor the classification number, in the present embodiment, the classification number is 4, i.e., four categories of pedestrians, non-motor vehicles, and backgrounds. First update, i.e. calculation from class labelsf_stage1_0、f_stage2_0 loss of feature map and reverse updating of network model, and then mapping feature mapf_ stage1_0、f_stage2_0 is output toclass1_1 layers andclass2_1 layer.

And (3) cyclic updating: the next target output layer filters the received feature map. The feature map of the target output layer output comprisesclassThe shaft, filtering the received characteristic map includes the steps of:

obtainingclassThe maximum value in the axis and its corresponding index value;

wherein the content of the first and second substances,f_stage _i_0_valueis as followsiIn the first characteristic diagram of a stageclassThe maximum value of the axis is that of the axis,f_stage _i_0_indexis as followsiIn a stageclassThe index value corresponding to the maximum value of the axis,thresh’is a tag threshold value

Calculating according to the label information, calculating the loss of the characteristic diagram output by the target output layer according to the step of calculating the loss of the characteristic diagram, outputting the characteristic diagram output by the target output layer to the next target output layer, and reversely updating the network model according to the calculated loss;

the real label often has noise and abnormal points, and the result is sent to the second step after the first step iterative learning, so that the label learned by the second step is more friendly and tends to a soft label learned by a network. Through multi-step learning, the network strengthens the training result of the network, and can well process abnormal values.

And repeating the cyclic updating step until the last target output layer.

In the embodiment, the characteristic diagram of the output of the target output layer is calculatedf_stage1_0、f_stageIn 2_0classThe maximum value of the axis and its corresponding index are respectively notedf_stage1_0_valve、f_stage1_0_index、f_ stage2_0_valve、f_stage2_0_index. The index values are filtered according to a formula that filters the index values, and when the formula is used,irespectively taking the value 1 or 2 according to the sequence number of the target output layer:

tag thresholdthresh’Taking the index value as 0.6f_stage1_1、f_stage2_1, calculating loss of label information of the characteristic diagram, finally updating the network model reversely, training the network model each time, executing the two training steps, continuously circulating the first step and the second step to iteratively optimize the network model, wherein usually, the real label often has noise and abnormal points, and sending the result to the second step after the first step iterative learning, so that the label learned by the second step is more friendly and tends to the soft label learned by the network. Through multi-step learning, the network strengthens the training result of the network, and can well process abnormal values.

Then, the feature graph output by the target output layer is screened according to the following formula:

wherein the content of the first and second substances,f_stage _i_outin order to obtain the characteristic diagram after screening,f_stage _i_lastfeatures output for the last target output layerIn the figure, the figure shows that,f_stage _i_(last-1) a feature map output for a target output layer preceding the last target output layer,thresh”is a screening threshold;

traversing the screened feature map, and screening out the coordinates of the target which is greater than the confidence threshold; and decoding the corresponding coordinates of the target in the sample according to the coordinates of the screened target according to the following formula:

wherein the content of the first and second substances,x、yin order to select the coordinates of the object,x1_offset、y1_offsetoutputting the coordinates of the top left point of the layer predicted target box for the target box,x2_offset、y2_offsetoutputting the coordinates of the bottom right point of the layer predicted target frame for the target frame,box_x1、box_y1 is the coordinate of the upper left point of the target frame output by the target frame output layer,box_x2、box_y2 is the coordinates of the lower right point of the target frame output by the target frame output layer,stridein this embodiment, the step size of the level 1 with respect to the original image is 8, and the step size of the level two with respect to the original image is 16;

and then carrying out non-maximum suppression operation on the output target frame.

In the embodiment, the characteristic diagram is output to the two-stage target output layerf_stage1_0、f_stageThe formula for screening 2_0 is specifically as follows:

i、lastand respectively taking the value 1 or 2 according to the sequence number of the target output layer.

In this embodiment, the Adam optimization method is used for training the network model, the basic learning rate is 0.001, and the training batch size is 25. And training the network model by using the samples according to the basic learning rate, iterating the target detection network model, and finally, detecting the road target by using the trained network model.

According to the technical scheme provided by the embodiment, the classification labels are dynamically assigned to the single-stage target detection, a multi-step learning method is adopted in the sub-training process, so that the network detection is more stable, the more robust characteristics are learned, and the detection rate of the detection network is improved while the characteristic that the single-stage end-to-end detection network is suitable for embedded end deployment is maintained.

Example two

The present embodiment provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the method of any of the above embodiments when executing the computer program. It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. Accordingly, the computer program can be stored in a non-volatile computer readable storage medium, and when executed, can implement the method of any one of the above embodiments. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and may be embodied in other forms without departing from the spirit or essential characteristics thereof. Any modification which does not depart from the functional and structural principles of the present invention is intended to be included within the scope of the claims.

Claims

1. An object detection method for detection of a road object, comprising the steps of:

the network model is characterized by comprising a target frame output layer and at least two target output layers, wherein each target output layer outputs a feature map, and the feature map output by each target output layer comprisesclassA shaft;

the network model training method comprises the following steps:

obtainingclassThe maximum value in the axis and its corresponding index value;

and repeating the cyclic updating step until the last target output layer.

2. The method of claim 1, wherein before calculating the loss of the feature map output by the target output layer, the targets in the sample are classified to obtain classification labels of the classified targets, and then the loss of the feature map output by the target output layer is calculated according to the following formula:

is the output value of the network model.

3. The method of claim 2, wherein classifying the objects in the sample comprises the steps of:

wherein the content of the first and second substances,

respectively true of the target in the sampleThe upper left point and the lower right point of the real object frame,

4. The method of claim 2, wherein when the network model is trained using the samples, the loss is calculated for the target box output layer according to the following formula:

5. The method of claim 1, wherein when the network model is trained, the feature map output by the target output layer is filtered according to the following formula:

6. The method for detecting the target of claim 5, wherein when the network model is trained, the corresponding coordinates of the target in the sample are decoded according to the coordinates of the screened target according to the following formula:

7. The method according to one of claims 1 to 6, wherein the network model has several levels of object detection, different levels of object detection being for objects of different sizes, each level of object detection comprising an object box output layer and at least two object output layers.

8. The object detection method according to one of claims 1 to 6, wherein the step of obtaining road picture data and preprocessing comprises the steps of:

selecting road picture data in a natural scene;

normalizing the road picture data according to the following formula:

wherein:

inputting data after normalization, wherein m is road picture data;

9. The method of one of claims 1 to 6, wherein the network model is trained using an Adam optimization method with a base learning rate of 0.001 and a training batch size of 25.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 9 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 9.