CN114842320A

CN114842320A - Robot target detection method and system based on DW-SEnet model

Info

Publication number: CN114842320A
Application number: CN202210265173.0A
Authority: CN
Inventors: 张洪; 于源卓
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-08-02

Abstract

The invention relates to a robot target detection method based on a DW-SEnet model, which comprises the steps of obtaining image information of a target to be detected, preprocessing the image information and obtaining an input image; carrying out depth separable convolution DW operation on the input image, and extracting image features; in the forward propagation process, a compression excitation module SE is adopted to calculate the importance degree weights of different characteristic channels of the image characteristics, and the characteristics of the different channels are inhibited or enhanced according to the weight to obtain a characteristic image; constructing a plurality of candidate detection frames according to the characteristic images, positioning the target to be detected, and classifying the images in the candidate detection frames according to the characteristics to obtain a classification result; and eliminating redundant candidate detection frames through non-maximum inhibition, and labeling classification results to obtain target detection results. The target detection method provided by the invention has fewer model parameters, higher detection speed and higher accuracy when processing image target detection.

Description

Robot target detection method and system based on DW-SEnet model

Technical Field

The invention relates to the technical field of robot vision detection, in particular to a robot target detection method and system based on a DW-SEnet model.

Background

The mobile robot is ubiquitous in the life of people, and the intelligent requirement of the mobile robot is higher and higher. The sequential appearance of multiple cameras makes robotic target detection possible. The target detection is to find the exact position of an object in an image and classify it by some method.

The target detection method of the mobile robot is different from the traditional target detection method, most mobile robots do not have processors with excellent performance, and even most robots use a single chip microcomputer as the processors. Therefore, the network with excessively complex parameters is not suitable for target detection of the mobile robot, and when the network is researched in the aspect, the PC is mostly adopted as an upper computer to process graphics, so that the network is concentrated on improvement of target detection precision. But considering the processing capacity of most mobile robots, the lightweight of the algorithm model is crucial. The YOLO series algorithm and the SSD algorithm highlighted by ONE-STAGE are widely applied to mobile robots, wherein the SSD algorithm has superior performance in terms of detection time and model size. The SSD algorithm only has one neural network picture input on the whole, then a bounding box is generated through the network, and the images are classified. The SSD algorithm is formed by adding an additional network on the basis of the modified VGG16 deep learning structure network. The fully connected layer included in the VGG16 is not used in target detection, and therefore, is removed, and is replaced with a completely new convolutional layer for additional feature extraction. The input of the network is that a 300 × 300 image is propagated forward, and as the size of a continuous convolution picture is continuously reduced, at the same time, a bounding box with different sizes but a fixed size is generated, the network classifies the picture in the bounding box, and finally the bounding box is subjected to a removing operation, so that a final detection result is generated. Because the SSD target detection algorithm has more network structure parameters and a higher error rate of the classified network, the target detection speed and the detection accuracy are still to be improved.

Therefore, at present, a mobile robot target detection method with few network structure parameters, short detection time and low detection accuracy is urgently needed.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problems in the prior art, and provide a robot target detection method and system based on a DW-SEnet model, wherein DW convolution is adopted to replace the traditional convolution, so that the calculation amount and parameters required by calculation are reduced, and the calculation speed is effectively improved; the weights of different characteristic channels are calculated through the compression excitation module SE, different image characteristics are restrained or promoted according to the weights, unimportant characteristics are restrained, and the accuracy and the speed of image classification are effectively improved, so that the detection time is effectively shortened, and the detection accuracy is improved.

In order to solve the technical problem, the invention provides a robot target detection method based on a DW-SEnet model, which comprises the following steps:

s1: acquiring image information of a target to be detected, and preprocessing the image information to obtain an input image;

s2: performing depth separable convolution (DW) operation on the input image to extract image features;

s3: in the forward propagation process, a compression excitation module SE is adopted to calculate the importance degree weights of different characteristic channels of the image characteristics, and the characteristics of the different channels are inhibited or enhanced according to the weight to obtain characteristic images;

s4: constructing a plurality of candidate detection frames according to the characteristic images, positioning the target to be detected, and classifying the images in the candidate detection frames according to the characteristics to obtain a classification result;

s5: and eliminating redundant candidate detection frames through non-maximum inhibition, and labeling the classification result to obtain a target detection result.

In one embodiment of the present invention, in S1, an RGB image of the object to be detected is taken by the mobile robot using an RGB-D vision camera.

In one embodiment of the invention, the depth separable convolution DW and the compressive excitation module SE constitute a DW-SENSet detection model.

In one embodiment of the invention, the detection model comprises 4 convolutional Block networks of the same type, and an inclusion structure is added among different Block network modules.

In an embodiment of the present invention, the inclusion structure includes the compression excitation module SE, the compression excitation module SE includes a compression unit, and the method for performing the compression operation by the compression unit includes:

compressing the features along the spatial dimension), the two-dimensional feature channel is transformed into a real number, the real number has a global receptive field, and represents the global distribution of the response on the feature channel, and the layer close to the input is also made to obtain the global receptive field, wherein the compression formula is:

wherein z is _c Is the global receptive field, w is the feature matrix length, h is the feature matrix width, u _c (i, j) is a convolution operation, i is a feature number of the feature matrix in the long direction, and j is a feature number of the feature matrix in the wide direction.

In one embodiment of the present invention, the compression excitation module SE comprises an excitation unit, and the method for excitation operation by the excitation unit comprises:

generating a weight for each feature channel by a parameter w, wherein the parameter w is used for establishing a relevant weight between the feature channels;

s＝σ(g(z，w))＝σ(w ₂ δ(w ₁ z))

wherein s is the relative weight of different characteristic channels, σ () is Sigmoid function, z is the result obtained by the compression unit, δ () is RELU activation function, w ₁ ，w ₂ The correlation coefficient to be learned for the full connection layer;

and adding the output weight to each characteristic channel by adopting a Reweight operation to finish the recalibration of the original characteristic in the channel dimension.

In one embodiment of the present invention, the classification of the image in S4 employs a Softmax function.

In addition, the invention also provides a robot target detection system based on the DW-SEnet model, which comprises:

the image acquisition module is used for acquiring image information of a target to be detected and preprocessing the image information to obtain an input image;

the feature extraction module is used for carrying out depth separable convolution DW operation on the input image and extracting image features;

the weight excitation module is used for calculating importance degree weights of different characteristic channels of the image characteristics by adopting a compression excitation module SE in the process of forward propagation, and inhibiting or enhancing the characteristics of the different channels according to the weight to obtain characteristic images;

the target positioning and classifying module is used for constructing a plurality of candidate detection frames according to the characteristic images, positioning the target to be detected and classifying the images in the candidate detection frames according to the characteristics to obtain a classification result;

and the target detection module is used for eliminating redundant candidate detection frames through non-maximum inhibition and marking classification results to obtain target detection results.

In an embodiment of the present invention, in the weight excitation module, the method for performing the compression operation by the compression excitation module SE includes:

compressing the features along the spatial dimension, converting the two-dimensional feature channel into a real number, wherein the real number has a global receptive field, represents the global distribution of the response on the feature channel, and enables the layer close to the input to also obtain the global receptive field, wherein the compression formula is:

In an embodiment of the present invention, in the weight excitation module, the method for the compression excitation module SE to perform the excitation operation includes:

s＝σ(g(z，w))＝σ(w ₂ δ(w ₁ z))

Compared with the prior art, the technical scheme of the invention has the following advantages:

the method is based on a ResNet model, and combines a depth separable convolution DW and a compression excitation module SE to construct a DW-SEnet detection model so as to realize the detection of the target; DW convolution is adopted to replace the traditional convolution, so that the calculation amount and parameters required by the operation are reduced, and the calculation speed is effectively improved; and weights of different characteristic channels are calculated through the compression excitation module SE, different image characteristics are inhibited or promoted according to the weights, unimportant characteristics are inhibited, and the accuracy and the speed of image classification are effectively improved, so that the detection time is effectively shortened, and the detection accuracy is improved.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference will now be made in detail to the present disclosure, examples of which are illustrated in the accompanying drawings.

FIG. 1 is a schematic flow chart of the robot target detection method based on the DW-SEnet model according to the present invention.

FIG. 2 is a schematic diagram of a depth separable convolution DW in accordance with the present invention.

Fig. 3 is a schematic diagram of a compression excitation module SE according to the present invention.

Fig. 4 is a schematic diagram of a Re _ SSD network structure in the present invention.

FIG. 5 is a schematic diagram of a detection model according to the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Example one

Referring to fig. 1, an embodiment of the present invention provides a method for detecting a robot target based on a DW-SEnet model, including the following steps:

s2: carrying out depth separable convolution DW operation on the input image, and extracting image characteristics;

In the robot target detection method based on the DW-SEnet model disclosed by the embodiment of the invention, the detection method is based on a ResNet model and combines the depth separable convolution DW and the compression excitation module SE to construct the DW-SEnet detection model.

The detection model comprises 4 convolutional Block networks of the same type, and an increment structure is added among different Block network modules.

In the robot target detection method based on the DW-SEnet model disclosed by the embodiment of the invention, the DW-SEnet detection model is constructed by combining a depth separable convolution DW module and a compression excitation module SE on the basis of a ResNet model, so that the target is detected; DW convolution is adopted to replace the traditional convolution, so that the calculation amount and parameters required by the operation are reduced, and the calculation speed is effectively improved; the weights of different characteristic channels are calculated through the compression excitation module SE, different image characteristics are restrained or promoted according to the weights, unimportant characteristics are restrained, and the accuracy and the speed of image classification are effectively improved, so that the detection time is effectively shortened, and the detection accuracy is improved.

In the method for detecting a robot target based on the DW-SEnet model disclosed in the embodiment of the present invention, in the implementation S1, the mobile robot uses an RGB-D vision camera to capture RGB images of a target to be detected, and preprocesses the RGB images to obtain input images with sizes meeting the requirements. Wherein the preprocessing operation comprises RESIZE; PUDDING.

In the robot target detection method based on the DW-SEnet model disclosed in the embodiment of the present invention, as for the implementation S2, please refer to fig. 2, unlike the conventional convolution operation, one convolution kernel of the DW convolution is responsible for one channel, and one channel is only convolved by one convolution kernel. The conventional convolution is that each convolution kernel operates on each channel of the input image simultaneously, and similarly, for a 5 × 5 pixel, three-channel color input image (shape is 5 × 5 × 3), DW convolution firstly passes through a first convolution operation, and unlike the conventional convolution, DW convolution is performed completely in a two-dimensional plane. The number of convolution kernels is the same as the number of channels in the previous layer (channels and convolution kernels are in one-to-one correspondence). So that a three-channel image is operated to generate 3 characteristic images.

In the robot target detection method based on the DW-SEnet model disclosed in the embodiment of the present invention, as for the implementation S3, please refer to fig. 3, where an input is x, which has c1 feature channels, the number of channels of the feature matrix after continuous convolution pooling becomes c2, and then a compression excitation operation is performed to obtain the importance degrees of different feature channels.

Firstly, performing compression operation, namely compressing the features along the spatial dimension, so that a two-dimensional feature channel becomes a real number, namely the real number has a global receptive field and represents the global distribution of response on the feature channel, and a layer close to the input can also obtain the global receptive field, wherein the compression formula is as follows;

Second, a stimulus operation is performed, which is a mechanism similar to a gate in a recurrent neural network. Generating a weight for each eigen-channel by a parameter w, wherein the parameter w is learned to explicitly model the correlation weight between eigen-channels;

s＝σ(g(z,w))＝σ(w ₂ δ(w ₁ z))

wherein s is the relative weight of different characteristic channels, σ () is Sigmoid function, z is the result obtained by the compression unit, δ () is RELU activation function, w ₁ ，w ₂ The correlation coefficient to be learned for the fully-connected layer, wherein w is divided into w because two fully-connected layers are adopted for excitation operation ₁ And w ₂ 。

Finally, a Reweight operation, in which the output weights of the excitation are the importance weights of each feature channel, adds these weights to each feature channel to complete the re-scaling of the original feature in the channel dimension.

In the robot target detection method based on the DW-SEnet model disclosed in the embodiment of the invention, as for the implementation mode S4, the default bounding box (true value) of the feature map is predicted, and the resampling of the features is eliminated, so that the operation speed of the algorithm can be effectively improved. A series of operations such as horizontal turning, cropping and the like are performed on the input graph to improve the accuracy of image prediction. Since the size of the target in the picture is different, in order to accommodate the difference, the ratio of the length to the width of the default width is set, and assuming that there are m feature maps, the dimension of each default bounding box is:

s _k ＝s _min +(s _max -s _min )(k-1)/(m-1)

in the formula, s _min Is 0.2, s _max 0.9, namely the characteristic scale of the lowest layer is 0.2, and the characteristic scale of the highest layer is 0.9; k is an element of [1, m ]]The number of layers of a certain layer of feature map is m, and the total number of the feature maps is m. Meanwhile, different length-width ratios a epsilon 1, 2, 3, 1/2 and 1/3 are set for the default bounding box according to the length-width ratio a and the default bounding box s _k Calculating the width and height of each candidate detection frame as

And

when the length-width ratio of the default bounding box is 1, the dimension of the default bounding box is

In the robot target detection method based on the DW-SEnet model disclosed in the embodiment of the present invention, as for embodiment S5, all detected candidate detection frames are sorted according to their scores. Selecting the detection frame A with the largest score, setting a threshold b, calculating the intersection ratio IoU (intersection over Unit) between the detection frame A and the detection frame A with the largest score in the rest detection frames, and deleting the detection frames if IoU is larger than the threshold b, namely the detection frames with high overlapping rate; there may be no overlap with the current detection box, or their overlap area is very small (IoU is less than threshold b), then the unprocessed detection boxes are reordered, and after the ordering is completed, a detection box with the largest score is also selected, then IoU values of other detection boxes and the largest detection box are calculated, then the detection boxes with IoU larger than a certain threshold are deleted again, and the process is iterated continuously until all detection boxes are processed, and the final detection result is output.

In order to further illustrate the beneficial effects of the present invention, simulation experiments were performed as follows:

under the experimental environment of a Window10 system, a video card GTX1650, a CPU i5-9300H and a memory 16G; pythroch is used as a deep learning frame, a target detection standard dataset Pascal Voc dataset is adopted, and a mobile ROBOT common dataset is recorded to form a VOC _ ROBOT dataset, wherein 600 pictures are used as a verification set, and 1300 pictures are used as a training set. And respectively comparing the model with other common models in terms of parameter size, detection accuracy and detection time.

In order to compare the model parameters, the network models trained on YOLOv3, SSD, and DW _ SENet were compared and adjusted, and the specific results are shown in table 1.

TABLE 1 comparison of model parameters

Network model	Parameter size/M
		YOLOv3	235
SSD	148
		DW_SENet	109

As can be seen from table 1: obviously, the invention occupies a smaller space, and only occupies 50% of the size of the mainstream Yolov3 target detection network. The model adopted by the invention adopts ResNet as a basic classification network for target detection, and compared with VGG16 adopted by an SSD target detection model, the model not only has less parameter quantity, but also has higher accuracy, and the adopted DW convolution can also greatly reduce the space occupied by the network model.

The target detection model detection accuracy is tested, firstly, a Pascal VOC2012 data set is used to train a common network, which respectively comprises DW _ SENet, SSD, YOLO, YOLOv2, YOLOv3 and Faster RCNN, and the obtained results are shown in table 2:

TABLE 2 accuracy of different network models

Network model	mAP
		FasterRCNN	0.708
YOLO	0.518
		YOLOv2	0.704
YOLOv3	0.713
		SSD	0.701
DW_SENet	0.711

According to experimental results, the target detection accuracy has higher accuracy in comparison of a plurality of common algorithms, and in the aspect of training results of the VOC2012 data set, the improved algorithm has the accuracy equivalent to that of the mainstream algorithm, and has a smaller parameter model.

The algorithm for detecting the SSD target and the invention in the algorithm are trained on the VOC _ ROBOT data set, and the training result is shown in Table 3:

TABLE 3 VOC _ ROBOT data set training result accuracy

As can be seen from table 3, for the common data set of the mobile robot, the accuracy of the present invention under various targets is improved to a certain extent compared with the conventional SSD algorithm. The detection time of the same picture is compared with that of the conventional algorithm in the same environment, 5 pictures are selected for detection, the average time is taken for comparison after 150 detections in each group, and the detection results are shown in table 4.

TABLE 4 comparison of the test times of different models

Through comparison, the detection time of the network model adopted by the invention is shorter than that of the conventional target detection algorithm.

In summary, the target detection method provided by the invention can have fewer model parameters, higher detection speed and higher accuracy when processing image target detection.

Example two

In the following, a robot target detection system based on a DW-SEnet model disclosed in the second embodiment of the present invention is introduced, and a robot target detection system based on a DW-SEnet model described below and a robot target detection method based on a DW-SEnet model described above may be referred to correspondingly.

The embodiment II of the invention discloses a robot target detection system based on a DW-SEnet model, which comprises:

compressing the features along the spatial dimension, converting the two-dimensional feature channel into a real number, wherein the real number has a global receptive field, characterizes the global distribution of the responses on the feature channel, and makes the layer close to the input also obtain the global receptive field, and wherein the compression formula is as follows:

s＝σ(g(z，w))＝σ(w ₂ δ(w ₁ z)) where s is the correlation weight of different feature channels, σ () is the Sigmoid function, z is the result of the compression unit, δ () is the RELU activation function, w ₁ ，w ₂ The correlation coefficient to be learned for the full connection layer;

The robot target detection system based on the DW-SEnet model of the present embodiment is used for implementing the robot target detection method based on the DW-SEnet model, and therefore, the specific implementation of the system can be seen in the above embodiment section of the robot target detection method based on the DW-SEnet model, and therefore, the specific implementation thereof can refer to the description of the corresponding respective section embodiments, and will not be further described herein.

In addition, since the DW-SEnet model-based robot target detection method system of the present embodiment is used for implementing the aforementioned DW-SEnet model-based robot target detection method, the role thereof corresponds to that of the above method, and details thereof are omitted here.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A robot target detection method based on a DW-SEnet model is characterized by comprising the following steps:

2. The DW-SEnet model-based robot target detection method of claim 1, wherein: in S1, an RGB image of the object to be detected is taken by the mobile robot using an RGB-D vision camera.

3. The DW-SEnet model-based robot target detection method of claim 1, wherein: the depth separable convolution DW and the compressed excitation module SE constitute a DW-SEnet detection model.

4. The DW-SEnet model-based robot target detection method of claim 3, characterized in that: the detection model comprises 4 convolutional Block networks of the same type, and an increment structure is added among different Block network modules.

5. The DW-SEnet model-based robot target detection method of claim 4, wherein the inclusion structure comprises the compression excitation module SE, the compression excitation module SE comprises a compression unit, and the method for performing the compression operation by the compression unit comprises:

6. The DW-SEnet model-based robot target detection method of claim 5, wherein the compressed excitation module SE comprises an excitation unit, and the excitation unit performs an excitation operation method comprising:

s＝σ(g(z，w))＝σ(w ₂ δ(w ₁ z))

7. The DW-SEnet model-based robot target detection method of claim 1, wherein: the classification of the image in S4 uses the Softmax function.

8. A robot target detection system based on a DW-SEnet model is characterized by comprising:

9. The DW-SEnet model-based robot target detection system of claim 8, wherein in the weight excitation module, the method of the compression excitation module SE performing a compression operation comprises:

10. The DW-SEnet model-based robot target detection system of claim 9, wherein in the weighted excitation module, the compressed excitation module SE comprises an excitation unit, and the method of excitation operation by the excitation unit comprises:

s＝σ(g(z，w))＝σ(w ₂ δ(w ₁ z))