CN113705359B

CN113705359B - Multi-scale clothes detection system and method based on drum images of washing machine

Info

Publication number: CN113705359B
Application number: CN202110883847.9A
Authority: CN
Inventors: 陈莹; 郑棨元; 化春键; 胡蒙; 裴佩
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2024-05-03
Anticipated expiration: 2041-08-03
Also published as: CN113705359A

Abstract

The invention discloses a multi-scale clothes detection system and method based on a drum image of a washing machine, and belongs to the technical field of 2D image target detection. The system comprises: the improved ResNet network module, the feature enhancement module SRM, the dynamic receptive field DRF module and the dynamic deformable convolution DDH module; when clothes detection is realized, firstly, a high-quality shallow layer characteristic is obtained by utilizing an improved ResNet network module and an SRM module, and positioning information of a clothes target is reserved to the maximum extent by carrying out regression operation on the shallow layer characteristic; a pyramid structure with stronger semantic information is constructed through the DRF module, and clothes targets are classified and further positioned and calibrated while the characteristics of each size are comprehensively utilized; the offset effect of the DDH module on the detection frame enriches the diversity of prediction scales; the invention effectively improves the identification and classification capability of the clothes of the drum washing machine, improves the detection precision of the clothes, and can be applied to the detection scene of complex clothes in the washing machine.

Description

Multi-scale clothes detection system and method based on drum images of washing machine

Technical Field

The invention relates to a multi-scale clothes detection system and method based on a drum image of a washing machine, and belongs to the technical field of target detection of 2D images.

Background

The traditional washing machine does not have a 'comet' function, and a washer needs to manually set a washing mode according to the known clothes type and through self experience values; the EnX-Pu semiconductor develops an intelligent washing machine demonstration model by adopting RFID and NFC technology on a global embedded system exhibition held by Nelumbo, germany, and the washing machine can read information about the type, color and the like of fabric fibers from buttons with built-in RFID tags, so that a washing program is optimized, but the technology needs to modify clothes; the method comprises the steps that a high-definition camera is placed in a washing machine, an image of clothes to be washed is acquired through the camera, the problem is converted into the problems in the fields of image segmentation and texture image classification, and the information of the clothes quantity and the clothes in the washing machine is obtained by designing an image segmentation algorithm and a texture image classification algorithm based on a convolutional neural network; however, the scheme needs to design two deep convolutional neural networks, namely an image segmentation network and an image classification network, and has high computational complexity; and the clothes are arranged in a manual and regular manner and are not in a natural state that various clothes are mutually shielded in the internal environment of the washing machine, so that the clothes are not suitable for an actual washing scene.

The advent of deep learning object detection technology makes it possible to directly learn the image characteristics of laundry through only one network, and find the laundry object in the image based on the characteristics. The technology is widely applied to common fields such as pedestrian detection, vehicle detection, face detection, retail commodity detection and the like, is also a prepositive technology for tracking and other high-level vision applications, and has huge market demand and application value.

Current target detection techniques fall into two main categories:

(1) Two-stage algorithm, which is mainly R-CNN and variants thereof, needs to rely on generated candidate region suggestions, generates a priori frame with possible targets through an RPN network, and then utilizes a subsequent detection network to predict the category and adjust the position coordinates of the candidate frame. The two-stage structure enables the generation of positive and negative samples to be more balanced, and has excellent detection precision in a secondary correction mode, but has the problem of low speed.

(2) The single-stage algorithm divides the picture into smaller squares, each square is provided with a fixed preset prior frame (anchor), and objects in the picture are distributed to different squares and then reclassified, so that the types and positions of different objects can be directly predicted by using only one CNN network, the execution speed is excellent, and the problem of low precision exists.

However, there is no detection network specifically adapted to the laundry image of the washing machine, so the most commonly used or universal target detection networks, such as two-stage network FASTER RCNN and single-stage network YOLO series (Chen Yaya, meng Chaohui. FashionAI garment attribute identification based on target detection algorithm [ J ]. Computer system application, 2019.); however, as the detail information of the clothing is rich, the similarity between the attributes is high, and the accuracy and the precision of identification and classification are seriously affected by external interference factors such as illumination, so that the effect of the identification and classification of the clothing attributes is directly affected by some detail designs of the general target detection framework; in addition, the scale variability caused by irregular placement of clothes in a roller environment cannot be well matched with a multi-scale target by a general target detection model, and positioning inaccuracy is easily caused.

Disclosure of Invention

In order to solve the problems of weak positioning and classifying capability and low recognition accuracy of the existing clothes detection method of the washing machine, the invention firstly provides a multi-scale clothes detection system based on a drum image of the washing machine, which comprises the following components:

the improved ResNet network module, the feature enhancement module SRM, the dynamic receptive field DRF module and the dynamic deformable convolution DDH module;

The improved ResNet network module is connected with the feature enhancement module SRM, a four-layer multi-scale pyramid structure is constructed on the basis of the output features of the feature enhancement module SRM, and the dynamic receptive field DRF module is used for connecting all feature layers of the four-layer multi-scale pyramid; the dynamic deformable convolution DDH module is connected with the dynamic receptive field DRF module;

the DRF module comprises multi-branch convolutions with different sizes

Optionally, the improved ResNet network module includes:

A 2D convolution layer with 7×7 convolution kernel and 1 step length, a maximum pooling layer with 3×3 convolution kernel and 2 step length, and 4 convolution layers connected in series; each of the 4 convolutional layers is formed by stacking residual blocks of different layers, the layers are 3,4, 23 and 3, and the output features are taken from the third layer and the fourth layer of the 4 convolutional layers.

Optionally, the method detects laundry in a washing machine by using the multi-scale laundry detection system based on drum images of the washing machine according to any one of claims 1-2, the method comprising:

step one: preprocessing an input washing machine drum image;

Step two: performing feature extraction on the drum image of the washing machine after the pretreatment in the step one by using an improved ResNet network module, and outputting feature layers with 8 times of downsampling rate and 16 times of downsampling rate;

Step three: sending the feature layer extracted in the second step into a feature enhancement module SRM to aggregate information so as to obtain shallow features with stronger characterization capability;

step four: inputting the shallow features obtained in the step three into a four-layer multi-scale pyramid structure, wherein the shallow features pass through a DRF module among layers of the four-layer multi-scale pyramid, and finally output features of feature layers of the pyramid are obtained;

Step five: carrying out multi-scale regression operation on the shallow features obtained in the step three, and carrying out coarse positioning on clothes by utilizing shallow feature information to obtain a prediction frame;

Step six: utilizing a dynamic deformable convolution DDH module to offset the output characteristics of each characteristic layer of the pyramid in the fourth step;

Step seven: taking the prediction frame obtained in the fifth step as a default frame of each feature layer of the four-layer multi-scale pyramid, and adjusting the default frame by using the offset generated by the DDH module in the sixth step;

step eight: performing secondary regression and classification by using the DDH module;

Step nine: and step five and step eight, the regression loss functions are synthesized and trained together, and finally, the classification and accurate positioning information of clothes are output.

Optionally, the step three of aggregating information includes:

Where S ₃ is the output characteristic of the third layer of the modified ResNet101 network at 8 times the downsampling rate, S ₄ is the output characteristic of the fourth layer of the modified ResNet101 network at 16 times the downsampling rate, f _k×k () is a kxk convolution operation, For element-wise addition, C (-) is the channel stack, U (-) is the upsampling operation, y is the output feature of the aggregate two layer feature at 8 times the downsampling rate.

Optionally, the calculating of the DRF module in the fourth step includes:

where x is the upper layer output feature of each layer in the pyramid structure, For k x k convolution of expansion rate r, i represents the ith branch of the DRF module, W ₁ [ i ] and W ₂ [ i ] are weight parameters obtained by self-learning of the network on the ith branch,Representing a stack of n+1 feature maps, U is the output feature of the DRF module.

Optionally, the multi-scale regression algorithm of the fifth step includes:

S1: carrying out maximum pooling operation for 4 times on the output characteristic y of the third step to obtain four scales consistent with the four-layer pyramid characteristics in the fourth step;

D_k＝f_3×3(M^k(y))，k＝0,1,2,3

Wherein M ^k (-) represents that k maximum pooling operations are performed, the downsampling rate of which is 2 ^3+k;D_k as an output feature; the number of channels is N _box ×4, representing N _box default frame centers and 4 offsets of width and height configured for each pixel point of the output feature D _k;

S2: splicing the predicted results of each D _k to obtain an integrated vector l of the predicted results;

s3: the smooth _L1 function was used for l as regression loss:

Where cx, cy, w, h are the center and width-height coordinates of the default box, N is the total number of default boxes, l is the integrated vector of all D _k predictors, representing 4 prediction offsets for all N default boxes, 4 Offsets for the corresponding known real box relative to the default box;

s4: the network performs reverse derivation according to the loss function of S3 in the training process, thereby reducing the l and the l The difference of the two is that the integration vector l of the more accurate prediction result is finally obtained.

Optionally, the calculating of the DDH module in the step six includes:

Wherein R defines the region and relative position of the receptive field, centered on the (0, 0) coordinates, r= { (-1, -1), (-1, 0),., (0, 1), (1, 1) }; p _n is an enumeration of the positions listed in R, w ()'s are weight values of the corresponding positions in the convolution kernel, I ()'s are input characteristic values of the corresponding positions, and O ()'s are output characteristic values of the corresponding positions; the offset Δp _n is obtained by performing 3×3 convolution on D _k obtained in step five S1, and the number of output channels is kxkx2, which represents an offset parameter for each position in the convolution kernel of k.

Optionally, the formula for adjusting the default box in the step seven includes:

cx^*＝cx+Δp|_x+l^cx×w

cy^*＝cy+Δp|_y+l^cy×h

Wherein cx, cy, w, h are the center and width-height coordinates of the default frame after adjustment, Δp| _x and Δp| _y are the components of the DDH module offset Δp with respect to the x and y directions, and l ^cx、l^cy、l^w、l^h is the prediction bias of the center and width-height coordinates of the default frame.

Optionally, the quadratic regression loss in the step eight is:

the classification loss is:

In the middle of Indicating whether the z-th prediction box and the j-th real box match with respect to the category t,/>For the softmax penalty of category confidence, N _pos and N _neg are the number of positive and negative samples, respectively, the positive sample being a prediction box containing laundry targets and the negative sample being a prediction box not containing laundry targets;

optionally, the integrated loss function in the step nine is:

optionally, the default frame sizes of the 4 feature layers in the fifth step are 32×32, 64×64, 128×128, 256×256, respectively.

Optionally, in the first step, the input size of the drum image of the washing machine is uniformly scaled to 512×512.

Optionally, the feature enhancement module in the third step is a multi-connection structure, and the shallow features with richer detail information are generated by obtaining multi-granularity information through stacking between adjacent layers.

Optionally, in the fourth step, the number of output channels of each layer of the multi-scale pyramid is 256.

Optionally, in step six, the DDH includes a short connection branch with spatial self-attention, and is implemented by a simple convolution of 3×3, so that the network can dynamically allocate weights to each scale object based on the distribution of the current features, and the final detection result can be more accurate.

The invention has the beneficial effects that:

Aiming at the problems of weak positioning and classifying capability and low recognition precision of the existing clothes detection method of the washing machine, the invention provides a multi-scale clothes detection system based on a drum image of the washing machine, and provides a multi-scale clothes detection method based on the drum image of the washing machine based on the system; the improved ResNet network changes the first layer convolution into a 7×7 large convolution with a step length of 1 to prevent excessive loss of clothing detail information, and extracts the third layer and the fourth layer to ensure sufficient clothing semantic information; the feature enhancement module obtains shallow layer features with stronger characterization capability in a feature aggregation mode so as to integrate the details and semantic information of the third layer and the fourth layer, so that the extracted clothing feature information is richer; the DRF module constructs a multi-scale pyramid structure with stronger semantic information, and the classification capability of the detection system on complex clothes is improved by deepening the network layer number and adaptively adjusting the receptive field; the DDH module has the offset effect on the positioning frame, so that the diversity of prediction scales is enriched, and the detection system has better adaptability to clothes with different scales. The multi-scale clothes detection system and method for the drum images of the washing machine effectively improve the identification and classification capacity of clothes of the washing machine and improve the clothes detection precision.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a diagram of a modified resnet network architecture.

FIG. 2 is a schematic diagram of a feature enhancement module according to the present invention.

Fig. 3 is a schematic diagram of a DRF module according to the present invention.

Fig. 4 is a diagram of a DDH module according to the present invention.

Fig. 5 is a schematic diagram of the default frame offset effect.

Fig. 6 is a diagram of an overall network framework provided by the present invention.

Fig. 7 is a diagram showing a detection effect of the network on complex clothes.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Embodiment one:

The embodiment provides a multi-scale clothes detection system and method based on a drum image of a washing machine, which are used in parameter recommendation of an intelligent washing machine, the system and the method are based on a deep learning framework, start from a 2D RGB image, utilize an improved ResNet network to extract characteristics, construct multi-scale information by enhancing the extracted characteristics, carry out regression and classification operation in two stages, and enhance the discrimination capability of the network on complex clothes through a cascading transmission process, adapt to the change of clothes of each scale, and improve the detection performance.

The 2D RGB image is obtained by shooting with a high-definition camera, and the resolution is 1920 x 1080.

The following describes the system setup procedure in terms of the system's modules, architecture, and network loss function, respectively:

(1) Module of system

As shown in fig. 1, the improved ResNet network module includes: a convolution kernel is 7 multiplied by 7, the 2D convolution with the step length of 1 is followed by pooling layers, and then 4 convolution layers are connected in series; each convolution layer is formed by stacking residual blocks of different layers of ResNet network, the layers are 3,4, 23 and 3, and the output characteristics are the convolution blocks of the third layer and the fourth layer.

A specific structure of the feature enhancement module (SRM module) is shown in fig. 2. The module is a lightweight multi-connection module for enhancing shallow feature representations, including multiplexing of up-sampling connections, down-sampling connections, and constant resolution connections. The input layers selected are from shallow features extracted by ResNet network 101, a third layer with 8 times down sampling rate and a fourth layer with 16 times down sampling rate.

In order to alleviate information dilution caused by up-sampling operation, a cascade fusion mode is adopted to perform 1×1 convolution operation on a third layer to perform element-by-element addition fusion with the same size as the fourth layer to complement the information of the fourth layer. And the fourth layer of the features after the complement is stacked with the third layer of the features by adopting bilinear interpolation up-sampling to 8 times of the down-sampling rate. By this operation multi-granularity information from adjacent layers is integrated, resulting in high quality final features.

The specific structure of the dynamic receptive field module (DRF module) is shown in fig. 3. The modular design concept stems from a study of the human superficial retina, i.e., population receptive field size increases as retinal eccentricity increases. The main realization is that the eccentricity is simulated by multi-branch convolution of Inception structures, while the cavity convolution is used for simulating the relation between the perception scale and the eccentricity. Information of different scales is first captured by multi-branch convolutions of 1 x 1, 3 x 3,5 x 5 sizes, where to mitigate the parameter number, the 1 x 1 convolution is used for the dimension reduction of the channel and the 5 x 5 convolution is replaced by two 3 x 3 convolutions. Then, self-learning vectors are introduced, and the weight of each scale is distributed by using a soft attention mechanism so as to simulate local stimulation aiming at different scales. Similarly, the self-learning vector is weighted according to global stimulus to perform weight selection on the hole convolutions with different expansion rates so as to adaptively adjust the receptive field according to the stimulus. Therefore, a smaller convolution kernel is used for giving a larger weight to a weight closer to the convolution center, so that a larger receptive field is obtained, more context information is captured, and the generalization capability of the model to different scales is improved.

The specific structure of the dynamically deformable convolution detector head module (DDH module) is shown in fig. 4. The module uses deformable convolution to solve the problem of fixed geometry of the convolution network, which is limited to model geometry transformation. The position of each sampling point in the convolution kernel is added with an offset variable by further displacement adjustment of the spatial sampling position information in the module, so that the sampling area is freely adjusted, and the method is not limited to the previous regular lattice points. Furthermore, global-based spatial self-attention is then implemented through a simple 3×3 shortcut connection, enabling the network to dynamically assign weights to scale objects appropriately based on the distribution of current features. The operation enables the network to generate different offset values for shallow regression frames with different scales, and carries out corresponding displacement for the characteristic pixel points according to the different offset values, so that default frames arranged on the corresponding pixel points are correspondingly displaced, the network can generate different search ranges for the different default frames, and further fine tuning and matching with targets are carried out, so that the detection performance of the network on clothes examples with variable scales is improved.

(2) System architecture

The overall structure of the system is shown in fig. 6, and mainly comprises four parts:

The first part reserves a third layer which is 8 times downsampled and a fourth layer which is 16 times downsampled on the basis of ResNet networks, and shallow layer characteristics with stronger characterization capability are obtained by sending the third layer and the fourth layer into a designed characteristic enhancement module (SRM) for information aggregation.

The second part constructs the enhancement features through the designed dynamic receptive field module (DRF) into multi-scale features with 8 times, 16 times, 32 times and 64 times downsampling rates. By adaptively compounding information on different receptive fields, a dynamic multi-scale pyramid with rich semantic information is constructed.

And the third part carries out multi-scale regression operation based on the enhanced feature information, and takes the regression result as a candidate frame of the corresponding feature of the dynamic multi-scale pyramid. Default boxes derived from shallow regression results are classified and trimmed by multi-scale pyramid features.

The fourth section introduces a dynamically deformable convolution detector head module (DDH module) as the output layer of the pyramid feature.

(3) Network loss function

After the network model is established, the following steps are executed to complete the clothes detection process;

A high-definition camera is adopted to shoot and obtain a 2D RGB image of clothes in a drum of the washing machine, and the resolution is 1920 x 1080;

Step one: data enhancement, namely scaling an input picture to 512 x 512, and carrying out random up-down left-right overturn, brightness change, fuzzy treatment and illumination change;

Step two: extracting features of the 2D input image after the enhancement in the step one by using a ResNet network modified as shown in fig. 1, and outputting feature layers with 8 times of downsampling rate and 16 times of downsampling rate;

step three: the feature layer extracted in the second step is sent to a feature enhancement module shown in fig. 2 to aggregate information, so as to obtain shallow features with stronger characterization capability;

Step five: carrying out multi-scale regression operation on the step polymerization characteristics, and carrying out coarse positioning on clothes by utilizing shallow characteristic information to obtain a prediction frame; the multiscale regression algorithm is as follows:

Input: outputting a characteristic y;

And (3) outputting: integration of 4 prediction biases for a multi-scale total of N default boxes;

s1: carrying out maximum pooling operation on y for 4 times to obtain four scales consistent with the four-layer pyramid features in the fourth step;

D_k＝f_3×3(M^k(y))，k＝0,1,2,3

Wherein M ^k () represents that k maximum pooling operations are performed, the downsampling rate is 2 ^3+k.D_k, the channel number is N _box ×4, and represents N _box default frame centers and 4 offsets of width and height configured for each pixel point relative to the output feature D _k;

S3: the smoothL1 function was used as regression loss for l:

Where cx, cy, w, h are the center and width-height coordinates of the default box, N is the total number of default boxes, l is the integration of all D _k predictors, representing 4 prediction offsets for all N default boxes, 4 Offsets of the corresponding real frame relative to the default frame;

s4: the network performs reverse derivation according to the loss function of S3 in the training process, thereby reducing the l and the l The difference of the two is that the integration vector l of the more accurate prediction result is finally obtained;

step six: the output features of the feature layers of the pyramid of step four are shifted using the DDH module as in fig. 4.

Step seven: and taking the prediction frame obtained in the fifth step as a default frame of each feature layer of the fourth pyramid, and adjusting the default frame by using the offset generated by the sixth DDH module, wherein the effect of adjusting the default frame is shown in figure 5. The default frame center and width and height adjustment formula is as follows:

cx^*＝cx+Δp|_x+l^cx×w

cy^*＝cy+Δp|_y+l^cy×h

Step eight: and D, using the DDH module in the step six as a detection head at the same time, and carrying out secondary regression and classification. The secondary regression loss is as follows:

The classification loss is as follows:

In the middle of Indicating whether the z-th prediction box and the j-th real box match with respect to the category t,/>For the softmax penalty of class confidence, N _pos and N _neg are the number of positive and negative samples, respectively; the positive sample is a prediction frame containing a clothes target, and the negative sample is a prediction frame not containing the clothes target;

Step nine: and step five and step eight, the loss functions are integrated and trained together, and finally, the classification and accurate positioning information of clothes are output.

In order to highlight the advantages of the invention relative to other prior art, a series of simulation experiments are carried out, and the simulation results are as follows:

Table 1 shows the accuracy and model parameters of the method of the present application compared with FASTER RCNN, YOLOv networks in laundry detection, the detected pictures are 1000 barrels of samples provided by companies, 10 barrels of samples per barrel, and 10000 pictures total.

Table 1 comparison of accuracy and model parameters of the inventive network with other methods in laundry detection

Method of	Faster RCNN	YOLOv5m	The method	The method (after compression)
					Input size	800×1000	640×640	512×512	512×512
Backbone network	ResNet101	CSPDarknet	ResNet101	ResNet101
					Precision of	85.2％	50％	89.7％	86.7％
Model parameter quantity	137M	21.4M	48M	26.4M

As can be seen from the comparison in the table; compared with FASTER RCNN, the detection system and method of the invention reduce the model parameter on the premise of ensuring high precision; compared with YOLOv m network, the invention greatly improves the detection precision, and simultaneously, the invention also realizes low parameter quantity after pruning the invention by a compression algorithm.

In summary, compared with the existing clothes detection method of the washing machine, the method can realize the reduction of the system parameter on the premise of ensuring the detection precision, and can be well adapted to the scene of large size change of clothes detection and realize the identification and classification of the clothes of the drum of the washing machine as can be seen from fig. 7.

Some steps in the embodiments of the present invention may be implemented by using software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A multi-scale laundry detection system based on a drum image of a washing machine, the system comprising:

the improved ResNet network module includes:

A 2D convolution layer with 7×7 convolution kernel and 1 step length, a maximum pooling layer with 3×3 convolution kernel and 2 step length, and 4 convolution layers connected in series; each of the 4 convolution layers is formed by stacking residual blocks of different layers, the number of layers is 3,4, 23 and 3, and output features are taken from a third layer convolution block and a fourth layer convolution block in the 4 convolution layers;

The feature enhancement module SRM is a lightweight multi-connection module and is used for enhancing shallow feature representation, and comprises multiplexing of up-sampling connection, down-sampling connection and constant-resolution connection, wherein the selected input layer is from shallow features extracted by ResNet network, namely a third layer with 8 times of down-sampling rate and a fourth layer with 16 times of down-sampling rate;

The dynamic receptive field DRF module simulates eccentricity through a multi-branch convolution of Inception structures, the cavity convolution is used for simulating the relation between a perception scale and the eccentricity, firstly, information of different scales is captured through multi-branch convolutions of 1×1,3×3 and 5×5, wherein the 1×1 convolution is used for reducing the dimension of a channel, the 5×5 convolution is replaced by two 3×3 convolutions, then, a self-learning vector is introduced, and the purpose of distributing weights of all scales by utilizing a soft attention mechanism is achieved so as to simulate local stimulation aiming at different scales;

The dynamic deformable convolution DDH module uses deformable convolution to solve the problem that the fixed geometric structure of a convolution network is limited to model geometric transformation, and the position of each sampling point in a convolution kernel is added with an offset variable by further displacement adjustment of the position information of spatial sampling in the module, so that the sampling area is freely adjusted and is not limited to the previous regular lattice points.

2. A method for detecting laundry in a washing machine based on a drum image of the washing machine, the method using the multi-scale laundry detection system based on a drum image of the washing machine of claim 1, the method comprising:

step one: preprocessing an input washing machine drum image;

3. The method of claim 2, wherein the step three of aggregating information comprises:

4. The method of claim 3, wherein the computing of the DRF module of step four comprises:

5. The method according to claim 2, wherein the multi-scale regression algorithm of step five comprises:

D_k＝f_3×3(M^k(y))，k＝0,1,2,3

s3: the smooth _L1 function was used for l as regression loss:

6. The method of claim 5, wherein the calculating of the DDH module in step six comprises:

Wherein the method comprises the steps of The region representing the receptive field and the relative position, centered on the (0, 0) coordinates,P _n is p/>Enumeration of the listed positions, w is a weight value of a corresponding position in the convolution kernel, I is an input characteristic value of the corresponding position, and O is an output characteristic value of the corresponding position; the offset Δp _n is obtained by performing 3×3 convolution on D _k obtained in step five S1, and the number of output channels is kxkx2, which represents an offset parameter for each position in the convolution kernel of k.

7. The method of claim 6, wherein the step seven formula for adjusting the default box comprises:

cx^*＝cx+Δp|_x+l^cx×w

cy^*＝cy+Δp|_y+l^cy×h

8. The method of claim 7, wherein the quadratic regression loss of step eight is:

the classification loss is:

the comprehensive loss function in the step nine is as follows: