CN113139896A

CN113139896A - Target detection system and method based on super-resolution reconstruction

Info

Publication number: CN113139896A
Application number: CN202010052220.4A
Authority: CN
Inventors: 李永; 明悦; 张高鑫; 刘莹; 丰·石; 叶翔; 李慧; 王凡; 何子航; 王伟刚; 李凤男; 赵家凤; 李婉婷; 胡嘉豪; 李博瀚
Original assignee: Beijing University of Posts and Telecommunications; Boeing Co
Current assignee: Beijing University of Posts and Telecommunications; Boeing Co
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2021-07-20

Abstract

The application discloses a target detection system and method based on super-resolution reconstruction. The system comprises: the data acquisition module is configured to acquire image data to be detected; the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data; a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.

Description

Target detection system and method based on super-resolution reconstruction

Technical Field

The present invention relates to the field of image processing. In particular, the present invention relates to a system and method for object detection based on super-resolution reconstruction.

Background

The image super-resolution technology is a signal processing technology for improving the spatial resolution of an image or a target on the basis of the existing imaging equipment, and the technology solves the problem that the imaging resolution of a scene or a target is too low in certain video and image-based applications. The image super-resolution technology comprises the following steps: single frame image super resolution techniques that improve their resolution using only a single image itself, such as SRCNN and EDSR, etc.; and multi-frame image super-resolution techniques that use adjacent multi-frame images to increase the image resolution of a particular frame, such as sub-pixel convolutional neural networks and ESPCN. In addition, the image super-resolution technology also relates to an image quality evaluation algorithm, which mainly comprises the following steps: an image quality evaluation algorithm using a convolutional neural network; and an image quality assessment algorithm that utilizes image gradient features. The present invention generally relates to an image super-resolution technique for performing super-resolution reconstruction from a plurality of image frames.

In performing video super-resolution processing, three problems are mainly focused on: 1) how to fully utilize the associated information among the multiple frames; 2) how to efficiently fuse image details into a high resolution image; and 3) how to increase the computation speed. In the video super-resolution processing, it is sometimes necessary to first map a low-resolution image onto a high-definition grid by upsampling, but this operation may lead to an increase in computational complexity. To address the above problem, Shi, Wenzhe et al, published in 2016, "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel conditional Neural Network" (2016): 1874-. Front of network

The layers are all ordinary (integer pixel) convolution layers with activation function layers, a new sub-pixel convolution layer is arranged at the last layer of the network, and the pixels are rearranged according to channels instead of convolution operation, namely H multiplied by W multiplied by C multiplied by r²The feature map of (d) is rearranged to (r × H) × (r × W) × C as an output. To further increase the speed, "Real-Time Video Super-Resolution with spatial-Temporal Networks and Motion Compensation" was published by Cabillero, Jose et al in 2016 (2016):2848-2857, which proposes an end-to-end joint training Motion Compensation and Video Super-Resolution algorithm, and introduces a Spatio-Temporal sub-pixel convolution network to achieve Real-Time Video image Super-Resolution. The network mainly utilizes early fusion (early fusion) and slow fusion (slow fusion) to process the time dimension, then establishes a motion compensation framework based on spatial transformation, and then is connected with ESPC (electronic stability program)And the N space-time networks are combined to realize real-time calculation of the video super-resolution reconstruction. In the prior art for realizing super-resolution reconstruction of video, the following problems exist: (1) the performance of the traditional super-resolution reconstruction (interpolation) and target detection method is surpassed by the method based on deep learning, the reconstruction quality of the traditional super-resolution reconstruction (interpolation) and target detection method is lower, the description capability of the detection method based on the template is limited, and the describable semantic information is less; (2) dependence of prior knowledge: the algorithm relies on the accuracy of the prior knowledge (target image template), which is reduced when the actual application scenario does not match the introduced prior knowledge. In order to solve the above problems of the conventional super-resolution reconstruction (interpolation) and target detection methods, a method has been proposed for improving the accuracy of target detection by increasing the information of context, which is an attempt to improve the performance of SSD by adding context, but this method also has the following problems: (1) the parameter (calculation) amount is large, and the algorithm speed is low; (2) the larger number of parameters makes the model occupy larger storage space.

As an important field of image processing using artificial intelligence, the object detection technology is essentially multi-object positioning, that is, object detection (object detection) is a combination of classification (classification) and positioning (localization) tasks, which aims to give a picture, accurately find the position (coordinates) of an object in the picture, and mark the type of the object. The main performance indicators of the target detection model are detection accuracy and speed. At present, the mainstream target detection algorithm is mainly based on a deep learning model, and can be divided into two categories: (1) a Two-stage detection algorithm, which divides the detection problem into Two stages, first generates candidate regions, and then classifies the candidate regions, and a typical representative of such algorithms is an R-CNN system algorithm based on region suggestion (region pro posal); (2) one-stage detection algorithm, which does not require a region proposal (region proposal) stage, directly generates class probability and position coordinate values of objects, comparing typical algorithms such as YOLO and SSD.

Scale invariance (scale invariant) representation is crucial for identifying and locating objects, since the deeper layers of modern CNNs have large step sizes (32 pixels), which results in very coarse representations of the input image, which makes small target detection very challenging, and when faced with small target problems (in essence scale invariance problems), the above-mentioned several methods all have insufficient detection results, resulting in the problem that convolutional network structures have the contradiction that the feature map (feature map) of the shallow layers of the network is large but semantic (context) information is insufficient, and the semantic information of the deep layers of the network is sufficient but the feature map (feature map) is too small. In order to detect multi-scale objects, various solutions have been proposed, such as:

(1) expansion/shrinkage (scaled/atrous) convolution is used to increase the resolution of the feature map, which preserves the weights and receptive field of the pre-trained network and does not suffer from large object performance degradation;

(2) based on the fact that the shallow layer and the deep layer contain complementary information, the shallow layer feature and the deep layer feature (context information) are fused for prediction;

(3) directly and independently predicting on feature maps (feature maps) of a shallow network layer and a deep network layer;

(4) the network input image is up-sampled during training.

Therefore, how to perform multi-frame image super-resolution reconstruction to obtain an image with higher resolution, and meanwhile, the integrity of a small target in the image is maintained, so that the realization of rapid and accurate target detection is a very worthy of research.

Disclosure of Invention

The embodiment of the invention provides an end-to-end target detection method, which introduces scale invariance through a multi-frame super-resolution module, designs a learnable data division module for adaptively cutting a super-resolution reconstructed image, keeps the integrity of small targets in sub-images, and finally inputs the small targets into a target detection module for detection, so that the detection effect of the small targets is improved under the application scene of detecting the targets in videos (multi-frames).

According to an aspect of the embodiments of the present invention, there is provided a target detection system based on super-resolution reconstruction, the system comprising: the data acquisition module is configured to acquire image data to be detected; the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data; a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.

In the object detection module, a single point multiple box detection (SSD) algorithm is used.

In the dividing and fusing module, the step length for clipping the image data subjected to target detection is a numerical value obtained based on edge detection.

The super-resolution reconstruction module is a trained spatio-temporal sub-pixel convolutional network comprising a motion estimation portion and a super-resolution portion, wherein the spatio-temporal sub-pixel convolutional network is trained by:

the Loss formula of the super-resolution part is as follows:

the Loss formula of the motion estimation part is as follows:

wherein the content of the first and second substances,

is approximated to

ε＝0.01

The total Loss formula of the spatio-temporal sub-pixel convolution network during end-to-end training is as follows:

wherein theta is_ΔIs a parameter of the motion estimation part, theta is a parameter of the super-resolution part,

which represents an image frame or frames of an image,

representing the image frame that has undergone the warping process.

Employing an objective loss function L in the objective detection module_DetTarget loss function L_DetObtained by the following equation:

wherein: n is the number of default boxes matching the real box, L_locIs a smooth 1-norm loss function in Fast R-CNN, L_confFor Softmax Loss, c is the confidence for each class, and α is a weight term and is set to 1.

According to another aspect of the embodiments of the present invention, there is also provided a target detection method based on super-resolution reconstruction, wherein the method includes the following steps: a data acquisition step, which is to acquire image data to be detected; a super-resolution reconstruction step of receiving the acquired image data and performing super-resolution reconstruction processing on the image data; a target detection step of performing target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing step, namely cutting the image data subjected to target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion so as to obtain a target detection result.

In the object detection step, a single point multi-box detection (SSD) algorithm is used.

In the dividing and fusing step, the step length for clipping the image data subjected to the target detection is a numerical value obtained based on edge detection.

In the super-resolution reconstruction step, a trained spatio-temporal sub-pixel convolutional network is used, the spatio-temporal sub-pixel convolutional network comprising a motion estimation part and a super-resolution part, wherein the spatio-temporal sub-pixel convolutional network is trained by:

the Loss formula of the super-resolution part is as follows:

the Loss formula of the motion estimation part is as follows:

wherein the content of the first and second substances,

is approximated to

ε＝0.01

wherein theta is_ΔIs a parameter of the motion estimation part, and theta is a parameter of the super-resolution partThe number of the first and second groups is,

which represents an image frame or frames of an image,

representing the image frame that has undergone the warping process.

Employing an objective loss function L in the objective detection step_DetThe target loss function is obtained by the following equation:

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 shows a schematic diagram of a super-resolution reconstruction based object detection system according to an embodiment of the present invention.

Fig. 2 shows a schematic diagram of a super-resolution reconstruction module in a super-resolution reconstruction based object detection system according to an embodiment of the present invention.

Fig. 3 shows a schematic diagram of an object detection module in an object detection system based on super-resolution reconstruction according to an embodiment of the present invention.

Fig. 4 shows a schematic diagram of a partitioning and fusion module in a target detection system based on super-resolution reconstruction according to an embodiment of the present invention.

Fig. 5 shows a flowchart of a target detection method based on super-resolution reconstruction according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules or elements is not necessarily limited to those steps or modules or elements expressly listed, but may include other steps or modules or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate the following description of the present invention, a few basic concepts will be described first.

Deep neural network: one type of neural network belongs to one branch of machine learning.

Feature (feature): a method of representing an image. Conventional methods represent an image with RGB three-channel pixels. In order to better utilize a computer for recognition, redundant information in RGB needs to be filtered out, and more semantic features need to be extracted. Image features contain some salient information in the image, such as contour edges, color, etc.

Dimension: the size of the image.

Super-resolution: Super-Resolution (Super-Resolution) is to improve the Resolution of the original image by a hardware or software method, and the process of obtaining a high-Resolution image by a series of low-Resolution images is Super-Resolution reconstruction.

Target detection: object detection (object detection) is a combination of classification (classification) and localization (localization) tasks, which consists in giving a picture, finding exactly where (the coordinates) of objects in the picture are located, and labeling the classes of the objects.

And (3) flaw detection: the method is a specific application scene of the target detection problem, and the damage and the category (the category comprises damaged rivets, scratches, cracks, paint falling and the like) of the damage appearing in the image are identified by giving an image of the surface of a certain material.

Scale invariance (scale invariant): the method refers to that certain characteristics of a system are unchanged after the system is subjected to scale conversion.

The target detection system and method based on super-resolution reconstruction provided by the invention can be used in a scene for detecting (small) targets in a video in practical application, the core of the system is to introduce scale invariance into the target detection scene aiming at the video to improve the detection effect of the small targets, and as long as enough training samples are provided, an algorithm model obtained by learning has excellent resolution capability and strong robustness. The algorithm can learn the characteristics of various images, and can be widely applied to scenes such as fine scar recognition of various material surfaces, for example, rivet damage detection of airplane surface materials.

Fig. 1 shows a schematic diagram of a super-resolution reconstruction based object detection system according to an embodiment of the present invention. As shown in fig. 1, the object detection system 100 based on super-resolution reconstruction includes: a data acquisition module 102 configured to acquire image data to be detected; a super-resolution reconstruction module 104 configured to receive the image data acquired by the data acquisition module 102 and perform super-resolution reconstruction processing on the image data; a target detection module 106 configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing module 108 configured to cut the image data subjected to the target detection into a plurality of sub-image data, and map the detection result of each of the sub-image data into the combined image data to perform coordinate fusion, thereby obtaining a target detection result.

Fig. 2 is a schematic diagram illustrating a super-resolution reconstruction module in a super-resolution reconstruction based object detection system according to an embodiment of the present invention. The super-resolution reconstruction module 200 shown in fig. 2, after being trained, may serve as the super-resolution reconstruction module 104 shown in fig. 1. Through training, the super-resolution reconstruction module can become a trained space-time sub-pixel convolution network. As shown in fig. 2, the spatio-temporal sub-pixel convolutional network includes a motion estimation portion 202 and a super-resolution portion 204. The multi-frame super-resolution module of the network structure is based on real-time video super-resolution with spatio-temporal network and motion compensation. The network can process video image super resolution and achieve real-time speed. An algorithm combining motion compensation and video super-resolution is also proposed and can be trained end-to-end. Compared with a single-frame model, the spatio-temporal network can reduce calculation and maintain output quality. As shown in fig. 2, the spatiotemporal sub-pixel convolutional network is trained by:

in the motion estimation section 202, it is,

and

is distinguished in that

And

two different frames, the position of the object in the image may have changed, which may be achieved by warping (warp)

And

the position of the object is almost the same (slightly different) and then fed into the super-resolution section 204.

In training, the loss of the motion estimation portion 202 is MSE loss plus Huber loss. Huber Loss was added to make the flow spatially smooth. The Loss formula is as follows:

wherein the latter term is approximately

ε＝0.01

The Loss formula of the super-resolution portion 204 is as follows:

finally, when the end-to-end training is performed through the motion estimation part 202 and the super-resolution part 204, the whole Loss is

Wherein, theta_ΔIs a parameter of the motion estimation section 202, and θ is a parameter of the super-resolution section 204.

Fig. 3 is a schematic diagram illustrating an object detection module in an object detection system based on super-resolution reconstruction according to an embodiment of the present invention. The object detection module 300 shown in FIG. 3, after being trained, may function as the object detection module 106 shown in FIG. 1. In the target detection module, the SSD algorithm is mainly used, and other algorithm modules (such as YOLOv3) may be used instead.

Included in the object detection module 300 are: an image input unit 302, a first set of convolutional layers 304, a second set of convolutional layers 306, and a detection output unit 308. In the second convolutional layer 306, a feature pyramid structure is used for detection, that is, feature maps (feature maps) with different sizes, such as conv4-3, conv-7(FC7), conv6-2, conv7-2, conv8_2 and conv9_2, are used for detection, and object class classification and position regression are simultaneously performed on a plurality of feature maps. The very beginning of the SSD model, referred to herein as the base network (VGG-16 is used herein, and lightweight base networks such as MobileNet, ShuffleNet, etc. are used to speed up the algorithm), is a common network used for image classification. After the base network, additional auxiliary network structures are added, which mainly comprise the following three parts: (1) multi-scale feature maps for detection (Multi-scale feature maps for detection): after the basic network structure, additional convolutional layers are added, the sizes of the convolutional layers are gradually reduced layer by layer, and prediction can be carried out under multiple scales. (2) Convolution predictors for detection (Convolutional predictors for detection): each newly added layer (or feature layer in the infrastructure) can use a series of convolution kernels to produce a series of fixed size predictions. (3) Default boxes and aspect ratios (Default boxes and aspect ratios): the position of each frame with respect to its corresponding feature map cell is fixed. In each feature map unit, the offset between the prediction box and the default box (default box) needs to be predicted, and the score of each box containing the object needs to be predicted. The prediction box actually predicts the offset (offsets) with respect to the default box.

The SSD trained objective function (Training objective) can handle multiple target classes. By using

Indicating that the ith default box matches the jth real box of the category p, and if not, then

According to this matching strategy, there must be

Meaning that for the jth real box, there are likely to be multiple default boxes matching it. The total objective loss function (objective loss function) is obtained by weighted summation of the localization loss (loc) and the confidence loss (conf):

the following explains the meaning of the parameters, in which: n is the number of default boxes that match the real box. localization Loss (loc) is the Smooth L1 Loss function in Fast R-CNN, used in the predicted frame (L) and real frame (g) parameters (i.e., center coordinate position, width, height). confidence Loss (conf) is Softmax Loss, the input being the confidence c for each class. The weight term α, set to 1.

After a series of predictions is generated, many prediction blocks that fit the real block are generated, but at the same time, there are many prediction blocks that do not fit the real block, and there are far more negative samples than positive samples. Leading to difficulty in convergence during training, and carrying out Hard negative minning. Therefore, the frames corresponding to the predicted frames (default boxes) at each object position are sorted according to the confidence of the default boxes. The top ones are selected to ensure that the ratio of the last negative to positive samples is 3: 1. (function minehardextensions of this algorithm in bbox _ util. cpp.) experiments herein have found that such a ratio can be optimized faster and the training is more stable. Data augmentation (data augmentation) is performed on training data during the training process. In order to make the model more robust to the scale and size of the target, the paper makes data augmentation on training images. Each training image was randomly generated by the following method: (1) the original image is used. (2) A patch (patch) is sampled with a minimum Jaccard overlap (jaccard overlap) of 0.1,0.3,0.5,0,7,0.9 with the object. (3) A patch (patch) is sampled randomly.

Fig. 4 shows a schematic diagram of a partitioning and fusion module in a target detection system based on super-resolution reconstruction according to an embodiment of the present invention. The partitioning and merging module 400 shown in fig. 4 may be implemented as the partitioning and merging module 106 shown in fig. 1. The present invention adds a cropping and fusion operation on data, the upper part of fig. 4 shows a data partitioning unit 402 in a partitioning and fusion module 400, which is a learnable module for partitioning an image into n sub-images using a sliding window, the step size of which is changeable at each step, the purpose of cropping is to preserve the reconstructed resolution and adapt to the fixed input of the object detection module, and the purpose of the changeable step size is to preserve the integrity of the object to be detected in each of the segmented sub-images. In the data dividing unit 402, the step size of cropping the image data subjected to the object detection is a numerical value based on the edge detection. Therefore, the step size for performing cropping is learnable, specifically, a numerical value is predicted through network learning, the numerical value is used for specifying the cropping size, the goal is to keep the complete detection result (sub-image) in the image, the learning label is the edge coordinate of the object in each image (the value is predicted, the relatively complete object can be kept), and meanwhile, the benefit (higher resolution) brought by super-resolution can be kept.

The lower part of fig. 4 shows the data fusion unit 404 in the dividing and fusing module 400, and the data fusion unit 404 is responsible for recombining the sliced and detected images into one image, and mapping the detection result of each sub-image into the combined large image for coordinate fusion to obtain the final detection result. In the data fusion unit 404, a Bounding Box (Bounding Box) coordinate recovery process is shown, which is exemplified by a fixed step size and is divided into four subgraphs. Finally, after the data fusion is completed in the data fusion unit 404, the target detection result is output.

Fig. 5 shows a flowchart of a target detection method based on super-resolution reconstruction according to an embodiment of the present invention. The target detection method based on the super-resolution reconstruction comprises the following steps: a data acquisition step S502, acquiring image data to be detected; a super-resolution reconstruction step S504, wherein the super-resolution reconstruction step is used for receiving the acquired image data and performing super-resolution reconstruction processing on the image data; a target detection step S506 of performing target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing step S508 of cutting the image data subjected to the target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion, thereby obtaining a target detection result.

The target detection technology and method based on super-resolution reconstruction provided by the invention can realize the following effects:

1) the accuracy is as follows: under the scene of detecting the target in the video, the resolution ratio of the input image is improved by using the super-resolution module, so that the scale invariance is introduced, the detection result of the small target is improved, and the accuracy is improved on the whole.

2) Flexibility: the whole network is end-to-end, and is easy to train. The lightweight basic network and the used hyper-parameters in the model sub-module and the sub-module can be replaced according to the needs of users.

3) The application range is wide: the method can be applied to real-time detection tasks with small-size targets in various scenes, and has a wide application range.

4) Strong generalization ability: the learned algorithm model has excellent accuracy (generalization ability) in practical application as long as there are enough training samples.

Claims

1. A target detection system based on super-resolution reconstruction, the system comprising:

the data acquisition module is configured to acquire image data to be detected;

the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data;

a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and

and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.

2. The super-resolution reconstruction based object detection system according to claim 1, wherein in the object detection module, a single point multi-box detection (SSD) algorithm is used.

3. The system of claim 1, wherein the step size for cropping the image data subjected to object detection in the segmentation and fusion module is based on a value obtained by edge detection.

4. The super resolution reconstruction based object detection system of claim 1, wherein the super resolution reconstruction module is a trained spatiotemporal sub-pixel convolution network comprising a motion estimation part and a super resolution part, wherein the spatiotemporal sub-pixel convolution network is trained by:

the Loss formula of the super-resolution part is as follows:

the Loss formula of the motion estimation part is as follows:

wherein the content of the first and second substances,

is approximated to

ε＝0.01

which represents an image frame or frames of an image,

representing the image frame that has undergone the warping process.

5. The system of claim 1, wherein an object loss function L is employed in the object detection module_DetTarget loss function L_DetObtained by the following equation:

6. A target detection method based on super-resolution reconstruction is characterized by comprising the following steps:

a data acquisition step, which is to acquire image data to be detected;

a super-resolution reconstruction step of receiving the acquired image data and performing super-resolution reconstruction processing on the image data;

a target detection step of performing target detection on the image data subjected to the super-resolution reconstruction processing; and

and a dividing and fusing step, namely cutting the image data subjected to target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion so as to obtain a target detection result.

7. The method for object detection based on super-resolution reconstruction according to claim 6, wherein in the step of object detection, a single-point multi-box detection (SSD) algorithm is used.

8. The method of claim 6, wherein in the step of dividing and fusing, the step size for cropping the image data subjected to object detection is based on a value obtained by edge detection.

9. The super resolution reconstruction-based object detection method according to claim 8, wherein in the super resolution reconstruction step, a trained spatio-temporal sub-pixel convolution network is used, the spatio-temporal sub-pixel convolution network comprising a motion estimation part and a super resolution part,

wherein the spatiotemporal sub-pixel convolutional network is trained by:

the Loss formula of the super-resolution part is as follows:

the Loss formula of the motion estimation part is as follows:

wherein the content of the first and second substances,

is approximated to

ε＝0.01

which represents an image frame or frames of an image,

representing the image frame that has undergone the warping process.

10. The method of claim 6, wherein an object loss function L is used in the step of detecting the object_DetThe target loss function is obtained by the following equation: