CN115689939A

CN115689939A - Video image stabilization method for visual detection scene of power transmission line

Info

Publication number: CN115689939A
Application number: CN202211414383.8A
Authority: CN
Inventors: 庄杰; 付以贤; 杜远; 张健; 程凤璐; 毕宬; 陈培峰; 巩乃奇; 曹亚华
Original assignee: Super High Voltage Co Of State Grid Shandong Electric Power Co
Current assignee: Super High Voltage Co Of State Grid Shandong Electric Power Co
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-02-03

Abstract

The invention discloses a video image stabilization method for a visual detection scene of a power transmission line, which comprises the following steps: collecting outdoor environment images and labeling non-moving objects, establishing original image data and target image data, constructing an original target detection model and training to obtain a final target detection model, detecting the non-moving objects in a real-time video, selecting a motion estimation target and calculating a motion vector, calculating a global motion vector, processing by using Kalman filtering, and executing motion compensation; the method is particularly used for the visual detection scene of the power transmission line, so that target objects are prevented from being weakened, such as shaking of conductor galloping, and the detection precision of the conductor galloping is improved; the method can provide the position of a non-moving object, so that the block matching algorithm can be focused to the corresponding position more accurately, the accuracy of a motion vector can be improved, and the subsequent image stabilizing effect can be improved.

Description

Video image stabilization method for visual detection scene of power transmission line

Technical Field

The invention belongs to the technical field of power transmission engineering, and particularly relates to a video image stabilization method for a visual detection scene of a power transmission line.

Background

With the development and progress of information technology, concepts such as digital society and new capital construction are proposed successively, and the power transmission line serving as a city flowing blood carrier plays an important role in the revolution. Therefore, the intelligent visual detection technology of the power transmission line is widely applied, but due to the influence of external factors, the monitoring shooting video of the power transmission line often has a shaking phenomenon, and the detection of abnormal events such as conductor galloping and the like is seriously influenced.

Video image stabilization techniques, which refer to processing an original video sequence acquired by a video device by using a related algorithm to remove jitter therein, are generally used to overcome this problem. The purpose of video image stabilization is to make human eyes feel comfortable, and facilitate manual observation, discrimination and the like, and also to be used as a preprocessing stage of many other subsequent processing, such as detection, tracking and compression. The most common use today is electronic (digital) stabilization, which is based on motion estimation between successive video images, followed by motion filtering and motion compensation for each frame of image in the video to obtain a stabilized image. The method of electronic (digital) image stabilization generally comprises three steps: motion estimation, motion compensation and image inpainting.

The basic idea of motion estimation is to divide each frame of an image sequence into a plurality of non-overlapping macro blocks, consider that the displacement of all pixels in the macro blocks are the same, and then find out the block most similar to the current block, i.e. the matching block, from each macro block to a given specific search range of a reference frame according to a certain matching criterion, wherein the relative displacement between the matching block and the current block is the motion vector. When a video is compressed, a current block can be completely restored by only storing a motion vector and residual data to obtain a motion vector (also called a motion vector); this process is called motion estimation. In the related art, local motion vectors are generally used to approximate a global motion vector, and for example, the motion vectors are calculated by the above-described block matching method, optical flow method, feature point detection method, or the like. However, the regions selected or calculated by the method have randomness and do not have fixed attributes. This may lead to inaccuracies in global motion estimation, for example when the camera and the wire are shaken simultaneously, if the motion vector calculation between previous and subsequent frames is performed according to the wire area, the shaking of the wire may be weakened after image stabilization, resulting in a further decrease in the accuracy of wire sway detection.

The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may comprise prior art that does not constitute known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims at the problem that the motion estimation algorithm in the prior art generally uses local motion vectors to approximately replace global motion vectors, and the selected or calculated area has randomness and does not have fixed property. The method can cause inaccurate global motion estimation, especially can weaken the shaking of the wire after image stabilization, and further cause the problem of reduced wire galloping detection precision.

In order to realize the purpose of the invention, the invention adopts the following technical scheme to realize:

the video image stabilization method for the visual detection scene of the power transmission line is characterized by comprising the following steps of:

step S1: collecting outdoor environment images which are monitored and shot by a data acquisition unit, and labeling non-moving objects in the outdoor environment images; acquiring an annotated image dataset of a non-moving object;

step S2: selecting an annotated image dataset as original image data; removing the marked object image with the area smaller than the set removal area from the original image data, and taking the removed marked image data set as target image data; dividing target image data into a training image data set and a test image data set;

and step S3: constructing an original target detection model for detecting a non-moving object;

and step S4: training the constructed original target detection model for detecting the non-moving object by using the original image data and the training image data set of the target image data to obtain a final target detection model;

step S5: the real-time sampling data acquisition unit monitors a shot video in a monitoring range, extracts continuous current frames and reference frames from the video, inputs images of the current frames and the reference frames into a final target detection model, and detects a non-moving object by using the final target detection model;

step S6: sorting the non-moving object targets detected in the current frame and the reference frame according to the confidence degrees, and selecting the non-moving object target with the confidence degree higher than the set confidence degree as a motion estimation target; calculating a motion vector of each motion estimation object between the current frame and the reference frame by using a block matching method;

step S7: solving the average value of the motion vectors of each calculated motion estimation target, and taking the average value of the motion vectors of each calculated motion estimation target as a global motion vector;

step S8, using Kalman Filter (Kalman Filter) algorithm to carry out global motion vector

Processing to obtain a global motion vector to be compensated;

step S9: performing motion compensation, which may be expressed as

The pixel point in the image before compensation is one pixel point in the image after compensation.

In some embodiments of the present application, the non-moving object comprises a tower, a house, a chimney, and/or a bridge.

In some embodiments of the present application, the outdoor environment image is a 2K image, and the rejection area is set to 60 pixels × 60 pixels.

In some embodiments of the present application, constructing an original target detection model for non-moving object detection includes replacing the backbone network in Faster R-CNN from the VGG-16 model to the ResNet50 model.

In some embodiments of the present application, after the motion compensation is completed, the image-stabilized video is obtained by image inpainting using a mosaic method.

In some embodiments of the present application, training a constructed original target detection model for detecting a non-moving object using original image data and a training image data set of target image data to obtain a final target detection model includes the following steps:

step S41: inputting images in a training image dataset in target image data into an original target detection model taking ResNet50 as a backbone network, performing feature extraction through ResNet50, and performing convolution operation of 3 multiplied by 3 on a feature map output by a Conv _4 layer of ResNet50 to generate a feature map with the channel number of 256 and the size of (H/16) × (W/16);

step S42: after the convolution operation of 3 × 3, performing convolution operations of 1 × 1 twice on the feature map of ((H/16) × (W/16) × 256), respectively, predicting the positive and negative shapes of the predicted frame and the coordinate offset of the predicted frame, wherein 256 is the number of channels;

step S43: generating suggestions in a suggestion layer, correcting and eliminating prediction frames with forward attributes, and performing NMS (Non Maximum Suppression) filtering, and finally selecting a set number of target class prediction frames larger than a threshold value p;

step S44: performing region-of-interest pooling operation, corresponding the selected target class prediction borders to a feature map output by a Conv _5 layer of ResNet50, and finding out the part of each selected target class prediction border corresponding to the feature map;

step S45: regression of a target prediction frame and classification of the target are carried out on the feature map output by the region-of-interest pooling operation;

step S46: judging whether the original target detection model reaches a first convergence degree;

step S47: stopping training and saving the original target detection model as a primary target training model when the original target detection model reaches a first convergence degree;

step S48: inputting the images in the original image data into the primary training model output in step S47, performing feature extraction through ResNet50, repeating the steps S41 to S45, and determining whether the primary target training model reaches a second convergence degree;

s49, stopping training and saving the primary target training model as a secondary target training model when the primary target training model is completely converged;

step S50: and (4) fine-tuning the secondary training model output in the image input step in the original image data to enable the parameters to adapt to the target image data until the secondary training model converges again, stopping fine tuning and storing the secondary target training model as a final target detection model.

In some embodiments of the present application, in step S42, the convolution result of 1 × 1 is divided into two branches, and the upper branch generates a characteristic map of (H/16) × (W/16) × 18 channels, where 18 channels refer to: each pixel position has 9 prediction frames, and each prediction frame corresponds to the probability of being a target and the probability of being a background; the lower branch generates a characteristic diagram of (H/16) × (W/16) × 36 channels, 36 channels refer to: each pixel position is provided with 9 prediction frames, each prediction frame corresponds to four independent characteristic values which respectively correspond to an offset tx of an x coordinate of a central point, an offset ty of a y coordinate of the central point, an offset tw of a frame in the horizontal direction and an offset th of the frame in the vertical direction.

In some embodiments of the present application, three sets of aspect ratios ratio = [0.5,1,2] and three scales = [8,16,32] are defined to constitute the prediction bounding box.

In some embodiments of the present application, determining whether the original target detection model reaches the first convergence level is determining whether slight convergence is reached: and judging whether the absolute value of the difference value between the loss generated by the previous batch size training and the loss generated by the current batch size training is smaller than a first set loss value in each period in the training process.

In some embodiments of the present application, the first set loss value is 0.1.

Compared with the prior art, the invention has the advantages and positive effects that:

compared with the prior art, the video image stabilization method for the visual detection scene of the power transmission line is particularly suitable for the visual detection scene of the power transmission line, four non-moving objects such as towers, houses, chimneys and bridges in the monitoring shooting video can be detected by using a final target detection model obtained through training, then the four non-moving objects are used for motion vector calculation and global motion estimation, video jitter can be eliminated, meanwhile, authenticity and reliability of normal moving objects, particularly conducting wires are guaranteed, and inaccurate motion compensation of the moving objects caused by motion estimation generated by superposition of the moving objects and the video jitter is reduced. In addition, the final target detection model can provide the position of a non-moving object, so that the block matching algorithm can focus on the corresponding position more accurately, the accuracy of a motion vector can be improved, and the subsequent image stabilizing effect is further improved.

Other features and advantages of the present invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments are briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of an embodiment of a video image stabilization method for a visual detection scene of a power transmission line according to the present disclosure;

fig. 2 is a flowchart of training a constructed original target detection model for detecting a non-moving object by using original image data and a training image data set of target image data in the video image stabilization method for a visual detection scene of a power transmission line to obtain a final target detection model;

fig. 3 is a schematic structural diagram of an original target detection model using a ResNet50 model as a backbone network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and examples.

It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Aiming at the problems that the motion estimation algorithm in the prior art usually uses local motion vectors to approximately replace global motion vectors, the selected or calculated area has randomness and does not have fixed attributes, the global motion estimation is further inaccurate, and especially the shaking of a wire can be weakened after image stabilization, so that the detection precision of wire galloping is further reduced, the video image stabilization method for the visual detection scene of the power transmission line is designed and provided. The conductor galloping refers to the self-excited vibration of the transmission line with low frequency and large amplitude generated by wind on the transmission line with the non-circular cross section. Conductor galloping is a ubiquitous phenomenon of power transmission and distribution lines, has a wide range, and can occur in all voltage levels as long as conditions are met. The wire galloping has huge energy, and the galloping track is elliptical in a section perpendicular to the axis of the wire. The peak value of the antinode on the wire appears from dozens of centimeters to twelve and three meters, and the maximum amplitude can reach 5 to 300 times of the diameter of the power transmission line. The long-time continuous, huge, the wire galloping by a wide margin of energy can make the stress that wire, gold utensil, cross arm, pole etc. received increase, can cause direct damage or the fatigue failure of wire, gold utensil, even shaft tower, reduces its life.

The conductor galloping detection can be realized through a power transmission line state monitoring device, and the power transmission line state monitoring device comprises a data acquisition unit, a data monitoring terminal, a power transmission line state monitoring device and a power transmission line state monitoring master station system; the data acquisition unit is an information measuring device based on various principles, which is arranged on a lead, a Ground Wire (including an Optical Fiber Composite Overhead Ground Wire (OPGW)), an insulator, a pole tower, a foundation and the like, transmits measurement information to a data monitoring terminal through a communication network, and responds to an instruction of the data monitoring terminal; the data monitoring terminal is a device which collects the information of each data acquisition unit, performs on-site storage and processing and can exchange information with the master station system; the power transmission line state monitoring device is a measuring device which can collect information such as a power transmission line body, a meteorological environment, channel conditions and the like in real time and transmit the information to the master station system through a communication network; the transmission line state detection master station system is a computer system which can be accessed to various transmission equipment state monitoring information and can perform centralized storage, unified processing and application. The main station system generally comprises an information access front-end processor, a centralized database, a data service module, a data processing module and various state monitoring application functional modules.

The video image stabilization method for the visual detection scene of the power transmission line can be realized by any one of a data acquisition unit, a data monitoring terminal, a power transmission line state monitoring device and a master station system. In a preferred embodiment, taking into account the data processing capacity, this can be realized, for example, by the master station system.

The video image stabilization method for the visual detection scene of the power transmission line provided by the application comprises a plurality of steps as shown in fig. 1.

Step S1: collecting an outdoor environment image which is monitored and shot by a data acquisition unit, and labeling 4 types of non-moving objects such as towers, houses, chimneys and bridges; and acquiring an annotated image data set of 4 types of non-moving objects such as towers, houses, chimneys and bridges.

In some optional embodiments of the present application, the outdoor environment image is a 2K image, and the 2K image is composed of 2048 × 1080 pixels, where 2048 represents the number of pixels in the horizontal direction, and 1048 represents the number of pixels in the vertical direction. And marking 4 types of non-moving objects such as towers, houses, chimneys and bridges in the outdoor environment image by adopting the 2D marking frame. When in marking, the chimney only marks the chimney body and does not contain smoke or water vapor. The bridge only marks the image collected from the side view angle of the bridge in the outdoor environment image, and the bridge shot from the top view angle is not marked. The annotation can be done by professional data annotating personnel.

Step S2: selecting an annotated image dataset as original image data; removing the marked object image with the area smaller than the set removal area from the original image data, and taking the removed marked image data set as target image data; the target image data is divided into a training image dataset and a test image dataset.

And setting the rejection area as the recognition critical area of the object feature model. In some embodiments of the present application, the set rejection area is composed of 60 × 60 pixels for a 2K image, and 60 denotes the number of pixels in the horizontal direction and the vertical direction. By carrying out statistical analysis on each non-moving object in the original image data, the phenomenon that the non-moving object in a labeling frame with the area smaller than 60 pixels multiplied by 60 pixels of a 2K image has a characteristic model which is difficult to distinguish is obtained. Therefore, the set thinning-out area is set to be composed of 60 × 60 pixels, and the target image data is high-quality target image data.

And step S3: and constructing an original target detection model for detecting a non-moving object.

The method for constructing the target detection model for detecting the non-moving object specifically comprises the following steps:

the backbone network in Faster R-CNN was replaced by the VGG-16 model with the ResNet50 model.

In some embodiments of the present application, the target detection model is improved on the basis of the fast R-CNN (Region-based Convolutional Neural Network). The basic structure of the traditional Faster R-CNN comprises a feature extraction part, a Region suggestion Network part (RPN), a suggestion Layer part (RPN) and a Region of interest pooling part (ROI pooling); the feature extraction part extracts a feature map (feature map) from an image to be processed by convolution and pooling, the area suggestion network part acquires the approximate position of an object from the feature map in a network training mode, the suggestion layer part utilizes the approximate position acquired by the area suggestion network part to continue training to acquire a more accurate position, and the region-of-interest pooling part utilizes the acquired more accurate position to scratch the object to be classified from the feature map and pool the object into data with fixed length.

The traditional feature extraction part adopts a VGG-16 (Visual Geometry Group) model as a backbone network, wherein the model comprises 16 layers, namely 13 convolutional layers and 3 full-connection layers. The VGG-16 model has strong fitting capability, text frame and confidence degree prediction is carried out on a feature map generated after an input image is downsampled by 16 times, but the VGG-16 has limited feature extraction capability and insufficient matching degree with the feature extraction requirement of the application. For the purpose of improving feature extraction capability, in some embodiments of the present application, the backbone network in the Faster R-CNN is replaced by a VGG-16 model with a deep residual network, i.e., a ResNet50 model.

And step S4: and training the constructed original target detection model for detecting the non-moving object by using the original image data and the training image data set of the target image data to obtain a final target detection model.

The training process is described below, and includes a number of steps as shown in fig. 2.

The size of the images in the training image dataset may be denoted H × W, where H denotes the number of pixels in the horizontal direction and W denotes the number of pixels in the vertical direction, illustratively, H is 2048 pixels and W is 1080 pixels.

Step S41: as shown in fig. 3, images in a training image dataset in target image data are input into an original target detection model using ResNet50 as a backbone network, feature extraction is performed through ResNet50, and a feature map output by a Conv _4 layer (Conv is an abbreviation of convolution) of ResNet50 is subjected to convolution operation of 3 × 3 to generate a feature map with a channel number of 256 and a size of (H/16) × (W/16), which can be expressed as ((H/16) × (W/16) × 256); the processing steps before the Conv _4 layer are consistent with the conventional ResNet50 model, and the description is not repeated here, and fig. 3 includes a Conv _1 layer, a Conv _2 layer and a Conv _3 layer. In the figure, conv represents convolution, relu represents an activation function, full connection represents full connection, reshape represents a reconstruction layer, softmax represents a softmax function, propofol represents a suggestion layer, ROI posing represents region-of-interest pooling, bbox represents Bounding box (Bounding box), and class represents a classifier, and the definitions of the above terms are well known in the neural network technology field and are not described one by one herein.

Step S42: after the convolution operation of 3 × 3, 1 × 1 convolution operations are performed twice on the feature map of ((H/16) × (W/16) × 256), respectively, the convolution result of 1 × 1 is divided into two branches, the upper branch generates a feature map of (H/16) × (W/16) × 18 channels, and 18 channels refer to 9 prediction frames (anchors) per pixel position. Each prediction frame (anchor) is correspondingly provided with two independent characteristic values which are respectively corresponding to the probability of being a target and the probability of being a background; the lower branch generates a characteristic diagram of (H/16) × (W/16) × 36 channels, wherein 36 channels refer to that 9 predicted frames (anchors) are arranged at each pixel position, each predicted frame (anchor) corresponds to four independent characteristic values, and the characteristic values respectively correspond to an offset tx of an x coordinate of a central point, an offset ty of a y coordinate of the central point, an offset tw of a frame in the horizontal direction and an offset th of the frame in the vertical direction.

Three sets of aspect ratios ratio = [0.5,1,2] and three scales = [8,16,32] are defined to constitute predicted bounding boxes (anchors), and 9 different shapes and sizes of bounding boxes can be combined by the two sets of parameters. For the prediction frame generated by the upper branch, in the obtained characteristic value, the probability of being the target or the probability of being the background has a relatively larger one, namely the maximum prediction probability p, and the category corresponding to the probability is the category of the prediction frame, namely the prediction frame is divided into a target category prediction frame and a background category prediction frame.

After the above steps, the positivity and the negativity of the prediction frame can be predicted (for example, the target class prediction frame is positive, and the background class prediction frame is negative); and meanwhile, the coordinate offset of the predicted frame can be predicted.

Step S43: and generating suggestions in a suggestion layer, correcting and eliminating the prediction frames with the forward attribute, and performing NMS (Non Maximum Suppression) filtering, and finally selecting the set number of target class prediction frames larger than the threshold value p.

Step S44: and performing region of interest pooling (ROI pooling) operation to correspond the selected target class prediction borders to the feature map output by the Conv _5 layer of the ResNet50, and finding out the part of the feature map corresponding to each selected target class prediction border.

Step S45: regression of the target prediction bounding box and classification of the target are performed on the feature map output by the region of interest pooling (ROI posing) operation.

Step S46: and judging whether the original target detection model reaches a first convergence degree.

Judging whether the original target detection model reaches a first convergence degree, wherein the first convergence degree represents a slightly-converged state; more specifically, it is determined whether or not the absolute value of the difference between the loss resulting from the previous Batch size training and the loss resulting from the current Batch size training is smaller than a first set loss value (for example, 0.1) in each period (Epoch: the process in which the complete training image data set passes through the original object detection model once and returns once) in the training process. The slight convergence refers to an interval in the original target detection model training process, in which the loss is still continuously reduced, but the reduction amplitude is gentle, and the absolute value of the difference between the loss of the previous round and the loss of the current round is less than 0.1.

Step S47: and when the original target detection model reaches a first convergence degree, namely slight convergence, stopping training and saving the original target detection model as a primary target training model.

Step S48: and inputting the image in the original image data into the primary training model output in the step S47, performing feature extraction through the ResNet50, repeating the steps S41 to S45, and judging whether the primary target training model reaches a second convergence degree.

And judging whether the primary training model reaches a second convergence degree, wherein the second convergence degree represents a completely converged state, namely a set ideal convergence state.

Step S49: and when the primary target training model is completely converged, stopping training and saving the primary target training model as a secondary target training model.

Step S50: and inputting the images in the original image data into the secondary training model output in the step S49, performing Fine adjustment (Fine Tune) on the secondary training model to enable the parameters to adapt to the target image data until the secondary training model converges again, stopping Fine adjustment and storing the secondary target training model as a final target detection model. The trained final target detection model may be detected using the test image dataset.

Step S5: the real-time sampling data acquisition unit monitors the shot video in the monitoring range, extracts continuous current frames and reference frames from the video, inputs images of the current frames and the reference frames into a final target detection model, and detects 4 types of non-moving objects such as towers, houses, chimneys and bridges by using the final target detection model.

Step S6: sequencing the non-moving object targets (towers, houses, chimneys and bridges) detected by the current frame and the reference frame according to the confidence degrees, and selecting the non-moving object targets (exemplarily, targets with the confidence degrees higher than the set confidence degree) as motion estimation targets, wherein the confidence degrees of the non-moving object targets are higher than the set confidence degree; calculating a motion vector of each motion estimation object between the current frame and the reference frame by using a block matching method; the block matching method uses a model disclosed in the prior art, and this part is not improved in the present application and is not described herein again.

Step S7: solving the calculated average value of the motion vector of each motion estimation target, and taking the calculated average value of the motion vector of each motion estimation target as a global motion vector:

wherein

Representing a global motion vector, n representing the number of motion estimation objects, representing the motion vector of the motion estimation object.

Step S8: using Kalman Filter (Kalman Filter) algorithm to align global motion vectors

And processing to obtain a global motion vector to be compensated.

Step S9: motion compensation is performed, and the process of motion compensation can be expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

is a pixel point in the image before compensation and is a pixel point in the image after compensation. And after the motion compensation is finished, image repairing is carried out by using a mosaic method to obtain a final image stabilizing video.

Compared with the prior art, the video image stabilization method for the visual detection scene of the power transmission line is particularly suitable for the visual detection scene of the power transmission line, four non-moving objects such as towers, houses, chimneys and bridges in the monitoring shooting video can be detected by using a final target detection model obtained through training through the method, then the four non-moving objects are used for carrying out motion vector calculation and global motion estimation, video jitter can be eliminated, authenticity and reliability of normal moving objects, particularly conducting wires, are guaranteed, and inaccurate motion compensation of the moving objects due to motion estimation generated by superposition of the moving objects and the video jitter is reduced. In addition, the final target detection model can provide the position of a non-moving object, so that the block matching algorithm can be focused on the corresponding position more accurately, the accuracy of a motion vector can be improved, and the subsequent image stabilizing effect is further improved.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof; such modifications and substitutions do not depart from the spirit and scope of the corresponding claims.

Claims

1. The video image stabilization method for the visual detection scene of the power transmission line is characterized by comprising the following steps of:

step S2: selecting an annotated image dataset as original image data; removing the annotation object image with the area smaller than the set removal area from the original image data, and taking the removed annotation image data set as target image data; dividing target image data into a training image data set and a test image data set;

step S6: sorting the non-moving object targets detected by the current frame and the reference frame according to the confidence coefficients, and selecting the non-moving object target with the confidence coefficient higher than the set confidence coefficient as a motion estimation target; calculating a motion vector of each motion estimation object between the current frame and the reference frame by using a block matching method;

step S7: solving the average value of the calculated motion vectors of each motion estimation target, and taking the average value of the calculated motion vectors of each motion estimation target as a global motion vector;

Processing to obtain a global motion vector to be compensated;

step S9: performing motion compensation, which may be expressed as

2. The video image stabilization method for the visual detection scene of the power transmission line according to claim 1, wherein the non-moving object comprises a tower, a house, a chimney and/or a bridge.

3. The video image stabilization method for the visual detection scene of the power transmission line according to claim 1, wherein the outdoor environment image is a 2K image, and the set rejection area is 60 pixels x 60 pixels.

4. The video image stabilization method for the visual detection scene of the power transmission line according to claim 1, wherein constructing the original target detection model for the detection of the non-moving object comprises replacing a backbone network in fast R-CNN from a VGG-16 model to a ResNet50 model.

5. The video image stabilization method for the visual detection scene of the power transmission line according to claim 1, characterized in that after the motion compensation is completed, a mosaic method is used for image patching to obtain an image stabilization video.

6. The video image stabilization method for the visual detection scene of the power transmission line according to any one of claims 1 to 5, characterized in that:

the method for training the constructed original target detection model for detecting the non-moving object by using the original image data and the training image data set of the target image data to obtain the final target detection model comprises the following steps:

step S44: performing region-of-interest pooling operation to correspond the selected target class prediction borders to a feature map output by a Conv _5 layer of ResNet50, and finding out a part of each selected target class prediction border corresponding to the feature map;

step S46: judging whether the original target detection model reaches a first convergence degree or not;

step S48: inputting the image in the original image data into the primary training model output in the step S47, performing feature extraction through ResNet50, repeating the steps S41 to S45, and judging whether the primary target training model reaches a second convergence degree;

step S49: stopping training and saving the primary target training model as a secondary target training model when the primary target training model is completely converged;

7. The video image stabilization method for the visual detection scene of the power transmission line according to claim 6, characterized in that:

in step S42, the convolution result of 1 × 1 is divided into two branches, and the upper branch generates a characteristic map of (H/16) × (W/16) × 18 channels, where 18 channels refer to: each pixel position has 9 prediction frames, and each prediction frame corresponds to the probability of being a target and the probability of being a background; the lower branch generates a characteristic diagram of (H/16) × (W/16) × 36 channels, 36 channels refer to: each pixel position is provided with 9 prediction frames, each prediction frame corresponds to four independent characteristic values which respectively correspond to an offset tx of an x coordinate of a central point, an offset ty of a y coordinate of the central point, an offset tw of a frame in the horizontal direction and an offset th of the frame in the vertical direction.

8. The video image stabilization method for the visual detection scene of the power transmission line according to claim 7, characterized in that:

three sets of aspect ratios ratio = [0.5,1,2] and three scales = [8,16,32] are defined to constitute the prediction bounding box.

9. The video image stabilization method for the visual detection scene of the power transmission line according to claim 8, characterized in that:

judging whether the original target detection model reaches the first convergence degree or not, namely judging whether the original target detection model reaches the slight convergence degree or not: and judging whether the absolute value of the difference value between the loss generated by the previous batch size training and the loss generated by the current batch size training is smaller than a first set loss value in each period in the training process.

10. The video image stabilization method for the visual detection scene of the power transmission line according to claim 9, wherein the first set loss value is 0.1.