CN117095317A

CN117095317A - Unmanned aerial vehicle three-dimensional image entity identification and time positioning method

Info

Publication number: CN117095317A
Application number: CN202311352027.2A
Authority: CN
Inventors: 周皓然; 叶绍泽; 陆国锋; 陈康; 袁杰遵; 余齐; 张举冠
Original assignee: Shenzhen Senge Data Technology Co ltd
Current assignee: Shenzhen Senge Data Technology Co ltd
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2023-11-21
Anticipated expiration: 2043-10-19
Also published as: CN117095317B

Abstract

The application discloses a three-dimensional image entity identification and time positioning method of an unmanned aerial vehicle, belonging to the technical field of geographic information, comprising the following steps: s10: acquiring a time period possibly containing an entity from the unmanned aerial vehicle video, and constructing a two-class training data set; s20: training a ResNet model on the data set to obtain a two-classifier ResNet_1; s30: recommending the entity time period by utilizing the characteristics extracted by the two classifiers ResNet_1; s40: refining the boundary of the recommended entity time period; s50: constructing a K+1 class training data set, wherein the data set comprises a K class entity frame and a background frame; s60: training the ResNet model on the data set constructed in the step S50 to obtain a K+1 classifier ResNet_2; s70: extracting specific entity time period characteristics by using a K+1 classifier ResNet_2; s80: and classifying the specific entity time period by using a K+1 classifier SVM. The beneficial effects of the application are as follows: the method can carry out classification processing on the whole video frequency band, reduces the influence caused by single frame identification errors, and ensures that the identification of the artificial geographic entity is more accurate.

Description

Unmanned aerial vehicle three-dimensional image entity identification and time positioning method

Technical Field

The application relates to the technical field of geographic information, in particular to an unmanned aerial vehicle three-dimensional image entity identification and time positioning method.

Background

The artificial geographic entity in the live-action three-dimension refers to a geographic entity constructed or modified by human beings, such as water conservancy, traffic, buildings, site facilities and the like. In live-action three-dimension, unmanned aerial vehicles are often used for data acquisition of an artificial geographic entity through oblique photography, then the artificial geographic entity is subjected to time positioning from the acquired data, video segments of the artificial geographic entity are extracted, and the artificial geographic entity is subjected to three-dimensional modeling around the video segments of the artificial geographic entity. The identification of artificial geographical entities and the time positioning in video is therefore of vital importance. With the continuous progress of geographic information science and the continuous development of spatial data acquisition means technology, the live-action three-dimensional technology has become an important means for acquiring urban current situation and natural resource spatial data. The real-scene three-dimensional model can truly realize the three-dimensional visual expression of the whole real world, multiple scales, multiple sources and multiple types, plays an important role in the three-dimensional construction of the Chinese real scene, and provides powerful assistance for the construction of smart cities.

In recent years, unmanned aerial vehicle technology is widely applied and rapidly developed, and plays an important role in various fields such as land exploration, geographic information acquisition, environment monitoring and the like. The unmanned aerial vehicle is provided with a high-resolution camera or a sensor to perform aerial shooting, and three-dimensional image information of the ground surface is acquired, so that the unmanned aerial vehicle has become an important mode for acquiring geographic information. Three-dimensional images captured by unmanned aerial vehicles or satellites and other devices are essentially complex image data containing a large amount of spatial geographic information, and useful information can be extracted through specific preprocessing and analysis processes. The step of artificial geographic entity identification is to intelligently identify and classify geographic entities in images through specific computer algorithms and pattern recognition technologies in the preprocessing and analyzing processes. However, how to extract geographical entity information from a large number of complex unmanned aerial vehicle images rapidly and accurately and to perform accurate time positioning on the geographical entity information is still an important research topic. However, the three-dimensional image data collected by the unmanned aerial vehicle is huge and complex, and effective data processing and analysis means are required to extract useful information therefrom. The traditional manual identification and processing method is time-consuming and labor-consuming, and is difficult to meet the requirements of modern high efficiency.

Disclosure of Invention

In order to overcome the defects of the prior art, the three-dimensional image entity identification and time positioning method of the unmanned aerial vehicle can be used for carrying out classification processing on the whole video frequency band, reducing the influence caused by single frame identification errors and enabling the identification of the artificial geographic entity to be more accurate.

The technical scheme adopted for solving the technical problems is as follows: in a method for three-dimensional image entity identification and time positioning of an unmanned aerial vehicle, the improvement comprising the steps of:

s10: acquiring a time period possibly containing an entity from the unmanned aerial vehicle video, and constructing a two-class training data set;

s20: training a ResNet model on the data set to obtain a two-classifier ResNet_1;

s30: recommending the entity time period by utilizing the characteristics extracted by the two classifiers ResNet_1;

s40: refining the boundary of the recommended entity time period;

s50: constructing a K+1 class training data set, wherein the data set comprises a K class entity frame and a background frame;

s60: training the ResNet model on the data set constructed in the step S50 to obtain a K+1 classifier ResNet_2;

s70: extracting specific entity time period characteristics by using a K+1 classifier ResNet_2;

s80: and classifying the specific entity time period by using a K+1 classifier SVM.

Further, in the step S10, the data set includes two types of entity frames and background frames, and the entity frames at this time do not distinguish specific building entities or traffic entities.

Further, in the step S20, the classifier res net_1 classifies a video frame into a background frame or a physical frame.

Further, the step S30 includes the following steps;

s301: each frame of video is taken as an initial time period and forms a time period set;

s302: extracting features from the video frames by using a two-classifier ResNet_1;

s303: and recommending the time of the entity based on the feature similarity.

Further, the step S40 includes the following steps;

s401: extracting features of the recommended entity time period;

s402; calculating an average value and L2 normalization operation on the recommended entity time period integration set to obtain a feature expression;

s403: training a fully connected neural network;

s404: taking the feature expression as the input of the neural network, and outputting the confidence score of the entity time period and the deviation of the time period boundary;

s405: the physical time period for redundancy removal is suppressed using a time domain non-maximum.

Further, in step S403, the fully connected neural network is trained into a classifier and a boundary regressor simultaneously by a multi-task learning method, and the loss function during training is composed of two parts, one is a Softmax cross entropy loss function for classifying tasks, and the other is a loss function for boundary shift regression tasks of entity time periods.

Further, in the step S405, the time-domain non-maximum suppression measures the overlapping degree of the two time periods by calculating a time overlap ratio, where the time overlap ratio is expressed as:

；

wherein time period 1 and time period 2 are overlapping time periods.

Further, the Softmax cross entropy loss function is defined as:

；

where N is the number of samples, y _i ' is the classification result expected for the ith sample, y _i Is the Softmax score of the i-th sample actually output by the neural network;

L _reg is a loss function in the boundary-offset regression task for an entity period, defined as:

；

wherein the label value is a sample label value, the positive sample is 1, the negative sample is 0, and N _p Is the number of positive samples, O _s,i For the deviation of the first frame in the video for the physical period i, O _e,i Deviations of the last frame in the video for the physical time period.

Further, the step S70 includes the following steps;

s701: extracting features from the video frames by using a K+1 classifier ResNet_2;

s702: and calculating the characteristic expression of the recommended specific entity time period.

The beneficial effects of the application are as follows: the method can carry out classification processing on the whole video frequency band, reduces the influence caused by single frame identification errors, and ensures that the identification of the artificial geographic entity is more accurate.

Drawings

FIG. 1 is a flow chart of a method for three-dimensional image entity identification and time positioning of an unmanned aerial vehicle according to the present application;

FIG. 2 is a diagram of an example boundary refinement of a time period of an artificial geographic entity of the present application;

FIG. 3 is a diagram of a fully connected neural network of the present application;

fig. 4 is an exemplary diagram of a time overlap ratio map tbou according to the present application.

Detailed Description

The application will be further described with reference to the drawings and examples.

The conception, specific structure, and technical effects produced by the present application will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, features, and effects of the present application. It is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments, and that other embodiments obtained by those skilled in the art without inventive effort are within the scope of the present application based on the embodiments of the present application. In addition, all the coupling/connection relationships referred to in the patent are not direct connection of the single-finger members, but rather, it means that a better coupling structure can be formed by adding or subtracting coupling aids depending on the specific implementation. The technical features in the application can be interactively combined on the premise of no contradiction and conflict.

Referring to fig. 1 to 4, the application also provides a method for identifying and positioning a three-dimensional image entity of an unmanned aerial vehicle, which in the embodiment comprises the following steps:

s40: refining the boundary of the recommended entity time period;

According to the method, through setting the series of operations, the artificial geographic entity time period recommendation is performed through the two classifiers ResNet_1, and then the artificial geographic entity time period classification is performed through the K+1 classifier. The method utilizes the similarity of adjacent frames, simultaneously carries out classification treatment on the whole video frequency band, reduces the influence caused by single frame identification errors, simultaneously carries out classification treatment on the whole video frequency band by the treatment from the frame level to the video frequency band level on the time period recommendation and the time period classification of the entity, realizes the identification of the artificial geographic entity from local to whole, and ensures that the identification of the artificial geographic entity is more accurate.

Further, some time periods that may contain artificial geographic entities will be recommended from the drone video. Because the two entities may have time overlapping in the video, the recommended time periods may also have time overlapping. A two-class training dataset is constructed using video frames, and in step S10, the dataset includes two classes of entity frames and background (non-entity) frames, where the entity frames do not distinguish between specific building entities or traffic entities, etc. In this process, the model learns and understands the data features in the training set and then generates a certain judgment criterion. In the step S20, the classifier res net_1 classifies a video frame into a background (non-physical) frame or a physical frame. For an unmanned video with a length of N frames, the frame set is. Classifying each frame using trained classifier ResNet_1, resulting in，/>. Extracting each frame simultaneouslyIs characterized by Pool5, the extraction result is +.>。

In the method, each frame of the video is taken as an initial time period to form a time period set, two time periods which are most similar in time and adjacent are continuously selected to be combined so as to recommend the time period containing the artificial geographic entity, and the time period recommendation is completed until one time period containing the entity frame is left in the set. The time period recommendation is performed in the Pool5 feature space extracted by the trained two-classifier ResNet_1, and the similarity calculation of the two time periods uses the L2 distance. The method also fully utilizes the discrimination capability of ResNet_1, and the video segment merging, video segment reservation and stopping criteria in the whole time segment recommendation process are as follows:

(1) Video segment merging criteria: in each merging process, the entity frames are included in two adjacent time periods of merging.

(2) Video segment retention criteria: if the proportion of physical frames in a video segment is below a certain threshold θ (typically 0.5), it is not recommended and not taken as a physical time period.

(3) Stopping criteria: in the merging process, merging is stopped when only one time period is left to contain the entity frame.

The specific algorithm recommended for this time period is as follows:

algorithm: time period recommendation

Input videoFrame set->

A classification result set of video frames:

classification score set for video frames:

pool5 feature set of video frames:

the method comprises the following steps:

aggregating framesAs an initial set of time periods

Initializing a time period similarity set:

initializing a time period recommendation set:

foreach adjacent time period

do

Calculating similarity:

end foreach

while at least one time period contains the physical frame// stop criteria

do

Acquiring most similar adjacent time periodsAnd wherein->Or->Includes physical frames;// merging criteria

MergingAnd->：/>

Mean value mergingAnd->Is characterized by: />

Mean value mergingAnd->Is a score of (2): />

Calculation ofSimilarity set with its neighboring time period +.>

Updating the similarity set:

updating the score set:

updating the feature set:

updating the set of time periods:

if video segmentThe proportion of physical frames in (3) is greater than +.>The/(reservation criterion)

then

end if

end while output: time period recommendation set for artificial geographic entity

Further, the step S30 includes a following step;

s303: and recommending the time of the entity based on the feature similarity.

Still further, the step S40 includes the following steps;

s401: extracting features of the recommended entity time period;

s403: training a fully connected neural network;

Referring to fig. 2, in step S40, for the entity time periods recommended by the time period recommendation algorithm, the method constructs a multi-task fully-connected neural network to refine the boundary of each recommended time period, and classifies the recommended time periods and also refines the boundary of the recommended time period.

A recommended entity time period PC e PC, and frame set thereofCorresponding Pool5 feature setWhere s and e are the indices of the first and last frames of the time period pc in the video, respectively. The characteristics of the entity time period pc are expressed as:

；

wherein the mean value of the feature set F and the L2 normalization operation are adopted. The feed forward neural network uses as input the characteristic representation R of the recommended physical time period, outputs it as a confidence score of the physical time period, and its offset of the time boundary. The boundary offset, using the L1 distance, is defined as:

；

wherein S is _p And e _p Is the index in the video of the first and last frame of the physical time period S _g And e _g Is the index in the video of the first and last frame of the true value that matches it.

In the step S403, the fully connected neural network is trained into a classifier and a boundary regressor simultaneously by a multitask learning method, and the loss function during training is composed of two parts, one is a Softmax cross entropy loss function for classifying tasks, and the other is a loss function for boundary shift regression tasks of entity time periods.

The fully connected neural network is trained into a classifier and a boundary regression simultaneously through a multi-task learning method. The loss function thus consists of two parts, defined as the formula:

；

the Softmax cross entropy loss function is defined as:

；

The structure of the feedforward neural network adopted by the method is shown in figure 3. The first layer is a fully connected layer containing ReLU activation, where there are 1024 neurons, followed by a discard layer with a discard rate of 0.4 and a fully connected layer of 4 neurons without ReLU activation. The last layer is a multi-task layer, the two loss functions are respectively used in training, and the Softmax score of the entity time period and the deviation of the boundary are respectively output in testing. Through the arrangement, the fully-connected neural network can conduct fine adjustment of the boundary while identifying the entity time period, so that the determination of the boundary of the entity is more accurate, the accuracy of the boundary of the entity time period is greatly improved, and the method is beneficial to better identifying and extracting the entity characteristics in the video.

；

the cleaning of entity time periods with repeated or excessively high overlapping degree is a key step, so that redundancy can be eliminated, and the final precision is improved. Non-maximum suppression (NMS) is a common method for eliminating redundant overlap areas in target detection results. The non-maximum suppression is performed in the spatial dimension and is mainly used for object detection tasks in the image. However, in processing video or some tasks involving the time dimension, the non-maximum suppression of the application space may not be accurate enough and thus may be extended to processing in the time domain, which is the time domain non-maximum suppression. The recommended redundant entity time period is removed by changing the overlap ratio calculation IoU to a time overlap ratio tIoU, extending the spatial Non-maximum suppression (NMS: non-Maximum Suppression) to the time domain.

Time domain non-maximum suppression in the method uses a time overlap ratio tIoU with a threshold of 0.3, wherein time period 1 and time period 2 are overlapping time periods. And then sequencing all the entity time periods (generally sequencing according to confidence scores), selecting one interval with the highest confidence score from the intervals, calculating tIoU with all other intervals, and deleting all intervals with the tIoU exceeding the threshold. This process is repeated until all intervals have been processed. Thus, the intervals with the highest confidence scores and no overlap (or the overlapping degree is lower than a certain threshold value) can be reserved, and the final result can effectively eliminate redundant entity time periods.

Further, the artificial geographic entity time period is classified into a specific artificial geography or background (non-entity). Constructing a K+1 class training data set, wherein the data set comprises a K class entity frame and a background (non-entity) frame, training a neural network ResNet on the data set to obtain a K+1 classifier, called ResNet_2, which can divide a video frame into: specific physical frames or background (non-physical) frames.

The pool5 feature is extracted for each frame in the artificial geographic entity time period recommended in the first stage using the trained ResNet_2. A recommended entity time period PC e PC, and frame set thereofAnd the corresponding ResNet_2 extracted Pool5 feature set +.>Where s and e are the indices of the first and last frames of the time period pc in the video, respectively. The characteristics of the entity time period pc at this stage are expressed as:

；

because the k+1 classifier res net_2 is used to classify video frames, it is not possible to classify video segments. ResNet_2 is therefore only used to extract the features of each frame and calculate the feature table R' for the acquisition entity time period. The SVM classifies the entity time period using the recommended feature expression R' of the entity time period as an input. Unlike the frame training data of ResNet_2, the K+1 class training data set of the SVM is a time period data set that includes a K class entity time period and a background (non-entity) time period.

Still further, the step S70 includes the following steps;

The training data used by ResNet_2 is in frames, each frame being a separate sample. The training data of the SVM is in units of time periods, including K-type entity time periods and background (non-entity) time periods. The reason for this is that in video processing, the information that a single frame and a sequence of consecutive frames (i.e. time segments) can provide is different, in particular the occurrence of some action or event, which requires a complete and clear representation in a sequence of consecutive frames. Therefore, by combining the two methods, the video data can be better analyzed and processed, further improving the performance of the model.

While the preferred embodiment of the present application has been illustrated and described, the present application is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. The unmanned aerial vehicle three-dimensional image entity identification and time positioning method is characterized by comprising the following steps of:

s40: refining the boundary of the recommended entity time period;

2. The method according to claim 1, wherein in step S10, the data set includes two types of entity frames and background frames, and the entity frames do not distinguish between specific building entities or traffic entities.

3. The method according to claim 1, wherein in the step S20, the classifier res net_1 classifies a video frame into a background frame or a physical frame.

4. The method for three-dimensional image entity recognition and time positioning of unmanned aerial vehicle according to claim 1, wherein the step S30 comprises the steps of:

s303: and recommending the time of the entity based on the feature similarity.

5. The method for three-dimensional image entity recognition and time positioning of unmanned aerial vehicle according to claim 1, wherein the step S40 comprises the steps of:

s401: extracting features of the recommended entity time period;

s403: training a fully connected neural network;

6. The method for three-dimensional image entity recognition and time positioning of unmanned aerial vehicle according to claim 5, wherein in step S403, the fully connected neural network is trained into a classifier and a boundary regressor simultaneously by a multitask learning method, and the loss function during training is composed of two parts, one part is a Softmax cross entropy loss function for classification tasks, and the other part is a loss function for boundary shift regression tasks for entity time periods.

7. The method for three-dimensional image entity recognition and time positioning of unmanned aerial vehicle according to claim 5, wherein in step S405, the time domain non-maximum suppression measures the overlapping degree of two time periods by calculating a time overlap ratio, and the expression of the time overlap ratio is:

；

wherein time period 1 and time period 2 are overlapping time periods.

8. The method for three-dimensional image entity recognition and time positioning of an unmanned aerial vehicle according to claim 6, wherein the Softmax cross entropy loss function is defined as:

；

9. The method for three-dimensional image entity recognition and time positioning of unmanned aerial vehicle according to claim 1, wherein the step S70 comprises the steps of,