CN114064971A

CN114064971A - Airport apron video semantic retrieval method and retrieval system based on deep learning

Info

Publication number: CN114064971A
Application number: CN202111383673.6A
Authority: CN
Inventors: 吕宗磊; 甘雨; 郝家祺; 张洁盈; 张义林
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-02-18

Abstract

The invention relates to a method and a system for retrieving apron video semantics based on deep learning, belonging to the technical field of video information processing and comprising S1, constructing an apron target detection data set; s2, training the apron target detection data set to generate a final target detection model; s3, preprocessing the apron video; s4, detecting the visual target in the preprocessed apron video to generate the position and label information of the visual target; s5, analyzing the target detection result, and screening out a characteristic sequence of the visual target which accords with the apron operation rule; s6, extracting the features of the feature sequence, and fusing the features by using an attention mechanism to generate a feature matrix; s7, inputting the feature matrix into a neural network training video semantic retrieval model; s8, acquiring the apron video to be detected, inputting a query event, and generating a video candidate segment; and S9, inputting the video candidate segments to obtain the video segments according with the query event semantics.

Description

Airport apron video semantic retrieval method and retrieval system based on deep learning

Technical Field

The invention belongs to the technical field of video information processing, and particularly relates to a method and a system for retrieving apron video semantics based on deep learning.

Background

With the development of the internet, video is becoming the main information carrier following characters and pictures and texts. In the field of social public safety, a video monitoring system becomes an important component for maintaining social security and strengthening social management. However, video recording has the characteristics of large data storage amount, long storage time and the like, and searching for required video segments through video recording means searching for a large amount of manual indexes and performing lengthy linear screening, which consumes a large amount of manpower, material resources and time, has extremely low efficiency, and increases unnecessary cost.

With the progress of technology, video retrieval technology is produced, and is being developed vigorously nowadays and widely applied in the fields of digital televisions, remote education, telemedicine, security and the like, and becomes a new and expensive new and new in the big data era. However, the video retrieval system in the civil aviation field is not started, and retrieval research of the apron video is still blank. The management of the airport apron is complex, and requires 7 multiplied by 24 hours to monitor operators and vehicles in the airport apron all the time to record illegal operation behaviors, identify vehicle running paths, the number of towed vehicles, the arrangement of cone drums, flat cars and the like, the illegal conditions that the operators cross the airport apron and the like, count historical illegal data, correct and handle the illegal conditions and the like, and consume a large amount of manpower. Therefore, the task of developing a search system with video search as a core is urgent.

Disclosure of Invention

The invention provides a method and a system for retrieving a apron video semantic based on deep learning for solving the technical problems in the prior art; by using a computer data processing technology, the original monitoring video is imported into a retrieval system, and corresponding video segments can be retrieved by selecting a query event.

The invention aims to provide a deep learning-based apron video semantic retrieval method, which at least comprises the following steps:

step 1, constructing a apron target detection data set;

step 2, training a apron target detection data set by using a YOLOv5s model to generate a final target detection model;

step 3, preprocessing the apron video according to the apron operation rule;

step 4, detecting the visual target in the preprocessed apron video by using the target detection model obtained in the step S2 to generate the position and label information of the visual target;

step 5, analyzing the target detection result, and screening out a characteristic sequence of the visual target which accords with the apron operation rule;

step 6, extracting the characteristics of the characteristic sequence through a time convolution network, and fusing the characteristics by using an attention mechanism to generate a characteristic matrix;

step 7, inputting the feature matrix into a neural network training video semantic retrieval model;

step 8, acquiring a apron video to be detected, inputting a query event, and generating a video candidate segment through preprocessing;

and 9, inputting the video candidate segment obtained in the step 8 into a deep learning-based apron video semantic retrieval system, and obtaining a video segment conforming to the semantic meaning of the query event.

Further: in step 1, a data source of the airport target detection data set is a Guiyang Longdongfeng International airport 209 airport machine position monitoring video, the video is split and split into images frame by using OpenCV, all the images are labeled by using a target detection labeling tool, namely, labelImg, so that a target detection data set in a YOLO format is constructed. The image labeling types cover 10 types, namely people, gallery bridges, garbage trucks, airplanes, refueling trucks, platform trucks, water trucks, luggage trucks, food carts and tractors.

Further: in step 2, the YOLOv5S model is used as a pre-training model to retrain the apron target detection data set constructed in S1, and the training parameters are set as follows: and (5) generating a final target detection model, wherein the epochs is 80, and the batch _ size is 16.

Further: in step 3, the airport 209 position monitoring video of the Guiyang Longdongpu airport is taken as a airport video data source of the airport, video segments of all airport operation flows are cut out according to the priori knowledge of the airport operation flows, and the average duration and the average video frame number of all airport operation flows are counted.

Further: in step 4, the apron work pieces cut out in S3 are divided into pictures frame by using OpenCV, and the pictures are input to the target detection model obtained in S2, so that the labels and the coordinate information of the visual objects are generated.

Further: and 5, screening the label and the coordinate information of the visual object which accords with the apron operation rule from the label and the coordinate information of all the visual objects obtained in the step 4. Three types of characteristics are set, namely the relative distance of the visual objects, the relative position of the visual objects and the speed of the visual objects. And counting the relative pixel distance of the visual object which accords with the apron operation rule in each frame of picture, and generating a relative pixel distance sequence. And counting the relative positions of the visual objects which accord with the apron operation rule in each frame of picture, wherein the relative positions refer to the included angles formed by the pixel coordinates of the plane as the original point and the pixel coordinates of the other visual objects and the positive direction of the x axis. And calculating the speed of the visual object by using an interframe difference method to generate a visual object speed characteristic sequence.

Further: and 6, extracting the features of the feature sequence obtained in the step 5 by using a time convolution network, splicing the extracted feature vectors into a feature matrix, applying an attention machine to the feature matrix, and performing different weight distribution on the feature matrix to realize feature fusion.

Further: and 7, inputting the feature matrix obtained in the step 6 into a semantic retrieval network formed by full connection layers, and training a semantic retrieval model.

Further: in step 8, any video to be detected in the airport apron scene is obtained, a query event is input, a sliding window method is used, the average frame number of the operation flow corresponding to the input event is taken as the size of the sliding window, and the long video is split into video candidate fragments.

Further: in step 9, inputting the video candidate segments into the target detection model obtained in step 2, acquiring the position and label information of the visual target in the video candidate segments according with the apron operation rule according to step 5, generating a feature matrix according to step 6, inputting the feature matrix into the video semantic retrieval model generated in step 7, and obtaining the video segments according with the query event semantics.

The second purpose of the invention is to provide a deep learning-based apron video semantic retrieval system, which at least comprises:

the component module is used for constructing a apron target detection data set;

the training module is used for training the apron target detection data set by using a YOLOv5s model to generate a final target detection model;

the preprocessing module is used for preprocessing the apron video according to the apron operation rule;

the detection module is used for detecting the visual target in the preprocessed apron video by using the target detection model to generate the position and the label information of the visual target;

the analysis screening module is used for analyzing the target detection result and screening out a characteristic sequence of the visual target which accords with the apron operation rule;

the characteristic generation module is used for extracting the characteristics of the characteristic sequence through a time convolution network and fusing the characteristics by using an attention mechanism to generate a characteristic matrix;

the retrieval module inputs the characteristic matrix into a neural network training video semantic retrieval model;

the video candidate segment generation module is used for acquiring the apron video to be detected, inputting a query event and generating a video candidate segment through preprocessing;

and the result output module is used for inputting the video candidate segments into a deep learning-based apron video semantic retrieval system to obtain video segments which accord with the semantic meaning of the query event.

The third purpose of the invention is to provide an information data processing terminal for realizing the apron video semantic retrieval method based on deep learning.

It is a fourth object of the present invention to provide a computer-readable storage medium, comprising instructions, which when run on a computer, cause the computer to perform the above-mentioned apron video semantic retrieval method based on deep learning.

The invention has the advantages and positive effects that:

according to the method and the system, the retrieval efficiency of the specific events in the apron can be effectively improved according to the characteristics of the apron operation flow.

The invention designs the target detection network to identify the apron operation vehicles and count the characteristic sequences of the apron operation vehicles, and highly abstracts the apron operation flow into the characteristic matrix by using the time convolution network to carry out the characteristic sequence and the attention mechanism, thereby being convenient for searching the specific events in the apron. Compared with manual retrieval, the method greatly shortens time, improves efficiency, saves cost, and ensures higher accuracy rate on the premise of having a large number of data sets for supporting training.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the present invention;

FIG. 2 is a diagram illustrating a first effect of the apron target detection model in the preferred embodiment of the present invention;

FIG. 3 is a diagram illustrating a second effect of the apron target detection model in the preferred embodiment of the present invention;

FIG. 4 is a third effect diagram of the apron target detection model in the preferred embodiment of the invention;

FIG. 5 is a diagram illustrating a fourth effect of the apron target detection model according to the preferred embodiment of the present invention;

FIG. 6 is a bar graph of the detection accuracy of the apron target detection model in the preferred embodiment of the present invention;

FIG. 7 is a relative position signature sequence in a preferred embodiment of the invention.

Detailed Description

In order to further understand the contents, features and effects of the present invention, the following embodiments are illustrated and described in detail with reference to the accompanying drawings:

the essence of the apron video semantic retrieval method and retrieval system based on deep learning is a feature sequence matching problem based on target detection. And identifying the visual target of the apron through the target detection model to obtain the label position information of the visual target and generate a characteristic sequence. And performing feature extraction on the feature sequence through a time convolution network to generate a feature matrix, and performing weight redistribution on the feature matrix by using an attention mechanism to realize feature fusion. The steps abstract the work flow height of the apron into a characteristic matrix, and then the characteristic matrix is input into a full-connection network to train a classifier. And for the video to be detected, judging the corresponding operation flow according to the input query event, determining the size of a sliding window according to the priori knowledge, and decomposing the video to be detected into video candidate segments by using a sliding window method. And inputting the video candidate segments into a deep learning-based apron video semantic retrieval system, and selecting video segments which accord with the semantics of the query events after target detection, feature sequence acquisition, feature extraction, feature fusion and classifier classification to finish apron video semantic retrieval.

Referring to fig. 1-7, a method for retrieving a video semantic of a apron based on deep learning includes two implementation stages, which are respectively the establishment of a video semantic model of the apron based on deep learning and the preprocessing of a video to be detected. The method specifically comprises the following steps:

step 1, constructing a apron target detection data set; the data source of the airport target detection data set is a Guiyang Longdongfeng international airport 209 machine position monitoring video, the video is split into images frame by using OpenCV, all the images are marked by using a target detection marking tool labelImg, and a target detection data set in a YOLO format is constructed. The image labeling types cover 10 types, namely people, gallery bridges, garbage trucks, airplanes, refueling trucks, platform trucks, water trucks, luggage trucks, food carts and tractors.

Step 2, training a apron target detection data set by using a YOLOv5s model to generate a final target detection model; the method specifically comprises the following steps: using the YOLOv5S model as a pre-training model, retraining the apron target detection data set constructed in S1, and setting the training parameters as follows: generating a final target detection model by using the epochs as 80 and the batch _ size as 16;

step 3, preprocessing the apron video according to the apron operation rule; the method specifically comprises the following steps: the airport video data source is a Guiyang Longdongpu airport 209 airport position monitoring video, video segments of all airport operation flows are cut out according to the priori knowledge of the airport operation flows, and the average duration and the average video frame number of all airport operation flows are counted;

step 4, detecting the visual target in the preprocessed apron video by using the target detection model obtained in the step S2 to generate the position and label information of the visual target; the method specifically comprises the following steps: dividing the apron operation fragments obtained by cutting in the S3 into pictures frame by using OpenCV, inputting the pictures into the target detection model obtained in the S2, and generating the label and coordinate information of the visual object;

step 5, analyzing the target detection result, and screening out a characteristic sequence of the visual target which accords with the apron operation rule; the method specifically comprises the following steps: and 4, screening the label and the coordinate information of the visual object which accords with the operating rule of the apron from the label and the coordinate information of all the visual objects obtained in the step 4. Three types of characteristics are set, namely the relative distance of the visual objects, the relative position of the visual objects and the speed of the visual objects. And counting the relative pixel distance of the visual object which accords with the apron operation rule in each frame of picture, and generating a relative pixel distance sequence. And counting the relative positions of the visual objects which accord with the apron operation rule in each frame of picture, wherein the relative positions refer to the included angles formed by the pixel coordinates of the plane as the original point and the pixel coordinates of the other visual objects and the positive direction of the x axis. Calculating the speed of the visual object by using an interframe difference method to generate a speed characteristic sequence of the visual object;

step 6, extracting the characteristics of the characteristic sequence through a time convolution network, and fusing the characteristics by using an attention mechanism to generate a characteristic matrix; the method specifically comprises the following steps: performing feature extraction on the feature sequence obtained in the step 5 by using a time convolution network, splicing the extracted feature vectors into a feature matrix, applying an attention machine on the feature matrix, and performing different weight distribution on the feature matrix to realize feature fusion;

step 7, inputting the feature matrix into a neural network training video semantic retrieval model; the method specifically comprises the following steps: inputting the feature matrix obtained in the step 6 into a semantic retrieval network formed by full connection layers, and training a semantic retrieval model;

step 8, acquiring a apron video to be detected, inputting a query event, and generating a video candidate segment through preprocessing; the method specifically comprises the following steps: acquiring any video to be detected in the airport apron scene, inputting a query event, and dividing a long video into video candidate fragments by using the average frame number of a working process corresponding to the input event as the size of a sliding window by using a sliding window method;

step 9, inputting the video candidate segments obtained in the step 8 into a deep learning-based apron video semantic retrieval system to obtain video segments conforming to the semantic of the query event; the method specifically comprises the following steps: inputting the video candidate segments into the target detection model obtained in the step 2, acquiring the position and label information of the visual target which accords with the apron operation rule in the video candidate segments according to the step 5, generating a characteristic matrix according to the step 6, and inputting the characteristic matrix into the video semantic retrieval model generated in the step 7 to obtain the video segments which accord with the query event semantic.

A deep learning-based apron video semantic retrieval system at least comprises:

the component module is used for constructing a apron target detection data set; the data source of the airport target detection data set is a Guiyang Longdongfeng international airport 209 machine position monitoring video, the video is split into images frame by using OpenCV, all the images are marked by using a target detection marking tool labelImg, and a target detection data set in a YOLO format is constructed. The image labeling types cover 10 types, namely people, gallery bridges, garbage trucks, airplanes, refueling trucks, platform trucks, water trucks, luggage trucks, food carts and tractors

The training module is used for training the apron target detection data set by using a YOLOv5s model to generate a final target detection model; and (3) retraining the apron target detection data set constructed in the component module by using a YOLOv5s model as a pre-training model, wherein the training parameters are set as follows: generating a final target detection model by using the epochs as 80 and the batch _ size as 16;

the preprocessing module is used for preprocessing the apron video according to the apron operation rule; the airport video data source is a Guiyang Longdongpu airport 209 airport position monitoring video, video segments of all airport operation flows are cut out according to the priori knowledge of the airport operation flows, and the average duration and the average video frame number of all airport operation flows are counted;

the detection module is used for detecting the visual target in the preprocessed apron video by using the target detection model to generate the position and the label information of the visual target; dividing the apron operation fragments cut in the S3 into pictures frame by using OpenCV, inputting the pictures into a target detection model obtained by a training module, and generating labels and coordinate information of visual objects;

the analysis screening module is used for analyzing the target detection result and screening out a characteristic sequence of the visual target which accords with the apron operation rule; and screening the label and the coordinate information of the visual object which accords with the operating rule of the apron from the label and the coordinate information of all the visual objects obtained by the detection module. Three types of characteristics are set, namely the relative distance of the visual objects, the relative position of the visual objects and the speed of the visual objects. And counting the relative pixel distance of the visual object which accords with the apron operation rule in each frame of picture, and generating a relative pixel distance sequence. And counting the relative positions of the visual objects which accord with the apron operation rule in each frame of picture, wherein the relative positions refer to the included angles formed by the pixel coordinates of the plane as the original point and the pixel coordinates of the other visual objects and the positive direction of the x axis. Calculating the speed of the visual object by using an interframe difference method to generate a speed characteristic sequence of the visual object;

the characteristic generation module is used for extracting the characteristics of the characteristic sequence through a time convolution network and fusing the characteristics by using an attention mechanism to generate a characteristic matrix; performing feature extraction on the feature sequence obtained by the analysis and screening module by using a time convolution network, splicing the extracted feature vectors into a feature matrix, applying an attention machine on the feature matrix, and performing different weight distribution on the feature matrix to realize feature fusion;

the retrieval module inputs the characteristic matrix into a neural network training video semantic retrieval model; inputting the feature matrix obtained by the feature generation module into a semantic retrieval network formed by full connection layers, and training a semantic retrieval model;

the video candidate segment generation module is used for acquiring the apron video to be detected, inputting a query event and generating a video candidate segment through preprocessing; acquiring any video to be detected in the airport apron scene, inputting a query event, and dividing a long video into video candidate fragments by using the average frame number of a working process corresponding to the input event as the size of a sliding window by using a sliding window method;

the result output module is used for inputting the video candidate segments into a depth learning-based apron video semantic retrieval system to obtain video segments which accord with the semantic of the query event; inputting the video candidate segments into a target detection model obtained by a training module, acquiring the position and label information of a visual target which accords with the apron operation rule in the video candidate segments according to an analysis screening module, generating a feature matrix according to a feature generation module, and inputting the feature matrix into a video semantic retrieval model generated by a retrieval module to obtain video segments which accord with the semantic of the query event.

An information data processing terminal for realizing the apron video semantic retrieval method based on deep learning.

A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described method for deep learning-based apron video semantic retrieval.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A method for retrieving video semantics of a plateau based on deep learning is characterized by at least comprising the following steps:

s1, constructing a apron target detection data set;

s2, training a apron target detection data set by using a YOLOv5S model to generate a final target detection model;

s3, preprocessing the apron video according to the apron operation rule;

s4, detecting the visual target in the preprocessed apron video by using the target detection model to generate the position and label information of the visual target;

s5, analyzing the target detection result, and screening out a characteristic sequence of the visual target which accords with the apron operation rule;

s6, extracting the features of the feature sequence through a time convolution network, and fusing the features by using an attention mechanism to generate a feature matrix;

s7, inputting the feature matrix into a neural network training video semantic retrieval model;

s8, acquiring the apron video to be detected, inputting a query event, and generating a video candidate segment through preprocessing;

s9, inputting the video candidate segments into a deep learning-based apron video semantic retrieval system to obtain video segments which accord with the semantic of the query event.

2. The method for retrieving apron video semantic meanings based on deep learning according to claim 1, wherein the S1 specifically is: a data source of a airport apron target detection data set is an airport monitoring video, the video is split into images by using OpenCV, and a target detection labeling tool, namely label img, is used for labeling all the images to construct a target detection data set in a YOLO format; the picture marking types include people, gallery bridges, garbage trucks, airplanes, refueling trucks, platform trucks, water trucks, luggage trucks, food carts and tractors.

3. The method for retrieving apron video semantic meanings based on deep learning according to claim 2, wherein the S2 specifically is: training the apron target detection data set constructed in S1 by using the YOLOv5S model as a pre-training model, wherein the training parameters are set as follows: and (5) generating a final target detection model, wherein the epochs is 80, and the batch _ size is 16.

4. The deep learning-based apron video semantic retrieval method according to claim 3, wherein the S3 is specifically: the method comprises the steps of enabling a machine level monitoring video to serve as a machine level video data source, cutting out video clips of all machine level operation flows according to priori knowledge of the machine level operation flows, and counting average duration and average video frame numbers of all machine level operation flows.

5. The deep learning-based apron video semantic retrieval method according to claim 4, wherein the S4 is specifically: the apron work pieces cut out in S3 are split into pictures frame by using OpenCV, and the pictures are input to the target detection model obtained in S2, so that the labels and coordinate information of the visual objects are generated.

6. The method for retrieving apron video semantic meanings based on deep learning according to claim 5, wherein the S5 specifically is: screening out the label and the coordinate information of the visual object which accords with the apron operation rule from the label and the coordinate information of all the visual objects obtained in the step S4; setting three types of characteristics, namely the relative distance of the visual object, the relative position of the visual object and the speed of the visual object; counting the relative pixel distance of the visual object which accords with the apron operation rule in each frame of picture, and generating a relative pixel distance sequence; counting the relative positions of the visual objects which accord with the apron operation rule in each frame of picture, wherein the relative positions refer to the included angles formed by the pixel coordinates of the plane as the original point and the pixel coordinates of the other visual objects and the positive direction of the x axis; and calculating the speed of the visual object by using an interframe difference method to generate a visual object speed characteristic sequence.

7. The deep learning-based apron video semantic retrieval method according to claim 6, characterized in that:

the S6 specifically includes: performing feature extraction on the feature sequence obtained in the step S5 by using a time convolution network, splicing the extracted feature vectors into a feature matrix, applying an attention machine on the feature matrix, and performing different weight distribution on the feature matrix to realize feature fusion;

the S7 specifically includes: inputting the feature matrix obtained in the step S6 into a semantic retrieval network formed by full connection layers, and training a semantic retrieval model;

the S8 specifically includes: acquiring any video to be detected in the airport apron scene, inputting a query event, and dividing a long video into video candidate fragments by using the average frame number of a working process corresponding to the input event as the size of a sliding window by using a sliding window method;

the S9 specifically includes: inputting the video candidate segments into the target detection model obtained in S2, acquiring the position and label information of the visual target in the video candidate segments according with the airport operation rule according to S5, generating a feature matrix according to S6, and inputting the feature matrix into the video semantic retrieval model generated in S7 to obtain the video segments according with the query event semantics.

8. A airport apron video semantic retrieval system based on deep learning is characterized in that: at least comprises the following steps:

9. An information data processing terminal for implementing the deep learning-based apron video semantic retrieval method according to any one of claims 1 to 7.

10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method for apron video semantic retrieval based on deep learning of any one of claims 1-7.