CN113742520B

CN113742520B - Video query and search method of dense video description algorithm based on semi-supervised learning

Info

Publication number: CN113742520B
Application number: CN202010476330.3A
Authority: CN
Inventors: 林科; 王立威
Original assignee: Peking University; Samsung Electronics Co Ltd
Current assignee: Peking University; Samsung Electronics Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-11-07
Anticipated expiration: 2040-05-29
Also published as: CN113742520A

Abstract

The invention discloses a video query and retrieval method of a dense video description algorithm based on semi-supervised learning, which is characterized in that a dense video description network model is trained by the semi-supervised learning method, and the performance of the dense video description algorithm is improved by utilizing a large amount of unlabeled video data; the implemented video query retrieval system comprises: the system comprises a dense video description network model, a semi-supervised learning module, a video query and search module and a video database. By adopting the technical scheme provided by the invention, the accuracy of video query search can be effectively improved.

Description

Video query and search method of dense video description algorithm based on semi-supervised learning

Technical Field

The invention relates to a video query and search technology based on dense video description, in particular to a method for query and search of videos by a dense video description algorithm based on semi-supervised learning, which can solve the problem of low video data search efficiency.

Background

The video data has the characteristics of large data volume, large storage occupation, rich content, various types and the like, so that the inquiry and the retrieval of the video data are difficult to realize. Meanwhile, due to the rapid popularization of the 5G technology, video data has an explosive growth trend in both a mobile phone terminal and a cloud, and how to quickly search for the video data is a pain point.

Traditional video data query retrieval is mostly implemented based on automatic tagging and video classification algorithms. The method needs to keep the video acquisition time, acquire the video place through a GPS sensor and the like, and automatically acquire the video scene or action category through a video classification algorithm. One video corresponds to a plurality of tags (e.g., 5 months, 1 day 2020, beijing, china, outdoor, running, etc.). When the user inquires and searches, the user inputs keywords, and the system acquires corresponding videos through a keyword matching method so as to achieve the function of searching the videos (document [1 ]). However, such methods can only be searched based on keywords such as time, place, scene, action and the like, and have limited searching capability.

The more natural and efficient query and search mode is to use sentences or phrases for search (such as 'boys eating cakes at seaside in the past in summer'), and the search mode can define people, objects, actions, relations, time and places and the like of videos more accurately, so that the user can accurately query and search the videos wanted by the user. The video description algorithm can automatically generate a natural language description of the video, and the aim of searching based on sentences or phrases can be fulfilled based on the video description algorithm and assisted by a keyword matching algorithm (document [2 ]). The common structure of video description algorithms is currently basically the encoder-decoder structure (document [3 ]). Generally, the encoder is a convolutional neural network (Convolutional Neural Network) (document [4 ]) which is used to extract the features of multi-frame video data and convert the three-dimensional video data into time-series feature vectors. The decoder is a recurrent neural network (Recurrent Neural Network) (document [5 ]) or a self-attention (document [6 ]) based structure decoder for processing the time-ordered feature vector and converting the feature vector into a form of natural language.

Dense video description is a more accurate video description algorithm (document [7 ]). The dense video description is divided into two steps, the first step is to divide the video in time, the video is divided into a plurality of time periods (for example, the whole video length is 100s, the divided time periods are 1-20s,35-60s and 55-90 s), each time period corresponds to an event or action, and each time period can be overlapped. And secondly, carrying out video description on each time period, and obtaining a description sentence for each time period. Dense video descriptions may finely divide video as compared to video descriptions and describe video in more detail. The video query and search system based on the dense description algorithm applies the dense description algorithm to all video data, and each video obtains a plurality of video fragments and corresponding text descriptions. When the user searches, the system calculates the text description of all video segments and the matching degree input by the user according to the text matching algorithm, and returns the corresponding video segments. A video query retrieval system based on dense video descriptions is shown in fig. 1. Compared with a common video description algorithm, the video obtained by dense video description is more simplified and more accurate.

Most of the existing dense video description algorithms are based on supervised learning. The supervised learning requires that all video data in the training set be annotated in advance. The labeling of dense video descriptions is complex, video clips and descriptions need to be labeled at the same time, and the labeling cost is high. Semi-supervised learning refers to learning by using a small amount of marked data and a large amount of unmarked data at the same time, so that marking requirements can be reduced. Meanwhile, due to the advent of the 5G age, video data is multiplied, and a large amount of unlabeled data can be easily obtained. It would be helpful to reduce labeling costs if such unlabeled data could be used to train a more accurate, densely described algorithm. Recent work (documents [8] to [9 ]) has effectively applied a semi-supervised learning algorithm based on data enhancement to image classification tasks. However, the prior art does not use semi-supervised learning for dense video description, and the technical problem of dense video query retrieval cannot be effectively solved.

Reference is made to:

[1]Shao L,Jones S,Li X.Efficient Search and Localization of Human Actions in Video Databases[J].IEEE Transactions on Circuits and Systems for Video Technology,2014,24(3):504-512.

[2]Gao L,Guo Z,Zhang H,et al.Video Captioning With Attention-Based LSTM and Semantic Consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055.

[3]Venugopalan S,Rohrbach M,Donahue J,et al.Sequence to Sequence--Video to Text[C].international conference on computer vision,2015:4534-4542.

[4]Alex Krizhevsky,Ilya Sutskever,and Geoffrey E Hinton.ImageNet classification with deep convolutional neural networks.In NIPS,pages 1097–1105,2012.

[5]Graves A.Generating Sequences With Recurrent Neural Networks[J].arXiv:Neural and Evolutionary Computing,2013.

[6]Vaswani A,Shazeer N,Parmar N,et al.Attention is All you Need[C].neural information processing systems,2017:5998-6008.

[7]Krishna R,Hata K,Ren F,et al.Dense-Captioning Events in Videos[C].international conference on computer vision,2017:706-715.

[8]Xie Q,Dai Z,Hovy E,et al.Unsupervised Data Augmentation for Consistency Training[J].arXiv:Learning,2019.

[9]Berthelot D,Carlini N,Goodfellow I,et al.MixMatch:A Holistic Approach to Semi-Supervised Learning[C].neural information processing systems,2019:5049-5059.

[10]Brown P F,Pietra V J D,Souza P V D,et al.Class-Based n-gram Models of Natural Language[J].Computational Linguistics,1992,18(4):467-479.

[11]Satanjeev B.METEOR:An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments[J].Acl,2005:228-231.

[12]Caba F,Escorcia V,Ghanem B,et al.ActivityNet:A Large-Scale Video Benchmark for Human Activity Understanding[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2015.IEEE,2015.

disclosure of Invention

In order to solve the defects in the prior art, the labeling requirement of a dense video description algorithm is reduced, and the precision of video query retrieval is improved.

According to the invention, the video query and search is performed by using the dense video description algorithm, so that the video clips can be queried and searched more accurately. In addition, the intensive video description algorithm is trained by semi-supervised learning, a large amount of unlabeled video data is effectively utilized, the performance of the intensive video description algorithm is improved, and the dependence of the algorithm on the labeled data is reduced. The implementation surface realizes more excellent performance on the reference data set by the algorithm provided by the invention.

The technical scheme provided by the invention is as follows:

the video query and retrieval method of the dense video description algorithm based on the semi-supervised learning trains the dense video description network model/algorithm by the semi-supervised learning method, effectively utilizes a large amount of unlabeled video data, and improves the performance of the dense video description algorithm; the method comprises the following steps:

1) Establishing a dense video description network model, wherein the dense video description network model comprises: extracting a network model, a video segmentation network model and a video description network model from video features; respectively for: extracting video features; dividing a video into a plurality of sub-video segments; decoding each sub-video segment to obtain natural language description of the sub-video segment;

the method specifically comprises the following steps:

11 Extracting video features including, but not limited to, three-dimensional visual features, two-dimensional visual features using convolutional neural networks

Perceptual features, semantic features, etc.

12 Video data is segmented using a video segmentation network (the video segmentation network may employ an edge sensitive network (Boundarysensitive network) or a time sequential segmentation network (temporal segment network)) to divide the video into sub-video segments.

13 Decoding each sub-video clip through a video description network (which may utilize a recurrent neural network or a self-attention-based network) to obtain a natural language description of the video clip.

2) Training a dense video description network by a semi-supervised learning method, and calculating to obtain training loss; comprises the following steps:

21 Calculating to obtain the loss of the marked video data;

for annotated video data v, learning is performed using the following loss function:

wherein L is ^(labeled) A loss function for marked video data;refers to loss of video segmentation network with annotated video data,/or->Refers to the loss of the video description network with the annotated video data, α being a weight coefficient used to control the ratio of the two parts. For marked data, add>And->The loss, such as L2 loss, cross entropy loss and the like, is calculated according to the similarity degree of the natural language description of the video segmentation fragments and the video fragments predicted by the dense video description model and the real annotation.

22 Calculating to obtain the loss of the non-marked video data;

and for the unlabeled video data u, carrying out data enhancement on the unlabeled video data to obtain u'. Predicting the video segments of u and u 'respectively by using a video segmentation network, marking each predicted video segment of u as seg, finding out the video segment with the biggest cross-over ratio with seg in a predicted video segment set of u', marking as seg ', calculating L1 loss (one-norm loss) of start and stop time of seg and seg', summing the losses of all the video segments of u, and obtaining segment loss of unlabeled video data, marking asFor each video segment predicted by u, a video description algorithm is used for predictionMeasuring the descriptive sentence of each video segment as the pseudo descriptive label of each video segment by using u and u respectively ^′ As input, the output probability distribution of each moment of the pseudo description label is calculated and is marked as the K-L divergence (Kullback-Leibler Divergence) of p and p ', p and p' as the description loss. Summing the descriptive losses of all video segments of u to obtain descriptive losses of the unlabeled video data, denoted +.>The final loss of unlabeled video data is:

where β is a weight coefficient used to control the ratio of the two parts.

23 Weighted summation of annotated video data loss and non-annotated video data loss to obtain final loss:

L＝L ^(labeled) +γL ^(unlabeled)

wherein L is the loss of all training data; gamma is a weight coefficient for controlling the ratio of the two parts.

3) The video query search is carried out, which comprises the following steps:

31 Pre-computing video segments and video descriptions for each video data in the video database using a dense description algorithm;

32 Inputting a search sentence to be queried, matching the input sentence with each segment description in the database by adopting a text matching method, and calculating to obtain a matching degree;

33 Ordering according to the matching degree, wherein the video segment with the highest matching degree is the result of video query retrieval.

The video query and retrieval system based on the intensive video description algorithm of semi-supervised learning comprises three key parts: the system comprises a dense video description network model, a semi-supervised learning module, a video query and retrieval module and a video database. Wherein:

the dense video description network model includes: extracting a network model, a video segmentation network model and a video description network model from video features; respectively for: extracting video features; dividing a video into a plurality of sub-video segments; decoding each sub video segment to obtain natural language description of the video segment;

video features, including, but not limited to, three-dimensional visual features, two-dimensional visual features, semantic features, and the like. The video segmentation network may employ an edge sensitive network (Boundary sensitive network) or a time sequential segmentation network (temporal segment network); the video description network may employ a recurrent neural network or a self-attention based network.

The semi-supervised learning module is used for training the dense video description network model, calculating the marked data loss, the unmarked data loss and the total loss, and obtaining the trained dense video description network model.

And the video query search module is used for matching the input query search statement with each segmented video description in the database according to the input query search statement to obtain a video segment with the highest matching degree, and the video segment is used as a video query search result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a video query and search method based on a dense video description algorithm, which can solve the problem of low video search precision. Meanwhile, the invention provides a dense video description algorithm based on semi-supervised learning, which can learn by utilizing massive unlabeled video data, improves the performance of the dense description algorithm, reduces the labeling cost of dense video description, and improves the efficiency and accuracy of video query and retrieval. The invention verifies that the semi-supervised learning intensive description algorithm has more excellent performance than the traditional supervised learning intensive video description algorithm on a standard data set.

Drawings

Fig. 1 is a block diagram of a prior art video query retrieval technique based on a dense video description algorithm.

Fig. 2 is a schematic diagram of a dense video description network employed in the practice of the present invention.

FIG. 3 is a block diagram of a data processing flow for a semi-supervised learning intensive video description algorithm provided by the present invention.

Fig. 4 is a block diagram of a video query search system based on a semi-supervised learning dense video description algorithm provided by the invention.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention provides a dense video description algorithm based on semi-supervised learning, which constructs a video query and retrieval system. The method comprises three parts: dense video description networks, semi-supervised learning dense video description algorithms and video query retrieval systems.

(1) Integral method

As shown in fig. 4, the video query retrieval system constructed based on the intensive description algorithm of semi-supervised learning is composed of three parts: dense video description networks, semi-supervised learning training algorithms, and video query retrieval systems. The input of the dense video description algorithm is video, and the output is a plurality of video segments and corresponding video descriptions; training a dense video description algorithm based on marked data and unmarked data at the same time by semi-supervised training; the video query and retrieval system obtains a plurality of video clips with highest matching degree by matching the segment description obtained by the dense video description algorithm with query input.

(2) Dense video description network

As shown in fig. 2, the dense video description network includes three parts: a feature extraction network, a video segmentation network, and a video description network.

Feature extraction network:

the invention extracts three features, namely a three-dimensional visual feature, a two-dimensional visual feature and a semantic feature. The three-dimensional visual features can be extracted from the whole video by the pretrained three-dimensional convolutional neural network such as c3d, i3d or p3 d. The two-dimensional visual features may be extracted from several key video frames using a pre-trained two-dimensional convolutional neural network (e.g., acceptance net, resnet, etc.). The semantic feature extraction method is to select a plurality of high-frequency words from description labels as attributes (attributes), wherein each video corresponds to a plurality of attributes. The semantic feature extraction network inputs two-dimensional and three-dimensional convolution features to predict whether the video contains a certain attribute. The semantic feature extraction network is trained with dense video description datasets. And splicing the three-dimensional visual features, the two-dimensional visual features and the semantic features to obtain a final feature v of the video, wherein the dimension is t multiplied by D, t is the number of frames of the video, and D is the feature dimension.

Video segmentation network:

the video segmentation network may be implemented using a Time Segmentation Network (TSN), a Boundary Sensitive Network (BSN), or any other video segmentation network. The invention is not limited to the form of a video segmentation network. The input of the video segmentation network is the extracted video characteristics, the output is a plurality of video segments, and the start and stop time of each video segment is respectively [ b ] _k ,e _k ].

Video description network:

for each video segment obtained by the video segmentation network, the video description network is used for respectively predicting the video description of the video segment. The prediction mode of the video description network is serialization prediction. The first moment is input as a predefined initiator and video clip feature, and output as the first word. The second moment is input as the initiator + the first word and the video clip feature, and the second word is output. This process is repeated until the output word reaches an upper limit for the terminator or description length. And sentences formed by words output at all moments are the video description of the video clip. The network structure of the video description network can be based on a cyclic neural network (such as a long and short memory network LSTM or a cyclic unit network GRU), or can be based on a self-attention network (such as a transducer network). The invention is not limited to the form of the video description network.

(3) Semi-supervised learning dense video description algorithm

The traditional dense video description algorithm is trained based on a supervised learning mode. Supervised learning requires that all video data in the training set have dense descriptive annotations. The supervised learning loss function is divided into two parts, namely video segmentation network loss and video description network loss. The video segmentation network loss is the predicted start-stop time of the segments and the L1 loss of the true segment labels, as follows:

wherein N is the number of video segments;marking the nth segment with->Is a start time of (2); />For all predicted segments and +.>The start time of the segment with highest cross-over ratio. />The termination time marked for the nth segment; />For all predicted segments and +.>Termination time of segment with highest cross-over ratio

Video description loss is as follows:

wherein T is the maximum length of description, T is the current moment, and y is the description label; y is _t Marking words for the description at the time t; p is the model output probability; vIs a video feature.

The final losses are shown below:

where α is a weight coefficient used to control the ratio of the segment loss to the description loss.

The invention is trained by semi-supervised learning, as shown in FIG. 3, and we are trained by marked data and a large amount of unmarked data at the same time. Wherein the labeling data uses the L ^(labeled) Training is performed.

For the unlabeled data u, the invention uses the output similarity of the data after data enhancement and the original data to train. The invention obtains the video data u' after data enhancement by the data enhancement algorithm. The data enhancement method can be an automatic enhancement method (AutoAutoAutoAutoAutomation), a random enhancement method (RandAugment), translation, overturn, rotation, random frame removal and the like. The present invention does not limit the data enhancement method. The feature extraction network is then used to extract features for u'.

For a segmented network, a video segment of u is predicted first, denoted seg= { [ b ] _k ,e _k ],k＝1,2,…,N ₁ }. Wherein b _k ,e _k For the start time and end time of the kth segment, N1 is the number of predicted video segments of u. For u ', the video segment of u ' is also predicted, denoted seg ' = { [ b ] _k ,e _k ],k＝1,2,…,N ₂ N2 is the number of predicted videos of u'. For each segment [ b ] in seg _k ,e _k ]Find the segment with the largest intersection ratio in seg 'and mark as [ b ]' _k ,e’ _k ]And calculating the L1 loss between the two, and accumulating and summing the loss of each segment to obtain the unmarked segment loss, as follows:

for the description network, a video description of u is predicted first, denoted as y'. And taking y 'as a pseudo label to calculate K-L divergence of each moment of u and u', and then accumulating and summing to obtain the non-label description loss, wherein the non-label description loss is as follows:

wherein D is _KL (.) is K-L divergence.

The final loss of unlabeled data is shown below:

where β is a weight coefficient used to control the ratio of the two parts.

And carrying out weighted summation on the marked data loss and the unmarked data loss to obtain the final loss:

L＝L ^(labeled) +γL ^(unlabeled)

where γ is a weight coefficient for controlling the ratio of the two parts.

The invention can improve the performance of the dense video description network by utilizing massive non-labeling data.

(4) Video query retrieval system

The invention trains the dense video description network by using the semi-supervised learning dense video description algorithm. And for each video in the video query and search database, respectively processing each video in the database by using the intensive video description network trained by the steps, and predicting a plurality of video segments and corresponding video descriptions of the video. When the video query search is carried out, a user inputs a query search statement, the matching degree of the input query search statement and all video descriptions in a database is calculated by using a text matching algorithm, and a plurality of video clips with the highest matching degree are returned. The text matching algorithm may be an n-gram-based algorithm (document 10), a keyword matching method, or a method based on an evaluation index such as met eor (document 11). The invention does not limit the text matching algorithm. Compared with other methods, the video query and search system has two advantages, namely, query and search can be performed based on sentences or phrases, and more simplified video fragments can be returned instead of the whole video.

(5) Semi-supervised learning dense video description algorithm and supervised learning video description algorithm contrast

The semi-supervised learning intensive video description algorithm is used for video query and retrieval, and in order to verify the effect of the semi-supervised learning intensive video description algorithm, the semi-supervised learning intensive video description algorithm is used for comparison, and training and testing are conducted on an ActivityNet data set (document 12). The dataset contains approximately 20000 videos, each video having several video segments and video description labels. The evaluation indexes are the recall rate of the segments and the METEOR value described, and the larger the two indexes are, the better the two indexes are. The results are shown in Table 1. As can be seen from table 1, semi-supervised learning has a higher segment recall and descriptive met eor values than supervised learning, verifying the superior performance of the semi-supervised learning algorithm.

Table 1 semi-supervised learning and supervised learning segment recall and met eor values for dense video description algorithms

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A video query and retrieval method of dense video description algorithm based on semi-supervised learning trains a dense video description network model by the semi-supervised learning method, and improves the performance of the dense video description algorithm by utilizing a large amount of unlabeled video data, thereby effectively improving the accuracy of video query and retrieval; the method comprises the following steps:

1) Establishing a dense video description network model, wherein the dense video description network model comprises: extracting a network model, a video segmentation network model and a video description network model from video features; the video feature extraction network model is used for extracting video features; the video segmentation network model is used for dividing the video into a plurality of sub-video segments; the video description network model is used for decoding each sub-video segment to obtain the natural language description of the video segment;

2) Training a dense video description network model by a semi-supervised learning method, and calculating to obtain training loss; comprises the following steps:

21 Calculating to obtain the loss of the marked video data;

wherein L is ^(labeled) A loss function for marked video data;is the loss of the video segmentation network with the annotation video data; />Is the loss of the video description network with the tagged video data; alpha is a weight coefficient;

22 Calculating to obtain the loss of the non-marked video data;

for the unlabeled video data u, carrying out data enhancement on the unlabeled video data to obtain u';

respectively predicting to obtain video clips of u and u' by using a video segmentation network model; recording each predicted video clip of u as seg; recording the video segment with the largest cross-over ratio with seg in the video segment set predicted by u 'as seg';

calculating L1 losses of start-stop times of seg and seg';

summing the losses of all video segments of u to obtain the segment loss of the unlabeled video data, which is recorded as

For each predicted video segment, predicting a description sentence of each video segment by using a video description algorithm as a pseudo description label of each video segment;

by u and u, respectively ^′ As input, calculating the output probability distribution of each moment of the pseudo description annotation, which is respectively marked as p and p';

taking the K-L divergence of p and p' as the description loss, summing the description loss of all the video segments of u to obtain the description loss of the unlabeled video data, and recording as

The final loss of unlabeled video data is:

wherein, beta is a weight coefficient for controlling the proportion of the two parts;

L＝L ^(labeled) +γL ^(unlabeled)

wherein L is the loss of all training data; gamma is a weight coefficient for controlling the ratio of the two parts;

3) The video query search is carried out, which comprises the following steps:

31 For each video data in the database, pre-calculating video segments and video descriptions by adopting a dense description algorithm;

32 Inputting a search sentence to be queried, matching the search sentence to be queried with each segment description in the database by adopting a text matching method, and calculating to obtain the matching degree;

33 Ordering according to the matching degree, wherein the video segment with the highest matching degree is the result of video query and retrieval;

through the steps, video query retrieval of the intensive video description algorithm based on semi-supervised learning is realized.

2. The method for searching video query based on dense video description algorithm of semi-supervised learning as set forth in claim 1, wherein the step 1) of establishing the dense video description network model specifically includes:

11 Extracting video features by using a convolutional neural network model, wherein the video features comprise three-dimensional visual features, two-dimensional visual features and semantic features;

12 Segmenting the video data using the video segmentation network model; the video segmentation network model adopts an edge sensitive network model or a time sequence segmentation network model;

13 Decoding each sub-video clip through a video description network model that utilizes a recurrent neural network model or a self-attention based network model.

3. The method for video query retrieval based on a dense video description algorithm of semi-supervised learning as recited in claim 1, wherein, in step 21), for annotated video data,and->The loss is calculated according to the similarity degree between the natural language description and the true annotation of the video segmentation fragments and the video fragments predicted by the dense video description network model.

4. The video query search method of the dense video description algorithm based on semi-supervised learning as set forth in claim 3, wherein the loss obtained by calculation according to the similarity degree between the video segment and the natural language description of the video segment predicted by the dense video description network model and the true annotation is L2 loss or cross entropy loss.

5. The video query retrieval method of a dense video description algorithm based on semi-supervised learning as recited in claim 1, wherein in step 21), the loss of the video segmentation network is the predicted start-stop time of the segments and the L1 loss of the true segment labels, expressed as follows:

wherein N is the number of video segments;marking the nth segment with->Is a start time of (2); />For all predicted segments and +.>The start time of the segment with the highest cross-over ratio; />The termination time marked for the nth segment; />For all predicted segments and +.>The end time of the segment with the highest cross-over ratio;

video description loss is expressed as follows:

wherein T is the maximum length of description, T is the current time, and y is the description label.

6. The method for searching video query based on dense video description algorithm of semi-supervised learning as recited in claim 1, wherein in step 32), the text matching method includes an n-gram based algorithm, a keyword matching based method, and a METEOR evaluation index based method.

7. The method for video query retrieval based on a dense video description algorithm of semi-supervised learning as recited in claim 1, wherein in step 22), the data enhancement method includes: an automatic enhancement method, a random enhancement method, a translation method, a turnover method, a rotation method and a random frame removal method.

8. A video query retrieval system based on a semi-supervised learning dense video description algorithm implementing the method of claim 1, comprising: the system comprises a dense video description network model, a semi-supervised learning module, a video query and search module and a video database; wherein:

the dense video description network model includes: extracting a network model, a video segmentation network model and a video description network model from video features; the video feature extraction network model is used for extracting video features; the video segmentation network model is used for dividing the video into a plurality of sub-video segments; the video description network model is used for decoding each sub-video segment to obtain the natural language description of the video segment;

the semi-supervised learning module is used for training the dense video description network model, calculating the marked data loss, the unmarked data loss and the total loss, and obtaining a trained dense video description network model;

the video query and search module is used for matching the input query and search statement with each segmented video description in the video database according to the input query and search statement, so as to obtain a video segment with the highest matching degree, and the video segment is used as a video query and search result.

9. The video query retrieval system based on a semi-supervised learning dense video description algorithm of claim 8, wherein the video features include three-dimensional visual features, two-dimensional visual features, semantic features.

10. The video query retrieval system of a dense video description algorithm based on semi-supervised learning as recited in claim 8, wherein the video segmentation network model employs an edge-sensitive network model or a time sequence segmentation network model; the video description network model utilizes a recurrent neural network model or a self-attention based network model.