CN111651635A

CN111651635A - Video retrieval method based on natural language description

Info

Publication number: CN111651635A
Application number: CN202010467416.XA
Authority: CN
Inventors: 王春辉; 胡勇
Original assignee: Polar Intelligence Technology Co ltd
Current assignee: Polar Intelligence Technology Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-11
Anticipated expiration: 2040-05-28
Also published as: CN111651635B

Abstract

The invention discloses a video retrieval method based on natural language description. The method comprises the following steps: extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture and storing the pictures; extracting text description of each picture, and creating a video description file by taking a video file path, a frame number of the picture and the corresponding text description as fields for each picture; and searching the video description file according to the query sentence input by the user to obtain the picture in the matched video file. Because the text description of each frame of image is generated before the query, the user can quickly obtain the video to be queried and the time positioning when querying the video with the generated text description, and the video retrieval speed is improved.

Description

Video retrieval method based on natural language description

Technical Field

The invention belongs to the technical field of natural language understanding, and particularly relates to a video retrieval method based on natural language description.

Background

Video retrieval positioning is a complex and challenging problem, and positioning at a particular moment in the video in response to a text query is associated with many visual tasks, including video retrieval, temporal action positioning, and video description and problem solving.

Video retrieval, which is the task of a given set of video candidates and a language query, utilizes a video retrieval algorithm to retrieve videos that match the query. There is a search model that matches visual concepts in a video with semantic graphs generated by parsing sentence descriptions; the video text alignment problem is solved by assigning a time interval for each sentence and a set of sentences of a given video that have a temporal order. Recently, Hendricks et al proposed a joint video language model for retrieving moments in video based on texture queries. However, these models can only verify line segments containing the corresponding moments, and much background noise exists in the returned results. Although it is possible to densely sample different proportions of video moments and use these models to retrieve the corresponding video moments, not only is the computational effort large, but the matching task is more challenging as the search space increases.

With respect to temporal localization, Gaidon et al address the problem of temporarily localizing motion in an untrimmed video, focusing on limited motion. Model 3DConvNets proposes an end-to-end segment-based 3D Convolutional Neural Network (CNN) framework that outperforms other Recursive Neural Network (RNN) based approaches by simultaneously capturing spatio-temporal information. There is also a novel time unit regression network model that can jointly predict action recommendations and refine time boundaries through time coordinate regression. Since these methods are limited to a predefined list of actions, learners suggest using natural language queries to localize the activity. They take advantage of all contextual moments around the current input without explicitly considering the semantic information of the input query.

With respect to the video question-and-answer task, attention mechanisms have achieved impressive results in neural machine translation, video captioning, and video question-and-answer. The visual attention model for video captions utilizes video frames at each time step without explicitly considering the semantic attributes of the predicted words. This is unnecessary and even misleading. To address this problem, layered Long Short Term Memory (LSTM) networks with adjusted temporal attention models for video captioning have been utilized. Later, the attention model was expanded to selectively participate not only in specific temporal or spatial regions, but also in specific forms of input, such as image features, motion features, and audio features. Recently, a multi-modal attention LSTM network has evolved faster that leverages multi-modal streaming and temporal attention to selectively focus on particular elements in the sentence generation process.

The existing video retrieval positioning method combines the above mentioned methods in other tasks to a certain extent so as to improve the model effect. However, the models are end-to-end modes, the models need to be run from the beginning for a new query or a new video, the running time is long, the models cannot be located quickly, and the use interest of users is reduced.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a video retrieval method based on natural language description.

In order to achieve the purpose, the invention adopts the following technical scheme:

a video retrieval method based on natural language description comprises the following steps:

step 1, extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture and storing the pictures;

step 2, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the corresponding text description as fields for each picture;

and 3, searching the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description.

Compared with the prior art, the invention has the following beneficial effects:

the method extracts the pictures from the input video file according to the frames, extracts the text description of each picture, creates the video description file by taking the video file path, the frame number of the picture and the corresponding text description as fields for each picture, searches the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description, and realizes the video retrieval based on the natural language description. Because the description of each frame of image is generated before the query, the user can quickly obtain the video to be queried and the time positioning when querying the video with the generated text description, and the video retrieval speed is improved.

Drawings

Fig. 1 is a flowchart of a video retrieval method based on natural language description according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention provides a video retrieval method based on natural language description, which comprises the following steps:

s101, extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture, and storing the pictures;

s102, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the corresponding text description as fields aiming at each picture;

s103, searching the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description.

In this embodiment, step S101 is mainly used to extract video images by frame, i.e. a series of pictures are obtained with a video file as input. The extracted pictures can be processed to the same size, for example, 720 × 480 (pixel × pixel), by using the FFmpeg module of the python model to take frames according to the number of video frames. And naming and saving the pictures by using the frame number of each picture. The frame number of the picture is the sequence number when the picture is extracted according to the frame, for example, the frame number of the first extracted picture is 1.

In this embodiment, step S102 is mainly used to obtain a text description of the picture. The DenseCap model can be used to extract textual descriptions from pictures. The text description comprises a global description and a local description, and in order to improve the query speed, only the global description generated by the model is taken, namely one picture corresponds to one text description. The DenseCap model consists of three parts: convolutional Network (Convolutional Network), full Convolutional positioning Layer (full Convolutional Localization Layer), and RNN language model. The DenseCap model can describe local details in the picture in natural language. The model can be said to be a combination of target detection and ordinary picture description, namely when the described object is a word, the model can be regarded as target detection; when the described object is the whole picture, the picture description generation is completed. The description objects of the present embodiment are all whole pictures. And after the text description is generated, combining the video path, the frame number of the picture and the picture description characters, and finally generating a video description file taking the video frame as a unit.

In this embodiment, step S103 is mainly used to search the video description file according to the query statement to obtain the corresponding video file and the corresponding picture. A search framework can be built using the whoosh library in python, and the framework performs searching based on a full-text retrieval method. The full-text retrieval comprises two processes of index creation and index search, wherein the index is established firstly, and then the index is searched to obtain text description matched with the query sentence. The picture corresponding to the text description is the picture to be inquired.

As an alternative embodiment, the method for extracting the text description of the picture in step S102 includes:

s1021, extracting a feature map of each picture by using a convolution network of a DenseCap model;

s1022, determining a candidate region and extracting a feature vector in the candidate region: firstly, inputting the feature map into a full convolution network, reversely mapping each pixel point in the feature map into an original image by taking each pixel point as an anchor point, then drawing anchor boxes, namely initial frames, with different aspect ratios and different sizes based on the anchor points, and predicting the confidence fraction and the position information of the initial frames by a positioning layer through a regression model; filtering out initial frames with the overlapping area exceeding 70% of the area with extremely high confidence scores by adopting a non-maximum inhibition mode to obtain candidate frames; finally, extracting the area in each candidate frame into a feature vector with a fixed size by adopting a bilinear interpolation method, wherein all the feature vectors form a feature matrix;

s1023, unfolding the feature matrix into a one-dimensional column vector by using a full connection layer;

s1024, inputting the one-dimensional column vector into an RNN network to obtain a code x_-1Constructing a word vector sequence x with the length of T +2_-1,x₀,x₁,x₂,…,x_T，x₀For start flag, x₁,x₂,…,x_TCoding the word series described by the picture text; outputting the vector sequence to RNN to train a prediction model; x is to be_-1,x₀Inputting the prediction model to obtain a word vector y₀According to y₀And predicting a first word, then taking the first word as an input of a next-layer RNN network, predicting a second word until the output word is an END mark, and obtaining the text description of the picture.

The embodiment provides a technical scheme for extracting the picture text description. The method comprises 4 steps S1021 to S1024.

Step S1021 is mainly used to extract a feature map of the picture using a convolutional network. The feature map comprises various features, such as texture, light intensity, shape and the like of the picture, and the numerical value at each position represents the strength value of a certain feature. Due to the characteristics of the convolutional neural network, the acquired features are more abstract and contain more semantic information as the number of layers is deepened. The convolutional network of the DenseCap model adopts a network structure based on VGG-16, and comprises 13 convolutional layers with convolution kernel of 3 x 3 and 4 maximum pooling layers with pooling kernel of 2 x 2. For a picture with the size of 3 × 720 × 480 (three-dimensional matrix, three dimensions refer to three color channels of red, green and blue), after the picture is subjected to a convolution network, the output result is a feature map with the size of 512 × 45 × 30. The feature map is the input to the next full convolution location layer FCL.

Step S1022 is mainly used to determine the candidate region and extract the feature vector in the candidate region. This step is mainly done by the full convolutional network. The full convolution localization layer is the core part of the entire model, similar to Faster R-Cnn, for generating the bounding box of objects within the recognition picture. The input of the method is a feature map from a convolutional network, and the output of the method is a feature vector of a plurality of (such as 300) candidate regions with fixed length, wherein each feature vector comprises three data of candidate region coordinates, a confidence score and candidate region features. A larger confidence score indicates a closer proximity to the real area. The processing procedure of the full convolution positioning layer comprises four steps: the first step is to convolve the anchor points. Firstly, each pixel point in a feature map with the size of C multiplied by W 'multiplied by H' of a convolutional network is used as an anchor point, the point is reversely mapped into an original image, then anchor boxes with different aspect ratios and sizes are drawn based on the anchor point, the number of the combined anchor boxes is k (for example, k is 12), and for each anchor box, a positioning layer in the FCL predicts corresponding confidence score and position information through a regression model. The specific calculation process is to take the feature picture as input, pass through a convolution layer with convolution kernel of 3 × 3, and then pass through a convolution layer with convolution kernel of 1 × 1, where the number of convolution kernels is 5k, so that the final output of this layer is a three-dimensional array of 5k × W '× H', which includes confidence scores and position information corresponding to all anchor points. And secondly, performing frame regression. This is a refinement of the initial bezel. Because the frame obtained in the previous step may not be particularly matched with the real region, under the supervision of the real region, linear regression is used to obtain four displacement values of the frame, and the four displacement values are mainly used to update the horizontal and vertical coordinate values of the point coordinate in the candidate region and the length and width of the candidate frame. And the third step is sampling, wherein the candidate frames obtained in the first two steps are too many, and in order to reduce the running cost, the candidate frames need to be sampled, and 300 candidate frames are selected in a non-maximum inhibition mode, wherein the non-maximum inhibition method is to remove the candidate frames with the overlapping area exceeding 70% of the area with the extremely high confidence score, so that the output of the overlapping area is reduced, and the target position is more finely positioned. The fourth step is to perform bilinear interpolation. Each candidate region obtained after sampling is a rectangular frame with different sizes and aspect ratios, namely, a candidate region, in order to establish connection with a subsequent full connection layer, namely, an identification network and an RNN language model, the model extracts the candidate region into a feature vector with a fixed size by using a bilinear interpolation method, and combines the feature vectors of all the candidate regions into a feature matrix.

Step S1023 is mainly used to expand the feature matrix obtained in the previous step into a one-dimensional column vector, and then combine all the positive sample one-dimensional column vectors into a matrix. This step is mainly accomplished by a fully connected neural network. The method is characterized in that the characteristics of each candidate region are flattened into a one-dimensional column vector, the one-dimensional column vector passes through two fully-connected layers, and a ReLU activation function and a Dropout optimization principle are used each time. Finally, each candidate region generates a one-dimensional vector with length D4096. All the one-dimensional vectors are stored to form a 300 × 4096 matrix, which is transferred to the next RNN language model. In addition, the confidence scores and the position information of the candidate regions can be refined for the second time, so that the final confidence score and the position information of each candidate region are generated. This refinement is essentially the same as the previous boundary regression except that a further boundary regression was performed on the length vector.

Step S1024 is mainly used to output a textual description of the picture. The step is mainly completed by utilizing an RNN (also called RNN language model), and the one-dimensional characteristic vector obtained in the step is used as input to output a natural language sequence based on the description picture content.

The key point of the DenseCap model is the FCLN structure and makes the positioning layer conductive using bilinear interpolation, so that end-to-end training from the picture region to the natural language description can be supported. The experimental result shows that compared with the previous network structure, the network structure of the embodiment has certain improvement in the quality of the generated picture description and the generation speed. In view of the advantages of Densecap, the embodiment uses a model pre-trained by Densecap, takes the text description of the picture as output, and constructs a file taking the video path, the frame number of the picture, and the text description of the picture as fields. Since Densecap is between object identification and common description, the finally generated picture description has more information of local areas compared with a common description generation model, and the accuracy of video retrieval positioning is improved.

As an optional embodiment, the step S103 specifically includes:

s1031, reading the video description file, inputting the text description of the picture in the video description file into a word segmentation component, removing punctuation marks and stop words, and performing word segmentation processing to obtain word elements; inputting the lemmas into a language processing component, converting the lemmas into lower case words and converting the lower case words into root words, wherein the root words are indexes;

s1032, performing lexical analysis on the query sentence input by the user, and identifying words and keywords; carrying out syntactic analysis, and constructing a syntactic tree according to syntactic rules of the query statement; performing language processing to process the query statement; searching the index to obtain the text description of the document, namely the picture, which accords with the syntax tree;

s1033, regarding each obtained document and query sentence as a word sequence, and calculating the weight of each word according to the following formula:

w＝TF×log_e(n/d) (1)

in the formula, w is weight, TF is the number of times of occurrence of a word in a document, d is the number of documents containing the word, and n is the total number of the documents;

and replacing each word in each document and each query statement by the weight of each word to obtain vector representation of each document and each query statement, calculating cosine similarity of the vector of each document and the vector of each query statement, wherein the picture corresponding to the document with the largest cosine similarity is the picture to be queried.

The embodiment provides a technical scheme for searching the picture matched with the query statement from the video description file. The method comprises 3 steps S1031 to S1033.

Step S1031 is mainly used to create an index. Creating an index is the process of language processing the text description in the video description file to create an index with word elements. The method is mainly realized by a word segmentation component and a language processing component. And the word segmentation component removes punctuation marks and stop words (words without practical meanings such as a and an) in the text description, and performs word segmentation processing to obtain word elements. For example, the text "I am driving a car on the road" is segmented into the lemmas "I", "driving", "car", "road". The language processing component further converts the lemmas into lower case words and converts the lower case words into root words, and the root words are the created indexes. The above example results in indices of "i", "driving", "car", "road".

Step S1032 is mainly used for searching the index. Firstly, performing lexical analysis on a query sentence, namely identifying words and keywords; then, carrying out syntactic analysis on the query statement, namely constructing a syntactic tree according to syntactic rules of the query statement; language processing, i.e., further processing of the original query statement, is also performed. And finally, searching the index established in the last step to obtain the document which is in line with the syntax tree, namely the text description of the picture.

Step S1033 is mainly used to screen out a document that is most matched with the query sentence, that is, a text description of the picture, from the documents obtained in the previous step, so as to obtain a video file and a picture to be queried. Firstly, calculating the weight of each word in each document and query sentence according to a formula (1); then, replacing each word with the weight of each word to obtain a vector represented by each document and query sentence by the weight; and calculating the cosine similarity between each document vector and the query statement vector, wherein the document with the largest cosine similarity is the text description of the picture to be queried. With the text description of the picture, the name of the video file where the picture is located and the number of the picture are also available.

The above description is only for the purpose of illustrating a few embodiments of the present invention, and should not be taken as limiting the scope of the present invention, in which all equivalent changes, modifications, or equivalent scaling-up or down, etc. made in accordance with the spirit of the present invention should be considered as falling within the scope of the present invention.

Claims

1. A video retrieval method based on natural language description is characterized by comprising the following steps:

step 2, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the text description of the picture as fields aiming at each picture;

2. The video retrieval method based on natural language description as claimed in claim 1, wherein the step 2 method for extracting text description of picture comprises:

step 2.1, extracting a characteristic map of each picture by using a convolution network of a DenseCap model;

step 2.2, determining a candidate region and extracting a feature vector in the candidate region: firstly, inputting the feature map into a full convolution network, reversely mapping each pixel point in the feature map into an original image by taking each pixel point as an anchor point, then drawing anchor boxes, namely initial frames, with different aspect ratios and different sizes based on the anchor points, and predicting the confidence fraction and the position information of the initial frames by a positioning layer through a regression model; filtering out initial frames with the overlapping area exceeding 70% of the area with extremely high confidence scores by adopting a non-maximum inhibition mode to obtain candidate frames; finally, extracting the area in each candidate frame into a feature vector with a fixed size by adopting a bilinear interpolation method, wherein all the feature vectors form a feature matrix;

step 2.3, unfolding the feature matrix of each picture into a one-dimensional column vector by using a full connection layer;

step 2.4, inputting the one-dimensional column vector into an RNN network to obtain a code x_-1Constructing a word vector sequence x with the length of T +2_-1,x₀,x₁,x₂,…,x_T，x₀For start flag, x₁,x₂,…,x_TCoding the word series described by the picture text; outputting the vector sequence to RNN to train a prediction model; x is to be_-1,x₀Inputting the prediction model to obtain a word vector y₀According to y₀And predicting a first word, then taking the first word as an input of a next-layer RNN network, predicting a second word until the output word is an END mark, and obtaining the text description of the picture.

3. The video retrieval method based on natural language description according to claim 1, wherein the step 3 specifically comprises:

step 3.1, reading a video description file, inputting text description of pictures in the video description file into a word segmentation component, removing punctuation marks and stop words, and performing word segmentation processing to obtain word elements; inputting the lemmas into a language processing component, converting the lemmas into lower case words and converting the lower case words into root words, wherein the root words are indexes;

step 3.2, performing lexical analysis on the query sentence input by the user, and identifying words and keywords; carrying out syntactic analysis, and constructing a syntactic tree according to syntactic rules of the query statement; performing language processing to process the query statement; searching the index to obtain the text description of the document, namely the picture, which accords with the syntax tree;

step 3.3, each obtained document and query sentence is regarded as a word sequence, and the weight of each word is calculated according to the following formula:

w＝TF×log_e(n/d) (1)