CN111309969A

CN111309969A - Video retrieval method matched with text information

Info

Publication number: CN111309969A
Application number: CN202010046793.6A
Authority: CN
Inventors: 邓清勇; 钱利智; 谭智辉; 向懿; 房海鹏; 徐康宇; 曾艳; 欧阳艳; 关屋大雄; 胡怡玮
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-19

Abstract

The invention provides a video retrieval method matched with text information. Firstly, using a knowledge map to perform information expansion on character information and establish a character characteristic vector matrix, secondly training an FCN model by referring to the character characteristic vector matrix, establishing a relation between a video and the character information, using a unidirectional LSTM neural network to generate characteristic description for the video and establish a video characteristic vector matrix, then importing the two vector matrices into an RNN recurrent neural network model for training, and finally adding a method for generating the character characteristic vector matrix and the video characteristic vector matrix into the trained model as an interface for processing the characters and the video to realize video retrieval matched with the character information. The invention can search out the video with high content conformity in the video library through the input character information, and because the screening and the searching are completed in the RNN, the characteristic description information of the video is not required to be stored, thereby reducing the storage amount of key data, improving the efficiency of video searching and realizing the video searching based on the video content.

Description

Video retrieval method matched with text information

Technical Field

The invention relates to the technical field of video retrieval, in particular to a video retrieval method matched with text information.

Background

With the rapid development of internet technology and various video shooting, editing and collecting equipment is continuously updated, and the number of network videos is increased explosively. People can check videos more conveniently and also require more efficient and accurate video retrieval. The traditional text-based video retrieval method needs to annotate video information manually and then uses a text-based database management system to retrieve the video, so that a large amount of time and storage index space are needed in the video retrieval process. With the rapid increase of the amount of video data, text-based video retrieval cannot meet the retrieval requirements of people, it is difficult to retrieve videos through a small amount of brief text information, and the efficiency is low or even ineffective when the retrieval based on video content is processed.

In summary, the key to solving the video retrieval problem is how to expand the text information to reduce the retrieval complexity and how to realize the retrieval based on the video content. With the development of artificial intelligence technology, deep learning technology provides a new idea for solving the problems. The neural network is a machine learning technology which simulates the human brain so as to realize artificial intelligence, wherein the knowledge map technology can be used for displaying the complex knowledge field through data mining, information processing, knowledge measurement and graphic drawing, processing and using, and can be used for expanding the character information. Video description (video) technology can generate text description for video, namely, the conversion from video image field to text field. A Recurrent Neural Network (RNN) may be used to implement the overall functionality of the video retrieval system. Based on the method, a video retrieval method matched with the text information is designed.

Disclosure of Invention

The invention discloses a video retrieval method matched with text information, which mainly applies knowledge mapping and a video processing technology to process the text information and video, realizes video retrieval based on video content, improves the efficiency of video retrieval and reduces the data storage capacity.

According to the application background of the invention, a video retrieval method matched with text information is provided, which comprises the following steps:

step 1, using a knowledge map to perform information expansion on text information and establish a text characteristic vector matrix, training a full convolution neural network FCN model by referring to the text characteristic vector matrix, establishing a relation between a video and the text information, using a unidirectional LSTM neural network to generate characteristic description on the video and establish a video characteristic vector matrix, recording methods and parameters for performing information expansion on the text information and generating the text characteristic vector matrix, using a video capturing technology to generate description on the video to be detected and establishing the video characteristic vector matrix through a word2vec model:

1) using a knowledge graph to perform information expansion on input character information for retrieval, splitting the input character information into a group of words, using a word2vec model and a knowledge graph embedding model to obtain vector expressions of the words and knowledge base entities, mapping the vectors to the same vector space through nonlinear transformation, using the vectors to construct a KCNN neural network, giving a vocabulary database, further obtaining vector expressions of the vocabulary retrieval on the input character information, using a DNN neural network model to predict the association probability of the characters and expanded information, establishing a character characteristic vector matrix, taking a characteristic information vector with the highest association degree to add in the matrix, performing information expansion on the input characters, and recording a method and parameters for performing information expansion on the character information and generating the character characteristic vector matrix;

2) the method comprises the steps of establishing a corresponding characteristic vocabulary library by referring to information vocabularies in a character characteristic matrix, generating description for a video to be detected by using a video adaptation technology, establishing a vocabulary full convolution neural network Lexical-FCN model, generating data for each frame of the video through an FCN neural network, establishing weak mapping relation between the data and a lexicon gathered from a character characteristic vector matrix through model training, roughly dividing 16 regions by using an anchor method in a target detection method in the last layer output by the FCN neural network, confirming the types of region sequences, selecting a part of sequences, generating description based on the character characteristic vector matrix by using a one-way LSTM neural network, establishing a video characteristic vector matrix by using a word2vec model, and recording a method and parameters for generating description for the video to be detected by using a video adaptation technology and establishing the video characteristic vector matrix through the word2vec model.

Step 2, importing a character characteristic vector matrix and a video characteristic vector matrix into an RNN recurrent neural network model for training, taking a method for performing information expansion on character information by using a knowledge map and generating character characteristic vectors as an interface for processing input character information, taking a method for generating description on a video to be detected by using a video capturing technology and establishing a video characteristic vector matrix through a word2vec model as an interface for processing the video, loading the whole model into a video retrieval engine, and processing and judging whether the usability of the model achieves the target:

1) matching and establishing a relation between a character characteristic vector matrix and a video characteristic vector matrix which are imported into an RNN recurrent neural network model, inputting for many times to generate activation functions of different types and different contents, so as to improve the precision of screening and matching, continuously adjusting and transmitting parameters, generating a multilayer network, iterating the training process to continuously adjust the parameters until the training is completed, using the stored method for generating the character characteristic vector matrix and the video characteristic vector matrix as an input interface of the model, and finally loading the model into a video retrieval engine;

2) inputting character information describing video salient features, connecting to a video resource library, entering the interior of an engine as input through a video search engine and character information expansion, participating in screening and matching processes, simultaneously entering the video of the video library into the engine, extracting corresponding features, then carrying out self-processing type matching and screening on the engine, and finally returning a processed optimal result as a retrieval result by the engine.

Compared with the prior art, the method has the advantages that:

1. the used data is from the input text information and the characteristic description of the video, and the video retrieval based on the video content is realized.

2. The retrieval process is optimized, the method for retrieving the video by using the text information is provided, and the recognition rate and the video retrieval rate are improved.

3. By using the character information to match with the characteristics of the video, the storage of various items of video characteristics given by experience in a method for manually establishing the video index is reduced, the storage capacity of key data is reduced, and the execution amount of data retrieval operation is reduced.

4. The method has the advantages that the deep learning algorithm is used, the established character characteristic vector matrix and the established video characteristic vector matrix are used as training samples to be trained, the relation corresponding to each dimension characteristic is obtained, the artificial subjectivity of each characteristic relation given by experience in the existing index method is overcome, the weight of information elements of the video on the influence of retrieval results is better, the screening effect of a video search engine is better, the search results are more in line with the user requirements, the user experience is improved, and the video retrieval efficiency is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of text message expansion according to the present invention;

FIG. 3 is a schematic diagram of a text eigenvector matrix according to the invention;

FIG. 4 is a schematic diagram of video feature generation of the present invention;

FIG. 5 is a schematic diagram of a video feature vector matrix according to the present invention;

FIG. 6 is a schematic diagram of RNN model training of the present invention

Detailed Description

As shown in fig. 1, the technical scheme of the invention comprises the following specific steps:

1) as shown in fig. 2, the method includes performing information expansion on input text information for retrieval by using a knowledge graph, splitting the input text information into a group of words, obtaining vector representations of the words and knowledge base entities by using a word2vec model and a knowledge graph embedding model, mapping the vectors to the same vector space through nonlinear transformation, constructing a KCNN neural network by using the vectors, giving a vocabulary database, further obtaining vector representations of the word information for vocabulary retrieval, predicting association probabilities of the words and extension information by using another DNN neural network model, establishing a text feature vector matrix shown in fig. 3, adding feature information vectors with the highest association degree into a matrix, performing information expansion on the input words, and recording a method and parameters for performing information expansion on the word information and generating a text feature vector matrix:

splitting input characters into a group of words, linking the words with the entity of the knowledge base, finding out all adjacent entities within one hop of the distance from the linked entity, obtaining vector representation of the words by using a word2vec model, and obtaining vector representation of the entity of the knowledge base by using a knowledge map embedding model;

mapping vector representation of input characters, link entities and context entities to the same vector space through a nonlinear transformation

g(e₁:n)＝[g(e₁)g(e₂)…g(e_n)]

Then, similar to three channels of RGB in the image, vector representations of words, link entities and context entities are used as input of CNN neural network multiple channels, a KCNN neural network is constructed, and thus the input of a KCNN neural network model can be expressed as:

given a vocabulary database, obtaining vector representation of character information through a KCNN neural network: calculating the normalized influence weight by using a DNN neural network model as an attention network and a normalization function softmax:

obtaining a vector representation of the database about the input text:

and predicting the association probability of the characters and the expansion information by using another DNN model, representing input from two levels of semantics and knowledge through results of the two models, fusing heterogeneous information sources by an alignment mechanism of an entity and words, and better capturing the implicit relationship between the characters so that the information expansion of the input characters can be performed through the implicit relationship, and recording a method and parameters for performing the information expansion on the character information and generating a character feature vector matrix.

2) Referring to the information words in the character feature matrix, as shown in fig. 4, a corresponding feature word library is established, a video capturing technology is used to generate descriptions for the video to be detected, a full convolution neural network Lexical-FCN model is established, each frame of the video is used to generate data through the FCN network, weak mapping relation between data and word library gathered from character characteristic vector matrix is established through model training, in the last layer of FCN neural network output, roughly dividing 16 regions by using an anchor method in a target detection method, confirming the type of a region sequence, selecting a part of the sequence, generating description based on a character feature vector matrix by using a unidirectional LSTM neural network, then establishing a video feature vector matrix shown in figure 5 by using a word2vec model, recording a method and parameters for generating description of a video to be detected by using a video capturing technology and establishing the video feature vector matrix by using the word2vec model:

establishing a Lexical-FCN model, generating data for each frame of a video through an FCN, establishing weak mapping relation between the data and a word library gathered from a character characteristic vector matrix through model training, roughly dividing 16 regions by using an anchor method in a target detection method at the last layer of FCN output, and generating service for region sequences;

the region sequence generation uses a sub-modulation mapping mathematical method to extract 30 frames of video, confirms the type of the region sequence, selects a part of sequence generation description, and selects the standard of selection

So select A in the sequence^*To maximize their correlation with characteristics, R being in particular

Is a linear combination of functions for each sequence a, f requires three aspects: information, coherence and diversity, the formula is

The region sequence is obtained by greedily step by step, and the benefit of adding the region r at each time is

Then at each step r is chosen to maximize its increment, for the parameter weight w:

extracting region sequences which contain information and are coherent and have larger difference (diversity) between the region sequences by using a greedy method step by step;

using a one-way LSTM neural network with type information c added: s^*＝argmax_sP (c | v), generating description words aiming at different types of sequences, establishing a feature vector matrix with symmetrical structure through a word2vec model, recording a method and parameters for generating description for a video to be detected by using a video adaptation technology and establishing a video feature vector matrix through the word2vec model.

Step 2, importing a character characteristic vector matrix and a video characteristic vector matrix into an RNN recurrent neural network model for training, taking a method for performing information expansion on character information by using a knowledge map and generating the character characteristic vector matrix as an interface for processing input character information, taking a method for generating description on a video to be detected by using a video capturing technology and establishing the video characteristic vector matrix through a word2vec model as an interface for processing the video, loading the whole model into a video retrieval engine, and processing and judging whether the usability of the model achieves the target:

1) as shown in fig. 6, matching and establishing a connection between a text feature vector matrix and a video feature vector matrix imported into an RNN recurrent neural network model, inputting for many times to generate activation functions of different types and different contents, so as to improve the precision of screening and matching, continuously adjusting and transferring parameters, generating a multilayer network, iterating the training process to continuously adjust parameters until the training is completed, using the stored method for generating the text feature vector matrix and the video feature vector matrix as an input interface of the model, and loading the model into a video search engine:

the network receives two input eigenvectors X at time t_tAnd Y_tThe value of the hidden layer is then S_tThe output is O_tA critical point S_tIs not only dependent on X_tAlso depends on S_t-1We can use the following formula:

S_t＝f(U·X_t+T·Y_t+W·S_t-1)

Claims

1. A video retrieval method matched with text information is characterized by at least comprising the following steps:

step 1, performing information expansion on text information by using a knowledge map and establishing a text characteristic vector matrix, training a full convolution neural network FCN model by referring to the text characteristic vector matrix, establishing a relation between a video and the text information, generating characteristic description on the video by using a unidirectional LSTM neural network and establishing a video characteristic vector matrix, recording a method and parameters for performing information expansion on the text information and generating the text characteristic vector matrix, generating description on the video to be detected by using a video capturing technology and establishing the video characteristic vector matrix through a word2vec model;

and 2, importing the character characteristic vector matrix and the video characteristic vector matrix into an RNN recurrent neural network model for training, taking a method for performing information expansion on character information by using a knowledge map and generating the character characteristic vector matrix as an interface for processing the character information, taking a method for generating description on a video to be detected by using a video capturing technology and establishing the video characteristic vector matrix through a word2vec model as an interface for processing the video, and finally loading the whole model into a video retrieval engine to process and judge whether the usability of the model achieves a target or not.

2. The method of claim 1, wherein said using a knowledge-graph to perform information expansion on textual information and to establish a textual feature vector matrix further comprises:

1) splitting input text information into a group of words, linking the words with the entity of the knowledge base, finding out all adjacent entities within one hop of the linked entity, obtaining vector representation of the words by using a word2vec model, and obtaining vector representation of the entity of the knowledge base by using a knowledge map embedded model;

2) mapping vector representations of input characters, link entities and context entities to the same vector space through a nonlinear transformation:

g(e₁:n)＝[g(e₁)g(e₂)…g(e_n)]

3) then, similar to three channels of RGB in the image, vector representations of words, link entities and context entities are used as input of CNN neural network multiple channels, a KCNN neural network is constructed, and thus the input of a KCNN neural network model can be expressed as:

4) given a vocabulary database, obtaining vector representation of character information through a KCNN neural network: calculating the normalized influence weight by using a DNN neural network model as an attention network and a normalization function softmax:

obtaining a vector representation of the lexical database with respect to the input text:

and the other DNN neural network model is used for predicting the association probability of the characters and the expanded information, the input is expressed from two levels of semantics and knowledge through the results of the two models, and the alignment mechanism of the entity and the word is fused with heterogeneous information sources, so that the implicit relationship between the characters can be better captured, and the information of the input character information can be expanded through the implicit relationship.

3. The method of claim 1, wherein training a full convolution neural network (FCN) model with reference to a text feature vector matrix to establish a relationship between the video and the text information, and using a unidirectional LSTM neural network to generate feature descriptions for the video and establish a video feature vector matrix further comprises:

1) establishing a Lexical-FCN model, generating data for each frame of a video through an FCN, establishing weak mapping relation between the data and a word bank converged from a character characteristic vector matrix through model training, roughly dividing 16 regions by using an anchor method in a target detection method on the last layer output by the FCN neural network, and generating service for region sequences;

2) the region sequence generation uses a sub-modulation mapping mathematical method to extract 30 frames of video, confirms the type of the region sequence, selects a part of sequence generation description, and selects the standard of selection as

So select A in the sequence^*To maximize the characteristic correlation, R is specifically

extracting a region sequence which contains information and is coherent and has large difference between the region sequences by using a greedy method step by step;

3) using a unidirectional LSTM neural network with type information c added S^*＝arg max_sP (c | v), generating descriptions for different types of sequencesVocabulary and establishing a feature vector matrix with symmetrical structure through a word2vec model, recording a method and parameters for generating description on a video to be detected by using a video adaptation technology and establishing a video feature vector matrix through the word2vec model.

4. The method of claim 1, wherein the step of importing the text eigenvector matrix and the video eigenvector matrix into the RNN recurrent neural network model for training further comprises:

2) inputting character information describing video salient features, connecting to a video resource library, entering the interior of an engine as input through a video search engine and character information expansion, participating in screening and matching processes, simultaneously entering the video of the video library into the engine, extracting corresponding features, then carrying out self-processing type matching and screening on the engine, and finally returning the processed optimal result as a retrieval result by the engine.