CN114998777A

CN114998777A - Training method and device for cross-modal video retrieval model

Info

Publication number: CN114998777A
Application number: CN202210429393.2A
Authority: CN
Inventors: 李冠楠
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-09-02

Abstract

The embodiment of the invention provides a training method and a training device for a cross-modal video retrieval model, wherein a target segmented video is generated by segmenting a video stream; acquiring a video sequence position scalar and a video sequence feature vector; generating word segmentation and a word segmentation sequence consisting of the word segmentation; extracting a text sequence feature vector, a text label feature vector and a text position scalar; combining the video sequence feature vectors based on the video sequence position scalars to generate target visual feature vectors; combining the text sequence feature vectors based on the text position scalar to generate target text feature vectors; respectively mapping vectors in different spaces to the same vector space, and calculating the similarity of the implicit characteristic vectors and the similarity of the label characteristic vectors; the retrieval result is determined based on the text label characteristic vector, the hidden characteristic vector similarity and the label characteristic vector similarity, so that the retrieval precision of cross-modal retrieval of the long video is improved, and the cross-modal retrieval function of the Chinese or Chinese and English text search video is realized.

Description

Training method and device for cross-modal video retrieval model

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a training method for a cross-modal video retrieval model, a training device for the cross-modal video retrieval model, and electronic equipment, namely a computer readable medium.

Background

The cross-modal search is a novel search method for returning other related search results in different modalities to a query word in one modality, is a novel technology for cross-media search, and a large number of internet users select to share and transmit information through video media along with the intelligentization and portability of mobile equipment and the rapid development of online video platforms. Under the trend, the currently widely used video retrieval method based on the text titles has the problems that the manual labeling cost is high, the efficiency is low, meanwhile, the text titles cannot fully cover the semantic content in the video, and the like, and the increasing requirements for managing and analyzing mass video data are difficult to effectively meet.

The cross-modal retrieval of the video text aims to determine the similarity of the representations under different modalities in a feature space by minimizing the difference between the video modal representation and the text modal representation of the same video, thereby achieving the effect of the cross-modal retrieval. Specifically, the retrieval paradigm allows the query input and the candidate object to be one of the modal data of the video and the text, and after the video and the text are vectorized, the cross-modal vector similarity is calculated and the search of the other modal data is realized through sequencing. In the prior art, the video modality representation can be directly determined through the whole video, and in a long video scene, because the long video simultaneously comprises a plurality of video material fragments with different contents, if the long video is directly adopted to determine the video modality representation so as to realize cross-modality retrieval, excessive different target elements appear in a video sequence, and finally the accuracy of a retrieval result is low.

Disclosure of Invention

The embodiment of the invention aims to provide a training method for a cross-modal video retrieval model, a training device for the cross-modal video retrieval model and electronic equipment, namely a computer readable medium, so as to solve the problem that cross-modal video retrieval cannot be performed on a Chinese text. The specific technical scheme is as follows:

in an aspect implemented by the present invention, there is provided a training method for a cross-modality video retrieval model, where the cross-modality video retrieval model includes a hidden vector space and a tag vector space, and the method may include:

acquiring and segmenting a video stream to generate a plurality of target segmented videos;

acquiring a video sequence position scalar used for expressing the video sequence positions of the target segmented videos;

acquiring a plurality of video sequence feature vectors used for expressing the features of a video sequence in a video stream; the video sequence has a corresponding text sequence;

segmenting the text sequence to generate participles and a participle sequence consisting of the participles;

extracting a plurality of text sequence feature vectors aiming at the participles, text label feature vectors aiming at the participle sequences, and a text position scalar used for expressing text positions;

scalar-merging the plurality of video sequence feature vectors based on the video sequence positions and generating a target visual feature vector;

combining the plurality of text sequence feature vectors based on the text position scalar and generating a target text feature vector;

mapping the target visual feature vector and the target text feature vector to the hidden vector space, and calculating the similarity of implicit feature vectors aiming at the target visual feature vector and the target text feature vector;

mapping the target visual feature vector and the target text feature vector to the tag vector space, and calculating tag feature vector similarity for the target visual feature vector and the target text feature vector;

and determining a retrieval result based on the text label feature vector, the implicit feature vector similarity and the label feature vector similarity.

Optionally, the cross-modality video retrieval model includes a residual attention module, and the step of scalar-combining the plurality of video sequence feature vectors based on the video sequence position and generating a target visual feature vector may include:

performing temporal fusion on the plurality of video sequence feature vectors by the residual attention module based on the video sequence position scalar, and generating a target visual feature vector.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the step of performing temporal fusion on the plurality of video sequence feature vectors by the residual attention module based on the video sequence position scalar and generating a target visual feature vector may include:

scalar coding the video sequence position in order, and generating first coding information aiming at the video sequence position scalar;

normalizing the plurality of video sequence feature vectors and generating a plurality of target video sequence feature vectors;

superimposing the plurality of target video sequence feature vectors with the first coding information as a first input signal;

inputting the first input signal to the multi-head attention unit to generate a first output signal;

superimposing the first output signal with the plurality of video sequence feature vectors as a second input signal;

normalizing the second input signal and generating a target second input signal;

inputting the target second input signal to the multilayer perceptron to generate a second output signal;

and superposing the second output signal and the second input signal as a target visual feature vector.

Optionally, the cross-modality video retrieval model includes a residual attention module, and the step of merging the plurality of text sequence feature vectors based on the text position metric and generating a target text feature vector may include:

performing time domain fusion on the plurality of text sequence feature vectors through the residual attention module based on the text position scalar, and generating a target text feature vector.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the step of performing time-domain fusion on the plurality of text sequence feature vectors through the residual attention module based on the text position scalar and generating a target text feature vector may include:

sequentially encoding the text position scalar to generate second encoding information aiming at the text position scalar;

normalizing the text sequence feature vectors to generate a plurality of target text sequence feature vectors;

superposing the plurality of target text sequence feature vectors and the second encoding information to be used as a third input signal;

inputting the third input signal to the multi-head attention unit to generate a third output signal;

superimposing the third output signal with the plurality of text sequence feature vectors as a fourth input signal;

performing normalization operation on the fourth input signal and generating a target fourth input signal;

inputting the target fourth input signal to the multilayer perceptron to generate a fourth output signal;

superposing the fourth output signal and the fourth input signal to serve as an initial target text characteristic vector;

and splicing the initial target text feature vector and the text label feature vector, and generating a target text feature vector.

Optionally, the step of mapping the target visual feature vector and the target text feature vector to the hidden vector space and calculating the implicit feature vector similarity for the target visual feature vector and the target text feature vector may include:

respectively mapping the target visual feature vector and the target text feature vector to the hidden vector space to generate a hidden visual vector and a hidden text vector;

determining a first vector distance of the implied visual vector and the implied text vector;

and calculating the similarity of the implicit characteristic vectors aiming at the target visual characteristic vector and the target text characteristic vector by adopting the first vector distance.

Optionally, the cross-modality video retrieval model includes a multi-layer fully-connected neural network having corresponding network parameters, and before the step of determining the first vector distance of the implied visual vector and the implied text vector, may further include:

generating a first target loss function using the implied visual vector and the implied text vector; the first target loss function comprises a first loss function value;

and reducing the first loss function value by controlling the network parameter.

Optionally, the step of mapping the target visual feature vector and the target text feature vector to the tag vector space, and calculating the similarity of the tag feature vectors for the target visual feature vector and the target text feature vector may include:

respectively mapping the target visual characteristic vector and the target text characteristic vector to the label vector space to generate a label visual vector and a label text vector;

determining a second vector distance between the tag visual vector and the tag text vector;

and calculating the similarity of the label feature vectors aiming at the target visual feature vector and the target text feature vector by adopting the second vector distance.

Optionally, the cross-modality video retrieval model includes a multi-layer fully-connected neural network having corresponding network parameters, and before the step of determining the second vector distance between the tag visual vector and the tag text vector, may further include:

generating a second target loss function by adopting the label visual vector and the label text vector;

generating a third target loss function by adopting the label visual vector and the text label characteristic vector;

generating a fourth target loss function by adopting the label text vector and the text label characteristic vector;

-reducing said second loss function value, and said third loss function value, and said fourth loss function value, by controlling said network parameter.

Optionally, the step of determining a retrieval result based on the text label feature vector, the implicit feature vector similarity, and the label feature vector similarity may further include:

determining word segmentation similarity between the target text feature vector and the text label feature vector;

weighting the word segmentation similarity and calculating a weight coefficient;

and calculating a retrieval result by adopting the weight coefficient, the similarity of the implicit characteristic vector and the similarity of the label characteristic vector.

Optionally, the step of acquiring and segmenting a video stream to generate a plurality of target segmented videos may include:

judging whether shot changes occur in the video stream;

if yes, when shot changes appear in the video stream, the video stream is segmented, and a plurality of target segmented videos are generated.

Optionally, the step of obtaining a plurality of video sequence feature vectors for expressing features of the video sequence in the video stream may include:

determining a plurality of target frames in a video stream based on the plurality of target segmented videos;

and extracting video sequence feature vectors of target frame pictures corresponding to the plurality of target frames as a plurality of video sequence feature vectors.

Optionally, the cross-modal video retrieval model includes a visual base network model for obtaining the plurality of video sequence feature vectors, a multilingual text model for obtaining the plurality of text sequence feature vectors, a visual feature sequence fusion module, a text feature sequence fusion module, a feature consistency learning module, and a tag consistency learning module, the cross-modal video retrieval model has a control switch for the multilingual text model and the visual base network model for obtaining the plurality of text sequence feature vectors, the text feature sequence fusion module, the feature consistency learning module, and a parameter adjustment stage of the tag consistency learning module, the cross-modal video retrieval model includes a control switch for the multilingual text model and the visual base network model, and the method may further include:

and when the cross-modal video retrieval model is in the parameter adjustment stage, closing the control switch.

In a second aspect of the present invention, there is also provided a training apparatus for a cross-modality video retrieval model, where the cross-modality video retrieval model includes a hidden vector space and a tag vector space, and the apparatus may include:

the target segmented video generation module is used for acquiring and segmenting a video stream to generate a plurality of target segmented videos;

a video sequence position scalar obtaining module for obtaining a video sequence position scalar for expressing the video sequence positions of the plurality of target segmented videos;

the video sequence feature vector acquisition module is used for acquiring a plurality of video sequence feature vectors which are used for expressing the features of the video sequence in the video stream; the video sequence has a corresponding text sequence;

the text sequence segmentation module is used for segmenting the text sequence to generate segmentation words and a segmentation word sequence consisting of the segmentation words;

the text sequence feature vector extraction module is used for extracting a plurality of text sequence feature vectors aiming at the participles, text label feature vectors aiming at the participle sequences and a text position scalar used for expressing text positions;

a video sequence feature vector merging module for scalar-merging the plurality of video sequence feature vectors based on the video sequence positions and generating a target visual feature vector;

a text sequence feature vector merging module, configured to merge the plurality of text sequence feature vectors based on the text position scalar and generate a target text feature vector;

a hidden vector mapping module, configured to map the target visual feature vector and the target text feature vector to the hidden vector space, and calculate an implicit feature vector similarity for the target visual feature vector and the target text feature vector;

a tag vector mapping module, configured to map the target visual feature vector and the target text feature vector to the tag vector space, and calculate a tag feature vector similarity for the target visual feature vector and the target text feature vector;

and the retrieval result determining module is used for determining a retrieval result based on the text label feature vector, the implicit feature vector similarity and the label feature vector similarity.

Optionally, the cross-modality video retrieval model includes a residual attention module, and the video sequence feature vector merging module may include:

and the video sequence feature vector merging submodule is used for performing time domain fusion on the plurality of video sequence feature vectors through the residual attention module based on the video sequence position scalar and generating a target visual feature vector.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the video sequence feature vector merging sub-module may include:

a first coding information generation unit for coding the video sequence position scalars in sequence and generating first coding information for the video sequence position scalars;

the target video sequence feature vector generating unit is used for carrying out normalization operation on the plurality of video sequence feature vectors and generating a plurality of target video sequence feature vectors;

a first input signal generating unit configured to superimpose the plurality of target video sequence feature vectors and the first encoding information as a first input signal;

a first output signal generating unit for inputting the first input signal to the multi-head attention unit and generating a first output signal;

a second input signal generation unit for superimposing the first output signal and the plurality of video sequence feature vectors as a second input signal;

a target second input signal generation unit configured to perform a normalization operation on the second input signal and generate a target second input signal;

a second output signal generating unit for inputting the target second input signal to the multilayer sensor to generate a second output signal;

and the target visual feature vector generating unit is used for superposing the second output signal and the second input signal to be used as a target visual feature vector.

Optionally, the cross-modality video retrieval model includes a residual attention module, and the text sequence feature vector merging module may include:

and the text sequence feature vector merging submodule is used for performing time domain fusion on the plurality of text sequence feature vectors through the residual attention module based on the text position scalar and generating a target text feature vector.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the text sequence feature vector merging sub-module may include:

a second encoding information generation unit configured to encode the text position scalars in order and generate second encoding information for the text position scalars;

the target text sequence feature vector generating unit is used for carrying out normalization operation on the plurality of text sequence feature vectors and generating a plurality of target text sequence feature vectors;

a third input signal generating unit, configured to superimpose the plurality of target text sequence feature vectors and the second encoding information as a third input signal;

a third output signal generating unit, configured to input the third input signal to the multi-head attention unit, and generate a third output signal;

a fourth input signal generation unit, configured to superimpose the third output signal and the plurality of text sequence feature vectors as a fourth input signal;

a target fourth input signal generation unit, configured to perform a normalization operation on the fourth input signal and generate a target fourth input signal;

a fourth output signal output unit, configured to input the target fourth input signal to the multilayer sensor, and generate a fourth output signal;

the initial target text feature vector generating unit is used for superposing the fourth output signal and the fourth input signal to serve as an initial target text feature vector;

and the target text feature vector generating unit is used for splicing the initial target text feature vector and the text label feature vector and generating a target text feature vector.

Optionally, the hidden vector mapping module may include:

a hidden vector mapping submodule for mapping the target visual feature vector and the target text feature vector to the hidden vector space respectively to generate a hidden visual vector and a hidden text vector;

a first vector distance determination submodule for determining a first vector distance of the implied visual vector and the implied text vector;

and the implicit feature vector similarity calculation operator module is used for calculating the implicit feature vector similarity aiming at the target visual feature vector and the target text feature vector by adopting the first vector distance.

Optionally, the cross-modality video retrieval model includes a multilayer fully-connected neural network having corresponding network parameters, and further includes:

a first target loss function generation submodule for generating a first target loss function using the implicit visual vector and the implicit text vector; the first target loss function comprises a first loss function value;

a first network parameter control sub-module, configured to reduce the first loss function value by controlling the network parameter.

Optionally, the tag vector mapping module may include:

the label vector mapping submodule is used for respectively mapping the target visual feature vector and the target text feature vector to the label vector space to generate a label visual vector and a label text vector;

a second vector distance determination submodule for determining a second vector distance between the tag visual vector and the tag text vector;

and the label feature vector similarity calculation operator module is used for calculating the label feature vector similarity aiming at the target visual feature vector and the target text feature vector by adopting the second vector distance.

a second target loss function generation submodule, configured to generate a second target loss function by using the tag visual vector and the tag text vector;

a third target loss function generation submodule, configured to generate a third target loss function by using the tag visual vector and the text tag feature vector;

a fourth target loss function generation submodule, configured to generate a fourth target loss function by using the tag text vector and the text tag feature vector;

a second network parameter control sub-module for reducing the second loss function value, and the third loss function value, and the fourth loss function value by controlling the network parameter.

Optionally, the retrieval result determining module may further include:

the word segmentation similarity determining submodule is used for determining word segmentation similarity between the target text feature vector and the text label feature vector;

the weight coefficient calculation submodule is used for weighting the word segmentation similarity and calculating a weight coefficient;

and the retrieval result calculation submodule is used for calculating the retrieval result by adopting the weight coefficient, the implicit characteristic vector similarity and the label characteristic vector similarity.

Optionally, the target segmented video generating module may include:

the lens change judging submodule is used for judging whether lens change occurs in the video stream; if yes, calling a target segmented video generation submodule;

and the target segmented video generation submodule is used for segmenting the video stream to generate a plurality of target segmented videos when the shot changes in the video stream.

Optionally, the video sequence feature vector obtaining module includes:

a target frame determination sub-module for determining a plurality of target frames in the video stream based on the plurality of target segmented videos;

and the video sequence feature vector acquisition sub-module is used for extracting the video sequence feature vectors of the target frame pictures corresponding to the plurality of target frames as a plurality of video sequence feature vectors.

Optionally, the cross-modal video retrieval model includes a visual base network model for obtaining the plurality of video sequence feature vectors, a multilingual text model for obtaining the plurality of text sequence feature vectors, a visual feature sequence fusion module, a text feature sequence fusion module, a feature consistency learning module, and a tag consistency learning module, the cross-modal video retrieval model has a control switch for the visual feature sequence fusion module, the text feature sequence fusion module, the feature consistency learning module, and a parameter adjustment stage of the tag consistency learning module, the cross-modal video retrieval model includes a control switch for the multilingual text model and the visual base network model, and the apparatus further includes:

and the control switch closing module is used for closing the control switch when the cross-modal video retrieval model is in the parameter adjusting stage.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the above-mentioned training methods for a cross-modality video retrieval model.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above-described training methods for a cross-modality video retrieval model.

According to the embodiment of the invention, a target segmented video is generated by segmenting a video stream; acquiring a video sequence position scalar and a video sequence feature vector; generating word segmentation and a word segmentation sequence consisting of the word segmentation; extracting a text sequence feature vector, a text label feature vector and a text position scalar; combining the video sequence feature vectors based on the video sequence position scalar and generating a target visual feature vector; combining text sequence feature vectors based on the text position scalars and generating target text feature vectors; respectively mapping vectors in different spaces to the same vector space, and calculating the similarity of the implicit characteristic vectors and the similarity of the label characteristic vectors; the retrieval result is determined based on the text label characteristic vector, the hidden characteristic vector similarity and the label characteristic vector similarity, so that the retrieval precision of cross-modal retrieval of the long video is improved, and the cross-modal retrieval function of the Chinese (Chinese, English and Chinese) text search video is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below.

FIG. 1 is a flowchart illustrating steps of a training method for a cross-modal video retrieval model according to an embodiment of the present invention;

fig. 2 is a schematic model diagram of a cross-mode video retrieval module provided in an embodiment of the present invention;

fig. 3 is a structural block diagram of training for a cross-modality video retrieval model provided in an embodiment of the present invention.

Fig. 4 is a block diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. All other embodiments that can be obtained by a person skilled in the art without any inventive work based on the embodiments of the present invention will fall into the scope of the present invention, and the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In practical application, except that the related technology can directly extract the video modal characterization from a long video, thereby causing low cross-modal retrieval accuracy, the existing cross-modal retrieval mode for the video is usually trained based on an English data set, because a word list contains limited Chinese words and the Chinese and English word segmentation modes have differences, the degree of distinction of text feature vectors extracted by a model is low, and further the cross-modal retrieval accuracy is low, therefore, the embodiment of the invention can jointly constrain target data by combining tags (vector space), can not directly extract sequence features expressing the whole text sequence as the text modal characterization, can classify the part of speech of each word in each text sequence, and extract a plurality of feature sequence vectors for expressing each word, the Chinese text retrieval method and the Chinese text retrieval system are used as text modal representations together, so that a cross-modal retrieval function for searching videos for Chinese (Chinese, English) texts is realized.

Referring to fig. 1, fig. 1 is a flowchart of steps of a training method for a cross-modality video retrieval model according to an embodiment of the present invention, which may specifically include the following steps:

step 101, acquiring and segmenting a video stream to generate a plurality of target segmented videos;

step 102, obtaining a video sequence position scalar for expressing the video sequence positions of the target segmented videos;

103, acquiring a plurality of video sequence feature vectors used for expressing the features of the video sequences in the video stream; the video sequence has a corresponding text sequence;

step 104, segmenting the text sequence to generate word segmentation and a word segmentation sequence consisting of the word segmentation;

step 105, extracting a plurality of text sequence feature vectors aiming at the participles, text label feature vectors aiming at the participle sequences, and a text position scalar used for expressing text positions;

step 106, combining the plurality of video sequence feature vectors based on the video sequence position scalar, and generating a target visual feature vector;

step 107, combining the plurality of text sequence feature vectors based on the text position scalar, and generating a target text feature vector;

step 108, mapping the target visual feature vector and the target text feature vector to the hidden vector space, and calculating the similarity of the hidden feature vectors aiming at the target visual feature vector and the target text feature vector;

step 109, mapping the target visual feature vector and the target text feature vector to the tag vector space, and calculating the tag feature vector similarity for the target visual feature vector and the target text feature vector;

and step 110, determining a retrieval result based on the text label feature vector, the implicit feature vector similarity and the label feature vector similarity.

In practical applications, the embodiment of the present invention may be applied to a cross-modal video retrieval model, for example, a cross-modal video retrieval model MultiLingual-CLIP, where the cross-modal video retrieval model may be a model for a cross-modal video retrieval technology, and in the embodiment of the present invention, a large number of training sample sets related to cross-modal video retrieval may be used to train the cross-modal video retrieval model, and the training model is continuously iterated through errors of the cross-modal video retrieval model on the training sample set, so that a cross-modal video retrieval model that is reasonably fitted to the training sample set may be obtained, and then the trained cross-modal video retrieval model is applied to an actual cross-modal video retrieval process. The smaller the error of the prediction result of the cross-modal video retrieval model on the video to be detected and the text is, the more accurate the training of the cross-modal video retrieval model is.

In the embodiment of the present invention, a virtual storage space for storing a mapping vector may be configured for the cross-modal search model, specifically, the virtual storage space may include a hidden vector space for storing a hidden vector and a tag vector space for storing a tag vector, and after the creation of the storage space is completed, a video stream may be obtained and segmented to generate a plurality of target segmented videos.

Optionally, in the embodiment of the present invention, a video shot detection module may be integrated in the cross-modal video retrieval model, and the video shot detection module is adopted to determine whether a shot change occurs in the video stream, and if so, the video stream may be segmented when the shot change occurs in the video stream, so as to generate a plurality of target segmented videos.

In practical application, a video photographer can use different shots to display different elements, so that the video is segmented according to shot changes under most conditions, and the element unicity in the segmented video can be effectively ensured.

The embodiment of the present invention may further use the video stream as a training sample set, and specifically, the video stream may include a video sequence composed of multiple frames of ordered images, and the video stream may have corresponding text information, and the video sequence may have a corresponding text sequence, and the text sequence may be composed of ordered words in the text information.

In a specific implementation, a visual basic network model, for example, a visual basic network ViT model (Vision Transformer) or a visual basic network Resnet model, may be integrated in the cross-modal video retrieval model, and a plurality of video sequence feature vectors used for expressing features of a video sequence in a video stream may be obtained through the visual basic network model according to the embodiment of the present invention.

The embodiment of the present invention may process image frames included in a video sequence, where each image may obtain a corresponding feature vector, specifically, a visual basic network model may be used to extract features of the video sequence in the image, and then the video sequence feature vectors are used to express the features of the video sequence, and a video sequence includes multiple frames of images, so that multiple video sequence feature vectors may be extracted for one video stream, for example, a set of feature vectors with a sequence length of S _ v and a feature dimension of D _ v may be extracted, and may be represented as fiat _ v _ seq [ fiat _ v _1, fiat _ v _2, fiat _ v _ sv ].

In an alternative embodiment of the present invention, a plurality of target frames in a video stream may be determined based on a plurality of target segmented videos; video sequence feature vectors of target frame pictures corresponding to the plurality of target frames are extracted as a plurality of video sequence feature vectors.

In a specific implementation, because a long video has been segmented into multiple target segmented videos, and a trained visual basic network model can directly determine a target frame of a video, an embodiment of the present invention can directly determine a target frame in each target segmented video by using the visual basic network model, and then directly use a video sequence feature of a picture corresponding to the target frame as a target video sequence feature to be extracted, and can express the target video sequence feature with one video sequence feature vector, specifically, the length of the video sequence feature vector may be the same as the number of the target frames, for example, the video sequence feature vector is a set of feature vectors with a sequence length S _ v and a feature dimension D _ v, and can be expressed as feat _ v _ seq [ feat _ v _1, feat _ v _2, feat _ v _ sv ]: wherein, the sequence length S _ v may be the same as the number of target frames; each feature vector (feat _ v _ i) in the sequence is output by the visual basic network model, that is, the visual basic network model processes the target frame image, the corresponding same feature dimension is D _ v, and typical values of D _ v may be, but are not limited to 256, 512, 1024, and the like, so that the extraction efficiency of the feature vector for the video sequence is improved.

The embodiment of the invention can determine the video sequence position of each target segmented video in the video sequence after generating a plurality of target segmented videos, and can adopt a scalar to express the video sequence position, and the scalar expressing the video sequence position can be a video sequence position scalar; alternatively, the video sequence position may be obtained from a detection result of monitoring the video by the shot detection module, and the video sequence position of each target segmented video may be expressed by a video sequence position scalar, for example, the video sequence position scalar may be sequence data composed of S _ v scalars, where the sequence length S _ v may be the same as the number of target frames, and the ith scalar element represents time sequence position information of the target frame i, such as shot sequence number information corresponding to the target frame.

And because in practical application, the element in the key frame is the optimal element for expressing the feature of the segment of video in general, in a more preferred embodiment of the present invention, the key frame of a segment of target segmented video can be used as the target frame, and a plurality of video sequence feature vectors for expressing the picture feature of the key frame are extracted as a plurality of video sequence feature vectors for expressing the feature of the video sequence, thereby avoiding taking different elements in different shots as training targets and ensuring the content purity of the video sequence for model training and reasoning.

Embodiments of the present invention may also integrate a multi-Language text model, such as the multi-Language bot model Bert or the multi-Language clip model, the Bert is from google's paper, the Pre-training of the Language Understanding depth Bidirectional transformer (Pre-training of Deep Bidirectional Transformers for Language Understanding) Bert is an acronym of "Bidirectional Encoder retrieval from Transformers", Bert is a self-coding Language model as a whole, and the clip model may be a contrast Language Image Pre-training model. The embodiment of the invention can segment the text sequence through the multi-language text model and obtain the participles of the text sequence and the participle sequence consisting of the participles, wherein the participle sequence can be a sequence for keeping the sequence of the participles in the text information.

In the related art, the text sequence feature vector is extracted in such a manner that one text sequence feature vector can be correspondingly output when a text sequence is input. For example, when a sentence is input as a text sequence, 1024 × 1 can be output, which is currently only applicable to english text, because the existing cross-modal video retrieval models are trained based on english text, and a single vector is enough to satisfy the expression of the text sequence, but in case of chinese text, the information of each participle is already mixed in the single vector, so that the participle cannot be subdivided. Therefore, in the embodiment of the present invention, a feature vector may not be directly used to express a text sequence, but a feature sequence vector output at a shallow layer number may be searched, and specifically, each text sequence may be segmented in sequence by the multi-language text model, that is, a part of speech of each word in each text sequence of the multi-language text model is classified, so as to extract a plurality of text sequence feature vectors for a plurality of segmented words.

In practical application, the word segmentation sequence may further include position information of each word in a word list, and the word list may be obtained according to statistics of all previous training data, or may be pre-constructed by using other methods without specific limitations. The embodiment of the invention can not only extract the feature vector of the text sequence by using a multilingual text model, but also extract the feature vector of the text label aiming at the word segmentation sequence by using the position information of each word in the word list, particularly, on the premise of knowing the word list, the sequence of the word segmentation sequence corresponds to the sequence of the word sequence of each word in the text information, and has the mapping relation between the word and the word list, and the word segmentation sequence can be represented by using one feature vector of the text label.

For example, the text sequence includes: i love city A, you hate city B, he likes city D. Then the segmentation text sequence is cut to obtain the segmentation word sequence of 1 (I) 3 (love) 4 (city A) 5 (you) 6 (disagreeable) 10 (city B) 7 (he) 8 (love) 12 (city D), the segmentation words are in parentheses, and the Arabic numerals represent the position information of the segmentation words in the vocabulary.

The predetermined vocabulary is:

the length of the word segmentation sequence of the sentence is 9, and the corresponding text sequence feature vector with the length of 9 is as follows: where the 1 st text feature sequence vector corresponds to word 1 ("i") in the vocabulary, the 2 nd text feature sequence vector corresponds to word 3 ("i") in the vocabulary, and the 9 th text feature sequence vector corresponds to word 12 ("D city") in the vocabulary, the text feature sequence vector may be a set of feature vectors with sequence length S _ t and feature dimension D _ t, which may be expressed as, for example, fet _ t _ seq [ fet _ t _1, fet _ t _2, fet _ t _ st ]: wherein the sequence length S _ t is the same as the segmentation sequence length; each feature vector in the sequence (feat _ t _ i) corresponds to the same feature dimension as D _ t, typical values of D _ t may be, but are not limited to 256, 512, 1024, etc.

The text label feature vector may be label information corresponding to the text information, and is a vector with dimension M, where M is the size of the label set.

Performing statistics based on the training data to obtain a set of label sets, where the set may only express the word segmentation content of the text sequence but not express the sequential relationship between the words, and each element in the set may correspond to one word segmentation, and taking the text sequence as an example, the label set is: "I, you, He, love, bother, A, B, C, D (M-9)".

The result of the correspondence between the participles of the text sequence 'I love A city, you dislike B city and he likes D city' and the label set is:

the text label feature vector used to express the sequence of word segments may be: [1,1,1,1,1,1,1, 0,1].

The embodiment of the present invention may determine, after segmenting the text sequence, a position of each text, such as a word, etc., in the segmentation sequence, and may express the text position by using a scalar, where the scalar expressing the text position may be a text position scalar, for example, the text position may be determined according to a position of each segmentation in the segmentation sequence after segmenting the text sequence in a multi-language text model, and the text position of each segmentation may be expressed by using a text position scalar, where the text position scalar may be sequence data composed of S _ t scalars, where a sequence length S _ t is the same as a segmentation sequence length, and an ith scalar element represents time sequence position information of a segmentation i, and if the position information of the segmentation in the sequence is used for representing, the value may be i.

In practical applications, most of the directly extracted sequence data are data with poor correlation in general, and are not suitable for being directly used as data for generating similarity through subsequent calculation, because a visual feature model for acquiring a video sequence feature vector and a text feature model for acquiring a text sequence feature vector are trained independently, so that feature sequence vectors extracted by the two models are distributed in different spaces, and the similarity (correlation) between the two models cannot be guaranteed, and therefore, a plurality of text sequence feature vectors and a plurality of video sequence feature vectors need to be mapped to the same vector space as a single vector to calculate the similarity between the two models, so that the cross-modal search function of a Chinese or English text search video is realized by calculating the feature similarity measurement in the vector space.

According to the embodiment of the invention, after the video sequence feature vectors and the video sequence position scalar are obtained, a plurality of video sequence feature vectors are combined based on the video sequence position scalar, and the target visual feature vectors are generated; after the text sequence feature vector and the text position scalar are obtained, combining a plurality of text sequence feature vectors based on the text position scalar to generate a target text feature vector, wherein in specific implementation, one or more residual error attention modules can be integrated in a cross-modal video retrieval model, and the video sequence feature vector and the video sequence position scalar are input to the residual error attention module for sequence processing to obtain a target visual feature vector; for another example, the text sequence feature vector and the text position scalar are input to the residual attention module for sequence processing, so as to obtain a target text feature vector.

As can be seen from the above, in practical applications, the video sequence feature vectors extracted by using the existing model and the text sequence feature vectors are distributed in different spaces, and the similarity measurement cannot be directly performed, so that the search cannot be directly performed.

After the target visual feature vector and the target text feature vector are generated, the embodiment of the invention can map the first target visual feature vector and the target text feature vector to a hidden vector space and calculate the similarity of the hidden feature vectors aiming at the target visual feature vector and the target text feature vector; and mapping the target visual feature vector and the target text feature vector to a tag vector space, calculating the similarity of the tag feature vectors aiming at the target visual feature vector and the target text feature vector, and then determining a retrieval result based on the similarity of the text tag feature vectors, the similarity of the implicit feature vectors and the similarity of the tag feature vectors.

According to the embodiment of the invention, a target segmented video is generated by segmenting a video stream; acquiring a video sequence position scalar and a video sequence feature vector; generating word segmentation and a word segmentation sequence consisting of the word segmentation; extracting a text sequence feature vector, a text label feature vector and a text position scalar; combining the video sequence feature vectors based on the video sequence position scalar and generating a target visual feature vector; combining text sequence feature vectors based on text position scalars and generating target text feature vectors; respectively mapping vectors in different spaces to the same vector space, and calculating the similarity of the implicit characteristic vectors and the similarity of the label characteristic vectors; the retrieval result is determined based on the text label characteristic vector, the hidden characteristic vector similarity and the label characteristic vector similarity, so that the retrieval precision of cross-modal retrieval of the long video is improved, and the cross-modal retrieval function of the Chinese (Chinese, English and Chinese) text search video is realized.

On the basis of the above-described embodiment, a modified embodiment of the above-described embodiment is proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the modified embodiment.

The cross-modal video retrieval model comprises a residual attention module, and the step of combining the plurality of video sequence feature vectors based on the video sequence position metric and generating a target visual feature vector comprises:

In practical application, for time series data with characteristics of long and short time relevance, nonlinearity, non-stationarity and the like, the traditional time series prediction model has poor prediction effect on the time series data. The embodiment of the invention can integrate one or more residual error attention modules in the cross-modal video retrieval model, perform time domain fusion on a plurality of video sequence feature vectors through the residual error attention modules based on the position scalar of the video sequence, and generate the target visual feature vector. The time domain fusion is a processing method in the deep learning model, can further improve the accuracy and efficiency of the time sequence prediction model, ensure the effectiveness of time domain convolution extraction time characteristics, improve the residual structure and accelerate the rate of model convergence, and can also make the attention mechanism play a role in strengthening the parameters.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the step of performing temporal fusion on the plurality of video sequence feature vectors by the residual attention module based on the video sequence position scalar, and generating a target visual feature vector includes:

superimposing the plurality of target video sequence feature vectors with the first encoding information as a first input signal;

The Residual Attention module (Residual Attention Network) in the embodiment of the present invention may further include a Multi-head Attention unit (Multi-head-Attention) and a Multi-layer perceptron mlp (Multi layer Perceptron), the Residual Attention module is formed by stacking the Multi-layer Attention modules, and each Attention module is divided into two parts: mask branches and trunk branches, where the trunk branches perform feature processing, which can use any network model, and the multi-head attention unit uses multiple queries to compute multiple information in parallel from the input information. Each focusing on a different part of the input information. Hard attention, i.e., the expectation of all input information based on attention distribution, a multi-layered perceptron is a model of a feed-forward artificial neural network that maps multiple data sets of an input onto a single output data set.

Specifically, the embodiment of the present invention may first perform order increasing coding on a video sequence position scalar to generate first coding information for the video sequence position scalar, then perform normalization operation on a plurality of video sequence feature vectors to obtain a plurality of target video sequence feature vectors, then superimpose the plurality of target video sequence feature vectors and the first coding information as a first input signal, and then input the first input signal to the multi-head attention unit to generate a first output signal. After the first output signal is generated, the first output signal and a plurality of video sequence feature vectors can be superposed to be used as a second input signal, the second input signal is subjected to normalization operation, so that a target second input signal is generated, the target second input signal is input to the multilayer perceptron, then the second output signal is generated, and then the second output signal and the second input signal are superposed to be used as a target visual feature vector; the normalization operation is a processing method in the deep learning model, and can generally play a role in accelerating the convergence of the model.

For example, when training is performed on a target segmented video corresponding to a shot, the current shot, the preamble K1 shots, and a plurality of video sequence feature vectors corresponding to key frames in the subsequent K2 shots may be combined to be used as a visual feature sequence, when the key frames are encoded in an increasing order from 0, each key frame in the preamble K1 shots and the subsequent K2 shots may be position-encoded by using-1 as a shot sequence number offset parameter for the shot sequence number offset, so as to obtain video sequence position encoding information, then the plurality of video sequence feature vectors subjected to normalization operation and the video sequence position encoding information are superimposed to be used as a first input signal, then the first input signal is input to the multi-head attention unit, a first output signal is generated, and after the first output signal is generated, the first output signal and the plurality of video sequence feature vectors are superimposed to be used as a second input signal, and inputting a second input signal subjected to normalization operation to the multilayer perceptron, then generating a second output signal, and then superposing the second output signal and the second input signal which is not subjected to normalization operation to be used as a target visual feature vector.

In an optional embodiment of the present invention, the cross-modality video retrieval model comprises a residual attention module, and the step of scalar-combining the plurality of text sequence feature vectors based on the text position and generating the target text feature vector comprises:

In practical application, for time series data with characteristics of long and short time relevance, nonlinearity, non-stationarity and the like, the traditional time series prediction model has poor prediction effect on the time series data. According to the embodiment of the invention, one or more residual error attention modules can be integrated in the cross-modal video retrieval model, time domain fusion is carried out on a plurality of text sequence feature vectors through the residual error attention modules based on the text position scalar, a target text feature vector is generated, and the technical effect which can be achieved by carrying out time domain fusion on the text sequence feature vectors is the same as that described above, and is not repeated herein.

Optionally, the residual attention module includes a multi-head attention unit and a multi-layer perceptron, and the time-domain fusion of the text sequence feature vectors by the residual attention module based on the text position scalar and the generation of the target text feature vector includes:

Specifically, the embodiment of the present invention may first encode the text position scalar in sequence to generate first encoding information for the video sequence position scalar, then perform normalization operation on a plurality of text sequence feature vectors to generate a plurality of target text sequence feature vectors, then superimpose the plurality of text sequence feature vectors and the second encoding information as a third input signal, input the third input signal to the multi-head attention unit to generate a third output signal, and superimpose the third output signal and the plurality of text sequence feature vectors as a fourth input signal; normalizing the fourth input signal to generate a target fourth input signal, and inputting the target fourth input signal to the multilayer perceptron to generate a fourth output signal; and superposing the fourth output signal and a fourth input signal which is not subjected to normalization operation to be used as a target text feature vector.

For example, by using Bert, a plurality of text sequence feature vectors and text position scalars corresponding to each text participle can be extracted, according to the relative position of the participle in the text sequence, such as the 1 st word, the 2 nd word … can obtain the position code corresponding to each participle, while since a text description may be a sequence consisting of one or more participles, for example, a "man wearing red clothing" is composed of the participle sequences "one", "wear", "red", "clothing", "male", "human", so that the text features of a plurality of participles can form a plurality of text sequence feature vectors, and the position codes of a plurality of participles form second coded information; then, overlapping the plurality of text sequence feature vectors subjected to normalization operation with second coding information to serve as a third input signal, and inputting the third input signal to the multi-head attention unit to generate a third output signal; superposing the third output signal and a plurality of text sequence feature vectors to serve as a fourth input signal; and inputting the fourth input signal subjected to the normalization operation into the multilayer perceptron, generating a fourth output signal, and superposing the fourth output signal and the fourth input signal which is not subjected to the normalization operation to be used as a target text feature vector.

According to the embodiment of the invention, the residual error attention module is adopted to perform time domain fusion on the plurality of video sequence feature vectors and the plurality of text sequence feature vectors, so that the feature expression capability of the cross-modal video retrieval model is improved, the position coding is performed on the video sequences in the sequence by combining the lens detection result, the sequence expression of the cross-modal video retrieval model is enhanced in the learning process, and the context information is introduced, so that the retrieval efficiency is further improved.

In an optional embodiment of the present invention, the step of mapping the target visual feature vector and the target text feature vector to the hidden vector space, and calculating the implicit feature vector similarity for the target visual feature vector and the target text feature vector comprises:

The cross-modality video retrieval model comprises a multi-layer fully-connected neural network having corresponding network parameters, and further comprises, prior to the step of determining the first vector distance of the implied vision vector and the implied text vector:

reducing the first loss function value by controlling the network parameter.

In practical applications, the embodiment of the present invention may integrate a multi-layer fully-connected neural network in a cross-modal video retrieval model, and in a specific implementation, the multi-layer fully-connected neural network is a structure in which a plurality of fully-connected layer FC layers are cascaded, a fully-connected layer FC layer is a typical structure in a neural network, and is a method for feature dimension transformation, the multi-layer fully-connected neural network may be configured to map a target visual feature vector and a target text feature vector to a hidden vector space, respectively, so as to obtain an implied visual vector and an implied text vector from the hidden vector space, and reduce hidden layer feature difference values of the implied visual vector and the implied text vector during a training process, optionally, the embodiment of the present invention may further generate a first target loss function using the implied visual vector and the implied text vector, and the first target loss function may include a first loss function value, and adjusting network parameters of the multilayer fully-connected neural network in a gradient descending manner to reduce the first loss function value, thereby achieving the effect of reducing the characteristic difference and further reducing the error of cross-modal video retrieval.

After the hidden layer feature difference value between the hidden visual vector and the hidden text vector is minimized, a first vector distance between the hidden visual vector and the hidden text vector can be determined, and the similarity of the hidden feature vector is calculated by using the first vector distance, wherein optionally, the first vector distance can be a Euclidean distance or a cosine distance.

For example, a multilayer fully-connected neural network (for example, 2 layers) is adopted to map the target visual feature vector and the target text feature vector to the hidden vector space, and in the case of minimizing the hidden layer feature difference value between the hidden visual vector and the hidden text vector, the hidden feature vector similarity sim _ e can be represented by calculating the cosine similarity of the hidden feature vector by using the hidden visual vector and the hidden text vector in model inference.

In an optional embodiment of the present invention, the mapping the target visual feature vector and the target text feature vector to the tag vector space, and calculating the similarity of the tag feature vectors for the target visual feature vector and the target text feature vector comprises:

Optionally, the cross-modality video retrieval model includes a multi-layer fully-connected neural network having corresponding network parameters, and further includes, before the step of determining a second vector distance between the tag visual vector and the tag text vector:

reducing the second loss function value, and, the third loss function value, and, the fourth loss function value by controlling the network parameter.

The label text vector different from the text label feature vector can be a corresponding feature vector after the text feature is mapped to a label space through a multilayer fully-connected network structure; the label visual vector can be a corresponding characteristic vector after the visual characteristic is mapped to the label space through a multilayer full-connection network structure.

In practical application, the embodiment of the invention can integrate a multilayer fully-connected neural network in a cross-modal video retrieval model, the multilayer fully-connected neural network can be used for respectively mapping a target visual feature vector and a target text feature vector to a label vector space, so that the label visual vector and the label text vector can be obtained from the label vector space, and the difference value between every two of the label visual vector, the label text vector and the text label feature vector is reduced in a training process, optionally, the embodiment of the invention can adopt the label visual vector and the label text vector to generate a second target loss function, adopt the label visual vector and the text label feature vector to generate a third target loss function, adopt the label text vector and the text label feature vector to generate a fourth target loss function, and adjust the network parameters of the multilayer fully-connected neural network in a gradient descending manner, to reduce the second loss function value, and, the third loss function value, and, the fourth loss function value, thereby achieving the effect of reducing feature differences and further reducing errors across modal video retrieval.

After minimizing the difference value between the tag visual vector, the tag text vector and the text tag feature vector, a second vector distance between the tag visual vector and the tag text vector may be determined, and the tag feature vector similarity may be calculated by using the second vector distance, and optionally, the second vector distance may be a euclidean distance or a cosine distance.

For example, a multilayer fully-connected neural network (for example, 2 layers) is adopted to map the target visual feature vector and the target text feature vector to a tag vector space, and when the difference value between the tag visual vector, the tag text vector and the text tag feature vector is minimized, the similarity sim _ t of the tag feature vector can be represented by calculating the cosine similarity of the tag feature vector by using the tag visual vector and the tag text vector during model reasoning.

In an optional embodiment of the present invention, the step of determining a search result based on the text label feature vector, the implicit feature vector similarity, and the label feature vector similarity further includes:

In a specific implementation, the present invention may calculate the word segmentation similarity between the target text feature vector and the text label feature vector through the euclidean distance or the cosine distance between the target text feature vector and the text label feature vector, then calculate the weight coefficient by weighting the word segmentation similarity, then calculate the search result by using the weight coefficient, the implicit feature vector similarity, and the label feature vector similarity, for example, using sim _ t to represent the label feature vector similarity, using sim _ e to represent the implicit feature vector similarity, clamping the normalized word segmentation similarity to the [ min _ w, max _ w ] interval, as the weight w _ t for feature search, then the search result sim is w _ t sim _ t + (1-w _ t) sim _ e, although those skilled in the art may use other values as the weight coefficient w _ t, for example, a typical value of 0.5 may be used, and it is also possible for a person skilled in the art to calculate the search result by using a weight coefficient, an implicit feature vector similarity, and a tag feature vector similarity through other algorithms, and the embodiment of the present invention is not limited thereto.

In an optional embodiment of the present invention, the cross-modal video retrieval model comprises a visual base network model for obtaining the plurality of video sequence feature vectors, a multi-language text model for obtaining the plurality of text sequence feature vectors, a visual feature sequence fusion module, a text feature sequence fusion module, a feature consistency learning module, and a tag consistency learning module, the cross-modal video retrieval model has a control switch for the visual feature sequence fusion module, the text feature sequence fusion module, the feature consistency learning module, and a parameter adjustment stage of the tag consistency learning module, the cross-modal video retrieval model comprises a control switch for the multi-language text model and the visual base network model, and the method further comprises:

In practical application, the cross-modal video retrieval model has corresponding parameter adjustment stages, for example, the learning stages aiming at the visual feature sequence fusion module, the text feature sequence fusion module, the feature consistency learning module and the label consistency learning module, and the multilingual text model and the visual basic network model can increase unnecessary data operation amount in the parameter adjustment stage, so that the control switch aiming at the multilingual bot model and the visual basic network model can be closed in the parameter adjustment stage of the cross-modal video retrieval model, so that the multilingual bot model and the visual basic network model do not participate in training, and the generalization capability of the model on a small sample training data set is improved.

It should be emphasized that, in the embodiment of the present invention, for training data of a cross-modal video retrieval model, a video is required to have a corresponding text sequence, and after model training is completed, that is, when a trained modal video retrieval model is used for cross-modal retrieval, one video may only have video stream information or a text sequence to implement cross-modal retrieval, and does not need to have both of the two, for example, after model training is completed, the video may be used in the following manner, where N is 10w to form a video database, and there is no limitation on whether each video in the video database needs to have a text sequence; and corresponding to the query text, for example, if a person walks in the blue sky and in the cloud, searching the topK video data which most conforms to the description in the video database by using a cross-modal search model, and finishing the cross-modal search by using the topK video data as return data.

Of course, those skilled in the art can set the data size (N value) and the search return size (topK value) according to actual conditions, and the embodiment of the present invention is not limited thereto.

In order to make the embodiments of the present invention better understood by those skilled in the art, the following description is given by way of a full example.

As shown in fig. 2, fig. 2 is a schematic model diagram of a cross-modal video retrieval module provided in an embodiment of the present invention, and the cross-modal video retrieval model 200 may include: the system comprises a video sequence feature extraction module 201, a text sequence feature extraction module 202, a video feature sequence fusion module 203, a text feature sequence fusion module 204, a feature consistency learning module 205 and a label consistency learning module 206;

the video sequence feature extraction module 201 may be configured to include a video shot detection module and a typical deep learning network such as ViT or Resnet, and may be configured to perform shot detection and key frame extraction on a video file, and specifically, may detect a video stream using the video shot detection module, determine whether a shot changes, if so, segment the video stream to generate a target segmented video, determine a video sequence position scalar for the target segmented video, determine a key frame in the video stream based on the target segmented video, and extract a plurality of video sequence feature vectors of a key frame picture corresponding to the key frame using a visual basic network model.

The text sequence feature extraction module 202 may be configured by a multilingual Bert model, and an obtaining manner for a text sequence feature vector may be configured to input a text sequence to the multilingual Bert model, and output a word segmentation sequence of the text sequence and a feature sequence vector for expressing a feature of each word segmentation text, for example, the multilingual Bert model may be used to process the word segmentation sequence, and output at the ith position in a sequence feature (with a sequence length S _ t corresponding to a shallower layer of Bert, that is, a non-last layer output) output by the language model is used as a feat _ t _ i, or the word segmentation sequence may be sequentially processed by the Bert, and a feature vector output by the language model for the ith word segmentation is used as a feat _ t _ i; taking Chinese and English as an example, the Chinese word segmentation result can be a single character, and the English word segmentation result can be a single word.

The visual feature sequence fusion module 203 may be formed by one or more residual attention modules, and the residual attention module may further include a multi-head attention unit and a multi-layer perceptron MLP, and is specifically configured to perform order-increasing coding on a video sequence position scalar to generate video sequence position coding information, then superimpose the normalized multiple video sequence feature vectors and the video sequence position coding information as a first input signal, then input the first input signal to the multi-head attention unit to generate a first output signal, after generating the first output signal, superimpose the first output signal and the multiple video sequence feature vectors as a second input signal, input the normalized second input signal to the multi-layer perceptron to generate a second output signal, and then superimpose the second output signal and the second input signal that is not normalized, as the target visual feature vector.

The text feature sequence fusion module 204 may be formed by one or more residual attention modules, and the residual attention module may further include a multi-head attention unit and a multi-layer perceptron MLP, and is specifically configured to perform order-increasing coding on a text position scalar to generate text sequence position coded information, superimpose a plurality of text sequence feature vectors subjected to normalization operation and the text sequence position coded information as a third input signal, input the third input signal to the multi-head attention unit, and generate a third output signal; superposing the third output signal and a plurality of text sequence feature vectors to be used as a fourth input signal; inputting the fourth input signal subjected to the normalization operation into the multilayer perceptron to generate a fourth output signal; and superposing the fourth output signal and a fourth input signal which is not subjected to normalization operation to be used as a target text feature vector.

The tag consistency learning module 205 may be formed by multiple layers of fully connected networks, and may be configured to respectively map the target visual feature vector and the target text feature vector to the same tag vector space by using the multiple layers of fully connected networks (e.g., 2 layers), obtain a tag visual vector and a tag text vector, and minimize a difference value between the tag visual vector, the tag text vector, and the text tag feature vector in a training process;

the feature consistency learning module 206 may be formed by a plurality of layers of fully connected networks, and may be configured to map the target visual feature vector and the target text feature vector to the same implicit vector space by using the plurality of layers of fully connected networks (e.g., 2 layers), obtain an implicit visual vector and an implicit text vector, and minimize an implicit layer feature difference value between the implicit visual vector and the implicit text vector in a training process.

The cross-modal video retrieval model 200 can be used for calculating the similarity sim _ e of the hidden feature vector by adopting the hidden visual vector and the hidden text vector when the cross-modal video retrieval model 200 performs reasoning; calculating the similarity sim _ t of the label characteristic vector by adopting the label visual vector and the label text vector; performing word segmentation on a query text, calculating the similarity between each word segmentation result and a hit training text label, weighting the similarity of each word segmentation as the reconstruction consistency of the text label, and clamping the reconstruction consistency of the text label to a [ min _ w, max _ w ] interval as the weight w _ t of feature retrieval, or directly adopting a typical value of 0.5 as the weight w _ t; then the cross-modal retrieval result sim ═ w _ t × sim _ t + (1-w _ t) × sim _ e; in the learning process of the visual feature sequence fusion module 203, the text feature sequence fusion module 204, the label consistency learning module 205 and the feature consistency learning module 206 of the cross-modal video retrieval model 200, parameters of the video sequence feature extraction module 201 and the text sequence feature extraction module 202 do not need to participate in training, so that the amount of training parameters can be reduced, and the generalization capability of the model on a small sample training data set is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination, but those skilled in the art will recognize that the present invention is not limited by the described order of acts, as some steps may occur in other orders or concurrently according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a block diagram of a structure of a training apparatus for a cross-modality video retrieval model provided in an embodiment of the present invention is shown, which may specifically include the following modules:

a target segmented video generating module 301, configured to obtain and segment a video stream to generate a plurality of target segmented videos;

a video sequence position scalar obtaining module 302 for obtaining a video sequence position scalar for expressing video sequence positions of the plurality of target segmented videos;

a video sequence feature vector obtaining module 303, configured to obtain a plurality of video sequence feature vectors used for expressing features of a video sequence in a video stream; the video sequence has a corresponding text sequence;

a text sequence segmentation module 304, configured to segment the text sequence to generate a word segmentation and a word segmentation sequence composed of the word segmentation;

a text sequence feature vector extraction module 305 for extracting a plurality of text sequence feature vectors for the participles, and text label feature vectors for the participle sequences, and a text position scalar for expressing text positions;

a video sequence feature vector merging module 306, configured to combine the plurality of video sequence feature vectors based on the video sequence position scalar, and generate a target visual feature vector;

a text sequence feature vector merging module 307, configured to combine the text sequence feature vectors based on the text position scalar and generate a target text feature vector;

a hidden vector mapping module 308, configured to map the target visual feature vector and the target text feature vector to the hidden vector space, and calculate an implicit feature vector similarity for the target visual feature vector and the target text feature vector;

a tag vector mapping module 309, configured to map the target visual feature vector and the target text feature vector to the tag vector space, and calculate a tag feature vector similarity for the target visual feature vector and the target text feature vector;

a retrieval result determining module 310, configured to determine a retrieval result based on the text label feature vector, the implicit feature vector similarity, and the label feature vector similarity.

a second output signal generation unit configured to input the target second input signal to the multilayer sensor and generate a second output signal;

an initial target text feature vector generating unit, configured to superimpose the fourth output signal and the fourth input signal as an initial target text feature vector;

Optionally, the hidden vector mapping module may include:

a first vector distance determination sub-module to determine a first vector distance of the implied visual vector and the implied text vector;

and the implicit characteristic vector similarity operator module is used for calculating the implicit characteristic vector similarity aiming at the target visual characteristic vector and the target text characteristic vector by adopting the first vector distance.

and the first network parameter control submodule is used for reducing the first loss function value by controlling the network parameters.

Optionally, the tag vector mapping module may include:

and the label characteristic vector similarity calculation operator module is used for calculating the label characteristic vector similarity aiming at the target visual characteristic vector and the target text characteristic vector by adopting the second vector distance.

a second network parameter control sub-module for reducing the second loss function value, and, the third loss function value, and, the fourth loss function value by controlling the network parameter.

Optionally, the retrieval result determining module may further include:

and the retrieval result calculation submodule is used for calculating the retrieval result by adopting the weight coefficient, the similarity of the implicit characteristic vector and the similarity of the label characteristic vector.

Optionally, the target segmented video generating module may include:

Optionally, the video sequence feature vector obtaining module includes:

Optionally, the cross-modal video retrieval model includes a visual base network model for obtaining the plurality of video sequence feature vectors, a multi-language text model for obtaining the plurality of text sequence feature vectors, a visual feature sequence fusion module, a text feature sequence fusion module, a feature consistency learning module, and a tag consistency learning module, the cross-modal video retrieval model has a control switch for the visual feature sequence fusion module, the text feature sequence fusion module, the feature consistency learning module, and a parameter adjustment phase of the tag consistency learning module, the cross-modal video retrieval model includes a control switch for the multi-language text model and the visual base network model, and the apparatus further includes:

In addition, an electronic device is further provided in the embodiments of the present invention, as shown in fig. 4, and includes a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

extracting a plurality of text sequence feature vectors for the participles, text label feature vectors for the participle sequences, and a text position scalar for expressing text positions;

performing time-domain fusion on the plurality of video sequence feature vectors through the residual attention module based on the video sequence position scalar, and generating a target visual feature vector.

normalizing the text sequence feature vectors and generating a plurality of target text sequence feature vectors;

reducing the first loss function value by controlling the network parameter.

Optionally, the step of mapping the target visual feature vector and the target text feature vector to the tag vector space and calculating the similarity of the tag feature vectors for the target visual feature vector and the target text feature vector may include:

judging whether shot changes occur in the video stream;

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, which when run on a computer, cause the computer to perform the training method for a cross-modality video retrieval model described in any one of the above embodiments.

In a further embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the training method for a cross-modality video retrieval model described in any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the invention may be generated, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and similar parts between the embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, as for the system embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A training method for a cross-modal video retrieval model, wherein the cross-modal video retrieval model comprises a hidden vector space and a label vector space, and the method comprises the following steps:

obtaining a video sequence position scalar for expressing video sequence positions of the plurality of target segmented videos;

segmenting the text sequence to generate word segmentation and a word segmentation sequence consisting of the word segmentation;

extracting a plurality of text sequence feature vectors for the participles, text label feature vectors for the participle sequences and a text position scalar for expressing text positions;

mapping the target visual feature vector and the target text feature vector to the tag vector space, and calculating the similarity of the tag feature vectors aiming at the target visual feature vector and the target text feature vector;

2. The method according to claim 1, wherein the cross-modality video retrieval model comprises a residual attention module, and wherein the step of scalar combining the plurality of video sequence feature vectors based on the video sequence positions and generating a target visual feature vector comprises:

3. The method of claim 2, wherein the residual attention module comprises a multi-head attention unit and a multi-layer perceptron, and wherein the step of temporally fusing the plurality of video sequence feature vectors by the residual attention module based on the video sequence position scalar and generating a target visual feature vector comprises:

4. The method of claim 1, wherein the cross-modality video retrieval model comprises a residual attention module, and wherein the step of scalar combining the plurality of text sequence feature vectors based on the text position and generating a target text feature vector comprises:

and performing time domain fusion on the plurality of text sequence feature vectors through the residual attention module based on the text position scalar, and generating a target text feature vector.

5. The method of claim 4, wherein the residual attention module comprises a multi-head attention unit and a multi-layer perceptron, and wherein the step of temporally fusing the plurality of text sequence feature vectors by the residual attention module based on the text position scalar and generating a target text feature vector comprises:

sequentially encoding the text position scalars to generate second encoding information aiming at the text position scalars;

superposing the plurality of target text sequence feature vectors and the second encoding information as a third input signal;

superposing the fourth output signal and the fourth input signal to serve as an initial target text feature vector;

6. The method of claim 1, wherein the step of mapping the target visual feature vector and the target text feature vector to the hidden vector space and calculating implicit feature vector similarity for the target visual feature vector and the target text feature vector comprises:

7. The method of claim 6, wherein the cross-modality video retrieval model comprises a multi-layer fully-connected neural network having corresponding network parameters, and further comprising, prior to the step of determining the first vector distance of the implied visual vector and the implied text vector:

reducing the first loss function value by controlling the network parameter.

8. The method of claim 1, wherein the steps of mapping the target visual feature vector and the target text feature vector to the tag vector space and calculating tag feature vector similarities for the target visual feature vector and the target text feature vector comprise:

respectively mapping the target visual feature vector and the target text feature vector to the label vector space to generate a label visual vector and a label text vector;

9. The method of claim 8, wherein the cross-modality video retrieval model comprises a multi-layer fully-connected neural network having corresponding network parameters, and further comprising, prior to the step of determining the second vector distance between the tag visual vector and the tag text vector:

10. The method of claim 1, wherein the step of determining the search result based on the text label feature vector, the implicit feature vector similarity, and the label feature vector similarity further comprises:

11. The method of claim 1, wherein the step of obtaining and segmenting the video stream to generate the plurality of target segmented videos comprises:

judging whether shot changes occur in the video stream;

if yes, when shot changes occur in the video stream, the video stream is segmented, and a plurality of target segmented videos are generated.

12. The method according to claim 1 or 11, wherein the step of obtaining a plurality of video sequence feature vectors for expressing features of the video sequence in the video stream comprises:

13. The method of claim 1, wherein the cross-modality video retrieval model comprises a visual base network model for obtaining the plurality of video sequence feature vectors, a multi-language text model for obtaining the plurality of text sequence feature vectors, a visual feature sequence fusion module, a text feature sequence fusion module, a feature consistency learning module, a tag consistency learning module, the cross-modality video retrieval model having a control switch for the visual feature sequence fusion module, the text feature sequence fusion module, the feature consistency learning module, a parameter adjustment phase of the tag consistency learning module, the cross-modality video retrieval model comprising a control switch for the multi-language text model and the visual base network model, the method further comprising:

when the cross-modal video retrieval model is in the parameter adjustment stage, closing the control switch.

14. A training apparatus for a cross-modality video retrieval model, wherein the cross-modality video retrieval model includes a hidden vector space and a tag vector space, the apparatus comprising:

a text sequence feature vector merging module for merging the plurality of text sequence feature vectors based on the text position scalar and generating a target text feature vector;

a hidden vector mapping module, configured to map the target visual feature vector and the target text feature vector to the hidden vector space, and calculate a hidden feature vector similarity for the target visual feature vector and the target text feature vector;

15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 13 when executing a program stored in a memory.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-13.