WO2024098525A1 - Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium - Google Patents

Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium Download PDF

Info

Publication number
WO2024098525A1
WO2024098525A1 PCT/CN2022/141680 CN2022141680W WO2024098525A1 WO 2024098525 A1 WO2024098525 A1 WO 2024098525A1 CN 2022141680 W CN2022141680 W CN 2022141680W WO 2024098525 A1 WO2024098525 A1 WO 2024098525A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text
image
features
sample
Prior art date
Application number
PCT/CN2022/141680
Other languages
French (fr)
Chinese (zh)
Inventor
李仁刚
王立
范宝余
郭振华
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024098525A1 publication Critical patent/WO2024098525A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present application relate to the field of information retrieval technology, and in particular to a video text mutual inspection model training method and device, a video text mutual inspection method and device, an electronic device and a non-volatile storage medium.
  • the relevant technology does not directly process the video. It usually divides the video data into multiple frames of image data and then processes the image data.
  • the relevant technology uses the attention method to weight the extracted image features to the text features, reconstruct the text features, and enhance the similarity between the text and the image.
  • this method can reconstruct the electronic text features using attention.
  • it simply uses the unidirectional attention of natural images to electronic texts when reconstructing the electronic text features. Since there is a corresponding relationship between natural images and electronic texts, the corresponding high-order features affect each other. Only reconstructing the electronic text features while ignoring the natural image features makes it impossible for the natural image features to accurately correspond to the electronic text features, affecting the mutual retrieval of video texts.
  • the embodiments of the present application provide a video-text mutual-check model training method and device, a video-text mutual-check method and device, an electronic device and a non-volatile storage medium, which can effectively improve the accuracy of video-text mutual retrieval.
  • a first aspect of an embodiment of the present application provides a video text mutual inspection model training method, comprising:
  • the sample text includes first-category text data, second-category text data and third-category text data
  • the second-category text data includes first-category text data
  • the third-category text data is used to summarize the second-category text data and the first-category text data
  • the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data
  • the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model
  • the video-text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.
  • the multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, including:
  • the image recombining parameters include the total number of image sets and the total number of image frames contained in each image set;
  • the image frames included in each image set are determined to perform segmentation processing on the image sequence formed by multiple frames of images.
  • the total number of image frames included in each image set is the same, and the above-mentioned determining the image frames included in each image set according to the above-mentioned image recombination parameters includes:
  • For the first image set determining the image frames included in the first image set according to the total number of image frames and the first frame of the image sequence;
  • the image frames included in the corresponding image set are determined;
  • m is the total number of image frames included in each image set
  • N is the total number of image frames included in the above image sequence
  • n is the total number of image sets
  • k is the image frame sequence number difference, which is an integer.
  • the multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, including:
  • a target image frame for identifying the current video segment is extracted.
  • the above-mentioned extraction for identifying the target image frame of the current video segment includes:
  • the first frame image of the current video segment is extracted to serve as the target image frame of the current video segment.
  • the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes:
  • the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set;
  • the image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer.
  • the above-mentioned first 3D convolution structure is used to perform a 3D convolution operation on the input information of the above-mentioned image feature extraction network; the above-mentioned first downsampling structure is used to perform a downsampling operation on the output features of the above-mentioned first 3D convolution structure; the above-mentioned second 3D convolution structure is used to perform a 3D convolution operation on the output features of the above-mentioned first downsampling structure; the above-mentioned second downsampling structure is used to perform a downsampling operation on the features output by the above-mentioned second 3D convolution structure; the above-mentioned 2D convolution structure is used to perform a 2D convolution operation on the output features of the above-mentioned second downsampling structure.
  • the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set;
  • the video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
  • the determining of the current initial weight of the current image set based on the image features of the current image set includes:
  • the initial weight calculation formula is called to calculate the current initial weight of the current image set; the initial weight calculation formula is:
  • ai is the initial weight of the i-th image set
  • q is a known vector
  • qT represents the transpose of q
  • ReLU() is the ReLU function
  • H is the weight matrix
  • yi is the image feature of the i-th image set.
  • the determining of the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set includes:
  • the weight calculation relationship is called to calculate the weight coefficient of the current image set; the weight calculation relationship is:
  • a i ′ is the weight coefficient of the i-th image set
  • a i is the initial weight of the i-th image set
  • softmax() is the softmax function
  • a j is the initial weight of the j-th image set
  • n is the total number of image sets.
  • the video text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples, including:
  • a loss function is called to guide the training process of the video text mutual inspection model; the above loss function is:
  • N is the number of training sample groups
  • min d() represents the minimum value of the calculated distance
  • the above-mentioned recombining multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: integrating the above-mentioned multiple frames of images into an image sequence according to the order of extraction, and obtaining the above-mentioned multiple image sets by cross-segmenting the above-mentioned image sequence.
  • the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly integrating the above-mentioned multiple frames of images into an image sequence, and obtaining the above-mentioned multiple image sets by segmenting the above-mentioned image sequence.
  • the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly allocating the above-mentioned multiple frames of images to different image sets.
  • the video splitting parameters include the number of segments of the sample video and identification information of the sample video.
  • the multiple video segments overlap with each other.
  • a second aspect of an embodiment of the present application provides a video text mutual inspection model training device, comprising:
  • a text feature acquisition module is configured to acquire text feature information of sample text in each group of training samples in a training sample set;
  • the sample text includes first-category text data, second-category text data and third-category text data
  • the second-category text data includes first-category text data
  • the third-category text data is used to summarize the second-category text data and the first-category text data
  • the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data
  • the first-category text features and the second-category text features determine node features and connection edges of a heterogeneous graph neural network in a video text mutual inspection model
  • the video feature generation module is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, wherein the same image is included in different image sets; and generate video features of the sample video according to image features of different image sets and correlations between the image sets;
  • the training module is configured to train the video-text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.
  • a third aspect of the embodiment of the present application provides a video text mutual inspection method, including:
  • the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched are input into the above-mentioned video-text mutual checking model to obtain the video-text mutual checking results;
  • the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data
  • the above-mentioned second-category text data includes first-category text data
  • the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data
  • the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual checking model.
  • a fourth aspect of the embodiments of the present application provides a video text mutual inspection device, including:
  • the model training module is configured to be preliminarily trained to obtain a video text mutual inspection model using any of the above-mentioned video text mutual inspection model training methods;
  • the video processing module is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, the video features to be matched of the video to be retrieved are generated;
  • the mutual check module is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result;
  • the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data
  • the above-mentioned second-category text data includes first-category text data
  • the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data
  • the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual check model.
  • An embodiment of the present application also provides an electronic device, including a processor, wherein the processor is configured to implement the steps of any of the above-mentioned video-text mutual-checking model training methods and/or the above-mentioned video-text mutual-checking methods when executing a computer program stored in a memory.
  • an embodiment of the present application further provides a non-volatile storage medium, on which a computer program is stored.
  • a computer program is executed by a processor, the steps of the video-text mutual-checking model training method and/or the video-text mutual-checking method as described above are implemented.
  • the advantage of the technical solution provided by the embodiment of the present application is that different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features.
  • the fusion features of the third type of text data and the second type of text data that summarize the text data are used as text features for performing matching tasks, which can mine the intrinsic relationship between text data, thereby facilitating the improvement of the accuracy of video-text mutual retrieval.
  • Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video.
  • the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.
  • the embodiments of the present application also provide corresponding implementation devices, electronic devices and non-volatile storage media, as well as video text mutual checking methods and devices for video text mutual checking model training methods, making the above methods more practical.
  • the above devices, electronic devices, non-volatile storage medium video text mutual checking methods and devices all have corresponding advantages.
  • FIG1 is a flow chart of a video text mutual inspection model training method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of constructing a heterogeneous graph neural network provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of multiple image sets generated by recombining multiple frame images according to an embodiment of the present application
  • FIG4 is a schematic diagram of a sample video cutting process provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of image feature extraction provided in an embodiment of the present application.
  • FIG6 is a flow chart of a video text mutual inspection method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a video text mutual inspection model framework for an exemplary application scenario provided by an embodiment of the present application.
  • FIG8 is a schematic diagram of a system structure framework of an exemplary application scenario provided in an embodiment of the present application.
  • FIG9 is a structural diagram of an optional implementation of a video text mutual inspection model training device provided in an embodiment of the present application.
  • FIG10 is a structural diagram of an optional implementation of a video text mutual inspection device provided in an embodiment of the present application.
  • FIG. 11 is a structural diagram of an optional implementation of an electronic device provided in an embodiment of the present application.
  • FIG. 1 is a flow chart of a video text mutual inspection model training method provided by an embodiment of the present application.
  • the embodiment of the present application may include the following contents:
  • S101 Obtain text feature information of sample text in each group of training samples in the training sample set.
  • the training sample set is sample data for training the video text mutual inspection model, and the training sample set includes multiple groups of training samples, each group of training samples includes corresponding sample texts and sample videos, that is, the sample text and the sample video are a set of sample data that match each other.
  • the number of training sample groups it can be determined according to the actual training needs and the database used, and the embodiment of the present application does not impose any restrictions on this.
  • the video text mutual inspection model is used to perform the mutual retrieval task of video data and text data, which includes a heterogeneous graph neural network and a video coding network.
  • the heterogeneous graph neural network is used to process the sample text and the second type of text data of the text to be retrieved and finally output the text features corresponding to the text data.
  • the video coding network is used to process the video data and finally output the video features of the video data.
  • the model is obtained based on the text features and video features training.
  • the data types contained in the sample text of this embodiment include at least three types, wherein the text features corresponding to the two data types are used as heterogeneous nodes of the graph structure. For the convenience of description, they can be called the first type of text data and the second type of text data, and the other type of data is the text data summarizing the first type of text data and the second type of text data.
  • the text feature information includes the first-category text features, the second-category text features, and the third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data;
  • the heterogeneous graph neural network which is a network based on a graph structure
  • the nodes of the graph structure are the first-category text features and the second-category text features
  • the connection edges of the graph structure are determined by whether there is an association relationship between the corresponding features of each heterogeneous node. If there is an association relationship between the features corresponding to two nodes, there is a connection edge relationship between the two nodes.
  • the features extracted from the first type of text data include
  • the second type of features extracted from text data include
  • the nodes of the heterogeneous graph neural network include like and There are related relationships, such as Included The characteristics of and With connecting edges e 32 and e 33 , if and If there is a relationship, and The connection edge e 11 is included therebetween.
  • a corresponding graph structure can be selected based on the actual application scenario, and the embodiments of the present application do not impose any limitation on this.
  • S102 and S103 are respectively executed.
  • multiple frames representing the sample video are extracted from the sample video.
  • it can be flexibly selected according to actual needs.
  • the total number of extracted image frames it can also be flexibly selected based on actual needs.
  • the embodiment of the present application does not impose any restrictions on this.
  • the multiple frames of images are recombined, and the multiple frames of images can be integrated into an image sequence in the order of extraction, and then multiple image sets are obtained by cross-segmenting the image sequence.
  • the same image in this embodiment is included in different image sets, indicating that the same image appears in at least two image sets.
  • the multiple frames of images can also be randomly integrated into an image sequence, and then multiple image sets are obtained by segmenting the image sequence.
  • the multiple frames of images can also be randomly assigned to different image sets, and the same image can be assigned to multiple image sets.
  • technicians in the relevant field can flexibly decide according to actual needs.
  • S103 Generate video features of the sample video according to the image features of different image sets and the association relationship between the image sets.
  • any existing machine learning model such as convolutional neural network, VGG (Visual Geometry Group Network), Resnet (Residual Neural Network), etc. can be used to extract the image features of each frame image contained in each image set, and the image features of all the frames in the image set are integrated into the image features of the image set.
  • the association between the image sets is used to identify the importance of the image features of different image sets to the entire video, and the final video features of the sample video are determined based on the importance of different image sets and the image features of the image sets.
  • the text features of a sample text correspond to the video features of a sample video.
  • the text features of each sample text in this embodiment are fusion features, which are fused with the text features corresponding to the third category text data of the sample text and the features extracted by the heterogeneous graph neural network of the second category text data of the video text mutual inspection model.
  • the text features corresponding to the third category text data can be extracted by any text feature extraction model, and this embodiment does not impose any restrictions on this.
  • a loss function is used to guide the training of the model, and then the network parameters of the video text mutual inspection model are updated by methods such as gradient back propagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good.
  • the training process of the video text mutual inspection model may include a forward propagation stage and a back propagation stage.
  • the forward propagation stage is a stage in which data is propagated from a low level to a high level
  • the back propagation stage is a stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not meet the expectation.
  • all network layer weights are first initialized, such as random initialization; then the video features and text feature information are input and forward propagated through the graph neural network, convolution layer, downsampling layer, fully connected layer and other layers to obtain the output value; the model output value of the video text mutual inspection model is calculated, and the loss value of the output value is calculated based on the loss function.
  • the error is reversed back to the video text mutual inspection model, and the back propagation errors of each part of the video text mutual inspection model such as the graph neural network layer, the fully connected layer, the convolution layer and other layers are obtained in turn.
  • Each layer of the video text mutual inspection model adjusts all weight coefficients of the video text mutual inspection model according to the back propagation errors of each layer to achieve weight update.
  • a new batch of video features and text feature information is randomly selected again, and then the above process is performed again to obtain the output value of the network forward propagation. Infinite reciprocating iterations, when the error between the calculated model output value and the target value (ie, the label) is less than the preset threshold, or the number of iterations exceeds the preset number of iterations, the model training is terminated. All layer parameters of the model corresponding to the end of the model training are used as the network parameters of the trained video text mutual inspection model.
  • different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features, which is conducive to improving the accuracy of video-text mutual retrieval.
  • Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video.
  • the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.
  • the above embodiment does not limit which loss function is used to guide the model training process in step S104.
  • Technical personnel in the relevant field can use any loss function in the prior art, such as L1 norm loss function, mean square error loss function, cross entropy loss, etc.
  • the loss function is an indicator used to measure the performance of the prediction model in predicting the expected result. Whether the loss function is accurate affects the accuracy of the entire model.
  • the embodiment of the present application also provides an optional implementation of the loss function, that is, based on the text features of each group of training samples and the corresponding video features, the loss function is called to guide the training process of the video text mutual inspection model; the loss function can be expressed as:
  • N is the number of training sample groups
  • min d() represents the minimum value of the calculated distance
  • the loss function will traverse each video feature and text feature information to calculate the average value of the loss function for paired data.
  • This embodiment can traverse N times, where N represents that there are N paired sample data in this batch, that is, there are N groups of training samples in the training sample set. All sample videos of these N groups of training samples can be regarded as a video image group, and all sample texts can be regarded as a text group.
  • the video image group feature Traverse (a total of N), and the selected video features can be called a represents anchor (anchor sample).
  • the text feature encoding paired with the anchor sample is recorded as p represents positive (paired matching).
  • the unpaired text features are recorded as is a hyperparameter, which is fixed during training, for example, set to 0.3.
  • the same traversal operation is performed for text features. Represents the sample selected in the traversal, and the corresponding video image group feature sample is recorded as The non-corresponding
  • step S102 there is no limitation on how to execute step S102.
  • an optional image frame combination method is provided, which may include the following steps:
  • An image recombination parameter is obtained, and the image frames included in each image set are determined according to the image recombination parameter, so as to perform segmentation processing on an image sequence formed by multiple frames of images.
  • the image recombination parameters may include the total number of image sets and the total number of image frames contained in each image set.
  • the total number of image sets and the total number of image frames contained in each image set can be changed in real time, that is, the user can input the latest parameter value in real time, and can directly write it into the specified location of the system, which does not affect the implementation of the embodiments of the present application.
  • the number of image frames contained in each image set they can be the same or different. In order to facilitate subsequent image processing, this embodiment can set the number of image frames contained in each image set to be the same.
  • the image frames can be allocated and reprocessed through manual interaction.
  • an automated image segmentation method can also be used.
  • this embodiment also provides an optional implementation method for determining the image frames contained in each image set according to the image recombination parameters, which may include the following contents:
  • the image frames extracted from the sample video are N frames
  • the N frames are divided into n overlapping image sets, and each image set may include m frames.
  • the image frame sequence number difference k value can be calculated.
  • the first image set includes [1, ..., m]
  • the second image set includes [k+1, ..., m+k]
  • the third image set includes [2k+1, ..., m+2k]
  • the nth image set includes [nk+1, ..., m+nk].
  • the sample video is composed of many frames of video images.
  • the above embodiment does not limit the process of extracting multiple frames of images from the sample video.
  • this embodiment also provides an optional implementation method, that is, by parsing the video splitting instruction, the video splitting parameters are obtained; according to the video splitting parameters, the sample video is split into multiple video segments; for each video segment, the target image frame used to identify the current video segment is extracted.
  • the first frame of the current video segment can be extracted as the target image frame of the current video segment.
  • the video splitting parameter refers to the number of sample video segmentation segments and the sample video identification information. This implementation can divide a sample video into N segments on average, and then take the first frame of each segment as the representative of the image of the segment.
  • This embodiment divides the image frames extracted from the video into multiple overlapping intervals, which is beneficial to extracting richer image features and improving the accuracy of model training.
  • the above embodiment does not limit how to generate video features.
  • the embodiment of the present application also provides an illustrative example, which may include the following content:
  • an embodiment of the present application provides a network structure for extracting image features of each frame of each image set, which is called an image feature extraction network in this embodiment.
  • the image feature extraction network may include a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure, and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to perform a downsampling operation on the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to perform a downsampling operation on the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure.
  • the size of the convolution kernel is shown in Figure 5.
  • the size of the pooling kernel is 2*2, and the step size is 2.
  • the final network obtains the final output features after one 2D convolution operation and one fully connected layer.
  • the input size of the network is 3*16*224*224, that is, 16 frames of images are input at a time, and the input image size is 224 ⁇ 224. In the embodiment, for each image set input, a 128-dimensional feature vector can be obtained.
  • the process of generating video features of the sample video according to the image features of different image sets and the correlation between the image sets may include: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set; generating the video features of the sample video according to the image features of each image set and the corresponding weight coefficient.
  • the current initial weight of the current image set can be calculated by calling the initial weight calculation formula; the initial weight calculation formula can be expressed as:
  • a i is the initial weight of the i-th image set
  • q is a known vector
  • q T represents the transpose of q
  • ReLU() is the ReLU function
  • H is the weight matrix
  • yi is the image feature of the i-th image set. yi can be mapped to a common space by matrix multiplication of H ⁇ yi , which can be obtained through model training. q T multiplied by ReLU(H ⁇ yi ) will get a number.
  • the weight coefficient of the current image set can be calculated by calling the weight calculation relationship;
  • the weight calculation relationship can be expressed as:
  • a i ′ is the weight coefficient of the i-th image set
  • softmax() is the softmax function
  • a j is the initial weight of the j-th image set
  • n is the total number of image sets.
  • the features of each image set can be expressed more significantly, which is conducive to obtaining more accurate video features and helps to improve the accuracy of model training.
  • this embodiment also provides a video text mutual inspection method, please refer to FIG6, which may include the following contents:
  • the video-text mutual-checking model can be trained in advance using the video-text mutual-checking model training method described in any of the above embodiments to obtain the video-text mutual-checking model.
  • S602 Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets.
  • S603 Generate to-be-matched video features of the to-be-retrieved video according to the image features of different image sets and the association relationship between the image sets.
  • S604 Input the to-be-matched text features and the to-be-matched video features of the to-be-retrieved text into the video-text mutual-checking model to obtain the video-text mutual-checking results.
  • the text to be retrieved includes first-category text data, second-category text data and third-category text data
  • the second-category text data includes first-category text data
  • the third-category text data is used to summarize the second-category text data and the first-category text data
  • the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video text mutual inspection model.
  • the processing of the video to be retrieved in this embodiment is S602 and S603. Please refer to the corresponding contents of S102 and S103 in the above embodiment, and will not be repeated here.
  • the weight coefficients trained by S601 can be preloaded. Feature extraction is performed on the video to be retrieved or the text to be retrieved, and the feature is stored in the data set to be retrieved.
  • the user gives any video to be retrieved or text to be retrieved, which can be called data to be retrieved for the sake of description.
  • the text feature information or video feature of the data to be retrieved is extracted and input into the video-text mutual inspection model.
  • the features of the data to be retrieved are distance matched with the features of all samples in the data set to be retrieved. For example: if the data to be retrieved is text data, the Euclidean distance is calculated with all the features of the video to be retrieved in the data set to be retrieved, and the sample with the smallest distance is the recommended sample for output.
  • this embodiment can effectively improve the accuracy of video-text mutual retrieval.
  • this embodiment also provides an illustrative example for implementing the mutual retrieval task of recipe text and recipe video, which may include the following contents:
  • This embodiment includes a recipe retrieval terminal device 701 and a server 702 .
  • a user can perform operations on the recipe retrieval terminal device 701 .
  • the recipe retrieval terminal device 701 interacts with the server 702 through a network.
  • the server 702 can deploy a video text mutual inspection model, as shown in FIG8 .
  • the video text mutual inspection model needs to be trained.
  • the recipe retrieval terminal device 701 can transmit a training sample set to the server 702 .
  • the training sample set can include multiple groups of training samples. Each group of training samples includes a corresponding recipe text sample and a recipe video sample.
  • Each recipe text sample includes operation steps (instruction list), ingredient information (ingredients) and dish name (Title). Instructions are steps for cooking, which are uniformly represented by steps in the following text. Ingredients are ingredients of a dish, which are uniformly represented by ingredients in the following text.
  • a heterogeneous graph neural network can be used to encode text information.
  • the text features are constructed into a graph structure, and the graph structure includes nodes, node features and connection relationships, as shown in Figure 2. Components and steps are different from structure to properties, so they are called heterogeneous nodes.
  • each step is called a node, and similarly, each component is called a node.
  • a node is composed of a sentence or a phrase.
  • the Bert model can be used to extract the features of each sentence or each word.
  • the implementation method is as follows: All recipe texts are input from the text information at the bottom, and the position information and text type accompanying the recipe text information are also input.
  • Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively.
  • Text type means: if the input text is a step, its text type is 1; if the input text is an ingredient, its text type is 2.
  • the features are used to represent node features, namely, component node features and step node features. Both component node features and step node features are high-dimensional vectors, and their dimensions are R d dimensions (d-dimensional real vectors).
  • the step information can be traversed by text comparison, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship.
  • the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph.
  • the heterogeneous graph information can be updated using a graph attention network to achieve feature aggregation and update.
  • the update method is to traverse each heterogeneous node in turn for update.
  • the aggregation and extraction of text features are realized by heterogeneous graph operations, and the calculation method can be as follows:
  • update the step node is the node feature of the qth node of the step node, Represents the feature of the pth node of the component node. If the qth node of the step node is connected to the pth node of the component node (that is, has an edge connection relationship), the feature of the pth node of the component node is used to update the qth node feature of the step node.
  • the correlation between the nodes needs to be considered.
  • the correlation between the nodes can be represented by assigning weights.
  • the following relationship (1) can be called to calculate the correlation weight zpq between the qth node of the step node and the pth node feature of the component node. For each step node, for example Traverse all component nodes with connected edges, assuming there are Np nodes, and get the corresponding relevant weight zpq.
  • Wa, Wb, Wc are known R d ⁇ d dimensional matrices, Represents matrix multiplication, aka vector mapping.
  • the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the normalized relevant weight ⁇ qp can be obtained by calling the following relationship (2):
  • exp represents the exponential function. It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node. Finally, the node features of the step node are updated by the normalized relevant weights, that is, the following relationship (3) is called for calculation:
  • W v is an R d ⁇ d dimensional matrix, It is the new feature vector updated by the component nodes connected to it.
  • equation (5) can be used to perform the same calculation and update on the component nodes:
  • the network update of one layer of the graph attention network is completed.
  • T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network.
  • the update method of the node features of each layer is as described above.
  • an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):
  • FFN feed-forward layer, feed-forward layer, or fully connected layer
  • the update of the node features is completed.
  • it is also necessary to summarize and synthesize the features of all text nodes such as operation steps, ingredient information and dish names.
  • the step node integrates the ingredient node information
  • the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords.
  • the dish name information contains important main material information and cooking methods
  • the dish name text is usually a widespread existence in the recipe-based graphic and text mutual inspection task. Based on this, this embodiment can also extract the features of the dish name through the Bert (Bidirectional Encoder Representations from Transformer, bidirectional feature encoder) model.
  • the BiLSTM Bi-directional Long Short-Term Memory, bidirectional long short-term memory neural network
  • the BiLSTM Bi-directional Long Short-Term Memory, bidirectional long short-term memory neural network
  • the left and right arrows represent the direction of LSTM (Long Short-Term Memory) encoding, that is, the forward and reverse encoding of the step node features.
  • LSTM Long Short-Term Memory
  • the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input.
  • the output of the q-1th unit in the BiLSTM that is, the output of the previous state.
  • the output of the entire text feature can be obtained by summing and averaging.
  • e rec represents the output of the text feature, which is used for the next step of retrieval.
  • the sample video can be used as the recipe video, and any one of the above embodiments can be used to implement the encoding of recipe features.
  • the loss function of the above embodiment can be used to guide the training of the video text mutual inspection model to make it converge.
  • the recipe retrieval terminal device 701 may include a display screen, an input interface, an input keyboard, and a wireless transmission module.
  • the input keyboard may be a soft keyboard presented on the display screen.
  • the input interface may be used to realize connection with an external device such as a USB flash drive. There may be multiple input interfaces.
  • the user may input a recipe text to be retrieved or a video to be retrieved to the recipe retrieval terminal device 701 through the input keyboard, or may write the recipe text to be retrieved or the video to be retrieved to a USB flash drive and insert the USB flash drive into the input interface of the recipe retrieval terminal device 701.
  • the user inputs a retrieval request to the recipe retrieval terminal device 701, and the retrieval request carries the recipe text to be retrieved or the recipe video to be retrieved.
  • the recipe retrieval terminal may send the retrieval request to the server 702 through the wireless transmission module.
  • the server 702 retrieves the corresponding database based on the trained model and may feed back the final mutual retrieval result to the recipe retrieval terminal device 701.
  • the recipe retrieval terminal device 701 may display the retrieved recipe text or recipe video to the user through the display screen.
  • the embodiment of the present application also provides a corresponding device for the video text mutual inspection model training method and the video text mutual inspection method, so that the method is more practical.
  • the device can be described from the perspective of functional modules and hardware.
  • the video text mutual inspection model training device and the video text mutual inspection device provided by the embodiment of the present application are introduced below.
  • the video text mutual inspection model training device and the video text mutual inspection device described below can correspond to each other with the video text mutual inspection model training method and the video text mutual inspection method described above.
  • FIG. 9 is a structural diagram of a video text mutual inspection model training device provided in an embodiment of the present application under an optional implementation mode, and the device may include:
  • the text feature acquisition module 901 is configured to acquire text feature information of sample text in each group of training samples in the training sample set, wherein the sample text includes first-category text data, second-category text data, and third-category text data, wherein the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data;
  • the text feature information includes first-category text features, second-category text features, and third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data;
  • the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;
  • the video feature generation module 902 is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, and the same image is included in different image sets; and generate video features of the sample video according to the image features of different image sets and the correlation between the image sets;
  • the training module 903 is configured to train the video text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are the fusion features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
  • the above-mentioned video feature generation module 902 can also be configured to: obtain image recombination parameters; the image recombination parameters include the total number of image sets and the total number of image frames contained in each image set; according to the image recombination parameters, determine the image frames contained in each image set to perform segmentation processing on the image sequence formed by multiple frames of images.
  • the video feature generation module 902 may further include a video decomposition unit, which is configured to obtain video splitting parameters by parsing video splitting instructions; split the sample video into multiple video segments according to the video splitting parameters; and for each video segment, extract a target image frame for identifying the current video segment.
  • a video decomposition unit which is configured to obtain video splitting parameters by parsing video splitting instructions; split the sample video into multiple video segments according to the video splitting parameters; and for each video segment, extract a target image frame for identifying the current video segment.
  • the video decomposition unit may also be configured to extract the first frame image of the current video segment as the target image frame of the current video segment.
  • the above-mentioned video feature generation module 902 may also include a feature extraction unit, which is configured to: pre-train an image feature extraction network; for each image set, input the image frames contained in the current image set into the image feature extraction network to obtain image features corresponding to the current image set; wherein the image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to downsample the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to downsample the features output by the second 3D convolution structure; and the 2
  • the above-mentioned video feature generation module 902 can also be configured as follows: for each image set, the current initial weight of the current image set is determined based on the image features of the current image set, and the weight coefficient of the current image set is determined based on the current initial weight and the initial weight of each image set; and the video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
  • the above video feature generation module 902 can also be configured to: call the initial weight calculation relationship to calculate the current initial weight of the current image set; the initial weight calculation relationship is:
  • ai is the initial weight of the i-th image set
  • q is a known vector
  • qT represents the transpose of q
  • ReLU() is the ReLU function
  • H is the weight matrix
  • yi is the image feature of the i-th image set.
  • the above video feature generation module 902 can also be configured to: call the weight calculation relationship to calculate the weight coefficient of the current image set; the weight calculation relationship is:
  • a i ′ is the weight coefficient of the i-th image set
  • mind() represents the minimum value of the calculated distance
  • softmax() is the softmax function
  • a j is the initial weight of the j-th image set
  • n is the total number of image sets.
  • the training module 903 may also be configured to: based on the text feature information of each group of training samples and the corresponding video features, call a loss function to guide the training process of the video text mutual inspection model; the loss function is:
  • N is the number of training sample groups
  • FIG. 10 is a structural diagram of a video text mutual inspection device provided in an embodiment of the present application under an optional implementation mode, and the device may include:
  • the model training module 1001 is configured to pre-train a video text mutual checking model using any of the above-mentioned video text mutual checking model training methods;
  • the video processing module 1002 is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, generate the video features to be matched of the video to be retrieved;
  • the mutual check module 1003 is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result;
  • the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes the first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data;
  • the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video-text mutual check model.
  • this embodiment can effectively improve the accuracy of video-text mutual retrieval.
  • FIG. 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application under one implementation.
  • the electronic device includes a memory 110, which is configured to store a computer program; a processor 111, which is configured to implement the video-text mutual-checking model training method and/or the video-text mutual-checking steps mentioned in any of the above embodiments when executing the computer program.
  • the processor 111 may include one or more processing cores, such as a 4-core processor or an 8-core processor.
  • the processor 111 may also be a controller, a microcontroller, a microprocessor or other data processing chip.
  • the processor 111 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), or PLA (Programmable Logic Array).
  • the processor 111 may also include a main processor and a coprocessor.
  • the main processor is a processor configured to process data in an awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor configured to process data in a standby state.
  • CPU Central Processing Unit
  • the processor 111 may be integrated with a GPU (Graphics Processing Unit), which is configured to be responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 111 may also include an AI (Artificial Intelligence) processor, which is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 110 may include one or more non-volatile storage media, which may be non-transitory.
  • the memory 110 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices.
  • the memory 110 may be an internal storage unit of an electronic device, such as a hard disk of the server 702.
  • the memory 110 may also be an external storage device of an electronic device, such as a plug-in hard disk equipped on the server 702, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc.
  • the memory 110 may also include both an internal storage unit of the electronic device and an external storage device.
  • the memory 110 may not only be configured to store application software and various types of data installed in the electronic device, such as: the code of the program in the process of executing the above-mentioned video text mutual inspection model training method and/or the above-mentioned video text mutual inspection method, but may also be configured to temporarily store data that has been output or is to be output.
  • the memory 110 is at least configured to store the following computer program 1101, wherein, after the computer program is loaded and executed by the processor 111, the video text mutual inspection model training method and/or the relevant steps of the video text mutual inspection method disclosed in any of the aforementioned embodiments can be implemented.
  • the resources stored in the memory 110 may also include an operating system 1102 and data 1103, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 1102 may include Windows, Unix, Linux, etc.
  • Data 1103 may include but is not limited to data generated by the video text mutual inspection model training process and/or data corresponding to the video text mutual inspection results, etc.
  • the electronic device may further include a display screen 112, an input/output interface 113, a communication interface 114 or a network interface, a power supply 115 and a communication bus 116.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc.
  • the display may also be appropriately referred to as a display screen or a display unit, which is configured to display information processed in the electronic device and to display a visual user interface.
  • the communication interface 114 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the electronic device and other electronic devices.
  • the communication bus 116 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, etc.
  • FIG11 only uses one thick line, but it does not mean that there is only one bus or one type of bus.
  • FIG. 11 does not limit the electronic device and may include more or fewer components than shown in the figure, for example, may also include a sensor 117 for implementing various functions.
  • this embodiment can effectively improve the accuracy of video-text mutual retrieval.
  • the video text mutual inspection model training method and/or the video text mutual inspection method in the above-mentioned embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application.
  • the aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc.
  • Various non-volatile storage media that can store program code.
  • an embodiment of the present application also provides a non-volatile storage medium storing a computer program.
  • the above computer program is executed by a processor, it is the above-mentioned video-text mutual-checking model training method and/or video-text mutual-checking method in any of the above embodiments.
  • each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
  • the same or similar parts between the embodiments can be referred to each other.
  • the description is relatively simple, and the relevant parts can be referred to the method part.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiments of the present application are applied to the technical field of information retrieval. Disclosed are a video-text mutual retrieval method and apparatus, a training method and apparatus for a video-text mutual retrieval model, and a device and a medium. The training method for a video-text mutual retrieval model comprises: acquiring text feature information of sample text in each group of training samples in a training sample set, and on the basis of the text feature information, determining node features and edges of a heterogeneous graph neural network in a video-text mutual retrieval model; for a sample video in each group of training samples, re-combining a plurality of image frames extracted from the sample video, so as to obtain a plurality of image sets; generating a video feature according to image features of different image sets and an association relationship between the image sets; and training the video-text mutual retrieval model on the basis of a text feature in which a third-type text feature and a feature, which is extracted by using the heterogeneous graph neural network, of second-type text data are fused, and a corresponding video feature. The present application can effectively improve the precision of video-text mutual retrieval.

Description

视频文本互检方法及其模型训练方法、装置、设备、介质Video text mutual inspection method and model training method, device, equipment, and medium thereof
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年11月08日提交中国专利局,申请号为202211388901.3,申请名称为“视频文本互检方法及其模型训练方法、装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on November 8, 2022, with application number 202211388901.3, and application name “Video-text mutual inspection method and its model training method, device, equipment, and medium”, all of which are incorporated by reference in this application.
技术领域Technical Field
本申请实施例涉及信息检索技术领域,特别是涉及一种视频文本互检模型训练方法及装置、视频文本互检方法及装置、电子设备及非易失性存储介质。The embodiments of the present application relate to the field of information retrieval technology, and in particular to a video text mutual inspection model training method and device, a video text mutual inspection method and device, an electronic device and a non-volatile storage medium.
背景技术Background technique
随着计算机技术以及网络技术被广泛地应用在日常工作生活中,数据呈现数量及多样性的显著增长,文本类数据如新闻报道、微博淘宝等评论数据、微信聊天记录等,图像数据如表情包、文章配图、手机照片、医疗影像等,视频数据如各种视频播放器的电视、电影,以及小视频如抖音、快手等,摄像头采集的数据等,音频数据如各种语音播报、微信语音、视频配音等。这些不同多媒体形式的数据通常还共同用于描述同一物体或同一场景。为了方便管理多样的多媒体内容,不同媒体间实现灵活检索的方法应用而生。As computer technology and network technology are widely used in daily work and life, the amount and diversity of data have increased significantly, including text data such as news reports, Weibo and Taobao comment data, WeChat chat records, etc., image data such as emoticons, article illustrations, mobile phone photos, medical images, etc., video data such as TV and movies from various video players, and short videos such as Douyin and Kuaishou, data collected by cameras, etc., and audio data such as various voice broadcasts, WeChat voice, video dubbing, etc. These different multimedia forms of data are often used together to describe the same object or the same scene. In order to facilitate the management of diverse multimedia content, methods for flexible retrieval between different media have been developed.
其中,对于视频数据和文本数据之间的互检索,相关技术并不以视频为直接处理对象,通常是将视频数据分割为多帧图像数据,然后对图像数据进行处理。在图像处理过程中,相关技术利用注意力方法将提取到的图像特征加权到文本特征中,对文本特征进行重构,增强文本与图像之间的相似性。该方法虽然能够利用注意力重构电子文本特征。但是,其只是简单地在重构电子文本特征时使用自然图像对电子文本的单向注意力,由于自然图像与电子文本存在对应关系,相互对应的高阶特征间互相影响,仅仅重构电子文本特征而忽略自然图像特征,使得自然图像特征无法准确与电子文本特征对应,影响视频文本互相检索。Among them, for the mutual retrieval between video data and text data, the relevant technology does not directly process the video. It usually divides the video data into multiple frames of image data and then processes the image data. In the image processing process, the relevant technology uses the attention method to weight the extracted image features to the text features, reconstruct the text features, and enhance the similarity between the text and the image. Although this method can reconstruct the electronic text features using attention. However, it simply uses the unidirectional attention of natural images to electronic texts when reconstructing the electronic text features. Since there is a corresponding relationship between natural images and electronic texts, the corresponding high-order features affect each other. Only reconstructing the electronic text features while ignoring the natural image features makes it impossible for the natural image features to accurately correspond to the electronic text features, affecting the mutual retrieval of video texts.
鉴于此,如何有效提高视频文本互检索精度,是所属领域技术人员需要解决的技术问题。In view of this, how to effectively improve the accuracy of video-text mutual retrieval is a technical problem that technical personnel in the relevant field need to solve.
发明内容Summary of the invention
本申请实施例提供了一种视频文本互检模型训练方法及装置、视频文本互检方法及装置、电子设备及非易失性存储介质,可有效提高视频文本互检索精度。The embodiments of the present application provide a video-text mutual-check model training method and device, a video-text mutual-check method and device, an electronic device and a non-volatile storage medium, which can effectively improve the accuracy of video-text mutual retrieval.
为解决上述技术问题,本申请实施例提供以下技术方案:To solve the above technical problems, the present application provides the following technical solutions:
本申请实施例第一方面提供了一种视频文本互检模型训练方法,包括:A first aspect of an embodiment of the present application provides a video text mutual inspection model training method, comprising:
获取训练样本集的每组训练样本中的样本文本的文本特征信息;上述样本文本包括第一类文本数据、第二类文本数据及第三类文本数据,上述第二类文本数据包括第一类文本数据,且上述第三类文本数据用于概括上述第二类文本数据和上述第一类文本数据;上述文本特征信息包括上述第一类文本数据、上述第二类文本数据和第三类文本数据对应的第一类文本特征、第二类文本特征和第三类文本特征;上述第一类文本特征和上述第二类文本特征确定视频文本互检模型中的异质图神经网络的节点特征和连接边;Obtaining text feature information of sample text in each group of training samples in the training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;
对每组训练样本中的样本视频,将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;For the sample videos in each group of training samples, multiple frames of images extracted from the sample videos are reassembled to obtain multiple image sets, and the same image is included in different image sets;
根据不同图像集的图像特征及各图像集之间的关联关系,生成上述样本视频的视频特征;Generate video features of the sample video according to image features of different image sets and correlations between the image sets;
基于每组训练样本的文本特征及相应的视频特征,训练上述视频文本互检模型;上述文本特征为利用上述异质图神经网络提取上述第二类文本数据的特征和上述第三类文本特征的融合特征。The video-text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.
可选的,上述将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:Optionally, the multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, including:
获取图像重组合参数;上述图像重组合参数包括图像集总数以及各图像集包含的图像帧总数;Obtaining image recombining parameters; the image recombining parameters include the total number of image sets and the total number of image frames contained in each image set;
根据上述图像重组合参数,确定每个图像集所包含的图像帧,以对由多帧图像形成的图像序列进行分割处理。According to the above-mentioned image recombining parameters, the image frames included in each image set are determined to perform segmentation processing on the image sequence formed by multiple frames of images.
可选的,各图像集所包含图像帧总数相同,上述根据上述图像重组合参数,确定每个图像集所包含的图像帧,包括:Optionally, the total number of image frames included in each image set is the same, and the above-mentioned determining the image frames included in each image set according to the above-mentioned image recombination parameters includes:
对第一个图像集,根据上述图像帧总数和上述图像序列的第一帧图像确定上述第一个图像集所包含的图像帧;For the first image set, determining the image frames included in the first image set according to the total number of image frames and the first frame of the image sequence;
调用图像分割关系式,确定相邻图像集的图像帧序号差;上述图像分割关系式为:m+nk=N;The image segmentation relational formula is called to determine the difference in the image frame sequence numbers of adjacent image sets; the above image segmentation relational formula is: m+nk=N;
对其余各图像集,基于当前图像集的上一个图像集所包含的图像帧和上述图像帧序号差,确定相应图像集所包含的图像帧;For each of the remaining image sets, based on the image frames included in the previous image set of the current image set and the image frame sequence number difference, the image frames included in the corresponding image set are determined;
式中,m为各图像集所包含图像帧总数,N为上述图像序列所包含图像帧总数,n为图像集总数,k为图像帧序号差,且其为整数。Wherein, m is the total number of image frames included in each image set, N is the total number of image frames included in the above image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.
可选的,上述将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:Optionally, the multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, including:
通过解析视频拆分指令,获取视频拆分参数;Obtain video splitting parameters by parsing the video splitting instruction;
按照上述视频拆分参数,将上述样本视频拆分为多个视频段;According to the above video splitting parameters, split the above sample video into multiple video segments;
对每个视频段,提取用于标识当前视频段的目标图像帧。For each video segment, a target image frame for identifying the current video segment is extracted.
可选的,上述提取用于标识当前视频段的目标图像帧,包括:Optionally, the above-mentioned extraction for identifying the target image frame of the current video segment includes:
提取上述当前视频段的第一帧图像,以作为上述当前视频段的目标图像帧。The first frame image of the current video segment is extracted to serve as the target image frame of the current video segment.
可选的,上述根据不同图像集的图像特征及各图像集之间的关联关系,生成上述样本视频的视频特征,包括:Optionally, the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes:
预先训练图像特征提取网络;Pre-train image feature extraction network;
对每个图像集,将当前图像集所包含的图像帧均输入至上述图像特征提取网络,得到上述当前图像集对应的图像特征;For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set;
其中,上述图像特征提取网络包括第一3D卷积结构、第一降采样结构、第二3D卷积结构、第二降采样结构、2D卷积结构和全连接层;The image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer.
上述第一3D卷积结构用于对上述图像特征提取网络的输入信息进行3D卷积操作;上述第一降采样结构用于对上述第一3D卷积结构的输出特征进行降采样操作;上述第二3D卷积结构用于对上述第一降采样结构的输出特征进行3D卷积操作;上述第二降采样结构用于对上述第二3D卷积结构输出的特征进行降采样操作;上述2D卷积结构用于对上述第二降采样结构的输出特征进行2D卷积操作。The above-mentioned first 3D convolution structure is used to perform a 3D convolution operation on the input information of the above-mentioned image feature extraction network; the above-mentioned first downsampling structure is used to perform a downsampling operation on the output features of the above-mentioned first 3D convolution structure; the above-mentioned second 3D convolution structure is used to perform a 3D convolution operation on the output features of the above-mentioned first downsampling structure; the above-mentioned second downsampling structure is used to perform a downsampling operation on the features output by the above-mentioned second 3D convolution structure; the above-mentioned 2D convolution structure is used to perform a 2D convolution operation on the output features of the above-mentioned second downsampling structure.
可选的,上述根据不同图像集的图像特征及各图像集之间的关联关系,生成上述样本视频的视频特征,包括:对每个图像集,基于当前图像集的图像特征确定上述当前图像集的当前初始权重,并基于上述当前初始权重和每个图像集的初始权重确定上述当前图像集的权重系数;Optionally, the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set;
根据各图像集的图像特征及相应的权重系数,生成上述样本视频的视频特征。The video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
可选的,上述基于当前图像集的图像特征确定上述当前图像集的当前初始权重,包括:Optionally, the determining of the current initial weight of the current image set based on the image features of the current image set includes:
调用初始权重计算关系式,计算上述当前图像集的当前初始权重;上述初始权重计算关系式为:The initial weight calculation formula is called to calculate the current initial weight of the current image set; the initial weight calculation formula is:
ai=qTReLU(H·yi);ai=qTReLU(H·yi);
式中,ai为第i个图像集的初始权重,q为已知向量,qT表示q的转置,ReLU()为ReLU函数,H为权重矩阵,yi为第i个图像集的图像特征。Where ai is the initial weight of the i-th image set, q is a known vector, qT represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set.
可选的,上述基于上述当前初始权重和每个图像集的初始权重确定上述当前图像集的权重系数,包括:Optionally, the determining of the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set includes:
调用权重计算关系,计算上述当前图像集的权重系数;上述权重计算关系式为:The weight calculation relationship is called to calculate the weight coefficient of the current image set; the weight calculation relationship is:
Figure PCTCN2022141680-appb-000001
Figure PCTCN2022141680-appb-000001
式中,a i′为第i个图像集的权重系数,a i为第i个图像集的初始权重,softmax()为softmax函数,a j为第j个图像集的初始权重,n为图像集总数。 Where a i ′ is the weight coefficient of the i-th image set, a i is the initial weight of the i-th image set, softmax() is the softmax function, a j is the initial weight of the j-th image set, and n is the total number of image sets.
可选的,上述基于每组训练样本的文本特征及相应的视频特征,训练上述视频文本互检模型,包括:Optionally, the video text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples, including:
基于每组训练样本的文本特征信息及相应的视频特征,调用损失函数指导视频文本互检模型的训练过程;上述损失函数为:Based on the text feature information and corresponding video features of each set of training samples, a loss function is called to guide the training process of the video text mutual inspection model; the above loss function is:
Figure PCTCN2022141680-appb-000002
Figure PCTCN2022141680-appb-000002
式中,
Figure PCTCN2022141680-appb-000003
为上述损失函数,N为训练样本组数,min d()表示计算距离的最小值,
Figure PCTCN2022141680-appb-000004
为上述训练样本集中所包含的所有样本视频中的第a个样本视频,
Figure PCTCN2022141680-appb-000005
为上述训练样本集中所包含的所有样本文本中第p个样本文本、且其与第a个样本视频相对应,
Figure PCTCN2022141680-appb-000006
为在所有样本文本数据中的第n个样本文本、且其与第a个样本视频不对应,
Figure PCTCN2022141680-appb-000007
为所有样本文本数据中的第a个样本文本,
Figure PCTCN2022141680-appb-000008
为所有样本视频中第p个样本视频、且其与第a个样本文本相对应,
Figure PCTCN2022141680-appb-000009
为所有样本视频数据中的第n个样本视频、且其与第a个样本文本不对应,
Figure PCTCN2022141680-appb-000010
为超参数。
In the formula,
Figure PCTCN2022141680-appb-000003
is the above loss function, N is the number of training sample groups, min d() represents the minimum value of the calculated distance,
Figure PCTCN2022141680-appb-000004
is the ath sample video among all the sample videos contained in the above training sample set,
Figure PCTCN2022141680-appb-000005
is the pth sample text among all the sample texts contained in the above training sample set, and it corresponds to the ath sample video,
Figure PCTCN2022141680-appb-000006
is the nth sample text in all sample text data, and it does not correspond to the ath sample video,
Figure PCTCN2022141680-appb-000007
is the ath sample text in all sample text data,
Figure PCTCN2022141680-appb-000008
is the pth sample video among all sample videos, and it corresponds to the ath sample text,
Figure PCTCN2022141680-appb-000009
is the nth sample video among all sample video data, and it does not correspond to the ath sample text,
Figure PCTCN2022141680-appb-000010
is a hyperparameter.
可选的,上述将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:将上述多帧图像按照提取的顺序整合为一个图像序列,通过对上述图像序列进行交叉分割得到上述多个图像集。Optionally, the above-mentioned recombining multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: integrating the above-mentioned multiple frames of images into an image sequence according to the order of extraction, and obtaining the above-mentioned multiple image sets by cross-segmenting the above-mentioned image sequence.
可选的,上述将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:将上述多帧图像随机整合为一个图像序列,通过对上述图像序列进行分割得到上述多个图像集。Optionally, the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly integrating the above-mentioned multiple frames of images into an image sequence, and obtaining the above-mentioned multiple image sets by segmenting the above-mentioned image sequence.
可选的,上述将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:将上述多帧图像随机分配至不同的图像集。Optionally, the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly allocating the above-mentioned multiple frames of images to different image sets.
可选的,上述视频拆分参数包括上述样本视频的拆分段数以及上述样本视频的标识信息。Optionally, the video splitting parameters include the number of segments of the sample video and identification information of the sample video.
可选的,上述多个视频段相互重叠。Optionally, the multiple video segments overlap with each other.
本申请实施例第二方面提供了一种视频文本互检模型训练装置,包括:A second aspect of an embodiment of the present application provides a video text mutual inspection model training device, comprising:
文本特征获取模块,被设置为获取训练样本集的每组训练样本中的样本文本的文本特征信息;上述样本文本包括第一类文本数据、第二类文本数据及第三类文本数据,上述第二类文本数据包括第一类文本数据,且上述第三类文本数据用于概括上述第二类文本数据和上述第一类文本数据;上述文本特征信息包括上述第一类文本数据、上述第二类文本数据和第三类文本数据对应的第一类文本特征、第二类文本特征和第三类文本特征;上述第一类文本特征和上述第二类文本特征确定视频文本互检模型中的异质图神经网络的节点特征和连接边;A text feature acquisition module is configured to acquire text feature information of sample text in each group of training samples in a training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data; the first-category text features and the second-category text features determine node features and connection edges of a heterogeneous graph neural network in a video text mutual inspection model;
视频特征生成模块,被设置为对每组训练样本中的样本视频,将从上述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成上述样本视频的视频特征;The video feature generation module is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, wherein the same image is included in different image sets; and generate video features of the sample video according to image features of different image sets and correlations between the image sets;
训练模块,被设置为基于每组训练样本的文本特征及相应的视频特征,训练上述视频文本互检模型;上述文本特征为利用上述异质图神经网络提取上述第二类文本数据的特征和上述第三类文本特征的融合特征。The training module is configured to train the video-text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.
本申请实施例第三方面提供了一种视频文本互检方法,包括:A third aspect of the embodiment of the present application provides a video text mutual inspection method, including:
预先利用如前任意一项上述的视频文本互检模型训练方法,训练得到视频文本互检模型;Preliminarily train a video text mutual inspection model using any of the above-mentioned video text mutual inspection model training methods;
将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets;
根据不同图像集的图像特征及各图像集之间的关联关系,生成上述待检索视频的待匹配视频特征;Generate matching video features of the video to be retrieved based on the image features of different image sets and the association relationship between the image sets;
将待检索文本的待匹配文本特征和上述待匹配视频特征,输入至上述视频文本互检模型,得到视频文本互检结果;上述待检索文本包括第一类文本数据、第二类文本数据及第三类文本数据,上述第二类文本数据包括第一类文本数据,且上述第三类文本数据用于概括上述第二类文本数据和上述第一类文本数据;上述待匹配文本特征为利用上述视频文本互检模型的异质图神经网络提取上述第二类文本数据的特征和上述第三类文本特征的融合特征。The text features to be matched of the text to be retrieved and the above-mentioned video features to be matched are input into the above-mentioned video-text mutual checking model to obtain the video-text mutual checking results; the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data, the above-mentioned second-category text data includes first-category text data, and the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data; the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual checking model.
本申请实施例第四方面提供了一种视频文本互检装置,包括:A fourth aspect of the embodiments of the present application provides a video text mutual inspection device, including:
模型训练模块,被设置为预先如前任意一项上述的视频文本互检模型训练方法,训练得到视频文本互检模型;The model training module is configured to be preliminarily trained to obtain a video text mutual inspection model using any of the above-mentioned video text mutual inspection model training methods;
视频处理模块,被设置为将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成上述待检索视频的待匹配视频特征;The video processing module is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, the video features to be matched of the video to be retrieved are generated;
互检模块,被设置为将待检索文本的待匹配文本特征和上述待匹配视频特征,输入至上述视频文本互检模型,得到视频文本互检结果;上述待检索文本包括第一类文本数据、第二类文本数据及第三类文本数据,上述第二类文本数据包括第一类文本数据,且上述第三类文本数据用于概括上述第二类文本数据和上述第一类文本数据;上述待匹配文本特征为利用上述视频文本互检模型的异质图神经网络提取上述第二类文本数据的特征和上述第三类文本特征的融合特征。The mutual check module is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result; the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data, the above-mentioned second-category text data includes first-category text data, and the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data; the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual check model.
本申请实施例还提供了一种电子设备,包括处理器,上述处理器被设置为执行存储器中存储的计算机程序时实现如前任一项上述视频文本互检模型训练方法和/或如前上述视频文本互检方法的步骤。An embodiment of the present application also provides an electronic device, including a processor, wherein the processor is configured to implement the steps of any of the above-mentioned video-text mutual-checking model training methods and/or the above-mentioned video-text mutual-checking methods when executing a computer program stored in a memory.
本申请实施例最后还提供了一种非易失性存储介质,上述非易失性存储介质上存储有计算机程序,上述计算机程序被处理器执行时实现如前任一项上述视频文本互检模型训练方法和/或如前上述视频文本互检方法的步骤。Finally, an embodiment of the present application further provides a non-volatile storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the video-text mutual-checking model training method and/or the video-text mutual-checking method as described above are implemented.
本申请实施例提供的技术方案的优点在于,将不同文本类型作为图神经网络的异构节点,采用图神经网络有利于提取更深层次、更丰富的文本特征,将概括文本数据的第三类文本数据和第二类文本数据的融合特征作为执行匹配任务的文本特征,可挖掘文本数据之间的内在关系,进而有利于提升视频文本互检索 的精度。将从视频数据中提取的图像帧进行重新组合后再提取图像视频,有利于获取到可更加精准反映视频的图像特征,在确定视频特征的过程中同时还考虑到不同图像帧之间的关联关系,有利于得到更加准确的视频特征,从而文本视频互检索精度。The advantage of the technical solution provided by the embodiment of the present application is that different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features. The fusion features of the third type of text data and the second type of text data that summarize the text data are used as text features for performing matching tasks, which can mine the intrinsic relationship between text data, thereby facilitating the improvement of the accuracy of video-text mutual retrieval. Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video. In the process of determining the video features, the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.
此外,本申请实施例还针对视频文本互检模型训练方法,提供了相应的实现装置、电子设备及非易失性存储介质,以及视频文本互检方法及装置、使得上述方法更具有实用性,上述装置、电子设备、非易失性存储介质视频文本互检方法及装置均具有相应的优点。In addition, the embodiments of the present application also provide corresponding implementation devices, electronic devices and non-volatile storage media, as well as video text mutual checking methods and devices for video text mutual checking model training methods, making the above methods more practical. The above devices, electronic devices, non-volatile storage medium video text mutual checking methods and devices all have corresponding advantages.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary only and are not restrictive of the present disclosure.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚的说明本申请实施例或相关技术的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1为本申请实施例提供的一种视频文本互检模型训练方法的流程示意图;FIG1 is a flow chart of a video text mutual inspection model training method provided in an embodiment of the present application;
图2为本申请实施例提供的一种构建异质图神经网络的示意图;FIG2 is a schematic diagram of constructing a heterogeneous graph neural network provided in an embodiment of the present application;
图3为本申请实施例提供的多帧图像重新组合生成的多个图像集的示意图;FIG3 is a schematic diagram of multiple image sets generated by recombining multiple frame images according to an embodiment of the present application;
图4为本申请实施例提供的样本视频切割流程示意图;FIG4 is a schematic diagram of a sample video cutting process provided in an embodiment of the present application;
图5为本申请实施例提供的图像特征提取示意图;FIG5 is a schematic diagram of image feature extraction provided in an embodiment of the present application;
图6为本申请实施例提供的一种视频文本互检方法的流程示意图;FIG6 is a flow chart of a video text mutual inspection method provided in an embodiment of the present application;
图7为本申请实施例提供的一个示例性应用场景的视频文本互检模型框架示意图;FIG7 is a schematic diagram of a video text mutual inspection model framework for an exemplary application scenario provided by an embodiment of the present application;
图8为本申请实施例提供的一个示例性应用场景的系统结构框架示意图;FIG8 is a schematic diagram of a system structure framework of an exemplary application scenario provided in an embodiment of the present application;
图9为本申请实施例提供的视频文本互检模型训练装置的一种可选的实施方式结构图;FIG9 is a structural diagram of an optional implementation of a video text mutual inspection model training device provided in an embodiment of the present application;
图10为本申请实施例提供的视频文本互检装置的一种可选的实施方式结构图;FIG10 is a structural diagram of an optional implementation of a video text mutual inspection device provided in an embodiment of the present application;
图11为本申请实施例提供的电子设备的一种可选的实施方式结构图。FIG. 11 is a structural diagram of an optional implementation of an electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请实施例方案,下面结合附图和可选的实施方式对本申请实施例作详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。In order to make those skilled in the art better understand the embodiments of the present application, the embodiments of the present application are described in detail below in conjunction with the accompanying drawings and optional implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the embodiments of the present application.
本申请实施例的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同的对象,而不是用于描述特定的顺序。此外术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。The terms "first", "second", "third", "fourth", etc. in the description and claims of the embodiments of the present application and the above-mentioned drawings are used to distinguish different objects rather than to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may include steps or units that are not listed.
在介绍了本申请实施例的技术方案后,下面详细的说明本申请实施例的各种非限制性实施方式。After introducing the technical solutions of the embodiments of the present application, various non-limiting implementation methods of the embodiments of the present application are described in detail below.
首先参见图1,图1为本申请实施例提供的一种视频文本互检模型训练方法的流程示意图,本申请实施例可包括以下内容:First, refer to FIG. 1 , which is a flow chart of a video text mutual inspection model training method provided by an embodiment of the present application. The embodiment of the present application may include the following contents:
S101:获取训练样本集的每组训练样本中的样本文本的文本特征信息。S101: Obtain text feature information of sample text in each group of training samples in the training sample set.
在本实施例中,训练样本集为用于训练视频文本互检模型的样本数据,训练样本集包括多组训练样本,每组训练样本均包括相对应的样本文本和样本视频,也就是样本文本和样本视频为相匹配的一组样本数据,至于训练样本组数可根据实际训练需求以及所采用的数据库来确定,本申请实施例对此不作任何限定。视频文本互检模型用于执行视频数据与文本数据的互相检索任务,其包括异质图神经网络和视频编码网络,异质图神经网络用于对样本文本以及待检索文本的第二类文本数据进行处理并最终输出该文本数据对应的文本特征,视频编码网络用于对视频数据进行处理,并最终输出该视频数据的视频特征,基于文本特征和视频特征训练得到该模型。本实施例的样本文本所包含的数据类型至少包括三种,其中两种数据类 型对应的文本特征作为图结构的异构节点,为了便于描述,可称为第一类文本数据和第二类文本数据,另一类数据为概括第一类文本数据和第二类文本数据的文本数据。相应的,文本特征信息包括第一类文本数据、第二类文本数据和第三类文本数据对应的第一类文本特征、第二类文本特征和第三类文本特征;对于异质图神经网络是基于图结构的网络,图结构的节点为第一类文本特征和第二类文本特征,图结构的连接边由各异构节点对应特征之间是否具有关联关系来决定,如果某两个节点对应的特征之间具有关联关系,则该两个节点之间具有连接边关系。如图2所示,对于样本文本的两类文本数据,第一类文本数据所提取的特征包括
Figure PCTCN2022141680-appb-000011
第二类文本数据提取的特征包括
Figure PCTCN2022141680-appb-000012
则异质图神经网络的节点包括
Figure PCTCN2022141680-appb-000013
Figure PCTCN2022141680-appb-000014
Figure PCTCN2022141680-appb-000015
均有关联关系,比如
Figure PCTCN2022141680-appb-000016
中包含
Figure PCTCN2022141680-appb-000017
的特征,则
Figure PCTCN2022141680-appb-000018
Figure PCTCN2022141680-appb-000019
具有连接边e 32、e 33,若
Figure PCTCN2022141680-appb-000020
Figure PCTCN2022141680-appb-000021
有关联关系,则
Figure PCTCN2022141680-appb-000022
Figure PCTCN2022141680-appb-000023
之间包括连接边e 11。至于异质图神经网络的图结构,可基于实际应用场景选择相应的图结构,本申请实施例对此不作任何限定。
In this embodiment, the training sample set is sample data for training the video text mutual inspection model, and the training sample set includes multiple groups of training samples, each group of training samples includes corresponding sample texts and sample videos, that is, the sample text and the sample video are a set of sample data that match each other. As for the number of training sample groups, it can be determined according to the actual training needs and the database used, and the embodiment of the present application does not impose any restrictions on this. The video text mutual inspection model is used to perform the mutual retrieval task of video data and text data, which includes a heterogeneous graph neural network and a video coding network. The heterogeneous graph neural network is used to process the sample text and the second type of text data of the text to be retrieved and finally output the text features corresponding to the text data. The video coding network is used to process the video data and finally output the video features of the video data. The model is obtained based on the text features and video features training. The data types contained in the sample text of this embodiment include at least three types, wherein the text features corresponding to the two data types are used as heterogeneous nodes of the graph structure. For the convenience of description, they can be called the first type of text data and the second type of text data, and the other type of data is the text data summarizing the first type of text data and the second type of text data. Correspondingly, the text feature information includes the first-category text features, the second-category text features, and the third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; for the heterogeneous graph neural network, which is a network based on a graph structure, the nodes of the graph structure are the first-category text features and the second-category text features, and the connection edges of the graph structure are determined by whether there is an association relationship between the corresponding features of each heterogeneous node. If there is an association relationship between the features corresponding to two nodes, there is a connection edge relationship between the two nodes. As shown in Figure 2, for the two types of text data of the sample text, the features extracted from the first type of text data include
Figure PCTCN2022141680-appb-000011
The second type of features extracted from text data include
Figure PCTCN2022141680-appb-000012
The nodes of the heterogeneous graph neural network include
Figure PCTCN2022141680-appb-000013
like
Figure PCTCN2022141680-appb-000014
and
Figure PCTCN2022141680-appb-000015
There are related relationships, such as
Figure PCTCN2022141680-appb-000016
Included
Figure PCTCN2022141680-appb-000017
The characteristics of
Figure PCTCN2022141680-appb-000018
and
Figure PCTCN2022141680-appb-000019
With connecting edges e 32 and e 33 , if
Figure PCTCN2022141680-appb-000020
and
Figure PCTCN2022141680-appb-000021
If there is a relationship,
Figure PCTCN2022141680-appb-000022
and
Figure PCTCN2022141680-appb-000023
The connection edge e 11 is included therebetween. As for the graph structure of the heterogeneous graph neural network, a corresponding graph structure can be selected based on the actual application scenario, and the embodiments of the present application do not impose any limitation on this.
S102:对每组训练样本中的样本视频,将从样本视频中提取的多帧图像进行重新组合,以得到多个图像集。S102: for each set of training samples, recombining multiple frames of images extracted from the sample videos to obtain multiple image sets.
在本实施例中,对训练样本集所包含的所有样本视频,分别执行S102和S103。在本步骤中,对每个样本视频,从样本视频中提取多帧表示该样本视频的图像,至于提取该样本视频的哪些帧图像,可根据实际需求进行灵活选择,可选的,至于提取的图像帧总数,也可基于实际需求灵活选择,本申请实施例对此均不做任何限定。提取到多帧图像之后,对这多帧图像进行重新组合,可将这多帧图像按照提取的顺序再整合为一个图像序列,然后通过对图像序列进行交叉分割得到多个图像集,本实施例的同一张图像被包含在不同图像集中,表示同一张图像至少出现在两个图像集中。当然,提取到多帧图像之后,也可将这多帧图像随机整合为一个图像序列,然后通过对图像序列进行分割得到多个图像集。当然,提取到多帧图像之后,还可以将这多帧图像随机分配至不同的图像集,同一张图像可以被分配至多个图像集中。至于采用何种方法将多帧图像通过重新组合再生成多个新的图像集,所属领域技术人员可根据实际需求灵活决定。In this embodiment, for all sample videos included in the training sample set, S102 and S103 are respectively executed. In this step, for each sample video, multiple frames representing the sample video are extracted from the sample video. As for which frames of the sample video are extracted, it can be flexibly selected according to actual needs. Optionally, as for the total number of extracted image frames, it can also be flexibly selected based on actual needs. The embodiment of the present application does not impose any restrictions on this. After extracting multiple frames of images, the multiple frames of images are recombined, and the multiple frames of images can be integrated into an image sequence in the order of extraction, and then multiple image sets are obtained by cross-segmenting the image sequence. The same image in this embodiment is included in different image sets, indicating that the same image appears in at least two image sets. Of course, after extracting multiple frames of images, the multiple frames of images can also be randomly integrated into an image sequence, and then multiple image sets are obtained by segmenting the image sequence. Of course, after extracting multiple frames of images, the multiple frames of images can also be randomly assigned to different image sets, and the same image can be assigned to multiple image sets. As for which method to adopt to regenerate multiple new image sets by recombining multiple frames of images, technicians in the relevant field can flexibly decide according to actual needs.
S103:根据不同图像集的图像特征及各图像集之间的关联关系,生成样本视频的视频特征。S103: Generate video features of the sample video according to the image features of different image sets and the association relationship between the image sets.
在上个步骤获取各个图像集之后,可采用任何一种现有的机器学习模型如卷积神经网络、VGG(Visual Geometry Group Network,视觉几何群网络)、Resnet(Residual Neural Network,残差网络模型)等提取各图像集中包含的每帧图像的图像特征,并将该图像集中所有帧图像的图像特征整合为该图像集的图像特征。各图像集之间的关联关系用于标识不同图像集的图像特征对整个视频的重要程度,基于不同图像集的重要程度和该图像集的图像特征确定样本视频的最终视频特征。After obtaining each image set in the previous step, any existing machine learning model such as convolutional neural network, VGG (Visual Geometry Group Network), Resnet (Residual Neural Network), etc. can be used to extract the image features of each frame image contained in each image set, and the image features of all the frames in the image set are integrated into the image features of the image set. The association between the image sets is used to identify the importance of the image features of different image sets to the entire video, and the final video features of the sample video are determined based on the importance of different image sets and the image features of the image sets.
S104:基于每组训练样本的文本特征及相应的视频特征,训练视频文本互检模型。S104: Based on the text features of each group of training samples and the corresponding video features, a video text mutual inspection model is trained.
在本实施例中,一个样本文本的文本特征对应一个样本视频的视频特征,本实施例的每个样本文本的文本特征均为融合特征,融合的是该样本文本的第三类文本数据对应的文本特征以及其第二类文本数据由视频文本互检模型的异质图神经网络提取所得到的特征。对于第三类文本数据对应的文本特征可采用任何一种文本特征提取模型提取得到,本实施例对此不做任何限定。模型训练过程中,会采用损失函数来指导模型的训练,然后通过诸如梯度反传等方式实现对视频文本互检模型的各网络参数的更新,直至满足模型 训练条件,如达到迭代次数或者收敛效果较好。举例来说,视频文本互检模型的训练过程可包括前向传播阶段和反向传播阶段,前向传播阶段是数据由低层次向高层次传播的阶段,反向传播阶段是当前向传播得出的结果与预期不相符时,将误差从高层次向底层次进行传播训练的阶段。详细来说,首先初始化所有网络层权值,如随机初始化;然后输入视频特征和文本特征信息经过图神经网络、卷积层、下采样层、全连接层等各层的前向传播得到输出值;计算视频文本互检模型的模型输出值,并基于损失函数计算该输出值的损失值。将误差反向传回视频文本互检模型中,依次求得视频文本互检模型的各部分如图神经网络层,全连接层,卷积层等各层的反向传播误差。视频文本互检模型的各层根据各层的反向传播误差对视频文本互检模型的所有权重系数进行调整,实现权重的更新。重新随机选取新批次的视频特征和文本特征信息,然后再次进行上述过程,获得网络前向传播得到输出值。无限往复迭代,当计算得到的模型输出值与目标值(也即标签)之间的误差小于预设阈值时,或者迭代次数超过预设迭代次数时,结束模型训练。将结束模型训练当前对应的模型的所有层参数作为训练好的视频文本互检模型的网络参数。In this embodiment, the text features of a sample text correspond to the video features of a sample video. The text features of each sample text in this embodiment are fusion features, which are fused with the text features corresponding to the third category text data of the sample text and the features extracted by the heterogeneous graph neural network of the second category text data of the video text mutual inspection model. The text features corresponding to the third category text data can be extracted by any text feature extraction model, and this embodiment does not impose any restrictions on this. During the model training process, a loss function is used to guide the training of the model, and then the network parameters of the video text mutual inspection model are updated by methods such as gradient back propagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good. For example, the training process of the video text mutual inspection model may include a forward propagation stage and a back propagation stage. The forward propagation stage is a stage in which data is propagated from a low level to a high level, and the back propagation stage is a stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not meet the expectation. Specifically, all network layer weights are first initialized, such as random initialization; then the video features and text feature information are input and forward propagated through the graph neural network, convolution layer, downsampling layer, fully connected layer and other layers to obtain the output value; the model output value of the video text mutual inspection model is calculated, and the loss value of the output value is calculated based on the loss function. The error is reversed back to the video text mutual inspection model, and the back propagation errors of each part of the video text mutual inspection model such as the graph neural network layer, the fully connected layer, the convolution layer and other layers are obtained in turn. Each layer of the video text mutual inspection model adjusts all weight coefficients of the video text mutual inspection model according to the back propagation errors of each layer to achieve weight update. A new batch of video features and text feature information is randomly selected again, and then the above process is performed again to obtain the output value of the network forward propagation. Infinite reciprocating iterations, when the error between the calculated model output value and the target value (ie, the label) is less than the preset threshold, or the number of iterations exceeds the preset number of iterations, the model training is terminated. All layer parameters of the model corresponding to the end of the model training are used as the network parameters of the trained video text mutual inspection model.
在本申请实施例提供的技术方案中,将不同文本类型作为图神经网络的异构节点,采用图神经网络有利于提取更深层次、更丰富的文本特征,进而有利于提升视频文本互检索的精度。将从视频数据中提取的图像帧进行重新组合后再提取图像视频,有利于获取到可更加精准反映视频的图像特征,在确定视频特征的过程中同时还考虑到不同图像帧之间的关联关系,有利于得到更加准确的视频特征,从而提升文本视频互检索精度。In the technical solution provided in the embodiment of the present application, different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features, which is conducive to improving the accuracy of video-text mutual retrieval. Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video. In the process of determining the video features, the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.
上述实施例对S104步骤中,对于采用哪种损失函数指导模型训练过程并没有进行限定,所属领域技术人员可采用任何一种现有技术中的损失函数,如L1范数损失函数、均方误差损失函数、交叉熵损失等。而可以理解的是,损失函数是用于衡量预测模型预测期望结果表现的指标,损失函数是否准确,影响整个模型精准度,为了提高视频文本互检索精准度,本申请实施例还给出了一种损失函数的可选实施方式,也即可基于每组训练样本的文本特征及相应的视频特征,调用损失函数指导视频文本互检模型的训练过程;损失函数可表述为:The above embodiment does not limit which loss function is used to guide the model training process in step S104. Technical personnel in the relevant field can use any loss function in the prior art, such as L1 norm loss function, mean square error loss function, cross entropy loss, etc. It can be understood that the loss function is an indicator used to measure the performance of the prediction model in predicting the expected result. Whether the loss function is accurate affects the accuracy of the entire model. In order to improve the accuracy of video text mutual retrieval, the embodiment of the present application also provides an optional implementation of the loss function, that is, based on the text features of each group of training samples and the corresponding video features, the loss function is called to guide the training process of the video text mutual inspection model; the loss function can be expressed as:
Figure PCTCN2022141680-appb-000024
Figure PCTCN2022141680-appb-000024
式中,
Figure PCTCN2022141680-appb-000025
为上述损失函数,N为训练样本组数,min d()表示计算距离的最小值,
Figure PCTCN2022141680-appb-000026
为上述训练样本集中所包含的所有样本视频中的第a个样本视频,
Figure PCTCN2022141680-appb-000027
为上述训练样本集中所包含的所有样本文本中第p个样本文本、且其与第a个样本视频相对应,
Figure PCTCN2022141680-appb-000028
为在所有样本文本数据中的第n个样本文本、且其与第a个样本视频不对应,
Figure PCTCN2022141680-appb-000029
为所有样本文本数据中的第a个样本文本,
Figure PCTCN2022141680-appb-000030
为所有样本视频中第p个样本视频、且其与第a个样本文本相对应,
Figure PCTCN2022141680-appb-000031
为所有样本视频数据中的第n个样本视频、且其与第a个样本文本不对应,
Figure PCTCN2022141680-appb-000032
为超参数。
In the formula,
Figure PCTCN2022141680-appb-000025
is the above loss function, N is the number of training sample groups, min d() represents the minimum value of the calculated distance,
Figure PCTCN2022141680-appb-000026
is the ath sample video among all the sample videos contained in the above training sample set,
Figure PCTCN2022141680-appb-000027
is the pth sample text among all the sample texts contained in the above training sample set, and it corresponds to the ath sample video,
Figure PCTCN2022141680-appb-000028
is the nth sample text in all sample text data, and it does not correspond to the ath sample video,
Figure PCTCN2022141680-appb-000029
is the ath sample text among all sample text data,
Figure PCTCN2022141680-appb-000030
is the pth sample video among all sample videos, and it corresponds to the ath sample text,
Figure PCTCN2022141680-appb-000031
is the nth sample video among all sample video data, and it does not correspond to the ath sample text,
Figure PCTCN2022141680-appb-000032
is a hyperparameter.
在本实施例中,损失函数对于成对的数据,会遍历每一个视频特征和文本特征信息计算损失函数的平均值。本实施例如可遍历N次,N代表在本batch(批次)中,共有N个成对的样本数据,也即训练样本集共有N 组训练样本,可将这N组训练样本的所有样本视频作为一个视频图像组,所有样本文本作为一个文本组。首先对视频图像组特征
Figure PCTCN2022141680-appb-000033
进行遍历(共N个),遍历选中视频特征可称为
Figure PCTCN2022141680-appb-000034
a代表anchor(锚点样本)。与锚点样本成对的文本特征编码记为
Figure PCTCN2022141680-appb-000035
p代表positive(成对匹配的)。同理,在本batch中与
Figure PCTCN2022141680-appb-000036
不配对的文本特征记为
Figure PCTCN2022141680-appb-000037
Figure PCTCN2022141680-appb-000038
是超参数,在训练时固定,例如设置为0.3。同理,对于文本特征也做相同的遍历操作,
Figure PCTCN2022141680-appb-000039
代表遍历中被选中的那个样本,与其对应的视频图像组特征样本记为
Figure PCTCN2022141680-appb-000040
不对应的记为
Figure PCTCN2022141680-appb-000041
In this embodiment, the loss function will traverse each video feature and text feature information to calculate the average value of the loss function for paired data. This embodiment can traverse N times, where N represents that there are N paired sample data in this batch, that is, there are N groups of training samples in the training sample set. All sample videos of these N groups of training samples can be regarded as a video image group, and all sample texts can be regarded as a text group. First, the video image group feature
Figure PCTCN2022141680-appb-000033
Traverse (a total of N), and the selected video features can be called
Figure PCTCN2022141680-appb-000034
a represents anchor (anchor sample). The text feature encoding paired with the anchor sample is recorded as
Figure PCTCN2022141680-appb-000035
p represents positive (paired matching). Similarly, in this batch
Figure PCTCN2022141680-appb-000036
The unpaired text features are recorded as
Figure PCTCN2022141680-appb-000037
Figure PCTCN2022141680-appb-000038
is a hyperparameter, which is fixed during training, for example, set to 0.3. Similarly, the same traversal operation is performed for text features.
Figure PCTCN2022141680-appb-000039
Represents the sample selected in the traversal, and the corresponding video image group feature sample is recorded as
Figure PCTCN2022141680-appb-000040
The non-corresponding
Figure PCTCN2022141680-appb-000041
在上述实施例中,对于如何执行步骤S102并不做限定,本实施例中给出一种可选的图像帧组合方式,可包括如下步骤:In the above embodiment, there is no limitation on how to execute step S102. In this embodiment, an optional image frame combination method is provided, which may include the following steps:
获取图像重组合参数,根据该图像重组合参数,确定每个图像集所包含的图像帧,以对由多帧图像形成的图像序列进行分割处理。An image recombination parameter is obtained, and the image frames included in each image set are determined according to the image recombination parameter, so as to perform segmentation processing on an image sequence formed by multiple frames of images.
在本实施例中,图像重组合参数可包括图像集总数以及各图像集包含的图像帧总数,图像集总数和各图像集包含的图像帧总数可实时更改,也即用户可实时输入最新采用的参数值,可以直接写入系统的指定位置,这均不影响本申请实施例的实现。而至于每个图像集包含的图像帧数可相同也可设置为不同,为了便于后续图像处理,本实施例可设置每个图像集所包含的图像帧数相同。在确定了图像集总数以及各图像集包含的图像帧总数之后,结合提取的图像帧数,可通过人工交互方式进行图像帧的分配与再处理。当然也可采用自动化的图像分割方法,本实施例对于各图像集所包含图像帧总数相同的场景,还给出根据图像重组合参数确定每个图像集所包含的图像帧的一种可选的实施方式,可包括下述内容:In this embodiment, the image recombination parameters may include the total number of image sets and the total number of image frames contained in each image set. The total number of image sets and the total number of image frames contained in each image set can be changed in real time, that is, the user can input the latest parameter value in real time, and can directly write it into the specified location of the system, which does not affect the implementation of the embodiments of the present application. As for the number of image frames contained in each image set, they can be the same or different. In order to facilitate subsequent image processing, this embodiment can set the number of image frames contained in each image set to be the same. After determining the total number of image sets and the total number of image frames contained in each image set, combined with the number of extracted image frames, the image frames can be allocated and reprocessed through manual interaction. Of course, an automated image segmentation method can also be used. For the scenario where the total number of image frames contained in each image set is the same, this embodiment also provides an optional implementation method for determining the image frames contained in each image set according to the image recombination parameters, which may include the following contents:
对第一个图像集,根据图像帧总数和图像序列的第一帧图像确定第一个图像集所包含的图像帧;调用图像分割关系式,确定相邻图像集的图像帧序号差;图像分割关系式为:m+nk=N;对其余各图像集,基于当前图像集的上一个图像集所包含的图像帧和图像帧序号差,确定相应图像集所包含的图像帧;式中,m为各图像集所包含图像帧总数,N为图像序列所包含图像帧总数,n为图像集总数,k为图像帧序号差,且其为整数。For the first image set, the image frames included in the first image set are determined according to the total number of image frames and the first frame image of the image sequence; the image segmentation relationship is called to determine the image frame sequence number difference of adjacent image sets; the image segmentation relationship is: m+nk=N; for the remaining image sets, the image frames included in the corresponding image set are determined based on the image frames included in the previous image set of the current image set and the image frame sequence number difference; wherein m is the total number of image frames included in each image set, N is the total number of image frames included in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.
在本实施例中,为了使所属领域技术人员更加清楚实现方式,结合图3给出了一个示意性例子,若样本视频提取的图像帧为N帧图像,将该N帧图像分成n个相互重叠的图像集,每个图像集可包括m帧图像。基于m+nk=N可计算得到图像帧序号差k值,第一个图像集包括[1,…,m],第二个图像集包括[k+1,…,m+k],第三个图像集包括[2k+1,…,m+2k],第n个图像集包括[nk+1,…,m+nk]。举例来说,N=32,n=5,m=16,则k=3.2,向上取整,k=4,则分成的图像集可为:[1,16]、[5,20]、[9,24]、[13,28]和[16,N]。In this embodiment, in order to make the implementation method clearer to the technicians in the relevant field, a schematic example is given in conjunction with FIG3. If the image frames extracted from the sample video are N frames, the N frames are divided into n overlapping image sets, and each image set may include m frames. Based on m+nk=N, the image frame sequence number difference k value can be calculated. The first image set includes [1, ..., m], the second image set includes [k+1, ..., m+k], the third image set includes [2k+1, ..., m+2k], and the nth image set includes [nk+1, ..., m+nk]. For example, N=32, n=5, m=16, then k=3.2, rounded up, k=4, then the image sets can be: [1,16], [5,20], [9,24], [13,28] and [16,N].
样本视频是由很多帧视频图像组成,上述实施例对从样本视频中提取多帧图像的过程并没有进行限定,如图4所示,本实施例还给出一种可选的实施方式,也即通过解析视频拆分指令,获取视频拆分参数;按照视频拆分参数,将样本视频拆分为多个视频段;对每个视频段,提取用于标识当前视频段的目标图像帧。可选的,可提取当前视频段的第一帧图像,作为当前视频段的目标图像帧。其中,视频拆分参数是指样本视频拆分段数以及样本视频标识信息等,本实施可将一段样本视频平均分成N段,然后取每一段的第一帧的图像作为该段图像的代表。The sample video is composed of many frames of video images. The above embodiment does not limit the process of extracting multiple frames of images from the sample video. As shown in Figure 4, this embodiment also provides an optional implementation method, that is, by parsing the video splitting instruction, the video splitting parameters are obtained; according to the video splitting parameters, the sample video is split into multiple video segments; for each video segment, the target image frame used to identify the current video segment is extracted. Optionally, the first frame of the current video segment can be extracted as the target image frame of the current video segment. Among them, the video splitting parameter refers to the number of sample video segmentation segments and the sample video identification information. This implementation can divide a sample video into N segments on average, and then take the first frame of each segment as the representative of the image of the segment.
本实施例通过将从视频中提取的图像帧划分为多个相互重叠的区间,有利于提取更丰富的图像特征,提升模型训练精准度。This embodiment divides the image frames extracted from the video into multiple overlapping intervals, which is beneficial to extracting richer image features and improving the accuracy of model training.
上述实施例对如何生成视频特征,并不进行限定,本申请实施例还给出一个示意性例子,可包括下述内容:The above embodiment does not limit how to generate video features. The embodiment of the present application also provides an illustrative example, which may include the following content:
首先结合图5所示,本申请实施例给出一种用于提取各图像集的每帧图像的图像特征的网络结构,在本实施例中称为图像特征提取网络,该图像特征提取网络可包括第一3D卷积结构、第一降采样结构、第二3D卷积结构、第二降采样结构、2D卷积结构和全连接层;第一3D卷积结构用于对图像特征提取网络的输入信息进行3D卷积操作;第一降采样结构用于对第一3D卷积结构的输出特征进行降采样操作;第二3D卷积结构用于对第一降采样结构的输出特征进行3D卷积操作;第二降采样结构用于对第二3D卷积结构输出的特征进行降采样操作;2D卷积结构用于对第二降采样结构的输出特征进行2D卷积操作。基于上述结构,可采用任何一种图像数据库对该图像特征提取网络进行训练,直至到达训练结束条件。对每个图像集,将当前图像集所包含的图像帧均输入至图像特征提取网络,得到当前图像集对应的图像特征。First, in conjunction with FIG. 5, an embodiment of the present application provides a network structure for extracting image features of each frame of each image set, which is called an image feature extraction network in this embodiment. The image feature extraction network may include a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure, and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to perform a downsampling operation on the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to perform a downsampling operation on the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure. Based on the above structure, any image database can be used to train the image feature extraction network until the training end condition is reached. For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set.
举例来说,考虑一个图像集的输入是多帧图像的体素块,其大小可为c*m*h*w,c为图像通道(一般为RGB(Red Green Blue,红绿蓝)彩色3通道),m为视频序列的长度=本图像集的图像的帧数,h和w分别为样本视频的宽与高。进行一次3D卷积核为K*3*3*3的操作,stride(步幅)为1,padding(填充)=True,滤波器个数为K的3D卷积后,输出的大小为K*m*h*w。池化层同理。基于以上3D卷积操作,本实施例使用如图5所示的C3D(Convolutional 3D)network结构,其包含3D convolution=3D卷积,2D convolution=2D卷积,subsampling降采样层,full connection=全连接层。共有4次卷积操作,2次降采样操作。其中卷积核的大小如图5所示。池化核的大小为2*2,步长为2。最终网络在经过1次2D卷积操作和1次全连接层层后就得到了最终的输出特征。网络的输入尺寸为3*16*224*224,即一次输入16帧图像,输入图像尺寸是224×224。在实施例中,对于每个图像集的输入,可得到一个128维度的特征向量。For example, consider an image set whose input is a voxel block of multiple frames of images, whose size can be c*m*h*w, where c is the image channel (generally RGB (Red Green Blue) color 3 channels), m is the length of the video sequence = the number of frames of the image in this image set, and h and w are the width and height of the sample video respectively. After performing a 3D convolution operation with a 3D convolution kernel of K*3*3*3, stride of 1, padding = True, and a filter number of K, the output size is K*m*h*w. The same is true for the pooling layer. Based on the above 3D convolution operation, this embodiment uses the C3D (Convolutional 3D) network structure shown in Figure 5, which includes 3D convolution = 3D convolution, 2D convolution = 2D convolution, subsampling layer, and full connection = full connection layer. There are 4 convolution operations and 2 downsampling operations. The size of the convolution kernel is shown in Figure 5. The size of the pooling kernel is 2*2, and the step size is 2. The final network obtains the final output features after one 2D convolution operation and one fully connected layer. The input size of the network is 3*16*224*224, that is, 16 frames of images are input at a time, and the input image size is 224×224. In the embodiment, for each image set input, a 128-dimensional feature vector can be obtained.
在提取到每个图像集的图像特征之后,根据不同图像集的图像特征及各图像集之间的关联关系生成样本视频的视频特征的过程,可包括:对每个图像集,基于当前图像集的图像特征确定当前图像集的当前初始权重,并基于当前初始权重和每个图像集的初始权重确定当前图像集的权重系数;根据各图像集的图像特征及相应的权重系数,生成样本视频的视频特征。After extracting the image features of each image set, the process of generating video features of the sample video according to the image features of different image sets and the correlation between the image sets may include: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set; generating the video features of the sample video according to the image features of each image set and the corresponding weight coefficient.
其中,可通过调用初始权重计算关系式计算当前图像集的当前初始权重;初始权重计算关系式可表述为:Among them, the current initial weight of the current image set can be calculated by calling the initial weight calculation formula; the initial weight calculation formula can be expressed as:
a i=q T ReLU(H·y i); a i =q T ReLU(H·y i );
式中,a i为第i个图像集的初始权重,q为已知向量,q T表示q的转置,ReLU()为ReLU函数,H为权重矩阵,y i为第i个图像集的图像特征。经过H·y i的矩阵乘法可将yi映射到一个公共空间,H可通过模型训练得到,q T与ReLU(H·y i)相乘会得到一个数。 Where a i is the initial weight of the i-th image set, q is a known vector, q T represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set. yi can be mapped to a common space by matrix multiplication of H· yi , which can be obtained through model training. q T multiplied by ReLU(H· yi ) will get a number.
其中,可通过调用权重计算关系计算当前图像集的权重系数;权重计算关系可表示为:Among them, the weight coefficient of the current image set can be calculated by calling the weight calculation relationship; the weight calculation relationship can be expressed as:
Figure PCTCN2022141680-appb-000042
Figure PCTCN2022141680-appb-000042
式中,a i′为第i个图像集的权重系数,softmax()为softmax函数,a j为第j个图像集的初始权重,n为图像集总数。 Where a i ′ is the weight coefficient of the i-th image set, softmax() is the softmax function, a j is the initial weight of the j-th image set, and n is the total number of image sets.
最终本实施例生成的视频特征e video可表示为:
Figure PCTCN2022141680-appb-000043
Finally, the video feature e video generated by this embodiment can be expressed as:
Figure PCTCN2022141680-appb-000043
在本实施例中,通过对不同的图像集的特征进行加权处理,使得每个图像集的特征能够更加显著的表达,从而有利于得到更精确的视频特征,有助于提升模型训练精准度。In this embodiment, by weighting the features of different image sets, the features of each image set can be expressed more significantly, which is conducive to obtaining more accurate video features and helps to improve the accuracy of model training.
此外,本实施例还提供了视频文本互检方法,请参阅图6,可包括下述内容:In addition, this embodiment also provides a video text mutual inspection method, please refer to FIG6, which may include the following contents:
S601:预先训练视频文本互检模型。S601: Pre-train a video text mutual inspection model.
本步骤可预先利用上述任意一个实施例中所记载的视频文本互检模型训练方法来训练得到视频文本互检模型。In this step, the video-text mutual-checking model can be trained in advance using the video-text mutual-checking model training method described in any of the above embodiments to obtain the video-text mutual-checking model.
S602:将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中。S602: Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets.
S603:根据不同图像集的图像特征及各图像集之间的关联关系,生成待检索视频的待匹配视频特征。S603: Generate to-be-matched video features of the to-be-retrieved video according to the image features of different image sets and the association relationship between the image sets.
S604:将待检索文本的待匹配文本特征和待匹配视频特征,输入至视频文本互检模型,得到视频文本互检结果。S604: Input the to-be-matched text features and the to-be-matched video features of the to-be-retrieved text into the video-text mutual-checking model to obtain the video-text mutual-checking results.
其中,待检索文本包括第一类文本数据、第二类文本数据及第三类文本数据,第二类文本数据包括第一类文本数据,且第三类文本数据用于概括第二类文本数据和第一类文本数据;待匹配文本特征为利用视频文本互检模型的异质图神经网络提取第二类文本数据的特征和第三类文本特征的融合特征。Among them, the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video text mutual inspection model.
本实施例的待检索视频的处理也即S602和S603,可参阅上述实施例的S102和S103相应的内容,此处便不再赘述。The processing of the video to be retrieved in this embodiment is S602 and S603. Please refer to the corresponding contents of S102 and S103 in the above embodiment, and will not be repeated here.
在推理过程中,可预先加载S601训练好的权重系数。对待检索视频或待检索文本进行特征提取,并存入待检索数据集中。用户给定任意待检索视频或待检索文本,为了便于描述,可称为待检索数据。提取待检索数据的文本特征信息或视频特征,输入至视频文本互检模型。将待检索数据的特征与待检索数据集中所有样本特征进行距离匹配。例如:若待检索数据是文本数据,则与待检索数据集中所有的待检索视频特征进行欧式距离计算,距离最小的样本即为推荐样本进行输出。During the inference process, the weight coefficients trained by S601 can be preloaded. Feature extraction is performed on the video to be retrieved or the text to be retrieved, and the feature is stored in the data set to be retrieved. The user gives any video to be retrieved or text to be retrieved, which can be called data to be retrieved for the sake of description. The text feature information or video feature of the data to be retrieved is extracted and input into the video-text mutual inspection model. The features of the data to be retrieved are distance matched with the features of all samples in the data set to be retrieved. For example: if the data to be retrieved is text data, the Euclidean distance is calculated with all the features of the video to be retrieved in the data set to be retrieved, and the sample with the smallest distance is the recommended sample for output.
由上可知,本实施例可有效提高视频文本互检索精度。It can be seen from the above that this embodiment can effectively improve the accuracy of video-text mutual retrieval.
需要说明的是,本申请实施例中各步骤之间没有严格的先后执行顺序,只要符合逻辑上的顺序,则这些步骤可以同时执行,也可按照某种预设顺序执行,图1和图6只是一种示意方式,并不代表只能是这样的执行顺序。It should be noted that there is no strict order of execution between the steps in the embodiments of the present application. As long as they comply with the logical order, these steps can be executed simultaneously or in a preset order. Figures 1 and 6 are only a schematic diagram and do not mean that this is the only execution order.
最后,为了使所属领域技术人员更加清楚明白本申请实施例的实施方式,本实施例还提供了一个示意性的例子,用于实现菜谱文本与菜谱视频的互检索任务,可包括下述内容:Finally, in order to make those skilled in the art more clearly understand the implementation methods of the embodiments of the present application, this embodiment also provides an illustrative example for implementing the mutual retrieval task of recipe text and recipe video, which may include the following contents:
请参阅图7,本实施例包括菜谱检索终端设备701和服务器702,用户可以在菜谱检索终端设备701上执行操作,菜谱检索终端设备701通过网络实现与服务器702的交互,服务器702可以部署视频文本互检模型,如图8所示,为了使得视频文本互检模型可以实现菜谱文本与菜谱视频互检索的功能,需要对视频文本互检模型进行训练。在训练过程中,可以由菜谱检索终端设备701向服务器702传输训练样本集,训练样本集可包含有多组训练样本,每组训练样本包括相对应的一个菜谱文本样本和一个菜谱视频样本,每个菜谱文本样本包括操作步骤(instruction list)、成分信息(ingredients)和菜名(Title)。Instructions为做菜的步骤,在下文中统一用步骤表示。Ingredients为菜的成分,在下文统一用成分表示。Please refer to FIG7 . This embodiment includes a recipe retrieval terminal device 701 and a server 702 . A user can perform operations on the recipe retrieval terminal device 701 . The recipe retrieval terminal device 701 interacts with the server 702 through a network. The server 702 can deploy a video text mutual inspection model, as shown in FIG8 . In order to enable the video text mutual inspection model to realize the function of mutual retrieval between recipe text and recipe video, the video text mutual inspection model needs to be trained. During the training process, the recipe retrieval terminal device 701 can transmit a training sample set to the server 702 . The training sample set can include multiple groups of training samples. Each group of training samples includes a corresponding recipe text sample and a recipe video sample. Each recipe text sample includes operation steps (instruction list), ingredient information (ingredients) and dish name (Title). Instructions are steps for cooking, which are uniformly represented by steps in the following text. Ingredients are ingredients of a dish, which are uniformly represented by ingredients in the following text.
服务器702在获取到训练样本集后,分别对菜谱文本和菜谱视频进行特征编码。本实施例可采用异质图神经网络对文本信息进行编码。本实施例将文本特征构建成一种图结构,图结构包括节点及节点特征和连接关系,如图2所示。成分和步骤从构造到性质都是不同的,所以称为异质节点。本实施例中每一个步骤称为1个节点,同理每1个成分称为1个节点。节点是由1句话或者1个词组组成,本实施例可使用Bert模型提取每句话或每个单词的特征,实现方式如下:所有菜谱文本从最下方的文本信息输入,同时还会输入与菜谱文本信息相伴随的位置信息和文本类型。位置信息是指若一句话中有5个单词“peel and slice the mango”,则其位置信息分别为“1,2,3,4,5”。文本类型是指:若输入文本是步骤,其文本类型为1;若输入文本是成分,其文本类型为2。通过该BERT模型,可以获得每句话和每个单词的编码特征,该特征用于代表节点特征,即成分节点特征和步骤节点特征,成分节点特征和步骤节点特征都是一个高维向量,其维度均为R d维度(d维实向量)。在确定节点特征之后,如果该主成分存在该操作步骤中,则该成分节点和步骤节点需要有一条边连接,也即两个节点之间具有连接关系。可选的,可通过文本比对的方 法,遍历步骤信息,提取每个步骤文本,然后依次查找主成分,如果该主成分中的单词在该步骤中出现,则该步骤和该主成分之间连接一条边即有连接关系。通过遍历所有步骤文本,可以构建步骤节点预成分节点的连接关系,即异质图的连接关系。在异质图建立之后,异质图信息更新可采用图注意力网络实现特征聚合与更新,更新方法是依次遍历每个异质节点进行更新。通过异质图运算来实现文本特征的聚合与提取,计算方法可如下所示: After obtaining the training sample set, the server 702 performs feature encoding on the recipe text and the recipe video respectively. In this embodiment, a heterogeneous graph neural network can be used to encode text information. In this embodiment, the text features are constructed into a graph structure, and the graph structure includes nodes, node features and connection relationships, as shown in Figure 2. Components and steps are different from structure to properties, so they are called heterogeneous nodes. In this embodiment, each step is called a node, and similarly, each component is called a node. A node is composed of a sentence or a phrase. In this embodiment, the Bert model can be used to extract the features of each sentence or each word. The implementation method is as follows: All recipe texts are input from the text information at the bottom, and the position information and text type accompanying the recipe text information are also input. Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively. Text type means: if the input text is a step, its text type is 1; if the input text is an ingredient, its text type is 2. Through the BERT model, the encoding features of each sentence and each word can be obtained. The features are used to represent node features, namely, component node features and step node features. Both component node features and step node features are high-dimensional vectors, and their dimensions are R d dimensions (d-dimensional real vectors). After determining the node features, if the principal component exists in the operation step, the component node and the step node need to be connected by an edge, that is, there is a connection relationship between the two nodes. Optionally, the step information can be traversed by text comparison, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship. By traversing all step texts, the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph. After the heterogeneous graph is established, the heterogeneous graph information can be updated using a graph attention network to achieve feature aggregation and update. The update method is to traverse each heterogeneous node in turn for update. The aggregation and extraction of text features are realized by heterogeneous graph operations, and the calculation method can be as follows:
首先对步骤节点进行更新,
Figure PCTCN2022141680-appb-000044
是步骤节点的第q个节点的节点特征,
Figure PCTCN2022141680-appb-000045
代表成分节点的第p个节点的特征。若步骤节点的第q个节点与成分节点的第p个节点有连接(也即具有边连接关系),则用成分节点的第p个节点的特征去更新步骤节点的第q个节点特征。在更新过程中,需要考虑各节点之间的相关性,本实施例可通过赋予权重来表示节点间的关联性,可选的,可调用下述关系式(1)计算步骤节点的第q个节点与成分节点的第p个节点特征的相关权重zpq。对于每个步骤节点,例如
Figure PCTCN2022141680-appb-000046
遍历所有与其有相连的边的成分节点,假设有Np个,都会得到与其对应的相关权重zpq。
First, update the step node.
Figure PCTCN2022141680-appb-000044
is the node feature of the qth node of the step node,
Figure PCTCN2022141680-appb-000045
Represents the feature of the pth node of the component node. If the qth node of the step node is connected to the pth node of the component node (that is, has an edge connection relationship), the feature of the pth node of the component node is used to update the qth node feature of the step node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (1) can be called to calculate the correlation weight zpq between the qth node of the step node and the pth node feature of the component node. For each step node, for example
Figure PCTCN2022141680-appb-000046
Traverse all component nodes with connected edges, assuming there are Np nodes, and get the corresponding relevant weight zpq.
Figure PCTCN2022141680-appb-000047
Figure PCTCN2022141680-appb-000047
其中,Wa、Wb、Wc为已知的R d×d维矩阵,
Figure PCTCN2022141680-appb-000048
代表矩阵乘法,也即向量映射。
Among them, Wa, Wb, Wc are known R d×d dimensional matrices,
Figure PCTCN2022141680-appb-000048
Represents matrix multiplication, aka vector mapping.
在更新完各步骤节点之后,可对所有与步骤节点相连的边的成分节点进行相关权重的归一化,也即可调用下述关系式(2)得到归一化的相关权重α qpAfter updating each step node, the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the normalized relevant weight α qp can be obtained by calling the following relationship (2):
Figure PCTCN2022141680-appb-000049
Figure PCTCN2022141680-appb-000049
式中,exp代表求指数函数,
Figure PCTCN2022141680-appb-000050
代表求取所有与步骤节点相连的边的成分节点的相关权重的总和。最后通过归一化的相关权重对步骤节点的节点特征进行更新,也即调用下述关系式(3)进行计算:
In the formula, exp represents the exponential function.
Figure PCTCN2022141680-appb-000050
It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node. Finally, the node features of the step node are updated by the normalized relevant weights, that is, the following relationship (3) is called for calculation:
Figure PCTCN2022141680-appb-000051
Figure PCTCN2022141680-appb-000051
其中,σ代表超参数,在[0,1]区间。W v是R d×d维矩阵,
Figure PCTCN2022141680-appb-000052
是被与其相连的成分节点更新后的新的特征向量。
Where σ represents a hyperparameter in the interval [0, 1]. W v is an R d×d dimensional matrix,
Figure PCTCN2022141680-appb-000052
It is the new feature vector updated by the component nodes connected to it.
可选的,基于残差网络的思想,调用下述关系式(4)可将更新后的
Figure PCTCN2022141680-appb-000053
与未更前的初始特征
Figure PCTCN2022141680-appb-000054
相加:
Optionally, based on the idea of residual network, the updated
Figure PCTCN2022141680-appb-000053
Compared with the initial features before
Figure PCTCN2022141680-appb-000054
Addition:
Figure PCTCN2022141680-appb-000055
Figure PCTCN2022141680-appb-000055
同理,可调用关系式(5)对成分节点也做相同的计算与更新:Similarly, equation (5) can be used to perform the same calculation and update on the component nodes:
Figure PCTCN2022141680-appb-000056
Figure PCTCN2022141680-appb-000056
遍历完所有的成分节点和步骤节点,即完成图注意力网络一层的网络更新。通常,可叠加T层图注意力网络,用t代表第t层的图注意力网络,每一层的节点特征的更新方式都如上上述。通常会在每层图注意力网络后面加入集成全连接层,实现对节点特征(包括成分节点和步骤节点)特征的再编码,如下述关系式(6)所示:After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as described above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):
Figure PCTCN2022141680-appb-000057
Figure PCTCN2022141680-appb-000057
FFN(feed-forward layer,前馈层,或称为全连接层)代表全连接层,FFN (feed-forward layer, feed-forward layer, or fully connected layer) stands for fully connected layer.
Figure PCTCN2022141680-appb-000058
代表t+1层的图注意力网络的初始化节点特征。
Figure PCTCN2022141680-appb-000058
Represents the initial node features of the graph attention network at layer t+1.
如上完成了对本节点特征的更新,为了实现与菜谱视频的检索,还需要将所有文字节点的特征如操作步骤、成分信息和菜名进行归纳和综合。在本实施例中,由于步骤节点融合了成分节点信息,成分节点通过图神经网络更新,以关键词的形式对相关步骤节点特征进行了强调。同时,由于菜名信息中包含重要的主材信息和烹饪手段,同时,菜名文本在基于菜谱的图文互检任务中通常是一个广泛的存在。基于此,本实施例还可通过Bert(Bidirectional Encoder Representations from Transformer,双向特征编码器)模型提取菜名的特征。在获取各文本特征之后,可采用BiLSTM(Bi-directional Long Short-Term Memory,双向长短期记忆神经网络)方法挖掘步骤节点的时序信息,实现对文字节点特征的归纳综合,并将其打包成一个向量。As above, the update of the node features is completed. In order to realize the retrieval with the recipe video, it is also necessary to summarize and synthesize the features of all text nodes such as operation steps, ingredient information and dish names. In this embodiment, since the step node integrates the ingredient node information, the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords. At the same time, since the dish name information contains important main material information and cooking methods, the dish name text is usually a widespread existence in the recipe-based graphic and text mutual inspection task. Based on this, this embodiment can also extract the features of the dish name through the Bert (Bidirectional Encoder Representations from Transformer, bidirectional feature encoder) model. After obtaining the features of each text, the BiLSTM (Bi-directional Long Short-Term Memory, bidirectional long short-term memory neural network) method can be used to mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.
本实施例可调用下述关系式(7)和(8)提取所有步骤节点的时序信息特征:In this embodiment, the following equations (7) and (8) can be used to extract the timing information features of all step nodes:
Figure PCTCN2022141680-appb-000059
Figure PCTCN2022141680-appb-000059
Figure PCTCN2022141680-appb-000060
Figure PCTCN2022141680-appb-000060
其中,向左和向右的箭头代表LSTM(Long Short-Term Memory,长短期记忆神经网络)编码的方向,即步骤节点特征正序编码和倒序编码。
Figure PCTCN2022141680-appb-000061
代表BiLSTM中第q个单元的输出,箭头方向不同代表按照步骤节点输入顺序不同得到的BiLSTM编码输出。同理,
Figure PCTCN2022141680-appb-000062
则代表BiLSTM中第q-1个单元的输出,也即上一个状态的输出。假设菜谱步骤共有Q步,
Figure PCTCN2022141680-appb-000063
为0,
Figure PCTCN2022141680-appb-000064
代表第T层的图神经网络的第q个步骤节点的特征。按照步骤的顺序和逆序,依次输入到其对应的BiLSTM网络中,最后得到所有步骤节点的BiLSTM编码,如下述关系式(9)所示:
Among them, the left and right arrows represent the direction of LSTM (Long Short-Term Memory) encoding, that is, the forward and reverse encoding of the step node features.
Figure PCTCN2022141680-appb-000061
represents the output of the qth unit in BiLSTM, and the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input. Similarly,
Figure PCTCN2022141680-appb-000062
represents the output of the q-1th unit in the BiLSTM, that is, the output of the previous state. Assume that there are Q steps in the recipe.
Figure PCTCN2022141680-appb-000063
is 0,
Figure PCTCN2022141680-appb-000064
Represents the features of the qth step node of the Tth layer of the graph neural network. According to the order and reverse order of the steps, they are input into the corresponding BiLSTM network in sequence, and finally the BiLSTM encoding of all step nodes is obtained, as shown in the following relation (9):
Figure PCTCN2022141680-appb-000065
Figure PCTCN2022141680-appb-000065
在获取所有BiLSTM单元的输出之后,可通过求和后取平均值得到整个文本特征的输出。其中,e rec代表文本特征的输出,用来进行下一步的检索。将e rec特征与菜名title特征进行融合e rec=[e rec,e ttl],[]代表特征拼接,即特征首尾相连。e rec特征最后会经过一个全连接层进行特征映射,也即e rec=fc(e rec),得到新维度的向量,也即菜谱文本的文本特征信息,其用于作为与菜谱视频的编码特征进行匹配。 After obtaining the output of all BiLSTM units, the output of the entire text feature can be obtained by summing and averaging. Among them, e rec represents the output of the text feature, which is used for the next step of retrieval. The e rec feature is fused with the dish title feature e rec = [e rec , e ttl ], where [] represents feature concatenation, that is, the features are connected end to end. The e rec feature will finally pass through a fully connected layer for feature mapping, that is, e rec = fc(e rec ), to obtain a vector of a new dimension, that is, the text feature information of the recipe text, which is used as the encoding feature for matching with the recipe video.
对于菜谱视频的编码过程,可将样本视频作为菜谱视频,采用上述任意一个实施例实现对菜谱特征的编码。在得到训练样本集的每组训练样本的菜谱视频特征和菜谱文本特征信息之后,可采用上述实施例的损失函数指导视频文本互检模型的训练,使其收敛。For the encoding process of recipe videos, the sample video can be used as the recipe video, and any one of the above embodiments can be used to implement the encoding of recipe features. After obtaining the recipe video features and recipe text feature information of each group of training samples in the training sample set, the loss function of the above embodiment can be used to guide the training of the video text mutual inspection model to make it converge.
菜谱检索终端设备701可以包括显示屏、输入接口、输入键盘、无线传输模块。当显示屏为触摸屏时,输入键盘可以是在显示屏上呈现的软键盘。输入接口可以用于实现与外部设备如U盘的连接。输入接口可以有多个。在实际应用中,用户可以通过输入键盘向菜谱检索终端设备701输入待检索菜谱文本或待检索视频,也可以将待检索菜谱文本或待检索视频写入U盘,将U盘插入菜谱检索终端设备701的输入接口。用户向菜谱检索终端设备701输入检索请求,检索请求携带待检索的菜谱文本或待检索的菜谱视频,菜谱检索终端可以通过无线传输模块向服务器702发送该检索请求,服务器702基于训练好的模型检索相应的数据库可以将最终互检索结果反馈至菜谱检索终端设备701,菜谱检索终端设备701可以通过显示屏向用户展示所检索到的菜谱文本或菜谱视频。The recipe retrieval terminal device 701 may include a display screen, an input interface, an input keyboard, and a wireless transmission module. When the display screen is a touch screen, the input keyboard may be a soft keyboard presented on the display screen. The input interface may be used to realize connection with an external device such as a USB flash drive. There may be multiple input interfaces. In practical applications, the user may input a recipe text to be retrieved or a video to be retrieved to the recipe retrieval terminal device 701 through the input keyboard, or may write the recipe text to be retrieved or the video to be retrieved to a USB flash drive and insert the USB flash drive into the input interface of the recipe retrieval terminal device 701. The user inputs a retrieval request to the recipe retrieval terminal device 701, and the retrieval request carries the recipe text to be retrieved or the recipe video to be retrieved. The recipe retrieval terminal may send the retrieval request to the server 702 through the wireless transmission module. The server 702 retrieves the corresponding database based on the trained model and may feed back the final mutual retrieval result to the recipe retrieval terminal device 701. The recipe retrieval terminal device 701 may display the retrieved recipe text or recipe video to the user through the display screen.
本申请实施例还针对视频文本互检模型训练方法以及视频文本互检方法提供了相应的装置,使得方法更具有实用性。其中,装置可从功能模块的角度和硬件的角度分别说明。下面对本申请实施例提供的视频文本互检模型训练装置以及视频文本互检装置进行介绍,下文描述的视频文本互检模型训练装置以及视频 文本互检装置与上文描述的视频文本互检模型训练方法以及视频文本互检方法可相互对应参照。The embodiment of the present application also provides a corresponding device for the video text mutual inspection model training method and the video text mutual inspection method, so that the method is more practical. Among them, the device can be described from the perspective of functional modules and hardware. The video text mutual inspection model training device and the video text mutual inspection device provided by the embodiment of the present application are introduced below. The video text mutual inspection model training device and the video text mutual inspection device described below can correspond to each other with the video text mutual inspection model training method and the video text mutual inspection method described above.
基于功能模块的角度,首先参见图9,图9为本申请实施例提供的视频文本互检模型训练装置在一种可选的实施方式下的结构图,该装置可包括:From the perspective of functional modules, first refer to FIG. 9 , which is a structural diagram of a video text mutual inspection model training device provided in an embodiment of the present application under an optional implementation mode, and the device may include:
文本特征获取模块901,被设置为获取训练样本集的每组训练样本中的样本文本的文本特征信息,样本文本包括第一类文本数据、第二类文本数据及第三类文本数据,第二类文本数据包括第一类文本数据,且第三类文本数据用于概括第二类文本数据和第一类文本数据;文本特征信息包括第一类文本数据、第二类文本数据和第三类文本数据对应的第一类文本特征、第二类文本特征和第三类文本特征;上述第一类文本特征和上述第二类文本特征确定视频文本互检模型中的异质图神经网络的节点特征和连接边;The text feature acquisition module 901 is configured to acquire text feature information of sample text in each group of training samples in the training sample set, wherein the sample text includes first-category text data, second-category text data, and third-category text data, wherein the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features, and third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;
视频特征生成模块902,被设置为对每组训练样本中的样本视频,将从样本视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成样本视频的视频特征;The video feature generation module 902 is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, and the same image is included in different image sets; and generate video features of the sample video according to the image features of different image sets and the correlation between the image sets;
训练模块903,被设置为基于每组训练样本的文本特征及相应的视频特征,训练视频文本互检模型;文本特征为利用异质图神经网络提取第二类文本数据的特征和第三类文本特征的融合特征。The training module 903 is configured to train the video text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are the fusion features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
可选的,在本实施例的一些实施方式中,上述视频特征生成模块902还可被设置为:获取图像重组合参数;图像重组合参数包括图像集总数以及各图像集包含的图像帧总数;根据图像重组合参数,确定每个图像集所包含的图像帧,以对由多帧图像形成的图像序列进行分割处理。Optionally, in some implementations of the present embodiment, the above-mentioned video feature generation module 902 can also be configured to: obtain image recombination parameters; the image recombination parameters include the total number of image sets and the total number of image frames contained in each image set; according to the image recombination parameters, determine the image frames contained in each image set to perform segmentation processing on the image sequence formed by multiple frames of images.
作为上述实施例的一种可选的实施方式,上述视频特征生成模块902还可被设置为:各图像集所包含图像帧总数相同,对第一个图像集,根据图像帧总数和图像序列的第一帧图像确定第一个图像集所包含的图像帧;调用图像分割关系式,确定相邻图像集的图像帧序号差;图像分割关系式为:m+nk=N;对其余各图像集,基于当前图像集的上一个图像集所包含的图像帧和图像帧序号差,确定相应图像集所包含的图像帧;式中,m为各图像集所包含图像帧总数,N为图像序列所包含图像帧总数,n为图像集总数,k为图像帧序号差,且其为整数。As an optional implementation of the above embodiment, the above video feature generation module 902 can also be set to: the total number of image frames contained in each image set is the same, for the first image set, the image frames contained in the first image set are determined according to the total number of image frames and the first frame image of the image sequence; the image segmentation relationship is called to determine the image frame sequence number difference of adjacent image sets; the image segmentation relationship is: m+nk=N; for the remaining image sets, based on the image frames contained in the previous image set of the current image set and the image frame sequence number difference, the image frames contained in the corresponding image set are determined; wherein, m is the total number of image frames contained in each image set, N is the total number of image frames contained in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.
可选的,在本实施例的一些实施方式中,上述视频特征生成模块902还可包括视频分解单元,该单元被设置为通过解析视频拆分指令,获取视频拆分参数;按照视频拆分参数,将样本视频拆分为多个视频段;对每个视频段,提取用于标识当前视频段的目标图像帧。Optionally, in some implementations of the present embodiment, the video feature generation module 902 may further include a video decomposition unit, which is configured to obtain video splitting parameters by parsing video splitting instructions; split the sample video into multiple video segments according to the video splitting parameters; and for each video segment, extract a target image frame for identifying the current video segment.
作为本实施例的一种可选的实施方式,上述视频分解单元还可被设置为:提取当前视频段的第一帧图像,以作为当前视频段的目标图像帧。As an optional implementation of this embodiment, the video decomposition unit may also be configured to extract the first frame image of the current video segment as the target image frame of the current video segment.
可选的,在本实施例的另一些实施方式中,上述视频特征生成模块902还可包括特征提取单元,该单元被设置为:预先训练图像特征提取网络;对每个图像集,将当前图像集所包含的图像帧均输入至图像特征提取网络,得到当前图像集对应的图像特征;其中,图像特征提取网络包括第一3D卷积结构、第一降采样结构、第二3D卷积结构、第二降采样结构、2D卷积结构和全连接层;第一3D卷积结构用于对图像特征提取网络的输入信息进行3D卷积操作;第一降采样结构用于对第一3D卷积结构的输出特征进行降采样操作;第二3D卷积结构用于对第一降采样结构的输出特征进行3D卷积操作;第二降采样结构用于对第二3D卷积结构输出的特征进行降采样操作;2D卷积结构用于对第二降采样结构的输出特征进行2D卷积操作。Optionally, in some other implementations of the present embodiment, the above-mentioned video feature generation module 902 may also include a feature extraction unit, which is configured to: pre-train an image feature extraction network; for each image set, input the image frames contained in the current image set into the image feature extraction network to obtain image features corresponding to the current image set; wherein the image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to downsample the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to downsample the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure.
可选的,在本实施例的再一些实施方式中,上述视频特征生成模块902还可被设置为:对每个图像集,基于当前图像集的图像特征确定当前图像集的当前初始权重,并基于当前初始权重和每个图像集的初始权重确定当前图像集的权重系数;根据各图像集的图像特征及相应的权重系数,生成样本视频的视频特征。Optionally, in some further implementations of the present embodiment, the above-mentioned video feature generation module 902 can also be configured as follows: for each image set, the current initial weight of the current image set is determined based on the image features of the current image set, and the weight coefficient of the current image set is determined based on the current initial weight and the initial weight of each image set; and the video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
作为上述实施例的一种可选的实施方式,上述视频特征生成模块902还可被设置为:调用初始权重计算关系式,计算当前图像集的当前初始权重;初始权重计算关系式为:As an optional implementation of the above embodiment, the above video feature generation module 902 can also be configured to: call the initial weight calculation relationship to calculate the current initial weight of the current image set; the initial weight calculation relationship is:
ai=qT ReLU(H·yi);ai=qT ReLU(H·yi);
式中,ai为第i个图像集的初始权重,q为已知向量,qT表示q的转置,ReLU()为ReLU函数,H为权重矩阵,yi为第i个图像集的图像特征。Where ai is the initial weight of the i-th image set, q is a known vector, qT represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set.
作为上述实施例的另一种可选的实施方式,上述视频特征生成模块902还可被设置为:调用权重计算关系,计算当前图像集的权重系数;权重计算关系式为:As another optional implementation of the above embodiment, the above video feature generation module 902 can also be configured to: call the weight calculation relationship to calculate the weight coefficient of the current image set; the weight calculation relationship is:
Figure PCTCN2022141680-appb-000066
Figure PCTCN2022141680-appb-000066
式中,a i′为第i个图像集的权重系数,mind()表示计算距离的最小值,softmax()为softmax函数,a j为第j个图像集的初始权重,n为图像集总数。 Where a i ′ is the weight coefficient of the i-th image set, mind() represents the minimum value of the calculated distance, softmax() is the softmax function, a j is the initial weight of the j-th image set, and n is the total number of image sets.
可选的,在本实施例的其他一些实施方式中,上述训练模块903还可被设置为:基于每组训练样本的文本特征信息及相应的视频特征,调用损失函数指导视频文本互检模型的训练过程;损失函数为:Optionally, in some other implementations of this embodiment, the training module 903 may also be configured to: based on the text feature information of each group of training samples and the corresponding video features, call a loss function to guide the training process of the video text mutual inspection model; the loss function is:
Figure PCTCN2022141680-appb-000067
Figure PCTCN2022141680-appb-000067
式中,
Figure PCTCN2022141680-appb-000068
为上述损失函数,N为训练样本组数,
Figure PCTCN2022141680-appb-000069
为上述训练样本集中所包含的所有样本视频中的第a个样本视频,
Figure PCTCN2022141680-appb-000070
为上述训练样本集中所包含的所有样本文本中第p个样本文本、且其与第a个样本视频相对应,
Figure PCTCN2022141680-appb-000071
为在所有样本文本数据中的第n个样本文本、且其与第a个样本视频不对应,
Figure PCTCN2022141680-appb-000072
为所有样本文本数据中的第a个样本文本,
Figure PCTCN2022141680-appb-000073
为所有样本视频中第p个样本视频、且其与第a个样本文本相对应,
Figure PCTCN2022141680-appb-000074
为所有样本视频数据中的第n个样本视频、且其与第a个样本文本不对应,
Figure PCTCN2022141680-appb-000075
为超参数。
In the formula,
Figure PCTCN2022141680-appb-000068
is the above loss function, N is the number of training sample groups,
Figure PCTCN2022141680-appb-000069
is the ath sample video among all the sample videos contained in the above training sample set,
Figure PCTCN2022141680-appb-000070
is the pth sample text among all the sample texts contained in the above training sample set, and it corresponds to the ath sample video,
Figure PCTCN2022141680-appb-000071
is the nth sample text in all sample text data, and it does not correspond to the ath sample video,
Figure PCTCN2022141680-appb-000072
is the ath sample text in all sample text data,
Figure PCTCN2022141680-appb-000073
is the pth sample video among all sample videos, and it corresponds to the ath sample text,
Figure PCTCN2022141680-appb-000074
is the nth sample video among all sample video data, and it does not correspond to the ath sample text,
Figure PCTCN2022141680-appb-000075
is a hyperparameter.
其次,请参见图10,图10为本申请实施例提供的视频文本互检装置在一种可选的实施方式下的结构图,该装置可包括:Next, please refer to FIG. 10 , which is a structural diagram of a video text mutual inspection device provided in an embodiment of the present application under an optional implementation mode, and the device may include:
模型训练模块1001,被设置为预先利用如前任意一项上述的视频文本互检模型训练方法,训练得到视频文本互检模型;The model training module 1001 is configured to pre-train a video text mutual checking model using any of the above-mentioned video text mutual checking model training methods;
视频处理模块1002,被设置为将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成上述待检索视频的待匹配视频特征;The video processing module 1002 is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, generate the video features to be matched of the video to be retrieved;
互检模块1003,被设置为将待检索文本的待匹配文本特征和上述待匹配视频特征,输入至上述视频文本互检模型,得到视频文本互检结果;待检索文本包括第一类文本数据、第二类文本数据及第三类文本数 据,第二类文本数据包括第一类文本数据,且第三类文本数据用于概括第二类文本数据和第一类文本数据;待匹配文本特征为利用视频文本互检模型的异质图神经网络提取第二类文本数据的特征和第三类文本特征的融合特征。The mutual check module 1003 is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes the first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video-text mutual check model.
本申请实施例上述装置的各功能模块的功能可根据相应方法实施例中的方法实现,其详细实现过程可以参照上述方法实施例的相关描述,此处不再赘述。The functions of the various functional modules of the above-mentioned device in the embodiment of the present application can be implemented according to the method in the corresponding method embodiment. The detailed implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.
由上可知,本实施例可有效提高视频文本互检索精度。It can be seen from the above that this embodiment can effectively improve the accuracy of video-text mutual retrieval.
上文中提到的跨媒体检索装置以及视频文本互检装置是从功能模块的角度描述,可选的,本申请实施例还提供一种电子设备,是从硬件角度描述。图11为本申请实施例提供的电子设备在一种实施方式下的结构示意图。如图11所示,该电子设备包括存储器110,被设置为存储计算机程序;处理器111,被设置为执行计算机程序时实现如上述任一实施例提到的视频文本互检模型训练方法和/或视频文本互检的步骤。The cross-media retrieval device and video-text mutual-checking device mentioned above are described from the perspective of functional modules. Optionally, the embodiment of the present application also provides an electronic device, which is described from the perspective of hardware. Figure 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application under one implementation. As shown in Figure 11, the electronic device includes a memory 110, which is configured to store a computer program; a processor 111, which is configured to implement the video-text mutual-checking model training method and/or the video-text mutual-checking steps mentioned in any of the above embodiments when executing the computer program.
其中,处理器111可以包括一个或多个处理核心,比如4核心处理器、8核心处理器,处理器111还可为控制器、微控制器、微处理器或其他数据处理芯片等。处理器111可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器111也可以包括主处理器和协处理器,主处理器是被设置为对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是被设置为对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器111可以集成有GPU(Graphics Processing Unit,图像处理器),GPU被设置为负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器111还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器被设置为处理有关机器学习的计算操作。The processor 111 may include one or more processing cores, such as a 4-core processor or an 8-core processor. The processor 111 may also be a controller, a microcontroller, a microprocessor or other data processing chip. The processor 111 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), or PLA (Programmable Logic Array). The processor 111 may also include a main processor and a coprocessor. The main processor is a processor configured to process data in an awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor configured to process data in a standby state. In some embodiments, the processor 111 may be integrated with a GPU (Graphics Processing Unit), which is configured to be responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 111 may also include an AI (Artificial Intelligence) processor, which is configured to process computing operations related to machine learning.
存储器110可以包括一个或多个非易失性存储介质,该非易失性存储介质可以是非暂态的。存储器110还可包括高速随机存取存储器以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。存储器110在一些实施例中可以是电子设备的内部存储单元,例如服务器702的硬盘。存储器110在另一些实施例中也可以是电子设备的外部存储设备,例如服务器702上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。可选地,存储器110还可以既包括电子设备的内部存储单元也包括外部存储设备。存储器110不仅可以被设置为存储安装于电子设备的应用软件及各类数据,例如:执行上述视频文本互检模型训练方法和/或上述视频文本互检方法过程中的程序的代码等,还可以被设置为暂时地存储已经输出或者将要输出的数据。本实施例中,存储器110至少被设置为存储以下计算机程序1101,其中,该计算机程序被处理器111加载并执行之后,能够实现前述任一实施例公开的视频文本互检模型训练方法和/或视频文本互检方法的相关步骤。另外,存储器110所存储的资源还可以包括操作系统1102和数据1103等,存储方式可以是短暂存储或者永久存储。其中,操作系统1102可以包括Windows、Unix、Linux等。数据1103可以包括但不限于视频文本互检模型训练过程产生的数据和/或视频文本互检结果对应的数据等。The memory 110 may include one or more non-volatile storage media, which may be non-transitory. The memory 110 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices. In some embodiments, the memory 110 may be an internal storage unit of an electronic device, such as a hard disk of the server 702. In other embodiments, the memory 110 may also be an external storage device of an electronic device, such as a plug-in hard disk equipped on the server 702, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. Optionally, the memory 110 may also include both an internal storage unit of the electronic device and an external storage device. The memory 110 may not only be configured to store application software and various types of data installed in the electronic device, such as: the code of the program in the process of executing the above-mentioned video text mutual inspection model training method and/or the above-mentioned video text mutual inspection method, but may also be configured to temporarily store data that has been output or is to be output. In this embodiment, the memory 110 is at least configured to store the following computer program 1101, wherein, after the computer program is loaded and executed by the processor 111, the video text mutual inspection model training method and/or the relevant steps of the video text mutual inspection method disclosed in any of the aforementioned embodiments can be implemented. In addition, the resources stored in the memory 110 may also include an operating system 1102 and data 1103, etc., and the storage method may be temporary storage or permanent storage. Among them, the operating system 1102 may include Windows, Unix, Linux, etc. Data 1103 may include but is not limited to data generated by the video text mutual inspection model training process and/or data corresponding to the video text mutual inspection results, etc.
在一些实施例中,上述电子设备还可包括有显示屏112、输入输出接口113、通信接口114或者称为网络接口、电源115以及通信总线116。其中,显示屏112、输入输出接口113比如键盘(Keyboard)属于用户接口,可选的用户接口还可以包括标准的有线接口、无线接口等。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器也可以适当的称为显示屏或显示单元,被设置为显示在电子设备中处理的信息以及用于显示可视化的用户界面。通信接口114可选的可以包括有线接口和/或无线接口,如WI-FI接口、蓝牙接口等,通常用于在电子设备与其他电子设备之间建立通信连接。通信总线116可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图11中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。In some embodiments, the electronic device may further include a display screen 112, an input/output interface 113, a communication interface 114 or a network interface, a power supply 115 and a communication bus 116. The display screen 112 and the input/output interface 113, such as a keyboard, belong to the user interface, and the optional user interface may also include a standard wired interface, a wireless interface, etc. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc. The display may also be appropriately referred to as a display screen or a display unit, which is configured to display information processed in the electronic device and to display a visual user interface. The communication interface 114 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the electronic device and other electronic devices. The communication bus 116 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG11 only uses one thick line, but it does not mean that there is only one bus or one type of bus.
本领域技术人员可以理解,图11中示出的结构并不构成对该电子设备的限定,可以包括比图示更多或更少的组件,例如还可包括实现各类功能的传感器117。Those skilled in the art will appreciate that the structure shown in FIG. 11 does not limit the electronic device and may include more or fewer components than shown in the figure, for example, may also include a sensor 117 for implementing various functions.
本申请实施例上述电子设备的各功能模块的功能可根据上述方法实施例中的方法实现,其详细实现过程可以参照上述方法实施例的相关描述,此处不再赘述。The functions of the functional modules of the electronic device in the embodiment of the present application can be implemented according to the method in the method embodiment. The detailed implementation process can refer to the relevant description of the method embodiment, which will not be repeated here.
由上可知,本实施例可有效提高视频文本互检索精度。It can be seen from the above that this embodiment can effectively improve the accuracy of video-text mutual retrieval.
可以理解的是,如果上述实施例中的视频文本互检模型训练方法和/或视频文本互检方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的非易失性存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电可擦除可编程ROM、寄存器、硬盘、多媒体卡、卡型存储器(例如SD或DX存储器等)、磁性存储器、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的非易失性存储介质。It is understandable that if the video text mutual inspection model training method and/or the video text mutual inspection method in the above-mentioned embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application. The aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc. Various non-volatile storage media that can store program code.
基于此,本申请实施例还提供了一种非易失性存储介质,存储有计算机程序,上述计算机程序被处理器执行时如上任意一实施例上述视频文本互检模型训练方法和/或视频文本互检方法。Based on this, an embodiment of the present application also provides a non-volatile storage medium storing a computer program. When the above computer program is executed by a processor, it is the above-mentioned video-text mutual-checking model training method and/or video-text mutual-checking method in any of the above embodiments.
本申请实施例上述非易失性存储介质的各功能模块的功能可根据上述方法实施例中的方法实现,其详细实现过程可以参照上述方法实施例的相关描述,此处不再赘述。The functions of the functional modules of the above-mentioned non-volatile storage medium in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment. The detailed implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的硬件包括装置及电子设备而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the hardware disclosed in the embodiments, including devices and electronic devices, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part.
专业人员还可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Professionals may also realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel may use different methods to implement the described functions for each specific application, but such implementation should not be considered to exceed the scope of the embodiments of the present application.
以上对本申请实施例所提供的一种视频文本互检模型训练方法及装置、视频文本互检方法及装置、电子设备及非易失性存储介质,进行了详细介绍。本文中应用了可选的个例对本申请实施例的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请实施例的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请实施例原理的前提下,还可以对本申请实施例进行若干改进和修饰,这些改进和修饰也落入本申请实施例权利要求的保护范围内。The above is a detailed introduction to a video text mutual inspection model training method and device, a video text mutual inspection method and device, an electronic device and a non-volatile storage medium provided in the embodiments of the present application. Optional examples are used herein to illustrate the principles and implementation methods of the embodiments of the present application. The description of the above embodiments is only used to help understand the methods and core ideas of the embodiments of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the embodiments of the present application, several improvements and modifications can be made to the embodiments of the present application, and these improvements and modifications also fall within the scope of protection of the claims of the embodiments of the present application.

Claims (20)

  1. 一种视频文本互检模型训练方法,其特征在于,包括:A video text mutual inspection model training method, characterized by comprising:
    获取训练样本集的每组训练样本中的样本文本的文本特征信息;所述样本文本包括第一类文本数据、第二类文本数据及第三类文本数据,所述第二类文本数据包括第一类文本数据,且所述第三类文本数据用于概括所述第二类文本数据和所述第一类文本数据;所述文本特征信息包括所述第一类文本数据、所述第二类文本数据和所述第三类文本数据各自对应的第一类文本特征、第二类文本特征和第三类文本特征;所述第一类文本特征和所述第二类文本特征确定视频文本互检模型中的异质图神经网络的节点特征和连接边;Acquire text feature information of sample text in each group of training samples in the training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data respectively; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;
    对每组训练样本中的样本视频,将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;For a sample video in each group of training samples, multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, and the same image is included in different image sets;
    根据不同图像集的图像特征及各图像集之间的关联关系,生成所述样本视频的视频特征;Generating video features of the sample video according to image features of different image sets and associations between the image sets;
    基于每组训练样本的文本特征及相应的视频特征,训练所述视频文本互检模型;所述文本特征为利用所述异质图神经网络提取所述第二类文本数据的特征和所述第三类文本特征的融合特征。The video-text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
  2. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:
    获取图像重组合参数;所述图像重组合参数包括图像集总数以及各图像集包含的图像帧总数;Acquire image recombination parameters; the image recombination parameters include the total number of image sets and the total number of image frames contained in each image set;
    根据所述图像重组合参数,确定每个图像集所包含的图像帧,以对由多帧图像形成的图像序列进行分割处理。The image frames included in each image set are determined according to the image recombination parameters, so as to perform segmentation processing on an image sequence formed by multiple image frames.
  3. 根据权利要求2所述的视频文本互检模型训练方法,其特征在于,各图像集所包含图像帧总数相同,所述根据所述图像重组合参数,确定每个图像集所包含的图像帧,包括:The video text mutual inspection model training method according to claim 2 is characterized in that the total number of image frames contained in each image set is the same, and determining the image frames contained in each image set according to the image recombination parameters includes:
    对第一个图像集,根据所述图像帧总数和所述图像序列的第一帧图像确定所述第一个图像集所包含的图像帧;For a first image set, determining the image frames included in the first image set according to the total number of image frames and the first frame of the image sequence;
    调用图像分割关系式,确定相邻图像集的图像帧序号差;所述图像分割关系式为:m+nk=N;Calling an image segmentation relational formula to determine the difference in image frame sequence numbers of adjacent image sets; the image segmentation relational formula is: m+nk=N;
    对其余各图像集,基于当前图像集的上一个图像集所包含的图像帧和所述图像帧序号差,确定相应图像集所包含的图像帧;For each of the remaining image sets, based on the image frames included in the previous image set of the current image set and the image frame sequence number difference, determine the image frames included in the corresponding image set;
    式中,m为各图像集所包含图像帧总数,N为所述图像序列所包含图像帧总数,n为图像集总数,k为图像帧序号差,且其为整数。Wherein, m is the total number of image frames included in each image set, N is the total number of image frames included in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.
  4. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:
    通过解析视频拆分指令,获取视频拆分参数;Obtain video splitting parameters by parsing the video splitting instruction;
    按照所述视频拆分参数,将所述样本视频拆分为多个视频段;Splitting the sample video into multiple video segments according to the video splitting parameter;
    对每个视频段,提取用于标识当前视频段的目标图像帧。For each video segment, a target image frame for identifying the current video segment is extracted.
  5. 根据权利要求4所述的视频文本互检模型训练方法,其特征在于,所述提取用于标识当前视频段的目标图像帧,包括:The video text mutual inspection model training method according to claim 4 is characterized in that extracting a target image frame for identifying the current video segment comprises:
    提取所述当前视频段的第一帧图像,以作为所述当前视频段的目标图像帧。The first image frame of the current video segment is extracted as a target image frame of the current video segment.
  6. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述根据不同图像集的图像特征及各图像集之间的关联关系,生成所述样本视频的视频特征,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the generating of the video features of the sample video according to the image features of different image sets and the correlation between the image sets comprises:
    预先训练图像特征提取网络;Pre-train image feature extraction network;
    对每个图像集,将当前图像集所包含的图像帧均输入至所述图像特征提取网络,得到所述当前图像集 对应的图像特征;For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set;
    其中,所述图像特征提取网络包括第一3D卷积结构、第一降采样结构、第二3D卷积结构、第二降采样结构、2D卷积结构和全连接层;The image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer;
    所述第一3D卷积结构用于对所述图像特征提取网络的输入信息进行3D卷积操作;所述第一降采样结构用于对所述第一3D卷积结构的输出特征进行降采样操作;所述第二3D卷积结构用于对所述第一降采样结构的输出特征进行3D卷积操作;所述第二降采样结构用于对所述第二3D卷积结构输出的特征进行降采样操作;所述2D卷积结构用于对所述第二降采样结构的输出特征进行2D卷积操作。The first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to perform a downsampling operation on the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to downsampling the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure.
  7. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述根据不同图像集的图像特征及各图像集之间的关联关系,生成所述样本视频的视频特征,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the generating of the video features of the sample video according to the image features of different image sets and the correlation between the image sets comprises:
    对每个图像集,基于当前图像集的图像特征确定所述当前图像集的当前初始权重,并基于所述当前初始权重和每个图像集的初始权重确定所述当前图像集的权重系数;For each image set, determining a current initial weight of the current image set based on image features of the current image set, and determining a weight coefficient of the current image set based on the current initial weight and the initial weight of each image set;
    根据各图像集的图像特征及相应的权重系数,生成所述样本视频的视频特征。The video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
  8. 根据权利要求7所述的视频文本互检模型训练方法,其特征在于,所述基于当前图像集的图像特征确定所述当前图像集的当前初始权重,包括:The video text mutual inspection model training method according to claim 7 is characterized in that the determining the current initial weight of the current image set based on the image features of the current image set comprises:
    调用初始权重计算关系式,计算所述当前图像集的当前初始权重;所述初始权重计算关系式为:The initial weight calculation formula is called to calculate the current initial weight of the current image set; the initial weight calculation formula is:
    a i=q TReLU(H·y i); a i =q T ReLU(H·y i );
    式中,a i为第i个图像集的初始权重,q为已知向量,q T表示q的转置,ReLU()为ReLU函数,H为权重矩阵,y i为第i个图像集的图像特征。 Where a i is the initial weight of the i-th image set, q is a known vector, q T represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set.
  9. 根据权利要求7所述的视频文本互检模型训练方法,其特征在于,所述基于所述当前初始权重和每个图像集的初始权重确定所述当前图像集的权重系数,包括:The video text mutual inspection model training method according to claim 7 is characterized in that the determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set comprises:
    调用权重计算关系,计算所述当前图像集的权重系数;所述权重计算关系式为:The weight calculation relationship is called to calculate the weight coefficient of the current image set; the weight calculation relationship is:
    Figure PCTCN2022141680-appb-100001
    Figure PCTCN2022141680-appb-100001
    式中,a i′为第i个图像集的权重系数,a i为第i个图像集的初始权重,softmax()为softmax函数,a j为第j个图像集的初始权重,n为图像集总数。 Where a i ′ is the weight coefficient of the i-th image set, a i is the initial weight of the i-th image set, softmax() is the softmax function, a j is the initial weight of the j-th image set, and n is the total number of image sets.
  10. 根据权利要求1至9任意一项所述的视频文本互检模型训练方法,其特征在于,所述基于每组训练样本的文本特征及相应的视频特征,训练所述视频文本互检模型,包括:The video text mutual checking model training method according to any one of claims 1 to 9 is characterized in that the video text mutual checking model is trained based on the text features and corresponding video features of each group of training samples, comprising:
    基于每组训练样本的文本特征及相应的视频特征,调用损失函数指导视频文本互检模型的训练过程;所述损失函数为:Based on the text features and corresponding video features of each set of training samples, a loss function is called to guide the training process of the video text mutual inspection model; the loss function is:
    Figure PCTCN2022141680-appb-100002
    Figure PCTCN2022141680-appb-100002
    式中,
    Figure PCTCN2022141680-appb-100003
    为所述损失函数,N为训练样本组数,min d()表示计算距离的最小值,
    Figure PCTCN2022141680-appb-100004
    为所述训练样本集中所包含的所有样本视频中的第a个样本视频,
    Figure PCTCN2022141680-appb-100005
    为所述训练样本集中所包含的所有 样本文本中第p个样本文本、且其与第a个样本视频相对应,
    Figure PCTCN2022141680-appb-100006
    为在所有样本文本数据中的第n个样本文本、且其与第a个样本视频不对应,
    Figure PCTCN2022141680-appb-100007
    为所有样本文本数据中的第a个样本文本,
    Figure PCTCN2022141680-appb-100008
    为所有样本视频中第p个样本视频、且其与第a个样本文本相对应,
    Figure PCTCN2022141680-appb-100009
    为所有样本视频数据中的第n个样本视频、且其与第a个样本文本不对应,▽为超参数。
    In the formula,
    Figure PCTCN2022141680-appb-100003
    is the loss function, N is the number of training sample groups, min d() represents the minimum value of the calculated distance,
    Figure PCTCN2022141680-appb-100004
    is the ath sample video among all sample videos contained in the training sample set,
    Figure PCTCN2022141680-appb-100005
    is the pth sample text among all the sample texts contained in the training sample set, and corresponds to the ath sample video,
    Figure PCTCN2022141680-appb-100006
    is the nth sample text in all sample text data, and it does not correspond to the ath sample video,
    Figure PCTCN2022141680-appb-100007
    is the ath sample text among all sample text data,
    Figure PCTCN2022141680-appb-100008
    is the pth sample video among all sample videos, and it corresponds to the ath sample text,
    Figure PCTCN2022141680-appb-100009
    is the nth sample video among all sample video data, and it does not correspond to the ath sample text, and ▽ is a hyperparameter.
  11. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:
    将所述多帧图像按照提取的顺序整合为一个图像序列,通过对所述图像序列进行交叉分割得到所述多个图像集。The multiple frames of images are integrated into an image sequence according to the order of extraction, and the multiple image sets are obtained by cross-segmenting the image sequence.
  12. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:
    将所述多帧图像随机整合为一个图像序列,通过对所述图像序列进行分割得到所述多个图像集。The multiple frames of images are randomly integrated into an image sequence, and the multiple image sets are obtained by segmenting the image sequence.
  13. 根据权利要求1所述的视频文本互检模型训练方法,其特征在于,所述将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,包括:The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:
    将所述多帧图像随机分配至不同的图像集。The multiple frames of images are randomly assigned to different image sets.
  14. 根据权利要求4所述的视频文本互检模型训练方法,其特征在于,所述视频拆分参数包括所述样本视频的拆分段数以及所述样本视频的标识信息。The video text mutual inspection model training method according to claim 4 is characterized in that the video splitting parameters include the number of segments of the sample video and the identification information of the sample video.
  15. 根据权利要求4所述的视频文本互检模型训练方法,其特征在于,所述多个视频段相互重叠。The video-text mutual-checking model training method according to claim 4 is characterized in that the multiple video segments overlap with each other.
  16. 一种视频文本互检方法,其特征在于,包括:A video text mutual checking method, characterized by comprising:
    预先利用如权利要求1至15任意一项所述的视频文本互检模型训练方法,训练得到视频文本互检模型;Preliminarily using the video text mutual inspection model training method according to any one of claims 1 to 15 to train a video text mutual inspection model;
    将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets;
    根据不同图像集的图像特征及各图像集之间的关联关系,生成所述待检索视频的待匹配视频特征;Generating the to-be-matched video features of the to-be-retrieved video according to the image features of different image sets and the association relationship between the image sets;
    将待检索文本的待匹配文本特征和所述待匹配视频特征,输入至所述视频文本互检模型,得到视频文本互检结果;所述待检索文本包括第一类文本数据、第二类文本数据及第三类文本数据,所述第二类文本数据包括第一类文本数据,且所述第三类文本数据用于概括所述第二类文本数据和所述第一类文本数据;所述待匹配文本特征为利用所述视频文本互检模型的异质图神经网络提取所述第二类文本数据的特征和所述第三类文本特征的融合特征。The text features to be matched of the text to be retrieved and the video features to be matched are input into the video-text mutual checking model to obtain the video-text mutual checking results; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are fused features of the features of the second-category text data and the third-category text features extracted using the heterogeneous graph neural network of the video-text mutual checking model.
  17. 一种视频文本互检模型训练装置,其特征在于,包括:A video text mutual inspection model training device, characterized by comprising:
    文本特征获取模块,被设置为获取训练样本集的每组训练样本中的样本文本的文本特征信息;所述样本文本包括第一类文本数据、第二类文本数据及第三类文本数据,所述第二类文本数据包括第一类文本数据,且所述第三类文本数据用于概括所述第二类文本数据和所述第一类文本数据;所述文本特征信息包括所述第一类文本数据、所述第二类文本数据和所述第三类文本数据对应的第一类文本特征、第二类文本特征和第三类文本特征;所述第一类文本特征和所述第二类文本特征确定视频文本互检模型中的异质图神经网络的节点特征和连接边;A text feature acquisition module is configured to acquire text feature information of a sample text in each group of training samples in a training sample set; the sample text includes first-category text data, second-category text data, and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features, and third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;
    视频特征生成模块,被设置为对每组训练样本中的样本视频,将从所述样本视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成所述样本视频的视频特征;The video feature generation module is configured to reassemble multiple frames of images extracted from the sample video in each group of training samples to obtain multiple image sets, and the same image is included in different image sets; and generate video features of the sample video according to image features of different image sets and correlations between the image sets;
    训练模块,被设置为基于每组训练样本的文本特征及相应的视频特征,训练所述视频文本互检模型;所述文本特征为利用所述异质图神经网络提取所述第二类文本数据的特征和所述第三类文本特征的融合特征。The training module is configured to train the video-text mutual inspection model based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
  18. 一种视频文本互检装置,其特征在于,包括:A video text mutual checking device, characterized by comprising:
    模型训练模块,被设置为预先利用如权利要求1至15任意一项所述的视频文本互检模型训练方法,训练 得到视频文本互检模型;The model training module is configured to pre-train a video text mutual inspection model using the video text mutual inspection model training method according to any one of claims 1 to 15;
    视频处理模块,被设置为将从待检索视频中提取的多帧图像进行重新组合,以得到多个图像集,且同一张图像被包含在不同图像集中;根据不同图像集的图像特征及各图像集之间的关联关系,生成所述待检索视频的待匹配视频特征;The video processing module is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; according to the image features of different image sets and the association relationship between the image sets, the video features to be matched of the video to be retrieved are generated;
    互检模块,被设置为将待检索文本的待匹配文本特征和所述待匹配视频特征,输入至所述视频文本互检模型,得到视频文本互检结果;所述待检索文本包括第一类文本数据、第二类文本数据及第三类文本数据,所述第二类文本数据包括第一类文本数据,且所述第三类文本数据用于概括所述第二类文本数据和所述第一类文本数据;所述待匹配文本特征为利用所述视频文本互检模型的异质图神经网络提取所述第二类文本数据的特征和所述第三类文本特征的融合特征。The mutual checking module is configured to input the text features to be matched of the text to be retrieved and the video features to be matched into the video-text mutual checking model to obtain the video-text mutual checking result; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are fused features of the features of the second-category text data and the third-category text features extracted using the heterogeneous graph neural network of the video-text mutual checking model.
  19. 一种电子设备,其特征在于,包括处理器和存储器,所述处理器被设置为执行所述存储器中存储的计算机程序时实现如权利要求1至15任一项所述视频文本互检模型训练方法和/或如权利要求16所述视频文本互检方法的步骤。An electronic device, characterized in that it includes a processor and a memory, wherein the processor is configured to implement the steps of the video text mutual checking model training method as described in any one of claims 1 to 15 and/or the video text mutual checking method as described in claim 16 when executing the computer program stored in the memory.
  20. 一种非易失性存储介质,其特征在于,所述非易失性存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至15任一项所述视频文本互检模型训练方法和/或如权利要求16所述视频文本互检方法的步骤。A non-volatile storage medium, characterized in that a computer program is stored on the non-volatile storage medium, and when the computer program is executed by a processor, the steps of the video text mutual inspection model training method as described in any one of claims 1 to 15 and/or the video text mutual inspection method as described in claim 16 are implemented.
PCT/CN2022/141680 2022-11-08 2022-12-23 Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium WO2024098525A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211388901.3 2022-11-08
CN202211388901.3A CN115438225B (en) 2022-11-08 2022-11-08 Video text mutual inspection method and model training method, device, equipment and medium thereof

Publications (1)

Publication Number Publication Date
WO2024098525A1 true WO2024098525A1 (en) 2024-05-16

Family

ID=84253119

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141680 WO2024098525A1 (en) 2022-11-08 2022-12-23 Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium

Country Status (2)

Country Link
CN (1) CN115438225B (en)
WO (1) WO2024098525A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438225B (en) * 2022-11-08 2023-03-24 苏州浪潮智能科技有限公司 Video text mutual inspection method and model training method, device, equipment and medium thereof
CN116049459B (en) * 2023-03-30 2023-07-14 浪潮电子信息产业股份有限公司 Cross-modal mutual retrieval method, device, server and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN113704546A (en) * 2021-08-23 2021-11-26 西安电子科技大学 Video natural language text retrieval method based on space time sequence characteristics
CN113806482A (en) * 2021-09-17 2021-12-17 中国电信集团系统集成有限责任公司 Cross-modal retrieval method and device for video text, storage medium and equipment
CN114896429A (en) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 Image-text mutual detection method, system, equipment and computer readable storage medium
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
CN115438225A (en) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 Video text mutual inspection method and model training method, device, equipment and medium thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680173B (en) * 2020-05-31 2024-02-23 西南电子技术研究所(中国电子科技集团公司第十研究所) CMR model for unified searching cross-media information
CN112131449B (en) * 2020-09-21 2022-07-22 西北大学 Method for realizing cultural resource cascade query interface based on ElasticSearch
CN114357124B (en) * 2022-03-18 2022-06-14 成都考拉悠然科技有限公司 Video paragraph positioning method based on language reconstruction and graph mechanism
CN114969405B (en) * 2022-04-30 2024-01-26 苏州浪潮智能科技有限公司 Cross-modal image-text mutual detection method
CN115293348A (en) * 2022-08-15 2022-11-04 腾讯科技(深圳)有限公司 Pre-training method and device for multi-mode feature extraction network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN113704546A (en) * 2021-08-23 2021-11-26 西安电子科技大学 Video natural language text retrieval method based on space time sequence characteristics
CN113806482A (en) * 2021-09-17 2021-12-17 中国电信集团系统集成有限责任公司 Cross-modal retrieval method and device for video text, storage medium and equipment
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
CN114896429A (en) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 Image-text mutual detection method, system, equipment and computer readable storage medium
CN115438225A (en) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 Video text mutual inspection method and model training method, device, equipment and medium thereof

Also Published As

Publication number Publication date
CN115438225B (en) 2023-03-24
CN115438225A (en) 2022-12-06

Similar Documents

Publication Publication Date Title
WO2024098525A1 (en) Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium
WO2020155423A1 (en) Cross-modal information retrieval method and apparatus, and storage medium
CN110647614A (en) Intelligent question and answer method, device, medium and electronic equipment
US20230069197A1 (en) Method, apparatus, device and storage medium for training video recognition model
US10878247B2 (en) Method and apparatus for generating information
WO2024098533A1 (en) Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium
WO2024098623A1 (en) Cross-media retrieval method and apparatus, cross-media retrieval model training method and apparatus, device, and recipe retrieval system
WO2024098524A1 (en) Text and video cross-searching method and apparatus, model training method and apparatus, device, and medium
WO2022166258A1 (en) Behavior recognition method and apparatus, terminal device, and computer-readable storage medium
WO2022140900A1 (en) Method and apparatus for constructing personal knowledge graph, and related device
CN108171189A (en) A kind of method for video coding, video coding apparatus and electronic equipment
CN113570030A (en) Data processing method, device, equipment and storage medium
CN108898549A (en) Image processing method, picture processing unit and terminal device
EP4213097A1 (en) Image generation method and apparatus
CN113673613A (en) Multi-modal data feature expression method, device and medium based on contrast learning
US20230252070A1 (en) Method and apparatus for training retrieval model, retrieval method and apparatus, device and medium
CN111368551A (en) Method and device for determining event subject
US11620547B2 (en) Estimating number of distinct values in a data set using machine learning
JP7309811B2 (en) Data annotation method, apparatus, electronics and storage medium
CN109993026A (en) The training method and device of relatives' identification network model
WO2024098763A1 (en) Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium
CN111325212A (en) Model training method and device, electronic equipment and computer readable storage medium
CN114327493A (en) Data processing method and device, electronic equipment and computer readable medium
CN107451194A (en) A kind of image searching method and device
CN108763260A (en) A kind of examination question searching method, system and terminal device