WO2024098525A1

WO2024098525A1 - Video-text mutual retrieval method and apparatus, training method and apparatus for video-text mutual retrieval model, and device and medium

Info

Publication number: WO2024098525A1
Application number: PCT/CN2022/141680
Authority: WO
Inventors: 李仁刚; 王立; 范宝余; 郭振华
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2022-11-08
Filing date: 2022-12-23
Publication date: 2024-05-16
Also published as: CN115438225B; CN115438225A

Abstract

The embodiments of the present application are applied to the technical field of information retrieval. Disclosed are a video-text mutual retrieval method and apparatus, a training method and apparatus for a video-text mutual retrieval model, and a device and a medium. The training method for a video-text mutual retrieval model comprises: acquiring text feature information of sample text in each group of training samples in a training sample set, and on the basis of the text feature information, determining node features and edges of a heterogeneous graph neural network in a video-text mutual retrieval model; for a sample video in each group of training samples, re-combining a plurality of image frames extracted from the sample video, so as to obtain a plurality of image sets; generating a video feature according to image features of different image sets and an association relationship between the image sets; and training the video-text mutual retrieval model on the basis of a text feature in which a third-type text feature and a feature, which is extracted by using the heterogeneous graph neural network, of second-type text data are fused, and a corresponding video feature. The present application can effectively improve the precision of video-text mutual retrieval.

Description

Video text mutual inspection method and model training method, device, equipment, and medium thereof

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to the Chinese patent application filed with the China Patent Office on November 8, 2022, with application number 202211388901.3, and application name “Video-text mutual inspection method and its model training method, device, equipment, and medium”, all of which are incorporated by reference in this application.

Technical Field

The embodiments of the present application relate to the field of information retrieval technology, and in particular to a video text mutual inspection model training method and device, a video text mutual inspection method and device, an electronic device and a non-volatile storage medium.

Background technique

As computer technology and network technology are widely used in daily work and life, the amount and diversity of data have increased significantly, including text data such as news reports, Weibo and Taobao comment data, WeChat chat records, etc., image data such as emoticons, article illustrations, mobile phone photos, medical images, etc., video data such as TV and movies from various video players, and short videos such as Douyin and Kuaishou, data collected by cameras, etc., and audio data such as various voice broadcasts, WeChat voice, video dubbing, etc. These different multimedia forms of data are often used together to describe the same object or the same scene. In order to facilitate the management of diverse multimedia content, methods for flexible retrieval between different media have been developed.

Among them, for the mutual retrieval between video data and text data, the relevant technology does not directly process the video. It usually divides the video data into multiple frames of image data and then processes the image data. In the image processing process, the relevant technology uses the attention method to weight the extracted image features to the text features, reconstruct the text features, and enhance the similarity between the text and the image. Although this method can reconstruct the electronic text features using attention. However, it simply uses the unidirectional attention of natural images to electronic texts when reconstructing the electronic text features. Since there is a corresponding relationship between natural images and electronic texts, the corresponding high-order features affect each other. Only reconstructing the electronic text features while ignoring the natural image features makes it impossible for the natural image features to accurately correspond to the electronic text features, affecting the mutual retrieval of video texts.

In view of this, how to effectively improve the accuracy of video-text mutual retrieval is a technical problem that technical personnel in the relevant field need to solve.

Summary of the invention

The embodiments of the present application provide a video-text mutual-check model training method and device, a video-text mutual-check method and device, an electronic device and a non-volatile storage medium, which can effectively improve the accuracy of video-text mutual retrieval.

To solve the above technical problems, the present application provides the following technical solutions:

A first aspect of an embodiment of the present application provides a video text mutual inspection model training method, comprising:

Obtaining text feature information of sample text in each group of training samples in the training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;

For the sample videos in each group of training samples, multiple frames of images extracted from the sample videos are reassembled to obtain multiple image sets, and the same image is included in different image sets;

Generate video features of the sample video according to image features of different image sets and correlations between the image sets;

The video-text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.

Optionally, the multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, including:

Obtaining image recombining parameters; the image recombining parameters include the total number of image sets and the total number of image frames contained in each image set;

According to the above-mentioned image recombining parameters, the image frames included in each image set are determined to perform segmentation processing on the image sequence formed by multiple frames of images.

Optionally, the total number of image frames included in each image set is the same, and the above-mentioned determining the image frames included in each image set according to the above-mentioned image recombination parameters includes:

For the first image set, determining the image frames included in the first image set according to the total number of image frames and the first frame of the image sequence;

The image segmentation relational formula is called to determine the difference in the image frame sequence numbers of adjacent image sets; the above image segmentation relational formula is: m+nk=N;

For each of the remaining image sets, based on the image frames included in the previous image set of the current image set and the image frame sequence number difference, the image frames included in the corresponding image set are determined;

Wherein, m is the total number of image frames included in each image set, N is the total number of image frames included in the above image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.

Obtain video splitting parameters by parsing the video splitting instruction;

According to the above video splitting parameters, split the above sample video into multiple video segments;

For each video segment, a target image frame for identifying the current video segment is extracted.

Optionally, the above-mentioned extraction for identifying the target image frame of the current video segment includes:

The first frame image of the current video segment is extracted to serve as the target image frame of the current video segment.

Optionally, the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes:

Pre-train image feature extraction network;

For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set;

The image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer.

The above-mentioned first 3D convolution structure is used to perform a 3D convolution operation on the input information of the above-mentioned image feature extraction network; the above-mentioned first downsampling structure is used to perform a downsampling operation on the output features of the above-mentioned first 3D convolution structure; the above-mentioned second 3D convolution structure is used to perform a 3D convolution operation on the output features of the above-mentioned first downsampling structure; the above-mentioned second downsampling structure is used to perform a downsampling operation on the features output by the above-mentioned second 3D convolution structure; the above-mentioned 2D convolution structure is used to perform a 2D convolution operation on the output features of the above-mentioned second downsampling structure.

Optionally, the generating of the video features of the sample video according to the image features of different image sets and the association relationship between the image sets includes: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set;

The video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.

Optionally, the determining of the current initial weight of the current image set based on the image features of the current image set includes:

The initial weight calculation formula is called to calculate the current initial weight of the current image set; the initial weight calculation formula is:

ai＝qTReLU(H·yi);

Where ai is the initial weight of the i-th image set, q is a known vector, qT represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set.

Optionally, the determining of the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set includes:

The weight calculation relationship is called to calculate the weight coefficient of the current image set; the weight calculation relationship is:

Where a _i ′ is the weight coefficient of the i-th image set, a _i is the initial weight of the i-th image set, softmax() is the softmax function, a _j is the initial weight of the j-th image set, and n is the total number of image sets.

Optionally, the video text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples, including:

Based on the text feature information and corresponding video features of each set of training samples, a loss function is called to guide the training process of the video text mutual inspection model; the above loss function is:

In the formula,

is the above loss function, N is the number of training sample groups, min d() represents the minimum value of the calculated distance,

is the ath sample video among all the sample videos contained in the above training sample set,

is the pth sample text among all the sample texts contained in the above training sample set, and it corresponds to the ath sample video,

is the nth sample text in all sample text data, and it does not correspond to the ath sample video,

is the ath sample text in all sample text data,

is the pth sample video among all sample videos, and it corresponds to the ath sample text,

is the nth sample video among all sample video data, and it does not correspond to the ath sample text,

is a hyperparameter.

Optionally, the above-mentioned recombining multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: integrating the above-mentioned multiple frames of images into an image sequence according to the order of extraction, and obtaining the above-mentioned multiple image sets by cross-segmenting the above-mentioned image sequence.

Optionally, the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly integrating the above-mentioned multiple frames of images into an image sequence, and obtaining the above-mentioned multiple image sets by segmenting the above-mentioned image sequence.

Optionally, the above-mentioned recombining the multiple frames of images extracted from the above-mentioned sample video to obtain multiple image sets includes: randomly allocating the above-mentioned multiple frames of images to different image sets.

Optionally, the video splitting parameters include the number of segments of the sample video and identification information of the sample video.

Optionally, the multiple video segments overlap with each other.

A second aspect of an embodiment of the present application provides a video text mutual inspection model training device, comprising:

A text feature acquisition module is configured to acquire text feature information of sample text in each group of training samples in a training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data; the first-category text features and the second-category text features determine node features and connection edges of a heterogeneous graph neural network in a video text mutual inspection model;

The video feature generation module is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, wherein the same image is included in different image sets; and generate video features of the sample video according to image features of different image sets and correlations between the image sets;

The training module is configured to train the video-text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are fused features of the features of the second category of text data extracted using the heterogeneous graph neural network and the third category of text features.

A third aspect of the embodiment of the present application provides a video text mutual inspection method, including:

Preliminarily train a video text mutual inspection model using any of the above-mentioned video text mutual inspection model training methods;

Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets;

Generate matching video features of the video to be retrieved based on the image features of different image sets and the association relationship between the image sets;

The text features to be matched of the text to be retrieved and the above-mentioned video features to be matched are input into the above-mentioned video-text mutual checking model to obtain the video-text mutual checking results; the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data, the above-mentioned second-category text data includes first-category text data, and the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data; the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual checking model.

A fourth aspect of the embodiments of the present application provides a video text mutual inspection device, including:

The model training module is configured to be preliminarily trained to obtain a video text mutual inspection model using any of the above-mentioned video text mutual inspection model training methods;

The video processing module is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, the video features to be matched of the video to be retrieved are generated;

The mutual check module is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result; the above-mentioned text to be retrieved includes first-category text data, second-category text data and third-category text data, the above-mentioned second-category text data includes first-category text data, and the above-mentioned third-category text data is used to summarize the above-mentioned second-category text data and the above-mentioned first-category text data; the above-mentioned text features to be matched are the fusion features of the features of the above-mentioned second-category text data and the above-mentioned third-category text features extracted by the heterogeneous graph neural network of the above-mentioned video-text mutual check model.

An embodiment of the present application also provides an electronic device, including a processor, wherein the processor is configured to implement the steps of any of the above-mentioned video-text mutual-checking model training methods and/or the above-mentioned video-text mutual-checking methods when executing a computer program stored in a memory.

Finally, an embodiment of the present application further provides a non-volatile storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the video-text mutual-checking model training method and/or the video-text mutual-checking method as described above are implemented.

The advantage of the technical solution provided by the embodiment of the present application is that different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features. The fusion features of the third type of text data and the second type of text data that summarize the text data are used as text features for performing matching tasks, which can mine the intrinsic relationship between text data, thereby facilitating the improvement of the accuracy of video-text mutual retrieval. Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video. In the process of determining the video features, the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.

In addition, the embodiments of the present application also provide corresponding implementation devices, electronic devices and non-volatile storage media, as well as video text mutual checking methods and devices for video text mutual checking model training methods, making the above methods more practical. The above devices, electronic devices, non-volatile storage medium video text mutual checking methods and devices all have corresponding advantages.

It is to be understood that the foregoing general description and the following detailed description are exemplary only and are not restrictive of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

FIG1 is a flow chart of a video text mutual inspection model training method provided in an embodiment of the present application;

FIG2 is a schematic diagram of constructing a heterogeneous graph neural network provided in an embodiment of the present application;

FIG3 is a schematic diagram of multiple image sets generated by recombining multiple frame images according to an embodiment of the present application;

FIG4 is a schematic diagram of a sample video cutting process provided in an embodiment of the present application;

FIG5 is a schematic diagram of image feature extraction provided in an embodiment of the present application;

FIG6 is a flow chart of a video text mutual inspection method provided in an embodiment of the present application;

FIG7 is a schematic diagram of a video text mutual inspection model framework for an exemplary application scenario provided by an embodiment of the present application;

FIG8 is a schematic diagram of a system structure framework of an exemplary application scenario provided in an embodiment of the present application;

FIG9 is a structural diagram of an optional implementation of a video text mutual inspection model training device provided in an embodiment of the present application;

FIG10 is a structural diagram of an optional implementation of a video text mutual inspection device provided in an embodiment of the present application;

FIG. 11 is a structural diagram of an optional implementation of an electronic device provided in an embodiment of the present application.

Detailed ways

In order to make those skilled in the art better understand the embodiments of the present application, the embodiments of the present application are described in detail below in conjunction with the accompanying drawings and optional implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the embodiments of the present application.

The terms "first", "second", "third", "fourth", etc. in the description and claims of the embodiments of the present application and the above-mentioned drawings are used to distinguish different objects rather than to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may include steps or units that are not listed.

After introducing the technical solutions of the embodiments of the present application, various non-limiting implementation methods of the embodiments of the present application are described in detail below.

First, refer to FIG. 1 , which is a flow chart of a video text mutual inspection model training method provided by an embodiment of the present application. The embodiment of the present application may include the following contents:

S101: Obtain text feature information of sample text in each group of training samples in the training sample set.

In this embodiment, the training sample set is sample data for training the video text mutual inspection model, and the training sample set includes multiple groups of training samples, each group of training samples includes corresponding sample texts and sample videos, that is, the sample text and the sample video are a set of sample data that match each other. As for the number of training sample groups, it can be determined according to the actual training needs and the database used, and the embodiment of the present application does not impose any restrictions on this. The video text mutual inspection model is used to perform the mutual retrieval task of video data and text data, which includes a heterogeneous graph neural network and a video coding network. The heterogeneous graph neural network is used to process the sample text and the second type of text data of the text to be retrieved and finally output the text features corresponding to the text data. The video coding network is used to process the video data and finally output the video features of the video data. The model is obtained based on the text features and video features training. The data types contained in the sample text of this embodiment include at least three types, wherein the text features corresponding to the two data types are used as heterogeneous nodes of the graph structure. For the convenience of description, they can be called the first type of text data and the second type of text data, and the other type of data is the text data summarizing the first type of text data and the second type of text data. Correspondingly, the text feature information includes the first-category text features, the second-category text features, and the third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; for the heterogeneous graph neural network, which is a network based on a graph structure, the nodes of the graph structure are the first-category text features and the second-category text features, and the connection edges of the graph structure are determined by whether there is an association relationship between the corresponding features of each heterogeneous node. If there is an association relationship between the features corresponding to two nodes, there is a connection edge relationship between the two nodes. As shown in Figure 2, for the two types of text data of the sample text, the features extracted from the first type of text data include

The second type of features extracted from text data include

The nodes of the heterogeneous graph neural network include

like

and

There are related relationships, such as

Included

The characteristics of

and

With connecting edges e ₃₂ and e ₃₃ , if

and

If there is a relationship,

and

The connection edge e ₁₁ is included therebetween. As for the graph structure of the heterogeneous graph neural network, a corresponding graph structure can be selected based on the actual application scenario, and the embodiments of the present application do not impose any limitation on this.

S102: for each set of training samples, recombining multiple frames of images extracted from the sample videos to obtain multiple image sets.

In this embodiment, for all sample videos included in the training sample set, S102 and S103 are respectively executed. In this step, for each sample video, multiple frames representing the sample video are extracted from the sample video. As for which frames of the sample video are extracted, it can be flexibly selected according to actual needs. Optionally, as for the total number of extracted image frames, it can also be flexibly selected based on actual needs. The embodiment of the present application does not impose any restrictions on this. After extracting multiple frames of images, the multiple frames of images are recombined, and the multiple frames of images can be integrated into an image sequence in the order of extraction, and then multiple image sets are obtained by cross-segmenting the image sequence. The same image in this embodiment is included in different image sets, indicating that the same image appears in at least two image sets. Of course, after extracting multiple frames of images, the multiple frames of images can also be randomly integrated into an image sequence, and then multiple image sets are obtained by segmenting the image sequence. Of course, after extracting multiple frames of images, the multiple frames of images can also be randomly assigned to different image sets, and the same image can be assigned to multiple image sets. As for which method to adopt to regenerate multiple new image sets by recombining multiple frames of images, technicians in the relevant field can flexibly decide according to actual needs.

S103: Generate video features of the sample video according to the image features of different image sets and the association relationship between the image sets.

After obtaining each image set in the previous step, any existing machine learning model such as convolutional neural network, VGG (Visual Geometry Group Network), Resnet (Residual Neural Network), etc. can be used to extract the image features of each frame image contained in each image set, and the image features of all the frames in the image set are integrated into the image features of the image set. The association between the image sets is used to identify the importance of the image features of different image sets to the entire video, and the final video features of the sample video are determined based on the importance of different image sets and the image features of the image sets.

S104: Based on the text features of each group of training samples and the corresponding video features, a video text mutual inspection model is trained.

In this embodiment, the text features of a sample text correspond to the video features of a sample video. The text features of each sample text in this embodiment are fusion features, which are fused with the text features corresponding to the third category text data of the sample text and the features extracted by the heterogeneous graph neural network of the second category text data of the video text mutual inspection model. The text features corresponding to the third category text data can be extracted by any text feature extraction model, and this embodiment does not impose any restrictions on this. During the model training process, a loss function is used to guide the training of the model, and then the network parameters of the video text mutual inspection model are updated by methods such as gradient back propagation until the model training conditions are met, such as reaching the number of iterations or the convergence effect is good. For example, the training process of the video text mutual inspection model may include a forward propagation stage and a back propagation stage. The forward propagation stage is a stage in which data is propagated from a low level to a high level, and the back propagation stage is a stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not meet the expectation. Specifically, all network layer weights are first initialized, such as random initialization; then the video features and text feature information are input and forward propagated through the graph neural network, convolution layer, downsampling layer, fully connected layer and other layers to obtain the output value; the model output value of the video text mutual inspection model is calculated, and the loss value of the output value is calculated based on the loss function. The error is reversed back to the video text mutual inspection model, and the back propagation errors of each part of the video text mutual inspection model such as the graph neural network layer, the fully connected layer, the convolution layer and other layers are obtained in turn. Each layer of the video text mutual inspection model adjusts all weight coefficients of the video text mutual inspection model according to the back propagation errors of each layer to achieve weight update. A new batch of video features and text feature information is randomly selected again, and then the above process is performed again to obtain the output value of the network forward propagation. Infinite reciprocating iterations, when the error between the calculated model output value and the target value (ie, the label) is less than the preset threshold, or the number of iterations exceeds the preset number of iterations, the model training is terminated. All layer parameters of the model corresponding to the end of the model training are used as the network parameters of the trained video text mutual inspection model.

In the technical solution provided in the embodiment of the present application, different text types are used as heterogeneous nodes of the graph neural network, and the use of the graph neural network is conducive to extracting deeper and richer text features, which is conducive to improving the accuracy of video-text mutual retrieval. Recombining the image frames extracted from the video data and then extracting the image video is conducive to obtaining image features that can more accurately reflect the video. In the process of determining the video features, the correlation between different image frames is also considered, which is conducive to obtaining more accurate video features, thereby improving the accuracy of text-video mutual retrieval.

The above embodiment does not limit which loss function is used to guide the model training process in step S104. Technical personnel in the relevant field can use any loss function in the prior art, such as L1 norm loss function, mean square error loss function, cross entropy loss, etc. It can be understood that the loss function is an indicator used to measure the performance of the prediction model in predicting the expected result. Whether the loss function is accurate affects the accuracy of the entire model. In order to improve the accuracy of video text mutual retrieval, the embodiment of the present application also provides an optional implementation of the loss function, that is, based on the text features of each group of training samples and the corresponding video features, the loss function is called to guide the training process of the video text mutual inspection model; the loss function can be expressed as:

In the formula,

is the ath sample text among all sample text data,

is a hyperparameter.

In this embodiment, the loss function will traverse each video feature and text feature information to calculate the average value of the loss function for paired data. This embodiment can traverse N times, where N represents that there are N paired sample data in this batch, that is, there are N groups of training samples in the training sample set. All sample videos of these N groups of training samples can be regarded as a video image group, and all sample texts can be regarded as a text group. First, the video image group feature

Traverse (a total of N), and the selected video features can be called

a represents anchor (anchor sample). The text feature encoding paired with the anchor sample is recorded as

p represents positive (paired matching). Similarly, in this batch

The unpaired text features are recorded as

is a hyperparameter, which is fixed during training, for example, set to 0.3. Similarly, the same traversal operation is performed for text features.

Represents the sample selected in the traversal, and the corresponding video image group feature sample is recorded as

The non-corresponding

In the above embodiment, there is no limitation on how to execute step S102. In this embodiment, an optional image frame combination method is provided, which may include the following steps:

An image recombination parameter is obtained, and the image frames included in each image set are determined according to the image recombination parameter, so as to perform segmentation processing on an image sequence formed by multiple frames of images.

In this embodiment, the image recombination parameters may include the total number of image sets and the total number of image frames contained in each image set. The total number of image sets and the total number of image frames contained in each image set can be changed in real time, that is, the user can input the latest parameter value in real time, and can directly write it into the specified location of the system, which does not affect the implementation of the embodiments of the present application. As for the number of image frames contained in each image set, they can be the same or different. In order to facilitate subsequent image processing, this embodiment can set the number of image frames contained in each image set to be the same. After determining the total number of image sets and the total number of image frames contained in each image set, combined with the number of extracted image frames, the image frames can be allocated and reprocessed through manual interaction. Of course, an automated image segmentation method can also be used. For the scenario where the total number of image frames contained in each image set is the same, this embodiment also provides an optional implementation method for determining the image frames contained in each image set according to the image recombination parameters, which may include the following contents:

For the first image set, the image frames included in the first image set are determined according to the total number of image frames and the first frame image of the image sequence; the image segmentation relationship is called to determine the image frame sequence number difference of adjacent image sets; the image segmentation relationship is: m+nk=N; for the remaining image sets, the image frames included in the corresponding image set are determined based on the image frames included in the previous image set of the current image set and the image frame sequence number difference; wherein m is the total number of image frames included in each image set, N is the total number of image frames included in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.

In this embodiment, in order to make the implementation method clearer to the technicians in the relevant field, a schematic example is given in conjunction with FIG3. If the image frames extracted from the sample video are N frames, the N frames are divided into n overlapping image sets, and each image set may include m frames. Based on m+nk=N, the image frame sequence number difference k value can be calculated. The first image set includes [1, ..., m], the second image set includes [k+1, ..., m+k], the third image set includes [2k+1, ..., m+2k], and the nth image set includes [nk+1, ..., m+nk]. For example, N=32, n=5, m=16, then k=3.2, rounded up, k=4, then the image sets can be: [1,16], [5,20], [9,24], [13,28] and [16,N].

The sample video is composed of many frames of video images. The above embodiment does not limit the process of extracting multiple frames of images from the sample video. As shown in Figure 4, this embodiment also provides an optional implementation method, that is, by parsing the video splitting instruction, the video splitting parameters are obtained; according to the video splitting parameters, the sample video is split into multiple video segments; for each video segment, the target image frame used to identify the current video segment is extracted. Optionally, the first frame of the current video segment can be extracted as the target image frame of the current video segment. Among them, the video splitting parameter refers to the number of sample video segmentation segments and the sample video identification information. This implementation can divide a sample video into N segments on average, and then take the first frame of each segment as the representative of the image of the segment.

This embodiment divides the image frames extracted from the video into multiple overlapping intervals, which is beneficial to extracting richer image features and improving the accuracy of model training.

The above embodiment does not limit how to generate video features. The embodiment of the present application also provides an illustrative example, which may include the following content:

First, in conjunction with FIG. 5, an embodiment of the present application provides a network structure for extracting image features of each frame of each image set, which is called an image feature extraction network in this embodiment. The image feature extraction network may include a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure, and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to perform a downsampling operation on the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to perform a downsampling operation on the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure. Based on the above structure, any image database can be used to train the image feature extraction network until the training end condition is reached. For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set.

For example, consider an image set whose input is a voxel block of multiple frames of images, whose size can be c*m*h*w, where c is the image channel (generally RGB (Red Green Blue) color 3 channels), m is the length of the video sequence = the number of frames of the image in this image set, and h and w are the width and height of the sample video respectively. After performing a 3D convolution operation with a 3D convolution kernel of K*3*3*3, stride of 1, padding = True, and a filter number of K, the output size is K*m*h*w. The same is true for the pooling layer. Based on the above 3D convolution operation, this embodiment uses the C3D (Convolutional 3D) network structure shown in Figure 5, which includes 3D convolution = 3D convolution, 2D convolution = 2D convolution, subsampling layer, and full connection = full connection layer. There are 4 convolution operations and 2 downsampling operations. The size of the convolution kernel is shown in Figure 5. The size of the pooling kernel is 2*2, and the step size is 2. The final network obtains the final output features after one 2D convolution operation and one fully connected layer. The input size of the network is 3*16*224*224, that is, 16 frames of images are input at a time, and the input image size is 224×224. In the embodiment, for each image set input, a 128-dimensional feature vector can be obtained.

After extracting the image features of each image set, the process of generating video features of the sample video according to the image features of different image sets and the correlation between the image sets may include: for each image set, determining the current initial weight of the current image set based on the image features of the current image set, and determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set; generating the video features of the sample video according to the image features of each image set and the corresponding weight coefficient.

Among them, the current initial weight of the current image set can be calculated by calling the initial weight calculation formula; the initial weight calculation formula can be expressed as:

a _i =q ^T ReLU(H·y _i );

Where a _i is the initial weight of the i-th image set, q is a known vector, q ^T represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and _yi is the image feature of the i-th image set. yi can be mapped to a common space by matrix multiplication of H· _yi , which can be obtained through model training. q ^T multiplied by ReLU(H· _yi ) will get a number.

Among them, the weight coefficient of the current image set can be calculated by calling the weight calculation relationship; the weight calculation relationship can be expressed as:

Where a _i ′ is the weight coefficient of the i-th image set, softmax() is the softmax function, a _j is the initial weight of the j-th image set, and n is the total number of image sets.

Finally, the video feature e _video generated by this embodiment can be expressed as:

In this embodiment, by weighting the features of different image sets, the features of each image set can be expressed more significantly, which is conducive to obtaining more accurate video features and helps to improve the accuracy of model training.

In addition, this embodiment also provides a video text mutual inspection method, please refer to FIG6, which may include the following contents:

S601: Pre-train a video text mutual inspection model.

In this step, the video-text mutual-checking model can be trained in advance using the video-text mutual-checking model training method described in any of the above embodiments to obtain the video-text mutual-checking model.

S602: Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets.

S603: Generate to-be-matched video features of the to-be-retrieved video according to the image features of different image sets and the association relationship between the image sets.

S604: Input the to-be-matched text features and the to-be-matched video features of the to-be-retrieved text into the video-text mutual-checking model to obtain the video-text mutual-checking results.

Among them, the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video text mutual inspection model.

The processing of the video to be retrieved in this embodiment is S602 and S603. Please refer to the corresponding contents of S102 and S103 in the above embodiment, and will not be repeated here.

During the inference process, the weight coefficients trained by S601 can be preloaded. Feature extraction is performed on the video to be retrieved or the text to be retrieved, and the feature is stored in the data set to be retrieved. The user gives any video to be retrieved or text to be retrieved, which can be called data to be retrieved for the sake of description. The text feature information or video feature of the data to be retrieved is extracted and input into the video-text mutual inspection model. The features of the data to be retrieved are distance matched with the features of all samples in the data set to be retrieved. For example: if the data to be retrieved is text data, the Euclidean distance is calculated with all the features of the video to be retrieved in the data set to be retrieved, and the sample with the smallest distance is the recommended sample for output.

It can be seen from the above that this embodiment can effectively improve the accuracy of video-text mutual retrieval.

It should be noted that there is no strict order of execution between the steps in the embodiments of the present application. As long as they comply with the logical order, these steps can be executed simultaneously or in a preset order. Figures 1 and 6 are only a schematic diagram and do not mean that this is the only execution order.

Finally, in order to make those skilled in the art more clearly understand the implementation methods of the embodiments of the present application, this embodiment also provides an illustrative example for implementing the mutual retrieval task of recipe text and recipe video, which may include the following contents:

Please refer to FIG7 . This embodiment includes a recipe retrieval terminal device 701 and a server 702 . A user can perform operations on the recipe retrieval terminal device 701 . The recipe retrieval terminal device 701 interacts with the server 702 through a network. The server 702 can deploy a video text mutual inspection model, as shown in FIG8 . In order to enable the video text mutual inspection model to realize the function of mutual retrieval between recipe text and recipe video, the video text mutual inspection model needs to be trained. During the training process, the recipe retrieval terminal device 701 can transmit a training sample set to the server 702 . The training sample set can include multiple groups of training samples. Each group of training samples includes a corresponding recipe text sample and a recipe video sample. Each recipe text sample includes operation steps (instruction list), ingredient information (ingredients) and dish name (Title). Instructions are steps for cooking, which are uniformly represented by steps in the following text. Ingredients are ingredients of a dish, which are uniformly represented by ingredients in the following text.

After obtaining the training sample set, the server 702 performs feature encoding on the recipe text and the recipe video respectively. In this embodiment, a heterogeneous graph neural network can be used to encode text information. In this embodiment, the text features are constructed into a graph structure, and the graph structure includes nodes, node features and connection relationships, as shown in Figure 2. Components and steps are different from structure to properties, so they are called heterogeneous nodes. In this embodiment, each step is called a node, and similarly, each component is called a node. A node is composed of a sentence or a phrase. In this embodiment, the Bert model can be used to extract the features of each sentence or each word. The implementation method is as follows: All recipe texts are input from the text information at the bottom, and the position information and text type accompanying the recipe text information are also input. Position information means that if there are 5 words "peel and slice the mango" in a sentence, their position information is "1, 2, 3, 4, 5" respectively. Text type means: if the input text is a step, its text type is 1; if the input text is an ingredient, its text type is 2. Through the BERT model, the encoding features of each sentence and each word can be obtained. The features are used to represent node features, namely, component node features and step node features. Both component node features and step node features are high-dimensional vectors, and their dimensions are R ^d dimensions (d-dimensional real vectors). After determining the node features, if the principal component exists in the operation step, the component node and the step node need to be connected by an edge, that is, there is a connection relationship between the two nodes. Optionally, the step information can be traversed by text comparison, each step text can be extracted, and then the principal component can be searched in turn. If the word in the principal component appears in the step, an edge is connected between the step and the principal component, that is, there is a connection relationship. By traversing all step texts, the connection relationship between the step node and the component node can be constructed, that is, the connection relationship of the heterogeneous graph. After the heterogeneous graph is established, the heterogeneous graph information can be updated using a graph attention network to achieve feature aggregation and update. The update method is to traverse each heterogeneous node in turn for update. The aggregation and extraction of text features are realized by heterogeneous graph operations, and the calculation method can be as follows:

First, update the step node.

is the node feature of the qth node of the step node,

Represents the feature of the pth node of the component node. If the qth node of the step node is connected to the pth node of the component node (that is, has an edge connection relationship), the feature of the pth node of the component node is used to update the qth node feature of the step node. During the update process, the correlation between the nodes needs to be considered. In this embodiment, the correlation between the nodes can be represented by assigning weights. Optionally, the following relationship (1) can be called to calculate the correlation weight zpq between the qth node of the step node and the pth node feature of the component node. For each step node, for example

Traverse all component nodes with connected edges, assuming there are Np nodes, and get the corresponding relevant weight zpq.

Among them, Wa, Wb, Wc are known R ^d×d dimensional matrices,

Represents matrix multiplication, aka vector mapping.

After updating each step node, the relevant weights of all component nodes of the edges connected to the step node can be normalized, that is, the normalized relevant weight α _qp can be obtained by calling the following relationship (2):

In the formula, exp represents the exponential function.

It represents the sum of the relevant weights of all the component nodes of the edges connected to the step node. Finally, the node features of the step node are updated by the normalized relevant weights, that is, the following relationship (3) is called for calculation:

Where σ represents a hyperparameter in the interval [0, 1]. _{W v} is an R ^d×d dimensional matrix,

It is the new feature vector updated by the component nodes connected to it.

Optionally, based on the idea of residual network, the updated

Compared with the initial features before

Addition:

Similarly, equation (5) can be used to perform the same calculation and update on the component nodes:

After traversing all component nodes and step nodes, the network update of one layer of the graph attention network is completed. Usually, T layers of graph attention networks can be superimposed, with t representing the tth layer of the graph attention network. The update method of the node features of each layer is as described above. Usually, an integrated fully connected layer is added after each layer of the graph attention network to realize the re-encoding of node features (including component nodes and step nodes), as shown in the following relationship (6):

FFN (feed-forward layer, feed-forward layer, or fully connected layer) stands for fully connected layer.

Represents the initial node features of the graph attention network at layer t+1.

As above, the update of the node features is completed. In order to realize the retrieval with the recipe video, it is also necessary to summarize and synthesize the features of all text nodes such as operation steps, ingredient information and dish names. In this embodiment, since the step node integrates the ingredient node information, the ingredient node is updated through the graph neural network, and the relevant step node features are emphasized in the form of keywords. At the same time, since the dish name information contains important main material information and cooking methods, the dish name text is usually a widespread existence in the recipe-based graphic and text mutual inspection task. Based on this, this embodiment can also extract the features of the dish name through the Bert (Bidirectional Encoder Representations from Transformer, bidirectional feature encoder) model. After obtaining the features of each text, the BiLSTM (Bi-directional Long Short-Term Memory, bidirectional long short-term memory neural network) method can be used to mine the temporal information of the step node, realize the induction and synthesis of the text node features, and package them into a vector.

In this embodiment, the following equations (7) and (8) can be used to extract the timing information features of all step nodes:

Among them, the left and right arrows represent the direction of LSTM (Long Short-Term Memory) encoding, that is, the forward and reverse encoding of the step node features.

represents the output of the qth unit in BiLSTM, and the different directions of the arrows represent the BiLSTM encoding output obtained according to the different order of step node input. Similarly,

represents the output of the q-1th unit in the BiLSTM, that is, the output of the previous state. Assume that there are Q steps in the recipe.

is 0,

Represents the features of the qth step node of the Tth layer of the graph neural network. According to the order and reverse order of the steps, they are input into the corresponding BiLSTM network in sequence, and finally the BiLSTM encoding of all step nodes is obtained, as shown in the following relation (9):

After obtaining the output of all BiLSTM units, the output of the entire text feature can be obtained by summing and averaging. Among them, e _rec represents the output of the text feature, which is used for the next step of retrieval. The e _rec feature is fused with the dish title feature e _rec = [e _rec , e _ttl ], where [] represents feature concatenation, that is, the features are connected end to end. The e _rec feature will finally pass through a fully connected layer for feature mapping, that is, e _rec = fc(e _rec ), to obtain a vector of a new dimension, that is, the text feature information of the recipe text, which is used as the encoding feature for matching with the recipe video.

For the encoding process of recipe videos, the sample video can be used as the recipe video, and any one of the above embodiments can be used to implement the encoding of recipe features. After obtaining the recipe video features and recipe text feature information of each group of training samples in the training sample set, the loss function of the above embodiment can be used to guide the training of the video text mutual inspection model to make it converge.

The recipe retrieval terminal device 701 may include a display screen, an input interface, an input keyboard, and a wireless transmission module. When the display screen is a touch screen, the input keyboard may be a soft keyboard presented on the display screen. The input interface may be used to realize connection with an external device such as a USB flash drive. There may be multiple input interfaces. In practical applications, the user may input a recipe text to be retrieved or a video to be retrieved to the recipe retrieval terminal device 701 through the input keyboard, or may write the recipe text to be retrieved or the video to be retrieved to a USB flash drive and insert the USB flash drive into the input interface of the recipe retrieval terminal device 701. The user inputs a retrieval request to the recipe retrieval terminal device 701, and the retrieval request carries the recipe text to be retrieved or the recipe video to be retrieved. The recipe retrieval terminal may send the retrieval request to the server 702 through the wireless transmission module. The server 702 retrieves the corresponding database based on the trained model and may feed back the final mutual retrieval result to the recipe retrieval terminal device 701. The recipe retrieval terminal device 701 may display the retrieved recipe text or recipe video to the user through the display screen.

The embodiment of the present application also provides a corresponding device for the video text mutual inspection model training method and the video text mutual inspection method, so that the method is more practical. Among them, the device can be described from the perspective of functional modules and hardware. The video text mutual inspection model training device and the video text mutual inspection device provided by the embodiment of the present application are introduced below. The video text mutual inspection model training device and the video text mutual inspection device described below can correspond to each other with the video text mutual inspection model training method and the video text mutual inspection method described above.

From the perspective of functional modules, first refer to FIG. 9 , which is a structural diagram of a video text mutual inspection model training device provided in an embodiment of the present application under an optional implementation mode, and the device may include:

The text feature acquisition module 901 is configured to acquire text feature information of sample text in each group of training samples in the training sample set, wherein the sample text includes first-category text data, second-category text data, and third-category text data, wherein the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features, and third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;

The video feature generation module 902 is configured to reassemble multiple frames of images extracted from the sample video in each set of training samples to obtain multiple image sets, and the same image is included in different image sets; and generate video features of the sample video according to the image features of different image sets and the correlation between the image sets;

The training module 903 is configured to train the video text mutual inspection model based on the text features of each group of training samples and the corresponding video features; the text features are the fusion features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.

Optionally, in some implementations of the present embodiment, the above-mentioned video feature generation module 902 can also be configured to: obtain image recombination parameters; the image recombination parameters include the total number of image sets and the total number of image frames contained in each image set; according to the image recombination parameters, determine the image frames contained in each image set to perform segmentation processing on the image sequence formed by multiple frames of images.

As an optional implementation of the above embodiment, the above video feature generation module 902 can also be set to: the total number of image frames contained in each image set is the same, for the first image set, the image frames contained in the first image set are determined according to the total number of image frames and the first frame image of the image sequence; the image segmentation relationship is called to determine the image frame sequence number difference of adjacent image sets; the image segmentation relationship is: m+nk=N; for the remaining image sets, based on the image frames contained in the previous image set of the current image set and the image frame sequence number difference, the image frames contained in the corresponding image set are determined; wherein, m is the total number of image frames contained in each image set, N is the total number of image frames contained in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.

Optionally, in some implementations of the present embodiment, the video feature generation module 902 may further include a video decomposition unit, which is configured to obtain video splitting parameters by parsing video splitting instructions; split the sample video into multiple video segments according to the video splitting parameters; and for each video segment, extract a target image frame for identifying the current video segment.

As an optional implementation of this embodiment, the video decomposition unit may also be configured to extract the first frame image of the current video segment as the target image frame of the current video segment.

Optionally, in some other implementations of the present embodiment, the above-mentioned video feature generation module 902 may also include a feature extraction unit, which is configured to: pre-train an image feature extraction network; for each image set, input the image frames contained in the current image set into the image feature extraction network to obtain image features corresponding to the current image set; wherein the image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer; the first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to downsample the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to downsample the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure.

Optionally, in some further implementations of the present embodiment, the above-mentioned video feature generation module 902 can also be configured as follows: for each image set, the current initial weight of the current image set is determined based on the image features of the current image set, and the weight coefficient of the current image set is determined based on the current initial weight and the initial weight of each image set; and the video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.

As an optional implementation of the above embodiment, the above video feature generation module 902 can also be configured to: call the initial weight calculation relationship to calculate the current initial weight of the current image set; the initial weight calculation relationship is:

ai＝qT ReLU(H·yi);

As another optional implementation of the above embodiment, the above video feature generation module 902 can also be configured to: call the weight calculation relationship to calculate the weight coefficient of the current image set; the weight calculation relationship is:

Where a _i ′ is the weight coefficient of the i-th image set, mind() represents the minimum value of the calculated distance, softmax() is the softmax function, a _j is the initial weight of the j-th image set, and n is the total number of image sets.

Optionally, in some other implementations of this embodiment, the training module 903 may also be configured to: based on the text feature information of each group of training samples and the corresponding video features, call a loss function to guide the training process of the video text mutual inspection model; the loss function is:

In the formula,

is the above loss function, N is the number of training sample groups,

is the ath sample text in all sample text data,

is a hyperparameter.

Next, please refer to FIG. 10 , which is a structural diagram of a video text mutual inspection device provided in an embodiment of the present application under an optional implementation mode, and the device may include:

The model training module 1001 is configured to pre-train a video text mutual checking model using any of the above-mentioned video text mutual checking model training methods;

The video processing module 1002 is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; based on the image features of different image sets and the association relationship between the image sets, generate the video features to be matched of the video to be retrieved;

The mutual check module 1003 is configured to input the text features to be matched of the text to be retrieved and the above-mentioned video features to be matched into the above-mentioned video-text mutual check model to obtain the video-text mutual check result; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes the first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are the fusion features of the features of the second-category text data and the third-category text features extracted by the heterogeneous graph neural network of the video-text mutual check model.

The functions of the various functional modules of the above-mentioned device in the embodiment of the present application can be implemented according to the method in the corresponding method embodiment. The detailed implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.

The cross-media retrieval device and video-text mutual-checking device mentioned above are described from the perspective of functional modules. Optionally, the embodiment of the present application also provides an electronic device, which is described from the perspective of hardware. Figure 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application under one implementation. As shown in Figure 11, the electronic device includes a memory 110, which is configured to store a computer program; a processor 111, which is configured to implement the video-text mutual-checking model training method and/or the video-text mutual-checking steps mentioned in any of the above embodiments when executing the computer program.

The processor 111 may include one or more processing cores, such as a 4-core processor or an 8-core processor. The processor 111 may also be a controller, a microcontroller, a microprocessor or other data processing chip. The processor 111 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), or PLA (Programmable Logic Array). The processor 111 may also include a main processor and a coprocessor. The main processor is a processor configured to process data in an awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor configured to process data in a standby state. In some embodiments, the processor 111 may be integrated with a GPU (Graphics Processing Unit), which is configured to be responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 111 may also include an AI (Artificial Intelligence) processor, which is configured to process computing operations related to machine learning.

The memory 110 may include one or more non-volatile storage media, which may be non-transitory. The memory 110 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices. In some embodiments, the memory 110 may be an internal storage unit of an electronic device, such as a hard disk of the server 702. In other embodiments, the memory 110 may also be an external storage device of an electronic device, such as a plug-in hard disk equipped on the server 702, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. Optionally, the memory 110 may also include both an internal storage unit of the electronic device and an external storage device. The memory 110 may not only be configured to store application software and various types of data installed in the electronic device, such as: the code of the program in the process of executing the above-mentioned video text mutual inspection model training method and/or the above-mentioned video text mutual inspection method, but may also be configured to temporarily store data that has been output or is to be output. In this embodiment, the memory 110 is at least configured to store the following computer program 1101, wherein, after the computer program is loaded and executed by the processor 111, the video text mutual inspection model training method and/or the relevant steps of the video text mutual inspection method disclosed in any of the aforementioned embodiments can be implemented. In addition, the resources stored in the memory 110 may also include an operating system 1102 and data 1103, etc., and the storage method may be temporary storage or permanent storage. Among them, the operating system 1102 may include Windows, Unix, Linux, etc. Data 1103 may include but is not limited to data generated by the video text mutual inspection model training process and/or data corresponding to the video text mutual inspection results, etc.

In some embodiments, the electronic device may further include a display screen 112, an input/output interface 113, a communication interface 114 or a network interface, a power supply 115 and a communication bus 116. The display screen 112 and the input/output interface 113, such as a keyboard, belong to the user interface, and the optional user interface may also include a standard wired interface, a wireless interface, etc. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc. The display may also be appropriately referred to as a display screen or a display unit, which is configured to display information processed in the electronic device and to display a visual user interface. The communication interface 114 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the electronic device and other electronic devices. The communication bus 116 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG11 only uses one thick line, but it does not mean that there is only one bus or one type of bus.

Those skilled in the art will appreciate that the structure shown in FIG. 11 does not limit the electronic device and may include more or fewer components than shown in the figure, for example, may also include a sensor 117 for implementing various functions.

The functions of the functional modules of the electronic device in the embodiment of the present application can be implemented according to the method in the method embodiment. The detailed implementation process can refer to the relevant description of the method embodiment, which will not be repeated here.

It is understandable that if the video text mutual inspection model training method and/or the video text mutual inspection method in the above-mentioned embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium to execute all or part of the steps of the various embodiments of the present application. The aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc. Various non-volatile storage media that can store program code.

Based on this, an embodiment of the present application also provides a non-volatile storage medium storing a computer program. When the above computer program is executed by a processor, it is the above-mentioned video-text mutual-checking model training method and/or video-text mutual-checking method in any of the above embodiments.

The functions of the functional modules of the above-mentioned non-volatile storage medium in the embodiment of the present application can be implemented according to the method in the above-mentioned method embodiment. The detailed implementation process can refer to the relevant description of the above-mentioned method embodiment, which will not be repeated here.

In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the hardware disclosed in the embodiments, including devices and electronic devices, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part.

Professionals may also realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel may use different methods to implement the described functions for each specific application, but such implementation should not be considered to exceed the scope of the embodiments of the present application.

The above is a detailed introduction to a video text mutual inspection model training method and device, a video text mutual inspection method and device, an electronic device and a non-volatile storage medium provided in the embodiments of the present application. Optional examples are used herein to illustrate the principles and implementation methods of the embodiments of the present application. The description of the above embodiments is only used to help understand the methods and core ideas of the embodiments of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the embodiments of the present application, several improvements and modifications can be made to the embodiments of the present application, and these improvements and modifications also fall within the scope of protection of the claims of the embodiments of the present application.

Claims

A video text mutual inspection model training method, characterized by comprising:

Acquire text feature information of sample text in each group of training samples in the training sample set; the sample text includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features and third-category text features corresponding to the first-category text data, the second-category text data and the third-category text data respectively; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;

For a sample video in each group of training samples, multiple frames of images extracted from the sample video are reassembled to obtain multiple image sets, and the same image is included in different image sets;

Generating video features of the sample video according to image features of different image sets and associations between the image sets;

The video-text mutual inspection model is trained based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:

Acquire image recombination parameters; the image recombination parameters include the total number of image sets and the total number of image frames contained in each image set;

The image frames included in each image set are determined according to the image recombination parameters, so as to perform segmentation processing on an image sequence formed by multiple image frames.
The video text mutual inspection model training method according to claim 2 is characterized in that the total number of image frames contained in each image set is the same, and determining the image frames contained in each image set according to the image recombination parameters includes:

For a first image set, determining the image frames included in the first image set according to the total number of image frames and the first frame of the image sequence;

Calling an image segmentation relational formula to determine the difference in image frame sequence numbers of adjacent image sets; the image segmentation relational formula is: m+nk=N;

For each of the remaining image sets, based on the image frames included in the previous image set of the current image set and the image frame sequence number difference, determine the image frames included in the corresponding image set;

Wherein, m is the total number of image frames included in each image set, N is the total number of image frames included in the image sequence, n is the total number of image sets, and k is the image frame sequence number difference, which is an integer.
The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:

Obtain video splitting parameters by parsing the video splitting instruction;

Splitting the sample video into multiple video segments according to the video splitting parameter;

For each video segment, a target image frame for identifying the current video segment is extracted.
The video text mutual inspection model training method according to claim 4 is characterized in that extracting a target image frame for identifying the current video segment comprises:

The first image frame of the current video segment is extracted as a target image frame of the current video segment.
The video text mutual inspection model training method according to claim 1 is characterized in that the generating of the video features of the sample video according to the image features of different image sets and the correlation between the image sets comprises:

Pre-train image feature extraction network;

For each image set, the image frames contained in the current image set are input into the image feature extraction network to obtain the image features corresponding to the current image set;

The image feature extraction network includes a first 3D convolution structure, a first downsampling structure, a second 3D convolution structure, a second downsampling structure, a 2D convolution structure and a fully connected layer;

The first 3D convolution structure is used to perform a 3D convolution operation on the input information of the image feature extraction network; the first downsampling structure is used to perform a downsampling operation on the output features of the first 3D convolution structure; the second 3D convolution structure is used to perform a 3D convolution operation on the output features of the first downsampling structure; the second downsampling structure is used to downsampling the features output by the second 3D convolution structure; and the 2D convolution structure is used to perform a 2D convolution operation on the output features of the second downsampling structure.
The video text mutual inspection model training method according to claim 1 is characterized in that the generating of the video features of the sample video according to the image features of different image sets and the correlation between the image sets comprises:

For each image set, determining a current initial weight of the current image set based on image features of the current image set, and determining a weight coefficient of the current image set based on the current initial weight and the initial weight of each image set;

The video features of the sample video are generated according to the image features of each image set and the corresponding weight coefficients.
The video text mutual inspection model training method according to claim 7 is characterized in that the determining the current initial weight of the current image set based on the image features of the current image set comprises:

The initial weight calculation formula is called to calculate the current initial weight of the current image set; the initial weight calculation formula is:

a i =q T ReLU(H·y i );

Where a i is the initial weight of the i-th image set, q is a known vector, q T represents the transpose of q, ReLU() is the ReLU function, H is the weight matrix, and yi is the image feature of the i-th image set.
The video text mutual inspection model training method according to claim 7 is characterized in that the determining the weight coefficient of the current image set based on the current initial weight and the initial weight of each image set comprises:

The weight calculation relationship is called to calculate the weight coefficient of the current image set; the weight calculation relationship is:

Where a i ′ is the weight coefficient of the i-th image set, a i is the initial weight of the i-th image set, softmax() is the softmax function, a j is the initial weight of the j-th image set, and n is the total number of image sets.
The video text mutual checking model training method according to any one of claims 1 to 9 is characterized in that the video text mutual checking model is trained based on the text features and corresponding video features of each group of training samples, comprising:

Based on the text features and corresponding video features of each set of training samples, a loss function is called to guide the training process of the video text mutual inspection model; the loss function is:

In the formula,
is the loss function, N is the number of training sample groups, min d() represents the minimum value of the calculated distance,
is the ath sample video among all sample videos contained in the training sample set,
is the pth sample text among all the sample texts contained in the training sample set, and corresponds to the ath sample video,
is the nth sample text in all sample text data, and it does not correspond to the ath sample video,
is the ath sample text among all sample text data,
is the pth sample video among all sample videos, and it corresponds to the ath sample text,
is the nth sample video among all sample video data, and it does not correspond to the ath sample text, and ▽ is a hyperparameter.
The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:

The multiple frames of images are integrated into an image sequence according to the order of extraction, and the multiple image sets are obtained by cross-segmenting the image sequence.
The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:

The multiple frames of images are randomly integrated into an image sequence, and the multiple image sets are obtained by segmenting the image sequence.
The video text mutual inspection model training method according to claim 1 is characterized in that the multiple frame images extracted from the sample video are recombined to obtain multiple image sets, including:

The multiple frames of images are randomly assigned to different image sets.
The video text mutual inspection model training method according to claim 4 is characterized in that the video splitting parameters include the number of segments of the sample video and the identification information of the sample video.
The video-text mutual-checking model training method according to claim 4 is characterized in that the multiple video segments overlap with each other.
A video text mutual checking method, characterized by comprising:

Preliminarily using the video text mutual inspection model training method according to any one of claims 1 to 15 to train a video text mutual inspection model;

Recombining multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets;

Generating the to-be-matched video features of the to-be-retrieved video according to the image features of different image sets and the association relationship between the image sets;

The text features to be matched of the text to be retrieved and the video features to be matched are input into the video-text mutual checking model to obtain the video-text mutual checking results; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are fused features of the features of the second-category text data and the third-category text features extracted using the heterogeneous graph neural network of the video-text mutual checking model.
A video text mutual inspection model training device, characterized by comprising:

A text feature acquisition module is configured to acquire text feature information of a sample text in each group of training samples in a training sample set; the sample text includes first-category text data, second-category text data, and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text feature information includes first-category text features, second-category text features, and third-category text features corresponding to the first-category text data, the second-category text data, and the third-category text data; the first-category text features and the second-category text features determine the node features and connection edges of the heterogeneous graph neural network in the video text mutual inspection model;

The video feature generation module is configured to reassemble multiple frames of images extracted from the sample video in each group of training samples to obtain multiple image sets, and the same image is included in different image sets; and generate video features of the sample video according to image features of different image sets and correlations between the image sets;

The training module is configured to train the video-text mutual inspection model based on the text features and corresponding video features of each group of training samples; the text features are fused features of the features of the second type of text data extracted using the heterogeneous graph neural network and the third type of text features.
A video text mutual checking device, characterized by comprising:

The model training module is configured to pre-train a video text mutual inspection model using the video text mutual inspection model training method according to any one of claims 1 to 15;

The video processing module is configured to reassemble multiple frames of images extracted from the video to be retrieved to obtain multiple image sets, and the same image is included in different image sets; according to the image features of different image sets and the association relationship between the image sets, the video features to be matched of the video to be retrieved are generated;

The mutual checking module is configured to input the text features to be matched of the text to be retrieved and the video features to be matched into the video-text mutual checking model to obtain the video-text mutual checking result; the text to be retrieved includes first-category text data, second-category text data and third-category text data, the second-category text data includes first-category text data, and the third-category text data is used to summarize the second-category text data and the first-category text data; the text features to be matched are fused features of the features of the second-category text data and the third-category text features extracted using the heterogeneous graph neural network of the video-text mutual checking model.
An electronic device, characterized in that it includes a processor and a memory, wherein the processor is configured to implement the steps of the video text mutual checking model training method as described in any one of claims 1 to 15 and/or the video text mutual checking method as described in claim 16 when executing the computer program stored in the memory.
A non-volatile storage medium, characterized in that a computer program is stored on the non-volatile storage medium, and when the computer program is executed by a processor, the steps of the video text mutual inspection model training method as described in any one of claims 1 to 15 and/or the video text mutual inspection method as described in claim 16 are implemented.