CN115204301A

CN115204301A - Video text matching model training method and device and video text matching method and device

Info

Publication number: CN115204301A
Application number: CN202210868349.1A
Authority: CN
Inventors: 刘烁; 全卫泽; 陈思宏; 陈宸; 周明; 严冬明
Original assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-10-18

Abstract

The application relates to a video text matching model training method, a video text matching device, a computer device, a storage medium and a computer program product. The application relates to artificial intelligence technology. The method comprises the following steps: inputting video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference feature comprises at least one of an audio feature and an action feature; performing feature enhancement on corresponding video features based on the reference features corresponding to the same training video to obtain the reference enhancement video features corresponding to the training video; carrying out similarity calculation on training text characteristics corresponding to the training texts and video characteristics and reference enhancement video characteristics corresponding to the training videos respectively; and training the initial video text matching model based on the similarity set corresponding to each training sample pair to obtain a target video text matching model. By adopting the method, the model prediction accuracy can be improved.

Description

Video text matching model training method and device and video text matching method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for training a video text matching model, a method and an apparatus for matching a video text, a computer device, a storage medium, and a computer program product.

Background

With the development of computer technology, a video text matching model appears, videos and texts matched with each other can be determined from a large number of videos and texts based on the video text matching model, and the video text matching model can be applied to scenes such as video text retrieval, video content recommendation, video content understanding and the like.

In the conventional technology, a video text matching model is usually trained based on video features and text features. However, the video features can only provide image information of the video, rich information in the video cannot be represented accurately, and a video text matching model obtained through training in the current training mode has the problem of low prediction accuracy.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a video text matching model training method, a video text matching method, an apparatus, a computer device, a computer readable storage medium, and a computer program product, which can improve accuracy of model prediction.

The application provides a video text matching model training method. The method comprises the following steps:

acquiring a training sample pair set; training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other;

inputting video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features comprise at least one of audio features and action features corresponding to the training videos;

based on the reference features corresponding to the same training video, performing feature enhancement on the corresponding video features to obtain the reference enhancement video features corresponding to the training video; the reference enhanced video features comprise at least one of motion enhanced video features and audio enhanced video features;

aiming at the same training sample pair, performing similarity calculation on training text characteristics corresponding to a training text, video characteristics corresponding to a training video and reference enhanced video characteristics respectively to obtain similarity sets corresponding to the training sample pairs respectively;

Calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text.

The application also provides a video text matching model training device. The device comprises:

the training sample pair set acquisition module is used for acquiring a training sample pair set; training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other;

the characteristic input module is used for inputting the video characteristics corresponding to the training videos, the reference characteristics and the training text characteristics corresponding to the training texts in the training sample pair set into an initial video text matching model; the reference features comprise at least one of audio features and action features corresponding to the training videos;

The feature enhancement module is used for carrying out feature enhancement on corresponding video features based on the reference features corresponding to the same training video to obtain the reference enhancement video features corresponding to the training video; the reference enhanced video features comprise at least one of motion enhanced video features and audio enhanced video features;

the similarity calculation module is used for performing similarity calculation on training text characteristics corresponding to the training texts and video characteristics and reference enhanced video characteristics corresponding to the training videos respectively aiming at the same training sample pair to obtain a similarity set corresponding to each training sample pair respectively;

the model adjusting module is used for calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met, and obtaining a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text.

A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above-mentioned video text matching model training method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned video-text matching model training method.

A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned video text matching model training method.

The video text matching model training method, the device, the computer equipment, the storage medium and the computer program product are characterized in that a training sample pair set is obtained; training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other; inputting video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features comprise at least one of audio features and action features corresponding to the training videos; based on the reference features corresponding to the same training video, performing feature enhancement on the corresponding video features to obtain the reference enhancement video features corresponding to the training video; the reference enhanced video feature comprises at least one of a motion enhanced video feature and an audio enhanced video feature; aiming at the same training sample pair, performing similarity calculation on training text characteristics corresponding to a training text, video characteristics corresponding to a training video and reference enhanced video characteristics respectively to obtain similarity sets corresponding to the training sample pairs respectively; calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text. Therefore, the video characteristics can provide image information of the video, the audio characteristics can provide sound information of the video, the action characteristics can provide motion information of the video, the video text matching model is trained based on the video characteristics corresponding to the training video, the reference characteristics and the training text characteristics corresponding to the training text, the understanding of the model on the video content can be improved by utilizing abundant modal information in the video, and the prediction accuracy of the model is improved. And moreover, feature enhancement and feature guidance are carried out on the video features based on the audio features or the motion features, important information in the video can be highlighted, similarity calculation is carried out on the video features and the training text features respectively based on the video features and the reference enhanced video features, model parameters are adjusted based on training loss generated by the similarity set obtained through calculation, the relation between the video and the text can be better established by the model, and the prediction accuracy of the model is further improved.

The application provides a video text matching method. The method comprises the following steps:

acquiring a video feature to be matched and a reference feature to be matched corresponding to a video to be matched, and acquiring a text feature to be matched corresponding to a text to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched;

based on the reference features to be matched, performing feature enhancement on the video features to be matched to obtain reference enhanced video features corresponding to the video to be matched; the reference enhanced video features comprise at least one of motion enhanced video features and audio enhanced video features;

respectively carrying out similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features to obtain a similarity set corresponding to the video to be matched and the text to be matched;

and determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched.

In one embodiment, the calculating the similarity between the video text feature corresponding to the video to be matched and the target text feature corresponding to the text to be matched to obtain the reference similarity between the video text feature and the target text feature includes:

Calculating the initial similarity between the video text characteristics and the target text characteristics to obtain an initial similarity matrix;

counting the number of matrix elements with the value larger than a preset threshold value in the initial similarity matrix to obtain a first number;

fusing the text word quantity respectively corresponding to the text to be matched and the video text to obtain a second quantity; the video text refers to a text corresponding to the video text characteristics;

and obtaining the reference similarity between the video text feature and the target text feature based on the first quantity and the second quantity.

In one embodiment, the current text is any one of an audio text, an image text and a text to be matched, the text feature corresponding to the current text is any one of an audio text feature, an image text feature or a target text feature, and the generation process of the text feature corresponding to the current text includes the following steps:

extracting nouns from the current text to obtain text nouns;

and performing feature extraction on the text nouns to obtain text features corresponding to the current text.

In one embodiment, the performing, based on the reference feature to be matched, feature enhancement on the video feature to be matched to obtain a reference enhanced video feature corresponding to the video to be matched, and performing similarity calculation on the text feature to be matched and the video feature to be matched and the reference enhanced video feature respectively to obtain a similarity set corresponding to the video to be matched and the text to be matched includes:

Inputting the video features to be matched, the reference features to be matched and the text features to be matched into a target video text matching model to obtain a similarity set corresponding to the video to be matched and the text to be matched; the target video text matching model is used for feature enhancement and similarity calculation.

The application also provides a video text matching device. The device comprises:

the characteristic acquisition module is used for acquiring the video characteristics to be matched and the reference characteristics to be matched corresponding to the video to be matched and acquiring the text characteristics to be matched corresponding to the text to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched;

the characteristic enhancement module is used for carrying out characteristic enhancement on the video characteristic to be matched based on the reference characteristic to be matched to obtain a reference enhanced video characteristic corresponding to the video to be matched; the reference enhanced video features comprise at least one of motion enhanced video features and audio enhanced video features;

the similarity calculation module is used for performing similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features respectively to obtain a similarity set corresponding to the video to be matched and the text to be matched;

And the matching result determining module is used for determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above-described video text matching method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned video text matching method.

A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned video text matching method.

According to the video text matching method, the video text matching device, the computer equipment, the storage medium and the computer program product, the text features to be matched corresponding to the text to be matched are obtained by obtaining the video features to be matched corresponding to the video to be matched and the reference features to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched; based on the reference features to be matched, performing feature enhancement on the video features to be matched to obtain reference enhanced video features corresponding to the video to be matched; the reference enhanced video feature comprises at least one of a motion enhanced video feature and an audio enhanced video feature; respectively carrying out similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features to obtain a similarity set corresponding to the video to be matched and the text to be matched; and determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched. Therefore, the video characteristics can provide image information of the video, the audio characteristics can provide sound information of the video, the action characteristics can provide motion information of the video, the matching result between the video to be matched and the text to be matched is determined based on the video characteristics corresponding to the video to be matched, the reference characteristics and the text characteristics corresponding to the text to be matched, the understanding of the video content can be improved by utilizing rich modal information in the video, and the matching accuracy is improved. In addition, feature enhancement and feature guidance are carried out on the video features based on the audio features or the motion features, so that important information in the video can be highlighted, and the understanding of the video content is further improved. Similarity calculation is carried out on the video features and the reference enhanced video features respectively, similarity calculation is carried out on the video features and the text features, matching results are determined based on a similarity set obtained through calculation, and matching accuracy can be further improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a video text matching model training method and a video text matching method;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a video text matching model in one embodiment;

FIG. 3 is a diagram of a video enhancement video feature and text feature matching network in one embodiment;

FIG. 4 is a diagram of an audio enhanced video feature and text feature matching network in one embodiment;

FIG. 5 is a schematic flow chart illustrating similarity calculation and loss calculation in one embodiment;

FIG. 6 is a flow diagram that illustrates the determination of a predicted match ranking for a test sample pair, according to one embodiment;

FIG. 7 is a diagram illustrating ordering of similarity matrices, according to an embodiment;

FIG. 8 is a flowchart illustrating a video text matching method according to an embodiment;

FIG. 9 is a flowchart illustrating a video text matching method according to another embodiment;

FIG. 10 is a diagram illustrating a cosine similarity matrix computed based on a text to be matched and a video text in one embodiment;

FIG. 11 is a schematic flowchart illustrating a process of calculating similarity between a text to be matched and a video text according to an embodiment;

FIG. 12 is a block diagram illustrating a video text matching method according to an embodiment;

FIG. 13 is a diagram illustrating text to be matched and video text in another embodiment;

FIG. 14 is a block diagram showing the construction of an apparatus for training a matching model of video text according to an embodiment;

FIG. 15 is a block diagram showing the construction of a video text matching apparatus according to one embodiment;

FIG. 16 is a diagram showing an internal structure of a computer device in one embodiment;

FIG. 17 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as computer vision technology, voice technology, natural language processing, machine learning and the like, and is specifically explained by the following embodiments:

the embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, data retrieval, data recommendation and the like.

The video text matching model training method and the video text matching method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be placed on the cloud or other server. The terminal 102 may be, but is not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or a server cluster consisting of a plurality of servers or a cloud server.

Both the terminal and the server can be independently used for executing the video text matching model training method and the video text matching method provided by the embodiment of the application.

For example, the server obtains a set of training sample pairs; the training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other. The server inputs video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features include at least one of audio features and motion features corresponding to the training video. In the model, based on the reference features corresponding to the same training video, feature enhancement is carried out on the corresponding video features to obtain the reference enhancement video features corresponding to the training video, the reference enhancement video features comprise at least one of motion enhancement video features and audio enhancement video features, and for the same training sample pair, the training text features corresponding to the training text are respectively carried out similarity calculation with the video features corresponding to the training video and the reference enhancement video features to obtain a similarity set corresponding to each training sample pair. And the server calculates training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusts the model parameters of the initial video text matching model based on the training loss until the convergence condition is met to obtain the target video text matching model.

The method comprises the steps that a server obtains video features to be matched and reference features to be matched, which correspond to videos to be matched, and obtains text features to be matched, which correspond to texts to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched. The server performs feature enhancement on the video features to be matched based on the reference features to be matched to obtain reference enhanced video features corresponding to the video to be matched; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature. The server respectively carries out similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features to obtain a similarity set corresponding to the video to be matched and the text to be matched, and determines a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched.

The terminal and the server can also be cooperatively used for executing the video text matching model training method and the video text matching method provided in the embodiment of the application.

For example, the server obtains a training sample pair set from the terminal, and performs model training on the initial video text matching model based on video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts to obtain a target video text matching model.

The terminal sends a data matching request to the server, the data matching request carries data identifications corresponding to the video to be matched and the text to be matched respectively, and the server obtains the video features to be matched and the reference features to be matched corresponding to the video to be matched based on the data identifications. The server performs feature enhancement on the video features to be matched based on the reference features to be matched to obtain reference enhanced video features corresponding to the video to be matched; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature. The server respectively carries out similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features to obtain a similarity set corresponding to the video to be matched and the text to be matched, and determines a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched. And the server sends the matching result to the terminal.

In one embodiment, as shown in fig. 2, a video text matching model training method is provided, which is exemplified by applying the method to a computer device, which may be the terminal 102 or the server 104 in fig. 1. Referring to fig. 2, the video text matching model training method includes the following steps:

Step S202, acquiring a training sample pair set; the training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other.

Wherein the set of training sample pairs comprises a plurality of training sample pairs. A training sample pair includes a training video and a training text. The training sample pairs may be divided into positive sample pairs and negative sample pairs. The training videos and the training texts in the positive sample pair are matched with each other, that is, the contents and information expressed by the training videos and the training texts in the positive sample pair are matched, and the training texts in the positive sample pair can be used for describing and explaining the training videos in the positive sample pair. The training videos and the training texts in the negative example pairs do not match, that is, the contents and information expressed by the training videos and the training texts in the negative example pairs do not match.

The training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs. The negative sample pair matching the positive sample pair includes at least one of a negative sample pair containing the same training video and a different training text as the positive sample pair, and a negative sample pair containing the same training text and a different training video as the positive sample pair.

Specifically, the computer device may obtain a training sample pair set locally or from another device, and perform model training on the training sample pair set to obtain a trained video text matching model.

Step S204, inputting the video characteristics corresponding to the training videos, the reference characteristics and the training text characteristics corresponding to the training texts in the training sample pair set into an initial video text matching model; the reference features include at least one of audio features and motion features corresponding to the training video.

The video characteristics refer to image characteristics of the video and are used for representing image modality information in the video. The audio features are used to characterize audio modality information in the video. The motion features are used to characterize motion modality information (i.e., motion modality information) in the video. The reference features include at least one of audio features and motion features corresponding to the training video. And performing feature extraction on the training video to obtain video features, action features and audio features corresponding to the training video. The feature extraction can be performed on the training video based on a machine learning model, specifically, the video features corresponding to the training video can be extracted based on a video feature extraction model, the action features corresponding to the training video are extracted based on an action feature extraction model, and the audio features corresponding to the training video are extracted based on an audio feature extraction model. For example, the motion features corresponding to the training videos are extracted based on the S3D model, and the audio features corresponding to the training videos are extracted based on the VGGish model.

The text features are used to characterize text modal information of the text. The training text features refer to text features corresponding to the training text. The feature extraction can be performed on the training text based on the machine learning model, and specifically, the training text features corresponding to the training text can be extracted based on the text feature extraction model.

In one embodiment, the video features corresponding to the training videos and the text features corresponding to the training texts may be obtained based on the same machine learning model. The video features corresponding to the training videos and the text features corresponding to the training texts can be extracted based on a video text processing model, the video text processing model comprises a video encoder and a text encoder, the video features corresponding to the training videos are extracted based on the video encoder, and the text features corresponding to the training texts are extracted based on the text encoder. In one embodiment, the video text processing model is used for classifying videos based on description texts of the videos, input data of the video text processing model are videos and corresponding description texts, and output data of the video text processing model are classification results of the videos. For example, video features corresponding to a training video and text features corresponding to a training text are extracted based on a CLIP (contrast text-Image Pre-training model) model.

The initial video text matching model refers to a video text matching model to be trained. The input data of the video text matching model are video features and reference features corresponding to training videos and training text features corresponding to training texts.

Specifically, the computer device may input the video features and the reference features corresponding to the training videos in the training sample pair set and the training text features corresponding to the training texts into an initial video text matching model, perform data processing on the video features, the reference features and the training text features in the initial video text matching model to obtain similarity sets corresponding to the training sample pairs, and the initial video text matching model may output the similarity sets corresponding to the training sample pairs.

Step S206, based on the reference features corresponding to the same training video, performing feature enhancement on the corresponding video features to obtain the reference enhanced video features corresponding to the training video; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature.

Wherein the feature enhancement is used to enhance information in the video feature related to the reference feature and weaken other information in the video feature. For example, based on the motion features corresponding to the training video, the video features corresponding to the training video are subjected to feature enhancement to obtain motion-enhanced video features, the motion-enhanced video features emphasize the feature representation of the moving object in the video, and the feature representations of other backgrounds and noises are weakened.

The motion enhancement video features are obtained by performing feature enhancement on the video features corresponding to the training video based on the motion features corresponding to the training video. The audio enhancement video features are obtained by performing feature enhancement on the video features corresponding to the training video based on the audio features corresponding to the training video.

It is to be understood that the processes of feature enhancing the video features based on the motion features and feature enhancing the video features based on the audio features may be the same or different.

Specifically, in the initial video text matching model, feature enhancement may be performed on corresponding video features based on reference features corresponding to the same training video to obtain reference enhanced video features corresponding to the training video. If the reference features comprise action features, feature enhancement is carried out on the corresponding video features based on the action features corresponding to the same training video to obtain action enhancement video features corresponding to the training video, and the reference enhancement video features comprise action enhancement video features. And if the reference features comprise audio features, performing feature enhancement on the corresponding video features based on the audio features corresponding to the same training video to obtain audio enhancement video features corresponding to the training video, wherein the reference enhancement video features comprise audio enhancement video features.

In one embodiment, the reference video features are obtained by fusing the reference features and the video features, and the reference video features are fused with image information and motion information of the video. And performing channel attention processing on the reference video features to obtain reference channel attention weights, and performing feature enhancement on the video features based on the reference channel attention weights to obtain reference enhanced video features. Feature enhancement is carried out on the video features based on the attention weight of the reference channel, so that association between objects with motion relation in the video features is facilitated, and the motion objects in the video features are highlighted.

And S208, aiming at the same training sample pair, performing similarity calculation on training text characteristics corresponding to the training text, video characteristics corresponding to the training video and reference enhanced video characteristics respectively to obtain a similarity set corresponding to each training sample pair respectively.

Wherein one similarity set corresponds to one training sample pair. One similarity set comprises target similarity obtained by performing similarity calculation on training text features corresponding to the training texts and video features corresponding to the training videos, and target similarity obtained by performing similarity calculation on the training text features corresponding to the training texts and reference enhancement video features corresponding to the training videos.

Specifically, in the initial video text matching model, for the same training sample pair, similarity calculation is performed on training text features corresponding to training texts and video features and reference enhanced video features corresponding to training videos respectively, and similarity sets corresponding to the training sample pairs are formed by the calculated target similarities. Because of the plurality of training sample pairs, the similarity set corresponding to each training sample pair can be finally obtained.

If the reference enhanced video features comprise motion enhanced video features, the similarity set comprises target similarity obtained by performing similarity calculation on the training text features and the motion enhanced video features. If the reference enhanced video features include audio enhanced video features, the similarity set includes a target similarity obtained by performing similarity calculation on the training text features and the audio enhanced video features.

It can be understood that, when the similarity calculation is performed, the euclidean distance or the cosine similarity between the two features may be calculated, or the similarity between the two features may be calculated based on a custom formula or an algorithm.

Step S210, calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text.

The target video text matching model refers to a trained video text matching model. The convergence condition may be at least one of a condition that the training loss is less than a preset threshold, a condition that the number of model iterations is greater than a preset number, and the like.

Specifically, after determining the similarity sets corresponding to the training sample pairs, the computer device may calculate the training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair. The computer device may calculate the training loss based on a loss function, which may be a loss function commonly used in model training or a custom loss function. Furthermore, the computer device can adjust model parameters of the initial video text matching model based on the training loss, and perform back propagation adjustment on the model parameters of the initial video text matching model based on the training loss until a convergence condition is met, so as to obtain the target video text matching model.

The target video text matching model is used for determining a matching result between any video and any text. For example, the video features and the reference features corresponding to the video to be matched and the text features corresponding to the text to be matched may be input into the target video text matching model, the target video text matching model outputs the similarity set corresponding to the video to be matched and the text to be matched, and the matching result between the video to be matched and the text to be matched is determined based on the similarity set. The matching result between the video to be matched and the text to be matched can be determined in the target video text matching model based on the similarity set, and the target video text matching model outputs the matching result between the video to be matched and the text to be matched.

In one embodiment, the training loss includes training sub-losses corresponding to the various similarities, that is, the training loss includes training sub-losses corresponding to the various similarity categories. The computer device may calculate training sub-losses based on the target similarities belonging to the same similarity class in the positive sample pair and the matched negative sample pair, obtain training sub-losses corresponding to each similarity class, and then obtain training losses based on the training sub-losses. It can be understood that the target similarity calculated based on the data of the same type belongs to the same similarity category.

In one embodiment, when calculating any one of the training sub-losses, the computer device may fuse the target similarities corresponding to the positive sample pairs and the negative sample pairs, respectively, to obtain the fused similarity corresponding to each of the positive sample pairs. For example, the sum of the target similarities respectively corresponding to the positive sample pair and the matched negative sample pair is taken as the fusion similarity; firstly, performing index processing with a natural constant e as the base on the target similarity respectively corresponding to the positive sample pair and the matched negative sample pair, and then taking the sum of the processing results of the indexes as the fusion similarity; and so on. And then, the computer equipment calculates the sample loss based on the difference between the target similarity and the fusion similarity corresponding to the same positive sample pair to obtain the sample loss corresponding to each positive sample pair. For example, the ratio of the target similarity to the fusion similarity is taken as the sample loss; performing exponential processing on the target similarity corresponding to the positive sample pair with a natural constant e as a base, and taking the ratio of the exponential processing result corresponding to the positive sample pair to the fusion similarity as sample loss; and so on. Finally, the computer device calculates a training sub-loss based on the respective sample losses. For example, the average value of the losses of the respective samples is calculated as the loss of the training sub; taking the median value in each sample loss as a training sub loss; and so on.

In one embodiment, a computer device may obtain a plurality of sets of training sample pairs, and randomly select one set of training sample pairs from each set of training sample pairs as a current set of sample pairs. The computer equipment inputs video features and reference features corresponding to training videos in the current sample pair set and training text features corresponding to training texts into an initial video text matching model, similarity sets corresponding to all the training sample pairs in the current sample pair set are obtained through data processing of the initial video text matching model, training losses are calculated based on all the similarity sets corresponding to the current sample pair set, model parameters of the initial video text matching model are adjusted based on the training losses, and an intermediate video text matching model is obtained. And the computer equipment takes the next training sample pair set as a new current sample pair set, takes the intermediate video text matching model as a new initial video text matching model, returns to the step of inputting the video features and the reference features corresponding to the training videos in the current sample pair set and the training text features corresponding to the training texts into the initial video text matching model for iterative training, and repeats the steps of iterative training for multiple times, continuously adjusting the model parameters until the convergence condition is met, and obtains the target video text matching model. For example, if the training loss calculated based on each similarity set corresponding to the current sample pair set in a certain round of training is smaller than a preset threshold, stopping adjusting the model parameters, and taking the video text matching model obtained through the latest adjustment as the target video text matching model. And if the iteration times of the model after a certain round of training are greater than the preset times, taking the video text matching model obtained by the latest adjustment as a target video text matching model.

It will be appreciated that adjusting the model parameters based on a current sample pair set is a round of model training, one model iteration. Different sets of training sample pairs may or may not contain repeated training sample pairs.

In the video text matching model training method, a training sample pair set is obtained; training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other; inputting video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features comprise at least one of audio features and action features corresponding to the training videos; based on the reference features corresponding to the same training video, performing feature enhancement on the corresponding video features to obtain the reference enhancement video features corresponding to the training video; the reference enhanced video feature comprises at least one of a motion enhanced video feature and an audio enhanced video feature; aiming at the same training sample pair, performing similarity calculation on training text characteristics corresponding to a training text, video characteristics corresponding to a training video and reference enhanced video characteristics respectively to obtain similarity sets corresponding to the training sample pairs respectively; calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text. Therefore, the video characteristics can provide image information of the video, the audio characteristics can provide sound information of the video, the action characteristics can provide motion information of the video, the video text matching model is trained based on the video characteristics corresponding to the training video, the reference characteristics and the training text characteristics corresponding to the training text, the understanding of the model on the video content can be improved by utilizing abundant modal information in the video, and the prediction accuracy of the model is improved. And moreover, feature enhancement and feature guidance are carried out on the video features based on the audio features or the motion features, important information in the video can be highlighted, similarity calculation is carried out on the video features and the training text features respectively based on the video features and the reference enhanced video features, model parameters are adjusted based on training loss generated by the similarity set obtained through calculation, the relation between the video and the text can be better established by the model, and the prediction accuracy of the model is further improved.

In one embodiment, step S202 includes:

obtaining a plurality of positive sample pairs; carrying out data recombination on each positive sample pair to obtain a plurality of negative sample pairs; taking a negative sample pair with coincident data with the positive sample pair as a negative sample pair matched with the positive sample pair; and obtaining a training sample pair set based on each positive sample pair and each matched negative sample pair.

The data reorganization refers to exchanging training videos or training texts between different positive sample pairs to reorganize the training sample pairs. For example, the training videos between two positive sample pairs may be exchanged to obtain two negative sample pairs, and the training texts between the two positive sample pairs may be exchanged to obtain two other negative sample pairs.

Specifically, the computer device may obtain a plurality of positive sample pairs locally or from another device, and perform data reassembly on each positive sample pair to obtain a plurality of negative sample pairs. In each negative sample pair, the computer device may use the negative sample pair having coincident data with the positive sample pair as a negative sample pair matching the positive sample pair, thereby obtaining a negative sample pair corresponding to each positive sample pair. And finally, forming a training sample pair set by each positive sample pair and each matched negative sample pair.

For example, the positive sample pair a includes a training video a1 and a training text a2, the positive sample pair B includes a training video B1 and a training text B2, and the positive sample pair C includes a training video C1 and a training text C2. Three positive sample pairs are recombined, six negative sample pairs can be obtained at most, the negative sample pair D1 comprises a training video a1 and a training text b2, the negative sample pair D2 comprises a training video a1 and a training text c2, the negative sample pair D3 comprises a training video b1 and a training text a2, the negative sample pair D4 comprises a training video b1 and a training text c2, the negative sample pair D5 comprises a training video c1 and a training text a2, and the negative sample pair D6 comprises a training video c1 and a training text b2. And forming a training sample pair set by the positive sample pair A, the positive sample pair B, the positive sample pair C and the negative sample pair D1-the negative sample pair D6.

In the above embodiment, each positive sample pair is subjected to data reassembly to obtain a plurality of negative sample pairs, the training sample pairs with coincident data are used as mutually matched training sample pairs, and each positive sample pair and each matched negative sample pair form a training sample pair set. In this way, a set of training sample pairs containing both positive and negative sample pairs can be quickly obtained based on the positive sample pairs.

In one embodiment, the generation of the motion enhanced video features comprises the steps of:

performing intra-modal attention processing on video features and action features corresponding to the current training video respectively to obtain self-attention video features and self-attention action features corresponding to the current training video; performing inter-modal attention processing on the video features and the self-attention motion features corresponding to the current training video to obtain cross attention video features corresponding to the current training video, and performing inter-modal attention processing on the motion features and the self-attention video features corresponding to the current training video to obtain cross attention motion features corresponding to the current training video; fusing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a motion video fusion characteristic corresponding to the current training video; and performing channel attention processing on the motion video fusion features corresponding to the current training video to obtain a first channel attention weight, and performing feature enhancement on the video features corresponding to the current training video based on the first channel attention weight to obtain motion enhancement video features corresponding to the current training video.

Wherein modality refers to the source or form of information. For example, video features are one modality data, motion features are another modality data, and audio features are another modality data.

The intra-modality attention processing is to perform attention processing on the single-modality data and is used for highlighting important and important information in the single-modality data. The inter-modality attention processing refers to attention processing on at least two kinds of modality data, and is used for performing information interaction on different modality data so as to highlight associated information between the two kinds of modality data.

Channel attention processing is attention processing performed on the channel dimensions of a feature to highlight information characterized by important channels in the feature. It can be understood that the data features obtained by feature extraction on the data generally include data sub-features respectively corresponding to a plurality of channels, information attention points of different channels are different, and the data sub-features corresponding to different channels can represent information of different semantics in the data.

Specifically, when the video features are subjected to feature enhancement based on the motion features of the video, the video features and the motion features can be subjected to data processing in multiple attention processing modes to obtain a first channel attention weight, and the video features are subjected to feature enhancement based on the first channel attention weight to obtain motion enhancement video features.

The current training video refers to the currently processed training video, and may be any one of the training videos. The computer equipment firstly carries out intra-modal attention processing on video features corresponding to a current training video to obtain self-attention video features corresponding to the current training video, and carries out intra-modal attention processing on action features corresponding to the current training video to obtain self-attention action features corresponding to the current training video. Through intra-modality attention processing, feature weighting can be carried out on respective modalities, and key contents in the respective modalities are highlighted.

Then, the computer device carries out inter-modal attention processing on the video features corresponding to the current training video and the self-attention motion features to obtain cross attention video features corresponding to the current training video, and carries out inter-modal attention processing on the motion features corresponding to the current training video and the self-attention video features to obtain cross attention motion features corresponding to the current training video. By performing inter-modality attention processing on the video features and the self-attention motion features and weighting the video features based on the self-attention motion features, motion-weighted video features, namely cross-attention video features, can be obtained. The motion weighted video features are used to highlight motion information in the video features. By performing inter-modal attention processing on the motion features and the self-attention video features and weighting the motion features based on the self-attention video features, video-weighted motion features, namely cross-attention motion features, can be obtained. The video-weighted motion features are used to highlight image information in the motion features.

Further, the computer device fuses the cross attention motion feature and the cross attention video feature corresponding to the current training video to obtain the motion video fusion feature corresponding to the current training video. By fusing the cross-attention motion feature and the cross-attention video feature, information that the motion feature and the video feature are linked to each other can be further highlighted.

And finally, the computer equipment performs channel attention processing on the action video fusion characteristics corresponding to the current training video to obtain a first channel attention weight. The first channel attention weight helps to establish an association between objects in the video feature that are motion related. And the computer equipment performs feature enhancement on the video features corresponding to the current training video based on the attention weight of the first channel to obtain action enhancement video features corresponding to the current training video. Motion enhancement video features emphasize the feature representation of moving objects in the video and de-emphasize the feature representation of background, noise, etc. other information in the video.

It will be appreciated that audio enhanced video features may also be obtained in the manner described above.

In one embodiment, intra-modality attention processing and inter-modality attention processing may be attention processing implemented based on commonly used attention mechanisms, such as key-value-to-attention mechanisms, multi-head attention mechanisms, and the like. Of course, intra-modality attention processing and inter-modality attention processing may also be implemented based on custom formulas or algorithms. The difference between intra-modality attention processing and inter-modality attention processing is that input information is different, the input information for intra-modality attention processing is one modality data, and the input information for inter-modality attention processing is different modality data. Similarly, the channel attention processing may be attention processing implemented based on a commonly used channel attention mechanism, or may be implemented based on a custom formula or algorithm.

In one embodiment, channel attention processing may be performed on the motion video fusion feature through formula (1), so as to obtain a motion enhancement video feature:

V _t ^M ＝W _t ^C ⊙V _t ,W _t ^C ＝σ(W ₂ δ(W ₁ MV _t ) Equation (1)

Wherein,

and

two linear transformations. δ and σ represent activation operations of ReLU (a kind of activated function) and sigmoid (a kind of activated function), respectively. d _v And d represents the characteristic dimension corresponding to the data obtained by performing linear conversion on the motion video fusion feature MV based on W1. MV (Medium Voltage) data base _t And representing the motion enhancement video characteristics corresponding to the t video frame in the video. W _t ^C Indicating the attention weight of the first channel corresponding to the t-th video frame in the video. V _t ^M And representing the motion enhancement video characteristic corresponding to the t-th video frame in the video.

In the embodiment, by organically combining intra-modality attention processing, inter-modality attention processing and channel attention processing, important information associated with each other can be mined from video features and action features to obtain a first channel attention weight, feature enhancement is performed on the video features based on the first channel attention weight to obtain action enhanced video features, and the action enhanced video features are added with feature expression of moving objects in the video to better express semantic information of the video, so that improvement of model prediction accuracy during model training is facilitated.

In one embodiment, the intra-modal attention processing is performed on the video features and the motion features corresponding to the current training video respectively to obtain the self-attention video features and the self-attention motion features corresponding to the current training video, and the method includes:

performing intra-modal fusion on video features corresponding to a current training video to obtain a first self-attention weight, performing fusion on the video features corresponding to the current training video and the first self-attention weight to obtain a first attention feature, and obtaining self-attention video features based on the video features corresponding to the current training video and the first attention feature; and performing intra-modal fusion on the motion features corresponding to the current training video to obtain a second self-attention weight, performing fusion on the motion features corresponding to the current training video and the second self-attention weight to obtain a second attention feature, and obtaining the self-attention motion feature based on the motion features corresponding to the current training video and the second attention feature.

And the intra-modal fusion is to fuse single-modal data. For example, intra-modal fusion may be performed by multiplying video features and then performing a softmax (normalized exponential function) operation; or the video characteristics can be subjected to linear transformation, the transformed video characteristics and the transformed video characteristics are multiplied, and then softmax operation is carried out to carry out intra-modal fusion; and so on.

Specifically, when intra-modal attention processing is performed on video features corresponding to a current training video, the computer device may perform intra-modal fusion on the video features corresponding to the current training video to obtain a first self-attention weight, and perform fusion on the video features corresponding to the current training video and the first self-attention weight to obtain a first attention feature. The first attention feature emphasizes important information in the video feature and emphasizes information with important semantics in the video feature. Furthermore, the computer device obtains the self-attention video feature based on the video feature corresponding to the current training video and the first attention feature. For example, the average of the video feature and the first attention feature is taken as the self-attention video feature; taking the weighted average of the video feature and the first attention feature as a self-attention video feature; and so on.

Similarly, when performing intra-modality attention processing on the motion features corresponding to the current training video, the computer device may perform intra-modality fusion on the motion features corresponding to the current training video to obtain a second self-attention weight, and perform fusion on the motion features corresponding to the current training video and the second self-attention weight to obtain a second attention feature. The second attention feature emphasizes important information in the motion feature and emphasizes information with important semantics in the motion feature. Further, the computer device obtains a self-attention action feature based on the action feature and the second attention feature corresponding to the current training video.

In the above embodiment, the video features are weighted and fused based on the first self-attention weight obtained by intra-modal fusion of the video features, so that feature expression of important information in the video features can be emphasized, and accurate self-attention video features can be obtained based on the video features and the first attention features obtained by weighted fusion. The action features are weighted and fused based on the second self-attention weight obtained through intra-modal fusion of the action features, feature expression of important information in the action features can be emphasized, and accurate self-attention action features can be obtained based on the action features and the second attention features obtained through weighted fusion.

In one embodiment, performing inter-modality attention processing on a video feature and a self-attention motion feature corresponding to a current training video to obtain a cross-attention video feature corresponding to the current training video, and performing inter-modality attention processing on a motion feature and a self-attention video feature corresponding to the current training video to obtain a cross-attention motion feature corresponding to the current training video includes:

splicing the video features corresponding to the current training video and the self-attention action features to obtain first splicing features, performing inter-modal fusion on the video features corresponding to the current training video and the first splicing features to obtain first cross attention weights, fusing the first splicing features and the first cross attention weights to obtain first cross attention features, and obtaining cross attention video features based on the first splicing features and the first cross attention features; the motion characteristic corresponding to the current training video and the self-attention video characteristic are spliced to obtain a second splicing characteristic, the motion characteristic corresponding to the current training video and the second splicing characteristic are subjected to inter-modal fusion to obtain a second cross attention weight, the second splicing characteristic and the second cross attention weight are fused to obtain a second cross attention characteristic, and the cross attention motion characteristic is obtained based on the second splicing characteristic and the second cross attention characteristic.

The inter-modality fusion is to fuse different modality data. For example, inter-modality fusion may be performed by multiplying the video feature and the first stitching feature followed by a softmax operation; or the video characteristic and the first splicing characteristic are subjected to linear transformation, and the transformed video characteristic, the transformed video characteristic and the first splicing characteristic are multiplied and then subjected to softmax operation to perform inter-modality fusion; and so on.

Specifically, when inter-modality attention processing is performed on video features and self-attention motion features corresponding to a current training video, the computer device stitches the video features and the self-attention motion features corresponding to the current training video to obtain first stitching features. The first splicing feature fuses image information and action information of the video and is a new feature different from video features. And then, the computer equipment performs inter-modality fusion on the video features corresponding to the current training video and the first splicing features to obtain first cross attention weights, and performs fusion on the first splicing features and the first cross attention weights to obtain first cross attention features. The first cross attention feature emphasizes important information in the first splicing feature and emphasizes information with important semantics in the first splicing feature. Finally, the computer device obtains cross-attention video features based on the first stitching features and the first cross-attention features.

Similarly, when the motion feature and the self-attention video feature perform inter-modality attention processing, the computer device stitches the motion feature and the self-attention video feature corresponding to the current training video to obtain a second stitching feature. The second splicing feature fuses image information and motion information of the video and is a new feature different from the motion feature. And further, the computer equipment performs inter-modal fusion on the action feature corresponding to the current training video and the second splicing feature to obtain a second cross attention weight, and performs fusion on the second splicing feature and the second cross attention weight to obtain a second cross attention feature. The second cross attention feature emphasizes important information in the second splicing feature and information with important semantics in the second splicing feature. Finally, the computer device obtains cross attention motion features based on the second stitching features and the second cross attention features.

In the above embodiment, the video features and the self-attention motion features are spliced to obtain the first spliced feature, the first spliced feature is weighted and fused based on the first cross attention weight obtained by performing inter-modality fusion on the video features and the first spliced feature, feature expression of important information in the first spliced feature can be emphasized, and then accurate cross attention video features can be obtained based on the first spliced feature and the first cross attention feature obtained by the weighted fusion. The action features and the self-attention video features are spliced to obtain second splicing features, the second splicing features are subjected to weighted fusion based on second cross attention weight obtained by performing inter-modal fusion on the action features and the second splicing features, feature expression of important information in the second splicing features can be emphasized, and accurate cross attention action features can be obtained based on the second splicing features and the second cross attention features obtained by the weighted fusion.

In one embodiment, fusing the cross attention motion feature and the cross attention video feature corresponding to the current training video to obtain the motion video fusion feature corresponding to the current training video, includes:

splicing the cross attention motion characteristics and the cross attention video characteristics corresponding to the current training video to obtain cross attention splicing characteristics; fusing the cross attention action characteristic and the cross attention video characteristic corresponding to the current training video to obtain a cross attention fusion characteristic; and performing inter-modal fusion on the cross attention splicing feature and the cross attention fusion feature to obtain a third cross attention weight, performing fusion on the cross attention splicing feature and the third cross attention weight to obtain a third cross attention feature, and obtaining an action video fusion feature based on the cross attention splicing feature and the third cross attention feature.

Specifically, when the cross attention motion feature and the cross attention video feature are fused, the computer device may splice the cross attention motion feature and the cross attention video feature corresponding to the current training video to obtain the cross attention spliced feature. The computer device may fuse the cross attention motion feature and the cross attention video feature corresponding to the current training video to obtain a cross attention fusion feature, for example, multiply the cross attention motion feature and the cross attention video feature to obtain a cross attention fusion feature; multiplying the cross attention motion characteristic and the cross attention video characteristic and then scaling to obtain a cross attention fusion characteristic; and so on. Furthermore, the computer device performs inter-modal fusion on the cross attention splicing feature and the cross attention fusion feature to obtain a third cross attention weight, and performs fusion on the cross attention splicing feature and the third cross attention weight to obtain a third cross attention feature. The third cross attention feature emphasizes important information in the cross attention splicing feature and emphasizes information with important semantics in the cross attention splicing feature. And finally, obtaining the action video fusion feature based on the cross attention splicing feature and the third cross attention feature.

In the above embodiment, the cross attention motion feature and the cross attention video feature are respectively spliced and fused to obtain a cross attention splicing feature and a cross attention fusion feature, the cross attention splicing feature is weighted and fused based on a third cross attention weight obtained by performing inter-modal fusion on the cross attention splicing feature and the cross attention fusion feature, so that feature expression of important information in the cross attention splicing feature can be emphasized, and an accurate motion video fusion feature can be obtained based on the cross attention splicing feature and the third cross attention feature obtained by the weighted fusion.

In one embodiment, attention processing may be based on a dual stream fusion strategy of a Transformer encoder. The Transformer encoder is shown in equation (2):

the MHA represents a Multi-Head Attention mechanism (Multi-Head Attention), and the working principle is that Q and K matrixes are used for multiplication, then softmax operation is carried out to obtain a weight W, and then W and V matrixes are multiplied to obtain a weighted result. FFN stands for Feed Forward Network (FEED FORWARD NETWORK), LN stands for Layer Normalization, encoder stands for Transformer Encoder, and Q, K, and V in formula (2) are characteristics of the input Encoder.

Representing a linear transformation, d is the characteristic dimension.

Self-attention video feature V obtained by intra-modal attention processing on video feature V _self ＝Encoder(V,V,V)。

Self-attention motion feature M obtained by performing intra-modal attention processing on motion feature M _self ＝Encoder(M,M,M)。

For video feature V and self-attention motion feature M _self Cross-attention video feature V obtained by inter-modal attention processing _cross ＝Encoder(V,cat(V,M _self ),cat(V,M _self )). Where cat is the concatenation of features in the time dimension.

For motion feature M and self-attention video feature V _self Inter-modal attention processingCross attention movement feature M _cross ＝Encoder(M,cat(M,V _self ),,cat(M,V _self ))。

Fusion cross attention video feature V _cross And cross attention movement feature M _cross The resulting motion video fusion feature MV = Encoder (V) _cross ·M _cross ,cat(V _cross ,M _cross ),cat(V _cross ,M _cross ))。

Performing attention operation on the motion video fusion feature MV at a channel level to obtain a first channel attention weight, and performing feature enhancement and guidance on the video feature V based on the first channel attention weight to obtain a motion enhanced video feature V ^M 。

In one embodiment, the initial video text matching model includes an action enhanced video feature and text feature matching network including a first intra-modality attention layer, a second intra-modality attention layer, a first inter-modality attention layer, a second inter-modality attention layer, an action video fusion attention layer, a first channel attention layer, and a first similarity computation layer.

The first intra-modality attention layer is used for performing intra-modality attention processing on the video features, and the second intra-modality attention layer is used for performing intra-modality attention processing on the action features; the first inter-modality attention layer is used for performing inter-modality attention processing on the video features and the self-attention motion features, and the second inter-modality attention layer is used for performing inter-modality attention processing on the motion features and the self-attention video features; the action video fusion attention layer is used for fusing cross attention action features and cross attention video features corresponding to the same training video; the first channel attention layer is used for carrying out channel attention processing on the motion video fusion features; the first similarity calculation layer is used for calculating the similarity between the motion enhancement video features and the training text features.

The action enhancement video feature and text feature matching network is a network used for carrying out feature enhancement on video features corresponding to training videos based on action features corresponding to the training videos and calculating target similarity between the action enhancement video features and training text features corresponding to the training texts.

Referring to fig. 3, the motion-enhanced video feature and text feature matching network includes a first intra-modality attention layer, a second intra-modality attention layer, a first inter-modality attention layer, a second inter-modality attention layer, a motion video fusion attention layer, a first channel attention layer, and a first similarity calculation layer. The intra-modality attention layer may also be referred to as an intra-modality attention module, the inter-modality attention layer may also be referred to as an inter-modality fusion attention module, and the motion video fusion attention layer may also be referred to as a motion-video fusion attention module.

And inputting the video features corresponding to the current training video into an attention layer in the first modality to perform intra-modality attention processing, thereby exploring the interior of the modality, and outputting the self-attention video features by the attention layer in the first modality. And inputting the action characteristics corresponding to the current training video into the attention layer in the second mode to perform intra-mode attention processing, thereby exploring the interior of the modes, and outputting the attention action characteristics from the attention layer in the second mode. And inputting the video features corresponding to the current training video and the self-attention motion features into a first inter-modal attention layer for inter-modal attention processing, so as to model inter-modal relationships, wherein the first inter-modal attention layer outputs cross-attention video features. And inputting the action characteristics corresponding to the current training video and the self-attention video characteristics into a second inter-modal attention layer for inter-modal attention processing, so as to perform inter-modal relationship modeling, and outputting the cross-attention action characteristics by the second inter-modal attention layer. And inputting the cross attention action features and the cross attention video features corresponding to the current training video into an action video fusion attention layer, and further fusing the action features and the video features to obtain action video fusion features (also called as action video fusion features). And inputting the motion video fusion feature into a first channel attention layer for channel attention processing, and outputting a first channel attention weight by the first channel attention layer. And performing feature enhancement on the video features corresponding to the current training video based on the attention weight of the first channel to obtain motion enhanced video features (also referred to as motion enhanced video features). Inputting the motion enhancement video features corresponding to the current training video and the training text features corresponding to the corresponding training text into a first similarity calculation layer for similarity calculation and similarity matching, and outputting a target similarity between the current training video and the corresponding training text by the first similarity calculation layer.

In the above embodiment, the initial video text matching model includes a motion enhancement video feature and text feature matching network, the feature enhancement is performed through the special motion enhancement video feature and text feature matching network to obtain a motion enhancement video feature, and the similarity between the motion enhancement video feature and the text feature is calculated through the motion enhancement video feature and text feature matching network, so that the risk of confusion with other modalities can be reduced, and the model training quality can be improved.

In one embodiment, the process of generating audio enhanced video features comprises the steps of:

fusing video features and audio features corresponding to the current training video to obtain initial audio and video fusion features; carrying out random inactivation treatment and pooling treatment on the initial audio and video fusion characteristics to obtain intermediate audio and video fusion characteristics; carrying out normalization processing on the intermediate audio and video fusion characteristics to obtain target audio and video fusion characteristics; and performing channel attention processing on the target audio and video fusion features to obtain a second channel attention weight, and performing feature enhancement on the video features corresponding to the current training video based on the second channel attention weight to obtain audio enhancement video features corresponding to the current training video.

The random inactivation process is to randomly change part of sub-features in the initial audio-video fusion feature, for example, to randomly set a feature value of the part of sub-features in the initial audio-video fusion feature to a preset value.

The pooling treatment is to count the characteristic value of the characteristic neutron characteristic obtained after the random inactivation treatment. For example, the pooling process may be a sum pooling process (Sum pooling), the features obtained by the random deactivation process are divided into a plurality of feature areas, one feature area includes a plurality of sub-features, feature values of the sub-features of each feature area are summed to obtain feature statistical values corresponding to each feature area, and a pooling process result is obtained based on each feature statistical value.

The normalization process is to map the data into a preset value range.

Specifically, when feature enhancement is performed on the video features based on the audio features of the video, the attention weight of the second channel can be obtained by performing data processing on the video features and the audio features, and the feature enhancement is performed on the video features based on the attention weight of the second channel to obtain the audio-enhanced video features.

The computer equipment can fuse the video features and the audio features corresponding to the current training video to obtain initial audio-video fusion features, for example, the video features and the audio features are expanded to the same dimension and then multiplied to obtain initial audio-video fusion features; and performing linear transformation on the video characteristics and the audio characteristics, and performing point multiplication on the video characteristics and the audio characteristics after the linear transformation to obtain initial audio and video fusion characteristics. And then, the computer equipment carries out random inactivation treatment on the initial audio and video fusion characteristics, carries out pooling treatment on the initial audio and video fusion characteristics after the random inactivation treatment to obtain intermediate audio and video fusion characteristics, and carries out normalization treatment on the intermediate audio and video fusion characteristics to obtain target audio and video fusion characteristics. The video features and the audio features can be effectively and accurately fused through random inactivation processing, pooling processing and normalization processing so as to obtain high-semantic target audio and video fusion features. And finally, the computer equipment performs channel attention processing on the target audio and video fusion features to obtain a second channel attention weight, and performs feature enhancement on the video features corresponding to the current training video based on the second channel attention weight to obtain audio enhancement video features corresponding to the current training video.

It will be appreciated that motion enhanced video features may also be obtained in the manner described above.

In one embodiment, the normalization process includes a square normalization process and an L2 normalization. And performing square normalization processing on the intermediate audio and video fusion characteristics, and performing L2 normalization on the intermediate audio and video fusion characteristics after square normalization processing.

In the above embodiment, through a series of fusion, random inactivation, pooling and normalization, the video features and the audio features can be fully fused to obtain accurate target audio/video fusion features. And performing feature enhancement on the video features based on the second channel attention weight obtained by performing channel attention processing on the target audio and video fusion features, so that accurate audio enhancement video features can be obtained.

In one embodiment, the initial video text matching model comprises an audio enhanced video feature and text feature matching network comprising an audio video fusion layer, a random inactivation layer, a pooling layer, a normalization layer, a second channel attention layer, and a second similarity calculation layer.

The audio and video fusion layer is used for fusing the video features and the audio features; the random inactivation layer is used for carrying out random inactivation treatment on input data; the pooling layer is used for pooling the input data; the normalization layer is used for normalizing input data; the second channel attention layer is used for carrying out channel attention processing on input data; the second similarity calculation layer is used for calculating the similarity between the audio enhancement video features and the training text features.

The audio enhancement video feature and text feature matching network is a network used for performing feature enhancement on video features corresponding to training videos based on audio features corresponding to the training videos and calculating target similarity between the audio enhancement video features and training text features corresponding to the training texts.

Referring to fig. 4, the audio enhanced video feature and text feature matching network includes an audio-video fusion layer, a random inactivation layer, a pooling layer, a normalization layer, a second channel attention layer, and a second similarity calculation layer.

Inputting the video feature and audio feature corresponding to the current training video into the audio-video fusion layer, and using the full connection layer (also called FC layer) activated by the ReLU to input the audio feature A _t And video features V _t Expand to the same dimension kd _o And then, carrying out fusion to obtain the initial audio and video fusion characteristic. Inputting the initial audio and video fusion characteristics into a random inactivation layer (dropout layer) for random inactivation treatment, and inputting the output data of the random inactivation layer into a summation pooling layer for pooling treatment to obtain the intermediate audio and video fusion characteristics. And inputting the intermediate audio and video fusion characteristics into a normalization layer for normalization processing to obtain target audio and video fusion characteristics. And inputting the target audio and video fusion characteristics into a second channel attention layer for channel attention processing to obtain a second channel attention weight. And performing feature enhancement on the video features corresponding to the current training video based on the attention weight of the second channel to obtain audio enhancement video features (which can also be called as audio enhancement video features) corresponding to the current training video. And inputting the audio enhancement video features corresponding to the current training video and the training text features corresponding to the corresponding training text into a second similarity calculation layer for similarity calculation and similarity matching, and outputting a target similarity between the current training video and the corresponding training text by the second similarity calculation layer.

In the above embodiment, the initial video text matching model includes an audio enhanced video feature and text feature matching network, the feature enhancement is performed through the special audio enhanced video feature and text feature matching network to obtain an audio enhanced video feature, and the similarity between the audio enhanced video feature and the text feature is calculated through the audio enhanced video feature and text feature matching network, so that the risk of confusion with other modalities can be reduced, and the model training quality can be improved.

In a specific embodiment, audio feature A is implemented using a ReLU-activated full-connectivity layer _t And video features V _t Expand to the same dimension kd _o Then, the expanded features are input into an MFB (Multi-modal finite Bilinear) module to fuse the video features and the audio features to obtain an audiovisual fusion feature AV _t (i.e. the target audio-visual fusion feature).

AV _t ＝SP(D(φ ^T A _t ⊙Ψ ^T V _t K)) formula(3)

Wherein,

and

are two learnable matrix parameters, SP (f, k) represents kernel and sum boosting operations with step size k, and D (-) represents dropout layer to prevent overfitting.

In addition, square normalization and L2 normalization are introduced for achieving the purpose of stable model training.

Referring to formula (4), first square normalization is performed, and then L2 normalization is performed.

Which represents the square normalization,

indicating L2 normalization. sign () represents a sign function.

Merging audiovisual features AV _t Inputting the data into a second channel attention layer, performing attention operation on the channel layer to obtain a second channel attention weight, and performing feature enhancement and guidance on the video feature based on the second channel attention weight to obtain an audio enhanced video feature V _t ^A 。

Wherein,

and

for two linear transformations, δ and σ represent the activation operations for ReLU and sigmoid, respectively.

In one embodiment, the initial video-text matching model may include a video-feature-to-text-feature matching network in addition to the motion-enhanced video-feature-to-text-feature matching network, the audio-enhanced video-feature-to-text-feature matching network. The video feature and text feature matching network is used for calculating the similarity between the video feature and the text feature.

In one embodiment, the video text matching model training method further comprises:

and when the audio frequency is lost in the training video, acquiring the preset characteristics or the video characteristics corresponding to the training video as the audio characteristics corresponding to the training video.

In particular, certain videos have the phenomenon of missing audio, and for such videos, they may be aligned based on preset features or video features. If the training video lacks audio, the computer device may obtain the preset features or video features corresponding to the training video as audio features corresponding to the training video. The preset features are preset features, and can be specifically set according to actual needs, for example, the audio features are missing in some samples, and the samples with missing audio features are aligned by using 1.0.

The method is characterized in that feature enhancement and guidance are carried out on video features based on audio features, the purpose is to increase the weight of a sound source, but when some videos have audio missing, the audio features are fixed to be 1 or the audio features are changed into the video features, and the two modes can realize that the video features replace the audio features to carry out guidance when the audio features do not exist.

In the above embodiment, when audio is missing in the training video, the preset features or the video features corresponding to the training video are obtained as the audio features corresponding to the training video, so that mode alignment can be performed when the modes are missing, and the robustness of the model can be further improved.

In one embodiment, the similarity set includes at least two target similarities, the current video feature is any one of a video feature and a reference enhanced video feature corresponding to a training video in a current training sample pair, the current text feature is a training text feature corresponding to a training text in the current training sample pair, the training video includes a plurality of video frames, and the training text includes a plurality of text words.

The calculation process of the target similarity between the current video characteristic and the current text characteristic comprises the following steps;

Calculating the initial similarity between the current video characteristic and the current text characteristic; the initial similarity comprises sub-similarities between a plurality of text words in the training text and the same video frame respectively, and sub-similarities between a plurality of video frames in the training video and the same text word respectively; obtaining a text weight based on the current text characteristics, and obtaining a video weight based on the current video characteristics; aiming at the initial similarity, obtaining a maximum value from a plurality of sub-similarities corresponding to the same video frame as a first sub-similarity, obtaining a maximum value from a plurality of sub-similarities corresponding to the same text word as a second sub-similarity, and obtaining a first sub-similarity corresponding to each video frame and a second sub-similarity corresponding to each text word; obtaining first similarity based on each first sub-similarity, and obtaining second similarity based on each second sub-similarity; fusing the first similarity and the text weight to obtain first fusion data, and fusing the second similarity and the video weight to obtain second fusion data; and obtaining the target similarity between the current video characteristic and the current text characteristic based on the first fusion data and the second fusion data.

Wherein the similarity set comprises at least two target similarities. For example, the similarity set includes one target similarity calculated based on the training text features and the video features, another target similarity calculated based on the training text features and the motion-enhanced video features, and another target similarity calculated based on the training text features and the audio-enhanced video features.

The current training sample pair refers to the currently processed training sample pair. The current video feature is any one of a video feature corresponding to the training video in the current training sample pair and the reference enhanced video feature. It is to be understood that if the reference enhanced video features include motion enhanced video features and audio enhanced video features, the current video feature is any one of video features, motion enhanced video features, and audio enhanced video features corresponding to the training video in the current training sample pair. And the current text features are the training text features corresponding to the training texts in the current training sample pair.

It is to be understood that the training video includes a plurality of video frames and the training text includes a plurality of text words. The features corresponding to the training videos comprise sub-features corresponding to the video frames respectively, and the features corresponding to the training texts comprise sub-features corresponding to the text words respectively.

Specifically, the calculation manner of the similarity of various objects may be the same, and the calculation process is described by taking the calculation of the similarity of the object between the current video feature and the current text feature as an example.

First, the computer device calculates an initial similarity between the current video feature and the current text feature, for example, a cosine similarity between the current video feature and the current text feature may be calculated as the initial similarity. The initial similarity comprises sub-similarities between a plurality of text words in the training text and the same video frame respectively, and sub-similarities between a plurality of video frames in the training video and the same text words respectively. For example, the initial similarity may be represented by a matrix, a horizontal axis of the matrix represents a plurality of text words in the training text, and specifically represents sub-similarities between the plurality of text words and the same video frame, different rows in the matrix correspond to different video frames, a vertical axis of the matrix represents a plurality of video frames in the training video, and specifically represents sub-similarities between the plurality of video frames and the same text word, and different columns in the matrix correspond to different text words. Aiming at the initial similarity, the computer equipment acquires a maximum value from a plurality of sub-similarities corresponding to the same video frame to be used as a first sub-similarity, obtains first sub-similarities corresponding to all the video frames respectively, and forms the first sub-similarities into the first similarity. Similarly, for the initial similarity, the computer device obtains a maximum value from the multiple sub-similarities corresponding to the same text word as a second sub-similarity, obtains a second sub-similarity corresponding to each text word, and combines the second sub-similarities into a second similarity. For example, the initial similarity, the first similarity and the second similarity may be expressed by a matrix, a horizontal axis of the initial similarity matrix represents sub-similarities between a plurality of text words in the training text and the same video frame, a maximum value is taken for each row of the initial similarity matrix to form the first similarity matrix, a vertical axis of the initial similarity matrix represents sub-similarities between a plurality of video frames in the training video and the same text word, and a maximum value is taken for each column of the initial similarity matrix to form the second similarity matrix.

The computer device may also weigh the current text feature and the current video feature, obtain a text weight based on the current text feature, and obtain a video weight based on the current video feature. Text words with important semantics in the text weights have higher weights, and video frames with important semantics in the video weights have higher weights. The first sub-similarity included in the first similarity represents the sub-similarity between each of the text words and the same video frame, the text weight comprises the text sub-weight corresponding to each text word, the first similarity and the text weight are fused to obtain first fusion data, and the sub-similarity between the text word with important semantics and the video frame is emphasized in the first fusion data. And the second sub-similarity included in the second similarity represents the sub-similarity between each of the plurality of video frames and the same text word, the video weight comprises the video sub-weight corresponding to each video frame, the second similarity and the video weight are fused to obtain second fusion data, and the sub-similarity between the video frame with important semantics and the text word is emphasized in the second fusion data.

And finally, the computer equipment obtains the target similarity between the current video characteristic and the current text characteristic based on the first fusion data and the second fusion data. For example, the average value of the first fused data and the second fused data is taken as the target similarity.

In the above embodiment, the first similarity and the second similarity obtained by performing data processing on the initial similarity integrate the sub-similarities between the most matched video frames and text words, the first similarity is subjected to weighted fusion based on the text weight determined by the text features, the second similarity is subjected to weighted fusion based on the video weight determined by the video features, and the target similarity obtained based on the weighted fusion result is the weighted similarity, so that the method has higher accuracy and is beneficial to improving the model training quality.

In a specific embodiment, the target similarity WTI between the video features and the training text features may be calculated by the following formula.

WTI(V _i ,C _i )＝(c2v_logits+v2c_logits)/2

Wherein, WTI (V) _i ,C _i ) Representing the video characteristic V corresponding to the training video in the ith positive sample _i Training text feature C corresponding to training text _i The target similarity between them. c2v _ locations represents a first similarity corresponding to the ith positive sample, and can be regarded as a similarity on a task from text to video (c 2 v), and v2c _ locations represents a second similarity corresponding to the ith positive sample, and can be regarded as a similarity on a task from video to text (v 2 c). The c2v task (which may also be referred to as a t2c task) represents determining, for a plurality of texts, a video that is respectively described by each text. The v2c (also referred to as v2t task) task represents that video description texts corresponding to each video are determined for a plurality of videos.

f _cw,θ And f _vw,θ Is a classic MLP (Multi layer Perceptron) structure with SoftMax, f _cw,θ For characterizing textWeight, f _vw,θ For applying weights to the video features. L is _C And L _V Respectively representing the number of tokens (i.e. text words) of the training text and the number of frames (video frames) of the training video, and p and q respectively represent the token sequence number of the training text and the frame sequence number of the training video.

Representation based on training text features C _i And video features V _i And calculating the matrix to obtain an initial similarity matrix, wherein the horizontal axis of the initial similarity matrix represents token of the training text, and the vertical axis of the initial similarity matrix represents frame of the training video.

Presentation pair

And performing matrix transposition.

Pair of representations

And (6) carrying out normalization.

The maximum value of each row of the initial similarity matrix is obtained through solving the maximum value of the row.

The maximum value of each column of the initial similarity matrix is obtained by solving the maximum value of the column.

In one embodiment, the similarity set includes at least two target similarities. Calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, including:

determining a target category from each similarity category; obtaining first similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pair and each negative sample pair containing the same training text with the positive sample pair on the target category, and obtaining second similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pair and each negative sample pair containing the same training video with the positive sample pair on the target category; fusing the target similarity and the first similarity weight of the same training sample pair on the target category to obtain a first updating similarity corresponding to each training sample pair, and fusing the target similarity and the second similarity weight of the same training sample pair on the target category to obtain a second updating similarity corresponding to each training sample pair; obtaining a first loss based on first updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, and obtaining a second loss based on second updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training text as the positive sample pair; obtaining training sub-losses corresponding to the target category based on the first losses and the second losses; obtaining a next similarity class as a target class, returning the target similarity of each negative sample pair containing the same training text on the target class based on the positive sample pair and the positive sample pair, and executing the step of obtaining first similarity weights respectively corresponding to each training sample pair on the target class until determining training sub-losses respectively corresponding to each similarity class; the training loss is derived based on various training sub-losses.

Specifically, the similarity set includes at least two kinds of target similarities, the training loss includes training sub-losses corresponding to the various similarity categories, and the calculation modes of the various training sub-losses are the same.

The computer equipment can randomly select one similarity class from the similarity classes as a target class, and calculate the training sub-loss corresponding to the target class based on the target similarity of the positive sample pair and each negative sample pair matched with the positive sample pair on the target class. The computer device may obtain a next similarity category from the similarity categories as a new target category, and calculate a training sub-loss corresponding to the new target category based on the target similarities of the positive sample pairs and the negative sample pairs matched with the positive sample pairs on the new target category. By analogy, the computer device can finally calculate and obtain the training sub losses respectively corresponding to various similarity classes, and the training losses are obtained based on the various training sub losses.

It is understood that the computer device may also calculate the training sub-losses corresponding to the various similarity categories in parallel.

For any one similarity class, the computer device calculates and obtains first similarity weights respectively corresponding to the training sample pairs on the target class based on the target similarity of the positive sample pair and each negative sample pair containing the same training text with the positive sample pair on the target class. For example, the positive sample pairs and the negative sample pairs containing the same training text as the positive sample pairs form training sample pair subsets, and first similarity weights respectively corresponding to the training sample pairs in the training sample pair subsets are determined based on differences of target similarities of the positive sample pairs and the training sample pairs in the training sample pair subsets on the target category. The first similarity weight is generated based on the target similarity of each training sample pair containing the same training text on the target category, and the first similarity weight can be considered to be fused with relevant information for the c2v task.

Similarly, the computer device calculates, based on the target similarity of the positive sample pair and each negative sample pair containing the same training video as the positive sample pair in the target category, second similarity weights respectively corresponding to the training sample pairs in the target category. The second similarity weight is generated based on the target similarity of each training sample pair containing the same training video in the target category, and the second similarity weight can be considered to be fused with the related information for the v2c task.

And then, the computer equipment fuses the target similarity and the first similarity weight of the same training sample pair on the target class to obtain the first updating similarity corresponding to each training sample pair. For example, the target similarity and the first similarity weight are multiplied to obtain a first updated similarity. Specifically, the first sub-losses are calculated based on the first update similarities corresponding to the positive sample pairs and the negative sample pairs containing the same training video, so as to obtain the first sub-losses corresponding to the positive sample pairs, and the first losses are obtained based on the first sub-losses corresponding to the positive sample pairs and the negative sample pairs containing the same training video. The first similarity weight is fused with the relevant information for the c2v task, and the first updated similarity calculated based on the target similarity and the first similarity weight can also be considered to be fused with the relevant information for the c2v task. Based on the data calculated from the correlation data of the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, it can be considered that the correlation information for the v2c task is fused. The first loss is calculated based on the first update similarity corresponding to each of the positive sample pairs and each of the negative sample pairs containing the same training video as the positive sample pairs, and it can be considered that the relevant information for the c2v task and the v2c task is comprehensively considered, and the model adjustment is performed based on the loss, so that the model can have good performance on both the c2v task and the v2c task.

Similarly, the computer device fuses the target similarity and the second similarity weight of the same training sample pair in the target category to obtain second update similarities corresponding to the training sample pairs respectively, and obtains a second loss based on the second update similarities corresponding to the positive sample pair and the negative sample pairs containing the same training text. The second similarity weight is fused with the related information for the v2c task, and the first updated similarity calculated based on the target similarity and the first similarity weight can also be considered to be fused with the related information for the v2c task. Based on the data calculated from the correlation data of the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, it can be considered that the correlation information for the c2v task is fused. The first loss is calculated based on the first update similarity corresponding to each of the positive sample pairs and each of the negative sample pairs containing the same training video as the positive sample pairs, and it can be considered that the relevant information for the c2v task and the v2c task is comprehensively considered, and the model adjustment is performed based on the loss, so that the model can have good performance on both the c2v task and the v2c task.

Finally, the computer device obtains a training sub-loss corresponding to the target class based on the first loss and the second loss. For example, the average of the first loss and the second loss is taken as a training sub-loss; taking the weighted average of the first loss and the second loss as a training sub-loss; and so on.

In the above embodiment, the training sub-losses corresponding to the similarity categories are respectively calculated, and then the accurate training loss can be obtained based on the training sub-losses. And calculating to obtain a first similarity weight based on the target similarity of the positive sample pair and each negative sample pair containing the same training text with the positive sample pair on the target category, fusing the target similarity and the first similarity weight to obtain a first updated similarity, wherein a first loss calculated based on the first updated similarity is beneficial to ensuring that the model has good performance on both the c2v task and the v2c task in the model training process. And calculating to obtain a second similarity weight based on the target similarity of the positive sample pair and each negative sample pair containing the same training video with the positive sample pair on the target category, fusing the target similarity and the second similarity weight to obtain a second updated similarity, wherein a second loss calculated based on the second updated similarity is beneficial to ensuring that the model has good performance on both the c2v task and the v2c task in the model training process.

In one embodiment, obtaining first similarity weights respectively corresponding to the training sample pairs on the target category based on target similarities of the positive sample pairs and negative sample pairs containing the same training text as the positive sample pairs on the target category, and obtaining second similarity weights respectively corresponding to the training sample pairs on the target category based on target similarities of the positive sample pairs and negative sample pairs containing the same training video as the positive sample pairs on the target category, includes:

obtaining a first similarity matrix based on the target similarity of each training sample pair on the target category; the first dimension of the first similarity matrix represents the target similarity of each training sample pair containing the same training video in the target category, the second dimension of the target similarity matrix represents the target similarity of each training sample pair containing the same training text in the target category, and the diagonal of the target similarity matrix represents the target similarity of each positive sample pair in the target category; generating a second matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in a second dimension, and generating a second similarity matrix based on the second matrix element corresponding to each matrix element in the first similarity matrix; generating a third matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in the first dimension, and generating a third similarity matrix based on the third matrix element corresponding to each matrix element in the first similarity matrix; adjusting each matrix element in the second similarity matrix based on the target similarity of the positive sample pair on the target category to obtain a fourth similarity matrix, and adjusting each matrix element in the third similarity matrix based on the target similarity of the positive sample pair on the target category to obtain a fifth similarity matrix; the fourth similarity matrix represents first similarity weights respectively corresponding to the training sample pairs on the target class, and the fifth similarity matrix represents second similarity weights respectively corresponding to the training sample pairs on the target class.

The forward matrix element of the current matrix element in the second dimension refers to a matrix element arranged before the current matrix element in the second dimension in the first similarity matrix. The forward matrix element of the current matrix element in the first dimension refers to a matrix element arranged before the current matrix element in the first dimension in the first similarity matrix.

Specifically, when the first similarity weight and the second similarity weight are calculated, a first similarity matrix is generated for the target similarity on the target category based on each training sample, and the matrix elements are subjected to ordered data processing in a matrix form, so that the first similarity weight and the second similarity weight can be quickly obtained.

The first dimension of the first similarity matrix generated based on the target similarity of each training sample pair in the target category characterizes the target similarity of each training sample pair containing the same training video in the target category, the second dimension of the target similarity matrix characterizes the target similarity of each training sample pair containing the same training text in the target category, and the diagonal of the target similarity matrix characterizes the target similarity of each positive sample pair in the target category.

For example, assume that the set of training sample pairs includes three positive sample pairs and six negative sample pairs. The three positive sample pairs are a positive sample pair a containing a training video 1 and a training text 1, a positive sample pair B containing a training video 2 and a training text 2, and a positive sample pair C containing a training video 3 and a training text 3, respectively. The six negative sample pairs are obtained by data recombination of the three positive sample pairs.

The first similarity matrix is

Wherein S is ₁₁ 、S ₂₂ 、S ₃₃ Respectively representing the target similarity of three positive sample pairs on the target class, S ₁₂ 、S ₁₃ Representing the similarity of the target on the target class of each negative sample pair containing the training video 1 as the positive sample pair A, S ₂₁ 、S ₂₃ Representing the similarity of the respective negative sample pairs containing the training video 2 as the positive sample pair B on the target class, S ₃₁ 、S ₃₂ Indicating the similarity of the objects in the object class of each negative sample pair containing the training video 3 as well as the positive sample pair C.

When calculating the first similarity weight, generating a second matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in the second dimension, for example, performing index processing with a natural constant e as a base on the current matrix element and the forward matrix element of the current matrix element in the second dimension, and adding to obtain the second matrix element corresponding to the current matrix element; carrying out geometric amplification on the current matrix element and a forward matrix element of the current matrix element in a second dimension, carrying out exponential processing on the amplified data with a natural constant e as a base, and adding the data subjected to the exponential processing to obtain a second matrix element corresponding to the current matrix element; and so on. And after second matrix elements corresponding to the matrix elements are determined, forming a second similarity matrix by the second matrix elements. Based on the target similarity of the positive sample pair in the target category, each matrix element in the second similarity matrix is adjusted to obtain a fourth similarity matrix, for example, the target similarity of the positive sample pair in the target category is divided by each matrix element to obtain the fourth similarity matrix. And the fourth similarity matrix represents the first similarity weight respectively corresponding to each training sample pair on the target class.

For example, the first similarity matrix is

The first dimension of the first similarity matrix is a row and the second dimension of the matrix is a column.

The second similarity matrix is:

the fourth similarity matrix is:

where exp () represents an exponential function with a natural constant e as the base.

In the fourth similarity matrix, the first row of data is a positive sample pair a, the negative sample pair containing a training video 1 and a training text 2, and the negative sample pair containing the training video 1 and the training text 3 respectively correspond to first similarity weights on the target category, the second row of data is a negative sample pair containing the training video 2 and the training text 1, the positive sample pair B, and the negative sample pair containing the training video 2 and the training text 3 respectively correspond to first similarity weights on the target category, and the third row of data is a negative sample pair containing the training video 3 and the training text 1, a negative sample pair containing the training video 3 and the training text 2, and a positive sample pair C respectively correspond to first similarity weights on the target category.

Similar to the calculation of the first similarity weight, when the second similarity weight is calculated, a third matrix element corresponding to the current matrix element is generated based on the current matrix element in the first similarity matrix and the forward matrix element of the current matrix element in the first dimension, and a third similarity matrix is generated based on the third matrix elements corresponding to the matrix elements in the first similarity matrix. And adjusting each matrix element in the third similarity matrix based on the target similarity of the positive sample pair on the target category to obtain a fifth similarity matrix, wherein the fifth similarity matrix represents second similarity weights respectively corresponding to each training sample pair on the target category.

For example, the first similarity matrix is

The third similarity matrix is:

the fifth similarity matrix is:

in the fifth similarity matrix, the first row of data is a positive sample pair a, a negative sample pair containing a training video 1 and a training text 2, and second similarity weights respectively corresponding to the negative sample pair containing the training video 1 and the training text 3 on the target category, the second row of data is a second similarity weight respectively corresponding to the negative sample pair containing the training video 2 and the training text 1, a positive sample pair B, and a second similarity weight corresponding to the negative sample pair containing the training video 2 and the training text 3 on the target category, and the third row of data is a second similarity weight corresponding to the negative sample pair containing the training video 3 and the training text 1, the negative sample pair containing the training video 3 and the training text 2, and the positive sample pair C on the target category.

In the above embodiment, a first similarity matrix is obtained based on the target similarity of each training sample pair in the target category, a second similarity matrix is generated based on matrix elements in a second dimension in the first similarity matrix, each matrix element in the second similarity matrix is adjusted based on the target similarity of the positive sample pair in the target category, and a fourth similarity matrix is obtained, and the fourth similarity matrix is fused with relevant information for the c2v task, which is beneficial to improving the processing capability of the model for the c2v task during model training. And generating a third similarity matrix based on matrix elements in the first dimension in the first similarity matrix, adjusting each matrix element in the third similarity matrix based on the target similarity of the positive sample to the target category to obtain a fifth similarity matrix, wherein the fifth similarity matrix is fused with relevant information aiming at the v2c task, and the processing capacity of the model on the v2c task is improved during model training.

In one embodiment, obtaining the first loss based on first update similarities corresponding to the positive sample pair and negative sample pairs containing the same training video as the positive sample pair, and obtaining the second loss based on second update similarities corresponding to the positive sample pair and negative sample pairs containing the same training text as the positive sample pair, respectively comprises:

fusing the positive sample pairs and first updating similarities corresponding to negative sample pairs containing the same training video with the positive sample pairs to obtain first similarity statistic values corresponding to the positive sample pairs, obtaining first sub-losses corresponding to the positive sample pairs based on the first updating similarities and the first similarity statistic values corresponding to the positive sample pairs, and obtaining first losses based on the first sub-losses; and fusing the positive sample pairs and second updating similarity corresponding to each negative sample pair containing the same training text with the positive sample pairs to obtain second similarity statistic values corresponding to each positive sample pair, obtaining second sub-losses corresponding to each positive sample pair based on the second updating similarity and the second similarity statistic values corresponding to the same positive sample pair, and obtaining second losses based on the second sub-losses.

Specifically, when the first loss is calculated, the first update similarities corresponding to the positive sample pair and the negative sample pairs containing the same training video are fused to obtain the first similarity statistic corresponding to the positive sample pair, for example, the sum of the first update similarities corresponding to the positive sample pair and the negative sample pairs containing the same training video is used as the first similarity statistic. Based on the first update similarity and the first similarity statistic corresponding to the same positive sample pair, the first sub-losses corresponding to the respective positive sample pairs are obtained, for example, the ratio of the first update similarity and the first similarity statistic corresponding to the positive sample pair is used as the first sub-loss. Finally, the first loss is obtained based on the respective first sub-losses, and for example, an average value of the respective first sub-losses is calculated as the first loss.

Similar to the calculation of the first loss, when the second loss is calculated, the second updating similarity corresponding to each negative sample pair containing the same training text is fused with the positive sample pair to obtain the second similarity statistic corresponding to each positive sample pair, the second sub-loss corresponding to each positive sample pair is obtained based on the second updating similarity and the second similarity statistic corresponding to the same positive sample pair, and the second loss is obtained based on each second sub-loss.

In the above embodiment, the positive sample pairs and the negative sample pairs containing the same training video as the positive sample pairs are fused to obtain the first update similarity corresponding to each negative sample pair, the first sub-losses corresponding to each positive sample pair are obtained based on the first update similarity and the first similarity statistic corresponding to the same positive sample pair, the first losses are obtained based on each first sub-loss, a normalization idea is embodied, and the training sub-losses generated based on such first losses are helpful for improving the model training quality. And fusing the positive sample pair and second updating similarity corresponding to each negative sample pair containing the same training text with the positive sample pair, obtaining second sub-losses corresponding to each positive sample pair based on the second updating similarity and the second similarity statistic corresponding to the same positive sample pair, and obtaining second losses based on each second sub-loss, so that a normalization idea is embodied, wherein the training sub-losses generated based on the second losses are beneficial to improving the model training quality.

In a specific embodiment, for the target similarity calculated based on the video features and the training text features, the training sub-loss corresponding to the similarity category to which the target similarity belongs may be calculated by the following formula (7).

Wherein, WTI (V) _i ,C _i ) And representing the target similarity corresponding to a training sample pair consisting of a training video i and a training text j, wherein the target similarity is calculated on the basis of the video feature V and the training text feature C.

Representing a first similarity weight corresponding to a training sample pair consisting of a training video i and a training text j,

and representing the second similarity weight corresponding to the training sample pair consisting of the training video i and the training text j. Pr (Pr) of ^v2c And a fourth similarity matrix is represented, and relevant information of the c2v task is fused. Pr (Pr) of ^c2v And a fifth similarity matrix is represented, and the related information of the v2c task is fused. Pr (Pr) ^v2c And Pr ^c2v Which may also be referred to as a priority matrix.

L ^v2c Denotes the first loss, L ^c2v Representing a second loss.

The loss information corresponding to the video feature and the text feature matching network can also be represented.

temp represents a temperature over-parameter for smoothing the gradient, which can be set according to actual needs. L represents a scaling parameter and can be set according to actual needs. B denotes the number of positive sample pairs.

Aiming at the motion enhancement video features and the training text features, WTI similarity matching can be carried out on the motion enhancement video features and the training text features to obtain a similarity matrix Sim _{c_m} ＝WTI(C,V ^M ). Similar to equation (7), a loss function can be derived: loss _m ＝L ^m2c +L ^c2m ,

The loss information corresponding to the similarity class of the target similarity calculated based on the motion enhancement video features and the training text features can be represented, and the loss information corresponding to the motion enhancement video features and the text feature matching network can also be represented.

Aiming at the audio enhanced video features and the training text features, WTI similarity matching can be carried out on the action enhanced video features and the training text features to obtain a similarity matrix Sim _{c_a} ＝WTI(C,V ^A ). Similar to equation (7), a loss function can be derived: loss _a ＝L ^a2c +L ^c2a ,

The method can be used for representing the training sub-loss corresponding to the similarity class of the target similarity calculated based on the audio enhanced video features and the training text features, and can also be used for representing the loss information corresponding to the audio enhanced video features and the text feature matching network.

In a specific embodiment, the similarity calculation process and the loss calculation process are described by taking video features and text features as examples, with reference to fig. 5. The video features corresponding to the training video may be obtained by inputting the training video into a video encoder, and the training text features corresponding to the training text may be obtained by inputting the training text into a text encoder. When calculating the target similarity between the video features and the training text features, the initial similarity (reference) between the video features and the training text features is calculated first

) Then, max operation is performed on the initial similarity, specifically, a 3*1 matrix (refer to) is obtained by taking the maximum value of the initial similarity according to rows

) Taking the maximum value of the initial similarity by column to obtain a matrix 1*3 (refer to

) And also the text weight (refer to) corresponding to the training text

) Calculating the corresponding video weight (reference) of the training video

) Finally, weighting and fusing the 3*1 matrix and the text weight (refer to c2v _ locations), weighting and fusing the 1*3 matrix and the video weight (refer to v2c _ locations), and obtaining the video feature and the training text feature based on each weighting and fusing resultThe object similarity between them.

Due to the fact that a plurality of training sample pairs exist, corresponding target similarity can be obtained through calculation of all the training sample pairs based on video features and training text features of all the training sample pairs, the target similarity corresponding to all the training sample pairs forms a target similarity matrix, and the diagonal line of the target similarity matrix is the target similarity corresponding to all the positive sample pairs. Based on the target similarity corresponding to the positive sample pair, performing softmax calculation on the target similarity matrix in columns to obtain a similarity weight matrix (refer to Pr) ^v2c ) And on the basis of the target similarity corresponding to the positive sample pair, performing softmax calculation on the target similarity matrix by rows to obtain a similarity weight matrix (reference Pr) ^c2v ). Multiplying the target similarity matrix with the two similarity weight matrixes respectively by corresponding matrix elements to obtain two updated similarity matrixes (reference matrix)

And

). Based on the target similarity corresponding to the positive sample pair, performing softmax calculation on the updated similarity matrix to obtain the loss (refer to L) ^v2c And L ^c2v I.e., the first loss and the second loss). Finally, a training sub-loss (reference) is obtained based on the first loss and the second loss

)。

In one embodiment, the initial video-text matching model includes a video feature-to-text feature matching network, a reference enhanced video feature-to-text feature matching network, the reference enhanced video feature-to-text feature matching network includes at least one of an action enhanced video feature-to-text feature matching network, an audio enhanced video feature-to-text feature matching network, and the training loss includes training sub-losses corresponding to the various networks, respectively.

Adjusting model parameters of the initial video text matching model based on training loss until a convergence condition is met to obtain a target video text matching model, wherein the method comprises the following steps:

model parameters of corresponding networks in the initial video text matching model are respectively adjusted based on the loss of each training sub until convergence conditions corresponding to various networks are met, and a target video text matching model is obtained.

In particular, the training mode of the model may employ an integrated training mode. The integrated training mode refers to the training of various networks in the initial video text matching model respectively, and the various networks carry out model training based on respective loss information. And if all the network training is finished, forming a target video text matching model by each trained network.

And training losses calculated based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair comprise training sub-losses respectively corresponding to various networks, and the computer equipment can respectively adjust model parameters of the corresponding networks in the initial video text matching model by using the training sub-losses until convergence conditions respectively corresponding to the various networks are met, so that the target video text matching model is obtained. Specifically, model parameters of the video feature and text feature matching network can be adjusted based on training sub-losses corresponding to the video feature and text feature matching network until convergence conditions corresponding to the video feature and text feature matching network are met, and the trained video feature and text feature matching network is obtained. And if the action enhanced video feature and text feature matching network exists, adjusting model parameters of the action enhanced video feature and text feature matching network based on training sub-losses corresponding to the action enhanced video feature and text feature matching network until convergence conditions corresponding to the action enhanced video feature and text feature matching network are met, and obtaining the trained action enhanced video feature and text feature matching network. And if the audio enhanced video feature and text feature matching network exists, adjusting model parameters of the audio enhanced video feature and text feature matching network based on training sub-losses corresponding to the audio enhanced video feature and text feature matching network until convergence conditions corresponding to the audio enhanced video feature and text feature matching network are met, and obtaining the trained audio enhanced video feature and text feature matching network.

When the model is applied, the prediction results of various networks in the target video text matching model are integrated to obtain the final prediction result. For example, the maximum value is obtained from the similarity output by the three networks of the target video text matching model and is used as a target value, and if the target value is greater than a preset threshold value, the final matching result is determined to be successful matching; obtaining the similarity of three network outputs of a target video text matching model, calculating an average value as a target value, and if the target value is greater than a preset threshold value, determining that the final matching result is successful in matching; and so on.

In the embodiment, the model parameters of the corresponding networks in the initial video text matching model are respectively adjusted based on the loss of each training sub-model, so that each network in the video text matching model can respectively reach the optimal state, and the finally obtained target video text matching model is guaranteed to have higher prediction accuracy.

In one embodiment, the training loss includes a first training sub-loss corresponding to each similarity class. Adjusting model parameters of the initial video text matching model based on training loss until a convergence condition is met to obtain a target video text matching model, wherein the method comprises the following steps:

Acquiring a minimum value from each first training sub-loss as a first target sub-loss; obtaining loss contribution degrees corresponding to the residual training sub-losses respectively based on the difference between the first target sub-losses and the residual training sub-losses respectively; obtaining loss weights respectively corresponding to the residual training sub-losses based on the loss contribution degrees respectively corresponding to the residual training sub-losses; fusing the losses of the training sub-steps based on the loss weight corresponding to the losses of the training sub-steps to obtain a first target loss; the loss weight corresponding to the first target sub-loss is a preset weight; and adjusting model parameters of the initial video text matching model based on the first target loss until a convergence condition is met to obtain a target video text matching model.

The training loss comprises first training sub-losses corresponding to the similarity classes respectively. If the initial video text matching model comprises multiple networks, the first training sub-loss corresponding to a certain similarity class is the training sub-loss corresponding to the corresponding network.

Specifically, the training mode of the model may adopt an end-to-end training mode (which may also be referred to as an E2E training mode), the final loss function of the end-to-end training mode may adopt a multi-modal balance loss function, various first training sub-losses are fused according to the contribution degree to obtain a final target loss, and the model parameters of the entire model are adjusted based on the final target loss until the convergence condition is satisfied, so as to obtain the target video text matching model.

Firstly, the computer device obtains the minimum value from each first training sub-loss as a first target sub-loss, and takes other first training sub-losses as the remaining training sub-losses. Since the value of the first target sub-loss is smallest, the first target sub-loss can be considered as the most contributing sub-loss. Then, the computer device calculates loss contribution degrees corresponding to the respective residual training sub-losses based on differences between the respective first target sub-losses and the respective residual training sub-losses. The loss contribution degree is reduced along with the increase of the numerical value of the loss of the remaining training sub, and the greater the loss of the remaining training sub, the more inaccurate the corresponding network is, the smaller the corresponding loss contribution degree is. Further, the computer device calculates a loss weight corresponding to each of the remaining training sub-losses based on the loss contribution degree corresponding to each of the remaining training sub-losses. The loss weight increases with the increase of the loss contribution degree, and the corresponding loss weight is larger when the loss contribution degree of the training sub-loss is larger. And finally, the computer equipment fuses the losses of the training sub-units based on the loss weights respectively corresponding to the losses of the training sub-units to obtain a first target loss. And the loss weight corresponding to the first target sub-loss is a preset weight. In one embodiment, the predetermined weight is greater than the loss weight corresponding to each of the remaining training sub-losses.

The first target loss may be considered a synthetic loss of the model. And the computer equipment performs back propagation on the basis of the first target loss to adjust the model parameters of the initial video text matching model, and performs integral end-to-end training on the model until a convergence condition is met to obtain the target video text matching model.

In one embodiment, three modalities are designed and defined as i, j, k ∈ { v, a, m }, and the contribution ratio between the modalities is defined as

In this case, i is the mode contributing the most, and specifically, the mode with the least loss may be referred to as i. v represents a modality corresponding to the video features, and can also represent similarity categories corresponding to the video features; a represents the mode corresponding to the audio enhanced video characteristics, and can also represent the similarity category corresponding to the audio enhanced video characteristics; m represents the modality corresponding to the motion enhancement video features, and can also represent the similarity category corresponding to the motion enhancement video features.

First, the mode with the least loss is defined as i, and this mode must satisfy

The other modalities are j and k, respectively. Then, referring to the following equation (8), the contribution ratio can be used to dynamically adjust and balance the contribution of each mode, so as to obtain the balance coefficient (i.e. loss weight) of each mode. Referring to the following equation (9), the final first target loss is calculated using the balance coefficient.

Loss1＝theta _i ·Loss _i +theta _j ·Loss _j +theta _k ·Loss _k Formula (9)

Wherein, theta _i Denotes the equilibrium coefficient, theta, corresponding to i _j Denotes the equilibrium coefficient, theta, corresponding to j _k Indicating the balance coefficient corresponding to k. tanh () represents a hyperbolic tangent function. α represents an adjustment parameter, which can be set according to actual needs, for example, to 1.0.Loss _i Represents the Loss of training sub corresponding to i, loss _j Represents the Loss of training sub, loss, corresponding to j _k Training indicating k correspondenceThe Loss of the mill, loss1, represents the first target Loss.

In the embodiment, the loss contribution degree of each first training sub-loss is used for performing weighted fusion on each first training sub-loss to obtain the first target loss, and the model is trained end to end based on the first target loss, so that the training quality of the model can be effectively improved, and the finally obtained target video text matching model has higher prediction accuracy.

In one embodiment, the training loss includes a second training sub-loss corresponding to each positive sample pair in each similarity class. Adjusting model parameters of the initial video text matching model based on training loss until a convergence condition is met to obtain a target video text matching model, wherein the method comprises the following steps:

acquiring a minimum value from the losses of the second training sub-samples as second target sub-losses aiming at the same positive sample pair to obtain second target sub-losses corresponding to the positive sample pairs respectively; obtaining second target losses based on the statistical values of the sub-losses of the second targets; and adjusting the model parameters of the initial video text matching model based on the second target loss until the convergence condition is met, so as to obtain the target video text matching model.

The training loss comprises second training sub-losses respectively corresponding to each positive sample pair on each similarity class. For example, refer to Loss _v 、Loss _a And Loss _m There is a respective Loss for each positive sample pair _v 、Loss _a And Loss _m ，Loss _v 、Loss _a And Loss _m The second training sub losses respectively corresponding to the positive sample pairs on each similarity class are obtained.

In particular, when the end-to-end training mode is adopted, the final loss function of the end-to-end training mode may adopt the optimal loss function. The optimal loss function is to obtain the final loss of the model based on the final loss corresponding to each positive sample pair by integrating the optimal loss of the same positive sample pair in each similarity class as the final loss corresponding to the positive sample pair.

For the same positive sample pair, the computer device may obtain a minimum value from each second training sub-loss as a second target sub-loss, and for each positive sample pair, the computer device may finally obtain a second target sub-loss corresponding to each positive sample pair. Then, the computer device obtains the second target loss based on the statistical value of each second target sub-loss, for example, calculates an average value of each second target sub-loss as the second target loss; calculating the median of each second target sub-loss as a second target loss; and so on.

The second target loss can also be considered as a comprehensive loss of the model, the computer device performs back propagation based on the second target loss to adjust model parameters of the initial video text matching model, and performs overall end-to-end training on the model until a convergence condition is met to obtain the target video text matching model.

In one embodiment, the final target loss is calculated using the balance coefficients, with reference to equation (10) below.

From the Loss corresponding to the same positive sample pair _v ,Loss _m ,Loss _a Finding the minimum value as the Loss, obtaining the Loss corresponding to each positive sample pair, and calculating the average value of the Loss as the second target Loss

In the above embodiment, for each positive sample pair, the optimal second training sub-loss of each branch is taken as the final training sub-loss for calculating the final second target loss, and the model is trained end to end based on the second target loss, so that the training quality of the model can be effectively improved, and the finally obtained target video text matching model has higher prediction accuracy.

inputting video characteristics and reference characteristics corresponding to a test video in the test sample pair set and test text characteristics corresponding to a test text into a target video text matching model to obtain a similarity set corresponding to each test sample pair in the test sample pair set; generating similarity test matrixes respectively corresponding to the similarity categories based on the similarity sets respectively corresponding to the test sample pairs; determining the rank of the prediction matching sub-sets respectively corresponding to the similarity categories of each test sample based on the similarity test matrix respectively corresponding to each similarity category; determining a predictive matching rank based on the rank of each predictive matching sub-pair corresponding to the same test sample pair to obtain the predictive matching rank corresponding to each test sample pair; and determining the prediction accuracy corresponding to the target video text matching model based on the prediction matching ranking corresponding to the matching sample pair in each test sample pair.

Wherein the set of test sample pairs comprises a plurality of test sample pairs. The set of test sample pairs includes matching test sample pairs and non-matching test sample pairs. The matching test sample pair means that the test video and the test text in the test sample pair are matched, and the non-matching test sample pair means that the test video and the test text in the test sample pair are not matched.

Specifically, after the target video text matching model is obtained through training, the model can be further tested, the performance of the model is evaluated, and after the model test is passed, the model is put into use. In model testing, a multi-modal post-fusion processing strategy can be employed. The multi-mode post-fusion processing strategy is to calculate the similarity corresponding to each mode and then integrate the similarities to obtain the final matching result.

During testing, the video features and the reference features corresponding to the test video in the test sample pair set and the test text features corresponding to the test text need to be input into the target video text matching model, and similarity sets corresponding to the test sample pairs in the test sample pair set can be obtained through data processing of the model. One similarity set includes the target similarities corresponding to the similarity categories respectively. The computer device may generate a similarity test matrix corresponding to each similarity category based on the similarity set corresponding to each test sample pair. A similarity test matrix includes corresponding target similarities of the test sample pairs in the same similarity category. Furthermore, the computer device may determine, based on the similarity test matrix corresponding to each similarity category, a prediction matching sub-rank corresponding to each test sample pair on each similarity category, and specifically may sort the values of the matrix elements in a certain similarity test matrix from large to small in rows or columns to obtain the prediction matching sub-ranks corresponding to each test sample pair on the corresponding similarity category. The computer device may determine the predicted matching ranking corresponding to the test sample pair based on the ranking of each predicted matching sub-pair corresponding to the same test sample pair, to obtain the predicted matching ranking corresponding to each test sample pair, for example, a maximum ranking may be obtained from each predicted matching sub-pair ranking as the predicted matching ranking; an average ranking can be obtained from each predicted matching sub-ranking as a predicted matching ranking; and so on. Finally, the computer device determines the prediction accuracy corresponding to the target video text matching model based on the predicted match ranking corresponding to the matching sample pair in each test sample pair.

And if the prediction accuracy is higher than the preset accuracy, determining that the target video text matching model passes the test, and putting the model into use. And if the prediction accuracy is less than or equal to the preset accuracy, determining that the target video text matching model fails the test. If the target video text matching model fails to pass the test, a new training sample pair set can be obtained to further train the target video text matching model, and the model parameters are adjusted again. The preset accuracy refers to the accuracy set in advance, and can be set according to actual needs.

In one embodiment, model performance is evaluated based on at least one of predictive match ranking statistics R@1, R@5, r @10, media Score, or Mean Score for each of the test sample pairs corresponding to the matching sample pair to determine a predictive accuracy for the target video text matching model. R@1 is obtained based on the ratio of the matching sample pair with the predicted matching row name as the first name to all the matching sample pairs, R@5 is obtained based on the ratio of the matching sample pair with the predicted matching row name as the first five names to all the matching sample pairs, r @10 is obtained based on the ratio of the matching sample pair with the predicted matching row name as the first ten names to all the matching sample pairs, and Median Score is obtained based on the ratio of the matching sample pair with the predicted matching row name as the middle ranking to all the matching sample pairs. The Mean Score is based on the average of the predicted match rankings for each matching sample pair. R@1, R@5, and R @10 are all values that are larger, indicating that the higher the model prediction accuracy, the better the model performance. The smaller the values of the Median Score and the Mean Score, the higher the model prediction accuracy and the better the model performance.

In one embodiment, a regularization term may also be introduced to assist in model testing. The method can extract the text features of the video to obtain the text features of the video, calculate the target similarity between the video text features of the test video and the test text features of the test text in the test sample pairs, and obtain the target similarity corresponding to each test sample pair on the new similarity category. And determining the predicted matching sub-ranking of each test sample pair respectively corresponding to the new similarity category based on the target similarity respectively corresponding to each test sample pair on the new similarity category. And determining the predicted matching ranking based on the ranking of each predicted matching sub-pair corresponding to the same test sample pair to obtain the predicted matching ranking corresponding to each test sample pair. And determining the prediction accuracy corresponding to the target video text matching model based on the prediction matching ranking corresponding to the matching sample pair in each test sample pair.

Referring to fig. 6, the video text matching model includes four branches for calculating the similarity between the video feature corresponding to the video and the text feature corresponding to the text, the similarity between the video text feature and the text feature, the similarity between the motion-guided video feature (i.e., motion-enhanced video feature) and the text feature, and the similarity between the audio-guided video feature (i.e., audio-enhanced video feature) and the text feature in the test sample pair, respectively. And inputting the video characteristics, the video text characteristics, the action-guided video characteristics, the audio-guided video characteristics and the test text characteristics corresponding to the test texts in the test sample pair set into a target video text matching model, and respectively calculating the similarity matrix of each branch. And sequencing the similarity matrix according to rows or columns to obtain the predicted matching sub-rank of each test sample pair, and then taking the optimal ranking result of each branch as the final predicted matching rank of each test sample pair. The final predicted match ranking statistics R@1, R@5, r @10, median Score and Mean Score are evaluated for model performance using individual match samples. Referring to fig. 6, the predicted matching sub-ranks of the test sample pair 1 in each branch are 7, 6, 0, and 3, respectively, the optimal rank is obtained from each predicted matching sub-rank as the final predicted matching rank, and the final predicted matching rank of the test sample pair 1 is 0.

Referring to fig. 7, when the similarity matrix is sorted, if the horizontal axis of the similarity matrix represents a test sample pair containing the same video and the vertical axis represents a test sample pair containing the same text, at this time, the rank obtained by sorting the similarity matrix by rows is the ranking result on the c2v task, and the rank obtained by sorting the similarity matrix by columns is the ranking result on the v2c task.

It will be appreciated that ranking a test sample to a predicted match sub-rank on a similarity category may include ranking results on at least one of a c2v task or a v2c task.

In the embodiment, the trained video text matching model is subjected to model testing based on the test sample pair set, the model testing adopts a multi-mode post-fusion strategy, and the optimal prediction result of each branch is taken as the final prediction result, so that the model testing effect is favorably ensured. Correspondingly, when the model is applied, a multi-mode post-fusion strategy can be adopted, and the prediction performance of the model can be remarkably improved.

In one embodiment, as shown in fig. 8, a video text matching method is provided, which is exemplified by applying the method to a computer device, which may be the terminal 102 or the server 104 in fig. 1. Referring to fig. 8, the video text matching method includes the steps of:

Step S802, acquiring a video feature to be matched and a reference feature to be matched corresponding to a video to be matched, and acquiring a text feature to be matched corresponding to a text to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched.

The video to be matched and the text to be matched refer to the video and the text to be determined whether to be matched or not. The video features to be matched refer to video features corresponding to the videos to be matched, and the reference features to be matched refer to reference features corresponding to the videos to be matched. The process of acquiring the video features to be matched and the reference features to be matched can refer to the process of acquiring the video features and the reference features corresponding to the training video. The text features to be matched refer to the text features corresponding to the text to be matched, and the process of acquiring the text features to be matched can refer to the process of acquiring the training text features corresponding to the training text.

Step S804, based on the reference features to be matched, the features of the video to be matched are enhanced, and reference enhanced video features corresponding to the video to be matched are obtained; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature.

Specifically, the computer device may obtain a to-be-matched video feature and a to-be-matched reference feature corresponding to the to-be-matched video, and perform feature enhancement on the to-be-matched video feature based on the to-be-matched reference feature to obtain a reference enhanced video feature corresponding to the to-be-matched video. And if the reference features to be matched comprise action features, performing feature enhancement on the video features to be matched based on the action features to obtain action enhanced video features, wherein the reference enhanced video features comprise action enhanced video features. And if the reference features to be matched comprise the audio features, performing feature enhancement on the video features to be matched based on the audio features to obtain audio enhanced video features, wherein the reference enhanced video features comprise the audio enhanced video features.

It will be appreciated that the specific process of feature enhancement can be referred to in the context of the various related embodiments of the image generation model training method described above. For example, intra-modality attention processing is respectively performed on the video features to be matched and the motion features to obtain self-attention video features and self-attention motion features; performing inter-modal attention processing on the video feature to be matched and the self-attention motion feature to obtain a cross attention video feature, and performing inter-modal attention processing on the motion feature and the self-attention video feature to obtain a cross attention motion feature; fusing the cross attention motion characteristic and the cross attention video characteristic to obtain a motion video fusion characteristic; and performing channel attention processing on the motion video fusion features to obtain a first channel attention weight, and performing feature enhancement on the video features to be matched based on the first channel attention weight to obtain motion enhancement video features.

Step 806, similarity calculation is carried out on the text features to be matched, the video features to be matched and the reference enhanced video features respectively, and a similarity set corresponding to the videos to be matched and the texts to be matched is obtained.

The similarity set comprises the similarity between the text features to be matched and the video features to be matched and the similarity between the text features to be matched and the reference enhanced video features.

It is understood that the specific process of the similarity calculation can refer to the content of the foregoing related embodiments of the image generation model training method. For example, calculating the initial similarity between the text features to be matched and the text features to be matched; the initial similarity comprises the sub-similarity between a plurality of text words in the text to be matched and the same video frame, and the sub-similarity between a plurality of video frames in the video to be matched and the same text word; obtaining text weight based on the text features to be matched, and obtaining video weight based on the video features to be matched; aiming at the initial similarity, acquiring a maximum value from a plurality of sub-similarities corresponding to the same video frame as a first sub-similarity, and acquiring a maximum value from a plurality of sub-similarities corresponding to the same text word as a second sub-similarity, so as to obtain a first sub-similarity corresponding to each video frame and a second sub-similarity corresponding to each text word; obtaining first similarity based on each first sub-similarity, and obtaining second similarity based on each second sub-similarity; fusing the first similarity and the text weight to obtain first fused data, and fusing the second similarity and the video weight to obtain second fused data; and obtaining the target similarity between the video features to be matched and the text features to be matched based on the first fusion data and the second fusion data.

And step S808, determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched.

Specifically, the computer device may obtain, locally or from another device, a to-be-matched video feature and a to-be-matched reference feature corresponding to a to-be-matched video, obtain a to-be-matched text feature corresponding to a to-be-matched text, then perform feature enhancement on the to-be-matched video feature based on the to-be-matched reference feature, obtain a reference enhanced video feature corresponding to the to-be-matched video, for example, separately train a model for generating an action enhanced video feature and an audio enhanced video feature, input the video feature and the action feature of the to-be-matched video into the model for generating the action enhanced video feature, obtain an action enhanced video feature, input the video feature and the audio feature of the to-be-matched video into the model for generating the audio enhanced video feature, and obtain the audio enhanced video feature. Furthermore, the computer device calculates similarities between the text features to be matched and the video features to be matched and the reference enhanced video features respectively to obtain a similarity set corresponding to the video to be matched and the text to be matched, for example, calculates cosine similarities between the text features to be matched and the video features to be matched and the reference enhanced video features respectively to obtain a similarity set. Finally, the computer device determines a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched, for example, if each similarity in the similarity set is greater than a preset threshold, the matching result between the video to be matched and the text to be matched is determined to be successful, otherwise, the matching result is failed; if at least two similarities in the similarity set are greater than a preset threshold, determining that the matching result between the video to be matched and the text to be matched is successful, otherwise, determining that the matching result is failure; acquiring a maximum value from the similarity set as a target value, if the target value is greater than a preset threshold value, determining that the matching result is successful, otherwise, determining that the matching result is failed; and so on.

It is understood that the specific processes of feature enhancement and similarity calculation can refer to the contents of the related embodiments of the image generation model training method. The method described in each of the related embodiments of the aforementioned video text matching model training method may be implemented not only by one model, but also by a plurality of models, for example, models for generating motion enhanced video features and audio enhanced video features may be individually trained, respectively. The method described in each of the related embodiments of the aforementioned video text matching model training method can also be implemented by designing a corresponding algorithm or formula without depending on the model.

In one embodiment, there may be a plurality of at least one of the videos to be matched and the texts to be matched, and each pair of the videos to be matched and the texts to be matched may be combined to obtain a plurality of video text pairs. Similar to the test model, when a matching result is determined, a similarity prediction matrix corresponding to each similarity category is generated based on a similarity set corresponding to each video text pair, a prediction matching sub-rank corresponding to each similarity category of each video text pair is determined based on the similarity prediction matrix corresponding to each similarity category, a prediction matching rank is determined based on the rank of each prediction matching sub-corresponding to the same video text pair, a prediction matching rank corresponding to each video text pair is obtained, and finally a matching result corresponding to each video text pair is determined based on the prediction matching rank corresponding to each video text pair. For example, the matching result corresponding to the video text pair whose predicted match is ranked 10% top may be determined as a successful match.

The video text matching method can be applied to data retrieval and data recommendation scenes. For example, in a video retrieval scene, a search statement of a user may be used as a text to be matched, videos in a candidate video set are used as videos to be matched, prediction matching ranks corresponding to respective video text pairs are determined based on a target video text matching model, candidate videos in the video text pairs with the prediction matching ranks at the top 10 are used as target videos, and the respective target videos are used as video search results corresponding to the search statement.

According to the video text matching method, the to-be-matched text features corresponding to the to-be-matched text are obtained by obtaining the to-be-matched video features and the to-be-matched reference features corresponding to the to-be-matched video; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched; based on the reference features to be matched, performing feature enhancement on the video features to be matched to obtain reference enhanced video features corresponding to the video to be matched; the reference enhanced video feature comprises at least one of a motion enhanced video feature and an audio enhanced video feature; respectively carrying out similarity calculation on the text features to be matched with the video features to be matched and the reference enhanced video features to obtain a similarity set corresponding to the video to be matched and the text to be matched; and determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched. Therefore, the video characteristics can provide image information of the video, the audio characteristics can provide sound information of the video, the action characteristics can provide motion information of the video, the matching result between the video to be matched and the text to be matched is determined based on the video characteristics corresponding to the video to be matched, the reference characteristics and the text characteristics corresponding to the text to be matched, the understanding of the video content can be improved by utilizing rich modal information in the video, and the matching accuracy is improved. In addition, feature enhancement and feature guidance are carried out on the video features based on the audio features or the action features, so that important information in the video can be highlighted, and the understanding of the video content is further improved. Similarity calculation is carried out on the video features and the reference enhanced video features respectively, similarity calculation is carried out on the video features and the text features, matching results are determined based on a similarity set obtained through calculation, and matching accuracy can be further improved.

In one embodiment, as shown in fig. 9, steps S804 and S806 include:

step S902, inputting the video features to be matched, the reference features to be matched and the text features to be matched into a target video text matching model to obtain a similarity set corresponding to the video to be matched and the text to be matched.

The training process of the target video text matching model comprises the following steps:

acquiring a training sample pair set; training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other; inputting video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features comprise at least one of audio features and action features corresponding to the training videos; based on the reference features corresponding to the same training video, performing feature enhancement on the corresponding video features to obtain the reference enhancement video features corresponding to the training video; the reference enhanced video feature comprises at least one of a motion enhanced video feature and an audio enhanced video feature; aiming at the same training sample pair, performing similarity calculation on training text characteristics corresponding to a training text, video characteristics corresponding to a training video and reference enhanced video characteristics respectively to obtain a similarity set corresponding to each training sample pair respectively; and calculating training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjusting the model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model.

It can be understood that the content of each embodiment in the foregoing video text matching model training method may be referred to in the training process of the target video text matching model, and details are not described here again.

Specifically, the computer device can obtain the video features to be matched and the reference features to be matched corresponding to the video to be matched, obtain the text features to be matched corresponding to the text to be matched, input the video features to be matched, the reference features to be matched and the text features to be matched into a target video text matching model, and obtain the similarity set corresponding to the video to be matched and the text to be matched through data processing of the model. And in the target video text matching model, based on the reference features to be matched, performing feature enhancement on the video features to be matched to obtain the reference enhanced video features corresponding to the video to be matched. In the target video text matching model, similarity calculation is carried out on the text features to be matched, the reference features to be matched and the reference enhanced video features respectively, and a similarity set corresponding to the videos to be matched and the texts to be matched is obtained.

In the embodiment, the text features to be matched corresponding to the text to be matched are obtained by obtaining the video features to be matched and the reference features to be matched corresponding to the video to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched; inputting the video features to be matched, the reference features to be matched and the text features to be matched into a target video text matching model to obtain a similarity set corresponding to the video to be matched and the text to be matched; and determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched. Therefore, feature enhancement and similarity calculation are carried out through a target video text matching model, data processing efficiency can be effectively improved, a similarity set corresponding to the video to be matched and the text to be matched is quickly obtained, and then the matching result between the video to be matched and the text to be matched is quickly determined.

In one embodiment, the video text matching method further comprises:

performing similarity calculation on video text characteristics corresponding to the video to be matched and target text characteristics corresponding to the text to be matched to obtain reference similarity between the video text characteristics and the target text characteristics; the video text features comprise at least one of audio text features and image text features, the audio text features are obtained by extracting the features of audio texts extracted from the audio of the video to be matched, and the image text features are obtained by extracting the features of image texts extracted from the image of the video to be matched; and determining a matching result between the video to be matched and the text to be matched based on the similarity set and the reference similarity corresponding to the video to be matched and the text to be matched.

Specifically, when the matching result between the video to be matched and the text to be matched is determined, the similarity between the video text characteristic corresponding to the video to be matched and the target text characteristic corresponding to the text to be matched can be further comprehensively considered, so that the matching accuracy is improved.

The computer equipment can obtain the video text features corresponding to the video to be matched and the target text features corresponding to the text to be matched, similarity calculation is carried out on the video text features corresponding to the video to be matched and the target text features corresponding to the text to be matched, the calculation results are used as reference similarity between the video text features and the target text features, and finally the matching results between the video to be matched and the text to be matched are comprehensively judged based on the similarity set and the reference similarity corresponding to the video to be matched and the text to be matched. For example, if each similarity and reference similarity in the similarity set are greater than a preset threshold, determining that the matching result between the video to be matched and the text to be matched is successful, otherwise, determining that the matching result is failed; if at least two similarities in the similarity set and the reference similarity are larger than a preset threshold, determining that the matching result between the video to be matched and the text to be matched is successful, otherwise, determining that the matching result is failed; etc. of

Wherein the video text features include at least one of audio text features and image text features. The audio text features are obtained by extracting the features of the audio text extracted from the audio of the video to be matched, and the image text features are obtained by extracting the features of the image text extracted from the image of the video to be matched. The audio text may be extracted from the audio of the video to be matched based on ASR (Automatic Speech Recognition), and the image text may be extracted from the image of the video to be matched based on OCR (Optical Character Recognition). The audio text features can be further extracted based on the machine learning model, and the image text features can be obtained by extracting the features of the image text based on the machine learning model.

It is to be understood that the target text feature and the text feature to be matched may be the same or different.

If the video text features comprise audio text features and image text features, similarity calculation can be carried out on the audio text features and the target text features to obtain first reference similarity, similarity calculation can be carried out on the image text features and the target text features to obtain second reference similarity, and a matching result between the video to be matched and the text to be matched is determined based on a similarity set corresponding to the video to be matched and the text to be matched, the first reference similarity and the second reference similarity.

In the embodiment, the matching result between the video to be matched and the text to be matched is determined, and the similarity set and the reference similarity are comprehensively considered, so that the matching accuracy can be further improved.

In one embodiment, the similarity calculation of the video text feature corresponding to the video to be matched and the target text feature corresponding to the text to be matched to obtain the reference similarity between the video text feature and the target text feature includes:

calculating initial similarity between the video text characteristics and the target text characteristics to obtain an initial similarity matrix; counting the number of matrix elements of which the numerical values are greater than a preset threshold value in the initial similarity matrix to obtain a first number; fusing the number of text words respectively corresponding to the text to be matched and the video text to obtain a second number; the video text refers to a text corresponding to the video text characteristics; and obtaining the reference similarity between the video text feature and the target text feature based on the first quantity and the second quantity.

Specifically, the computer device may first calculate an initial similarity based on the video text feature and the target text feature, to obtain an initial similarity matrix, for example, using a cosine similarity between the video text feature and the target text feature as the initial similarity. The larger the numerical value of the matrix element in the initial similarity matrix is, the higher the similarity degree between the text to be matched and the corresponding text word in the video text is. Therefore, the computer device may perform quantity statistics on matrix elements in the initial similarity matrix, the values of which are greater than a preset threshold, and take the statistical result as a first quantity. Meanwhile, the computer device can fuse the text word numbers respectively corresponding to the text to be matched and the video text to obtain a second number. For example, the sum of the number of text words corresponding to the text to be matched and the video text is used as the second number; taking the product of the number of the text words corresponding to the text to be matched and the video text as a second number; and so on. Finally, the computer device obtains a reference similarity between the video text feature and the target text feature based on the first number and the second number. For example, the ratio of the first number and the second number is taken as the reference similarity.

In one embodiment, cosine similarity matrixes of the video text features and the target text features are calculated, then keyword scores are calculated by using the cosine similarity matrixes, and the keyword scores are used as reference similarities. The keyword score can be calculated by the following formula (11).

N _{key_word} ＝L(Sim_maxtirx＞＝threshold)

Wherein N is _{key_word} The number of elements which are larger than or equal to a threshold value in the cosine similarity matrix is represented, sim represents the cosine similarity matrix, and threshold represents the threshold value. Score _{caption_text} Representing a keyword score, L _C Number of text words, L, representing the text to be matched _T The number of text words representing the video text.

Fig. 10 is a schematic diagram of a cosine similarity matrix, where the horizontal axis of the cosine similarity matrix represents each text word in the text to be matched, and the vertical axis represents each text word in the video text. A square in fig. 10 represents a matrix element of the cosine similarity matrix, the matrix element of the cosine similarity matrix represents the similarity between corresponding text words in the text to be matched and the video text, and the higher the similarity is, the darker the color of the square is.

In the above embodiment, when the reference similarity is calculated, complex data processing is not required to be performed on the video text features and the target text features, and the reference similarity between the video text features and the target text features can be quickly calculated based on the first number obtained by counting the matrix elements of the initial similarity matrix and the second number obtained by fusing the number of the text words.

In one embodiment, the current text is any one of an audio text, an image text and a text to be matched, and the text feature corresponding to the current text is any one of an audio text feature, an image text feature or a target text feature. The generation process of the text features corresponding to the current text comprises the following steps:

extracting nouns from the current text to obtain text nouns; and performing feature extraction on the text nouns to obtain text features corresponding to the current text.

In particular, the video text features and the target text features may be generated based on the same manner. Through data analysis, abstract nouns such as names of people and place names in the text are difficult to express in video features, action features and audio features, but the names of people and place names in the text play an important role in a video text matching task. Therefore, when generating the video text feature and the target text feature, the computer device may perform noun extraction on the current text, filter out nouns from the current text to obtain text nouns, and perform feature extraction only on the text nouns to obtain text features corresponding to the current text.

In an embodiment, referring to fig. 11, in order to further improve the feature accuracy of the text feature corresponding to the current text, the text to be matched and the video text may be preprocessed, feature extraction is performed on the preprocessed text to be matched and the video text to obtain a target text feature and a video text feature, and the keyword score is calculated as the reference similarity based on the target text feature and the video text feature. The preprocessing comprises the steps of filtering out nouns and reducing the words into original parts of speech and letter lowercase, and specifically comprises the steps of filtering out nouns from a text, reducing the nouns into the original parts of speech and carrying out letter lowercase processing on the nouns of the original parts of speech.

In the embodiment, the video text features are generated based on the nouns in the video text, and the target text features are generated based on the nouns in the text to be matched, so that the video text features and the target text features are beneficial to considering noun information in the video and the text which are difficult to express by other modalities when the video text is matched, and the matching accuracy of the video text is improved.

In a specific embodiment, the video text matching method of the present application is a multi-level multi-modal hybrid fusion method. Referring to fig. 12, the target video text matching model includes four branches, which are a video feature and text feature matching branch (also referred to as a video feature and text feature matching network), a video text feature and target text feature matching branch (also referred to as a video text feature and target text feature matching network), a motion-guided video feature and text feature matching branch (also referred to as a motion-enhanced video feature and text feature matching network), and an audio-guided video feature and text feature matching branch (also referred to as an audio-enhanced video feature and text feature matching network), respectively.

The input data of the video feature and text feature matching branch to be matched are the video feature of the video to be matched and the text feature of the text to be matched, and the output data is the target similarity obtained through calculation based on the video feature and the text feature to be matched. The input data of the branch circuit for matching the video text features with the target text features are the video text features of the video to be matched and the target text features of the text to be matched, and the output data is the target similarity obtained through calculation based on the video text features and the target text features. The input data of the branch circuit for matching the action-guided video features and the text features to be matched are the video features and the action features of the video to be matched and the target text features of the text to be matched, and the output data is the target similarity obtained by calculation based on the action-guided video features and the text features to be matched. The input data of the audio-guided video feature and text feature matching branch to be matched are the video feature and audio feature of the video to be matched and the target text feature of the text to be matched, and the output data is the target similarity calculated based on the video feature and text feature to be matched guided.

The target video text matching model guides the video features by respectively utilizing the audio features and the action features in a forward fusion mode, and then performs similarity calculation with the text features of the text to be matched by respectively utilizing the video features, the audio-guided video features, the action-guided video features and the target text features.

And integrating the prediction results of the four branches by using a backward fusion mode by using the target video text matching model to obtain a final prediction result.

Meanwhile, the video features can be used for filling the unaligned audio features to align the multi-mode features, so that when the modes are unaligned, the video features of the user can be used for guiding, and the video text matching performance can be improved to a certain extent.

The target video text matching model fully excavates and utilizes the detection capability of each modal characteristic by using a forward and backward mixed fusion mode, and finally, the performance of video text matching is obviously improved. The multi-stage multi-modal mixed fusion video text matching method achieves the highest detection performance on related data sets.

The video text matching method utilizes the multi-mode information, and the multi-mode information can provide information which can not be provided by a plurality of image characteristics for video text matching, thereby being beneficial to improving the performance of video text matching. For example, referring to fig. 13, the text information may provide abstract noun information like a person's name, a place name, and the like. The video text and the text to be matched corresponding to the video in fig. 13 respectively include noun information of "people" and "education".

The multi-modal mixed fusion strategy adopted by the video text matching method can establish a relationship between multi-modal characteristics of the video and text characteristics of the text in a hierarchical manner, and the retrieval performance of the model is remarkably improved by utilizing the forward and backward mixed fusion strategy. The forward fusion is carried out on the feature level, the modes are fused to obtain multi-mode fusion features, and the backward fusion can integrate the optimal prediction result of each branch. In addition, a matching branch mutually retrieved with the text information of the text to be matched is designed for each modal information, so that the risk of modal confusion is reduced, similarity matching is performed on the various modal information and the text information of the text to be matched hierarchically, the risk of modal confusion is reduced, and the detection performance of the model is improved.

The video text matching method also explores mode alignment, can achieve the effect of guiding video features when the modes are absent, and increases the robustness of the model.

Aiming at model training, the video text matching method of the application also provides two training modes: the training mode and the E2E training mode are integrated. The integrated training mode integrates the prediction results of all branches by using the proposed multi-mode post-fusion processing strategy. The E2E training mode proposes two loss functions for integrating the branches: a multi-modal balancing loss function and an optimal loss function. The multi-modal balance loss function can balance the loss of each branch and adjust the proportion of each branch. And the optimal loss function integrates the lowest loss of each branch training sample as the final loss.

Compared with the traditional method, the video text matching method can deeply mine effective information in the video and match the effective information with the text, so that the retrieval performance of both a t2v task and a v2t task can be effectively improved.

It can be understood that the video text matching method can be applied to a plurality of scenes such as video content understanding, video content recommendation, video text retrieval and the like.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a video text matching model training apparatus for implementing the above-mentioned video text matching model training method, and an implementation scheme provided by the apparatus for solving the problem is similar to the implementation scheme described in the above-mentioned method. The embodiment of the present application further provides a video text matching device for implementing the above mentioned video text matching method, and the implementation scheme provided by the device for solving the problem is similar to the implementation scheme recorded in the above mentioned method. Therefore, the specific limitations in the following one or more embodiments of the video text matching device may refer to the limitations in the foregoing video text matching model training method, and the specific limitations in the following one or more embodiments of the video text matching device may refer to the limitations in the foregoing video text matching method, which are not described herein again.

In one embodiment, as shown in fig. 14, there is provided a video text matching model training apparatus, including: a training sample pair set acquisition module 1402, a feature input module 1404, a feature enhancement module 1406, a similarity calculation module 1408, and a model adjustment module 1410, wherein:

A training sample pair set obtaining module 1402, configured to obtain a training sample pair set; the training sample pairs in the training sample pair set comprise positive sample pairs and negative sample pairs matched with the positive sample pairs, the training sample pairs comprise training videos and training texts, and the training videos and the training texts in the positive sample pairs are matched with each other.

A feature input module 1404, configured to input video features and reference features corresponding to training videos in the training sample pair set and training text features corresponding to training texts into an initial video text matching model; the reference features include at least one of audio features and motion features corresponding to the training video.

The feature enhancement module 1406 is configured to perform feature enhancement on corresponding video features based on the reference features corresponding to the same training video to obtain reference enhanced video features corresponding to the training video; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature.

The similarity calculation module 1408 is configured to perform similarity calculation on training text features corresponding to the training text, video features corresponding to the training videos, and reference enhanced video features, respectively, for the same training sample pair, so as to obtain a similarity set corresponding to each training sample pair.

The model adjusting module 1410 is configured to calculate a training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair, and adjust model parameters of the initial video text matching model based on the training loss until a convergence condition is met, so as to obtain a target video text matching model; the target video text matching model is used for determining a matching result between the video and the text.

According to the video text matching model training device, the video characteristics can provide video image information, the audio characteristics can provide video sound information, the action characteristics can provide video motion information, the video text matching model is trained based on the video characteristics and the reference characteristics corresponding to the training videos and the training text characteristics corresponding to the training texts, the understanding of the model on video contents can be improved by means of abundant modal information in the videos, and therefore the prediction accuracy of the model is improved. And moreover, feature enhancement and feature guidance are carried out on the video features based on the audio features or the motion features, important information in the video can be highlighted, similarity calculation is carried out on the video features and the training text features respectively based on the video features and the reference enhanced video features, model parameters are adjusted based on training loss generated by the similarity set obtained through calculation, the relation between the video and the text can be better established by the model, and the prediction accuracy of the model is further improved.

In one embodiment, the training sample pair set acquisition module is further configured to:

obtaining a plurality of positive sample pairs; carrying out data recombination on each positive sample pair to obtain a plurality of negative sample pairs; taking the negative sample pair with coincident data with the positive sample pair as the negative sample pair matched with the positive sample pair; and obtaining a training sample pair set based on each positive sample pair and each matched negative sample pair.

In one embodiment, the feature enhancement module is further to:

splicing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a cross attention splicing characteristic; fusing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a cross attention fusion characteristic; and performing inter-modal fusion on the cross attention splicing feature and the cross attention fusion feature to obtain a third cross attention weight, performing fusion on the cross attention splicing feature and the third cross attention weight to obtain a third cross attention feature, and obtaining an action video fusion feature based on the cross attention splicing feature and the third cross attention feature.

In one embodiment, the initial video text matching model includes a motion enhanced video feature and text feature matching network including a first intra-modality attention layer, a second intra-modality attention layer, a first inter-modality attention layer, a second inter-modality attention layer, a motion video fusion attention layer, a first channel attention layer, and a first similarity calculation layer.

The first intra-modality attention layer is used for performing intra-modality attention processing on the video features, and the second intra-modality attention layer is used for performing intra-modality attention processing on the action features; the first inter-modality attention layer is used for performing inter-modality attention processing on the video feature and the self-attention action feature, and the second inter-modality attention layer is used for performing inter-modality attention processing on the action feature and the self-attention video feature; the action video fusion attention layer is used for fusing cross attention action features and cross attention video features corresponding to the same training video; the first channel attention layer is used for carrying out channel attention processing on the motion video fusion features; the first similarity calculation layer is used for calculating the similarity between the motion enhancement video features and the training text features.

In one embodiment, the feature enhancement module is further to:

The audio and video fusion layer is used for fusing the video features and the audio features; the random inactivation layer is used for carrying out random inactivation treatment on input data; the pooling layer is used for pooling input data; the normalization layer is used for normalizing input data; the second channel attention layer is used for carrying out channel attention processing on input data; the second similarity calculation layer is used for calculating the similarity between the audio enhancement video features and the training text features.

In one embodiment, the video text matching model training device is further configured to:

and when the audio frequency is missing in the training video, acquiring the preset characteristics or the video characteristics corresponding to the training video as the audio frequency characteristics corresponding to the training video.

The similarity calculation module is further configured to:

In one embodiment, the similarity set includes at least two target similarities. The model adjustment module is further to:

determining a target category from each similarity category; obtaining first similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pair and each negative sample pair containing the same training text with the positive sample pair on the target category, and obtaining second similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pair and each negative sample pair containing the same training video with the positive sample pair on the target category; fusing the target similarity and the first similarity weight of the same training sample pair on the target category to obtain a first updating similarity corresponding to each training sample pair, and fusing the target similarity and the second similarity weight of the same training sample pair on the target category to obtain a second updating similarity corresponding to each training sample pair; obtaining a first loss based on first updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, and obtaining a second loss based on second updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training text as the positive sample pair; obtaining training sub-losses corresponding to the target category based on the first losses and the second losses; acquiring a next similarity category as a target category, returning to the step of obtaining the target similarity of each negative sample pair containing the same training text on the target category based on the positive sample pair and the positive sample pair, and executing the step of obtaining the first similarity weight of each training sample pair respectively corresponding to the target category until determining the training sub-loss respectively corresponding to each similarity category; the training loss is derived based on various training sub-losses.

In one embodiment, the model adjustment module is further to:

obtaining a first similarity matrix based on the target similarity of each training sample pair on the target category; the first dimension of the first similarity matrix represents the target similarity of each training sample pair containing the same training video in the target category, the second dimension of the target similarity matrix represents the target similarity of each training sample pair containing the same training text in the target category, and the diagonal of the target similarity matrix represents the target similarity of each positive sample pair in the target category; generating a second matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in a second dimension, and generating a second similarity matrix based on the second matrix element corresponding to each matrix element in the first similarity matrix; generating a third matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in the first dimension, and generating a third similarity matrix based on the third matrix element corresponding to each matrix element in the first similarity matrix; adjusting each matrix element in the second similarity matrix based on the target similarity of the positive sample to the target category to obtain a fourth similarity matrix, and adjusting each matrix element in the third similarity matrix based on the target similarity of the positive sample to the target category to obtain a fifth similarity matrix; the fourth similarity matrix represents first similarity weights respectively corresponding to the training sample pairs on the target class, and the fifth similarity matrix represents second similarity weights respectively corresponding to the training sample pairs on the target class.

In one embodiment, the model adjustment module is further to:

The model adjustment module is further to:

and respectively adjusting model parameters of corresponding networks in the initial video text matching model based on the loss of each training sub until convergence conditions respectively corresponding to various networks are met, and obtaining a target video text matching model.

In one embodiment, the training loss includes a first training sub-loss corresponding to each similarity class. The model adjustment module is further to:

acquiring a minimum value from each first training sub-loss as a first target sub-loss; obtaining loss contribution degrees corresponding to the residual training sub-losses respectively based on the difference between the first target sub-losses and the residual training sub-losses respectively; obtaining loss weights corresponding to the losses of the remaining training sub-losses respectively based on the loss contribution degrees corresponding to the losses of the remaining training sub-losses respectively; fusing the losses of the training sub-based on the loss weight corresponding to the losses of the training sub-respectively to obtain a first target loss; the loss weight corresponding to the first target sub-loss is a preset weight; and adjusting model parameters of the initial video text matching model based on the first target loss until a convergence condition is met to obtain a target video text matching model.

In one embodiment, the training loss includes a second training sub-loss corresponding to each positive sample pair in each similarity class. The model adjustment module is further to:

In one embodiment, as shown in fig. 15, there is provided a video text matching apparatus including: a feature acquisition module 1502, a feature enhancement module 1504, a similarity calculation module 1506, and a matching result determination module 1508, wherein:

the feature obtaining module 1502 is configured to obtain a video feature to be matched and a reference feature to be matched corresponding to a video to be matched, and obtain a text feature to be matched corresponding to a text to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched.

The feature enhancement module 1504 is used for enhancing the features of the video to be matched based on the reference features to be matched to obtain the reference enhanced video features corresponding to the video to be matched; the reference enhanced video feature includes at least one of a motion enhanced video feature and an audio enhanced video feature.

The similarity calculation module 1506 is configured to perform similarity calculation on the text features to be matched and the video features to be matched and the reference enhanced video features respectively to obtain a similarity set corresponding to the video to be matched and the text to be matched.

The matching result determining module 1508 is configured to determine a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched.

According to the video text matching device, the video characteristics can provide video image information, the audio characteristics can provide video sound information, the action characteristics can provide video motion information, the matching result between the video to be matched and the text to be matched is determined based on the video characteristics and the reference characteristics corresponding to the video to be matched and the text characteristics corresponding to the text to be matched, the understanding of the video content can be improved by utilizing rich modal information in the video, and therefore the matching accuracy is improved. In addition, feature enhancement and feature guidance are carried out on the video features based on the audio features or the motion features, so that important information in the video can be highlighted, and the understanding of the video content is further improved. Similarity calculation is carried out on the video features and the text features respectively based on the video features and the reference enhanced video features, and matching results are determined based on a similarity set obtained through calculation, so that matching accuracy can be further improved.

In one embodiment, the video text matching device is further configured to:

performing similarity calculation on the video text characteristics corresponding to the video to be matched and the target text characteristics corresponding to the text to be matched to obtain reference similarity between the video text characteristics and the target text characteristics; the video text features comprise at least one of audio text features and image text features, the audio text features are obtained by extracting the features of audio texts extracted from the audio of the video to be matched, and the image text features are obtained by extracting the features of image texts extracted from the image of the video to be matched; and determining a matching result between the video to be matched and the text to be matched based on the similarity set and the reference similarity corresponding to the video to be matched and the text to be matched.

In one embodiment, the video text matching device is further configured to:

calculating initial similarity between the video text characteristics and the target text characteristics to obtain an initial similarity matrix; counting the number of matrix elements of which the numerical values are greater than a preset threshold value in the initial similarity matrix to obtain a first number; fusing the number of text words respectively corresponding to the text to be matched and the video text to obtain a second number; the video text refers to a text corresponding to the video text characteristics; and obtaining the reference similarity between the video text characteristic and the target text characteristic based on the first quantity and the second quantity.

In one embodiment, the current text is any one of an audio text, an image text and a text to be matched, and the text feature corresponding to the current text is any one of an audio text feature, an image text feature or a target text feature. The video text matching device is further used for:

noun extraction is carried out on the current text to obtain text nouns; and performing feature extraction on the text nouns to obtain text features corresponding to the current text.

In one embodiment, the video text matching device is further configured to:

inputting the video features to be matched, the reference features to be matched and the text features to be matched into a target video text matching model to obtain a similarity set corresponding to the video to be matched and the text to be matched; and the target video text matching model is used for carrying out feature enhancement and similarity calculation.

The modules in the video text matching model training device and the video text matching device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as a training sample pair set, a testing sample pair set, a target video text matching model and the like. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a video text matching model training method and a video text matching method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 17. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected by a system bus, and the communication interface, the display unit and the input device are connected by the input/output interface to the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video text matching model training method and a video text matching method. The display unit of the computer equipment is used for forming a visual and visible picture, and can be a display screen, a projection device or a virtual reality imaging device, the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 16 and 17 are block diagrams of only some of the configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for training a video text matching model, the method comprising:

2. The method of claim 1, wherein obtaining the set of training sample pairs comprises:

obtaining a plurality of positive sample pairs;

carrying out data recombination on each positive sample pair to obtain a plurality of negative sample pairs;

taking the negative sample pair with coincident data with the positive sample pair as the negative sample pair matched with the positive sample pair;

and obtaining the training sample pair set based on each positive sample pair and each matched negative sample pair.

3. The method of claim 1, wherein the generating of the motion-enhanced video features comprises:

performing intra-modal attention processing on video features and action features corresponding to a current training video respectively to obtain self-attention video features and self-attention action features corresponding to the current training video;

performing inter-modal attention processing on the video features and the self-attention motion features corresponding to the current training video to obtain cross attention video features corresponding to the current training video, and performing inter-modal attention processing on the motion features and the self-attention video features corresponding to the current training video to obtain cross attention motion features corresponding to the current training video;

Fusing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a motion video fusion characteristic corresponding to the current training video;

and performing channel attention processing on the motion video fusion features corresponding to the current training video to obtain a first channel attention weight, and performing feature enhancement on the video features corresponding to the current training video based on the first channel attention weight to obtain motion enhancement video features corresponding to the current training video.

4. The method according to claim 3, wherein the intra-modal attention processing is performed on video features and motion features corresponding to a current training video respectively to obtain self-attention video features and self-attention motion features corresponding to the current training video, and the method comprises:

performing intra-modal fusion on the video features corresponding to the current training video to obtain a first self-attention weight, performing fusion on the video features corresponding to the current training video and the first self-attention weight to obtain a first attention feature, and obtaining the self-attention video features based on the video features corresponding to the current training video and the first attention feature;

And performing intra-modal fusion on the motion features corresponding to the current training video to obtain a second self-attention weight, performing fusion on the motion features corresponding to the current training video and the second self-attention weight to obtain a second attention feature, and obtaining the self-attention motion feature based on the motion features corresponding to the current training video and the second attention feature.

5. The method according to claim 3, wherein the inter-modal attention processing is performed on the video feature and the self-attention motion feature corresponding to the current training video to obtain a cross-attention video feature corresponding to the current training video, and the inter-modal attention processing is performed on the motion feature and the self-attention video feature corresponding to the current training video to obtain a cross-attention motion feature corresponding to the current training video, and includes:

splicing the video features corresponding to the current training video and the self-attention action features to obtain first spliced features, performing inter-modal fusion on the video features corresponding to the current training video and the first spliced features to obtain first cross attention weights, fusing the first spliced features and the first cross attention weights to obtain first cross attention features, and obtaining the cross attention video features based on the first spliced features and the first cross attention features;

The motion feature corresponding to the current training video and the self-attention video feature are spliced to obtain a second splicing feature, the motion feature corresponding to the current training video and the second splicing feature are subjected to inter-modal fusion to obtain a second cross attention weight, the second splicing feature and the second cross attention weight are fused to obtain a second cross attention feature, and the cross attention motion feature is obtained based on the second splicing feature and the second cross attention feature.

6. The method according to claim 3, wherein the fusing the cross attention motion feature and the cross attention video feature corresponding to the current training video to obtain the motion video fused feature corresponding to the current training video comprises:

splicing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a cross attention splicing characteristic;

fusing the cross attention motion characteristic and the cross attention video characteristic corresponding to the current training video to obtain a cross attention fusion characteristic;

and performing inter-modal fusion on the cross attention splicing feature and the cross attention fusion feature to obtain a third cross attention weight, performing fusion on the cross attention splicing feature and the third cross attention weight to obtain a third cross attention feature, and obtaining the action video fusion feature based on the cross attention splicing feature and the third cross attention feature.

7. The method of claim 3, wherein the initial video text matching model comprises a motion-enhanced video feature-to-text feature matching network comprising a first intra-modality attention layer, a second intra-modality attention layer, a first inter-modality attention layer, a second inter-modality attention layer, a motion video fusion attention layer, a first channel attention layer, and a first similarity calculation layer;

the first intra-modality attention layer is used for intra-modality attention processing on video features, and the second intra-modality attention layer is used for intra-modality attention processing on motion features;

the first inter-modality attention layer is used for performing inter-modality attention processing on the video feature and the self-attention action feature, and the second inter-modality attention layer is used for performing inter-modality attention processing on the action feature and the self-attention video feature;

the action video fusion attention layer is used for fusing cross attention action features and cross attention video features corresponding to the same training video;

the first channel attention layer is used for carrying out channel attention processing on the motion video fusion features;

The first similarity calculation layer is used for calculating the similarity between the motion enhancement video features and the training text features.

8. The method of claim 1, wherein the process of generating the audio enhanced video features comprises the steps of:

fusing video features and audio features corresponding to the current training video to obtain initial audio and video fusion features;

carrying out random inactivation treatment and pooling treatment on the initial audio and video fusion characteristics to obtain intermediate audio and video fusion characteristics;

carrying out normalization processing on the intermediate audio and video fusion characteristics to obtain target audio and video fusion characteristics;

and performing channel attention processing on the target audio and video fusion features to obtain a second channel attention weight, and performing feature enhancement on the video features corresponding to the current training video based on the second channel attention weight to obtain audio enhancement video features corresponding to the current training video.

9. The method of claim 8, wherein the initial video text matching model comprises an audio enhanced video feature and text feature matching network comprising an audio video fusion layer, a random inactivation layer, a pooling layer, a normalization layer, a second channel attention layer, and a second similarity calculation layer;

The audio and video fusion layer is used for fusing the video features and the audio features;

the random inactivation layer is used for carrying out random inactivation treatment on input data;

the pooling layer is used for pooling input data;

the normalization layer is used for normalizing input data;

the second channel attention layer is used for carrying out channel attention processing on input data;

the second similarity calculation layer is used for calculating the similarity between the audio enhancement video features and the training text features.

10. The method of claim 1, further comprising:

11. The method of claim 1, wherein the similarity set comprises at least two target similarities, a current video feature is any one of a video feature and a reference enhanced video feature corresponding to a training video in a current training sample pair, a current text feature is a training text feature corresponding to a training text in the current training sample pair, the training video comprises a plurality of video frames, and the training text comprises a plurality of text words;

calculating the initial similarity between the current video characteristic and the current text characteristic; the initial similarity comprises sub-similarities between a plurality of text words in the training text and the same video frame respectively, and sub-similarities between a plurality of video frames in the training video and the same text words respectively;

obtaining a text weight based on the current text characteristics, and obtaining a video weight based on the current video characteristics;

aiming at the initial similarity, acquiring a maximum value from a plurality of sub-similarities corresponding to the same video frame as a first sub-similarity, and acquiring a maximum value from a plurality of sub-similarities corresponding to the same text word as a second sub-similarity, so as to obtain a first sub-similarity corresponding to each video frame and a second sub-similarity corresponding to each text word;

obtaining first similarity based on each first sub-similarity, and obtaining second similarity based on each second sub-similarity;

fusing the first similarity and the text weight to obtain first fused data, and fusing the second similarity and the video weight to obtain second fused data;

And obtaining the target similarity between the current video characteristic and the current text characteristic based on the first fusion data and the second fusion data.

12. The method of claim 1, wherein the similarity set comprises at least two target similarities, and wherein calculating the training loss based on the similarity set corresponding to the positive sample pair and the similarity set corresponding to the matched negative sample pair comprises:

determining a target category from each similarity category;

obtaining first similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pairs and each negative sample pair containing the same training text as the positive sample pairs on the target category, and obtaining second similarity weights respectively corresponding to the training sample pairs on the target category based on the target similarity of the positive sample pairs and each negative sample pair containing the same training video as the positive sample pairs on the target category;

fusing the target similarity and the first similarity weight of the same training sample pair on the target category to obtain the corresponding first updating similarity of each training sample pair, and fusing the target similarity and the second similarity weight of the same training sample pair on the target category to obtain the corresponding second updating similarity of each training sample pair;

Obtaining a first loss based on first updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, and obtaining a second loss based on second updating similarities corresponding to the positive sample pair and each negative sample pair containing the same training text as the positive sample pair;

obtaining training sub-losses corresponding to the target category based on the first losses and the second losses;

obtaining a next similarity class as a target class, returning the target similarity of each negative sample pair containing the same training text based on the positive sample pair and the positive sample pair on the target class, and executing the step of obtaining first similarity weights respectively corresponding to each training sample pair on the target class until determining training sub-losses respectively corresponding to each similarity class;

the training loss is derived based on various training sub-losses.

13. The method of claim 12, wherein obtaining first similarity weights respectively corresponding to the training sample pairs on the target class based on the target similarities of the positive sample pairs and the negative sample pairs containing the same training text as the positive sample pairs on the target class, and obtaining second similarity weights respectively corresponding to the training sample pairs on the target class based on the target similarities of the positive sample pairs and the negative sample pairs containing the same training video as the positive sample pairs on the target class comprises:

Obtaining a first similarity matrix based on the target similarity of each training sample pair on the target category; the first dimension of the first similarity matrix represents the target similarity of each training sample pair containing the same training video in the target category, the second dimension of the target similarity matrix represents the target similarity of each training sample pair containing the same training text in the target category, and the diagonal line of the target similarity matrix represents the target similarity of each positive sample pair in the target category;

generating a second matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in a second dimension, and generating a second similarity matrix based on the second matrix element corresponding to each matrix element in the first similarity matrix;

generating a third matrix element corresponding to the current matrix element based on the current matrix element in the first similarity matrix and a forward matrix element of the current matrix element in the first dimension, and generating a third similarity matrix based on the third matrix element corresponding to each matrix element in the first similarity matrix;

Adjusting each matrix element in the second similarity matrix based on the target similarity of the positive sample to the target category to obtain a fourth similarity matrix, and adjusting each matrix element in the third similarity matrix based on the target similarity of the positive sample to the target category to obtain a fifth similarity matrix; the fourth similarity matrix represents first similarity weights respectively corresponding to the training sample pairs on the target category, and the fifth similarity matrix represents second similarity weights respectively corresponding to the training sample pairs on the target category.

14. The method of claim 12, wherein obtaining a first loss based on a first update similarity between the positive sample pair and each negative sample pair containing the same training video as the positive sample pair, and obtaining a second loss based on a second update similarity between the positive sample pair and each negative sample pair containing the same training text as the positive sample pair comprises:

fusing the positive sample pairs and first updating similarities corresponding to negative sample pairs containing the same training video with the positive sample pairs to obtain first similarity statistic values corresponding to the positive sample pairs, obtaining first sub-losses corresponding to the positive sample pairs based on the first updating similarities and the first similarity statistic values corresponding to the positive sample pairs, and obtaining the first losses based on the first sub-losses;

And fusing the positive sample pairs and second updating similarity corresponding to each negative sample pair containing the same training text with the positive sample pairs to obtain second similarity statistic values corresponding to each positive sample pair, obtaining second sub-losses corresponding to each positive sample pair based on the second updating similarity and the second similarity statistic values corresponding to the same positive sample pair, and obtaining the second losses based on the second sub-losses.

15. The method according to any one of claims 1 to 14, wherein the initial video text matching model comprises a video feature and text feature matching network, a reference enhanced video feature and text feature matching network, the reference enhanced video feature and text feature matching network comprises at least one of an action enhanced video feature and text feature matching network, an audio enhanced video feature and text feature matching network, and the training loss comprises training sub-losses respectively corresponding to various networks;

adjusting model parameters of the initial video text matching model based on the training loss until a convergence condition is met to obtain a target video text matching model, wherein the method comprises the following steps:

and respectively adjusting model parameters of corresponding networks in the initial video text matching model based on the loss of each training sub until convergence conditions corresponding to various networks are met, so as to obtain the target video text matching model.

16. The method according to any one of claims 1 to 14, wherein the training loss comprises a first training sub-loss corresponding to each similarity class;

acquiring a minimum value from each first training sub-loss as a first target sub-loss;

obtaining loss contribution degrees corresponding to the residual training sub-losses respectively based on the difference between the first target sub-losses and the residual training sub-losses respectively;

obtaining loss weights respectively corresponding to the residual training sub-losses based on the loss contribution degrees respectively corresponding to the residual training sub-losses;

fusing the losses of the training sub-steps based on the loss weight corresponding to the losses of the training sub-steps to obtain a first target loss; the loss weight corresponding to the first target sub-loss is a preset weight;

and adjusting the model parameters of the initial video text matching model based on the first target loss until a convergence condition is met, so as to obtain the target video text matching model.

17. The method according to any one of claims 1 to 14, wherein the training loss comprises a second training sub-loss respectively corresponding to each positive sample pair on each similarity class;

acquiring a minimum value from the losses of the second training sub-samples as second target sub-losses aiming at the same positive sample pair to obtain second target sub-losses corresponding to the positive sample pairs respectively;

obtaining second target losses based on the statistical values of the sub-losses of the second targets;

and adjusting the model parameters of the initial video text matching model based on the second target loss until a convergence condition is met, so as to obtain the target video text matching model.

18. The method according to any one of claims 1 to 14, further comprising:

inputting video characteristics and reference characteristics corresponding to a test video in a test sample pair set and test text characteristics corresponding to a test text into a target video text matching model to obtain a similarity set corresponding to each test sample pair in the test sample pair set;

generating similarity testing matrixes respectively corresponding to the similarity categories on the basis of the similarity sets respectively corresponding to the testing sample pairs;

Determining the ranking of the prediction matching sub-sets respectively corresponding to the similarity classes of the test sample pairs on the basis of the similarity test matrix respectively corresponding to the similarity classes;

determining a predictive matching ranking based on the ranking of each predictive matching sub-pair corresponding to the same test sample pair to obtain the predictive matching ranking corresponding to each test sample pair respectively;

and determining the prediction accuracy corresponding to the target video text matching model based on the prediction matching ranking corresponding to the matching sample pair in each test sample pair.

19. A method for matching video text, the method further comprising:

acquiring video features to be matched and reference features to be matched which correspond to videos to be matched, and acquiring text features to be matched which correspond to texts to be matched; the reference feature to be matched comprises at least one of an audio feature and an action feature corresponding to the video to be matched;

20. The method of claim 19, further comprising:

similarity calculation is carried out on the video text characteristics corresponding to the video to be matched and the target text characteristics corresponding to the text to be matched, and reference similarity between the video text characteristics and the target text characteristics is obtained; the video text features comprise at least one of audio text features and image text features, the audio text features are obtained by performing feature extraction on audio texts extracted from the audio of the video to be matched, and the image text features are obtained by performing feature extraction on image texts extracted from the image of the video to be matched;

and determining a matching result between the video to be matched and the text to be matched based on the similarity set corresponding to the video to be matched and the text to be matched and the reference similarity.

21. An apparatus for training a video text matching model, the apparatus comprising:

the similarity calculation module is used for calculating the similarity of training text features corresponding to training texts and video features and reference enhancement video features corresponding to training videos respectively aiming at the same training sample pair to obtain a similarity set corresponding to each training sample pair respectively;

22. A video text matching apparatus, characterized in that the apparatus comprises:

23. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 20.

24. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 20.

25. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 20 when executed by a processor.