CN115952317A

CN115952317A - Video processing method, device, equipment, medium and program product

Info

Publication number: CN115952317A
Application number: CN202210823046.8A
Authority: CN
Inventors: 黄靖佳; 李毅男; 冯佳时
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2023-04-11

Abstract

The present disclosure provides a video processing method, apparatus, device, storage medium, and program product. The method comprises the following steps: acquiring an input text; performing feature extraction on the input text based on the multi-modal fusion characteristics of the first model to obtain text features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode; searching a video feature set for target video features matched with the text features based on the text features with multi-modal characteristics; and outputting the target video corresponding to the target video characteristics.

Description

Video processing method, device, equipment, medium and program product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video processing method, apparatus, device, medium, and program product.

Background

In the current scenes of searching for video, it is often necessary to rely on labels or tags to the video data. For example, the keyword input by the user is subjected to character matching with the label or label of the video data to obtain a label or label matched with the keyword, and then the video corresponding to the matched label or label is returned to the user as a search result; alternatively, the text or the like corresponding to the video is searched for based on the label or the annotation of the video. However, the labels or tags of the video data often cannot fully express the content in the video data, or some tags or tags are not accurate, or even some video data have no labels or tags, which may cause a search task of the video data to depend excessively on descriptions of the video data from outside, and incompleteness and inaccuracy of the descriptions may cause a low accuracy of a search result obtained when the video data is searched based on a text, or when the video data is searched based on a text, thereby affecting user experience.

Disclosure of Invention

The present disclosure provides a video processing method, apparatus, device, storage medium, and program product to solve a technical problem that a search result of video data is not high in accuracy to a certain extent.

In a first aspect of the present disclosure, a video processing method is provided, including:

acquiring an input text;

performing feature extraction on the input text based on the multi-modal fusion characteristics of the first model to obtain text features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode;

searching a video feature set for target video features matched with the text features based on the text features with multi-modal characteristics;

and outputting the target video corresponding to the target video characteristics.

In a second aspect of the present disclosure, a video processing method is provided, including:

acquiring video data to be processed;

performing feature extraction on the video data based on the multi-modal fusion characteristics of the first model to obtain video features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode;

searching a target text feature matched with the video feature in a text feature set based on the video feature with the multi-modal characteristic;

and generating and outputting a target text based on the target text characteristics.

In a third aspect of the present disclosure, there is provided a video processing apparatus comprising:

the first acquisition module is used for acquiring an input text;

the first model module is used for carrying out feature extraction on the input text based on the multi-modal fusion characteristics of the first model to obtain text features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode; searching a video feature set for a target video feature matched with the text feature based on the text feature with the multi-modal characteristic; and outputting the target video corresponding to the target video characteristics.

In a fourth aspect of the present disclosure, there is provided a video processing apparatus including:

the second acquisition module is used for acquiring video data to be processed;

the second model module is used for carrying out feature extraction on the video data based on the multi-modal fusion characteristics of the first model to obtain video features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode; searching a text feature set for a target text feature matched with the video feature based on the video feature with the multi-modal characteristic; and generating and outputting a target text based on the target text characteristics.

In a fifth aspect of the present disclosure, an electronic device is provided, which includes one or more processors, a memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method according to the first or second aspects.

A sixth aspect of the disclosure provides a non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of the first or second aspect.

A seventh aspect of the present disclosure provides a computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

From the above, according to the video processing method, device, equipment, medium and program product provided by the present disclosure, the text features with multi-modal characteristics are obtained after feature extraction is performed on the input text based on the first model, and matching is performed in the video feature set to obtain the target video features, so as to obtain the corresponding target video; or extracting the features of the video data based on the first model to obtain video features with multi-modal characteristics, and matching in the text feature set to obtain target text features, thereby obtaining the corresponding target text. The accuracy of searching in the video according to the text or the video can be improved based on the multi-modal characteristics of the video data, without depending on the label or label of the video data.

Drawings

In order to clearly illustrate the technical solutions of the present disclosure or related technologies, the drawings used in the embodiments or related technologies description will be briefly introduced below, and obviously, the drawings in the following description are only embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a video processing architecture according to an embodiment of the disclosure.

Fig. 2 is a hardware configuration diagram of an exemplary electronic device according to an embodiment of the present disclosure.

FIG. 3 is a schematic diagram of a model architecture of a multi-modal model according to an embodiment of the disclosure

Fig. 4 is a schematic diagram of feature alignment training of an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of ranking training according to an embodiment of the disclosure.

Fig. 6 is a schematic flow chart of a video processing method according to an embodiment of the disclosure.

Fig. 7 is a schematic flow chart of a video processing method according to an embodiment of the disclosure.

Fig. 8 is a schematic diagram of a video processing apparatus according to an embodiment of the disclosure.

Fig. 9 is a schematic diagram of a video processing apparatus according to an embodiment of the disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that technical terms or scientific terms used in the embodiments of the present disclosure should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the disclosure is not intended to indicate any order, quantity, or importance, but rather to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Searching for videos based on text may refer to searching for a target video associated with text in a video set according to the text, for example, searching for a video database provided by an application program according to a keyword input by a user. Searching for text based on a video may refer to searching for words matching the video in a text collection from the video, such as generating descriptive text (e.g., abstract, title, etc.) about the video, or generating answers to questions for the video. Existing data processing models often need to rely on labels or tags to video data for processing video search tasks. However, the labels or labels of the video data often cannot fully represent the content in the video data, or some video data have no labels or labels, which all result in low accuracy of the search results obtained by the data processing model. Therefore, how to improve the accuracy of the data processing model in processing the search task of the video data becomes a technical problem which needs to be solved urgently.

In view of this, the disclosed embodiments provide a video processing method, apparatus, device, storage medium and program product. Performing feature extraction on an input text based on a first model to obtain text features with multi-modal characteristics, and performing matching in a video feature set to obtain target video features so as to obtain a corresponding target video; or extracting the features of the video data based on the first model to obtain video features with multi-modal characteristics, and matching in the text feature set to obtain target text features, thereby obtaining the corresponding target text. The accuracy of searching in the video according to the text or searching in the text according to the video can be improved based on the multi-modal characteristics of the video data, without depending on the label or label of the video data.

Fig. 1 shows a schematic diagram of a video processing architecture of an embodiment of the present disclosure. Referring to fig. 1, the video processing architecture 100 may include a server 110, a terminal 120, and a network 130 that provides a communication link. The server 110 and the terminal 120 may be connected through a wired or wireless network 130. The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, security service, and CDN.

The terminal 120 may be a hardware or software implementation. For example, when the terminal 120 is implemented in hardware, it may be various electronic devices having a display screen and supporting page display, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. When the terminal 120 device is implemented by software, it can be installed in the electronic devices listed above; it may be implemented as a plurality of software or software modules (for example, software or software modules for providing distributed services), or may be implemented as a single software or software module, which is not specifically limited herein.

It should be noted that the video processing method provided in the embodiment of the present application may be executed by the terminal 120, and may also be executed by the server 110. It should be understood that the number of terminals, networks, and servers in fig. 1 are illustrative only and not intended to be limiting. There may be any number of terminals, networks, and servers, as desired for an implementation.

Fig. 2 shows a hardware structure diagram of an exemplary electronic device 200 provided by the embodiment of the present disclosure. As shown in fig. 2, the electronic device 200 may include: a processor 202, a memory 204, a network module 206, a peripheral interface 208, and a bus 210. The processor 202, the memory 204, the network module 206, and the peripheral interface 208 are communicatively coupled to each other within the electronic device 200 via a bus 210.

The processor 202 may be a Central Processing Unit (CPU), an image processor, a neural Network Processor (NPU), a Microcontroller (MCU), a programmable logic device, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits. Processor 202 may be used to perform functions related to the techniques described in this disclosure. In some embodiments, processor 202 may also include multiple processors integrated into a single logic component. For example, as shown in FIG. 2, the processor 202 may include a plurality of processors 202a, 202b, and 202c.

The memory 204 may be configured to store data (e.g., instructions, computer code, etc.). As shown in fig. 2, the data stored by the memory 204 may include program instructions (e.g., for implementing the video processing method of the disclosed embodiments) as well as data to be processed (e.g., the memory may store configuration files for other modules, etc.). Processor 202 may also access program instructions and data stored by memory 204 and execute the program instructions to operate on the data to be processed. Memory 204 may include volatile memory devices or nonvolatile memory devices. In some embodiments, the memory 204 may include Random Access Memory (RAM), read Only Memory (ROM), optical disks, magnetic disks, hard disks, solid State Disks (SSDs), flash memory, memory sticks, and the like.

The network module 206 may be configured to provide the electronic device 200 with communications with other external devices via a network. The network may be any wired or wireless network capable of transmitting and receiving data. For example, the network may be a wired network, a local wireless network (e.g., bluetooth, wiFi, near Field Communication (NFC), etc.), a cellular network, the internet, or a combination of the above. It is to be understood that the type of network is not limited to the specific examples described above. In some embodiments, the network module 306 may include any combination of any number of Network Interface Controllers (NICs), radio frequency modules, transceivers, modems, routers, gateways, adapters, cellular network chips, and the like.

The peripheral interface 208 may be configured to connect the electronic apparatus 200 with one or more peripheral devices to enable input and output of information. For example, the peripheral devices may include input devices such as keyboards, mice, touch pads, touch screens, microphones, various sensors, and output devices such as displays, speakers, vibrators, indicator lights, and the like.

The bus 210 may be configured to transfer information between various components of the electronic device 200 (e.g., the processor 202, the memory 204, the network module 206, and the peripheral interface 208), such as an internal bus (e.g., a processor-memory bus), an external bus (a USB port, a PCI-E bus), and so forth.

It should be noted that although the architecture of the electronic device 200 described above only shows the processor 202, the memory 204, the network module 206, the peripheral interface 208 and the bus 210, in a specific implementation, the architecture of the electronic device 200 may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the architecture of the electronic device 200 described above may also include only the components necessary to implement the embodiments of the present disclosure, and not necessarily all of the components shown in the figures.

The video data may include multi-modal data such as images, sounds or words, and the video data itself contains more information than a single text modality such as a label or tag of the video data. Based on the multi-modal data in the video data, the video language understanding tasks such as video retrieval, video question answering and the like can be facilitated, and the accuracy of data processing is improved. However, when different video processing tasks are oriented, different model architectures are usually adopted for different specific tasks, which makes the model design more limited. Models designed for one task are often difficult to migrate to another task, and the same video processing model cannot ensure good efficiency and performance in different tasks.

At present, video-text pre-training aims to train a multi-modal model with strong generalization by using a large-scale video-text sample so as to better solve video language understanding tasks with strong challenges such as video-text retrieval, video question answering and the like. When facing different downstream tasks, current methods tend to adopt different model architectures. For example, when designing a pre-training model for a video-text retrieval task, in consideration of the operating efficiency of an application stage, two mutually independent encoders are generally used to construct the pre-training model and enhance feature alignment between different modalities; when designing a pre-training model for a video question-answering task, considering that the task has high requirements on video-text fusion representation, a cross-modal encoder is usually adopted to improve the multi-modal fusion capability of the model. The downstream task oriented model has large design limitation and is difficult to simultaneously migrate to a plurality of different downstream tasks, so that the functions of the pre-trained models in different real scenes are greatly reduced.

Although there are pre-training models that can accommodate multiple downstream tasks, these pre-training models have several problems. For example, some pre-training models combine a single-mode encoder with a cross-mode encoder and apply the cross-mode encoder to a search or question-and-answer task. However, when such pre-trained models are migrated to a retrieval task, it is necessary to match the query with the data in the library one by one in an exhaustive manner, and learn all the matching progressive tokens by using the cross-modal encoder. The computational complexity of this process is high, e.g., O (NM) where N and M are the number of queries and base database data, respectively. This makes such pre-trained models difficult to apply in a realistic large-scale retrieval task. Other pre-training models are simply formed by overlapping a single-mode encoder and a cross-mode encoder, and different modules are adopted when migrating to a downstream task. Compared with the former class of pre-training models, some pre-training models can fully exert the flexibility of the models, the complexity of the pre-training models in the retrieval task is O (N + M), and the high efficiency of the pre-training models in the downstream task is guaranteed. However, such pre-trained models cannot make the two types of encoders advance cooperatively when training only through simple superposition, that is, the cross-mode alignment capability and the cross-mode fusion capability of the models cannot promote each other, so that the performance of such pre-trained models is degraded compared with the models trained independently for each downstream task. Therefore, the embodiment of the disclosure further provides a pre-training method of the multi-modal data model, so that the pre-training model after pre-training can give consideration to the high efficiency and the high performance of various downstream tasks. On the basis, further training is carried out on the downstream task to obtain a multi-modal data model, and the multi-modal data model can improve the searching accuracy when processing the searching task of the video data.

Referring to FIG. 3, FIG. 3 shows a schematic diagram of a modeling architecture of a multi-modal data model, according to an embodiment of the disclosure. In fig. 3, the multimodal data model 300 can include a video encoder 310, a text encoder 320, and a multimodal encoder 330. The input to the video encoder 310 may be a sequence of expanded video blocks. Wherein the sequence of video blocks may include a plurality of video block packets, each video block packet corresponding to a video frame in the video. A video block packet may be a set of multiple video blocks into which a video frame is partitioned according to a preset size. The output of the video encoder 310 is a sequence of video vector representations (Embedding) corresponding to the video blocks. The input to the text encoder 320 is a text entry (Token), and the output is a text vector representation corresponding to the input text entry. The multimodal encoder 330 has as input a multimodal vector representation sequence in which a video vector representation sequence output by the video encoder 310 and a text vector representation output by the text encoder 320 are fused, and outputs a fused feature sequence modeled by a self-attention mechanism.

The initial multi-modal data model may be pre-trained based on the first training samples to obtain a pre-trained model based on the multi-modal data. In some embodiments, the first training samples may include at least one video text pair, a video mask sample corresponding to a video sample in the video text pair, and a text mask sample corresponding to a text sample in the video text pair. For example, a video text pair < V, T > includes a video sample V and a corresponding text sample T, which may be used to describe the content of the video sample V. Videos that are pre-labeled or matched with corresponding text may be considered as video-text pairs.

In some embodiments, the video mask samples may be derived based on the video samples and a preset video mask policy. Further, obtaining a video mask sample based on the video sample and a preset video mask policy includes: randomly selecting a first preset proportion of video blocks in each video block packet of the video samples to mask to generate the video mask samples. For example, video blocks of each video frame in the video samples 311 in a first preset proportion in the spatial domain may be randomly selected as mask regions, and the pixel values of these mask regions are set to 0, that is, set to black, so as to obtain video samples with a partial region being masked as the video mask samples 312, as shown in fig. 3.

In some embodiments, the text mask samples may be derived based on the text samples and a preset text mask policy. Further, obtaining a text mask sample based on the text sample and a preset text mask policy includes: randomly selecting preset type words with a second preset proportion in the text sample for masking so as to generate the text mask sample. Wherein the preset type words may include at least one of verbs, nouns or adjectives. Because the text sample has the semantic characteristic, in order to prevent the reverse of the text sample semantic, the verb assistant is not covered, so that the text sample can be ensured to have the accurate semantic characteristic, and the accuracy of the pre-training model is further improved. For example, in fig. 3, the text sample 321 is "a black swan is masking in a point" and includes 8 words, wherein the second predetermined ratio (e.g., 30%) is about 2 words, then 2 words of the predetermined type in the text sample 321 may be masked randomly, and the text "Ablack swan [ mask ] masking in a [ mask ]" is used as the text mask sample 322.

It should be understood that the number of video blocks in the first predetermined ratio and the number of words in the second predetermined ratio may not be integers, and may be integers (e.g., rounded) as needed for concealment, and are not limited herein.

In fig. 3, a video sample 311 and a video mask sample 312 may be input to a video encoder 310 for feature extraction, so as to obtain a corresponding video feature sequence Ve and a corresponding video mask feature sequence Vm. The text samples 321 and the text mask samples 322 may be input to the text encoder 320 for feature extraction, so as to obtain corresponding text feature sequences Te and text mask feature sequences Tm. The video mask feature sequence Vm and the text mask feature sequence Tm are features extracted after a masking operation, and the contained feature information is incomplete, so that the video mask feature sequence Vm and the text mask feature sequence Tm belong to a pseudo feature sample, and the video feature sequence Ve and the text feature sequence Te belong to a positive feature sample. The pseudo-feature samples and the positive feature samples can be respectively cross-matched to obtain two pairs of pseudo-positive feature samples<Ve，Tm>And<Vm，Te>the pseudo-positive feature sample pairs are all input to the multi-modal encoder 330 to obtain a first fused modal feature sequence M _Tmf And a second fusion modality feature sequence M _Vmf 。

In order to promote the synergistic progress of multi-mode alignment and multi-mode fusion capability in the pre-training model, a feature sequence M based on a video feature sequence Ve, a video mask feature sequence Vm, a text feature sequence Te, a text mask feature sequence Tm and a first fusion mode feature sequence M can be utilized _Tmf Second fusion mode feature sequence M _Vmf And (4) performing tri-modal feature alignment training. The three modes respectively refer to a text mode, a video mode and a fusion mode, the text mode can refer to a text feature sequence Te and a text mask feature sequence Tm, the video mode can refer to a video feature sequence Ve and a video mask feature sequence Vm, and the fusion mode can refer to a fusion mode feature sequence M _Tmf 、M _Vmf . Multimodal feature Alignment (Alignment) is used to determine correspondence between features of different modalities from the same instance, which may be in the time dimension, e.g. video, speech, subtitle information of three different modalities based on time Alignment. Multimodal feature Fusion (Multimodal Fusion) is used to combine different modal features for a target task (e.g., prediction, classification, regression, or the like). As shown in fig. 4, fig. 4 shows a schematic diagram of feature alignment training according to an embodiment of the present disclosure.

The first training sample may include N (N is a positive integer) video text pairs, and corresponding N video mask samples Vm and N text mask samples Tm, so that N training feature samples are obtained, where each training feature sample includes a video feature sequence Ve, a video mask feature sequence Vm, a text feature sequence Te, a text mask feature sequence Tm, and a fusion mode feature sequence M _Tmf 、M _Vmf . The pre-training model may be trained by training the training feature samples in batches, where the number of training feature samples in each batch may be B, and B is a positive integer. For example, each batch of training feature samples may include B positive sample feature pairs<Ve，Te>And B pseudo-positive feature sample pairs<Ve，Tm>And B false positive feature sample pairs<Vm，Te>。

In some embodiments, calculating a first feature alignment loss function for the first pair of pseudo-positive feature samples may include:

calculating a first sub-loss function of the first pseudo-positive feature sample pair with the video feature as an anchor point and a second sub-loss function of the first pseudo-positive feature sample pair with the text feature as an anchor point based on the first fusion feature, the video feature, the text feature and the text mask feature;

and obtaining the first characteristic alignment loss function based on the sum of the first sub-loss function and the second sub-loss function.

In particular, for a first pseudo-positive feature sample pair<Ve，Tm>The first pseudo-positive feature sample pair of the batch can be calculated<Ve，Tm>First feature alignment loss function L 'based on three modes' _T Wherein the first feature alignment loss function L' _T May be a pseudo-positive feature sample pair<Ve，Tm>First sub-loss function L 'with visual feature Ve as anchor point' _t2v Is paired with the pseudo-positive feature sample<Ve，Tm>Second sub-loss function L 'with text feature Te as anchor point' _t2v And (4) summing.

In some embodiments, the first sub-loss function comprises:

for the ith sample of the training features,

based on ith positive sample feature pair<Ve ⁱ ，Te ⁱ >Is measured by a first metric function s (Ve) ⁱ ，Te ⁱ ) Calculating the ith positive sample characteristic pair according to the first ratio of the temperature coefficient tau<Ve ⁱ ，Te ⁱ >A first exponential function based on the first ratio; and computing the ith positive sample feature pair based on the first exponential function and the first intermediate function Z<Ve ⁱ ，Te ⁱ >Is first logarithmic function log (Ve) ⁱ ，Te ⁱ )；

Based on ith pseudo-positive sample feature pair<Ve ⁱ ，Tm ⁱ >Second metric function s (Ve) ⁱ ，Tm ⁱ ) Calculating the ith pseudo-positive sample characteristic pair according to the second ratio of the temperature coefficient tau<Ve ⁱ ，Tm ⁱ >A second exponential function based on the second ratio; and calculating the ith pseudo-positive sample based on the second exponential function and the first intermediate function ZCharacteristic pair<Ve ⁱ ，Te ⁱ >Second logarithmic function log (Ve) ⁱ ，Te ⁱ )；

Based on the ith feature pair<Ve ⁱ ，M _Tmf ⁱ >Is measured by the third metric function s (Ve) ⁱ ，M _Tmf ⁱ ) Calculating the ith characteristic pair according to the third ratio of the temperature coefficient tau<Ve ⁱ ，M _Tmf ⁱ >A third exponential function based on the third ratio; and computing the ith feature pair based on the third exponential function and the first intermediate function Z<Ve ⁱ ，M _Tmf ⁱ >Log (Ve) of the third logarithmic function of ⁱ ，M _Tmf ⁱ )；

Based on the first logarithmic function log (Ve) ⁱ ，Te ⁱ ) The second log function log (Ve) ⁱ ，Te ⁱ ) And the third logarithmic function log (Ve) ⁱ ，M _Tmf ⁱ ) Calculating a first single sample loss function of the ith training feature sample;

and calculating the first sub-loss function based on the first single sample loss function of all the B training feature samples.

Wherein i may be a positive integer. In some embodiments, the ith positive sample feature pair is calculated based on the first exponential function and a first intermediate function Z<Ve ⁱ ，Te ⁱ >Further comprising: calculating the sum of the first exponential function and the first intermediate function Z' as a first sum; and taking a natural constant e as a base, and taking the ratio of the first exponential function to the first sum as a true number to obtain the first logarithmic function. In some embodiments, the ith positive sample feature pair is calculated based on the second exponential function and a second intermediate function Z<Ve ⁱ ，Tm ⁱ >Further comprising: calculating the sum of the second exponential function and the second intermediate function Z' as a second sum; and taking a natural constant e as a base, and taking the ratio of the second exponential function to the second sum as a true number to obtain the second logarithmic function. In some embodiments, the ith positive sample feature is calculated based on the third exponential function and a first intermediate function ZTo pair<Ve ⁱ ，M _Tmf ⁱ >Further comprising: calculating the sum of the third exponential function and the first intermediate function Z' as a third sum; and taking a natural constant e as a base, and taking the ratio of the third exponential function to the third sum as a true number to obtain the third logarithmic function.

In some embodiments, calculating the first sub-loss function based on a first single sample loss function of all B training feature samples comprises: and calculating the negative number of the sum of all the B first single sample loss functions to obtain the first sub-loss function.

In some embodiments, the first intermediate function Z' may comprise:

feature-based pairs<Ve ⁱ ，Te ^j >Is s (Ve) ⁱ ，Te ^j ) Calculating a fourth ratio of the temperature coefficient tau to the feature pair<Ve ⁱ ，Te ^j >A fourth exponential function based on the fourth ratio;

based on feature pairs<Ve ⁱ ，Tm ^j >Of the fifth metric function s (Ve) ⁱ ，Tm ^j ) Calculating a fifth ratio of the temperature coefficient tau to the feature pair<Ve ⁱ ，Tm ^j >A fifth exponential function based on the fifth ratio;

based on feature pairs<Ve ⁱ ，M _Tmf ^j >Is measured by the sixth metric function s (Ve) ⁱ ，M _Tmf ^j ) Calculating a sixth ratio of the characteristic pair to the temperature coefficient tau<Ve ⁱ ，M _Tmf ^j >A sixth exponential function based on the sixth ratio; wherein i is different from j (j may be a positive integer);

obtaining a first intermediate subfunction based on the sum of the fourth exponential function, the fifth exponential function and the sixth exponential function;

and calculating to obtain the first intermediate function based on the sum of the first intermediate subfunctions of all the B training feature samples.

In some embodiments, the second sub-loss function comprises:

for the ith sample of the training features,

calculating a fourth logarithmic function by taking a natural constant e as a base and taking the ratio of the first exponential function to the sum of the B fourth exponential functions as a true number; calculating a fifth logarithmic function by taking a natural constant e as a base and taking the ratio of the second exponential function to the sum of the B fifth exponential functions as a true number;

calculating a sixth logarithmic function by taking a natural constant e as a base and taking the ratio of the third exponential function to the sum of the B sixth exponential functions as a true number;

and obtaining the second sub-loss function based on the negative number of the sum of the fourth logarithmic function, the fifth logarithmic function and the sixth logarithmic function.

In some embodiments, calculating a second feature alignment loss function for the second pseudo-positive feature sample pair may include:

calculating a third sub-loss function of the second pseudo-positive feature sample pair with the video feature as an anchor point and a fourth sub-loss function of the second pseudo-positive feature sample pair with the text feature as an anchor point based on the second fusion feature, the video feature, the text feature and the video mask feature;

and obtaining the second feature alignment loss function based on the sum of the third sub-loss function and the fourth sub-loss function.

In particular, with the first pseudo-positive feature sample pair<Ve，Tm>Similarly, for the second pseudo-positive feature sample pair<Vm，Te>The second pseudo-positive feature sample pair in the batch of training can also be calculated<Vm，Te>Second feature alignment loss function L 'based on three modes' _V Including pairs of pseudo-positive feature samples<Vm，Te>Third sub-loss function L' with visual characteristic Ve as anchor point _v2t Is paired with the pseudo-positive feature sample<Vm，Te>Fourth sub-loss function L' with text feature Te as anchor point _t2v And (4) summing.

In some embodiments, the third sub-loss function comprises:

for the ith sample of the training features,

based on the ith positive sample feature pair<Te ⁱ ，Ve ⁱ >Seventh of (2)Metric function s (Te) ⁱ ，Ve ⁱ ) Calculating the ith positive sample characteristic pair according to the seventh ratio of the temperature coefficient tau<Ve ⁱ ，Te ⁱ >A seventh exponential function based on the seventh ratio; based on feature pairs<Te ⁱ ，Ve ^j >Is s (Te) ⁱ ，Ve ^j ) Calculating an eighth ratio to the temperature coefficient tau<Te ⁱ ，Ve ^j >An eighth exponential function based on the eighth ratio; and calculating a seventh logarithmic function by taking a natural constant e as a base and taking the ratio of the seventh exponential function to the sum of the B eighth exponential functions as a true number;

based on ith pseudo-positive sample feature pair<Te ⁱ ，Vm ⁱ >Of a ninth metric function s (Te) ⁱ ，Vm ⁱ ) Calculating the ith pseudo-positive sample characteristic pair according to the ninth ratio of the temperature coefficient tau<Te ⁱ ，Vm ⁱ >A ninth exponential function based on the ninth ratio; based on feature pairs<Te ⁱ ，Vm ^j >Is s (Te) ⁱ ，Vm ^j ) Calculating a ninth ratio of the temperature coefficient tau to the feature pair<Te ⁱ ，Vm ^j >A ninth exponential function based on the ninth ratio; and calculating an eighth logarithmic function by taking a natural constant e as a base and taking the ratio of the ninth exponential function to the sum of the B tenth exponential functions as a true number;

based on the ith feature pair<Te ⁱ ，M _vmf ⁱ >Eleventh metric function s (Te) of ⁱ ，M _vmf ⁱ ) An eleventh ratio to the temperature coefficient τ is calculated to calculate an ith feature pair<Te ⁱ ，M _vmf ⁱ >An eleventh exponential function based on the eleventh ratio; feature-based pairs<Te ⁱ ，M _vmf ^j >Is s (Te) ⁱ ，M _vmf ^j ) Computing a feature pair according to the twelfth ratio of the temperature coefficient tau<Te ⁱ ，M _vmf ^j >A twelfth exponential function based on the twelfth ratio; and taking a natural constant e as a base, taking the ratio of the eleventh exponential function to the sum of the B twelfth exponential functions as a true number, and countingCalculating a ninth logarithmic function;

and obtaining the third sub-loss function based on the negative number of the sum of the seventh logarithmic function, the eighth logarithmic function and the ninth logarithmic function.

In some embodiments, the fourth sub-loss function comprises:

computing an ith positive sample feature pair based on the seventh exponential function and a second intermediate function Z ″<Te ⁱ ，Ve ⁱ >Log' of (Te) ⁱ ，Ve ⁱ )；

Computing an i-th pseudo-positive sample feature pair based on the ninth exponential function and a second intermediate function Z ″<Te ⁱ ，Vm ⁱ >Log' (Te) of ⁱ ，Vm ⁱ )；

Computing an ith feature pair based on the eleventh exponential function and a second intermediate function Z ″<Te ⁱ ，M _vmf ^j >Log' (Te) of ⁱ ，M _vmf ^j )；

Based on the tenth logarithmic function log' (Te) ⁱ ，Ve ⁱ ) The eleventh logarithmic function log' (Ve) ⁱ ，Vm ⁱ ) And the twelfth logarithmic function log' (Te) ⁱ ，M _vmf ^j ) Calculating a second single sample loss function of the ith training feature sample;

and calculating the fourth sub-loss function based on the second single sample loss function of all the B training feature samples.

Wherein i may be a positive integer. In some embodiments, the ith positive sample feature pair is calculated based on the seventh exponential function and a second intermediate function Z ″<Te ⁱ ，Ve ⁱ >Log (Te) of (2) ⁱ ，Ve ⁱ ) The method further comprises the following steps: calculating the sum of the seventh exponential function and the second intermediate function Z' as a fourth sum; and taking a natural constant e as a base, and taking the ratio of the seventh exponential function to the fourth sum as a true number to obtain the tenth logarithmic function. In some embodiments, the ith pseudo-positive sample is calculated based on the ninth exponential function and a second intermediate function Z ″The characteristic pair<Te ⁱ ，Vm ⁱ >Log' (Te) of ⁱ ，Vm ⁱ ) Further comprising: calculating the sum of the ninth exponential function and the second intermediate function Z' to be a fifth sum; and taking a natural constant e as a base, and taking the ratio of the ninth exponential function to the fifth sum as a true number to obtain the eleventh logarithmic function. In some embodiments, the ith feature pair is calculated based on the eleventh exponential function and a second intermediate function Z ″<Ve ⁱ ，M _Tmf ⁱ >Log' (Te) of ⁱ ，M _vmf ^j ) Further comprising: calculating the sum of the eleventh exponential function and the second intermediate function Z' as a sixth sum; and taking a natural constant e as a base, and taking the ratio of the eleventh exponential function to the sixth sum as a true number to obtain the twelfth logarithmic function.

In some embodiments, calculating the fourth sub-loss function based on a second single sample loss function of all B training feature samples comprises: and calculating the negative number of the sum of all B second single sample loss functions to obtain the fourth sub-loss function. .

In some embodiments, the second intermediate function Z "may include:

based on feature pairs<Ve ⁱ ，Te ^j >Is s (Ve) ⁱ ，Te ^j ) Calculating a fourth ratio of the temperature coefficient tau to the feature pair<Ve ⁱ ，Te ^j >A fourth exponential function based on the fourth ratio;

based on feature pairs<Te ⁱ ，Vm ^j >Is the tenth metric function s (Ve) ⁱ ，Tm ^j ) Computing a feature pair from a thirteenth ratio of the temperature coefficient τ<Ve ⁱ ，Tm ^j >A thirteenth exponential function based on the thirteenth ratio;

based on feature pairs<Te ⁱ ，M _vmf ^j >Eleventh metric function s (Te) of ⁱ ，M _vmf ^j ) Calculating a feature pair from a fourteenth ratio to the temperature coefficient τ<Te ⁱ ，M _vmf ^j >Based on the fourteenth ratioA fourteenth exponential function of values; wherein i is different from j (j may be a positive integer);

obtaining a second intermediate subfunction based on the sum of the fourth exponential function, the thirteenth exponential function and the fourteenth exponential function;

and calculating the second intermediate function based on the sum of the second intermediate subfunctions of all the B training feature samples. Then, two pairs of pseudo-positive feature samples may be based on<Ve，Tm>And pseudo-positive feature sample pairs<Vm，Te>The total loss function L of the feature alignment based on the three modes can be obtained by calculating the feature alignment loss function based on the three modes _TmA 'including a first feature alignment loss function L' _T Align with second feature loss function L' _V And (4) summing.

Since the video mask samples and the text mask samples have partial information missing compared to the complete video samples and text samples, the semantic consistency between the pairs of pseudo-positive feature samples derived from the video mask samples and the text mask samples is weaker than the semantic consistency between the pairs of video text features derived from the complete video samples and the text samples. The similarity between the positive sample feature pair < Ve, te > with complete information is higher than that between the pseudo positive feature sample pair < Ve, tm >, < Vm, te >, so that the pre-training model can realize that the pseudo positive feature sample pair has certain semantic deletion relative to the positive sample feature pair, and the information fusion capability and fine-grained perception capability of the pre-training model among different modes are enhanced. Therefore, the sample feature sequence can be subjected to sequencing training. As shown in fig. 5, fig. 5 shows a schematic diagram of ranking training according to an embodiment of the present disclosure.

In some embodiments, a first correlation loss function may be calculated for the first pseudo positive feature sample pair and the positive sample feature pair.

Further, in some embodiments, calculating a first correlation loss function of the first pseudo positive feature sample pair and the positive sample feature pair comprises:

obtaining a first correlation function based on a first preset value, a first metric function of the first pseudo positive feature sample pair and a positive sample metric function of the positive sample feature pair;

and obtaining the first correlation degree loss function based on a second preset value and a maximum function of the first correlation degree function.

In particular, for a first pseudo-positive feature sample pair<Ve，Tm>Rank training first correlation loss function L _{rank<Ve，Tm>} The method can comprise the following steps:

calculating the fifteen ratio of the measurement function s (Ve, te) of the positive characteristic sample pair < Ve, te > to the temperature coefficient; and calculating a sixteen-ratio of a metric function s (Ve, tm) of the first pseudo-positive feature sample pair < Ve, tm > to the temperature coefficient;

calculating a first difference value of the fifteen ratio value and the sixteen ratio value, and calculating a negative number of the sum of the first difference value and a first preset value lambda to obtain a first correlation function;

the first correlation loss function is obtained based on the second preset value (e.g., 0) and a maximum function of the first correlation function.

In some embodiments, a second correlation loss function of the second pseudo positive feature sample to the positive sample feature pair may be calculated.

Further, in some embodiments, calculating a second correlation loss function of the second pseudo positive feature sample and the positive sample feature pair comprises:

obtaining a second correlation function based on the first preset value, a second metric function of the second pseudo positive feature sample pair and a metric positive sample metric function of the positive sample feature pair;

and obtaining the second correlation loss function based on the second preset value and the maximum function of the second correlation function.

In particular, for the second pseudo-positive feature sample pair<Vm，Te>Second correlation loss function L of rank training _{rank<Vm，Te>} The method can comprise the following steps:

calculating a seventeen ratio of a measurement function s (Te, ve) of the positive characteristic sample pair < Te, ve > and the temperature coefficient; and calculating an eighteen ratio of the measurement function s (Vm, te) of the second pseudo-positive feature sample pair < Vm, te > to the temperature coefficient;

calculating a second difference value of the seventeen ratio and the eighteen ratio, and calculating a negative number of the sum of the second difference value and a first preset value lambda to obtain a second correlation function;

the second correlation loss function is obtained based on the second preset value (e.g., 0) and a maximum function of the second correlation function.

Then, at this time, the total correlation loss function L of the ranking training may be calculated based on the first correlation loss function and the second correlation loss function _rank Comprising a first correlation loss function L _{rank<Ve，Tm>} And a second degree of correlation loss function L _{rank<Vm，Te>} And (4) the sum.

Further, a total loss function L = L in the entire pre-training process may be calculated based on the total loss function of the feature alignment in the pre-training and the total correlation loss function of the ranking training phase _TmA ’+L _rank 。

And adjusting parameters of the pre-training model according to the total loss function in the pre-training process so as to minimize the total loss function, thereby obtaining the trained pre-training model.

It can be seen that, according to the embodiment of the present disclosure, a pre-training method based on tri-modal alignment is adopted, which explicitly strengthens the connection between a single-modal encoder (e.g., the video encoder 310 and the text encoder 320 in fig. 3) and the multi-span-modal encoder 330, promotes the synergistic improvement between the two, realizes the mutual promotion of the multi-modal alignment capability and the multi-modal fusion capability of the model in the pre-training, and can consider different downstream tasks while ensuring a highly efficient and highly-functional pre-training model. Compared with the traditional pre-training model designed and trained for a single task, the multi-mode-based pre-training model according to the embodiment of the disclosure can achieve higher accuracy on different downstream video text tasks such as video retrieval, video question answering and the like without reducing the computational efficiency.

After the pre-training stage, a pre-training model is obtained, and on the basis, the pre-training model can be trained specifically aiming at different downstream tasks, so that multi-mode data models for different downstream tasks are obtained.

In some embodiments, it may further include: and training the pre-training model based on the second training sample to obtain the first model.

Further, in some embodiments, training the pre-training model based on the second training sample to obtain the first model may further include:

acquiring the second training sample, wherein the second training sample comprises at least one video-text training pair, and each video-text training pair comprises a video training sample and a corresponding text training sample;

and training the pre-training model based on the second training sample until a target training requirement is met to obtain the first model.

Wherein the content contained in the second training sample may be different for different video processing tasks. For example, for a video retrieval task, the first model may be retrieved in a video set including a plurality of video data based on the first text information, resulting in a corresponding video retrieval result. Then, the second training sample corresponding to the video retrieval task may include at least one video-associated information pair, each video-associated information pair including a video training sample and corresponding associated information; the associated information may refer to text information describing the corresponding video training sample. For the video information generation task, the first model may generate second textual information (e.g., a summary, a title, a brief introduction, etc.) associated with the video data based on the video data. The second training sample corresponding to the video information generation task may then include at least one video-text information pair, each video-text information pair including a video training sample and corresponding text information such as a summary, a title, a brief introduction, and the like. For the video question-answering task, the first model may obtain a text answer result for the question information based on the question information about the video data and the video data. Then, the second training sample corresponding to the video question-and-answer task may include at least one video-question-and-answer information pair, where each video-question-and-answer information pair includes a video training sample, question information for the video training sample, and corresponding answer information. Therefore, the pre-training model obtained by the video processing method according to the embodiment of the disclosure can give consideration to different downstream tasks and ensure high efficiency and high performance, and the high-performance first model can be obtained when the pre-training model is used for different video tasks.

The trained first model may be used in different video processing, including searching for videos based on text, or searching for texts based on videos.

According to an embodiment of the present disclosure, there is also provided a video processing method, including:

acquiring an input text;

searching a video feature set for a target video feature matched with the text feature based on the text feature with the multi-modal characteristic;

In some embodiments, the multi-modal fusion characteristics of the first model are determined based on fusion modal characteristics obtained by fusing the first training feature samples by the first model; the first training feature sample is obtained by performing feature extraction on a first training sample comprising a video mode and a text mode.

In the pre-training process of the first model, the first model performs feature fusion on the first training feature sample to obtain fusion modal features, so that the first model has a multi-modal fusion characteristic of fusing a video mode and a text mode. In the subsequent data processing process, the first model can obtain data features with multi-modal characteristics when performing feature extraction on data of any modality (such as a text modality, a video modality and the like).

In particular, the input text may be user-determined. For example, in the application APP1, N1 pieces of video data may be provided to a user via a local or network, from which the user wishes to search for videos of interest to himself. The user may input the keyword text KeyA in the corresponding search bar, and the application APP1 may invoke the first model deployed therein to perform feature extraction on the keyword text KeyA, to obtain a text feature FeatureA of the keyword text KeyA, where the text feature FeatureA may be a feature vector. The first model performs search matching in a video feature set based on the text feature FeatureA, where the video feature set may be a set of video features obtained by performing feature extraction on N1 pieces of video data. The search matching may be to calculate the distance (e.g., euclidean distance, relevance, etc.) between the text feature FeatureA and the video features in the video feature set, so as to obtain a target video feature matching the text feature FeatureA. The first model may return the target video corresponding to the target video feature, and present the target video in the application APP1 to output the target video to the user. According to the video processing method, the first model is adopted to search the video based on the keyword text, and the search result with higher accuracy can be obtained.

acquiring video data to be processed;

Specifically, for Video data Video to be processed, a user wishes to automatically generate corresponding summary information for the Video data Video. The first model may perform feature extraction on the Video data Video to obtain a corresponding Video feature FeatureB, where the Video feature FeatureB may be a feature vector. The first model searches and matches in a text feature set based on the video feature FeatureB, wherein the text feature set can be a set of text features obtained by feature extraction of a preset text. One or more target text characteristics matched with the Video characteristics FeatureB can be obtained after searching and matching, and a target text related to the Video data Video can be formed as abstract information based on a target preset text corresponding to the one or more target text characteristics. According to the video processing method, the first model is adopted to generate the relevant text information based on the video, and the accuracy of the text information can be improved.

For another example, for the Video data Video 'to be processed, the user presents a question Q1 for the Video data Video', and desires to obtain an answer to the question Q1. The first model can perform feature extraction on the Video data Video' to obtain a corresponding Video feature FeatureC, and perform feature extraction on the question Q1 to obtain a text feature FeatureD. And the first model searches and matches in the text feature set based on the video feature FeatureB and the text feature FeatureD to obtain one or more target text features matched with the video feature FeatureB and the text feature FeatureD. Target text is formed based on the one or more target text features and returned to the user as an answer to the question Q1. Therefore, according to the video processing method disclosed by the embodiment of the disclosure, the first model is adopted to generate the corresponding answer based on the video and the question information, so that the accuracy of the answer can be improved.

Referring to fig. 6, fig. 6 shows a flow diagram of a video processing method according to an embodiment of the present disclosure. In fig. 6, a video processing method 600 may include the following steps.

Step S610, acquiring an input text;

step S620, performing feature extraction on the input text based on the multi-modal fusion characteristics of the first model to obtain text features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode;

step S630, searching a target video characteristic matched with the text characteristic in a video characteristic set based on the text characteristic with the multi-modal characteristic;

and step S640, outputting the target video corresponding to the target video characteristics.

In some embodiments, method 600 further comprises:

the first model needs to be pre-trained on an initial model based on a first training sample, and specifically includes:

obtaining a first training sample;

performing feature extraction based on the first training sample to obtain a first training feature sample (e.g., a video feature sequence Ve, a video mask feature sequence Vm, a text feature sequence Te, and a text mask feature sequence Tm in fig. 3), and fusing the first training feature sample to obtain a fusion mode feature (e.g., a first fusion mode feature sequence M in fig. 3) _Tmf And a second fusion modality signature sequence M _Vmf )；

And pre-training the first model based on the first training feature sample and the fusion modal feature to obtain a pre-training model.

In some embodiments, obtaining the first training sample comprises:

obtaining at least one video text pair, wherein the video text pair comprises a video sample and a text sample;

randomly selecting a first preset proportion of video blocks in each video block packet of the video samples for masking to generate the video mask samples (e.g., video mask samples 312 in fig. 3);

randomly selecting a second preset proportion of preset type texts in the text samples for masking to generate the text mask samples (for example, the text mask samples 322 in fig. 3);

deriving the first training sample based on the at least one video text pair, the video mask sample, and the text mask sample.

In some embodiments, method 600 further comprises:

the first model is obtained by training the pre-training model based on a second training sample, and specifically comprises the following steps:

In some embodiments, the fusion modality features include a first fusion modality feature (e.g., first fusion modality feature sequence M in fig. 3) _Tmf ) And a second fusion modality feature (e.g., second fusion modality feature series M in FIG. 3) _Vmf )；

Obtaining a first training feature sample after feature extraction is carried out on the basis of the first training sample, and fusing the first training feature sample to obtain a fusion modal feature, further comprising:

respectively performing feature extraction on the video samples to obtain video features (for example, a video feature sequence Ve in fig. 3), performing feature extraction on the text samples to obtain text features (for example, a text feature sequence Te in fig. 3), performing feature extraction on the video mask samples to obtain video mask features (for example, a video mask feature sequence Vm in fig. 3), and performing feature extraction on the text mask samples to obtain text mask features (for example, a text mask feature sequence Tm in fig. 3);

performing multi-modal feature fusion based on the video feature and the text mask feature to obtain the first fusion modalityFeatures (e.g., first fusion modality feature series M in FIG. 3) _Tmf ) (ii) a And

performing multi-modal feature fusion based on the text feature and the video mask feature to obtain a second fusion modal feature (e.g., a second fusion modal feature sequence M in fig. 3) _Vmf )。

In some embodiments, the pre-training an initial model based on the first training feature sample and the fusion modality feature to obtain the pre-trained model, further includes:

deriving a first pseudo-positive feature sample pair (e.g., first pseudo-positive feature sample pair < Ve, tm > in fig. 3) based on the video feature and the text mask feature, deriving a second pseudo-positive feature sample pair (e.g., second pseudo-positive feature sample pair < Vm, te > in fig. 3) based on the text feature and the video mask feature, and deriving a positive sample feature pair (e.g., positive sample feature pair < Ve, te >) based on the video feature and the text feature;

calculating a first feature alignment loss function (e.g., first feature alignment loss function L) for the first pseudo-positive feature sample pair _T ') and a second feature alignment loss function (e.g., a second feature alignment loss function L) for the second pseudo-positive feature sample pair _V ’)；

Calculating a first correlation loss function (e.g., a first correlation loss function L) of the first pseudo-positive feature sample pair and the positive sample feature pair _{rank<Ve，Tm>} ) And a second correlation loss function (e.g., a second correlation loss function L) of the second pseudo-positive feature sample and the positive sample feature pair _{rank<Vm，Te>} )；

The pre-trained total loss function (e.g., total loss function L) is derived based on the first feature alignment loss function, the second feature alignment loss function, the first correlation loss function, and the second correlation loss function.

In some embodiments, calculating a first feature alignment loss function for the first pseudo-positive feature sample pair and a second feature alignment loss function for the second pseudo-positive feature sample pair comprises:

computing a first sub-loss function (e.g., a first sub-loss function L 'of the first pseudo-positive feature sample pair anchored with the video feature based on the first fused feature, the video feature, the text feature, and the text mask feature' _v2t ) And a second sub-loss function (e.g., a second sub-loss function L ') of the first pseudo-positive feature sample pair having the text feature as an anchor point' _t2v )；

Deriving the first feature alignment loss function (e.g., first feature alignment loss function L) based on a sum of the first sub-loss function and the second sub-loss function _T ') to a test; and

based on the second fused feature, the video feature, the textual feature, and the video mask feature, a third sub-loss function (e.g., a third sub-loss function L ″) of the second pseudo-positive feature sample pair with the video feature as an anchor point is calculated _v2t ) And a fourth sub-loss function (e.g., a fourth sub-loss function L ″) of the second pseudo-positive feature sample pair having the text feature as an anchor point _t2v )；

Deriving the second feature alignment loss function (e.g., second feature alignment loss function L) based on a sum of the third sub-loss function and the fourth sub-loss function _V ’)。

In some embodiments, calculating a first correlation loss function of the first pseudo positive feature sample pair with the positive sample feature pair and a second correlation loss function of the second pseudo positive feature sample with the positive sample feature pair comprises:

deriving a first correlation function (e.g., based on a first preset value (e.g., a first preset value λ), a first metric function (e.g., s (Ve, tm)) of the first pseudo-positive feature sample pair, and a positive sample metric function (e.g., s (Ve, te)) of the positive sample feature pair;

obtaining the first correlation loss function (such as a first correlation loss function L) based on a second preset value (such as 0) and a maximum function of the first correlation function _{rank<Ve，Tm>} ) (ii) a And

deriving a second correlation degree function based on the first preset value (e.g., first preset value λ), a second metric function (e.g., s (Vm, te)) of the second pseudo-positive feature sample pair, and a metric positive sample metric function (e.g., s (Te, ve)) of the positive sample feature pair;

obtaining the second correlation loss function (such as a second correlation loss function L) based on the second preset value (such as 0) and the maximum function of the second correlation function _{rank<Vm，Te>} )。

Referring to fig. 7, fig. 7 shows a flow diagram of a video processing method according to an embodiment of the present disclosure. In fig. 7, a video processing method 700 may include the following steps.

In step S710, video data to be processed is acquired;

in step S720, feature extraction is performed on the video data based on the multi-modal fusion characteristics of the first model to obtain video features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode;

in step S730, searching a text feature set for a target text feature matching the video feature based on the video feature with multi-modal characteristics;

in step S740, a target text is generated based on the target text feature and output.

It should be noted that the method of the embodiment of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same technical concept, the present disclosure also provides a video processing apparatus corresponding to any of the above-mentioned embodiment methods, referring to fig. 8, the video processing apparatus including:

the first acquisition module is used for acquiring an input text;

Based on the same technical concept, the present disclosure also provides a video processing apparatus corresponding to any of the above-mentioned embodiment methods, referring to fig. 9, the video processing apparatus including:

the second acquisition module is used for acquiring video data to be processed;

For convenience of description, the above devices are described as being divided into various modules by functions, which are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.

The apparatus in the foregoing embodiment is used to implement the corresponding video processing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same technical concept, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video processing method according to any of the above embodiments, corresponding to any of the above embodiment methods.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the video processing method according to any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, and are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the present disclosure, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present disclosure, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present disclosure are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that the embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made without departing from the spirit or scope of the embodiments of the present disclosure are intended to be included within the scope of the disclosure.

Claims

1. A video processing method, comprising:

acquiring an input text;

2. The method according to claim 1, wherein the multi-modal fusion characteristics of the first model are determined based on fusion modal characteristics obtained by fusing the first training feature samples by the first model; the first training feature sample is obtained by performing feature extraction on a first training sample comprising a video mode and a text mode.

3. The method of claim 1, wherein:

obtaining a first training sample;

performing feature extraction on the basis of the first training sample to obtain a first training feature sample, and fusing the first training feature sample to obtain a fusion modal feature;

and pre-training the initial model based on the first training characteristic sample and the fusion modal characteristics to obtain the pre-training model.

4. The method of claim 3, wherein said obtaining a first training sample comprises:

randomly selecting a first preset proportion of video blocks in each video block packet of the video samples for masking to generate the video mask samples;

randomly selecting a preset type text with a second preset proportion in the text sample for masking to generate the text mask sample;

5. The method of claim 3, further comprising:

6. A method according to claim 3, wherein the fusion modality features include a first fusion modality feature and a second fusion modality feature;

performing feature extraction on the video sample to obtain video features, performing feature extraction on the text sample to obtain text features, performing feature extraction on the video mask sample to obtain video mask features, and performing feature extraction on the text mask sample to obtain text mask features;

performing multi-mode feature fusion based on the video feature and the text mask feature to obtain the first fusion modal feature; and

and performing multi-mode feature fusion based on the text feature and the video mask feature to obtain a second fusion modal feature.

7. The method according to claim 6, wherein the pre-training an initial model based on the first training feature sample and the fused modality features to obtain the pre-trained model, further comprises:

obtaining a first pseudo-positive feature sample pair based on the video feature and the text mask feature, obtaining a second pseudo-positive feature sample pair based on the text feature and the video mask feature, and obtaining a positive sample feature pair based on the video feature and the text feature;

calculating a first feature alignment loss function of the first pseudo-positive feature sample pair and a second feature alignment loss function of the second pseudo-positive feature sample pair;

calculating a first correlation loss function of the first pseudo-positive feature sample pair and the positive sample feature pair and a second correlation loss function of the second pseudo-positive feature sample and the positive sample feature pair;

obtaining the pre-trained total loss function based on the first feature alignment loss function, the second feature alignment loss function, the first correlation loss function, and the second correlation loss function;

and adjusting model parameters of the initial model based on the total loss function so as to minimize the total loss function, thereby obtaining the pre-training model.

8. The method of claim 7, wherein computing a first feature alignment loss function for the first pseudo-positive feature sample pair and a second feature alignment loss function for the second pseudo-positive feature sample pair comprises:

obtaining the first feature alignment loss function based on a sum of the first sub-loss function and the second sub-loss function; and

9. The method of claim 7, wherein computing a first correlation loss function of the first pseudo positive feature sample pair and the positive sample feature pair and a second correlation loss function of the second pseudo positive feature sample and the positive sample feature pair comprises:

obtaining a first correlation degree loss function based on a second preset value and a maximum function of the first correlation degree function; and

10. A video processing method, comprising:

acquiring video data to be processed;

searching a text feature set for a target text feature matched with the video feature based on the video feature with the multi-modal characteristic;

11. A video processing apparatus, comprising:

the first acquisition module is used for acquiring an input text;

the first model module is used for carrying out feature extraction on the input text based on the multi-modal fusion characteristics of the first model to obtain text features with multi-modal characteristics; the first model has a multi-mode fusion characteristic of fusing a video mode and a text mode; searching a video feature set for target video features matched with the text features based on the text features with multi-modal characteristics; and outputting the target video corresponding to the target video characteristics.

12. A video processing apparatus, comprising:

the second acquisition module is used for acquiring video data to be processed;

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 9 when executing the program.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 9.

15. A computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 9.