CN116310643A

CN116310643A - Video processing model training method, device and equipment

Info

Publication number: CN116310643A
Application number: CN202310271487.6A
Authority: CN
Inventors: 宋雨鑫; 杨敏; 吴文灏; 李甫; 何栋梁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-06-23

Abstract

The disclosure provides a video processing model training method, device and equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like. One embodiment of the method comprises the following steps: obtaining a mask video frame, wherein the mask video frame comprises a visible video block and a mask video block; inputting the mask video frame to an encoder, and learning the characteristics of the mask video frame; the method comprises the steps of respectively inputting the features of a mask video frame into a visual decoder and a motion decoder, and predicting a visual codebook and hiding motion information of the mask video frame; calculating a loss based on the visual codebook and the hidden motion information; parameters of the encoder, the visual decoder and the motion decoder are adjusted based on the loss to obtain a video processing model. The embodiment provides a training method of a video processing model based on visual semantics and motion transformation joint supervision, which not only reduces training cost, but also remarkably improves performance in video downstream tasks.

Description

Video processing model training method, device and equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like.

Background

Visual pre-training is an important research direction and application point in deep learning at present. For a large-scale vision pre-training scheme, firstly, the non-supervision learning is utilized to pre-train in a large-scale vision data set, then fine adjustment is carried out on a specific downstream task, and excellent effects can be achieved in a series of downstream tasks such as image classification, target detection, semantic segmentation, action recognition and the like.

Disclosure of Invention

The embodiment of the disclosure provides a video processing model training method, a device, equipment, a storage medium and a program product.

In a first aspect, an embodiment of the present disclosure provides a video processing model training method, including: obtaining a mask video frame, wherein the mask video frame comprises a visible video block and a mask video block; inputting the mask video frame to an encoder, and learning the characteristics of the mask video frame; the method comprises the steps of respectively inputting the features of a mask video frame into a visual decoder and a motion decoder, and predicting a visual codebook and hiding motion information of the mask video frame; calculating a loss based on the visual codebook and the hidden motion information; parameters of the encoder, the visual decoder and the motion decoder are adjusted based on the loss to obtain a video processing model.

In a second aspect, an embodiment of the present disclosure provides a video processing method, including: obtaining a video to be processed of a target task, wherein the target task comprises at least one of the following: image classification, target detection, semantic segmentation and action recognition; inputting the video to be processed into a video processing model to obtain a target task processing result of the video to be processed, wherein the video processing model is trained by the method of the first aspect.

In a third aspect, an embodiment of the present disclosure provides a video processing model training apparatus, including: the first acquisition module is configured to acquire a mask video frame, wherein the mask video frame comprises a visible video block and a mask video block; the encoding module is configured to input the mask video frame to the encoder and learn the characteristics of the mask video frame; the decoding module is configured to input the characteristics of the mask video frame to the visual decoder and the motion decoder respectively, and predict the visual codebook and the hidden motion information of the mask video frame; a first calculation module configured to calculate a loss based on the visual codebook and the hidden motion information; the first training module is configured to adjust parameters of the encoder, the visual decoder and the motion decoder based on the loss to obtain a video processing model.

In a fourth aspect, an embodiment of the present disclosure provides a video processing apparatus, including: the system comprises an acquisition module configured to acquire a video to be processed of a target task, wherein the target task comprises at least one of the following: image classification, target detection, semantic segmentation and action recognition; the processing module is configured to input the video to be processed into a video processing model to obtain a target task processing result of the video to be processed, wherein the video processing model is trained by the device according to the third aspect.

In a fifth aspect, an embodiment of the present disclosure proposes an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the implementations of the first or second aspect.

In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in any one of the implementations of the first or second aspects.

In a seventh aspect, embodiments of the present disclosure propose a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.

The embodiment of the disclosure provides a video processing model training method based on visual semantics and motion transformation joint supervision, which uses two disentanglement decoders to predict semantic codebooks and inter-frame motion changes in mask video frames at the same time so as to decouple and reconstruct visual and motion characterization. In addition, the encoder can be further promoted to obtain stronger space-time characteristic extraction capacity through combining the semantic codebook of the prediction mask video frame and the inter-frame motion change, so that more robust and generalized space-time video representation is learned, and the encoder can be more efficiently migrated to a video-related downstream task. Not only reduces the training cost, but also significantly improves the performance in the video downstream tasks.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of one embodiment of a video processing model training method according to the present disclosure;

FIG. 2 is a flow chart of yet another embodiment of a video processing model training method according to the present disclosure;

FIG. 3 is a scene graph of a video processing model training method in which embodiments of the present disclosure may be implemented;

FIG. 4 is a flow chart of one embodiment of a video processing method according to the present disclosure;

FIG. 5 is a schematic diagram of the architecture of one embodiment of a video processing model training apparatus according to the present disclosure;

FIG. 6 is a schematic diagram of the architecture of one embodiment of a video processing device according to the present disclosure;

fig. 7 is a block diagram of an electronic device used to implement a video processing model training method or video processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates a flow 100 of one embodiment of a video processing model training method according to the present disclosure. The video processing model training method comprises the following steps:

step 101, obtaining a mask video frame.

In this embodiment, the execution subject of the video processing model training method may acquire the mask video frame.

Generally, a video frame in a video is obtained, a mask video prediction frame is adopted, and a part of video blocks of the video frame are masked, so that a mask video frame can be obtained. Wherein the mask video frame may include a visible video block and a mask video block. The visible video block is an unmasked video block in a video frame. The masked video block may be a masked video block in a video frame. In some embodiments, a predetermined proportion (e.g., 75%) of the video blocks in the video frame are randomly masked to obtain a masked video frame.

Step 102, inputting the mask video frame to an encoder, and learning the characteristics of the mask video frame.

In this embodiment, the executing body may input the mask video frame to the encoder, and learn the features of the mask video frame.

Wherein the encoder may compress the input into a potential spatial representation. Here, the encoder learns not only the features of the visible video blocks in the mask video frame, but also the features of the mask video blocks in the mask video frame. Also, the features learned by the encoder may also include spatiotemporal relationships inherent in the video data.

And step 103, inputting the features of the mask video frame into a visual decoder and a motion decoder respectively, and predicting the visual codebook and hiding motion information of the mask video frame.

In this embodiment, the executing body may input the features of the mask video frame to the visual decoder, predict the visual codebook of the mask video frame, and input the features of the mask video frame to the motion decoder, and predict the hidden motion information of the mask video frame.

Here, two de-entanglement decoders are used simultaneously to predict the reconstructed visual appearance of the video frames and the motion information in the video frames. Wherein a visual decoder may be used to predict the visual codebook of video frames and a motion decoder may be used to predict various motion information hidden in the video data, thereby providing complementary semantic cues during the pre-training phase. And moreover, the appearance and the motion view can be decoupled on the target level by using two disentanglement decoders, redundant space-time data can be effectively processed, and the convergence speed of the pre-training stage is increased. Furthermore, the encoder can be forced to learn the spatiotemporal relationship inherent in video data by two de-entanglement decoders.

Step 104, calculating the loss based on the visual codebook and the hidden motion information.

In the present embodiment, the above-described execution subject may calculate the loss based on the visual codebook and the hidden motion information.

In general, for an unmasked original video frame corresponding to a masked video frame, the real visual codebook and motion information may be obtained. The loss may be calculated based on the differences of the real visual codebook and the predicted visual codebook, and the differences of the real motion information and the predicted motion information. For example, a loss may be calculated by inputting the true visual codebook and the predicted visual codebook into the corresponding loss function. The real motion information and the predicted motion information are input into the corresponding loss function, and another loss can be obtained. The two losses are weighted and summed to obtain the final total loss.

And step 105, adjusting parameters of the encoder, the visual decoder and the motion decoder based on the loss to obtain a video processing model.

In this embodiment, the execution body may adjust parameters of the encoder, the visual decoder, and the motion decoder based on the loss to obtain the video processing model.

Typically, the parameters of the above encoder, visual decoder and motion decoder are adjusted based on the loss until convergence, i.e., training of the video processing model is completed.

In some embodiments, to enhance the processing effect of the video processing model on the video downstream task, after the training of the video processing model is completed, a training sample set of the target task may also be obtained to continue training the video processing model. Specifically, the sample video is taken as input, the target task processing result of the sample video is taken as output, and the video processing model is continuously trained. The training sample set may be a small number of labeled video data sets of the target task, and each training sample may include a sample video and a target task processing result of the sample video. The parameters of the encoder, the visual decoder and the motion decoder in the video processing model are fine-tuned by using the marked video data sets, so that the video processing model can achieve excellent effects in a target task. The target task is a video downstream task, and may include, but is not limited to, image classification, target detection, semantic segmentation, action recognition, and the like.

With continued reference to fig. 2, a flow 200 of yet another embodiment of a video processing model training method in accordance with the present disclosure is shown. The video processing model training method comprises the following steps:

step 201, a mask video frame is acquired.

In this embodiment, the specific operation of step 201 is described in detail in step 101 in the embodiment shown in fig. 1, and will not be described herein.

Step 202, inputting the visible video block to a first encoder, and learning the characteristics of the visible video block.

In this embodiment, the execution subject of the video processing model training method may input the visible video block to the first encoder, and learn the characteristics of the visible video block. Wherein the first encoder may compress the input into a potential spatial representation. Here, the first encoder learns the features of the visible video blocks in the mask video frame.

Step 203, inputting the mask video block to the second encoder, and learning the features of the mask video block.

In this embodiment, the executing body may input the mask video block to the second encoder, and learn the features of the mask video block. Wherein the second encoder may compress the input into a potential spatial representation. Here, the second encoder learns the features of the mask video blocks in the mask video frame.

And 204, inputting the features of the visible video block into a hidden variable regressor to obtain the predicted features of the mask video block.

In this embodiment, the execution body may input the features of the visible video block to the hidden variable regressor to obtain the predicted features of the mask video block.

To eliminate the difference between the look view and the motion view, a hidden variable regressor is employed to further map potential representations of the encoder output. In particular, the hidden variable regressor may query from the embedding of the visible video blocks and predict the representation of the mask video blocks.

Wherein the hidden variable regressor may be composed of stacked cross-attention modules. Each cross-attention module may take the hidden variable representation of the visible video block as keys and values, take a learnable query mask (mask queries) as queries, and predict the feature representation of the mask video block by cross-attention.

In step 205, an alignment penalty is calculated based on the predicted features of the mask video block and the features of the mask video block.

In this embodiment, the execution subject may calculate the alignment loss based on the predicted feature of the mask video block and the feature of the mask video block, and perform the constraint.

And in the training stage, the real features of the mask video block learned by the second encoder are aligned with the predicted features of the mask video block predicted by the hidden variable regressive, so that the borrowing task (pre task) in the pre-training stage is completed.

Wherein alignment loss

The method can be calculated by the following formula:

wherein M is the set of all mask video blocks of the T frame, M is the number of tokens (token) of the mask video blocks, r _p For the output of the regressor,

for masking the video block via the output of the encoder, < >>

Can be used as a label and r _p Calculate alignment loss->

And 206, respectively inputting the predicted features of the mask video block to a visual decoder and a motion decoder, and predicting the visual codebook and hiding the motion information.

In this embodiment, the executing body may input the prediction features of the mask video block to the visual decoder, predict the visual codebook, and input the prediction features of the mask video block to the motion decoder to predict the hidden motion information.

Step 207, inputting the visual codebook into a pre-trained word segmentation device to generate a discrete codebook.

In this embodiment, the execution body may input the visual codebook to a pre-trained word segmentation unit to generate the discrete codebook.

For the visual decoder, a pre-trained word segmentation machine (Off-the-shell token) of the discrete autoregressive encoder is used to generate a discrete codebook as a training target.

Step 208, cross entropy loss is calculated based on the discrete codebook.

In this embodiment, the execution body may calculate the cross entropy loss based on the discrete codebook, and perform constraint.

Wherein cross entropy loss

The method can be calculated by the following formula:

wherein M is the set of all mask video blocks of the T frame, |M| is the number of tokens (token) of the mask video blocks, K is the category number of discrete representations generated by the word segmentation device of the discrete autoregressive encoder, and y _m(i,j) Label value corresponding to j-th class for i-th input video block, p _i,j The i-th input video block corresponds to the predicted value of the j-th class.

Step 209, calculating pixel difference values corresponding to the mask video blocks based on the hidden motion information.

In this embodiment, the executing body may calculate the pixel difference value corresponding to the mask video block based on the hidden motion information.

For a motion decoder, pixel differences of the input T frame in the occluded region are calculated, and the pixel differences are used as supervision information for the motion branches.

At step 210, a mean square error loss is calculated based on the pixel difference.

In this embodiment, the execution body may calculate the mean square error loss based on the pixel difference value, and optimize by the mean square error loss.

Wherein the mean square error is lost

The method can be calculated by the following formula:

wherein M' represents the set of all mask video blocks from the second frame to the T frame, D _p For the p-th mask video block to correspond to the output of the motion decoder,

and the pixel difference value corresponding to the p-th mask video block.

Step 211, calculating the sum of the alignment loss, the cross entropy loss and the mean square error loss to obtain the total loss.

In this embodiment, the execution body may calculate the sum of the alignment loss, the cross entropy loss, and the mean square error loss, to obtain the total loss.

Wherein the total loss

The method can be calculated by the following formula:

wherein,,

for cross entropy loss, < >>

Is mean square error loss>

For alignment loss.

In step 212, parameters of the first encoder, the second encoder, the visual decoder, and the motion decoder are adjusted based on the total loss to obtain a video processing model.

In this embodiment, the execution body may adjust parameters of the first encoder, the second encoder, the visual decoder, and the motion decoder based on the total loss until the training target is completed, so as to obtain the video processing model. The training target may be that the performance of the video processing model composed of the first encoder, the second encoder, the visual decoder, and the motion decoder reaches a preset performance.

Typically, the parameters of the above first encoder, second encoder, visual decoder and motion decoder are adjusted based on the total loss until convergence, i.e. training of the video processing model is completed.

As can be seen from fig. 2, the training step is highlighted by the flow 200 of the training method of the video processing model in this embodiment, compared to the corresponding embodiment of fig. 1. Therefore, according to the scheme described in the embodiment, the real features of the mask video block learned by the second encoder are aligned with the predicted features of the mask video block predicted by the hidden variable regressor, so that the borrowing task of the pre-training stage can be completed. And calculating cross entropy loss based on the class label value of the mask video block and the class predicted value predicted by the word segmentation device, and restraining the cross entropy loss as supervision information of the visual decoder. Based on the output of the motion decoder and the pixel difference, a mean square error loss is calculated as a constraint on the supervisory information of the motion decoder. The effect of the video processing model can be improved by combining the three loss functions for training.

For ease of understanding, fig. 3 illustrates a scene graph of a training method of a video processing model in which embodiments of the present disclosure may be implemented.

And carrying out random masking on the input video segments to obtain visible video blocks and masking video blocks. And inputting the visible video block into a first encoder, learning to obtain the characteristics of the visible video block, and simultaneously inputting the mask video block into a second encoder, and learning to obtain the characteristics of the mask video block.

To eliminate the difference between the appearance view and the motion view, the features of the visible video block are input to a regressor, a learnable query mask is used as a query, and the features of the masked video block are predicted by cross-attention. Calculating alignment loss based on output of the second encoder and output of the regressor

Thereby completing the borrowing task of the pre-training stage.

The prediction features of the masked video blocks are input to both the visual decoder and the motion decoder to predict the reconstructed visual appearance of the video and the motion information in the video. Wherein the motion decoder can predict various motion information hidden in the video data, and the visual decoder can predict the visual codebook of the video, thereby providing supplementary semantic cues in the pre-training stage.

For visual decoders, a word-segmenter of a pre-trained discrete autoregressive encoder is used to generate a discrete codebook as a training target and exploit cross entropy loss

And (5) performing constraint. For a motion decoder, pixel differences for the input T frame in the occluded region are calculated. Using pixel differences as supervision information for the motion branches and by mean square error loss +.>

To optimize.

And finally, the total loss function in the pre-training stage is the sum of the alignment loss, the cross entropy loss and the mean square error loss. And (5) performing parameter adjustment based on the total loss to complete training of the video processing model.

With further reference to fig. 4, a flow 400 of one embodiment of a video processing method according to the present disclosure is shown. The video processing method comprises the following steps:

step 401, obtaining a video to be processed of a target task.

In this embodiment, the execution subject of the video processing method may acquire the video to be processed of the target task. Among other things, target tasks may include, but are not limited to, image classification, target detection, semantic segmentation, and action recognition, among others.

Step 402, inputting the video to be processed into the video processing model to obtain a target task processing result of the video to be processed.

In this embodiment, the executing body may input the video to be processed into the video processing model to obtain a target task processing result of the video to be processed. The video processing model may be trained by the method shown in fig. 1 or fig. 2, which is not described herein.

In general, for a video frame in a video to be processed, inputting the video frame to be processed to an encoder, and learning characteristics of the video frame to be processed; inputting the characteristics of the video frames to be processed into a visual decoder and a motion decoder, and predicting the visual codebook and the motion information of the video frames to be processed; and performing target task processing based on the visual codebook and the motion information of the video frame to be processed to obtain a target task processing result.

In some embodiments, to enhance the processing effect of the video processing model on the target task, a small number of labeled video data sets of the target task may be obtained, and the parameters of the video processing model may be fine-tuned using the labeled video data sets. And inputting the video to be processed into the finely tuned video processing model for processing, so that the video processing model can achieve excellent effects in the target task.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a video processing model training apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the video processing model training apparatus 500 of the present embodiment may include: a first acquisition module 501, an encoding module 502, a decoding module 503, a first calculation module 504, and a first training module 505. Wherein, the first obtaining module 501 is configured to obtain a mask video frame, where the mask video frame includes a visible video block and a mask video block; an encoding module 502 configured to input the mask video frame to an encoder, learn characteristics of the mask video frame; a decoding module 503 configured to input features of the mask video frame to the visual decoder and the motion decoder, respectively, predicting a visual codebook and hiding motion information of the mask video frame; a first calculation module 504 configured to calculate a loss based on the visual codebook and the hidden motion information; a first training module 505 is configured to adjust parameters of the encoder, the visual decoder and the motion decoder based on the loss, resulting in a video processing model.

In this embodiment, in the video processing model training apparatus 500: specific processing of the first acquisition module 501, the encoding module 502, the decoding module 503, the first calculation module 504, and the first training module 505 and the technical effects thereof may refer to the relevant descriptions of steps 101-105 in the corresponding embodiment of fig. 1, and are not repeated herein.

In some alternative implementations of the present embodiment, the encoding module 502 includes: a first encoding submodule configured to input a visible video block to the first encoder, learn characteristics of the visible video block; and the regression sub-module is configured to input the characteristics of the visible video block into the hidden variable regressor to obtain the predicted characteristics of the mask video block.

In some alternative implementations of the present embodiment, the decoding module 503 is further configured to: and respectively inputting the prediction characteristics of the mask video block into a visual decoder and a motion decoder, and predicting a visual codebook and hiding motion information.

In some alternative implementations of the present embodiment, the first computing module 504 is further configured to: inputting the visual codebook into a pre-trained word segmentation device to generate a discrete codebook; calculating cross entropy loss based on the discrete codebook; calculating pixel difference values corresponding to the mask video blocks based on the hidden motion information; based on the pixel difference, a mean square error loss is calculated.

In some alternative implementations of the present embodiment, the encoding module 502 further includes: and a second encoding sub-module configured to input the mask video block to a second encoder, and learn characteristics of the mask video block.

In some optional implementations of this embodiment, the video processing model training apparatus 500 further includes: and a second calculation module configured to calculate an alignment loss based on the predicted features of the mask video block and the features of the mask video block.

In some optional implementations of the present embodiment, the first training module 505 is further configured to: calculating the sum of the alignment loss, the cross entropy loss and the mean square error loss to obtain the total loss; parameters of the first encoder, the second encoder, the visual decoder and the motion decoder are adjusted based on the total loss to obtain a video processing model.

In some alternative implementations of the present embodiment, the hidden variable regressor is composed of stacked cross-attention modules that use hidden variable representations of visible video blocks as keys and values, use a learnable query mask as a query, and predict features of mask video blocks through cross-attention.

In some alternative implementations of the present embodiment, the first acquisition module 501 is further configured to: acquiring a video frame; and carrying out random masking on video blocks with preset proportions in the video frames to obtain masked video frames.

In some optional implementations of this embodiment, the video processing model training apparatus 500 further includes: a second acquisition module configured to acquire a training sample set of target tasks, wherein training samples in the training sample set include sample videos and target task processing results of the sample videos, wherein the target tasks include at least one of: image classification, target detection, semantic segmentation and action recognition; and the second training module is configured to take the sample video as input, take the target task processing result of the sample video as output and train the video processing model continuously.

With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a video processing apparatus, which corresponds to the method embodiment shown in fig. 4, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the video processing apparatus 600 of the present embodiment may include: an acquisition module 601 and a processing module 602. Wherein, the obtaining module 601 is configured to obtain a video to be processed of a target task, where the target task includes at least one of the following: image classification, target detection, semantic segmentation and action recognition; the processing module 602 is configured to input the video to be processed into a video processing model, so as to obtain a target task processing result of the video to be processed, where the video processing model is obtained by training using the apparatus shown in fig. 5.

In the present embodiment, in the video processing apparatus 600: the specific processing of the obtaining module 601 and the processing module 602 and the technical effects thereof may refer to the relevant descriptions of the steps 401 to 402 in the corresponding embodiment of fig. 4, and are not repeated herein.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, such as a video processing model training method or a video processing method. For example, in some embodiments, the video processing model training method or video processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the video processing model training method or video processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video processing model training method or the video processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video processing model training method, comprising:

obtaining a mask video frame, wherein the mask video frame comprises a visible video block and a mask video block;

inputting the mask video frame to an encoder, and learning the characteristics of the mask video frame;

the features of the mask video frames are respectively input to a visual decoder and a motion decoder, and the visual codebook and the hidden motion information of the mask video frames are predicted;

Calculating a loss based on the visual codebook and the hidden motion information;

and adjusting parameters of the encoder, the visual decoder and the motion decoder based on the loss to obtain a video processing model.

2. The method of claim 1, wherein the inputting the masked video frame to an encoder, learning features of the masked video frame, comprises:

inputting the visible video block to a first encoder, and learning the characteristics of the visible video block;

and inputting the characteristics of the visible video block into a hidden variable regressor to obtain the predicted characteristics of the mask video block.

3. The method of claim 2, wherein the inputting the features of the masked video frame to a visual decoder and a motion decoder, respectively, predicting the visual codebook and the hidden motion information of the masked video frame comprises:

and respectively inputting the prediction characteristics of the mask video block to the visual decoder and the motion decoder to predict the visual codebook and the hidden motion information.

4. The method of claim 3, wherein the calculating a loss based on the visual codebook and the hidden motion information comprises:

Inputting the visual codebook into a pre-trained word segmentation device to generate a discrete codebook;

calculating cross entropy loss based on the discrete codebook;

calculating pixel difference values corresponding to the mask video blocks based on the hidden motion information;

based on the pixel difference, a mean square error loss is calculated.

5. The method of claim 4, wherein the inputting the masked video frame to an encoder, learning features of the masked video frame, further comprises:

and inputting the mask video block to a second encoder, and learning the characteristics of the mask video block.

6. The method of claim 5, wherein the method further comprises:

and calculating the alignment loss based on the predicted features of the mask video block and the features of the mask video block.

7. The method of claim 6, wherein said adjusting parameters of the encoder, the visual decoder, and the motion decoder based on the loss results in a video processing model, comprising:

calculating the sum of the alignment loss, the cross entropy loss and the mean square error loss to obtain a total loss;

and adjusting parameters of the first encoder, the second encoder, the visual decoder and the motion decoder based on the total loss to obtain the video processing model.

8. The method of claim 2, wherein the hidden variable regressor is comprised of a stacked cross-attention module that takes hidden variable representations of the visible video blocks as keys and values, takes a learnable query mask as a query, and predicts features of the masked video blocks by cross-attention.

9. The method of any of claims 1-8, wherein the acquiring mask video frames comprises:

acquiring a video frame;

and carrying out random masking on video blocks with preset proportions in the video frames to obtain the masked video frames.

10. The method of any of claims 1-8, wherein the method further comprises:

obtaining a training sample set of target tasks, wherein training samples in the training sample set comprise sample videos and target task processing results of the sample videos, and the target tasks comprise at least one of the following: image classification, target detection, semantic segmentation and action recognition;

and taking the sample video as input, taking a target task processing result of the sample video as output, and continuing training the video processing model.

11. A video processing method, comprising:

obtaining a video to be processed of a target task, wherein the target task comprises at least one of the following: image classification, target detection, semantic segmentation and action recognition;

inputting the video to be processed into a video processing model to obtain a target task processing result of the video to be processed, wherein the video processing model is trained by the method of any one of claims 1-10.

12. A video processing model training apparatus, comprising:

a first acquisition module configured to acquire a mask video frame, wherein the mask video frame comprises a visible video block and a mask video block;

an encoding module configured to input the mask video frame to an encoder, learn characteristics of the mask video frame;

the decoding module is configured to input the characteristics of the mask video frame to a visual decoder and a motion decoder respectively, and predict a visual codebook and hidden motion information of the mask video frame;

a first calculation module configured to calculate a loss based on the visual codebook and the hidden motion information;

a first training module configured to adjust parameters of the encoder, the visual decoder, and the motion decoder based on the loss to obtain a video processing model.

13. The apparatus of claim 12, wherein the encoding module comprises:

a first encoding submodule configured to input the visible video block to a first encoder, learn characteristics of the visible video block;

and the regression sub-module is configured to input the characteristics of the visible video block to a hidden variable regressor to obtain the predicted characteristics of the mask video block.

14. The apparatus of claim 13, wherein the decoding module is further configured to:

15. The apparatus of claim 14, wherein the first computing module is further configured to:

calculating cross entropy loss based on the discrete codebook;

based on the pixel difference, a mean square error loss is calculated.

16. The apparatus of claim 15, wherein the encoding module further comprises:

A second encoding sub-module configured to input the masked video block to a second encoder, learning features of the masked video block.

17. The apparatus of claim 16, wherein the apparatus further comprises:

a second calculation module configured to calculate an alignment loss based on the predicted features of the mask video block and the features of the mask video block.

18. The apparatus of claim 17, wherein the first training module is further configured to:

19. The apparatus of claim 13, wherein the hidden variable regressor is comprised of a stacked cross-attention module that takes hidden variable representations of the visible video blocks as keys and values, takes a learnable query mask as a query, and predicts features of the masked video blocks by cross-attention.

20. The apparatus of any of claims 12-19, wherein the acquisition module is further configured to:

acquiring a video frame;

21. The apparatus of any of claims 12-19, wherein the apparatus further comprises:

a second acquisition module configured to acquire a training sample set of target tasks, wherein training samples in the training sample set include a sample video and target task processing results of the sample video, wherein the target tasks include at least one of: image classification, target detection, semantic segmentation and action recognition;

and the second training module is configured to take the sample video as input, take the target task processing result of the sample video as output and train the video processing model continuously.

22. A video processing apparatus comprising:

the system comprises an acquisition module configured to acquire a video to be processed of a target task, wherein the target task comprises at least one of the following: image classification, target detection, semantic segmentation and action recognition;

a processing module configured to input the video to be processed into a video processing model to obtain a target task processing result of the video to be processed, wherein the video processing model is trained by using the apparatus of any one of claims 12-21.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10 or the method of claim 11.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10 or the method of claim 11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10 or the method according to claim 11.