CN115205763A

CN115205763A - Video processing method and device

Info

Publication number: CN115205763A
Application number: CN202211099158.XA
Authority: CN
Inventors: 岑俊; 裴逸璇; 张士伟; 吕逸良; 赵德丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2022-10-18
Anticipated expiration: 2042-09-09
Also published as: CN115205763B

Abstract

The embodiment of the invention provides a video processing method and equipment; the video processing method comprises the following steps: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to a plurality of video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model; and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed. According to the video processing method, the data volume of the fusion frame is smaller than that of the video to be processed, and the fusion frame represents the video to be stored, so that the storage space occupied by data storage is effectively reduced, a large number of fusion frames can be stored in the limited storage space, then model optimization operation or updating operation can be performed based on the stored fusion frame, and the diversity of data types and the sufficiency of the number during model updating are effectively guaranteed.

Description

Video processing method and device

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video processing method and device.

Background

The video type increment learning refers to a type increment learning method taking a video sample as input, and the basic task is behavior recognition of a video to obtain a video behavior recognition model. At present, in a single fine tuning stage, people often use training data of all types to perform training operation of a motion recognition model, and as video data has more redundant information compared with image data, a larger storage space is required for storing video. When performing model training operation, it is impractical to store a large number of training videos for each video category in advance, and therefore, due to the limitation of memory space, all category data of model training is not available or partially available in limited memory, which greatly limits the type and amount of model training data, thereby easily reducing the performance of model training.

Disclosure of Invention

The embodiment of the invention provides a video processing method and video processing equipment, which can accurately obtain a fusion frame representing a video to be processed, and the fusion frame has a smaller data volume compared with video data, so that the memory consumption can be effectively reduced, and the training quality and performance of a model can be ensured.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a plurality of video frames of a video to be processed;

determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;

and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

Optionally, after obtaining the fusion frame corresponding to the video to be processed, the method further includes:

acquiring a newly added video sample;

and performing learning training on the video behavior recognition model based on the newly added video samples and the fusion frame.

In a second aspect, an embodiment of the present invention provides a video processing apparatus, including:

the first acquisition module is used for acquiring a plurality of video frames of a video to be processed;

a first determining module, configured to determine learnable parameters corresponding to each of the plurality of video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;

and the first processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is used to make a computer implement the video processing method in the first aspect when executed.

In a fifth aspect, an embodiment of the present invention provides a computer program product, including: computer program, which, when executed by a processor of an electronic device, causes the processor to carry out the steps of the video processing method according to the first aspect.

In a sixth aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a plurality of video frames of a video to be processed;

displaying a parameter configuration interface for processing the plurality of video frames;

determining learnable parameters corresponding to the plurality of video frames through parameter configuration operation obtained by the parameter configuration interface;

fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed;

and displaying the fused frame.

In a seventh aspect, an embodiment of the present invention provides a video processing apparatus, including:

the second acquisition module is used for acquiring a plurality of video frames of the video to be processed;

the second display module is used for displaying a parameter configuration interface for processing the plurality of video frames;

a second determining module, configured to determine, through the parameter configuration operation obtained by the parameter configuration interface, a learnable parameter corresponding to each of the plurality of video frames;

the second processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed;

and the second display module is also used for displaying the fusion frame.

In an eighth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the sixth aspect.

In a ninth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program enables a computer to implement the video processing method in the sixth aspect when executed.

In a tenth aspect, an embodiment of the present invention provides a computer program product, including: a computer program which, when executed by a processor of an electronic device, causes the processor to perform the steps in the video processing method of the sixth aspect described above.

In an eleventh aspect, an embodiment of the present invention provides a video processing method, which is applied to an augmented reality device, and the method includes:

acquiring a plurality of video frames of a video to be processed;

fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames respectively to obtain a fused frame corresponding to the video to be processed;

rendering the fused frame to a display screen of the augmented reality device.

In a twelfth aspect, an embodiment of the present invention provides a video processing apparatus, which is applied to an augmented reality device, where the apparatus includes:

the third acquisition module is used for acquiring a plurality of video frames of the video to be processed;

a third determining module, configured to determine learnable parameters corresponding to each of the multiple video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;

the third processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed;

and the third rendering module is used for rendering the fusion frame to a display screen of the augmented reality device.

In a thirteenth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the eleventh aspect.

In a fourteenth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is used to enable a computer to implement the video processing method in the eleventh aspect when executed.

In a fifteenth aspect, an embodiment of the present invention provides a computer program product, including: a computer program, which, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method of the eleventh aspect.

According to the technical scheme provided by the embodiment, a plurality of video frames of a video to be processed are obtained; and determining learnable parameters corresponding to the plurality of video frames, and finally fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed, wherein the data volume of the fused frames is far less than that of the video to be processed, and the fused frames are used for representing the video for storage, so that the storage space required by data storage is effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic view of a scene of a video processing method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of another video processing method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of fusing the fusion frame and the learnable information according to the embodiment of the present invention;

fig. 5 is a schematic flowchart of determining learnable parameters corresponding to the plurality of video frames according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating a video processing method according to another embodiment of the present invention;

FIG. 7 is a schematic diagram of a video processing method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 8;

fig. 10 is a flowchart illustrating another video processing method according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 11;

fig. 13 is a flowchart illustrating a further video processing method according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 14.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any inventive step, are intended to be protected by the present invention but do not exclude at least one. It is to be understood that the term "and/or" range "is used herein.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, and only one, describing an associative relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of additional like elements in a commodity or system comprising the element.

Definition of terms:

and (3) lifelong learning: lifelong learning is a high-level machine learning paradigm that accumulates knowledge from past tasks by learning continuously and uses that knowledge to assist in future learning.

Class increment learning: in class incremental learning, new classes are continuously coming, and the model needs to correctly classify the input into the corresponding class, and there is no overlap between the classes contained in different tasks.

Video type incremental learning: the method comprises the following steps of class increment learning by taking a video sample as input, and behavior identification of a video as a basic task.

Catastrophic forgetting: after the model learns new knowledge, the learned features and information from previous training are almost completely forgotten.

And (3) behavior recognition: the method is used for analyzing the motion category of a target person in the video, and the behavior recognition is generally based on a large amount of label training data to learn;

and (3) performing Prompt: researchers have designed an input form or template for downstream tasks that can help pre-train large models to "recall" what they "learned" from pre-training themselves.

Example specific Prompt Instance-specific Prompt: and a prompt template which is designed for each independent sample and is adaptive to the image characteristics is used for identifying the temporal characteristics and/or the spatial characteristics of the video.

In order to facilitate understanding of a specific implementation manner and an implementation effect of the technical solution provided by the present embodiment, the following description is made on a related technology:

at present, in the fine tuning stage of the model, people often use training data of all categories to perform the training operation of the motion recognition model, since video data has more redundant information compared with image data, a larger storage space is needed for storing the video, and it is impractical to store a large number of training video samples for each category due to privacy problems or technical limitations. However, it is impractical to store a large number of training videos for each video category in advance when performing a model update or model optimization operation, and therefore, due to the limitation of memory space, all category data of the model update or model optimization is not available or partially available in a limited memory, which greatly limits the type and amount of model update or model optimization data, thereby easily degrading the performance of the model update or model optimization.

When the stored training data is used for model updating or model optimization, all the training data are sequentially trained and fine-tuned according to the class sequence, so that the training data are sequentially fine-tuned to the model to easily over-fit the training data of the current class, the training performance of the model for other classes is reduced, and catastrophic forgetfulness is caused.

In order to alleviate the problem of catastrophic forgetting, the related art provides an incremental learning method based on sample preservation, and the implementation principle of the incremental learning method is mainly as follows: a small set of representative videos is selected for subsequent model update or model optimization operations, and then significant performance is achieved in the image domain by retraining a portion of the past examples. Meanwhile, some existing video incremental learning methods have also proved that the ability to alleviate forgetting can be improved by storing more old samples. However, although the above method can significantly improve and guarantee the performance of the video motion recognition model, it still needs to store multiple frames for each representative video, which still results in non-negligible memory overhead, thereby limiting further application of the above technical solution in practical scenarios.

In order to solve the above technical problems, the present embodiment provides a video processing method and a device, where an execution main body of the video processing method may be a video processing apparatus, and in particular, when the video processing apparatus is implemented as a cloud server, the video processing method may be executed in the cloud, at this time, a plurality of computing nodes (cloud servers) may be deployed in the cloud, and each computing node has processing resources such as computation, storage, and the like. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), application Programming Interface (API), and other forms.

According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the video processing service, and the user calls the video processing service interface through the client end/the request end so as to trigger a request for calling the video processing service interface to the cloud end. The cloud determines the compute nodes that respond to the request, and performs the specific processing operations of video processing using the processing resources in the compute nodes.

Specifically, referring to fig. 1, the client/request end may be any computing device with a certain data transmission capability, and the client/request end may be a mobile phone, a personal computer PC, a tablet computer, a setting application program, and the like. In addition, the basic structure of the client may include: at least one processor. The number of processors depends on the configuration and type of client. The client may also include a Memory, which may be volatile, such as RAM, or non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or may include both types. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. In addition to the processing unit and the memory, the client includes some basic configurations, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, a mouse, a stylus, a printer, and the like. Other peripheral devices are well known in the art and will not be described in detail herein.

The video processing apparatus refers to a device that can provide video processing services in a network virtual environment, and generally refers to an apparatus that performs information planning and video processing operations using a network. In physical implementation, the video processing apparatus may be any device capable of providing a computing service, responding to a video processing request, and performing a video processing service based on the video processing request, for example: can be cluster servers, regular servers, cloud hosts, virtual centers, and the like. The interactive detection device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and is similar to a general computer framework.

In the above embodiment, the client/requester may have a network connection with the video processing apparatus, and the network connection may be a wireless or wired network connection. If the client/request end is in communication connection with the video processing device, the network format of the mobile network may be any one of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), wiMax, 5G, and 6G.

In the embodiment of the application, a client/request terminal can acquire videos to be processed, wherein the videos to be processed can be used for training a video behavior recognition model, and specifically, the number of the videos to be processed can be one or more; specifically, the specific implementation manner of the request end for acquiring the video to be processed is not limited in this embodiment, and in some examples, the video to be processed may be stored in a preset area of the request end, and the video to be processed may be acquired by accessing the preset area. Or the video to be processed may be stored in a third device, and the third device is in communication connection with the request terminal, and the video to be processed is actively or passively acquired through the third device. After the video to be processed is acquired, the video to be processed can be sent to the video processing device, so that the video processing device can perform video processing operation on the video to be processed, specifically, the video to be processed can be compressed, and memory consumption for storing the video to be processed can be reduced.

The video processing device is used for acquiring a video to be processed, then sampling the video to be processed to obtain a plurality of video frames, wherein the plurality of video frames can be used as representative frames of the video to be processed, and in order to accurately perform fusion operation on the plurality of video frames, learnable parameters corresponding to the plurality of video frames can be determined, wherein the learnable parameters can be obtained through a video behavior recognition model; and then, the plurality of video frames can be fused based on the learnable parameters respectively corresponding to the plurality of video frames, so that a fused frame corresponding to the video to be processed can be obtained.

After a fusion frame used for representing a video to be processed is obtained, the fusion frame can represent related information included in the video to be processed, so that the fusion frame can represent the video to be processed for storage, and since the data volume of the fusion frame is smaller or far smaller than that of the video to be processed, the storage space occupied by data storage is reduced.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below can be combined with or separated from each other without conflict between the embodiments. In addition, the sequence of steps in the embodiments of the methods described below is merely an example, and is not strictly limited.

Fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present invention; referring to fig. 2, the embodiment provides a video processing method, where an execution subject of the method may be a video processing apparatus, the video processing apparatus may be implemented as software, or a combination of software and hardware, and specifically, when the video processing apparatus is implemented as hardware, it may be embodied as various electronic devices having data processing operations, including but not limited to a tablet computer, a personal computer PC, a server, and the like. When the video processing apparatus is implemented as software, it can be installed in the electronic devices exemplified above. Based on the video processing apparatus, the video processing method in this embodiment may include the following steps:

step S201: a plurality of video frames of a video to be processed are acquired.

Step S202: determining learnable parameters corresponding to the video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model.

Step S203: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

The following is a detailed description of specific implementation processes and implementation effects of the above steps:

step S201: a plurality of video frames of a video to be processed are obtained.

The video to be processed may refer to a video that needs to be subjected to a video processing operation (e.g., a video compression operation), and the video to be processed may perform a training operation, an updating operation, or an optimizing operation of a video behavior recognition model, that is, the video to be processed may be used as training data of the video behavior recognition model, and the video behavior recognition model may recognize behavior characteristics or behavior information in the video to be processed.

In addition, a plurality of video frames may be stored in the preset area or the third device, and at this time, a plurality of video frames of the video to be processed may be acquired by accessing the preset area or the third device. Alternatively, obtaining a plurality of video frames of the video to be processed may include: and acquiring a video frame to be processed, sampling the video to be processed, and acquiring a plurality of video frames. Specifically, the number of the videos to be processed may be one or more, specifically, the embodiment does not limit the obtaining manner of the videos to be processed, in some examples, the videos to be processed may be stored in a preset area in the video processing apparatus, and at this time, the videos to be processed may be obtained by accessing the preset area; or the video to be processed may be stored in the third device, and the video to be processed is actively or passively acquired by the third device. In still other examples, the video processing device may be communicatively connected with a live broadcast terminal, the live broadcast terminal may generate and obtain a live broadcast video, and then the live broadcast terminal may send the live broadcast video to the video processing device, and the video processing device may obtain a to-be-processed live broadcast video, so that the processing operation of the live broadcast video is effectively implemented; similarly, the video processing device can be in communication connection with a conference terminal, the conference terminal can generate conference video, the conference terminal can send the conference video to the video processing device, and the video processing device can acquire the conference video to be processed, so that the processing operation of the conference video is effectively realized.

In other examples, the video to be processed may be a part of a plurality of sample videos used for performing a training operation on the video behavior recognition model, and in this case, acquiring the video to be processed may include: acquiring an original video set, wherein the original video set comprises a plurality of sample videos used for training a video behavior recognition model; determining a video category corresponding to each sample video in an original video set; and determining one or more to-be-processed videos in the original video set based on the video categories, wherein at least one to-be-processed video corresponds to each video category.

For the video behavior recognition model, an original video set corresponding to the video behavior recognition model is configured in advance, and the original video set comprises a plurality of sample videos used for training the video behavior recognition model. Because the original video set comprises a plurality of sample videos, different sample videos can correspond to different video categories, and specifically, the video categories can comprise live videos, conference videos, fun videos, cate videos, entertainment videos, life videos, information videos, knowledge videos, game videos, favorite videos, sports videos, cartoon videos, science and technology videos, health videos and the like. In addition, because the focused information of the sample videos of different video types is different from the content to be expressed, in order to ensure accurate and comprehensive characterization of the original video set, videos to be processed may be screened from the original video set based on the video category, at this time, after the original video set is obtained, all sample videos in the original video set may be analyzed and processed to determine the video category corresponding to each sample video in the original video set, in some examples, the sample videos of different video categories may correspond to different identification information, and at this time, the video category corresponding to each sample video may be determined based on the identification information. In still other examples, sample videos of different video categories may correspond to different video features, and at this time, after the original video set is obtained, feature extraction may be performed on each sample video in the original video set to obtain video features, and the video category corresponding to each sample video is determined based on the video features.

After determining the video categories corresponding to the videos in the original video set, one or more videos to be processed may be determined in the original video set based on the video categories, specifically, at least one video to be processed corresponding to each video category. For example, the video categories corresponding to the sample videos in the original video set include: when the videos of the life category, the videos of the travel category, the videos of the knowledge category and the videos of the conference category are taken as the videos, one or more videos to be processed can be determined in the original video set based on the video categories, and specifically, the videos to be processed of at least one life category, the videos to be processed of at least one travel category, the videos to be processed of at least one knowledge category and the videos to be processed of at least one conference category can be obtained, so that the videos to be processed of the whole category can be accurately obtained.

In order to implement processing operation on the video to be processed, after the video to be processed is acquired, sampling processing may be performed on the video to be processed, so that a plurality of video frames may be acquired. In some examples, sampling the video to be processed to obtain the plurality of video frames may include: and randomly sampling the video to be processed to obtain a plurality of video frames, wherein the plurality of video frames are a plurality of random frames in the video to be processed. Alternatively, sampling the video to be processed to obtain a plurality of video frames may include: and uniformly sampling the video to be processed to obtain a plurality of video frames. Alternatively, sampling the video to be processed to obtain a plurality of video frames may include: carrying out interval sampling on a video to be processed to obtain a plurality of video frames; alternatively, sampling the video to be processed to obtain a plurality of video frames may include: acquiring motion distribution information corresponding to a video to be processed; and sampling the video to be processed based on the motion distribution information to obtain a plurality of video frames.

After the plurality of video frames are acquired, different video frames can express different information of the video to be processed, so that in order to accurately perform fusion processing operation on the plurality of video frames, learnable parameters corresponding to the plurality of video frames can be determined, and different video frames can correspond to the same or different learnable parameters, wherein the learnable parameters can be obtained through a video behavior recognition model.

In addition, the specific implementation manner of the learnable parameters corresponding to each of the plurality of video frames is not limited in this embodiment, in some examples, the learnable parameters may be parameters that are configured manually by a user based on a video behavior recognition model, or the learnable parameters may be parameters that are configured automatically based on the video behavior recognition model, it should be noted that the video behavior recognition model may be a machine learning model or a neural network model that is trained in advance or configured in advance and is used for performing behavior recognition on a video, and the video behavior recognition model may be configured on any electronic device with an image processing capability to implement the recognition operation of a video behavior. In other examples, the learnable parameter may be obtained by adjusting the initialization parameter based on the video behavior recognition model, and determining the learnable parameter corresponding to each of the plurality of video frames may include: acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are acquired based on the number of the plurality of video frames; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

After a plurality of video frames are acquired, initialization parameters corresponding to the plurality of video frames may be automatically acquired or determined, where the initialization parameters may be determined based on the number of the video frames, and in some examples, the initialization parameters corresponding to different video frames are the same, that is, in an initial state, it may be default that all the video frames have the same influence on the video to be processed, and at this time, when the number of the video frames is N, then the initialization parameters may be 1/N. In other examples, the initialization parameters corresponding to different video frames may be different, and at this time, the initialization parameters corresponding to each of the plurality of video frames may be determined based on the frame characteristics of the video frame.

Because the initialization parameter is set in the initial state and used for representing the initial influence degree of the video frame on the video to be processed, in order to better represent the video to be processed and obtain a more accurate fusion frame, the initialization parameter can be optimized and adjusted based on the video behavior recognition model, at this time, after the initialization parameter is obtained, an initial fusion frame corresponding to the video frame to be processed can be determined based on the initialization parameter, and specifically, the initial fusion frame can be a fusion frame obtained by weighting and summing a plurality of video frames through the initialization parameter; alternatively, the initial fusion frame may be a fusion frame obtained by summing up products of a plurality of video frames based on the initialization parameter; alternatively, the initial fusion frame may be a fusion frame obtained by performing a splicing process with the video frame based on the initialization parameter.

After the initial fusion frame is obtained, feature extraction and behavior recognition processing operations can be performed on the initial fusion frame by using a video behavior recognition model, a first loss function corresponding to the initial fusion frame can be obtained based on a feature extraction result and a behavior recognition result, and then the initialization parameter can be adjusted by using the first loss function as a constraint condition to obtain learnable parameters corresponding to a plurality of video frames, wherein the learnable parameters are determined parameters which are matched with the plurality of video frames in a comparison manner.

For the first loss function, this embodiment may obtain different first loss functions in different manners, and in a first implementation manner, obtaining, based on the video behavior recognition model, the first loss function corresponding to the initial fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between an initial prediction label and a standard label; a first loss function corresponding to the initial fused frame is determined based on the first feature loss function and the first tag loss function.

For example, the video behavior recognition model may be represented as

，

Is a video behavior recognition model

The initial fused frame is represented as

The video to be processed is represented as

Wherein, in the step (A),

identifying behavior for videoThe current parameters of the model after the k-th incremental phase,

the current parameters of the model of the feature extractor after the k-th increment stage is finished.

After obtaining the initial fusion frame

Thereafter, the fused frame features may be obtained by a feature extractor

Similarly, after the video V to be processed is obtained, the video features may also be obtained by the feature extractor

For the fusion frame feature and the video feature, because it is desired to obtain a fusion frame capable of accurately characterizing a video to be processed, that is, the desired fusion frame has the same or very similar expression ability to the original video, at this time, it is necessary that the embedded feature of the fusion frame extracted from the current model is consistent with the video feature of the original video or consistent with the video feature of the original video as much as possible, and further, a first feature loss function between the fusion frame feature and the video feature can be obtained

When the first characteristic loss function is characterized by the euclidean distance,

(ii) a Of course, the first characteristic loss function may be expressed in other ways, such as: the first characteristic loss function can also be expressed by cosine similarity, mahalanobis distance, manhattan distance, pearson correlation coefficient and other modes, and different modes can correspond to different expression formulas, which is not described herein again.

To further improve the initial fused frame pairThe adaptability of the frequency behavior recognition model can monitor the classification confidence of the initial fusion frame by using cross entropy loss, and at this time, an initial prediction label corresponding to the initial fusion frame can be obtained by the video behavior recognition model, and the initial prediction label can be expressed as

Then obtaining the standard label corresponding to the video to be processed

And obtaining an initial predictive tag

And standard label

First tag loss function in between

When the first tag loses the function

When characterized by a cross-entropy loss function,

can be expressed as

(ii) a Of course, the first tag loss function may be expressed in other ways, such as: the first label loss function may also be expressed by a logarithmic loss function, an exponential loss function, and the like, and different modes may correspond to different expression formulas, which is not described herein again.

After obtaining the first characteristic loss function

And a first tag loss function

Then, the first characteristic loss function can be passed

And a first tag loss function

Determining a first loss function corresponding to the initial fused frame

The process may, in some instances,

=

+

that is, the first loss function is the sum of the first characteristic loss function and the first tag loss function; in yet other embodiments, the first and second light sources may be,

=

+

i.e. the first loss function is a weighted sum of the first characteristic loss function and the first tag loss function, as described above

The weight information corresponding to the first characteristic loss function,

as a function of first tag lossAnd (4) weighted summation.

In a second implementation manner, based on the video behavior recognition model, obtaining the first loss function corresponding to the initial fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame and video characteristics corresponding to the video to be processed based on the video behavior recognition model; acquiring a first characteristic loss function between the fusion frame characteristic and the video characteristic; based on the first feature loss function, a first loss function corresponding to the initial fused frame is determined.

Different from the above implementation manner, the first loss function in the present embodiment

Only the first characteristic loss function is needed

The acquisition is performed, and therefore, there is no need to acquire the first tag loss function

In this case, the first characteristic loss function may be directly determined as the first loss function, that is

=

。

In a third implementation manner, based on the video behavior recognition model, obtaining the first loss function corresponding to the initial fusion frame may include: acquiring an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to a video to be processed based on the video behavior recognition model; obtaining a first label loss function between an initial prediction label and a standard label; based on the first tag loss function, a first loss function corresponding to the initial fused frame is determined.

Only the first label loss function needs to be passed

The acquisition is performed so that there is no need to acquire the first characteristic loss function

Specifically, the first label can be directly lost

Is determined as a first loss function, i.e.

=

。

After the learnable parameters corresponding to the multiple video frames are obtained, the multiple video frames can be fused based on the learnable parameters corresponding to the multiple video frames, and a fusion frame corresponding to the video to be processed is obtained, wherein for the fusion frame, in order to enable the fusion frame to accurately express the video to be processed, the number of image channels of the obtained fusion frame is the same as that of the video frames, and the size of the fusion frame is the same as that of the video frames.

In addition, the specific implementation manner of fusing the video frames is not limited in this embodiment, and since the learnable parameters may correspond to different numerical ranges, and the learnable parameters of different numerical ranges may correspond to different implementation manners, in some examples, fusing the multiple video frames based on the learnable parameters corresponding to the multiple video frames, respectively, and obtaining the fused frame corresponding to the video to be processed may include: when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fusion frame; or when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on the plurality of video frames based on the normalization parameters to obtain a fusion frame.

Specifically, when the learnable parameter is a value greater than zero and less than 1, that is, in the process of video processing, the learnable parameter is always in a range greater than zero and less than 1, and since the learnable parameter can directly reflect the degree of influence of the plurality of video frames on the video to be processed, the video frames can be weighted and summed based on the learnable parameters corresponding to the plurality of video frames, so as to obtain the fusion frame. For example, the plurality of video frames may be

Learnable parameters corresponding to each of a plurality of video frames

Then, the fused frame can be obtained

。

In addition, when the learnable parameter is a value greater than 1, that is, in the process of video processing, the learnable parameter may be in a range greater than zero and smaller than 1 or a range greater than 1, since the learnable parameter cannot directly reflect the influence degree of the plurality of video frames on the video to be processed, normalization processing may be performed based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a normalization parameter corresponding to the learnable parameter, where the normalization parameter is a value greater than zero and smaller than 1, and at this time, the normalization parameter may directly reflect the influence degree of the plurality of video frames on the video to be processed, and thus, the video frames may be subjected to weighted summation based on the normalization parameter to obtain a fused frame. For example, the plurality of video frames may be

Learnable parameters corresponding to each of a plurality of video frames

Then, the fused frame can be obtained

Wherein, in the step (A),

i.e. the normalization parameters corresponding to each of the plurality of video frames.

After obtaining the fusion frame corresponding to the video to be processed, in order to facilitate processing the video to be processed based on the fusion frame, the method in this embodiment may further include: the fusion frame represents the video to be processed for storage, namely the fusion frame can be stored in a preset area in the video processing device, when the video to be processed needs to be called or used, the fusion frame is obtained by accessing the preset area, and the fusion frame replaces the video to be processed for calling or using operation, so that the memory space needed by data storage is effectively reduced.

For example, when a device B (a cloud server, a cloud database, or the like) stores a fusion frame representing a plurality of pieces of video data, an application program for implementing a model optimization operation may be configured on the device a (a user side), and when the stored fusion frame is required for the optimization operation of the video motion recognition model, the device a may establish a communication connection with the device B, acquire the plurality of fusion frames stored in the device B by accessing the device B, and then may optimize or update the video motion recognition model based on the plurality of fusion frames and other pieces of video data.

In the video processing method provided by the embodiment, a plurality of video frames of a video to be processed are acquired; and determining the learnable parameters corresponding to the video frames, and finally fusing the video frames based on the learnable parameters corresponding to the video frames to obtain a fused frame corresponding to the video to be processed, wherein the data volume of the fused frame is far smaller than that of the video to be processed, and the fused frame is used for representing the video for storage, so that the storage space required by data storage is effectively reduced.

Fig. 3 is a schematic flow chart of another video processing method according to an embodiment of the present invention; fig. 4 is a schematic diagram of fusing a fusion frame and learnable information according to an embodiment of the present invention; on the basis of the foregoing embodiment, as shown in fig. 3 to 4, when a video to be processed is acquired and is expressed by a fusion frame, time information and spatial information of the video are inevitably lost to some extent, at this time, in order to supplement the lost time information and spatial information of the fusion frame, after obtaining the fusion frame corresponding to the video to be processed, the present embodiment further provides an implementation scheme for processing the fusion frame by using different prompt information to supplement the time information and the spatial information, specifically, the method in the present embodiment may include:

step S301: and acquiring learnable information corresponding to the video to be processed, wherein the learnable information is used for identifying the spatial information and/or the time information of the video to be processed.

After the video to be processed is acquired, the video to be processed may be analyzed to acquire learnable information corresponding to the video information to be processed, where the learnable information is used to identify spatial information and/or temporal information of the video to be processed, and a spatial resolution of the learnable information is the same as a spatial resolution of the video to be processed. In some examples, the learnable information may be information configured by a user based on the video to be processed, or the learnable information may be information obtained by processing the video to be processed through a machine learning model trained in advance.

In other examples, the learnable information may be information obtained after performing optimization adjustment on the initialization information based on the video behavior recognition model, and at this time, obtaining the learnable information corresponding to the video to be processed may include: acquiring initialization information corresponding to a video to be processed; fusing the initialization information and the fusion frame to obtain a process fusion frame; acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization information based on a second loss function to obtain learnable information corresponding to each of the plurality of video frames.

After the video to be processed is acquired, the initialization information corresponding to the video to be processed may be automatically acquired or determined, where the initialization information may be determined based on the video to be processed, and in some examples, the initialization information corresponding to different videos to be processed may be the same value, and at this time, the initialization information may be 0; alternatively, the initialization information corresponding to different videos to be processed may be different values.

After the initialization information is obtained, the initialization information and the fusion frame may be fused, so as to obtain a process fusion frame, where the fusing of the initialization information and the fusion frame may include: performing pixel-by-pixel summation processing on the initialization information and the fusion frame to obtain a process fusion frame; or, the initialization information and the fusion frame are processed by pixel-by-pixel product to obtain a process fusion frame; or, the initialization information and the fusion frame are spliced to obtain a process fusion frame, and specifically, when the initialization information and the fusion frame are spliced, in order to enable the process fusion frame to accurately represent a video to be processed, the fusion frame may be used as a central region, and the initialization information may be used as a peripheral edge region of the central region and spliced with the fusion frame to obtain the process fusion frame, where the number of channels of the process fusion frame is the same as the number of channels of the video frame, and the height and width of the process fusion frame are greater than the height and width of the video frame, so that accurate reliability of obtaining the process fusion frame is also achieved.

After the process fusion frame is acquired, in order to accurately acquire learnable information corresponding to each of the plurality of video frames, feature extraction and behavior recognition processing operations may be performed on the process fusion frame based on the video behavior recognition model, and a second loss function corresponding to the process fusion frame may be acquired based on a feature extraction result and a behavior recognition result; after the second loss function is obtained, the initialization information can be adjusted based on the second loss function as a constraint condition to obtain learnable information corresponding to each of the plurality of video frames, so that the accuracy and reliability of obtaining the learnable information are effectively ensured.

For the second loss function, different second loss functions may be obtained in different manners in this embodiment, and in the first implementation manner, obtaining the second loss function corresponding to the process fusion frame based on the video behavior recognition model may include: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic and a second label loss function between the frame prediction label and the standard label; a second loss function corresponding to the process fused frame is determined based on the second feature loss function and the second tag loss function.

For example, the video behavior recognition model may be represented as

，

Is a video behavior recognition model

Is characterized byExtractor, process fusion frame representation as

The video to be processed is represented as

Wherein, in the step (A),

identifying the current parameters of the model after the k-th increment stage of the model for the video behavior,

Fusing frames during acquisition

Thereafter, process frame features may be obtained by a feature extractor

Similarly, after the video to be processed is obtained, the video features may be obtained by the feature extractor

For the process frame feature and the video feature, because a fused frame capable of accurately representing the video to be processed is desired to be obtained, that is, the desired fused frame has the same or very similar expression capability to the original video, at this time, the embedded feature of the fused frame, which needs to be extracted from the video behavior recognition model, can be consistent with the video feature of the original video or consistent with the video feature of the original video as much as possible, and then a second feature loss function between the process frame feature and the video feature is obtained

When the second feature loss function is characterized by a euclidean distance,

(ii) a Of course, the second characteristic loss function may be expressed in other ways, such as: the second characteristic loss function can also be expressed by cosine similarity, mahalanobis distance, manhattan distance, pearson correlation coefficient and other modes, and different modes can correspond to different expression formulas, which is not described herein again.

To further enhance process fusion frames

The adaptability to the video behavior recognition model can use cross entropy loss to supervise the classification confidence of the process fusion frame, and at this time, a frame prediction label corresponding to the process fusion frame can be obtained through the video behavior recognition model, and the frame prediction label can be expressed as

Then obtaining a standard label corresponding to the video to be processed

Then a frame prediction tag can be obtained

And standard label

Second label loss function in between

When the second tag loses the function

When characterized by a cross-entropy loss function,

can be expressed as

(ii) a Of course, the second tag loss function may be expressed in other ways, such as: the second label loss function can also be expressed by a logarithmic loss function, an exponential loss function and the like, and different modes can correspond to different expression formulas, which is not described herein again.

Upon acquisition of the second characteristic loss function

And a second tag loss function

Thereafter, the second characteristic loss function can be passed

And a second tag loss function

Determining a second penalty function corresponding to the process fusion frame

The process may, in some instances,

=

+

i.e. the second penalty function is the sum of the second characteristic penalty function and the second tag penalty function.

In a second implementation manner, based on the video behavior recognition model, obtaining the second loss function corresponding to the process fusion frame may include: acquiring process frame characteristics corresponding to the process fusion frame and video characteristics corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic; a second loss function corresponding to the process fused frame is determined based on the second characteristic loss function.

Different from the above implementation manner, the second loss function in the present embodiment

Only the second characteristic loss function is needed

The acquisition is performed, thus, there is no need to acquire a second tag loss function

In particular, the second characteristic loss function can be determined directly as the second loss function, i.e.

=

。

In a third implementation manner, based on the video behavior recognition model, obtaining the second loss function corresponding to the process fusion frame may include: acquiring a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; obtaining a second label loss function between the frame prediction label and the standard label; based on the second tag loss function, a second loss function corresponding to the process fused frame is determined.

Only the second label loss function needs to be passed

To proceed to obtainTaking, therefore, there is no need to obtain a second characteristic loss function

In particular, the second label can be directly lost

Determined as a second loss function, i.e.

=

。

In still other examples, after acquiring a fusible frame corresponding to the to-be-processed video in combination with the learnable parameter and the learnable information, and after acquiring a second loss function corresponding to the process fusion frame, this embodiment may further include an implementation manner of adjusting the learnable information based on the second loss function, where the method in this embodiment may further include: and adjusting the learnable parameters based on the second loss function to obtain the target learning parameters corresponding to the learnable parameters, thereby effectively improving the flexibility and reliability of determining the target learning parameters.

Step S302: and fusing the fusion frame and the learnable information to obtain a target fusion frame.

Referring to fig. 4, after the fusion frame and the learnable information are acquired, fusion processing may be performed on the fusion frame and the learnable information, so that a target fusion frame may be acquired. Specifically, fusing the fusion frame and the learnable information to obtain the target fusion frame may include: performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain a target fusion frame; or, the learnable information and the fusion frame are subjected to pixel-by-pixel product processing to obtain a target fusion frame; or splicing the learnable information and the fusion frame to obtain a target fusion frame; therefore, the accurate reliability of obtaining the target fusion frame is effectively realized.

In the embodiment, the learnable information corresponding to the video to be processed is acquired, and the fusion frame and the learnable information are fused to acquire the target fusion frame, so that the accuracy and reliability of determining the target fusion frame are effectively ensured, and the accuracy degree of representing the video to be processed based on the target fusion frame is facilitated.

Fig. 5 is a schematic flowchart illustrating a process of determining learnable parameters corresponding to a plurality of video frames according to an embodiment of the present invention; on the basis of the foregoing embodiment, referring to fig. 5, this embodiment provides an implementation manner for determining a learnable parameter, and specifically, determining a learnable parameter corresponding to each of a plurality of video frames in this embodiment may include:

step S501: acquiring initialization parameters corresponding to the plurality of video frames and initialization information corresponding to the video to be processed, wherein the initialization parameters are obtained based on the number of the plurality of video frames, and the initialization information is used for identifying space information and time information corresponding to the video to be processed.

After acquiring the plurality of video frames, the initialization parameters corresponding to the plurality of video frames may be automatically acquired or determined, and the initialization parameters may be determined based on the number of the video frames, in some examples, the initialization parameters corresponding to different video frames are the same, that is, in an initial state, it may be default that all the video frames have the same influence on the behavior recognition operation of the video to be processed, for example, when the number of the video frames is N, then the initialization parameters may be 1/N. In other examples, the initialization parameters corresponding to different video frames may be different, and at this time, the initialization parameters corresponding to each of the plurality of video frames may be determined based on the frame characteristics of the video frame.

Similarly, after the to-be-processed video is acquired, the initialization information corresponding to the to-be-processed video may be automatically acquired or determined, where the initialization information may be determined based on the to-be-processed video, and in some examples, the initialization information corresponding to different to-be-processed videos may be the same value, and at this time, the initialization information may be 0. Or, the initialization information corresponding to different videos to be processed may be different values.

Step S502: and determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters.

Because the initialization parameter is the initial influence degree set in the initial state for representing the video frame to represent the video to be processed, in order to better represent the video to be processed and obtain a more accurate fusion frame, the initialization parameter needs to be optimally adjusted in combination with the result of video behavior recognition, at this time, after the initialization parameter is obtained, an initial fusion frame corresponding to the video frame to be processed can be determined based on the initialization parameter, and specifically, the initial fusion frame can be a fusion frame obtained by performing weighted summation on a plurality of video frames through the initialization parameter.

Step S503: and fusing the initial fusion frame and the initialization information to obtain a process fusion frame.

After the initialization information and the initial fusion frame are obtained, the initialization information and the fusion frame may be fused, so as to obtain a process fusion frame, where the fusing of the initialization information and the fusion frame may include: performing pixel-by-pixel summation processing on the initialization information and the fusion frame to obtain a process fusion frame; or, the initialization information and the fusion frame are processed by pixel-by-pixel product to obtain a process fusion frame; or splicing the initialization information and the fusion frame to obtain a process fusion frame; therefore, the accurate reliability of acquiring the process fusion frame is effectively realized.

Step S504: and acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model.

After the process fusion frame is obtained, feature extraction processing and behavior tag identification processing can be performed on the process fusion frame based on the video behavior identification model, and then a third loss function corresponding to the process fusion frame can be obtained based on the obtained feature information and tag information.

For the third loss function, this embodiment may obtain different third loss functions in different manners, and in a first implementation manner, obtaining the third loss function corresponding to the process fusion frame based on the video behavior recognition model may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, initial prediction labels corresponding to the initial fusion frame, process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, frame prediction labels corresponding to the process fusion frame and standard labels corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame feature and the video feature, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame feature and the video feature, and a fourth sub-loss function between the frame prediction label and the standard label; and determining a third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function.

Wherein the first sub-loss function

And a third sub-loss function

Can be obtained by Euclidean distance, cosine similarity, etc., and the second sub-loss function

And a fourth sub-loss function

Can be obtained by a cross entropy loss function. Additionally, determining a third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function, and the fourth sub-loss function may include: acquiring weight information corresponding to the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function respectively; weighting information based on the first sub-loss function, the second sub-loss function and the third sub-lossAnd carrying out weighted summation on the loss function and the fourth sub-loss function to obtain a third loss function.

Specifically, the first sub-loss function is obtained

Second sub-loss function

A third sub-loss function

And a fourth sub-loss function

Thereafter, first sub-loss functions may be determined separately

Second sub-loss function

A third sub-loss function

And a fourth sub-loss function

The weight information corresponding to each of the first sub-loss functions

Second sub-loss function

A third sub-loss function

And a fourth sub-loss function

Respective corresponding weight information to the first sub-loss function

Second sub-loss function

A third sub-loss function

And a fourth sub-loss function

Performing weighted summation to obtain a third loss function corresponding to the process fusion frame

Wherein, in the process,

is a first sub-loss function

The corresponding weight information is set to the corresponding weight information,

is a second sub-loss function

as a third sub-loss function

is a fourth sub-loss function

Corresponding weight information.

In a second implementation manner, based on the video behavior recognition model, the obtaining a third loss function corresponding to the process fusion frame may include: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a third sub-loss function between the process frame characteristic and the video characteristic and a fourth sub-loss function between the frame prediction label and the standard label; and determining a third loss function corresponding to the process fusion frame based on the third sub-loss function and the fourth sub-loss function.

In a third implementation manner, based on the video behavior recognition model, obtaining a third loss function corresponding to the process fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, an initial prediction label corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame characteristic and the video characteristic and a second sub-loss function between the initial prediction label and the standard label; a third loss function corresponding to the process fused frame is determined based on the first sub-loss function and the second sub-loss function.

Specifically, the specific implementation manner and implementation effect of the three implementation manners of the third loss function in this embodiment are similar to the specific implementation manner and implementation effect of the three implementation manners of the second loss function and the first loss function in the above embodiment, and the above statements may be specifically referred to, and are not repeated herein.

Step S505: and adjusting the initialization parameters based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

After the third loss function is obtained, the initialization parameter may be adjusted based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames, specifically, the video frames may be analyzed and processed by using the video behavior recognition model, the initialization parameter may be adjusted by using the third loss function as constraint information to obtain learnable parameters corresponding to the plurality of video frames, where the learnable parameters are determined parameters that are adapted to the plurality of video frames.

In still other examples, after obtaining the fusible frame corresponding to the to-be-processed video by combining the learnable parameter and the learnable information, that is, after obtaining the third loss function corresponding to the process fusion frame, the method in this embodiment may further include an implementation manner of adjusting the initialization information based on the third loss function, in this case, the method in this embodiment may further include: and adjusting the initialization information based on the third loss function to obtain learnable information corresponding to the video to be processed, thereby effectively improving the flexible reliability of determining the learnable information.

In the embodiment, the initial parameters corresponding to the plurality of video frames and the initial information corresponding to the video to be processed are obtained, the initial fusion frame corresponding to the video to be processed is determined based on the initial parameters, the initial fusion frame and the initial information are fused to obtain the process fusion frame, the third loss function corresponding to the process fusion frame is obtained based on the video behavior recognition model, the initial parameters are adjusted based on the third loss function, and the learnable parameters corresponding to the plurality of video frames are obtained, so that the accuracy and reliability of determining the learnable parameters are effectively ensured, and the quality and the efficiency of video processing are further improved.

Fig. 6 is a flowchart illustrating a video processing method according to another embodiment of the present invention; on the basis of any one of the above embodiments, referring to fig. 6, after obtaining a fusion frame corresponding to a video to be processed, the present embodiment provides a technical solution for performing a model update or model optimization operation by using the fusion frame and a newly added video sample, and specifically, the method in the present embodiment may further include:

step S601: and acquiring a new video sample.

The specific acquiring method of the newly added video sample is similar to the specific acquiring method of the video to be processed in the above embodiment, and the above statements may be specifically referred to, and are not repeated here.

Step S602: and performing learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain an optimized recognition model.

After the newly added video sample and the fusion frame are obtained, learning training operation can be carried out on the video behavior recognition model based on the newly added video sample and the fusion frame, and therefore the recognition model after optimization can be obtained. In some examples, newly added video samples and fusion frames can be alternately and sequentially input into the video behavior recognition model to implement a learning training operation on the video behavior recognition model. In other examples, to avoid that the number of newly added video samples and the number of fusion frames are not the same magnitude, and to avoid that the training operation of video behavior recognition is forgotten catastrophically, in this embodiment, the learning and training of the video behavior recognition model based on the newly added video samples and the fusion frames is performed, and obtaining the optimized recognition model may include: acquiring a sample proportion between the newly added video sample and the fusion frame; and training the video behavior recognition model by the newly added video sample and the fusion frame according to the sample proportion to obtain an optimized recognition model.

Specifically, after the newly added video sample and the fusion frame are obtained, the sample proportion between the newly added video sample and the fusion frame can be obtained firstly, and then the newly added video sample and the fusion frame can be used for training the video behavior recognition model according to the sample proportion, so that the recognition model after optimization can be obtained, the situation that the training quality and effect of the recognition model after optimization are disastrous forgetting is effectively avoided, and the training quality and the training effect of the video behavior recognition model are further ensured.

In the embodiment, the newly added video sample is obtained, and then the video behavior recognition model is learned and trained based on the newly added video sample and the fusion frame to obtain the recognition model after optimization, so that the optimization operation of the model is effectively realized, the quality and effect of model updating or model optimization are ensured, and the practicability of the method is further improved.

In specific application, referring to fig. 7, the present application embodiment provides a video processing method, which can implement a video increment learning method with high memory efficiency, and specifically, in the technical scheme, a representative frame is learned for each video sample, so as to further reduce memory overhead caused by storing video data. Experiments show that the technical scheme has greatly improved performance compared with other reference methods under the condition of the same memory consumption, and can particularly obtain higher data storage quality and effect under the condition that the memory consumption is about 20%. Specifically, the video processing method in this embodiment may include the following steps:

step 1: and acquiring a video data set and a preliminarily trained video behavior recognition model, wherein the video behavior recognition model is used for performing behavior recognition operation on the video data.

Step 2: video samples that can represent category information are screened out of the video dataset.

Specifically, after the video behavior recognition model is initially trained, that is, after an incremental task corresponding to the video behavior recognition model is finished, a representative sample selection algorithm may be used to screen out video samples capable of representing category information from a current video data set, the number of the video samples corresponding to a video of one category information is at least one, and the obtained video samples of all categories form a representative sample video set.

The representative sample selection algorithm may include a clustering algorithm Herding policy, which mainly may perform a selection operation of samples around a category center, and of course, the representative sample selection algorithm is not limited to the above listed Herding policy, and may also include other policies, for example: selecting video samples according to the contribution degree of each video to the loss function; alternatively, the video samples and the like may be selected according to the category gradient corresponding to the video, and a person skilled in the art may perform any selection and configuration operation according to a specific application scenario or application requirement, which is not described herein again.

And 3, step 3: and performing frame sampling operation on each video sample to obtain a plurality of video frames.

In particular, the video data set obtained after incremental step k is selected by a representative sample selection method (clique strategy) commonly used in incremental learning

To determine a representative sample video set

. Then, for a representative sample video set

Of any one of the video samples

In other words, T frames may be uniformly sampled or randomly sampled in a video sample

Thereby obtaining a plurality of video frames.

And 4, step 4: and determining a learnable weight corresponding to each of the plurality of video frames, wherein the learnable weight can be determined through the video behavior recognition model and the number of the plurality of video frames.

In order to further process the selected video sample, for each video frame, the learnable weight may be optimized, in some examples, after a plurality of video frames are acquired, the number N of the plurality of video frames may be determined, and then the initialization parameter is determined based on the number of the plurality of video frames, where the initialization parameter corresponding to each video frame is the same, specifically 1/N, and for example, if the number of the plurality of video frames is 8 frames, the initialization parameter corresponding to each video frame is 1/8.

And 5: and fusing the plurality of video frames based on the learnable weights corresponding to the plurality of video frames respectively to obtain a fused frame.

After the learnable weights corresponding to the multiple video frames are obtained, an initial fusion frame can be obtained based on the initialization parameters, behavior recognition operation is performed on the initial fusion frame by using a video behavior recognition model to obtain a loss function corresponding to the initial fusion frame, and then the initialization parameters are adjusted and optimized based on the loss function to obtain the learnable weights corresponding to the multiple video frames.

Through the optimized learnable weight, a fusion frame which can represent the characteristics of the video and can be correctly classified with higher confidence coefficient can be learnt for the video, namely a plurality of video frames are fused into a representative frame, so that the fusion frame represents a video sample for storage, and only the fusion frame is stored in the next increment task, thereby effectively ensuring the memory space required by data storage.

For example, for a representative sample video set

Is a medium to

A video sample

In other words, learnable weights are defined

Then the fused frame can be expressed as the following formula:

；

wherein C, H and W are the number of image channels corresponding to the video frame, the height and width of the video frame,

represented as a fused frame, is shown,

for representing a normalization process of the learnable weights,

the size of the fusion frame is the same as that of the video frame, and the number of image channels of the fusion frame is the same as that of the video frame.

Step 6: and determining an Instance-Specific Prompt (Instance-Specific Prompt) corresponding to each of the plurality of video frames, wherein the Instance-Specific Prompt is used for representing the temporal information and the spatial information corresponding to the video sample, and the Instance-Specific Prompt can be specifically determined through a video behavior recognition model.

When a plurality of videos are fused into one frame (namely, a fusion frame), the time and space information of the original video is necessarily lost to a certain extent, and in order to supplement the information loss caused by frame fusion and compensate the time information loss and space information confusion caused in the frame fusion stage, an instance-specific hint can be further added to the fusion frame for recovering the detail information at the pixel level, so that the feature information of the original video can be better saved.

In particular, for each video sample

Constructing a learnable instance-specific hint

Spatial resolution the same as its original video resolution, for instance-specific cues

In other words, it can be obtained by adjusting the initialization cue through the video behavior recognition model. In some examples, the initialization hint may be 0, and then the initialization hint and the fusion frame are summed pixel by pixel to obtain a target fusion frame, which is then analyzed by a video behavior recognition modelAnd processing to obtain a loss function of the target fusion frame, and performing optimization adjustment on the initialization prompt information based on the loss function, so that instance specific prompts can be obtained.

And 7: and processing the fused frame based on the example specific prompts corresponding to the video frames to obtain a target fused frame, wherein the target fused frame is used for representing the video sample.

Wherein the example characteristics are prompted

And fusion frames

Summing pixel by pixel to obtain a target fusion frame for representing the video sample

The obtained target fusion frame can be used as a video sample to carry out model training operation, so that more target fusion frames can be stored in a limited memory space, and the diversity and the quantity of the samples during model updating or model optimization operation can be effectively ensured.

In addition, when the learnable weight is determined by the video behavior recognition model and the number of the plurality of video frames, the method in this embodiment may further include the following steps:

step 11: obtaining initialization parameters corresponding to a plurality of video frames respectively, wherein the plurality of video frames correspond to one or a plurality of video samples

。

Step 12: performing preliminary fusion on a plurality of video frames based on initialization parameters corresponding to the plurality of video frames respectively to obtain an initial fusion frame

。

Step 13: initial fusion frame pair by using video behavior recognition model

Performing feature extraction operation to obtain the feature of the fusion frame

。

Wherein the content of the first and second substances,

is a video behavior recognition model

The feature extractor of (1) is provided,

are parameters of the behavior recognition model of the video,

are parameters of the feature extractor.

Step 14: video sample using video behavior recognition model

Performing feature extraction operation to obtain video features

。

Step 15: a first sub-loss function between the fused frame feature and the video feature is obtained.

Specifically, in the process of video processing, in order to obtain a fused frame accurately representing a video sample and expect that the fused frame has the same or very similar expression capability as the video sample, the embedded features of the fused frame, which need to be extracted from the video behavior recognition model, should be consistent with the features of the original video:

wherein the content of the first and second substances,

is the current model

The feature extractor of (1).

Are parameters of the feature extractor.

Step 16: initial fusion frame pair by using video behavior recognition model

Performing label behavior prediction operation to obtain predicted label

。

And step 17: determining a standard label corresponding to a video sample

And obtaining a prediction tag

And standard label

The first sub-cross entropy loss function in between.

In order to further improve the adaptability of the fusion frame to the video behavior recognition model, cross entropy loss can be used to supervise the classification confidence of the fusion frame:

。

step 18: obtaining a first total objective function for optimizing the learnable weight, in particular, the first total objective function, based on the first sub-loss function and the first sub-cross entropy loss function

。

Further, when determining, based on the video behavior recognition model, an instance-specific hint corresponding to each of the plurality of video frames, the method in this embodiment may further include the following steps:

step 21: obtaining an initial fused frame

Corresponding initialization prompt information

。

Wherein the prompt information is initialized

May be all zero.

Step 22: determining a process fusion frame based on the initialization prompt information and the initial fusion frame.

In particular, and will initialize prompt messages

Fusing with the initial frame

Performing pixel-by-pixel summation operation to obtain process fusion frame

。

Step 23: fusing frames to a process using a video behavior recognition model

Performing feature extraction operation to obtain process frame features

。

Step 24: video sample processing using video behavior recognition model

Performing feature extraction operation to obtain video features

。

Step 25: obtaining process frame features

And video features

A second sub-loss function in between.

In particular, to characterize a process frame

And the original video

The second sub-loss function can be expressed by the following formula

In this process, the representation of the fused frame is enriched by introducing more flexible learnable parameters in the input space.

Step 26: fusing frames to a process using a video behavior recognition model

Performing label behavior prediction operation to obtain predicted label

。

Step 27: determining standard tags corresponding to video samples

Obtaining a prediction tag

And standard label

A second sub-cross entropy loss function in between.

Wherein the second sub-cross entropy loss function can be used to enhance the semantic perception thereof, and specifically, the second sub-cross entropy loss function can be expressed as

。

Step 28: determining a second total objective function based on the first total objective function, the second sub-cross entropy loss function and the second sub-loss function, specifically, the second total objective function of frame fusion and adding instance specific prompts is as follows:

wherein, in the process,

，

and

the default weights for the loss functions described above, respectively, in some instances,

，

and

may all be 1.

Step 29: and adjusting the initialization prompt information and/or the initialization parameters based on the second overall objective function to obtain the instance-specific prompt and/or the learnable parameters, wherein the obtained instance-specific prompt and/or the learnable parameters can be used for determining the fusion frame capable of representing the video sample.

After the fused frame of the video sample is determined, in addition to processing the video sample based on the example-specific prompt and/or the learnable parameters, the embodiment can also process the video sample based on an algorithm of sample reproduction and knowledge distillation, so that not only can the storage frame number of the representative video be reduced, the information redundancy of the representative video be reduced, a large amount of memory consumption required by video data storage be reduced, but also catastrophic forgetting can be further prevented, and efficient storage of video incremental learning can be realized.

Specifically, after the fusion frame capable of representing the video sample is acquired, the method in this application embodiment may further include performing a training operation on the video behavior recognition model based on the fusion frame, and at this time, the method in this application embodiment may include:

step 31: a plurality of fused frames is obtained, with different fused frames being used to represent different video samples.

Step 32: and acquiring a newly added video sample, and performing training updating operation on the video behavior recognition model based on the newly added video sample and the plurality of fusion frames to obtain an optimized video behavior recognition model.

Wherein, in the training increment step

When data sets can be used

Training model

Wherein, in the process,

is a task

The task is defined by belonging to a category set

The video composition of (a) is set up,

is a repository of old samples, among

Contains a plurality of fused frames generated in advance. In order to ensure the quality and effect of model updating or model optimization, the sample proportion between a plurality of video frames and newly added video samples can be determined, and then the video frames and the newly added video samples are input into the video behavior recognition model according to the sample proportion, so that the problem of catastrophic forgetting during model updating or model optimization operation can be effectively prevented.

In addition, the application embodiment can also adopt a basic knowledge distillation method in incremental learning to convert the former model

To the current model

。

The technical scheme provided by the application embodiment can realize the video type increment learning task under the condition of high storage efficiency, and particularly, a fusion frame capable of representing a video sample is obtained by learning the representative characteristics of the video sample, lost spatio-temporal information is supplemented at the pixel level through instance specific prompt, so that the video sample can be better represented by the fusion frame.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 8, the present embodiment provides a video processing apparatus, which is configured to execute the video processing method shown in fig. 2, and specifically, the video processing apparatus may include:

a first obtaining module 11, configured to obtain multiple video frames of a video to be processed;

the first determining module 12 is configured to determine learnable parameters corresponding to each of a plurality of video frames, where the learnable parameters are obtained through a video behavior recognition model, where the video behavior recognition model is a machine learning model;

the first processing module 13 is configured to fuse the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, so as to obtain a fused frame corresponding to the video to be processed.

In some examples, when the first processing module 13 fuses the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, and obtains a fused frame corresponding to the video to be processed, the first processing module 13 is configured to perform: when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fusion frame; or when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on the plurality of video frames based on the normalization parameters to obtain a fusion frame.

In some examples, when the first determination module 12 determines the learnable parameters corresponding to each of the plurality of video frames, the first determination module 12 is configured to perform: acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are acquired based on the number of the plurality of video frames; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

In some examples, when the first determining module 12 obtains the first loss function corresponding to the initial fused frame based on the video behavior recognition model, the first determining module 12 is configured to perform: acquiring fusion frame characteristics corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between an initial prediction label and a standard label; a first loss function corresponding to the initial fused frame is determined based on the first feature loss function and the first tag loss function.

In some examples, after obtaining the fusion frame corresponding to the video to be processed, the first obtaining module 11 and the first processing module 13 in this embodiment are configured to perform the following steps:

a first obtaining module 11, configured to obtain learnable information corresponding to a video to be processed, where the learnable information is used to identify spatial information and/or temporal information of the video to be processed;

and a first processing module 13, configured to fuse the fused frame and the learnable information to obtain a target fused frame.

In some examples, when the first processing module 13 fuses the fused frame and the learnable information to obtain the target fused frame, the first processing module 13 is configured to perform: performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain a target fusion frame; or, performing pixel-by-pixel product processing on the learnable information and the fusion frame to obtain a target fusion frame; or splicing the learnable information and the fusion frame to obtain a target fusion frame.

In some examples, when the first obtaining module 11 obtains the learnable information corresponding to the video to be processed, the first obtaining module 11 is configured to perform: acquiring initialization information corresponding to a video to be processed; fusing the initialization information and the fusion frame to obtain a process fusion frame; acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization information based on the second loss function to obtain learnable information corresponding to each of the plurality of video frames.

In some examples, when the first obtaining module 11 obtains the second loss function corresponding to the process fusion frame based on the video behavior recognition model, the first obtaining module 11 is configured to perform: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic and a second label loss function between the frame prediction label and the standard label; a second loss function corresponding to the process fused frame is determined based on the second feature loss function and the second tag loss function.

In some examples, after obtaining the second loss function corresponding to the process fusion frame, the first processing module 13 in this embodiment is configured to perform: and adjusting the learnable parameters based on the second loss function to obtain target learning parameters corresponding to the learnable parameters.

In some examples, when the first determination module 12 determines the learnable parameters corresponding to each of the plurality of video frames, the first determination module 12 is configured to perform: acquiring initialization parameters corresponding to a plurality of video frames and initialization information corresponding to a video to be processed, wherein the initialization parameters are acquired based on the number of the video frames, and the initialization information is used for identifying space information and time information corresponding to the video to be processed; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; fusing the initial fusion frame and the initialization information to obtain a process fusion frame; acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on a third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

In some examples, after obtaining the third loss function corresponding to the process fusion frame, the first processing module 13 in this embodiment is configured to perform: and adjusting the initialization information based on a third loss function to obtain learnable information corresponding to the video to be processed.

In some examples, when the first determination module 12 obtains the third loss function corresponding to the process fusion frame based on the video behavior recognition model, the first determination module 12 is configured to perform: acquiring fusion frame characteristics corresponding to the initial fusion frame, initial prediction labels corresponding to the initial fusion frame, process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, frame prediction labels corresponding to the process fusion frame and standard labels corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame characteristic and the video characteristic, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame characteristic and the video characteristic, and a fourth sub-loss function between the frame prediction label and the standard label; determining a third loss function corresponding to the process fusion frame based on the first, second, third, and fourth sub-loss functions.

In some examples, when the first determination module 12 determines the third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function, and the fourth sub-loss function, the first determination module 12 is configured to perform: acquiring weight information corresponding to a first sub-loss function, a second sub-loss function, a third sub-loss function and a fourth sub-loss function respectively; and carrying out weighted summation on the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function based on the weight information to obtain a third loss function.

In some examples, after obtaining the fused frame corresponding to the video to be processed, the first obtaining module 11 and the first processing module 13 in this embodiment perform the following steps:

the first obtaining module 11 is configured to obtain a newly added video sample;

and the first processing module 13 is configured to perform learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain an optimized recognition model.

In some examples, when the first processing module 13 performs learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain the optimized recognition model, the first processing module 13 is configured to perform: acquiring a sample proportion between the newly added video sample and the fusion frame; and training the video behavior recognition model by the newly added video sample and the fusion frame according to the sample proportion to obtain an optimized recognition model.

The apparatus shown in fig. 8 can perform the method of the embodiment shown in fig. 1-7, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 1-7. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.

In one possible design, the structure of the video processing apparatus shown in fig. 8 may be implemented as an electronic device, which may be a tablet computer, a personal computer PC, a conference room device, a server, or other various devices. As shown in fig. 9, the electronic device may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing programs for corresponding electronic devices to execute the video processing method in the embodiments shown in fig. 1-7, and the first processor 21 is configured to execute the programs stored in the first memory 22.

The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor 21, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to a plurality of video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model; and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

Further, the first processor 21 is also used to execute all or part of the steps in the embodiments shown in fig. 1-7. The electronic device may further include a first communication interface 23 for communicating with other devices or a communication network.

In addition, the embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 1 to 7.

Furthermore, an embodiment of the present invention provides a computer program product, including: computer program, which, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method as described above with reference to fig. 1-7.

Fig. 10 is a flowchart illustrating another video processing method according to an embodiment of the present invention; referring to fig. 10, the present embodiment provides another video processing method, where an execution subject of the method may be a video processing apparatus, the video processing apparatus may be implemented as software, or a combination of software and hardware, and specifically, when the video processing apparatus is implemented as hardware, it may be embodied as various electronic devices having data processing operations, including but not limited to a tablet computer, a personal computer PC, a server, and the like. When the video processing apparatus is implemented as software, it can be installed in the electronic devices exemplified above. Based on the above video processing apparatus, the video processing method in this embodiment may include the following steps:

step S1001: a plurality of video frames of a video to be processed are acquired.

Step S1002: and displaying a parameter configuration interface for processing the plurality of video frames.

After the image to be processed is acquired, in order to process a plurality of video frames of the video to be processed, a parameter configuration interface may be displayed, where a parameter adjustment control for adjusting a learnable parameter is displayed in the parameter configuration interface, and a user may configure or adjust the learnable parameter through the control, for example: the learnable parameters can be increased or decreased through the control to meet different video processing requirements, so that the learnable parameters meeting different requirements can be obtained quickly.

Step S1003: and determining learnable parameters corresponding to the plurality of video frames through the parameter configuration operation obtained by the parameter configuration interface.

After the parameter configuration interface is displayed, a parameter configuration operation corresponding to the learnable parameter can be acquired through the parameter configuration interface, and the parameter configuration operation is used for generating or adjusting the learnable parameter corresponding to the image to be processed. In some examples, a default learnable parameter value (e.g., 0, 0.5, etc.) may be displayed in the parameter configuration interface, and at this time, the user may confirm or adjust the default learnable parameter through the parameter configuration interface. The default learnable parameter value may be determined by the number of the plurality of video frames, for example, when the number of the plurality of video frames is n, the learnable parameter value may be 1/n.

In addition, the parameter adjusting control included in the parameter configuration interface is a character input control, a user can input corresponding characters through the character input control, and the character input operation input by the user through the character input control is the parameter configuration operation. For example, a default learnable parameter configured in advance may be displayed in the parameter configuration interface, for example, the default learnable parameter is 0.5, after the image to be processed is acquired, a character input control may be displayed in the parameter configuration interface, and the user directly inputs a corresponding character through the character input control, for example: the characters "0", "-" and "6" are input, so that parameter configuration operation can be obtained, and the default learnable parameter 0.5 can be adjusted to 0.6 through the character input operation.

In other examples, the parameter adjustment control included in the parameter configuration interface is a click control ("+" control and "-" control) or a slide control, when the parameter adjustment control is the click control, the user may increase the learnable parameter by clicking the "+" control, and decrease the learnable parameter by clicking the "-" control, where at this time, the obtained parameter configuration operation is a click operation. When the parameter adjustment space is a sliding control, the user may decrease the learnable parameter by sliding left or downward, and increase the learnable parameter by sliding right or upward, at this time, the obtained parameter configuration operation is a sliding operation.

After the parameter configuration operation is acquired through the parameter configuration interface, the learnable parameters may be generated or acquired based on the parameter configuration operation, and it can be understood that the learnable parameters corresponding to different numbers of video frames are different.

Step S1004: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

Step S1005: and displaying the fused frame.

After the fusion frame is acquired, in order to enable a user to intuitively know an image generation effect corresponding to the fusion frame, the fusion frame generated by using learnable parameters may be displayed in a display interface or a parameter configuration interface, and the fusion frame may represent a video to be processed.

It should be noted that when different learnable parameters are obtained by adjusting the learnable parameters in the parameter configuration interface, fusion frames corresponding to the different learnable parameters may be displayed in a preset area of the parameter configuration interface, for example: after the plurality of video frames are acquired, when a user can acquire the learnable parameters ai corresponding to each video frame through the parameter configuration interface, the fusion frame corresponding to the learnable parameters ai can be displayed in a preset area in the parameter configuration interface, so that the user can directly check the generation effect of the fusion frame through the parameter configuration interface; if the generation effect of the fusion frame does not meet the user requirement or the quality is crossed, the user can continue to adjust or configure the learnable parameter through the parameter configuration interface, so that the learnable parameter bi can be obtained, the learnable parameter bi is different from the learnable parameter ai, at this time, the fusion frame corresponding to the learnable parameter bi can be displayed in a preset area in the parameter configuration interface, so that the user can directly check the generation effect of the fusion frame through the parameter configuration interface, if the fusion frame meets the user requirement at this time, the configuration operation on the learnable parameter can be stopped, so that the flexible and free adjustment of the learnable parameter through the interactive operation of the user and the parameter configuration interface can be effectively realized, the processing quality and the effect of the generated fusion frame can be immediately checked through the parameter configuration interface, the user can visually judge whether the generated fusion frame meets the requirement at this time, if the requirement is not met, the learnable parameter can be adjusted again, and if the requirement is met, the fusion frame can be directly generated or input.

The method in this embodiment may further include the method in the embodiment shown in fig. 1 to 7, and reference may be made to the related description of the embodiment shown in fig. 1 to 7 for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.

Fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 11, the present embodiment provides a video processing apparatus capable of executing the video processing method shown in fig. 10, and specifically, the video processing apparatus may include:

a second obtaining module 31, configured to obtain multiple video frames of a video to be processed;

a second display module 32, configured to display a parameter configuration interface for processing the plurality of video frames;

a second determining module 33, configured to determine, through the parameter configuration operation obtained through the parameter configuration interface, a learnable parameter corresponding to each of the plurality of video frames;

a second processing module 34, configured to fuse the multiple video frames based on learnable parameters corresponding to the multiple video frames, to obtain a fused frame corresponding to the video to be processed;

and the second display module 32 is further configured to display the fused frame.

The apparatus in this embodiment may also perform the method in the embodiments shown in fig. 1 to 7, and reference may be made to the related description of the embodiments shown in fig. 1 to 7 for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to fig. 7, which are not described herein again.

In one possible design, the structure of the video processing apparatus shown in fig. 11 may be implemented as an electronic device, which may be a tablet computer, a personal computer PC, a conference room device, a server, or other various devices. As shown in fig. 12, the electronic device may include: a second processor 41 and a second memory 42. Wherein the second memory 42 is used for storing the program of the corresponding electronic device for executing the video processing method in the embodiment shown in fig. 10, and the second processor 41 is configured for executing the program stored in the second memory 42.

The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor 41, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; displaying a parameter configuration interface for processing the plurality of video frames; determining learnable parameters corresponding to the plurality of video frames through parameter configuration operation obtained by the parameter configuration interface; fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed; and displaying the fused frame.

Further, the second processor 41 is also used to execute all or part of the steps in the embodiment shown in fig. 10. The electronic device may further include a second communication interface 44 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 10.

Furthermore, an embodiment of the present invention provides a computer program product, including: the computer program, when executed by a processor of the electronic device, causes the processor to perform the steps of the video processing method shown in fig. 10 described above.

Fig. 13 is a flowchart illustrating a further video processing method according to an embodiment of the present invention; referring to fig. 13, the present embodiment provides a further video processing method, where an execution subject of the method may be a video processing apparatus, and the video processing apparatus may be implemented as an augmented Reality device, that is, the video processing method may be applied to an augmented Reality device, where the augmented Reality device refers to a device implemented by Extended Reality (XR) technology, where the XR technology refers to a real and virtual combined human-computer interactive environment generated by computer technology and a wearable device. XR may include Augmented Reality (AR), virtual Reality (VR), mixed Reality (MR), and video Reality (CR), in other words, XR is a generic term, and specifically includes AR, VR, MR, and CR. In short, XR can be divided into multiple levels and can go through a limited sensor-input virtual world to a fully immersive virtual world.

Specifically, the video processing method in this embodiment may include the following steps:

step S1301: a plurality of video frames of a video to be processed are acquired.

Step S1302: determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model.

Step S1303: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.

Step S1304: rendering the fused frame to a display screen of the augmented reality device.

After the fusion frame is acquired, in order to enable a user to intuitively know an image generation effect corresponding to the fusion frame through the augmented reality device, the fusion frame can be rendered to a display screen of the augmented reality device, and then the fusion frame generated by using the learnable parameters can be displayed in a display interface, wherein the fusion frame can represent a video to be processed.

The specific implementation process and implementation effect of steps S1301 to S1304 in this embodiment are similar to those of steps S201 to S203 in the foregoing embodiment, and specific reference may be made to the above statements, and details are not repeated here.

Fig. 14 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 14, the present embodiment provides a video processing apparatus that can be implemented as an augmented reality device, that is, the video processing apparatus can be applied to an augmented reality device; the device comprises:

the third obtaining module 51 is configured to obtain a plurality of video frames of a video to be processed.

A third determining module 52, configured to determine learnable parameters corresponding to each of the plurality of video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model.

A third processing module 53, configured to fuse the multiple video frames based on the learnable parameters corresponding to the multiple video frames, so as to obtain a fused frame corresponding to the video to be processed.

A third rendering module 54, configured to render the fused frame to a display screen of the augmented reality device.

The apparatus in this embodiment may also perform the method in the embodiment shown in fig. 13, and for a part not described in detail in this embodiment, reference may be made to the relevant description of the embodiment shown in fig. 13. The implementation process and technical effect of the technical solution refer to the description in the embodiment shown in fig. 13, and are not described herein again.

In one possible design, the structure of the video processing apparatus shown in fig. 14 may be implemented as an electronic device, which may be various devices such as an augmented reality device. As shown in fig. 15, the electronic device may include: a third processor 61 and a third memory 62. Wherein the third memory 62 is used for storing the program for executing the video processing method in the embodiment shown in fig. 13, and the third processor 61 is configured for executing the program stored in the third memory 62.

The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor 61, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model; fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames respectively to obtain a fused frame corresponding to the video to be processed; rendering the fused frame to a display screen of the augmented reality device.

Further, the third processor 61 is also configured to execute all or part of the steps in the embodiment shown in fig. 13. The electronic device may further include a third communication interface 63 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 13.

Furthermore, an embodiment of the present invention provides a computer program product, including: the computer program, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method shown in fig. 13 described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort. Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video processing method, comprising:

acquiring a plurality of video frames of a video to be processed;

2. The method according to claim 1, wherein fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, and obtaining a fused frame corresponding to the video to be processed, comprises:

when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on a plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain the fusion frame; alternatively, the first and second electrodes may be,

when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the plurality of video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on a plurality of video frames based on the normalization parameters to obtain the fusion frame.

3. The method of claim 1, wherein determining learnable parameters corresponding to each of the plurality of video frames comprises:

acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are obtained based on the number of the plurality of video frames;

determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters;

acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model;

and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

4. The method of claim 3, wherein obtaining a first loss function corresponding to the initial fused frame based on the video behavior recognition model comprises:

acquiring fusion frame features corresponding to the initial fusion frame, video features corresponding to the to-be-processed video, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the to-be-processed video based on the video behavior recognition model;

acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between the initial prediction label and the standard label;

determining a first loss function corresponding to the initial fused frame based on the first feature loss function and the first tag loss function.

5. The method of claim 1, wherein after obtaining the fused frame corresponding to the video to be processed, the method further comprises:

acquiring learnable information corresponding to the video to be processed, wherein the learnable information is used for identifying spatial information and/or time information of the video to be processed;

and fusing the fusion frame and the learnable information to obtain a target fusion frame.

6. The method of claim 5, wherein fusing the fused frame and the learnable information to obtain a target fused frame comprises:

performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain the target fusion frame; alternatively, the first and second electrodes may be,

performing pixel-by-pixel product processing on the learnable information and the fusion frame to obtain the target fusion frame; alternatively, the first and second electrodes may be,

and splicing the learnable information and the fusion frame to obtain the target fusion frame.

7. The method according to claim 5, wherein obtaining learnable information corresponding to the video to be processed comprises:

acquiring initialization information corresponding to the video to be processed;

fusing the initialization information and the fusion frame to obtain a process fusion frame;

acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model;

and adjusting the initialization information based on the second loss function to obtain learnable information corresponding to each of the plurality of video frames.

8. The method of claim 7, wherein obtaining a second loss function corresponding to the process fusion frame based on the video behavior recognition model comprises:

acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the to-be-processed video, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the to-be-processed video based on the video behavior identification model;

obtaining a second feature loss function between the process frame feature and the video feature and a second label loss function between the frame prediction label and the standard label;

determining a second loss function corresponding to the process fused frame based on the second feature loss function and the second tag loss function.

9. The method of claim 7, wherein after obtaining the second penalty function corresponding to the process fusion frame, the method further comprises:

and adjusting the learnable parameters based on the second loss function to obtain target learning parameters corresponding to the learnable parameters.

10. The method of claim 1, wherein determining the learnable parameters corresponding to each of the plurality of video frames comprises:

acquiring initialization parameters corresponding to the video frames and initialization information corresponding to the to-be-processed video, wherein the initialization parameters are acquired based on the number of the video frames, and the initialization information is used for identifying space information and time information corresponding to the to-be-processed video;

fusing the initial fusion frame and the initialization information to obtain a process fusion frame;

acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model;

and adjusting the initialization parameters based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.

11. The method of claim 10, wherein after obtaining a third loss function corresponding to the process fusion frame, the method further comprises:

and adjusting the initialization information based on the third loss function to obtain learnable information corresponding to the video to be processed.

12. The method of claim 10, wherein obtaining a third loss function corresponding to the process fusion frame based on the video behavior recognition model comprises:

acquiring a fusion frame feature corresponding to the initial fusion frame, an initial prediction label corresponding to the initial fusion frame, a process frame feature corresponding to the process fusion frame, a video feature corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame, and a standard label corresponding to the video to be processed based on the video behavior recognition model;

obtaining a first sub-loss function between the fused frame feature and the video feature, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame feature and the video feature, and a fourth sub-loss function between the frame prediction label and the standard label;

determining a third loss function corresponding to the process fusion frame based on the first, second, third, and fourth sub-loss functions.

13. A video processing method, comprising:

acquiring a plurality of video frames of a video to be processed;

and displaying the fused frame.

14. A video processing method is applied to an augmented reality device, and the method comprises the following steps:

acquiring a plurality of video frames of a video to be processed;

rendering the fused frame to a display screen of the augmented reality device.