CN115205763A - Video processing method and device - Google Patents

Video processing method and device Download PDF

Info

Publication number
CN115205763A
CN115205763A CN202211099158.XA CN202211099158A CN115205763A CN 115205763 A CN115205763 A CN 115205763A CN 202211099158 A CN202211099158 A CN 202211099158A CN 115205763 A CN115205763 A CN 115205763A
Authority
CN
China
Prior art keywords
video
frame
loss function
processed
learnable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211099158.XA
Other languages
Chinese (zh)
Other versions
CN115205763B (en
Inventor
岑俊
裴逸璇
张士伟
吕逸良
赵德丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211099158.XA priority Critical patent/CN115205763B/en
Publication of CN115205763A publication Critical patent/CN115205763A/en
Application granted granted Critical
Publication of CN115205763B publication Critical patent/CN115205763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a video processing method and equipment; the video processing method comprises the following steps: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to a plurality of video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model; and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed. According to the video processing method, the data volume of the fusion frame is smaller than that of the video to be processed, and the fusion frame represents the video to be stored, so that the storage space occupied by data storage is effectively reduced, a large number of fusion frames can be stored in the limited storage space, then model optimization operation or updating operation can be performed based on the stored fusion frame, and the diversity of data types and the sufficiency of the number during model updating are effectively guaranteed.

Description

Video processing method and device
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a video processing method and device.
Background
The video type increment learning refers to a type increment learning method taking a video sample as input, and the basic task is behavior recognition of a video to obtain a video behavior recognition model. At present, in a single fine tuning stage, people often use training data of all types to perform training operation of a motion recognition model, and as video data has more redundant information compared with image data, a larger storage space is required for storing video. When performing model training operation, it is impractical to store a large number of training videos for each video category in advance, and therefore, due to the limitation of memory space, all category data of model training is not available or partially available in limited memory, which greatly limits the type and amount of model training data, thereby easily reducing the performance of model training.
Disclosure of Invention
The embodiment of the invention provides a video processing method and video processing equipment, which can accurately obtain a fusion frame representing a video to be processed, and the fusion frame has a smaller data volume compared with video data, so that the memory consumption can be effectively reduced, and the training quality and performance of a model can be ensured.
In a first aspect, an embodiment of the present invention provides a video processing method, including:
acquiring a plurality of video frames of a video to be processed;
determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
Optionally, after obtaining the fusion frame corresponding to the video to be processed, the method further includes:
acquiring a newly added video sample;
and performing learning training on the video behavior recognition model based on the newly added video samples and the fusion frame.
In a second aspect, an embodiment of the present invention provides a video processing apparatus, including:
the first acquisition module is used for acquiring a plurality of video frames of a video to be processed;
a first determining module, configured to determine learnable parameters corresponding to each of the plurality of video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
and the first processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is used to make a computer implement the video processing method in the first aspect when executed.
In a fifth aspect, an embodiment of the present invention provides a computer program product, including: computer program, which, when executed by a processor of an electronic device, causes the processor to carry out the steps of the video processing method according to the first aspect.
In a sixth aspect, an embodiment of the present invention provides a video processing method, including:
acquiring a plurality of video frames of a video to be processed;
displaying a parameter configuration interface for processing the plurality of video frames;
determining learnable parameters corresponding to the plurality of video frames through parameter configuration operation obtained by the parameter configuration interface;
fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed;
and displaying the fused frame.
In a seventh aspect, an embodiment of the present invention provides a video processing apparatus, including:
the second acquisition module is used for acquiring a plurality of video frames of the video to be processed;
the second display module is used for displaying a parameter configuration interface for processing the plurality of video frames;
a second determining module, configured to determine, through the parameter configuration operation obtained by the parameter configuration interface, a learnable parameter corresponding to each of the plurality of video frames;
the second processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed;
and the second display module is also used for displaying the fusion frame.
In an eighth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the sixth aspect.
In a ninth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program enables a computer to implement the video processing method in the sixth aspect when executed.
In a tenth aspect, an embodiment of the present invention provides a computer program product, including: a computer program which, when executed by a processor of an electronic device, causes the processor to perform the steps in the video processing method of the sixth aspect described above.
In an eleventh aspect, an embodiment of the present invention provides a video processing method, which is applied to an augmented reality device, and the method includes:
acquiring a plurality of video frames of a video to be processed;
determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames respectively to obtain a fused frame corresponding to the video to be processed;
rendering the fused frame to a display screen of the augmented reality device.
In a twelfth aspect, an embodiment of the present invention provides a video processing apparatus, which is applied to an augmented reality device, where the apparatus includes:
the third acquisition module is used for acquiring a plurality of video frames of the video to be processed;
a third determining module, configured to determine learnable parameters corresponding to each of the multiple video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
the third processing module is used for fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed;
and the third rendering module is used for rendering the fusion frame to a display screen of the augmented reality device.
In a thirteenth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the video processing method of the eleventh aspect.
In a fourteenth aspect, an embodiment of the present invention provides a computer storage medium for storing a computer program, where the computer program is used to enable a computer to implement the video processing method in the eleventh aspect when executed.
In a fifteenth aspect, an embodiment of the present invention provides a computer program product, including: a computer program, which, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method of the eleventh aspect.
According to the technical scheme provided by the embodiment, a plurality of video frames of a video to be processed are obtained; and determining learnable parameters corresponding to the plurality of video frames, and finally fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames to obtain fused frames corresponding to the video to be processed, wherein the data volume of the fused frames is far less than that of the video to be processed, and the fused frames are used for representing the video for storage, so that the storage space required by data storage is effectively reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic view of a scene of a video processing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of another video processing method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of fusing the fusion frame and the learnable information according to the embodiment of the present invention;
fig. 5 is a schematic flowchart of determining learnable parameters corresponding to the plurality of video frames according to an embodiment of the present invention;
fig. 6 is a flowchart illustrating a video processing method according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of a video processing method according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 8;
fig. 10 is a flowchart illustrating another video processing method according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 11;
fig. 13 is a flowchart illustrating a further video processing method according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;
fig. 15 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 14.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any inventive step, are intended to be protected by the present invention but do not exclude at least one. It is to be understood that the term "and/or" range "is used herein.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, and only one, describing an associative relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of additional like elements in a commodity or system comprising the element.
Definition of terms:
and (3) lifelong learning: lifelong learning is a high-level machine learning paradigm that accumulates knowledge from past tasks by learning continuously and uses that knowledge to assist in future learning.
Class increment learning: in class incremental learning, new classes are continuously coming, and the model needs to correctly classify the input into the corresponding class, and there is no overlap between the classes contained in different tasks.
Video type incremental learning: the method comprises the following steps of class increment learning by taking a video sample as input, and behavior identification of a video as a basic task.
Catastrophic forgetting: after the model learns new knowledge, the learned features and information from previous training are almost completely forgotten.
And (3) behavior recognition: the method is used for analyzing the motion category of a target person in the video, and the behavior recognition is generally based on a large amount of label training data to learn;
and (3) performing Prompt: researchers have designed an input form or template for downstream tasks that can help pre-train large models to "recall" what they "learned" from pre-training themselves.
Example specific Prompt Instance-specific Prompt: and a prompt template which is designed for each independent sample and is adaptive to the image characteristics is used for identifying the temporal characteristics and/or the spatial characteristics of the video.
In order to facilitate understanding of a specific implementation manner and an implementation effect of the technical solution provided by the present embodiment, the following description is made on a related technology:
at present, in the fine tuning stage of the model, people often use training data of all categories to perform the training operation of the motion recognition model, since video data has more redundant information compared with image data, a larger storage space is needed for storing the video, and it is impractical to store a large number of training video samples for each category due to privacy problems or technical limitations. However, it is impractical to store a large number of training videos for each video category in advance when performing a model update or model optimization operation, and therefore, due to the limitation of memory space, all category data of the model update or model optimization is not available or partially available in a limited memory, which greatly limits the type and amount of model update or model optimization data, thereby easily degrading the performance of the model update or model optimization.
When the stored training data is used for model updating or model optimization, all the training data are sequentially trained and fine-tuned according to the class sequence, so that the training data are sequentially fine-tuned to the model to easily over-fit the training data of the current class, the training performance of the model for other classes is reduced, and catastrophic forgetfulness is caused.
In order to alleviate the problem of catastrophic forgetting, the related art provides an incremental learning method based on sample preservation, and the implementation principle of the incremental learning method is mainly as follows: a small set of representative videos is selected for subsequent model update or model optimization operations, and then significant performance is achieved in the image domain by retraining a portion of the past examples. Meanwhile, some existing video incremental learning methods have also proved that the ability to alleviate forgetting can be improved by storing more old samples. However, although the above method can significantly improve and guarantee the performance of the video motion recognition model, it still needs to store multiple frames for each representative video, which still results in non-negligible memory overhead, thereby limiting further application of the above technical solution in practical scenarios.
In order to solve the above technical problems, the present embodiment provides a video processing method and a device, where an execution main body of the video processing method may be a video processing apparatus, and in particular, when the video processing apparatus is implemented as a cloud server, the video processing method may be executed in the cloud, at this time, a plurality of computing nodes (cloud servers) may be deployed in the cloud, and each computing node has processing resources such as computation, storage, and the like. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), application Programming Interface (API), and other forms.
According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the video processing service, and the user calls the video processing service interface through the client end/the request end so as to trigger a request for calling the video processing service interface to the cloud end. The cloud determines the compute nodes that respond to the request, and performs the specific processing operations of video processing using the processing resources in the compute nodes.
Specifically, referring to fig. 1, the client/request end may be any computing device with a certain data transmission capability, and the client/request end may be a mobile phone, a personal computer PC, a tablet computer, a setting application program, and the like. In addition, the basic structure of the client may include: at least one processor. The number of processors depends on the configuration and type of client. The client may also include a Memory, which may be volatile, such as RAM, or non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or may include both types. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. In addition to the processing unit and the memory, the client includes some basic configurations, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, a mouse, a stylus, a printer, and the like. Other peripheral devices are well known in the art and will not be described in detail herein.
The video processing apparatus refers to a device that can provide video processing services in a network virtual environment, and generally refers to an apparatus that performs information planning and video processing operations using a network. In physical implementation, the video processing apparatus may be any device capable of providing a computing service, responding to a video processing request, and performing a video processing service based on the video processing request, for example: can be cluster servers, regular servers, cloud hosts, virtual centers, and the like. The interactive detection device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and is similar to a general computer framework.
In the above embodiment, the client/requester may have a network connection with the video processing apparatus, and the network connection may be a wireless or wired network connection. If the client/request end is in communication connection with the video processing device, the network format of the mobile network may be any one of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), wiMax, 5G, and 6G.
In the embodiment of the application, a client/request terminal can acquire videos to be processed, wherein the videos to be processed can be used for training a video behavior recognition model, and specifically, the number of the videos to be processed can be one or more; specifically, the specific implementation manner of the request end for acquiring the video to be processed is not limited in this embodiment, and in some examples, the video to be processed may be stored in a preset area of the request end, and the video to be processed may be acquired by accessing the preset area. Or the video to be processed may be stored in a third device, and the third device is in communication connection with the request terminal, and the video to be processed is actively or passively acquired through the third device. After the video to be processed is acquired, the video to be processed can be sent to the video processing device, so that the video processing device can perform video processing operation on the video to be processed, specifically, the video to be processed can be compressed, and memory consumption for storing the video to be processed can be reduced.
The video processing device is used for acquiring a video to be processed, then sampling the video to be processed to obtain a plurality of video frames, wherein the plurality of video frames can be used as representative frames of the video to be processed, and in order to accurately perform fusion operation on the plurality of video frames, learnable parameters corresponding to the plurality of video frames can be determined, wherein the learnable parameters can be obtained through a video behavior recognition model; and then, the plurality of video frames can be fused based on the learnable parameters respectively corresponding to the plurality of video frames, so that a fused frame corresponding to the video to be processed can be obtained.
After a fusion frame used for representing a video to be processed is obtained, the fusion frame can represent related information included in the video to be processed, so that the fusion frame can represent the video to be processed for storage, and since the data volume of the fusion frame is smaller or far smaller than that of the video to be processed, the storage space occupied by data storage is reduced.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below can be combined with or separated from each other without conflict between the embodiments. In addition, the sequence of steps in the embodiments of the methods described below is merely an example, and is not strictly limited.
Fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present invention; referring to fig. 2, the embodiment provides a video processing method, where an execution subject of the method may be a video processing apparatus, the video processing apparatus may be implemented as software, or a combination of software and hardware, and specifically, when the video processing apparatus is implemented as hardware, it may be embodied as various electronic devices having data processing operations, including but not limited to a tablet computer, a personal computer PC, a server, and the like. When the video processing apparatus is implemented as software, it can be installed in the electronic devices exemplified above. Based on the video processing apparatus, the video processing method in this embodiment may include the following steps:
step S201: a plurality of video frames of a video to be processed are acquired.
Step S202: determining learnable parameters corresponding to the video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model.
Step S203: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
The following is a detailed description of specific implementation processes and implementation effects of the above steps:
step S201: a plurality of video frames of a video to be processed are obtained.
The video to be processed may refer to a video that needs to be subjected to a video processing operation (e.g., a video compression operation), and the video to be processed may perform a training operation, an updating operation, or an optimizing operation of a video behavior recognition model, that is, the video to be processed may be used as training data of the video behavior recognition model, and the video behavior recognition model may recognize behavior characteristics or behavior information in the video to be processed.
In addition, a plurality of video frames may be stored in the preset area or the third device, and at this time, a plurality of video frames of the video to be processed may be acquired by accessing the preset area or the third device. Alternatively, obtaining a plurality of video frames of the video to be processed may include: and acquiring a video frame to be processed, sampling the video to be processed, and acquiring a plurality of video frames. Specifically, the number of the videos to be processed may be one or more, specifically, the embodiment does not limit the obtaining manner of the videos to be processed, in some examples, the videos to be processed may be stored in a preset area in the video processing apparatus, and at this time, the videos to be processed may be obtained by accessing the preset area; or the video to be processed may be stored in the third device, and the video to be processed is actively or passively acquired by the third device. In still other examples, the video processing device may be communicatively connected with a live broadcast terminal, the live broadcast terminal may generate and obtain a live broadcast video, and then the live broadcast terminal may send the live broadcast video to the video processing device, and the video processing device may obtain a to-be-processed live broadcast video, so that the processing operation of the live broadcast video is effectively implemented; similarly, the video processing device can be in communication connection with a conference terminal, the conference terminal can generate conference video, the conference terminal can send the conference video to the video processing device, and the video processing device can acquire the conference video to be processed, so that the processing operation of the conference video is effectively realized.
In other examples, the video to be processed may be a part of a plurality of sample videos used for performing a training operation on the video behavior recognition model, and in this case, acquiring the video to be processed may include: acquiring an original video set, wherein the original video set comprises a plurality of sample videos used for training a video behavior recognition model; determining a video category corresponding to each sample video in an original video set; and determining one or more to-be-processed videos in the original video set based on the video categories, wherein at least one to-be-processed video corresponds to each video category.
For the video behavior recognition model, an original video set corresponding to the video behavior recognition model is configured in advance, and the original video set comprises a plurality of sample videos used for training the video behavior recognition model. Because the original video set comprises a plurality of sample videos, different sample videos can correspond to different video categories, and specifically, the video categories can comprise live videos, conference videos, fun videos, cate videos, entertainment videos, life videos, information videos, knowledge videos, game videos, favorite videos, sports videos, cartoon videos, science and technology videos, health videos and the like. In addition, because the focused information of the sample videos of different video types is different from the content to be expressed, in order to ensure accurate and comprehensive characterization of the original video set, videos to be processed may be screened from the original video set based on the video category, at this time, after the original video set is obtained, all sample videos in the original video set may be analyzed and processed to determine the video category corresponding to each sample video in the original video set, in some examples, the sample videos of different video categories may correspond to different identification information, and at this time, the video category corresponding to each sample video may be determined based on the identification information. In still other examples, sample videos of different video categories may correspond to different video features, and at this time, after the original video set is obtained, feature extraction may be performed on each sample video in the original video set to obtain video features, and the video category corresponding to each sample video is determined based on the video features.
After determining the video categories corresponding to the videos in the original video set, one or more videos to be processed may be determined in the original video set based on the video categories, specifically, at least one video to be processed corresponding to each video category. For example, the video categories corresponding to the sample videos in the original video set include: when the videos of the life category, the videos of the travel category, the videos of the knowledge category and the videos of the conference category are taken as the videos, one or more videos to be processed can be determined in the original video set based on the video categories, and specifically, the videos to be processed of at least one life category, the videos to be processed of at least one travel category, the videos to be processed of at least one knowledge category and the videos to be processed of at least one conference category can be obtained, so that the videos to be processed of the whole category can be accurately obtained.
In order to implement processing operation on the video to be processed, after the video to be processed is acquired, sampling processing may be performed on the video to be processed, so that a plurality of video frames may be acquired. In some examples, sampling the video to be processed to obtain the plurality of video frames may include: and randomly sampling the video to be processed to obtain a plurality of video frames, wherein the plurality of video frames are a plurality of random frames in the video to be processed. Alternatively, sampling the video to be processed to obtain a plurality of video frames may include: and uniformly sampling the video to be processed to obtain a plurality of video frames. Alternatively, sampling the video to be processed to obtain a plurality of video frames may include: carrying out interval sampling on a video to be processed to obtain a plurality of video frames; alternatively, sampling the video to be processed to obtain a plurality of video frames may include: acquiring motion distribution information corresponding to a video to be processed; and sampling the video to be processed based on the motion distribution information to obtain a plurality of video frames.
Step S202: determining learnable parameters corresponding to the video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model.
After the plurality of video frames are acquired, different video frames can express different information of the video to be processed, so that in order to accurately perform fusion processing operation on the plurality of video frames, learnable parameters corresponding to the plurality of video frames can be determined, and different video frames can correspond to the same or different learnable parameters, wherein the learnable parameters can be obtained through a video behavior recognition model.
In addition, the specific implementation manner of the learnable parameters corresponding to each of the plurality of video frames is not limited in this embodiment, in some examples, the learnable parameters may be parameters that are configured manually by a user based on a video behavior recognition model, or the learnable parameters may be parameters that are configured automatically based on the video behavior recognition model, it should be noted that the video behavior recognition model may be a machine learning model or a neural network model that is trained in advance or configured in advance and is used for performing behavior recognition on a video, and the video behavior recognition model may be configured on any electronic device with an image processing capability to implement the recognition operation of a video behavior. In other examples, the learnable parameter may be obtained by adjusting the initialization parameter based on the video behavior recognition model, and determining the learnable parameter corresponding to each of the plurality of video frames may include: acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are acquired based on the number of the plurality of video frames; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
After a plurality of video frames are acquired, initialization parameters corresponding to the plurality of video frames may be automatically acquired or determined, where the initialization parameters may be determined based on the number of the video frames, and in some examples, the initialization parameters corresponding to different video frames are the same, that is, in an initial state, it may be default that all the video frames have the same influence on the video to be processed, and at this time, when the number of the video frames is N, then the initialization parameters may be 1/N. In other examples, the initialization parameters corresponding to different video frames may be different, and at this time, the initialization parameters corresponding to each of the plurality of video frames may be determined based on the frame characteristics of the video frame.
Because the initialization parameter is set in the initial state and used for representing the initial influence degree of the video frame on the video to be processed, in order to better represent the video to be processed and obtain a more accurate fusion frame, the initialization parameter can be optimized and adjusted based on the video behavior recognition model, at this time, after the initialization parameter is obtained, an initial fusion frame corresponding to the video frame to be processed can be determined based on the initialization parameter, and specifically, the initial fusion frame can be a fusion frame obtained by weighting and summing a plurality of video frames through the initialization parameter; alternatively, the initial fusion frame may be a fusion frame obtained by summing up products of a plurality of video frames based on the initialization parameter; alternatively, the initial fusion frame may be a fusion frame obtained by performing a splicing process with the video frame based on the initialization parameter.
After the initial fusion frame is obtained, feature extraction and behavior recognition processing operations can be performed on the initial fusion frame by using a video behavior recognition model, a first loss function corresponding to the initial fusion frame can be obtained based on a feature extraction result and a behavior recognition result, and then the initialization parameter can be adjusted by using the first loss function as a constraint condition to obtain learnable parameters corresponding to a plurality of video frames, wherein the learnable parameters are determined parameters which are matched with the plurality of video frames in a comparison manner.
For the first loss function, this embodiment may obtain different first loss functions in different manners, and in a first implementation manner, obtaining, based on the video behavior recognition model, the first loss function corresponding to the initial fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between an initial prediction label and a standard label; a first loss function corresponding to the initial fused frame is determined based on the first feature loss function and the first tag loss function.
For example, the video behavior recognition model may be represented as
Figure 378140DEST_PATH_IMAGE001
Figure 948143DEST_PATH_IMAGE002
Is a video behavior recognition model
Figure 259038DEST_PATH_IMAGE001
The initial fused frame is represented as
Figure 795193DEST_PATH_IMAGE003
The video to be processed is represented as
Figure 140724DEST_PATH_IMAGE004
Wherein, in the step (A),
Figure 228765DEST_PATH_IMAGE005
identifying behavior for videoThe current parameters of the model after the k-th incremental phase,
Figure 151591DEST_PATH_IMAGE006
the current parameters of the model of the feature extractor after the k-th increment stage is finished.
After obtaining the initial fusion frame
Figure 881649DEST_PATH_IMAGE003
Thereafter, the fused frame features may be obtained by a feature extractor
Figure 691474DEST_PATH_IMAGE007
Similarly, after the video V to be processed is obtained, the video features may also be obtained by the feature extractor
Figure 215996DEST_PATH_IMAGE008
For the fusion frame feature and the video feature, because it is desired to obtain a fusion frame capable of accurately characterizing a video to be processed, that is, the desired fusion frame has the same or very similar expression ability to the original video, at this time, it is necessary that the embedded feature of the fusion frame extracted from the current model is consistent with the video feature of the original video or consistent with the video feature of the original video as much as possible, and further, a first feature loss function between the fusion frame feature and the video feature can be obtained
Figure 235905DEST_PATH_IMAGE009
When the first characteristic loss function is characterized by the euclidean distance,
Figure 363129DEST_PATH_IMAGE010
(ii) a Of course, the first characteristic loss function may be expressed in other ways, such as: the first characteristic loss function can also be expressed by cosine similarity, mahalanobis distance, manhattan distance, pearson correlation coefficient and other modes, and different modes can correspond to different expression formulas, which is not described herein again.
To further improve the initial fused frame pairThe adaptability of the frequency behavior recognition model can monitor the classification confidence of the initial fusion frame by using cross entropy loss, and at this time, an initial prediction label corresponding to the initial fusion frame can be obtained by the video behavior recognition model, and the initial prediction label can be expressed as
Figure 417673DEST_PATH_IMAGE011
Then obtaining the standard label corresponding to the video to be processed
Figure 113097DEST_PATH_IMAGE012
And obtaining an initial predictive tag
Figure 761247DEST_PATH_IMAGE011
And standard label
Figure 301950DEST_PATH_IMAGE012
First tag loss function in between
Figure 601213DEST_PATH_IMAGE013
When the first tag loses the function
Figure 467538DEST_PATH_IMAGE013
When characterized by a cross-entropy loss function,
Figure 462038DEST_PATH_IMAGE013
can be expressed as
Figure 681798DEST_PATH_IMAGE014
(ii) a Of course, the first tag loss function may be expressed in other ways, such as: the first label loss function may also be expressed by a logarithmic loss function, an exponential loss function, and the like, and different modes may correspond to different expression formulas, which is not described herein again.
After obtaining the first characteristic loss function
Figure 710934DEST_PATH_IMAGE009
And a first tag loss function
Figure 141303DEST_PATH_IMAGE013
Then, the first characteristic loss function can be passed
Figure 357521DEST_PATH_IMAGE009
And a first tag loss function
Figure 505605DEST_PATH_IMAGE013
Determining a first loss function corresponding to the initial fused frame
Figure 264614DEST_PATH_IMAGE015
The process may, in some instances,
Figure 472741DEST_PATH_IMAGE015
=
Figure 176255DEST_PATH_IMAGE009
+
Figure 252664DEST_PATH_IMAGE013
that is, the first loss function is the sum of the first characteristic loss function and the first tag loss function; in yet other embodiments, the first and second light sources may be,
Figure 256392DEST_PATH_IMAGE015
=
Figure 635421DEST_PATH_IMAGE016
+
Figure 701597DEST_PATH_IMAGE017
i.e. the first loss function is a weighted sum of the first characteristic loss function and the first tag loss function, as described above
Figure 457064DEST_PATH_IMAGE018
The weight information corresponding to the first characteristic loss function,
Figure 174353DEST_PATH_IMAGE019
as a function of first tag lossAnd (4) weighted summation.
In a second implementation manner, based on the video behavior recognition model, obtaining the first loss function corresponding to the initial fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame and video characteristics corresponding to the video to be processed based on the video behavior recognition model; acquiring a first characteristic loss function between the fusion frame characteristic and the video characteristic; based on the first feature loss function, a first loss function corresponding to the initial fused frame is determined.
Different from the above implementation manner, the first loss function in the present embodiment
Figure 989862DEST_PATH_IMAGE015
Only the first characteristic loss function is needed
Figure 667968DEST_PATH_IMAGE009
The acquisition is performed, and therefore, there is no need to acquire the first tag loss function
Figure 836912DEST_PATH_IMAGE013
In this case, the first characteristic loss function may be directly determined as the first loss function, that is
Figure 549653DEST_PATH_IMAGE015
=
Figure 4906DEST_PATH_IMAGE009
In a third implementation manner, based on the video behavior recognition model, obtaining the first loss function corresponding to the initial fusion frame may include: acquiring an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to a video to be processed based on the video behavior recognition model; obtaining a first label loss function between an initial prediction label and a standard label; based on the first tag loss function, a first loss function corresponding to the initial fused frame is determined.
Different from the above implementation manner, the first loss function in the present embodiment
Figure 294941DEST_PATH_IMAGE015
Only the first label loss function needs to be passed
Figure 392210DEST_PATH_IMAGE013
The acquisition is performed so that there is no need to acquire the first characteristic loss function
Figure 959458DEST_PATH_IMAGE009
Specifically, the first label can be directly lost
Figure 460978DEST_PATH_IMAGE013
Is determined as a first loss function, i.e.
Figure 379255DEST_PATH_IMAGE015
=
Figure 14636DEST_PATH_IMAGE013
Step S203: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
After the learnable parameters corresponding to the multiple video frames are obtained, the multiple video frames can be fused based on the learnable parameters corresponding to the multiple video frames, and a fusion frame corresponding to the video to be processed is obtained, wherein for the fusion frame, in order to enable the fusion frame to accurately express the video to be processed, the number of image channels of the obtained fusion frame is the same as that of the video frames, and the size of the fusion frame is the same as that of the video frames.
In addition, the specific implementation manner of fusing the video frames is not limited in this embodiment, and since the learnable parameters may correspond to different numerical ranges, and the learnable parameters of different numerical ranges may correspond to different implementation manners, in some examples, fusing the multiple video frames based on the learnable parameters corresponding to the multiple video frames, respectively, and obtaining the fused frame corresponding to the video to be processed may include: when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fusion frame; or when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on the plurality of video frames based on the normalization parameters to obtain a fusion frame.
Specifically, when the learnable parameter is a value greater than zero and less than 1, that is, in the process of video processing, the learnable parameter is always in a range greater than zero and less than 1, and since the learnable parameter can directly reflect the degree of influence of the plurality of video frames on the video to be processed, the video frames can be weighted and summed based on the learnable parameters corresponding to the plurality of video frames, so as to obtain the fusion frame. For example, the plurality of video frames may be
Figure 315952DEST_PATH_IMAGE020
Learnable parameters corresponding to each of a plurality of video frames
Figure 644165DEST_PATH_IMAGE021
Then, the fused frame can be obtained
Figure 659526DEST_PATH_IMAGE022
In addition, when the learnable parameter is a value greater than 1, that is, in the process of video processing, the learnable parameter may be in a range greater than zero and smaller than 1 or a range greater than 1, since the learnable parameter cannot directly reflect the influence degree of the plurality of video frames on the video to be processed, normalization processing may be performed based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a normalization parameter corresponding to the learnable parameter, where the normalization parameter is a value greater than zero and smaller than 1, and at this time, the normalization parameter may directly reflect the influence degree of the plurality of video frames on the video to be processed, and thus, the video frames may be subjected to weighted summation based on the normalization parameter to obtain a fused frame. For example, the plurality of video frames may be
Figure 98598DEST_PATH_IMAGE020
Learnable parameters corresponding to each of a plurality of video frames
Figure 374858DEST_PATH_IMAGE021
Then, the fused frame can be obtained
Figure 608393DEST_PATH_IMAGE023
Wherein, in the step (A),
Figure 360318DEST_PATH_IMAGE024
i.e. the normalization parameters corresponding to each of the plurality of video frames.
After obtaining the fusion frame corresponding to the video to be processed, in order to facilitate processing the video to be processed based on the fusion frame, the method in this embodiment may further include: the fusion frame represents the video to be processed for storage, namely the fusion frame can be stored in a preset area in the video processing device, when the video to be processed needs to be called or used, the fusion frame is obtained by accessing the preset area, and the fusion frame replaces the video to be processed for calling or using operation, so that the memory space needed by data storage is effectively reduced.
For example, when a device B (a cloud server, a cloud database, or the like) stores a fusion frame representing a plurality of pieces of video data, an application program for implementing a model optimization operation may be configured on the device a (a user side), and when the stored fusion frame is required for the optimization operation of the video motion recognition model, the device a may establish a communication connection with the device B, acquire the plurality of fusion frames stored in the device B by accessing the device B, and then may optimize or update the video motion recognition model based on the plurality of fusion frames and other pieces of video data.
In the video processing method provided by the embodiment, a plurality of video frames of a video to be processed are acquired; and determining the learnable parameters corresponding to the video frames, and finally fusing the video frames based on the learnable parameters corresponding to the video frames to obtain a fused frame corresponding to the video to be processed, wherein the data volume of the fused frame is far smaller than that of the video to be processed, and the fused frame is used for representing the video for storage, so that the storage space required by data storage is effectively reduced.
Fig. 3 is a schematic flow chart of another video processing method according to an embodiment of the present invention; fig. 4 is a schematic diagram of fusing a fusion frame and learnable information according to an embodiment of the present invention; on the basis of the foregoing embodiment, as shown in fig. 3 to 4, when a video to be processed is acquired and is expressed by a fusion frame, time information and spatial information of the video are inevitably lost to some extent, at this time, in order to supplement the lost time information and spatial information of the fusion frame, after obtaining the fusion frame corresponding to the video to be processed, the present embodiment further provides an implementation scheme for processing the fusion frame by using different prompt information to supplement the time information and the spatial information, specifically, the method in the present embodiment may include:
step S301: and acquiring learnable information corresponding to the video to be processed, wherein the learnable information is used for identifying the spatial information and/or the time information of the video to be processed.
After the video to be processed is acquired, the video to be processed may be analyzed to acquire learnable information corresponding to the video information to be processed, where the learnable information is used to identify spatial information and/or temporal information of the video to be processed, and a spatial resolution of the learnable information is the same as a spatial resolution of the video to be processed. In some examples, the learnable information may be information configured by a user based on the video to be processed, or the learnable information may be information obtained by processing the video to be processed through a machine learning model trained in advance.
In other examples, the learnable information may be information obtained after performing optimization adjustment on the initialization information based on the video behavior recognition model, and at this time, obtaining the learnable information corresponding to the video to be processed may include: acquiring initialization information corresponding to a video to be processed; fusing the initialization information and the fusion frame to obtain a process fusion frame; acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization information based on a second loss function to obtain learnable information corresponding to each of the plurality of video frames.
After the video to be processed is acquired, the initialization information corresponding to the video to be processed may be automatically acquired or determined, where the initialization information may be determined based on the video to be processed, and in some examples, the initialization information corresponding to different videos to be processed may be the same value, and at this time, the initialization information may be 0; alternatively, the initialization information corresponding to different videos to be processed may be different values.
After the initialization information is obtained, the initialization information and the fusion frame may be fused, so as to obtain a process fusion frame, where the fusing of the initialization information and the fusion frame may include: performing pixel-by-pixel summation processing on the initialization information and the fusion frame to obtain a process fusion frame; or, the initialization information and the fusion frame are processed by pixel-by-pixel product to obtain a process fusion frame; or, the initialization information and the fusion frame are spliced to obtain a process fusion frame, and specifically, when the initialization information and the fusion frame are spliced, in order to enable the process fusion frame to accurately represent a video to be processed, the fusion frame may be used as a central region, and the initialization information may be used as a peripheral edge region of the central region and spliced with the fusion frame to obtain the process fusion frame, where the number of channels of the process fusion frame is the same as the number of channels of the video frame, and the height and width of the process fusion frame are greater than the height and width of the video frame, so that accurate reliability of obtaining the process fusion frame is also achieved.
After the process fusion frame is acquired, in order to accurately acquire learnable information corresponding to each of the plurality of video frames, feature extraction and behavior recognition processing operations may be performed on the process fusion frame based on the video behavior recognition model, and a second loss function corresponding to the process fusion frame may be acquired based on a feature extraction result and a behavior recognition result; after the second loss function is obtained, the initialization information can be adjusted based on the second loss function as a constraint condition to obtain learnable information corresponding to each of the plurality of video frames, so that the accuracy and reliability of obtaining the learnable information are effectively ensured.
For the second loss function, different second loss functions may be obtained in different manners in this embodiment, and in the first implementation manner, obtaining the second loss function corresponding to the process fusion frame based on the video behavior recognition model may include: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic and a second label loss function between the frame prediction label and the standard label; a second loss function corresponding to the process fused frame is determined based on the second feature loss function and the second tag loss function.
For example, the video behavior recognition model may be represented as
Figure 603080DEST_PATH_IMAGE001
Figure 609213DEST_PATH_IMAGE002
Is a video behavior recognition model
Figure 279229DEST_PATH_IMAGE001
Is characterized byExtractor, process fusion frame representation as
Figure 128236DEST_PATH_IMAGE025
The video to be processed is represented as
Figure 768165DEST_PATH_IMAGE004
Wherein, in the step (A),
Figure 284597DEST_PATH_IMAGE005
identifying the current parameters of the model after the k-th increment stage of the model for the video behavior,
Figure 859935DEST_PATH_IMAGE006
the current parameters of the model of the feature extractor after the k-th increment stage is finished.
Fusing frames during acquisition
Figure 71605DEST_PATH_IMAGE025
Thereafter, process frame features may be obtained by a feature extractor
Figure 656170DEST_PATH_IMAGE026
Similarly, after the video to be processed is obtained, the video features may be obtained by the feature extractor
Figure 761529DEST_PATH_IMAGE008
For the process frame feature and the video feature, because a fused frame capable of accurately representing the video to be processed is desired to be obtained, that is, the desired fused frame has the same or very similar expression capability to the original video, at this time, the embedded feature of the fused frame, which needs to be extracted from the video behavior recognition model, can be consistent with the video feature of the original video or consistent with the video feature of the original video as much as possible, and then a second feature loss function between the process frame feature and the video feature is obtained
Figure 897981DEST_PATH_IMAGE027
When the second feature loss function is characterized by a euclidean distance,
Figure 456001DEST_PATH_IMAGE028
(ii) a Of course, the second characteristic loss function may be expressed in other ways, such as: the second characteristic loss function can also be expressed by cosine similarity, mahalanobis distance, manhattan distance, pearson correlation coefficient and other modes, and different modes can correspond to different expression formulas, which is not described herein again.
To further enhance process fusion frames
Figure 719624DEST_PATH_IMAGE025
The adaptability to the video behavior recognition model can use cross entropy loss to supervise the classification confidence of the process fusion frame, and at this time, a frame prediction label corresponding to the process fusion frame can be obtained through the video behavior recognition model, and the frame prediction label can be expressed as
Figure 679489DEST_PATH_IMAGE029
Then obtaining a standard label corresponding to the video to be processed
Figure 596630DEST_PATH_IMAGE012
Then a frame prediction tag can be obtained
Figure 300668DEST_PATH_IMAGE029
And standard label
Figure 961457DEST_PATH_IMAGE012
Second label loss function in between
Figure 775829DEST_PATH_IMAGE030
When the second tag loses the function
Figure 270395DEST_PATH_IMAGE030
When characterized by a cross-entropy loss function,
Figure 68587DEST_PATH_IMAGE030
can be expressed as
Figure 533066DEST_PATH_IMAGE031
(ii) a Of course, the second tag loss function may be expressed in other ways, such as: the second label loss function can also be expressed by a logarithmic loss function, an exponential loss function and the like, and different modes can correspond to different expression formulas, which is not described herein again.
Upon acquisition of the second characteristic loss function
Figure 592158DEST_PATH_IMAGE027
And a second tag loss function
Figure 116680DEST_PATH_IMAGE030
Thereafter, the second characteristic loss function can be passed
Figure 277534DEST_PATH_IMAGE027
And a second tag loss function
Figure 280125DEST_PATH_IMAGE030
Determining a second penalty function corresponding to the process fusion frame
Figure 69090DEST_PATH_IMAGE032
The process may, in some instances,
Figure 154726DEST_PATH_IMAGE032
=
Figure 661931DEST_PATH_IMAGE027
+
Figure 468213DEST_PATH_IMAGE030
i.e. the second penalty function is the sum of the second characteristic penalty function and the second tag penalty function.
In a second implementation manner, based on the video behavior recognition model, obtaining the second loss function corresponding to the process fusion frame may include: acquiring process frame characteristics corresponding to the process fusion frame and video characteristics corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic; a second loss function corresponding to the process fused frame is determined based on the second characteristic loss function.
Different from the above implementation manner, the second loss function in the present embodiment
Figure 518209DEST_PATH_IMAGE032
Only the second characteristic loss function is needed
Figure 118954DEST_PATH_IMAGE027
The acquisition is performed, thus, there is no need to acquire a second tag loss function
Figure 113455DEST_PATH_IMAGE030
In particular, the second characteristic loss function can be determined directly as the second loss function, i.e.
Figure 582483DEST_PATH_IMAGE032
=
Figure 611618DEST_PATH_IMAGE027
In a third implementation manner, based on the video behavior recognition model, obtaining the second loss function corresponding to the process fusion frame may include: acquiring a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; obtaining a second label loss function between the frame prediction label and the standard label; based on the second tag loss function, a second loss function corresponding to the process fused frame is determined.
Different from the above implementation manner, the second loss function in the present embodiment
Figure 383265DEST_PATH_IMAGE032
Only the second label loss function needs to be passed
Figure 740429DEST_PATH_IMAGE030
To proceed to obtainTaking, therefore, there is no need to obtain a second characteristic loss function
Figure 154092DEST_PATH_IMAGE027
In particular, the second label can be directly lost
Figure 772155DEST_PATH_IMAGE030
Determined as a second loss function, i.e.
Figure 101987DEST_PATH_IMAGE032
=
Figure 71080DEST_PATH_IMAGE030
In still other examples, after acquiring a fusible frame corresponding to the to-be-processed video in combination with the learnable parameter and the learnable information, and after acquiring a second loss function corresponding to the process fusion frame, this embodiment may further include an implementation manner of adjusting the learnable information based on the second loss function, where the method in this embodiment may further include: and adjusting the learnable parameters based on the second loss function to obtain the target learning parameters corresponding to the learnable parameters, thereby effectively improving the flexibility and reliability of determining the target learning parameters.
Step S302: and fusing the fusion frame and the learnable information to obtain a target fusion frame.
Referring to fig. 4, after the fusion frame and the learnable information are acquired, fusion processing may be performed on the fusion frame and the learnable information, so that a target fusion frame may be acquired. Specifically, fusing the fusion frame and the learnable information to obtain the target fusion frame may include: performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain a target fusion frame; or, the learnable information and the fusion frame are subjected to pixel-by-pixel product processing to obtain a target fusion frame; or splicing the learnable information and the fusion frame to obtain a target fusion frame; therefore, the accurate reliability of obtaining the target fusion frame is effectively realized.
In the embodiment, the learnable information corresponding to the video to be processed is acquired, and the fusion frame and the learnable information are fused to acquire the target fusion frame, so that the accuracy and reliability of determining the target fusion frame are effectively ensured, and the accuracy degree of representing the video to be processed based on the target fusion frame is facilitated.
Fig. 5 is a schematic flowchart illustrating a process of determining learnable parameters corresponding to a plurality of video frames according to an embodiment of the present invention; on the basis of the foregoing embodiment, referring to fig. 5, this embodiment provides an implementation manner for determining a learnable parameter, and specifically, determining a learnable parameter corresponding to each of a plurality of video frames in this embodiment may include:
step S501: acquiring initialization parameters corresponding to the plurality of video frames and initialization information corresponding to the video to be processed, wherein the initialization parameters are obtained based on the number of the plurality of video frames, and the initialization information is used for identifying space information and time information corresponding to the video to be processed.
After acquiring the plurality of video frames, the initialization parameters corresponding to the plurality of video frames may be automatically acquired or determined, and the initialization parameters may be determined based on the number of the video frames, in some examples, the initialization parameters corresponding to different video frames are the same, that is, in an initial state, it may be default that all the video frames have the same influence on the behavior recognition operation of the video to be processed, for example, when the number of the video frames is N, then the initialization parameters may be 1/N. In other examples, the initialization parameters corresponding to different video frames may be different, and at this time, the initialization parameters corresponding to each of the plurality of video frames may be determined based on the frame characteristics of the video frame.
Similarly, after the to-be-processed video is acquired, the initialization information corresponding to the to-be-processed video may be automatically acquired or determined, where the initialization information may be determined based on the to-be-processed video, and in some examples, the initialization information corresponding to different to-be-processed videos may be the same value, and at this time, the initialization information may be 0. Or, the initialization information corresponding to different videos to be processed may be different values.
Step S502: and determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters.
Because the initialization parameter is the initial influence degree set in the initial state for representing the video frame to represent the video to be processed, in order to better represent the video to be processed and obtain a more accurate fusion frame, the initialization parameter needs to be optimally adjusted in combination with the result of video behavior recognition, at this time, after the initialization parameter is obtained, an initial fusion frame corresponding to the video frame to be processed can be determined based on the initialization parameter, and specifically, the initial fusion frame can be a fusion frame obtained by performing weighted summation on a plurality of video frames through the initialization parameter.
Step S503: and fusing the initial fusion frame and the initialization information to obtain a process fusion frame.
After the initialization information and the initial fusion frame are obtained, the initialization information and the fusion frame may be fused, so as to obtain a process fusion frame, where the fusing of the initialization information and the fusion frame may include: performing pixel-by-pixel summation processing on the initialization information and the fusion frame to obtain a process fusion frame; or, the initialization information and the fusion frame are processed by pixel-by-pixel product to obtain a process fusion frame; or splicing the initialization information and the fusion frame to obtain a process fusion frame; therefore, the accurate reliability of acquiring the process fusion frame is effectively realized.
Step S504: and acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model.
After the process fusion frame is obtained, feature extraction processing and behavior tag identification processing can be performed on the process fusion frame based on the video behavior identification model, and then a third loss function corresponding to the process fusion frame can be obtained based on the obtained feature information and tag information.
For the third loss function, this embodiment may obtain different third loss functions in different manners, and in a first implementation manner, obtaining the third loss function corresponding to the process fusion frame based on the video behavior recognition model may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, initial prediction labels corresponding to the initial fusion frame, process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, frame prediction labels corresponding to the process fusion frame and standard labels corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame feature and the video feature, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame feature and the video feature, and a fourth sub-loss function between the frame prediction label and the standard label; and determining a third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function.
Wherein the first sub-loss function
Figure 757276DEST_PATH_IMAGE033
And a third sub-loss function
Figure 636371DEST_PATH_IMAGE034
Can be obtained by Euclidean distance, cosine similarity, etc., and the second sub-loss function
Figure 15399DEST_PATH_IMAGE035
And a fourth sub-loss function
Figure 330843DEST_PATH_IMAGE036
Can be obtained by a cross entropy loss function. Additionally, determining a third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function, and the fourth sub-loss function may include: acquiring weight information corresponding to the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function respectively; weighting information based on the first sub-loss function, the second sub-loss function and the third sub-lossAnd carrying out weighted summation on the loss function and the fourth sub-loss function to obtain a third loss function.
Specifically, the first sub-loss function is obtained
Figure 86309DEST_PATH_IMAGE033
Second sub-loss function
Figure 819910DEST_PATH_IMAGE035
A third sub-loss function
Figure 369840DEST_PATH_IMAGE034
And a fourth sub-loss function
Figure 313525DEST_PATH_IMAGE036
Thereafter, first sub-loss functions may be determined separately
Figure 466158DEST_PATH_IMAGE033
Second sub-loss function
Figure 913320DEST_PATH_IMAGE035
A third sub-loss function
Figure 899731DEST_PATH_IMAGE034
And a fourth sub-loss function
Figure 940499DEST_PATH_IMAGE036
The weight information corresponding to each of the first sub-loss functions
Figure 37768DEST_PATH_IMAGE033
Second sub-loss function
Figure 339436DEST_PATH_IMAGE035
A third sub-loss function
Figure 355803DEST_PATH_IMAGE034
And a fourth sub-loss function
Figure 274080DEST_PATH_IMAGE036
Respective corresponding weight information to the first sub-loss function
Figure 643882DEST_PATH_IMAGE033
Second sub-loss function
Figure 941002DEST_PATH_IMAGE035
A third sub-loss function
Figure 3636DEST_PATH_IMAGE034
And a fourth sub-loss function
Figure 143630DEST_PATH_IMAGE036
Performing weighted summation to obtain a third loss function corresponding to the process fusion frame
Figure 710265DEST_PATH_IMAGE037
Wherein, in the process,
Figure 252105DEST_PATH_IMAGE038
is a first sub-loss function
Figure 485640DEST_PATH_IMAGE033
The corresponding weight information is set to the corresponding weight information,
Figure 988297DEST_PATH_IMAGE039
is a second sub-loss function
Figure 965480DEST_PATH_IMAGE035
The corresponding weight information is set to the corresponding weight information,
Figure 486460DEST_PATH_IMAGE040
as a third sub-loss function
Figure 156476DEST_PATH_IMAGE034
The corresponding weight information is set to the corresponding weight information,
Figure 5483DEST_PATH_IMAGE041
is a fourth sub-loss function
Figure 661724DEST_PATH_IMAGE036
Corresponding weight information.
In a second implementation manner, based on the video behavior recognition model, the obtaining a third loss function corresponding to the process fusion frame may include: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a third sub-loss function between the process frame characteristic and the video characteristic and a fourth sub-loss function between the frame prediction label and the standard label; and determining a third loss function corresponding to the process fusion frame based on the third sub-loss function and the fourth sub-loss function.
In a third implementation manner, based on the video behavior recognition model, obtaining a third loss function corresponding to the process fusion frame may include: acquiring fusion frame characteristics corresponding to the initial fusion frame, an initial prediction label corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame characteristic and the video characteristic and a second sub-loss function between the initial prediction label and the standard label; a third loss function corresponding to the process fused frame is determined based on the first sub-loss function and the second sub-loss function.
Specifically, the specific implementation manner and implementation effect of the three implementation manners of the third loss function in this embodiment are similar to the specific implementation manner and implementation effect of the three implementation manners of the second loss function and the first loss function in the above embodiment, and the above statements may be specifically referred to, and are not repeated herein.
Step S505: and adjusting the initialization parameters based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
After the third loss function is obtained, the initialization parameter may be adjusted based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames, specifically, the video frames may be analyzed and processed by using the video behavior recognition model, the initialization parameter may be adjusted by using the third loss function as constraint information to obtain learnable parameters corresponding to the plurality of video frames, where the learnable parameters are determined parameters that are adapted to the plurality of video frames.
In still other examples, after obtaining the fusible frame corresponding to the to-be-processed video by combining the learnable parameter and the learnable information, that is, after obtaining the third loss function corresponding to the process fusion frame, the method in this embodiment may further include an implementation manner of adjusting the initialization information based on the third loss function, in this case, the method in this embodiment may further include: and adjusting the initialization information based on the third loss function to obtain learnable information corresponding to the video to be processed, thereby effectively improving the flexible reliability of determining the learnable information.
In the embodiment, the initial parameters corresponding to the plurality of video frames and the initial information corresponding to the video to be processed are obtained, the initial fusion frame corresponding to the video to be processed is determined based on the initial parameters, the initial fusion frame and the initial information are fused to obtain the process fusion frame, the third loss function corresponding to the process fusion frame is obtained based on the video behavior recognition model, the initial parameters are adjusted based on the third loss function, and the learnable parameters corresponding to the plurality of video frames are obtained, so that the accuracy and reliability of determining the learnable parameters are effectively ensured, and the quality and the efficiency of video processing are further improved.
Fig. 6 is a flowchart illustrating a video processing method according to another embodiment of the present invention; on the basis of any one of the above embodiments, referring to fig. 6, after obtaining a fusion frame corresponding to a video to be processed, the present embodiment provides a technical solution for performing a model update or model optimization operation by using the fusion frame and a newly added video sample, and specifically, the method in the present embodiment may further include:
step S601: and acquiring a new video sample.
The specific acquiring method of the newly added video sample is similar to the specific acquiring method of the video to be processed in the above embodiment, and the above statements may be specifically referred to, and are not repeated here.
Step S602: and performing learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain an optimized recognition model.
After the newly added video sample and the fusion frame are obtained, learning training operation can be carried out on the video behavior recognition model based on the newly added video sample and the fusion frame, and therefore the recognition model after optimization can be obtained. In some examples, newly added video samples and fusion frames can be alternately and sequentially input into the video behavior recognition model to implement a learning training operation on the video behavior recognition model. In other examples, to avoid that the number of newly added video samples and the number of fusion frames are not the same magnitude, and to avoid that the training operation of video behavior recognition is forgotten catastrophically, in this embodiment, the learning and training of the video behavior recognition model based on the newly added video samples and the fusion frames is performed, and obtaining the optimized recognition model may include: acquiring a sample proportion between the newly added video sample and the fusion frame; and training the video behavior recognition model by the newly added video sample and the fusion frame according to the sample proportion to obtain an optimized recognition model.
Specifically, after the newly added video sample and the fusion frame are obtained, the sample proportion between the newly added video sample and the fusion frame can be obtained firstly, and then the newly added video sample and the fusion frame can be used for training the video behavior recognition model according to the sample proportion, so that the recognition model after optimization can be obtained, the situation that the training quality and effect of the recognition model after optimization are disastrous forgetting is effectively avoided, and the training quality and the training effect of the video behavior recognition model are further ensured.
In the embodiment, the newly added video sample is obtained, and then the video behavior recognition model is learned and trained based on the newly added video sample and the fusion frame to obtain the recognition model after optimization, so that the optimization operation of the model is effectively realized, the quality and effect of model updating or model optimization are ensured, and the practicability of the method is further improved.
In specific application, referring to fig. 7, the present application embodiment provides a video processing method, which can implement a video increment learning method with high memory efficiency, and specifically, in the technical scheme, a representative frame is learned for each video sample, so as to further reduce memory overhead caused by storing video data. Experiments show that the technical scheme has greatly improved performance compared with other reference methods under the condition of the same memory consumption, and can particularly obtain higher data storage quality and effect under the condition that the memory consumption is about 20%. Specifically, the video processing method in this embodiment may include the following steps:
step 1: and acquiring a video data set and a preliminarily trained video behavior recognition model, wherein the video behavior recognition model is used for performing behavior recognition operation on the video data.
Step 2: video samples that can represent category information are screened out of the video dataset.
Specifically, after the video behavior recognition model is initially trained, that is, after an incremental task corresponding to the video behavior recognition model is finished, a representative sample selection algorithm may be used to screen out video samples capable of representing category information from a current video data set, the number of the video samples corresponding to a video of one category information is at least one, and the obtained video samples of all categories form a representative sample video set.
The representative sample selection algorithm may include a clustering algorithm Herding policy, which mainly may perform a selection operation of samples around a category center, and of course, the representative sample selection algorithm is not limited to the above listed Herding policy, and may also include other policies, for example: selecting video samples according to the contribution degree of each video to the loss function; alternatively, the video samples and the like may be selected according to the category gradient corresponding to the video, and a person skilled in the art may perform any selection and configuration operation according to a specific application scenario or application requirement, which is not described herein again.
And 3, step 3: and performing frame sampling operation on each video sample to obtain a plurality of video frames.
In particular, the video data set obtained after incremental step k is selected by a representative sample selection method (clique strategy) commonly used in incremental learning
Figure 912576DEST_PATH_IMAGE042
To determine a representative sample video set
Figure 487914DEST_PATH_IMAGE043
. Then, for a representative sample video set
Figure 480010DEST_PATH_IMAGE043
Of any one of the video samples
Figure 798996DEST_PATH_IMAGE044
In other words, T frames may be uniformly sampled or randomly sampled in a video sample
Figure 310880DEST_PATH_IMAGE045
Thereby obtaining a plurality of video frames.
And 4, step 4: and determining a learnable weight corresponding to each of the plurality of video frames, wherein the learnable weight can be determined through the video behavior recognition model and the number of the plurality of video frames.
In order to further process the selected video sample, for each video frame, the learnable weight may be optimized, in some examples, after a plurality of video frames are acquired, the number N of the plurality of video frames may be determined, and then the initialization parameter is determined based on the number of the plurality of video frames, where the initialization parameter corresponding to each video frame is the same, specifically 1/N, and for example, if the number of the plurality of video frames is 8 frames, the initialization parameter corresponding to each video frame is 1/8.
And 5: and fusing the plurality of video frames based on the learnable weights corresponding to the plurality of video frames respectively to obtain a fused frame.
After the learnable weights corresponding to the multiple video frames are obtained, an initial fusion frame can be obtained based on the initialization parameters, behavior recognition operation is performed on the initial fusion frame by using a video behavior recognition model to obtain a loss function corresponding to the initial fusion frame, and then the initialization parameters are adjusted and optimized based on the loss function to obtain the learnable weights corresponding to the multiple video frames.
Through the optimized learnable weight, a fusion frame which can represent the characteristics of the video and can be correctly classified with higher confidence coefficient can be learnt for the video, namely a plurality of video frames are fused into a representative frame, so that the fusion frame represents a video sample for storage, and only the fusion frame is stored in the next increment task, thereby effectively ensuring the memory space required by data storage.
For example, for a representative sample video set
Figure 57119DEST_PATH_IMAGE043
Is a medium to
Figure 739773DEST_PATH_IMAGE046
A video sample
Figure 128029DEST_PATH_IMAGE047
In other words, learnable weights are defined
Figure 87895DEST_PATH_IMAGE048
Then the fused frame can be expressed as the following formula:
Figure 880401DEST_PATH_IMAGE049
wherein C, H and W are the number of image channels corresponding to the video frame, the height and width of the video frame,
Figure 456876DEST_PATH_IMAGE050
represented as a fused frame, is shown,
Figure 117665DEST_PATH_IMAGE051
for representing a normalization process of the learnable weights,
Figure 584899DEST_PATH_IMAGE052
the size of the fusion frame is the same as that of the video frame, and the number of image channels of the fusion frame is the same as that of the video frame.
Step 6: and determining an Instance-Specific Prompt (Instance-Specific Prompt) corresponding to each of the plurality of video frames, wherein the Instance-Specific Prompt is used for representing the temporal information and the spatial information corresponding to the video sample, and the Instance-Specific Prompt can be specifically determined through a video behavior recognition model.
When a plurality of videos are fused into one frame (namely, a fusion frame), the time and space information of the original video is necessarily lost to a certain extent, and in order to supplement the information loss caused by frame fusion and compensate the time information loss and space information confusion caused in the frame fusion stage, an instance-specific hint can be further added to the fusion frame for recovering the detail information at the pixel level, so that the feature information of the original video can be better saved.
In particular, for each video sample
Figure 938520DEST_PATH_IMAGE047
Constructing a learnable instance-specific hint
Figure 346499DEST_PATH_IMAGE053
Spatial resolution the same as its original video resolution, for instance-specific cues
Figure 76558DEST_PATH_IMAGE054
In other words, it can be obtained by adjusting the initialization cue through the video behavior recognition model. In some examples, the initialization hint may be 0, and then the initialization hint and the fusion frame are summed pixel by pixel to obtain a target fusion frame, which is then analyzed by a video behavior recognition modelAnd processing to obtain a loss function of the target fusion frame, and performing optimization adjustment on the initialization prompt information based on the loss function, so that instance specific prompts can be obtained.
And 7: and processing the fused frame based on the example specific prompts corresponding to the video frames to obtain a target fused frame, wherein the target fused frame is used for representing the video sample.
Wherein the example characteristics are prompted
Figure 135649DEST_PATH_IMAGE054
And fusion frames
Figure 660172DEST_PATH_IMAGE055
Summing pixel by pixel to obtain a target fusion frame for representing the video sample
Figure 680080DEST_PATH_IMAGE056
The obtained target fusion frame can be used as a video sample to carry out model training operation, so that more target fusion frames can be stored in a limited memory space, and the diversity and the quantity of the samples during model updating or model optimization operation can be effectively ensured.
In addition, when the learnable weight is determined by the video behavior recognition model and the number of the plurality of video frames, the method in this embodiment may further include the following steps:
step 11: obtaining initialization parameters corresponding to a plurality of video frames respectively, wherein the plurality of video frames correspond to one or a plurality of video samples
Figure 823617DEST_PATH_IMAGE047
Step 12: performing preliminary fusion on a plurality of video frames based on initialization parameters corresponding to the plurality of video frames respectively to obtain an initial fusion frame
Figure 878160DEST_PATH_IMAGE055
Step 13: initial fusion frame pair by using video behavior recognition model
Figure 308005DEST_PATH_IMAGE055
Performing feature extraction operation to obtain the feature of the fusion frame
Figure 939843DEST_PATH_IMAGE057
Wherein the content of the first and second substances,
Figure 11704DEST_PATH_IMAGE002
is a video behavior recognition model
Figure 655175DEST_PATH_IMAGE001
The feature extractor of (1) is provided,
Figure 662446DEST_PATH_IMAGE005
are parameters of the behavior recognition model of the video,
Figure 391367DEST_PATH_IMAGE058
are parameters of the feature extractor.
Step 14: video sample using video behavior recognition model
Figure 391553DEST_PATH_IMAGE047
Performing feature extraction operation to obtain video features
Figure 155110DEST_PATH_IMAGE059
Step 15: a first sub-loss function between the fused frame feature and the video feature is obtained.
Specifically, in the process of video processing, in order to obtain a fused frame accurately representing a video sample and expect that the fused frame has the same or very similar expression capability as the video sample, the embedded features of the fused frame, which need to be extracted from the video behavior recognition model, should be consistent with the features of the original video:
Figure 926757DEST_PATH_IMAGE060
wherein the content of the first and second substances,
Figure 549499DEST_PATH_IMAGE002
is the current model
Figure 963163DEST_PATH_IMAGE001
The feature extractor of (1).
Figure 581226DEST_PATH_IMAGE058
Are parameters of the feature extractor.
Step 16: initial fusion frame pair by using video behavior recognition model
Figure 916917DEST_PATH_IMAGE055
Performing label behavior prediction operation to obtain predicted label
Figure 620431DEST_PATH_IMAGE061
And step 17: determining a standard label corresponding to a video sample
Figure 572206DEST_PATH_IMAGE062
And obtaining a prediction tag
Figure 185721DEST_PATH_IMAGE061
And standard label
Figure 830329DEST_PATH_IMAGE062
The first sub-cross entropy loss function in between.
In order to further improve the adaptability of the fusion frame to the video behavior recognition model, cross entropy loss can be used to supervise the classification confidence of the fusion frame:
Figure 145773DEST_PATH_IMAGE063
step 18: obtaining a first total objective function for optimizing the learnable weight, in particular, the first total objective function, based on the first sub-loss function and the first sub-cross entropy loss function
Figure 901239DEST_PATH_IMAGE064
Further, when determining, based on the video behavior recognition model, an instance-specific hint corresponding to each of the plurality of video frames, the method in this embodiment may further include the following steps:
step 21: obtaining an initial fused frame
Figure 634840DEST_PATH_IMAGE055
Corresponding initialization prompt information
Figure 184770DEST_PATH_IMAGE054
Wherein the prompt information is initialized
Figure 862876DEST_PATH_IMAGE054
May be all zero.
Step 22: determining a process fusion frame based on the initialization prompt information and the initial fusion frame.
In particular, and will initialize prompt messages
Figure 281088DEST_PATH_IMAGE054
Fusing with the initial frame
Figure 728250DEST_PATH_IMAGE055
Performing pixel-by-pixel summation operation to obtain process fusion frame
Figure 714661DEST_PATH_IMAGE056
Step 23: fusing frames to a process using a video behavior recognition model
Figure 755429DEST_PATH_IMAGE056
Performing feature extraction operation to obtain process frame features
Figure 852698DEST_PATH_IMAGE065
Step 24: video sample processing using video behavior recognition model
Figure 544579DEST_PATH_IMAGE047
Performing feature extraction operation to obtain video features
Figure 436312DEST_PATH_IMAGE059
Step 25: obtaining process frame features
Figure 823431DEST_PATH_IMAGE065
And video features
Figure 458812DEST_PATH_IMAGE059
A second sub-loss function in between.
In particular, to characterize a process frame
Figure 21511DEST_PATH_IMAGE056
And the original video
Figure 84145DEST_PATH_IMAGE047
The second sub-loss function can be expressed by the following formula
Figure 958560DEST_PATH_IMAGE066
In this process, the representation of the fused frame is enriched by introducing more flexible learnable parameters in the input space.
Step 26: fusing frames to a process using a video behavior recognition model
Figure 519336DEST_PATH_IMAGE056
Performing label behavior prediction operation to obtain predicted label
Figure 61176DEST_PATH_IMAGE067
Step 27: determining standard tags corresponding to video samples
Figure 170077DEST_PATH_IMAGE062
Obtaining a prediction tag
Figure 531788DEST_PATH_IMAGE067
And standard label
Figure 508971DEST_PATH_IMAGE062
A second sub-cross entropy loss function in between.
Wherein the second sub-cross entropy loss function can be used to enhance the semantic perception thereof, and specifically, the second sub-cross entropy loss function can be expressed as
Figure 295531DEST_PATH_IMAGE068
Step 28: determining a second total objective function based on the first total objective function, the second sub-cross entropy loss function and the second sub-loss function, specifically, the second total objective function of frame fusion and adding instance specific prompts is as follows:
Figure 699967DEST_PATH_IMAGE069
wherein, in the process,
Figure 283395DEST_PATH_IMAGE018
Figure 205215DEST_PATH_IMAGE019
Figure 456068DEST_PATH_IMAGE070
and
Figure 296985DEST_PATH_IMAGE071
the default weights for the loss functions described above, respectively, in some instances,
Figure 492343DEST_PATH_IMAGE018
Figure 76908DEST_PATH_IMAGE019
Figure 182267DEST_PATH_IMAGE070
and
Figure 803872DEST_PATH_IMAGE071
may all be 1.
Step 29: and adjusting the initialization prompt information and/or the initialization parameters based on the second overall objective function to obtain the instance-specific prompt and/or the learnable parameters, wherein the obtained instance-specific prompt and/or the learnable parameters can be used for determining the fusion frame capable of representing the video sample.
After the fused frame of the video sample is determined, in addition to processing the video sample based on the example-specific prompt and/or the learnable parameters, the embodiment can also process the video sample based on an algorithm of sample reproduction and knowledge distillation, so that not only can the storage frame number of the representative video be reduced, the information redundancy of the representative video be reduced, a large amount of memory consumption required by video data storage be reduced, but also catastrophic forgetting can be further prevented, and efficient storage of video incremental learning can be realized.
Specifically, after the fusion frame capable of representing the video sample is acquired, the method in this application embodiment may further include performing a training operation on the video behavior recognition model based on the fusion frame, and at this time, the method in this application embodiment may include:
step 31: a plurality of fused frames is obtained, with different fused frames being used to represent different video samples.
Step 32: and acquiring a newly added video sample, and performing training updating operation on the video behavior recognition model based on the newly added video sample and the plurality of fusion frames to obtain an optimized video behavior recognition model.
Wherein, in the training increment step
Figure 893051DEST_PATH_IMAGE072
When data sets can be used
Figure 750149DEST_PATH_IMAGE073
Training model
Figure 365807DEST_PATH_IMAGE074
Wherein, in the process,
Figure 548526DEST_PATH_IMAGE042
is a task
Figure 469209DEST_PATH_IMAGE072
The task is defined by belonging to a category set
Figure 395577DEST_PATH_IMAGE075
The video composition of (a) is set up,
Figure 475528DEST_PATH_IMAGE076
is a repository of old samples, among
Figure 956713DEST_PATH_IMAGE076
Contains a plurality of fused frames generated in advance. In order to ensure the quality and effect of model updating or model optimization, the sample proportion between a plurality of video frames and newly added video samples can be determined, and then the video frames and the newly added video samples are input into the video behavior recognition model according to the sample proportion, so that the problem of catastrophic forgetting during model updating or model optimization operation can be effectively prevented.
In addition, the application embodiment can also adopt a basic knowledge distillation method in incremental learning to convert the former model
Figure 489325DEST_PATH_IMAGE074
To the current model
Figure 953804DEST_PATH_IMAGE001
The technical scheme provided by the application embodiment can realize the video type increment learning task under the condition of high storage efficiency, and particularly, a fusion frame capable of representing a video sample is obtained by learning the representative characteristics of the video sample, lost spatio-temporal information is supplemented at the pixel level through instance specific prompt, so that the video sample can be better represented by the fusion frame.
Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 8, the present embodiment provides a video processing apparatus, which is configured to execute the video processing method shown in fig. 2, and specifically, the video processing apparatus may include:
a first obtaining module 11, configured to obtain multiple video frames of a video to be processed;
the first determining module 12 is configured to determine learnable parameters corresponding to each of a plurality of video frames, where the learnable parameters are obtained through a video behavior recognition model, where the video behavior recognition model is a machine learning model;
the first processing module 13 is configured to fuse the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, so as to obtain a fused frame corresponding to the video to be processed.
In some examples, when the first processing module 13 fuses the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, and obtains a fused frame corresponding to the video to be processed, the first processing module 13 is configured to perform: when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fusion frame; or when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on the plurality of video frames based on the normalization parameters to obtain a fusion frame.
In some examples, when the first determination module 12 determines the learnable parameters corresponding to each of the plurality of video frames, the first determination module 12 is configured to perform: acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are acquired based on the number of the plurality of video frames; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
In some examples, when the first determining module 12 obtains the first loss function corresponding to the initial fused frame based on the video behavior recognition model, the first determining module 12 is configured to perform: acquiring fusion frame characteristics corresponding to the initial fusion frame, video characteristics corresponding to the video to be processed, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between an initial prediction label and a standard label; a first loss function corresponding to the initial fused frame is determined based on the first feature loss function and the first tag loss function.
In some examples, after obtaining the fusion frame corresponding to the video to be processed, the first obtaining module 11 and the first processing module 13 in this embodiment are configured to perform the following steps:
a first obtaining module 11, configured to obtain learnable information corresponding to a video to be processed, where the learnable information is used to identify spatial information and/or temporal information of the video to be processed;
and a first processing module 13, configured to fuse the fused frame and the learnable information to obtain a target fused frame.
In some examples, when the first processing module 13 fuses the fused frame and the learnable information to obtain the target fused frame, the first processing module 13 is configured to perform: performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain a target fusion frame; or, performing pixel-by-pixel product processing on the learnable information and the fusion frame to obtain a target fusion frame; or splicing the learnable information and the fusion frame to obtain a target fusion frame.
In some examples, when the first obtaining module 11 obtains the learnable information corresponding to the video to be processed, the first obtaining module 11 is configured to perform: acquiring initialization information corresponding to a video to be processed; fusing the initialization information and the fusion frame to obtain a process fusion frame; acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization information based on the second loss function to obtain learnable information corresponding to each of the plurality of video frames.
In some examples, when the first obtaining module 11 obtains the second loss function corresponding to the process fusion frame based on the video behavior recognition model, the first obtaining module 11 is configured to perform: acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the video to be processed based on the video behavior recognition model; acquiring a second characteristic loss function between the process frame characteristic and the video characteristic and a second label loss function between the frame prediction label and the standard label; a second loss function corresponding to the process fused frame is determined based on the second feature loss function and the second tag loss function.
In some examples, after obtaining the second loss function corresponding to the process fusion frame, the first processing module 13 in this embodiment is configured to perform: and adjusting the learnable parameters based on the second loss function to obtain target learning parameters corresponding to the learnable parameters.
In some examples, when the first determination module 12 determines the learnable parameters corresponding to each of the plurality of video frames, the first determination module 12 is configured to perform: acquiring initialization parameters corresponding to a plurality of video frames and initialization information corresponding to a video to be processed, wherein the initialization parameters are acquired based on the number of the video frames, and the initialization information is used for identifying space information and time information corresponding to the video to be processed; determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters; fusing the initial fusion frame and the initialization information to obtain a process fusion frame; acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model; and adjusting the initialization parameters based on a third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
In some examples, after obtaining the third loss function corresponding to the process fusion frame, the first processing module 13 in this embodiment is configured to perform: and adjusting the initialization information based on a third loss function to obtain learnable information corresponding to the video to be processed.
In some examples, when the first determination module 12 obtains the third loss function corresponding to the process fusion frame based on the video behavior recognition model, the first determination module 12 is configured to perform: acquiring fusion frame characteristics corresponding to the initial fusion frame, initial prediction labels corresponding to the initial fusion frame, process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the video to be processed, frame prediction labels corresponding to the process fusion frame and standard labels corresponding to the video to be processed based on the video behavior recognition model; acquiring a first sub-loss function between the fusion frame characteristic and the video characteristic, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame characteristic and the video characteristic, and a fourth sub-loss function between the frame prediction label and the standard label; determining a third loss function corresponding to the process fusion frame based on the first, second, third, and fourth sub-loss functions.
In some examples, when the first determination module 12 determines the third loss function corresponding to the process fusion frame based on the first sub-loss function, the second sub-loss function, the third sub-loss function, and the fourth sub-loss function, the first determination module 12 is configured to perform: acquiring weight information corresponding to a first sub-loss function, a second sub-loss function, a third sub-loss function and a fourth sub-loss function respectively; and carrying out weighted summation on the first sub-loss function, the second sub-loss function, the third sub-loss function and the fourth sub-loss function based on the weight information to obtain a third loss function.
In some examples, after obtaining the fused frame corresponding to the video to be processed, the first obtaining module 11 and the first processing module 13 in this embodiment perform the following steps:
the first obtaining module 11 is configured to obtain a newly added video sample;
and the first processing module 13 is configured to perform learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain an optimized recognition model.
In some examples, when the first processing module 13 performs learning training on the video behavior recognition model based on the newly added video samples and the fusion frames to obtain the optimized recognition model, the first processing module 13 is configured to perform: acquiring a sample proportion between the newly added video sample and the fusion frame; and training the video behavior recognition model by the newly added video sample and the fusion frame according to the sample proportion to obtain an optimized recognition model.
The apparatus shown in fig. 8 can perform the method of the embodiment shown in fig. 1-7, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 1-7. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.
In one possible design, the structure of the video processing apparatus shown in fig. 8 may be implemented as an electronic device, which may be a tablet computer, a personal computer PC, a conference room device, a server, or other various devices. As shown in fig. 9, the electronic device may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing programs for corresponding electronic devices to execute the video processing method in the embodiments shown in fig. 1-7, and the first processor 21 is configured to execute the programs stored in the first memory 22.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor 21, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to a plurality of video frames respectively, wherein the learnable parameters are obtained through a video behavior recognition model, and the video behavior recognition model is a machine learning model; and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
Further, the first processor 21 is also used to execute all or part of the steps in the embodiments shown in fig. 1-7. The electronic device may further include a first communication interface 23 for communicating with other devices or a communication network.
In addition, the embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 1 to 7.
Furthermore, an embodiment of the present invention provides a computer program product, including: computer program, which, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method as described above with reference to fig. 1-7.
Fig. 10 is a flowchart illustrating another video processing method according to an embodiment of the present invention; referring to fig. 10, the present embodiment provides another video processing method, where an execution subject of the method may be a video processing apparatus, the video processing apparatus may be implemented as software, or a combination of software and hardware, and specifically, when the video processing apparatus is implemented as hardware, it may be embodied as various electronic devices having data processing operations, including but not limited to a tablet computer, a personal computer PC, a server, and the like. When the video processing apparatus is implemented as software, it can be installed in the electronic devices exemplified above. Based on the above video processing apparatus, the video processing method in this embodiment may include the following steps:
step S1001: a plurality of video frames of a video to be processed are acquired.
Step S1002: and displaying a parameter configuration interface for processing the plurality of video frames.
After the image to be processed is acquired, in order to process a plurality of video frames of the video to be processed, a parameter configuration interface may be displayed, where a parameter adjustment control for adjusting a learnable parameter is displayed in the parameter configuration interface, and a user may configure or adjust the learnable parameter through the control, for example: the learnable parameters can be increased or decreased through the control to meet different video processing requirements, so that the learnable parameters meeting different requirements can be obtained quickly.
Step S1003: and determining learnable parameters corresponding to the plurality of video frames through the parameter configuration operation obtained by the parameter configuration interface.
After the parameter configuration interface is displayed, a parameter configuration operation corresponding to the learnable parameter can be acquired through the parameter configuration interface, and the parameter configuration operation is used for generating or adjusting the learnable parameter corresponding to the image to be processed. In some examples, a default learnable parameter value (e.g., 0, 0.5, etc.) may be displayed in the parameter configuration interface, and at this time, the user may confirm or adjust the default learnable parameter through the parameter configuration interface. The default learnable parameter value may be determined by the number of the plurality of video frames, for example, when the number of the plurality of video frames is n, the learnable parameter value may be 1/n.
In addition, the parameter adjusting control included in the parameter configuration interface is a character input control, a user can input corresponding characters through the character input control, and the character input operation input by the user through the character input control is the parameter configuration operation. For example, a default learnable parameter configured in advance may be displayed in the parameter configuration interface, for example, the default learnable parameter is 0.5, after the image to be processed is acquired, a character input control may be displayed in the parameter configuration interface, and the user directly inputs a corresponding character through the character input control, for example: the characters "0", "-" and "6" are input, so that parameter configuration operation can be obtained, and the default learnable parameter 0.5 can be adjusted to 0.6 through the character input operation.
In other examples, the parameter adjustment control included in the parameter configuration interface is a click control ("+" control and "-" control) or a slide control, when the parameter adjustment control is the click control, the user may increase the learnable parameter by clicking the "+" control, and decrease the learnable parameter by clicking the "-" control, where at this time, the obtained parameter configuration operation is a click operation. When the parameter adjustment space is a sliding control, the user may decrease the learnable parameter by sliding left or downward, and increase the learnable parameter by sliding right or upward, at this time, the obtained parameter configuration operation is a sliding operation.
After the parameter configuration operation is acquired through the parameter configuration interface, the learnable parameters may be generated or acquired based on the parameter configuration operation, and it can be understood that the learnable parameters corresponding to different numbers of video frames are different.
Step S1004: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
Step S1005: and displaying the fused frame.
After the fusion frame is acquired, in order to enable a user to intuitively know an image generation effect corresponding to the fusion frame, the fusion frame generated by using learnable parameters may be displayed in a display interface or a parameter configuration interface, and the fusion frame may represent a video to be processed.
It should be noted that when different learnable parameters are obtained by adjusting the learnable parameters in the parameter configuration interface, fusion frames corresponding to the different learnable parameters may be displayed in a preset area of the parameter configuration interface, for example: after the plurality of video frames are acquired, when a user can acquire the learnable parameters ai corresponding to each video frame through the parameter configuration interface, the fusion frame corresponding to the learnable parameters ai can be displayed in a preset area in the parameter configuration interface, so that the user can directly check the generation effect of the fusion frame through the parameter configuration interface; if the generation effect of the fusion frame does not meet the user requirement or the quality is crossed, the user can continue to adjust or configure the learnable parameter through the parameter configuration interface, so that the learnable parameter bi can be obtained, the learnable parameter bi is different from the learnable parameter ai, at this time, the fusion frame corresponding to the learnable parameter bi can be displayed in a preset area in the parameter configuration interface, so that the user can directly check the generation effect of the fusion frame through the parameter configuration interface, if the fusion frame meets the user requirement at this time, the configuration operation on the learnable parameter can be stopped, so that the flexible and free adjustment of the learnable parameter through the interactive operation of the user and the parameter configuration interface can be effectively realized, the processing quality and the effect of the generated fusion frame can be immediately checked through the parameter configuration interface, the user can visually judge whether the generated fusion frame meets the requirement at this time, if the requirement is not met, the learnable parameter can be adjusted again, and if the requirement is met, the fusion frame can be directly generated or input.
The method in this embodiment may further include the method in the embodiment shown in fig. 1 to 7, and reference may be made to the related description of the embodiment shown in fig. 1 to 7 for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.
Fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 11, the present embodiment provides a video processing apparatus capable of executing the video processing method shown in fig. 10, and specifically, the video processing apparatus may include:
a second obtaining module 31, configured to obtain multiple video frames of a video to be processed;
a second display module 32, configured to display a parameter configuration interface for processing the plurality of video frames;
a second determining module 33, configured to determine, through the parameter configuration operation obtained through the parameter configuration interface, a learnable parameter corresponding to each of the plurality of video frames;
a second processing module 34, configured to fuse the multiple video frames based on learnable parameters corresponding to the multiple video frames, to obtain a fused frame corresponding to the video to be processed;
and the second display module 32 is further configured to display the fused frame.
The apparatus in this embodiment may also perform the method in the embodiments shown in fig. 1 to 7, and reference may be made to the related description of the embodiments shown in fig. 1 to 7 for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to fig. 7, which are not described herein again.
In one possible design, the structure of the video processing apparatus shown in fig. 11 may be implemented as an electronic device, which may be a tablet computer, a personal computer PC, a conference room device, a server, or other various devices. As shown in fig. 12, the electronic device may include: a second processor 41 and a second memory 42. Wherein the second memory 42 is used for storing the program of the corresponding electronic device for executing the video processing method in the embodiment shown in fig. 10, and the second processor 41 is configured for executing the program stored in the second memory 42.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor 41, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; displaying a parameter configuration interface for processing the plurality of video frames; determining learnable parameters corresponding to the plurality of video frames through parameter configuration operation obtained by the parameter configuration interface; fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed; and displaying the fused frame.
Further, the second processor 41 is also used to execute all or part of the steps in the embodiment shown in fig. 10. The electronic device may further include a second communication interface 44 for communicating with other devices or a communication network.
In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 10.
Furthermore, an embodiment of the present invention provides a computer program product, including: the computer program, when executed by a processor of the electronic device, causes the processor to perform the steps of the video processing method shown in fig. 10 described above.
Fig. 13 is a flowchart illustrating a further video processing method according to an embodiment of the present invention; referring to fig. 13, the present embodiment provides a further video processing method, where an execution subject of the method may be a video processing apparatus, and the video processing apparatus may be implemented as an augmented Reality device, that is, the video processing method may be applied to an augmented Reality device, where the augmented Reality device refers to a device implemented by Extended Reality (XR) technology, where the XR technology refers to a real and virtual combined human-computer interactive environment generated by computer technology and a wearable device. XR may include Augmented Reality (AR), virtual Reality (VR), mixed Reality (MR), and video Reality (CR), in other words, XR is a generic term, and specifically includes AR, VR, MR, and CR. In short, XR can be divided into multiple levels and can go through a limited sensor-input virtual world to a fully immersive virtual world.
Specifically, the video processing method in this embodiment may include the following steps:
step S1301: a plurality of video frames of a video to be processed are acquired.
Step S1302: determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model.
Step S1303: and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
Step S1304: rendering the fused frame to a display screen of the augmented reality device.
After the fusion frame is acquired, in order to enable a user to intuitively know an image generation effect corresponding to the fusion frame through the augmented reality device, the fusion frame can be rendered to a display screen of the augmented reality device, and then the fusion frame generated by using the learnable parameters can be displayed in a display interface, wherein the fusion frame can represent a video to be processed.
The specific implementation process and implementation effect of steps S1301 to S1304 in this embodiment are similar to those of steps S201 to S203 in the foregoing embodiment, and specific reference may be made to the above statements, and details are not repeated here.
Fig. 14 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention; referring to fig. 14, the present embodiment provides a video processing apparatus that can be implemented as an augmented reality device, that is, the video processing apparatus can be applied to an augmented reality device; the device comprises:
the third obtaining module 51 is configured to obtain a plurality of video frames of a video to be processed.
A third determining module 52, configured to determine learnable parameters corresponding to each of the plurality of video frames, where the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model.
A third processing module 53, configured to fuse the multiple video frames based on the learnable parameters corresponding to the multiple video frames, so as to obtain a fused frame corresponding to the video to be processed.
A third rendering module 54, configured to render the fused frame to a display screen of the augmented reality device.
The apparatus in this embodiment may also perform the method in the embodiment shown in fig. 13, and for a part not described in detail in this embodiment, reference may be made to the relevant description of the embodiment shown in fig. 13. The implementation process and technical effect of the technical solution refer to the description in the embodiment shown in fig. 13, and are not described herein again.
In one possible design, the structure of the video processing apparatus shown in fig. 14 may be implemented as an electronic device, which may be various devices such as an augmented reality device. As shown in fig. 15, the electronic device may include: a third processor 61 and a third memory 62. Wherein the third memory 62 is used for storing the program for executing the video processing method in the embodiment shown in fig. 13, and the third processor 61 is configured for executing the program stored in the third memory 62.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor 61, are capable of performing the steps of: acquiring a plurality of video frames of a video to be processed; determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model; fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames respectively to obtain a fused frame corresponding to the video to be processed; rendering the fused frame to a display screen of the augmented reality device.
Further, the third processor 61 is also configured to execute all or part of the steps in the embodiment shown in fig. 13. The electronic device may further include a third communication interface 63 for communicating with other devices or a communication network.
In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for an electronic device, which includes a program for executing the video processing method in the embodiment shown in fig. 13.
Furthermore, an embodiment of the present invention provides a computer program product, including: the computer program, when executed by a processor of an electronic device, causes the processor to perform the steps of the video processing method shown in fig. 13 described above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort. Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (14)

1. A video processing method, comprising:
acquiring a plurality of video frames of a video to be processed;
determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
and fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed.
2. The method according to claim 1, wherein fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames, and obtaining a fused frame corresponding to the video to be processed, comprises:
when the learnable parameters are numerical values which are larger than zero and smaller than 1, carrying out weighted summation on a plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain the fusion frame; alternatively, the first and second electrodes may be,
when the learnable parameters are numerical values larger than 1, normalizing the learnable parameters corresponding to the plurality of video frames respectively to obtain normalized parameters corresponding to the learnable parameters; and carrying out weighted summation on a plurality of video frames based on the normalization parameters to obtain the fusion frame.
3. The method of claim 1, wherein determining learnable parameters corresponding to each of the plurality of video frames comprises:
acquiring initialization parameters corresponding to the plurality of video frames respectively, wherein the initialization parameters are obtained based on the number of the plurality of video frames;
determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters;
acquiring a first loss function corresponding to the initial fusion frame based on the video behavior recognition model;
and adjusting the initialization parameters based on the first loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
4. The method of claim 3, wherein obtaining a first loss function corresponding to the initial fused frame based on the video behavior recognition model comprises:
acquiring fusion frame features corresponding to the initial fusion frame, video features corresponding to the to-be-processed video, an initial prediction label corresponding to the initial fusion frame and a standard label corresponding to the to-be-processed video based on the video behavior recognition model;
acquiring a first feature loss function between the fusion frame feature and the video feature and a first label loss function between the initial prediction label and the standard label;
determining a first loss function corresponding to the initial fused frame based on the first feature loss function and the first tag loss function.
5. The method of claim 1, wherein after obtaining the fused frame corresponding to the video to be processed, the method further comprises:
acquiring learnable information corresponding to the video to be processed, wherein the learnable information is used for identifying spatial information and/or time information of the video to be processed;
and fusing the fusion frame and the learnable information to obtain a target fusion frame.
6. The method of claim 5, wherein fusing the fused frame and the learnable information to obtain a target fused frame comprises:
performing pixel-by-pixel summation processing on the learnable information and the fusion frame to obtain the target fusion frame; alternatively, the first and second electrodes may be,
performing pixel-by-pixel product processing on the learnable information and the fusion frame to obtain the target fusion frame; alternatively, the first and second electrodes may be,
and splicing the learnable information and the fusion frame to obtain the target fusion frame.
7. The method according to claim 5, wherein obtaining learnable information corresponding to the video to be processed comprises:
acquiring initialization information corresponding to the video to be processed;
fusing the initialization information and the fusion frame to obtain a process fusion frame;
acquiring a second loss function corresponding to the process fusion frame based on the video behavior recognition model;
and adjusting the initialization information based on the second loss function to obtain learnable information corresponding to each of the plurality of video frames.
8. The method of claim 7, wherein obtaining a second loss function corresponding to the process fusion frame based on the video behavior recognition model comprises:
acquiring process frame characteristics corresponding to the process fusion frame, video characteristics corresponding to the to-be-processed video, a frame prediction label corresponding to the process fusion frame and a standard label corresponding to the to-be-processed video based on the video behavior identification model;
obtaining a second feature loss function between the process frame feature and the video feature and a second label loss function between the frame prediction label and the standard label;
determining a second loss function corresponding to the process fused frame based on the second feature loss function and the second tag loss function.
9. The method of claim 7, wherein after obtaining the second penalty function corresponding to the process fusion frame, the method further comprises:
and adjusting the learnable parameters based on the second loss function to obtain target learning parameters corresponding to the learnable parameters.
10. The method of claim 1, wherein determining the learnable parameters corresponding to each of the plurality of video frames comprises:
acquiring initialization parameters corresponding to the video frames and initialization information corresponding to the to-be-processed video, wherein the initialization parameters are acquired based on the number of the video frames, and the initialization information is used for identifying space information and time information corresponding to the to-be-processed video;
determining an initial fusion frame corresponding to the video to be processed based on the initialization parameters;
fusing the initial fusion frame and the initialization information to obtain a process fusion frame;
acquiring a third loss function corresponding to the process fusion frame based on the video behavior recognition model;
and adjusting the initialization parameters based on the third loss function to obtain learnable parameters corresponding to the plurality of video frames respectively.
11. The method of claim 10, wherein after obtaining a third loss function corresponding to the process fusion frame, the method further comprises:
and adjusting the initialization information based on the third loss function to obtain learnable information corresponding to the video to be processed.
12. The method of claim 10, wherein obtaining a third loss function corresponding to the process fusion frame based on the video behavior recognition model comprises:
acquiring a fusion frame feature corresponding to the initial fusion frame, an initial prediction label corresponding to the initial fusion frame, a process frame feature corresponding to the process fusion frame, a video feature corresponding to the video to be processed, a frame prediction label corresponding to the process fusion frame, and a standard label corresponding to the video to be processed based on the video behavior recognition model;
obtaining a first sub-loss function between the fused frame feature and the video feature, a second sub-loss function between the initial prediction label and the standard label, a third sub-loss function between the process frame feature and the video feature, and a fourth sub-loss function between the frame prediction label and the standard label;
determining a third loss function corresponding to the process fusion frame based on the first, second, third, and fourth sub-loss functions.
13. A video processing method, comprising:
acquiring a plurality of video frames of a video to be processed;
displaying a parameter configuration interface for processing the plurality of video frames;
determining learnable parameters corresponding to the plurality of video frames through parameter configuration operation obtained by the parameter configuration interface;
fusing the plurality of video frames based on the learnable parameters respectively corresponding to the plurality of video frames to obtain a fused frame corresponding to the video to be processed;
and displaying the fused frame.
14. A video processing method is applied to an augmented reality device, and the method comprises the following steps:
acquiring a plurality of video frames of a video to be processed;
determining learnable parameters corresponding to the plurality of video frames respectively, wherein the learnable parameters are obtained through the video behavior recognition model, and the video behavior recognition model is a machine learning model;
fusing the plurality of video frames based on the learnable parameters corresponding to the plurality of video frames respectively to obtain a fused frame corresponding to the video to be processed;
rendering the fused frame to a display screen of the augmented reality device.
CN202211099158.XA 2022-09-09 2022-09-09 Video processing method and device Active CN115205763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211099158.XA CN115205763B (en) 2022-09-09 2022-09-09 Video processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211099158.XA CN115205763B (en) 2022-09-09 2022-09-09 Video processing method and device

Publications (2)

Publication Number Publication Date
CN115205763A true CN115205763A (en) 2022-10-18
CN115205763B CN115205763B (en) 2023-02-17

Family

ID=83572115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211099158.XA Active CN115205763B (en) 2022-09-09 2022-09-09 Video processing method and device

Country Status (1)

Country Link
CN (1) CN115205763B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116017010A (en) * 2022-12-01 2023-04-25 凡游在线科技(成都)有限公司 Video-based AR fusion processing method, electronic device and computer readable medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN111683269A (en) * 2020-06-12 2020-09-18 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
US20220051025A1 (en) * 2019-11-15 2022-02-17 Tencent Technology (Shenzhen) Company Limited Video classification method and apparatus, model training method and apparatus, device, and storage medium
CN114067381A (en) * 2021-04-29 2022-02-18 中国科学院信息工程研究所 Deep forgery identification method and device based on multi-feature fusion
CN114283350A (en) * 2021-09-17 2022-04-05 腾讯科技(深圳)有限公司 Visual model training and video processing method, device, equipment and storage medium
CN114627397A (en) * 2020-12-10 2022-06-14 顺丰科技有限公司 Behavior recognition model construction method and behavior recognition method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
US20220051025A1 (en) * 2019-11-15 2022-02-17 Tencent Technology (Shenzhen) Company Limited Video classification method and apparatus, model training method and apparatus, device, and storage medium
CN111683269A (en) * 2020-06-12 2020-09-18 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN114627397A (en) * 2020-12-10 2022-06-14 顺丰科技有限公司 Behavior recognition model construction method and behavior recognition method
CN114067381A (en) * 2021-04-29 2022-02-18 中国科学院信息工程研究所 Deep forgery identification method and device based on multi-feature fusion
CN114283350A (en) * 2021-09-17 2022-04-05 腾讯科技(深圳)有限公司 Visual model training and video processing method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAIYU ZHAO等: ""Spindle Net: Person Re-Identification With Human Body Region Guided Feature Decomposition and Fusion"", 《PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
周云等: ""基于双流非局部残差网络的行为识别方法"", 《计算机应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116017010A (en) * 2022-12-01 2023-04-25 凡游在线科技(成都)有限公司 Video-based AR fusion processing method, electronic device and computer readable medium
CN116017010B (en) * 2022-12-01 2024-05-17 凡游在线科技(成都)有限公司 Video-based AR fusion processing method, electronic device and computer readable medium

Also Published As

Publication number Publication date
CN115205763B (en) 2023-02-17

Similar Documents

Publication Publication Date Title
CN108228703B (en) Image question-answering method, device, system and storage medium
CN113994384A (en) Image rendering using machine learning
CN111738243B (en) Method, device and equipment for selecting face image and storage medium
US11521038B2 (en) Electronic apparatus and control method thereof
CN111160569A (en) Application development method and device based on machine learning model and electronic equipment
CN114155543A (en) Neural network training method, document image understanding method, device and equipment
Lytvyn et al. System development for video stream data analyzing
KR101617649B1 (en) Recommendation system and method for video interesting section
WO2022068320A1 (en) Computer automated interactive activity recognition based on keypoint detection
US11417096B2 (en) Video format classification and metadata injection using machine learning
CN112883257B (en) Behavior sequence data processing method and device, electronic equipment and storage medium
US10904476B1 (en) Techniques for up-sampling digital media content
JP2022533690A (en) Movie Success Index Prediction
CN112052759B (en) Living body detection method and device
CN110414335A (en) Video frequency identifying method, device and computer readable storage medium
CN115205763B (en) Video processing method and device
CN112149642A (en) Text image recognition method and device
CN114330499A (en) Method, device, equipment, storage medium and program product for training classification model
US11636282B2 (en) Machine learned historically accurate temporal classification of objects
CN116702835A (en) Neural network reasoning acceleration method, target detection method, device and storage medium
US20200153873A1 (en) Filtering media data in an internet of things (iot) computing environment
CN109960745B (en) Video classification processing method and device, storage medium and electronic equipment
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium
US20200050898A1 (en) Intelligent personalization of operations of an image capturing device
CN109948426A (en) Application program method of adjustment, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant