CN116958638A

CN116958638A - Video processing method and device and storage medium

Info

Publication number: CN116958638A
Application number: CN202310493940.8A
Authority: CN
Inventors: 陈科宇; 吴昊谦; 谯睿智
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-10-27

Abstract

The application discloses a video processing method, a video processing device and a storage medium. Splitting the video to be processed based on unit granularity to obtain a plurality of video units; performing multi-mode feature extraction on the video units respectively to obtain target features; and determining the segmentation probability that the segmentation points corresponding to the target features belong to the segmentation points under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularity is larger than the granularity level corresponding to the unit granularity. Therefore, a multi-level video segmentation process is realized, and as different granularity levels are adopted to segment videos, the hierarchical relationship among the granularity levels is utilized, the influence of parameter differences in different level identification processes on segmentation results is reduced, and the accuracy of video segmentation processing is improved.

Description

Video processing method and device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for processing video, and a storage medium.

Background

With the rapid development of internet technology, the demand for video content is increasing. Targeted content extraction through video slicing becomes an important video content processing mode.

Generally, segmentation of videos with different granularities is performed, and segmentation scenes with different granularities can be respectively modeled, so that multi-granularity segmentation tasks are performed.

However, the effect of the process of modeling separately at different granularities is different for each granularity, and the situation that the individual granularities are not applicable may occur, which affects the accuracy of video segmentation.

Disclosure of Invention

In view of this, the present application provides a video processing method, which can effectively improve the accuracy of video segmentation.

The first aspect of the present application provides a video processing method, which can be applied to a system or a program including a video processing function in a terminal device, and specifically includes:

acquiring an input video to be processed;

splitting the video to be processed based on unit granularity to obtain a plurality of video units;

performing multi-mode feature extraction on the video units respectively to obtain unit features corresponding to the video units;

adding position features to the unit features based on the time sequence relation among the unit features to obtain target features;

determining the segmentation probability that the segmentation points corresponding to the target features belong to segmentation points under a plurality of segmentation granularities, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularities is larger than the granularity level corresponding to the unit granularity, and the video segments of the video to be processed under the segmentation points under the segmentation granularities contain corresponding video units.

Optionally, in some possible implementations of the present application, the slicing the video to be processed based on the unit granularity to obtain a plurality of video units includes:

acquiring event information corresponding to the unit granularity;

determining the sampling frame number corresponding to the unit granularity according to the content change condition indicated in the event information;

and cutting the video to be processed based on the sampling frame number to obtain a plurality of video units.

Optionally, in some possible implementations of the present application, the slicing the video to be processed based on the sampling frame number to obtain a plurality of video units includes:

acquiring a video type corresponding to the video to be processed;

determining a segmentation parameter corresponding to the video type;

adjusting the sampling frame number based on the segmentation parameter to obtain an adjusted frame number;

and cutting the video to be processed based on the adjustment frame number to obtain a plurality of video units.

Optionally, in some possible implementations of the present application, adding a location feature to the unit feature based on a timing relationship between the unit features to obtain a target feature includes:

Acquiring a preset time sequence length aiming at the video configuration to be processed;

determining a time sequence relation among the unit features based on the preset time sequence length so as to determine position information corresponding to the unit features;

if the position information indicates that the position of the unit feature is odd, a first position formula is called to calculate the position information to obtain a first time sequence feature;

if the position information indicates that the positions of the unit features are even, a second position formula is called to calculate the position information so as to obtain second time sequence features;

and adding the first time sequence feature and the second time sequence feature to the unit feature for use to obtain the target feature.

Optionally, in some possible implementations of the present application, determining the segmentation probability that the segmentation point corresponding to the target feature belongs to a segmentation point under a plurality of segmentation granularities, so as to segment the video to be processed according to the segmentation probability includes:

inputting the target feature into a classification header corresponding to a first granularity under the segmentation granularity;

matrix multiplying the target feature and the classification head with the first granularity to obtain a first similarity score of a corresponding cutting point of the target feature and the classification head with the first granularity;

Determining a first segmentation probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a first granularity based on the first similarity score;

inputting the target features into a classification head corresponding to a second granularity under the segmentation granularity, wherein the granularity level corresponding to the second granularity is larger than the granularity level corresponding to the first granularity, and the video fragments of the video to be processed under the segmentation point under the second granularity comprise the video fragments under the segmentation point under the first granularity;

matrix multiplying the target feature with the classification head of the second granularity to obtain a second similarity score of the target feature and a corresponding cutting point of the classification head of the second granularity;

determining a second segmentation probability of the segmentation point corresponding to the target feature belonging to the segmentation point under a second granularity based on the second similarity score;

and comparing the first segmentation probability with the second segmentation probability to segment the video to be processed.

Optionally, in some possible implementations of the present application, the method further includes:

acquiring training videos, wherein the training videos are provided with segmentation point labels for video units under different granularities;

Segmenting the training video based on unit granularity to obtain a plurality of training units;

performing multi-mode feature extraction on the training units respectively to obtain training unit features corresponding to the training units;

adding position features to the training unit features based on the time sequence relation among the training unit features to obtain training features;

respectively inputting the training features into a plurality of classification heads under training granularity to obtain similarity scores corresponding to the classification heads;

sorting the similarity scores corresponding to the classification heads according to the granularity increment of the classification heads;

configuring ordering loss information based on the ordered score sequence and the segmentation point label of the training video mark;

training each corresponding classification head based on the ordering loss information.

Optionally, in some possible implementations of the present application, the training each corresponding classification header based on the ordering loss information includes:

acquiring granularity information corresponding to different classification heads;

configuring cross entropy loss information between different granularities based on the granularity information;

training each corresponding classification header based on the cross entropy loss information and the ordering loss information.

A second aspect of the present application provides a video processing apparatus, including:

the acquisition unit is used for acquiring the input video to be processed;

the segmentation unit is used for segmenting the video to be processed based on unit granularity so as to obtain a plurality of video units;

the processing unit is used for respectively carrying out multi-mode feature extraction on the video units so as to obtain unit features corresponding to the video units;

the processing unit is further used for adding position features to the unit features based on the time sequence relation among the unit features so as to obtain target features;

the processing unit is further configured to determine a segmentation probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a plurality of segmentation granularities, so as to segment the video to be processed according to the segmentation probability, a granularity level corresponding to the segmentation granularity is greater than a granularity level corresponding to the unit granularity, and a video segment of the video to be processed under the segmentation point under the segmentation granularity contains a corresponding video unit.

Optionally, in some possible implementation manners of the present application, the splitting unit is specifically configured to obtain event information corresponding to the granularity of the unit;

The segmentation unit is specifically configured to determine a sampling frame number corresponding to the unit granularity according to a content change condition indicated in the event information;

the segmentation unit is specifically configured to segment the video to be processed based on the sampling frame number, so as to obtain a plurality of video units.

Optionally, in some possible implementation manners of the present application, the splitting unit is specifically configured to obtain a video type corresponding to the video to be processed;

the segmentation unit is specifically configured to determine a segmentation parameter corresponding to the video type;

the segmentation unit is specifically configured to adjust the sampling frame number based on the segmentation parameter to obtain an adjusted frame number;

the segmentation unit is specifically configured to segment the video to be processed based on the adjustment frame number, so as to obtain a plurality of video units.

Optionally, in some possible implementations of the present application, the processing unit is specifically configured to obtain a preset time sequence length for the video configuration to be processed;

the processing unit is specifically configured to determine a timing relationship between each unit feature based on the preset timing length, so as to determine location information corresponding to the unit feature;

The processing unit is specifically configured to invoke a first location formula to calculate the location information to obtain a first timing feature if the location information indicates that the location of the unit feature is odd;

the processing unit is specifically configured to invoke a second location formula to calculate the location information to obtain a second timing sequence feature if the location information indicates that the location of the unit feature is even;

the processing unit is specifically configured to add the first timing characteristic and the second timing characteristic to a unit characteristic for pairing, so as to obtain the target characteristic.

Optionally, in some possible implementations of the present application, the processing unit is specifically configured to input the target feature into a classification header corresponding to the segmentation granularity;

the processing unit is specifically configured to perform matrix multiplication on the target feature and the classification head with the segmentation granularity, so as to obtain a first similarity score of a segmentation point corresponding to the target feature and the classification head with the segmentation granularity;

the processing unit is specifically configured to determine, based on the first similarity score, a first score probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a segmentation granularity;

The processing unit is specifically configured to input the target feature into a classification header corresponding to the third granularity, where a granularity level corresponding to the third granularity is greater than a granularity level corresponding to the segmentation granularity, and the video segment under the segmentation point of the video to be processed under the third granularity includes the video segment under the segmentation point under the segmentation granularity;

the processing unit is specifically configured to perform matrix multiplication on the target feature and the classification head with the third granularity, so as to obtain a second similarity score of a corresponding cutting point of the target feature and the classification head with the third granularity;

the processing unit is specifically configured to determine, based on the second similarity score, a second segmentation probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a third granularity;

the processing unit is specifically configured to compare the first segmentation probability with the second segmentation probability, so as to segment the video to be processed.

Optionally, in some possible implementations of the present application, the processing unit is specifically configured to obtain a training video, where the training video configures segmentation point labels for video units under different granularities;

The processing unit is specifically configured to segment the training video based on unit granularity to obtain a plurality of training units;

the processing unit is specifically configured to perform multi-mode feature extraction on the training units respectively, so as to obtain training unit features corresponding to the training units;

the processing unit is specifically configured to add a position feature to the training unit feature based on a time sequence relationship between the training unit features, so as to obtain a training feature;

the processing unit is specifically configured to input the training features into classification heads under multiple training granularities, so as to obtain similarity scores corresponding to the classification heads;

the processing unit is specifically configured to sort the similarity scores corresponding to the classification heads according to increasing granularity of the classification heads;

the processing unit is specifically configured to configure sorting loss information based on the sorted score sequences and the segmentation point labels of the training video markers;

the processing unit is specifically configured to train each corresponding classification header based on the ordering loss information.

Optionally, in some possible implementations of the present application, the processing unit is specifically configured to obtain granularity information corresponding to different classification heads;

The processing unit is specifically configured to configure cross entropy loss information between different granularities based on the granularity information;

the processing unit is specifically configured to train each corresponding classification header based on the cross entropy loss information and the ordering loss information.

A third aspect of the present application provides a computer apparatus comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to execute the video processing method according to the first aspect or any one of the first aspects according to an instruction in the program code.

A fourth aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of processing video of the first aspect or any of the first aspects described above.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, which executes the computer instructions, causing the computer device to perform the method of processing video provided in the above-described first aspect or various alternative implementations of the first aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

acquiring an input video to be processed; then, segmenting the video to be processed based on the unit granularity to obtain a plurality of video units; performing multi-mode feature extraction on the video units respectively to obtain unit features corresponding to the video units; then adding position features for the unit features based on the time sequence relation among the unit features to obtain target features; and determining the segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularity is larger than the granularity level corresponding to the unit granularity, and the video fragment of the video to be processed under the segmentation point under the segmentation granularity contains the corresponding video unit. Therefore, a multi-level video segmentation process is realized, and as different granularity levels are adopted to segment videos, the hierarchical relationship among the granularity levels is utilized, the influence of parameter differences in different level identification processes on segmentation results is reduced, and the accuracy of video segmentation processing is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a network architecture diagram of the operation of a video processing system;

fig. 2 is a flowchart of video processing according to an embodiment of the present application;

fig. 3 is a flowchart of a video processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a video processing method according to an embodiment of the present application;

FIG. 5 is a flowchart of another video processing method according to an embodiment of the present application;

fig. 6 is a schematic view of a scene of another video processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a video processing method and a related device, which can be applied to a system or a program containing a video processing function in terminal equipment, and input videos to be processed are acquired; then, segmenting the video to be processed based on the unit granularity to obtain a plurality of video units; performing multi-mode feature extraction on the video units respectively to obtain unit features corresponding to the video units; then adding position features for the unit features based on the time sequence relation among the unit features to obtain target features; and determining the segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularity is larger than the granularity level corresponding to the unit granularity, and the video fragment of the video to be processed under the segmentation point under the segmentation granularity contains the corresponding video unit. Therefore, a multi-level video segmentation process is realized, and as different granularity levels are adopted to segment videos, the hierarchical relationship among the granularity levels is utilized, the influence of parameter differences in different level identification processes on segmentation results is reduced, and the accuracy of video segmentation processing is improved.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the video processing method provided by the present application may be applied to a system or a program including a video processing function in a terminal device, for example, a video editing application, and specifically, the video processing system may operate in a network architecture as shown in fig. 1, which is a network architecture diagram operated by the video processing system, where, as shown in the fig. 1, the video processing system may provide a processing procedure of a video with multiple information sources, that is, through an uploading operation at a terminal side, the server performs segmentation processing of uploaded videos with different granularity levels; it will be appreciated that various terminal devices are shown in fig. 1, the terminal devices may be computer devices, in which a greater or lesser variety of terminal devices may participate in the processing of video in an actual scenario, the specific number and variety are not limited herein, and in addition, one server is shown in fig. 1, but in an actual scenario, there may also be multiple servers participating, and the specific number of servers is determined by the actual scenario.

In this embodiment, the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. The terminals and servers may be directly or indirectly connected by wired or wireless communication, and the terminals and servers may be connected to form a blockchain network, which is not limited herein.

It will be appreciated that the video processing system described above may be implemented in a personal mobile terminal, for example: the video editing application can be used as an application which can also be run on a server, and can also be used as a processing device which can be run on third-party equipment to provide video so as to obtain the processing result of the video of the information source; the specific video processing system may be in a program form, may also be operated as a system component in the device, and may also be used as a cloud service program, where the specific operation mode is determined by an actual scenario and is not limited herein.

In order to solve the above-mentioned problems, the present application proposes a video processing method, which is applied to a flow frame of video processing shown in fig. 2, and as shown in fig. 2, a flow frame of video processing provided in an embodiment of the present application, by editing operations of a user, a server segments an input video, and by introducing a hierarchical relationship inside the video into a sorting loss, a displayed guidance model learns the hierarchical relationship of the video. The process can be combined with various existing loss functions, and the performance of multi-level segmentation of the video is effectively improved on the premise of not increasing extra computational complexity.

It can be understood that the method provided by the application can be a program writing method, which can be used as a processing logic in a hardware system, and can also be used as a video processing device, and the processing logic can be realized in an integrated or external mode. As one implementation manner, the video processing device obtains an input video to be processed; then, segmenting the video to be processed based on the unit granularity to obtain a plurality of video units; performing multi-mode feature extraction on the video units respectively to obtain unit features corresponding to the video units; then adding position features for the unit features based on the time sequence relation among the unit features to obtain target features; and determining the segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularity is larger than the granularity level corresponding to the unit granularity, and the video fragment of the video to be processed under the segmentation point under the segmentation granularity contains the corresponding video unit. Therefore, a multi-level video segmentation process is realized, and as different granularity levels are adopted to segment videos, the hierarchical relationship among the granularity levels is utilized, the influence of parameter differences in different level identification processes on segmentation results is reduced, and the accuracy of video segmentation processing is improved.

The scheme provided by the embodiment of the application relates to an artificial intelligence computer vision technology, and is specifically described by the following embodiments:

with reference to fig. 3, fig. 3 is a flowchart of a video processing method provided by an embodiment of the present application, where the method may be executed by a server or a terminal, and the embodiment of the present application at least includes the following steps:

301. and acquiring the input video to be processed.

In this embodiment, the video to be processed may be a news video, an advertisement video, a movie video or a video in other media, and the embodiment is illustrated by taking a news video as an example, but is not limited thereto.

Specifically, the processing scene of the news video can be applied to news catalogs in the broadcast and television industry, namely, the complete news video is segmented into various units such as fragments, scenes, shots and the like according to different semantic granularity. After the news video is subjected to multi-level segmentation, the method can be applied to downstream news media information arrangement and news media information searching tasks.

302. And splitting the video to be processed based on the unit granularity to obtain a plurality of video units.

In this embodiment, the unit granularity is the event granularity representing the video content, and in the news video, formalized definitions of multiple granularity corresponding levels of the news video include:

Fragments: containing a complete, independent news event.

Scene: including a complete location such as a studio scene, an outdoor scene. A segment may contain several scenes.

Lens: one camera position captures successive pictures, and a scene may contain several shots.

Events: a change in the number of persons, a change in background music. A shot may contain several scenes.

Thus, at granularity size, clip > scene > shot > event. The event is the unit granularity, the fragment, the scene and the lens are the segmentation granularity, and the granularity attribution calculation is carried out through the hierarchical relationship between the video units under the unit granularity and the fragments under the segmentation granularity.

When the number of persons or sound information or the like in one screen changes, it may be defined as an event change. The original video can be subjected to event segmentation by using an open-source universal event detection scheme, so that a plurality of continuous event units are obtained. An event consists of several consecutive frames. Because the frames inside the event have small content and form changes, the frames inside the event can be sparsely sampled to accelerate. For each event, 5 frames may be uniformly sampled to represent the event.

Specifically, for the process of frame sampling, event information corresponding to the granularity of a unit may be first obtained; then determining the sampling frame number corresponding to the unit granularity according to the content change condition indicated in the event information; and then the video to be processed is segmented based on the sampling frame number, so as to obtain a plurality of video units.

Optionally, because the association degrees of the contents under different video types are different, frame sampling with different lengths can be performed, namely, firstly, the video type corresponding to the video to be processed is obtained; then determining a segmentation parameter corresponding to the video type; adjusting the sampling frame number based on the segmentation parameter to obtain an adjusted frame number; and then the video to be processed is segmented based on the adjustment frame number to obtain a plurality of video units, so that the accuracy of content sampling is improved.

Alternatively, the sampling may be performed with a time period, such as 1 second, 2 seconds, etc. The specific process refers to frame sampling, and is not described here in detail.

303. And respectively carrying out multi-mode feature extraction on the video units to obtain unit features corresponding to each video unit.

In this embodiment, the multi-modal feature extraction includes the extraction of visual features, text features, and audio features.

Specifically, for extracting visual features, for 5 sampling frames of each determined event, a Resnet+Netvlad model may be used to perform feature extraction, and a feature f_visual with a dimension of 2048 is output.

For text features, audio in the video may be separated out using a ffmpeg tool library. And after the audio file is obtained, calling a voice recognition module to obtain ASR information of the whole video. An event-level ASR alignment operation is then performed, specifically for an event E (start time, end time), all text in the ASR information at that time can be matched as text for that event. After obtaining the text of the event, the Bert model may be used to extract the text feature f_text of each event, with dimension 768.

In addition, for audio features, audio in the video is separated out using a ffmpeg tool library. After the audio file is obtained, the audio is downsampled to 16 KHz, and an audioTag model is used for extracting audio characteristics f_audio with the dimension of 2048.

After the multi-mode features are obtained, namely, the process of feature fusion is carried out, the visual features, the text features and the audio features are spliced, and the segmentation points based on the visual modes are obtained, wherein the segmentation points are carried out according to the following formula:

f _multi-modal ＝Concat(f _visual +f _text +f _audio )

Wherein the f_ (multi-mode) dimension is the sum 4864 of the single mode dimensions.

After the splice characteristics are obtained, the following formula is executed for fusion:

f _{multi-modal-fuse} ＝MLP(f _multi-modal )

and fusing and dimension-reducing the spliced features by using a multi-layer perceptron, wherein the dimension of the fused dimension-reduced feature f_ (multi-mode-fuse) is 512, so that the multi-mode unit features corresponding to all the video units are obtained.

304. And adding position features for the unit features based on the time sequence relation among the unit features to obtain target features.

In this embodiment, a process of time sequence modeling is performed using a transducer-Encoder. Specifically, a preset time sequence length aiming at the video configuration to be processed can be firstly obtained; then determining the time sequence relation among the unit features based on the preset time sequence length so as to determine the position information corresponding to the unit features; for example, the timing length is set to 100 event units. And adding a position feature to each event unit feature, and further, if the position information indicates that the position of the unit feature is odd, calling a first position formula to calculate the position information to obtain a first time sequence feature, namely, using the following formula at the odd position:

if the position information indicates that the position of the unit feature is even, a second position formula is called to calculate the position information to obtain a second time sequence feature, namely, the even position uses the following formula:

Then, adding the first time sequence feature and the second time sequence feature to the unit feature to be used to obtain a target feature, wherein the adding process is to add the multi-mode feature of the event unit and the position feature of the event unit element by element, and a specific adding formula is as follows:

f _{multi-modal-fuse-pos}

＝elementwiseAdd(f _{multi-modal-fuse} ，PE)

and then obtaining an event feature f_ (multi-model-fuse-pos) fused with the multi-model feature and the position feature, and then sending the event feature f_ (multi-model-fuse-pos) into a transducer-Encoder module for time sequence modeling, wherein the formula is as follows:

and outputting the target feature added with the position feature through a modeling converter-Encoder module.

305. Determining the segmentation probability that the segmentation points corresponding to the target features belong to the segmentation points under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability.

In this embodiment, the granularity level corresponding to the slicing granularity is greater than the granularity level corresponding to the unit granularity, and the video segment of the video to be processed at the slicing point of the slicing granularity includes the corresponding video unit. I.e. for 3 hierarchy (fragments, scenes, shots) structures that are exactly aligned on atomic event units (i.e. the basic units are all event units), the boundary of a higher level (coarse granularity) segmentation task is also the boundary of a lower level (fine granularity) task, e.g. one event unit is a fragment segmentation point, then it must be a scene, shot segmentation point, whereas the opposite is not necessarily true. Thus, segmentation confidence scores from coarse to fine segmentation tasks should be ordered in a non-increasing order.

Specifically, for the processing procedure of multiple segmentation tasks, namely, firstly inputting target features into a classification head corresponding to a first granularity under segmentation granularity; then, carrying out matrix multiplication on the target features and the classification heads with the first granularity to obtain first similarity scores of the target features and the corresponding cutting points of the classification heads with the first granularity; determining a first score probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the first granularity based on the first similarity score; in addition, inputting the target feature into a classification head corresponding to a second granularity under the segmentation granularity, wherein the granularity level corresponding to the second granularity is larger than the granularity level corresponding to the first granularity, and the video fragments under the segmentation point of the video to be processed under the second granularity comprise the video fragments under the segmentation point under the first granularity, for example, the first granularity is a lens, and the second granularity is a scene; then, carrying out matrix multiplication on the target features and the classification heads with the second granularity to obtain second similarity scores of the target features and the corresponding cutting points of the classification heads with the second granularity; determining a second segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the second granularity based on the second similarity score; and comparing the first segmentation probability with the second segmentation probability to segment the video to be processed.

In one possible scenario, for a given news video a, the scheme outputs a result of multi-level splitting of a. Assume that N event units are included in a, and for each event unit, the results of the Segment, the Scene, and the Shot are output respectively. For example, for event element i, output (Segment, scene) ∈0, 1. Taking Segment as an example, segment is 0 to indicate that the event unit is an internal node of a Segment; segment 1 indicates that the event element is a start node of a Segment; the scene and the lens are the same.

The process of segmentation can be performed based on a segmentation model, i.e. the segmentation model is provided with a plurality of classification heads, and the following description of the training process of the segmentation model is performed in combination with a scene with unit granularity as an event. Fig. 4 is a schematic view of a video processing method according to an embodiment of the present application, as shown in fig. 4; the diagram shows the multi-mode feature extraction based on event units, multi-mode time sequence feature fusion based on a transducer-Encoder, multi-level loss functions based on a sorting relation and a model training process, and for the model training process, the embodiment combines the association relation among levels with different granularities, namely, firstly acquiring training videos, wherein the training videos are provided with segmentation point labels for video units with different granularities; then, based on unit granularity, segmenting the training video to obtain a plurality of training units; then respectively carrying out multi-mode feature extraction on the training units to obtain training unit features corresponding to the training units; adding position features for the training unit features based on the time sequence relation among the training unit features to obtain training features; then respectively inputting training features into a plurality of classification heads under training granularity to obtain similarity scores corresponding to the classification heads; next, the configuration of the ordering loss information is performed.

The effectiveness of multiple segmentation tasks is optimized by ordering confidence scores between different tasks through a hierarchical penalty based on ordering relationships due to the multi-tasking segmentation process. It may comprise two parts: one is to calculate the confidence of the segmentation for each event cell in each hierarchy; and sequencing the confidence scores according to the non-increasing sequence from the thick to the thin of the task, and calculating a loss value obtained after sequencing, wherein the loss value is shown in the following formula:

the hierarchical loss based on the ordering relation adopts a formula based on sigmoid function and Stop Gradient operation when calculating the similarity score. I.e., for two tasks H and L, where H represents a higher semantic level, coarser granularity task (e.g., segment cut) and L represents a lower semantic level, finer granularity task (e.g., shot cut). In the formula, F represents a confidence score (i.e., a probability that a certain event element is a cut point of the task L) at each position in the task L; SG denotes a Gradient intercept (Stop Gradient) operation; Y_L represents the true tag value (gt) at each location in task L; sigma (·) represents a sigmoid function. K represents the number of event elements fed into the network at a time, which may be 100, the similarity score calculated by this formula may be used for subsequent ranking and loss calculation. The method comprises the steps of sorting similarity scores corresponding to all sorting heads according to the granularity increment of the sorting heads, configuring sorting loss information based on a sorted score sequence and a segmentation point label of a training video mark, namely after sorting, calculating loss values among tasks by the formula, and adding the loss values of all positions to obtain a final loss function, so that the corresponding sorting heads are trained based on the sorting loss information.

It can be appreciated that since the method is based on similarity score calculation and ranking loss function calculation, the results are easier to understand and interpret; in addition, since the method is based on similarity score calculation and ranking loss function calculation, it can be used with other existing methods, such as cross-soil moisture loss function:

and combine the functions to further improve video multi-level slicing performance.

Specifically, for the combination process of the cross soil moisture loss function, firstly, the granularity information corresponding to different classification heads is obtained; then configuring cross entropy loss information among different granularities based on the granularity information; and training each corresponding classification header based on the cross entropy loss information and the ordering loss information. I.e. high-low semantic task pairs: (Segment, scene), (Segment, shot) is taken into H, L in the following formula to obtain L_ (Segment, scene), L_ (Segment, shot). This is added to the cross entropy loss function to give the following as the final loss function:

in one possible training procedure, k=100 consecutive event elements may be taken as one input. First, f_ (multi-model-fuse) is obtained by multi-modal feature extraction, dim= (100×512). The time sequence modeling module based on the converter-Encoder captures the time sequence relation among event units to obtain f_temporal, dim= (100×256). It is necessary to determine whether each event unit is a segmentation point of a segment, a scene, or a shot, and thus is three bisectional tasks. Three classification heads follow f_temporal: the method comprises the steps of performing matrix multiplication on a time sequence characteristic vector f_temporal and a classification head to obtain three bit results in Kx2 dimension, and performing probability on the results by using the following formula (for a scene, C=2 in the formula represents the number of categories) to obtain probability results in Kx2 dimension.

Wherein dimension 0 represents the probability of not being a cut point and dimension 1 represents the probability of being a cut point. The training is then guided using the hierarchical loss formula based on the ranking relationship set forth in the above embodiments. In the training process, a Pytorch deep learning framework is used for training, the optimization method is that the SGD is reduced by a random gradient, the initial learning rate is 0.01, the learning rate is reduced to 0.001 at the 10 th epoch, and the total iteration times are 20 epochs.

As can be seen from the above embodiments, by acquiring an input video to be processed; then, segmenting the video to be processed based on the unit granularity to obtain a plurality of video units; performing multi-mode feature extraction on the video units respectively to obtain unit features corresponding to the video units; then adding position features for the unit features based on the time sequence relation among the unit features to obtain target features; and determining the segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point under the segmentation granularity, so as to segment the video to be processed according to the segmentation probability, wherein the granularity level corresponding to the segmentation granularity is larger than the granularity level corresponding to the unit granularity, and the video fragment of the video to be processed under the segmentation point under the segmentation granularity contains the corresponding video unit. Therefore, a multi-level video segmentation process is realized, and as different granularity levels are adopted to segment videos, the hierarchical relationship among the granularity levels is utilized, the influence of parameter differences in different level identification processes on segmentation results is reduced, and the accuracy of video segmentation processing is improved.

A description is given below of a scene of news video slicing. Referring to fig. 5, fig. 5 is a flowchart of another video processing method according to an embodiment of the present application, where the embodiment of the present application at least includes the following steps:

501. and responding to the input operation of the target object to acquire the video to be processed.

In this embodiment, the target object may be a user or a terminal, so that a news video to be processed is acquired through an input operation (uploading).

502. And carrying out multi-granularity segmentation on the video to be processed.

In this embodiment, the multi-granularity slicing process is referred to the embodiment shown in fig. 3, and is not described herein.

501. And calling the segmented video at the target granularity in response to the editing operation of the target object.

In this embodiment, a process of calling a cut video at a target granularity through an editing operation is shown in fig. 6, and fig. 6 is a schematic view of a scene of another video processing method according to an embodiment of the present application; the figure shows that the news video comprises a plurality of layers, and can be divided into a fragment layer, a scene layer and a shot layer according to the size of semantic granularity. Therefore, a news video can be simultaneously segmented into a plurality of units with different semantic granularities, the efficiency of downstream video archiving, distributing and searching can be improved, and downstream tasks are supported.

Therefore, the embodiment can efficiently and accurately split the news video in multiple layers, and split a complete news video into a plurality of news segments, scenes and shots at the same time; according to the hierarchical loss based on the ordering relation, the hierarchical relation inside the news video can be mined, and the performance of the model is improved. The segmented fragments, scenes and shots can be used by downstream media distribution, media resource arrangement and media resource retrieval.

In order to better implement the above-described aspects of the embodiments of the present application, the following provides related apparatuses for implementing the above-described aspects. Referring to fig. 7, fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application, where a video processing apparatus 700 includes:

an acquiring unit 701, configured to acquire an input video to be processed;

a splitting unit 702, configured to split the video to be processed based on unit granularity, so as to obtain a plurality of video units;

a processing unit 703, configured to perform multi-mode feature extraction on the video units respectively, so as to obtain unit features corresponding to each video unit;

the processing unit 703 is further configured to add a location feature to the unit feature based on a time sequence relationship between the unit features, so as to obtain a target feature;

The processing unit 703 is further configured to determine a segmentation probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a plurality of segmentation granularities, so as to segment the video to be processed according to the segmentation probability, where a granularity level corresponding to the segmentation granularity is greater than a granularity level corresponding to the unit granularity, and a video segment of the video to be processed under the segmentation point under the segmentation granularity contains a corresponding video unit.

Optionally, in some possible implementations of the present application, the splitting unit 702 is specifically configured to obtain event information corresponding to the unit granularity;

the slicing unit 702 is specifically configured to determine a sampling frame number corresponding to the unit granularity according to a content change condition indicated in the event information;

the slicing unit 702 is specifically configured to slice the video to be processed based on the sampling frame number, so as to obtain a plurality of video units.

Optionally, in some possible implementations of the present application, the splitting unit 702 is specifically configured to obtain a video type corresponding to the video to be processed;

the slicing unit 702 is specifically configured to determine slicing parameters corresponding to the video type;

The slicing unit 702 is specifically configured to adjust the sampling frame number based on the slicing parameter to obtain an adjusted frame number;

the slicing unit 702 is specifically configured to slice the video to be processed based on the adjustment frame number, so as to obtain a plurality of video units.

Optionally, in some possible implementations of the present application, the processing unit 703 is specifically configured to obtain a preset time sequence length for the video configuration to be processed;

the processing unit 703 is specifically configured to determine a timing relationship between each of the unit features based on the preset timing length, so as to determine location information corresponding to the unit features;

the processing unit 703 is specifically configured to invoke a first location formula to calculate the location information to obtain a first timing feature if the location information indicates that the location of the unit feature is odd;

the processing unit 703 is specifically configured to, if the location information indicates that the location of the unit feature is even, invoke a second location formula to calculate the location information to obtain a second timing feature;

the processing unit 703 is specifically configured to add the first timing characteristic and the second timing characteristic to the unit characteristic for pairing, so as to obtain the target characteristic.

Optionally, in some possible implementations of the present application, the processing unit 703 is specifically configured to input the target feature into a classification header corresponding to the segmentation granularity;

the processing unit 703 is specifically configured to perform matrix multiplication on the target feature and the classification header with the segmentation granularity, so as to obtain a first similarity score of a segmentation point corresponding to the target feature and the classification header with the segmentation granularity;

the processing unit 703 is specifically configured to determine, based on the first similarity score, a first score probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a segmentation granularity;

the processing unit 703 is specifically configured to input the target feature into a classification header corresponding to the third granularity, where a granularity level corresponding to the third granularity is greater than a granularity level corresponding to the splitting granularity, and the video segment of the video to be processed at the splitting point of the third granularity includes the video segment at the splitting point of the splitting granularity;

the processing unit 703 is specifically configured to perform matrix multiplication on the target feature and the classification header with the third granularity, so as to obtain a second similarity score of a corresponding cut point of the target feature and the classification header with the third granularity;

The processing unit 703 is specifically configured to determine, based on the second similarity score, a second segmentation probability that a segmentation point corresponding to the target feature belongs to a segmentation point under a third granularity;

the processing unit 703 is specifically configured to compare the first segmentation probability with the second segmentation probability, so as to segment the video to be processed.

Optionally, in some possible implementations of the present application, the processing unit 703 is specifically configured to obtain a training video, where the training video configures segmentation point labels for video units under different granularities;

the processing unit 703 is specifically configured to segment the training video based on unit granularity to obtain a plurality of training units;

the processing unit 703 is specifically configured to perform multi-mode feature extraction on the training units respectively, so as to obtain training unit features corresponding to the training units;

the processing unit 703 is specifically configured to add a location feature to the training unit feature based on a time sequence relationship between the training unit features, so as to obtain a training feature;

the processing unit 703 is specifically configured to input the training features into the classification heads under multiple training granularities, respectively, so as to obtain similarity scores corresponding to the classification heads;

The processing unit 703 is specifically configured to sort the similarity scores corresponding to the classification heads according to the granularity increment of the classification heads;

the processing unit 703 is specifically configured to configure ordering loss information based on the ordered score sequence and the segmentation point label of the training video label;

the processing unit 703 is specifically configured to train each corresponding classification header based on the ordering loss information.

Optionally, in some possible implementations of the present application, the processing unit 703 is specifically configured to obtain granularity information corresponding to different classification heads;

the processing unit 703 is specifically configured to configure cross entropy loss information between different granularities based on the granularity information;

the processing unit 703 is specifically configured to train each corresponding classification header based on the cross entropy loss information and the ordering loss information.

The embodiment of the present application further provides a terminal device, as shown in fig. 8, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, for convenience of explanation, only the portion related to the embodiment of the present application is shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as an example of the mobile phone:

fig. 8 is a block diagram showing a part of the structure of a mobile phone related to a terminal provided by an embodiment of the present application. Referring to fig. 8, the mobile phone includes: radio Frequency (RF) circuitry 810, memory 820, input unit 830, display unit 840, sensor 850, audio circuitry 860, wireless fidelity (wireless fidelity, wiFi) module 870, processor 880, and power supply 890. Those skilled in the art will appreciate that the handset configuration shown in fig. 8 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 8:

the RF circuit 810 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, it is processed by the processor 880; in addition, the data of the design uplink is sent to the base station. Typically, the RF circuitry 810 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 810 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS), and the like.

The memory 820 may be used to store software programs and modules, and the processor 880 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 820. The memory 820 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 820 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 830 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the handset. In particular, the input unit 830 may include a touch panel 831 and other input devices 832. The touch panel 831, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user using any suitable object or accessory such as a finger, a stylus, etc., on or near the touch panel 831, and spaced touch operations within a certain range on the touch panel 831), and actuate the corresponding connection device according to a predetermined program. Alternatively, the touch panel 831 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 880 and can receive commands from the processor 880 and execute them. In addition, the touch panel 831 may be implemented in various types of resistive, capacitive, infrared, surface acoustic wave, and the like. The input unit 830 may include other input devices 832 in addition to the touch panel 831. In particular, other input devices 832 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 840 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 840 may include a display panel 841, and optionally, the display panel 841 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 831 may overlay the display panel 841, and when the touch panel 831 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 880 to determine the type of touch event, and the processor 880 then provides a corresponding visual output on the display panel 841 according to the type of touch event. Although in fig. 8, the touch panel 831 and the display panel 841 are implemented as two separate components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 831 and the display panel 841 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 850, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 841 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 841 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 860, speaker 861, microphone 862 may provide an audio interface between the user and the handset. The audio circuit 860 may transmit the received electrical signal converted from audio data to the speaker 861, and the electrical signal is converted into a sound signal by the speaker 861 to be output; on the other hand, microphone 862 converts the collected sound signals into electrical signals, which are received by audio circuit 860 and converted into audio data, which are processed by audio data output processor 880 for transmission to, for example, another cell phone via RF circuit 810, or which are output to memory 820 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 870, so that wireless broadband Internet access is provided for the user. Although fig. 8 shows a WiFi module 870, it is understood that it does not belong to the necessary constitution of the handset, and can be omitted entirely as needed within the scope of not changing the essence of the invention.

The processor 880 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by running or executing software programs and/or modules stored in the memory 820 and calling data stored in the memory 820, thereby performing overall detection of the mobile phone. In the alternative, processor 880 may include one or more processing units; alternatively, the processor 880 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 880.

The handset further includes a power supply 890 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 880 through a power management system, as well as performing functions such as managing charge, discharge, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In the embodiment of the present application, the processor 880 included in the terminal further has a function of executing each step of the page processing method as described above.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 900 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 922 (e.g., one or more processors) and a memory 932, and one or more storage media 930 (e.g., one or more mass storage devices) storing application programs 942 or data 944. Wherein the memory 932 and the storage medium 930 may be transitory or persistent. The program stored in the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 922 may be arranged to communicate with a storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input/output interfaces 958, and/or one or more operating systems 941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The steps performed by the management apparatus in the above-described embodiments may be based on the server structure shown in fig. 9.

In an embodiment of the present application, there is further provided a computer readable storage medium, where processing instructions of a video are stored, which when executed on a computer, cause the computer to perform steps performed by a video processing apparatus in a method as described in the foregoing embodiment shown in fig. 3 to 6.

In an embodiment of the present application, there is also provided a computer program product comprising video processing instructions which, when run on a computer, cause the computer to perform the steps performed by the video processing apparatus in the method described in the embodiment of fig. 3 to 6.

The embodiment of the application also provides a video processing system, which can comprise the video processing device in the embodiment shown in fig. 7, or the terminal equipment in the embodiment shown in fig. 8, or the server shown in fig. 9.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a processing apparatus of a video, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for processing video, comprising:

acquiring an input video to be processed;

2. The method of claim 1, wherein slicing the video to be processed based on unit granularity to obtain a plurality of video units comprises:

acquiring event information corresponding to the unit granularity;

3. The method according to claim 2, wherein the slicing the video to be processed based on the number of sampling frames to obtain a plurality of video units comprises:

acquiring a video type corresponding to the video to be processed;

determining a segmentation parameter corresponding to the video type;

4. The method of claim 1, wherein adding location features to the cell features based on a timing relationship between the cell features to obtain target features comprises:

5. The method according to claim 1, wherein determining the segmentation probability that the segmentation point corresponding to the target feature belongs to the segmentation point at a plurality of segmentation granularities, so as to segment the video to be processed according to the segmentation probability, comprises:

6. The method according to any one of claims 1-5, further comprising:

7. The method of claim 6, wherein the training each corresponding classification head based on the ordering loss information comprises:

8. A video processing apparatus, comprising:

the acquisition unit is used for acquiring the input video to be processed;

9. A computer device, the computer device comprising a processor and a memory:

the memory is used for storing program codes; the processor is configured to execute the video processing method according to any one of claims 1 to 7 according to instructions in the program code.

10. A computer program product comprising computer programs/instructions stored on a computer readable storage medium, characterized in that the computer programs/instructions in the computer readable storage medium, when executed by a processor, implement the steps of the method of processing video according to any of the preceding claims 1 to 7.