CN117014693A

CN117014693A - Video processing method, device, equipment and storage medium

Info

Publication number: CN117014693A
Application number: CN202211320887.3A
Authority: CN
Inventors: 林靖渝; 郭春超; 刘思聪; 王红法; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-11-07

Abstract

The embodiment of the application discloses a video processing method, a device, equipment and a storage medium, which can be applied to the fields of artificial intelligence, cloud technology, blockchain and the like. The method comprises the following steps: determining an initial video frame sequence of a video to be processed, wherein the initial video frame sequence comprises a plurality of initial video frames; determining an image feature sequence, wherein the image feature sequence comprises target image features of each initial video frame; determining fusion features corresponding to the target image features based on each target image feature and target image features adjacent to the target image feature in the image feature sequence; the frame type of the corresponding initial video frame is determined based on each fusion feature, the frame type including a first type of video frame that does not include recurring subtitle information and a second type of video frame that includes recurring subtitle information. By adopting the embodiment of the application, the frame type of each video frame in the video can be rapidly and accurately determined, and the applicability is high.

Description

Video processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a video processing method, apparatus, device, and storage medium.

Background

With the continuous development of multimedia technology, video has become a main information display medium in learning, entertainment, advertising popularization and other aspects, and with the continuous enrichment of video content, caption information has become a main means for assisting in understanding video content.

In the fields of subtitle information auditing, subtitle information editing, subtitle key information extraction and the like, it is often necessary to distinguish between video frames including subtitle information that appears for the first time and video frames including subtitle information that appears repeatedly in video. The prior art is usually distinguished on the basis of manpower or on the basis of a model, but the mode of distinguishing on the basis of manpower is low in efficiency, and the accuracy of distinguishing effects of the existing model is generally low.

Disclosure of Invention

The embodiment of the application provides a video processing method, a device, equipment and a storage medium, which can rapidly and accurately determine the frame type of each video frame in a video and have high applicability.

In one aspect, an embodiment of the present application provides a video processing method, including:

determining an initial video frame sequence of a video to be processed, wherein the initial video frame sequence comprises a plurality of initial video frames;

Determining an image feature sequence, wherein the image feature sequence comprises target image features of each initial video frame, and the arrangement order of each target image feature in the image feature sequence is consistent with the arrangement order of the corresponding initial video frame in the initial video frame sequence;

for each target image feature, determining a fusion feature corresponding to the target image feature based on the target image feature and a target image feature adjacent to the target image feature in the image feature sequence;

for each of the fusion features, determining a frame type of the corresponding initial video frame based on the fusion feature, the frame type including a first type and a second type, the video frame of the first type not including repeatedly occurring caption information, the video frame of the second type including repeatedly occurring caption information.

In another aspect, an embodiment of the present application provides a video processing apparatus, including:

the video frame determining module is used for determining an initial video frame sequence of the video to be processed, wherein the initial video frame sequence comprises a plurality of initial video frames;

an image feature determining module, configured to determine an image feature sequence, where the image feature sequence includes target image features of each initial video frame, and an arrangement order of each target image feature in the image feature sequence is consistent with an arrangement order of a corresponding initial video frame in the initial video frame sequence;

The image feature fusion module is used for determining fusion features corresponding to the target image features based on the target image features and target image features adjacent to the target image features in the image feature sequence for each target image feature;

and the frame type determining module is used for determining the frame type of the corresponding initial video frame based on each fusion feature, wherein the frame type comprises a first type and a second type, the video frame of the first type does not comprise repeatedly-appearing subtitle information, and the video frame of the second type comprises repeatedly-appearing subtitle information.

In another aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other;

the memory is used for storing a computer program;

the processor is configured to execute the video processing method provided by the embodiment of the application when the computer program is called.

In another aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program that is executed by a processor to implement the video processing method provided by the embodiment of the present application.

In another aspect, an embodiment of the present application provides a computer program product, where the computer program product includes a computer program, where the computer program implements the video processing method provided by the embodiment of the present application when the computer program is executed by a processor.

In the embodiment of the application, for each initial video frame of the video to be processed, the fusion characteristic corresponding to the initial video frame can be determined based on the target image characteristic of the initial video frame and the target image characteristic of the initial video frame adjacent to the initial video frame, so that the fusion characteristic of the initial video frame also comprises the image characteristics of other video frames adjacent to the initial video frame, thereby accurately and efficiently determining the frame type of the corresponding initial video frame based on the fusion characteristic, namely accurately and efficiently determining whether the corresponding initial video frame comprises repeated subtitle information, and having high applicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a video processing method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a video processing method according to an embodiment of the present application;

FIG. 3 is a schematic view of a scene for determining initial image features provided by an embodiment of the present application;

FIG. 4 is a schematic view of a scene for determining features of a target image according to an embodiment of the present application;

FIG. 5 is a schematic view of a scenario for determining fusion features according to an embodiment of the present application;

FIG. 6 is another schematic view of a scenario for determining fusion features provided by an embodiment of the present application;

FIG. 7 is a schematic view of yet another scenario for determining fusion features provided by an embodiment of the present application;

FIG. 8 is a flow chart of a method for determining frame type according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another process frame for determining a frame type according to an embodiment of the present application;

FIG. 10 is a flowchart of a training method of a frame type prediction model according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The video processing method provided by the embodiment of the application can be applied to the fields of image processing, artificial intelligence and the like, and is used for determining video frames including repeatedly-appearing subtitle information and video frames not including repeatedly-appearing subtitle information in the video to be processed.

For example, according to the video processing method provided by the embodiment of the application, the repeated subtitle information can be determined to be included in the video frame to be processed, and then the repeated subtitle information is screened out to edit, audit and other operations on the residual subtitle information.

Referring to fig. 1, fig. 1 is a schematic view of a video processing method according to an embodiment of the present application. As shown in fig. 1, for the video 11 to be processed, the device 12 may determine a frame type of each video frame in the video 11 to be processed based on the video processing method provided in the embodiment of the present application. The frame type of each video frame is either a first type or a second type, and the frame type of each video frame indicates that the video frame does not include recurring subtitle information when the frame type of each video frame is the first type, and indicates that the video frame includes recurring subtitle information when the frame type of each video frame is the second type.

Further, the device 12 may determine the frame type of each video frame of the video 11 to be processed, and further determine video frames that do not include the repeatedly occurring caption information and video frames that include the repeatedly occurring caption information. If the device 12 finally determines that the video frames including the repeatedly occurring subtitle information are the video frame 13 and the video frame 14, the video frames other than the video frame 13 and the video frame 14 do not include the repeatedly occurring subtitle information.

Wherein the device 12 may be a server or a terminal device having data processing capabilities. The server may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, such as an internet of things edge cloud platform, a cloud computing platform, and the like.

The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device (e.g. a smart speaker), a wearable electronic device (e.g. a smart watch), a vehicle-mounted terminal, an intelligent home appliance (e.g. a smart television), an AR/VR device, etc.

Referring to fig. 2, fig. 2 is a flowchart of a video processing method according to an embodiment of the present application. As shown in fig. 2, the video processing method provided in the embodiment of the present application may specifically include the following steps:

step S21, determining an initial video frame sequence of the video to be processed.

In some possible embodiments, the video to be processed may be a movie video, an advertisement video, a documentary, a self-media video with subtitle information, etc., which may be specifically determined based on the actual application scene requirement, and is not limited herein.

Specifically, when determining the initial video frame sequence of the video to be processed, all the initial video frames of the video to be processed may be arranged according to the playing order in the video to be processed, so as to obtain the initial video frame sequence.

Optionally, when determining the initial video frame sequence of the video to be processed, frame extraction processing may be performed on the video to be processed based on a preset frequency to obtain a plurality of initial video frames, and then the obtained play order of each initial video frame in the video to be processed may be arranged to obtain the initial video frame sequence.

For example, the video to be processed is uniformly sampled, an initial video frame is extracted from the video to be processed every 1 second, and then an initial video frame sequence of the video to be processed is determined based on the obtained initial video frame.

Step S22, determining an image feature sequence, wherein the image feature sequence comprises target image features of each initial video frame.

The ordering order of each target image feature in the image feature sequence is consistent with the ordering order of the corresponding initial video frame in the initial video frame sequence.

In some possible embodiments, when determining the target image feature of each initial video frame, feature extraction may be performed on each initial video frame based on the feature extraction network to obtain a corresponding initial image feature, so as to determine the initial image feature corresponding to each initial video frame as the target image feature of the initial video frame.

In some possible embodiments, when determining the target image feature of each initial video frame, in order to make the target image feature of each initial video frame include context information of other adjacent video frames, feature extraction may be performed on each initial video frame to obtain a corresponding initial image feature.

As shown in fig. 3, fig. 3 is a schematic view of a scene for determining initial image features according to an embodiment of the present application. For each initial video frame in the initial video frame sequence, the initial video frame can be input into a pre-trained feature extraction network to obtain initial image features corresponding to the initial video frame.

The pre-trained feature extraction network may be a visual geometry group (Visual Geometry Group, VGG) -19 network, or may be other existing related neural networks for extracting image features, which may be specifically determined based on actual application scene requirements, and is not limited herein.

For each initial video frame, an initial feature sequence corresponding to each initial video frame may be determined, the initial feature sequence corresponding to each initial video frame including initial image features of a consecutive plurality of initial video frames in the initial video frame sequence, and the consecutive plurality of initial video frames including the initial video frame.

Specifically, for each initial video frame, any number of consecutive plurality of initial video frames including the initial video frame may be determined from the sequence of initial video frames, and an initial feature sequence corresponding to the initial video frame may be determined based on initial image features corresponding to the determined plurality of initial video frames.

Optionally, a preset number of preset video frames may be added to two ends of the initial video frame sequence to obtain the target video frame sequence.

The preset video frame sequence may be a blank frame, or may be another blank frame including limited image information.

For each initial video frame, a sequence of sub-video frames corresponding to the initial video frame may be determined based on the initial video frame, a consecutive predetermined number of video frames in the sequence of target video frames preceding and adjacent to the initial video frame, and a predetermined number of video frames in the sequence of target video frames following and adjacent to the initial video frame. And further determining an initial feature sequence corresponding to each video frame in the sub-video frame sequence based on the initial image feature of the video frame.

For each initial video frame, if the sub-video frame sequence corresponding to the initial video frame includes a preset video frame, the initial image feature of the preset video frame may be determined based on a manner of determining the initial image feature of the initial video frame, which is not described herein.

That is, for an initial video frame i, the sub-video frame sequence corresponding to the initial video frame includes a total of 2n+1 video frames from video frame i-n to video frame i+n in the target video frame, where the initial video frame i is the center video frame in the corresponding sub-video frame sequence.

The video frames i-n to i-1 in the target video frame sequence are a continuous preset number of video frames located before and adjacent to the initial video frame i in the target video frame sequence, and the video frames i+1 to i+n in the target video frame sequence are a preset number of video frames located after and adjacent to the initial video frame i in the target video frame sequence.

Wherein n is a preset number, which is an integer greater than or equal to 1, and specifically can be determined based on actual application scene requirements, without limitation.

Optionally, for each initial video frame, when determining an initial feature sequence corresponding to the initial video frame, if the number of initial video frames located before the initial video frame in the initial video frame sequence is greater than or equal to a preset number and the number of initial video frames located after the initial video frame is greater than or equal to a preset number, determining a sub-video frame sequence corresponding to the initial video frame based on the initial video frame, a continuous preset number of initial video frames located before the initial video frame and adjacent to the initial video frame in the initial video frame sequence, and a continuous preset number of initial video frames located after the initial video frame and adjacent to the initial video frame in the initial video frame sequence, and further determining an initial feature sequence corresponding to the initial video frame based on initial image features of each initial video frame in the sub-video frame sequence.

If the number of initial video frames positioned before the initial video frames in the initial video frame sequence is smaller than the preset number and the number of initial video frames positioned after the initial video frames is larger than or equal to the preset number, determining a sub-video frame sequence corresponding to the initial video frames based on the initial video frames, all initial video frames positioned before the initial video frames in the initial video frame sequence and the continuous preset number of initial video frames positioned after the initial video frames and adjacent to the initial video frames in the initial video frame sequence, further determining a first feature sequence corresponding to the initial video frames based on the initial image features of each initial video frame in the sub-video frame sequence, and adding the preset image features of the first number before the first feature sequence to obtain the initial feature sequence corresponding to the initial video frames.

The first number is a difference between the preset number and the number of initial video frames located before the initial video frame in the initial video frame sequence, and the first preset feature may be specifically determined based on the actual application scene requirement, which is not limited herein.

If the number of the initial video frames positioned before the initial video frame in the initial video frame sequence is greater than or equal to the preset number, and the number of the initial video frames positioned after the initial video frame is smaller than the preset number, determining a sub-video frame sequence corresponding to the initial video frame based on the initial video frame, the initial video frames positioned before the initial video frame and adjacent to the initial video frame in the initial video frame sequence, and all the initial video frames positioned after the initial video frame in the initial video frame sequence, further determining a second feature sequence corresponding to the initial video frame based on the initial image features of each initial video frame in the sub-video frame sequence, and adding a second number of preset image features after the second feature sequence to obtain the initial feature sequence corresponding to the initial video frame.

The second number is a difference between the preset number and the number of initial video frames located after the initial video frame in the initial video frame sequence.

Based on this, for each initial video frame, a target image feature for that initial video frame may be determined based on an initial feature sequence corresponding to that initial video frame.

As an example, referring to fig. 4, fig. 4 is a schematic view of a scene for determining a target image feature according to an embodiment of the present application. As shown in fig. 4, in the case that the preset number is 3, 3 preset video frames may be added to two ends of the initial video frame sequence to obtain the target video frame sequence.

The preset video frame sequence may be a blank frame.

Taking the 5 th video frame in the target video frame sequence (i.e., the 2 nd initial video frame in the initial video frame sequence) as an example, the corresponding sub-video frame sequence may be determined based on the consecutive 3 video frames in the target video frame sequence that precede and are adjacent to the 5 th video frame (i.e., the 2 nd to 4 th video frames in the target video frame sequence), the consecutive 3 video frames in the target video frame sequence that follow and are adjacent to the 5 th video frame (i.e., the 4 th to 6 th video frames in the target video frame sequence), and the 5 th video frame in the target video frame sequence.

The target image feature corresponding to the 5 th video frame in the target video frame sequence may be further determined based on the initial image feature of each video frame in the sub-video frame sequence corresponding to the 5 th video frame in the target video frame sequence.

In some possible embodiments, for each initial video frame, when determining the target image feature of the initial video frame based on the initial feature sequence corresponding to the initial video frame, the channel dimension of each initial image feature in the initial feature sequence may be spliced to obtain the spliced image feature.

The convolution network may include at least one convolution layer, and the relevant parameter of each convolution layer may be specifically determined based on the actual application scene requirement, which is not limited herein.

And further performing feature processing and dimension reduction on the spliced image features based on the convolution network, and determining a final result as a target image feature of the initial video frame.

As an example, for each initial video frame, the process of determining the target image feature of the initial video frame based on the initial feature sequence corresponding to the initial video frame may be implemented based on the following:

G _i ＝Conv _1×1 (Concat(F _i-n ,…,F _i ,…F _i+n ))

wherein n is a preset number, i represents the position of the initial video frame in the target video frame sequence, F _i Representing an initial image feature of the initial video frame, F _i-n Representing initial image characteristics of the ith-nth video frame in the target video frame sequence, F _i+n Representing initial image characteristics of (i+n) th video frame in target video frame sequence, concat representing splicing operation of each initial image characteristic in initial characteristic sequence corresponding to the initial video frame, conv _1×1 Representing the operation of processing the stitched image features based on a 1 x 1 convolutional network, G _i Representing the target image characteristics of the initial video frame.

Step S23, for each target image feature, determining a fusion feature corresponding to the target image feature based on the target image feature and a target image feature adjacent to the target image feature in the image feature sequence.

In some possible embodiments, for each target image feature, the target image feature in the sequence of image features that is adjacent to the target image feature may be the previous target image feature in the sequence of image features.

In this case, for each target image feature, the first hidden layer feature corresponding to the target image feature may be determined based on the target image feature and the first hidden layer feature corresponding to the target image feature preceding the target image feature in the image feature sequence.

When the target image feature is the first target image feature in the image feature sequence, the first preset feature can be used as the first hidden layer feature of the previous target image feature of the target image feature.

That is, for a first target image feature in the sequence of image features, a first hidden layer feature corresponding to the target image feature may be determined based on the first preset feature and the target image feature. For any one of the other target image features in the sequence of image features except for the first target image feature, the first hidden layer feature corresponding to the target image feature may be determined based on the first hidden layer feature corresponding to the target image feature that is previous to the target image feature and the target image feature.

Based on this, the first hidden layer feature corresponding to each target image feature in the image feature sequence may be determined as a fusion feature corresponding to each target image feature. At this time, the fusion feature corresponding to each target image feature may further fuse other target image features before the target image feature.

As an example, referring to fig. 5, fig. 5 is a schematic view of a scenario for determining fusion features according to an embodiment of the present application. As shown in fig. 5, for the 1 st target image feature G ₁ The 1 st object image feature G ₁ And a first preset feature input coding layer to obtain a 1 st target image feature G ₁ Corresponding first hidden layer feature h ₁ . For the 2 nd target image feature G ₂ The 2 nd target image feature G ₂ First hidden layer feature h corresponding to 1 st target image feature ₁ Inputting the coding layer to obtain the 2 nd target image feature G ₂ Corresponding first hidden layer feature h ₂ . For the 3 rd target image feature G ₃ The 3 rd target image feature G can be ₃ First hidden layer feature h corresponding to the 2 nd target image feature ₂ Inputting the coding layer to obtain the 3 rd target image feature G ₃ Corresponding first hidden layer feature h ₃ 。

Similarly, a first hidden layer feature corresponding to each target image feature can be obtained. After the first hidden layer feature corresponding to each target image feature is obtained, the first hidden layer feature corresponding to each target image feature may be determined as a fusion feature corresponding to each target image feature.

The process of determining the characteristics of the first hidden layer may be specifically represented by the following manner, where each coding layer shown in fig. 5 may be each coding layer in a forward Long Short-Term Memory network (LSTM):

h _t ＝LSTM(h _t-1 ,G _t )

Where t represents the index of the target image feature, h represents the first hidden layer feature, and LSTM represents the encoding fusion operation.

In some possible embodiments, for each target image feature, the target image feature in the sequence of image features that is adjacent to the target image feature may be the next target image feature in the sequence of image features.

In this case, for each target image feature, a second hidden layer feature corresponding to the target image feature may be determined based on the target image feature and a second hidden layer feature corresponding to a next target image feature of the target image feature in the sequence of image features.

When the target image feature is the last target image feature in the image feature sequence, the second preset feature can be used as a second hidden layer feature of the next target image feature of the target image feature.

That is, for the last target image feature in the sequence of image features, a second hidden layer feature corresponding to the target image feature may be determined based on the second preset feature and the target image feature. For any one of the other target image features in the sequence of image features except for the last target image feature, a second hidden layer feature corresponding to the target image feature may be determined based on a second hidden layer feature corresponding to a next target image feature of the target image feature and the target image feature.

Based on this, the second hidden layer feature corresponding to each target image feature in the image feature sequence may be determined as a fusion feature corresponding to each target image feature. At this time, the fusion feature corresponding to each target image feature may further fuse other target image features after the target image feature.

As an example, referring to fig. 6, fig. 6 is another schematic view of a scenario for determining fusion features according to an embodiment of the present application. As shown in fig. 6, for the last (mth) target image feature G _m The mth target image feature G _m And a second preset feature input coding layer to obtain an mth target image feature G _m Corresponding second hidden layer feature k _m . For the m-1 th target image feature G _m-1 The m-1 th target image feature G can be used for _m-1 Second hidden layer feature k corresponding to mth target image feature _m Inputting the coding layer to obtain the m-1 target image feature G _m-1 Corresponding second hidden layer feature k _m-1 . For the m-2 th target image feature G _m-1 The m-2 th target image feature G can be used for _m-2 Second hidden layer feature k corresponding to the m-1 th target image feature _m-1 Inputting the coding layer to obtain the m-2 target image characteristics G _m-2 Corresponding first hidden layer feature k _m-2 。

And the like, the second hidden layer characteristic corresponding to each target image characteristic can be obtained. After obtaining the second hidden layer feature corresponding to each target image feature, the second hidden layer feature corresponding to each target image feature may be determined as a fusion feature corresponding to each target image feature.

Wherein, each coding layer shown in fig. 6 may be each coding layer in the backward LSTM, and the process of determining the second hidden layer feature may be specifically represented by the following manner:

k _t ＝LSTM(k _t+1 ,G _t )

where t represents the index of the target image feature, k represents the first hidden layer feature, and LSTM represents the encoding fusion operation.

In some possible embodiments, for each target image feature, the target image feature in the image feature sequence that is adjacent to the target image feature may be a target image feature in the image feature sequence that is successively adjacent to the target image feature.

In this case, for each target image feature, the first hidden layer feature corresponding to the target image feature may be determined based on the target image feature and the first hidden layer feature corresponding to the target image feature preceding the target image feature in the image feature sequence. And determining a second hidden layer feature corresponding to the target image feature based on the target image feature and a second hidden layer feature corresponding to a next target image feature of the target image feature in the image feature sequence.

That is, for a first target image feature in the sequence of image features, a first hidden layer feature corresponding to the target image feature may be determined based on the first preset feature and the target image feature. For any one of the other target image features in the sequence of image features except for the first target image feature, the first hidden layer feature corresponding to the target image feature may be determined based on the first hidden layer feature corresponding to the target image feature that is previous to the target image feature and the target image feature. For a last target image feature in the sequence of image features, a second hidden layer feature corresponding to the target image feature may be determined based on a second preset feature and the target image feature. For any one of the other target image features in the sequence of image features except for the last target image feature, a second hidden layer feature corresponding to the target image feature may be determined based on a second hidden layer feature corresponding to a next target image feature of the target image feature and the target image feature.

Further, for each target image feature, a fusion feature corresponding to the target image feature may be determined based on the first hidden layer feature and the second hidden layer feature corresponding to the target image feature. Specifically, the first hidden layer feature and the second hidden layer feature corresponding to the target image feature may be spliced to obtain a fusion feature, or the first hidden layer feature and the second hidden layer feature corresponding to the target image feature may be spliced, and the spliced features may be processed through the full connection layer to obtain the fusion feature.

Based on the above, the fusion feature corresponding to each target image feature can further fuse other adjacent target image features before and after the target image feature, so that the context information is fully fused interactively.

As an example, referring to fig. 7, fig. 7 is a schematic view of still another scenario for determining fusion features according to an embodiment of the present application. As shown in fig. 7, the image feature can be determined based on the method shown in fig. 5Fusion features of each target image feature in the signature sequence, e.g. determining the b-1 target image feature G _b-1 Corresponding first hidden layer feature h _b-1 B-th target image feature G _b Corresponding first hidden layer feature h _b B+1th target image feature G _b+1 Corresponding first hidden layer feature h _b+1 . Based on the method shown in FIG. 6, the fusion feature of each target image feature in the image feature sequence can be determined, e.g., the b-1 target image feature G is determined _b-1 Corresponding second hidden layer feature k _b-1 B-th target image feature G _b Corresponding second hidden layer feature k _b B+1th target image feature G _b+1 Corresponding second hidden layer feature k _b+1 。

Based on the first hidden layer feature and the second hidden layer feature corresponding to each target image feature, a fusion feature corresponding to each target image feature may be determined. E.g. based on the b-1 th target image feature G _b-1 Corresponding first hidden layer feature h _b-1 And a second hidden layer feature k _b-1 The fusion characteristic Q corresponding to the b-1 target image characteristic can be determined _b-1 . Based on the b-th target image feature G _b Corresponding first hidden layer feature h _b And a second hidden layer feature k _b Can determine the fusion characteristic Q corresponding to the b target image characteristic _b . Based on the (b+1) th target image feature G _b+1 Corresponding first hidden layer feature h _b+1 And a second hidden layer feature k _b+1 The fusion characteristic Q corresponding to the (b+1) th target image characteristic can be determined _b+1 。

The process of determining the fusion feature based on the first hidden layer feature and the second hidden layer feature after obtaining the first hidden layer feature and the second hidden layer feature corresponding to each target image feature may be specifically represented by the following manner:

Q _t ＝MLP(Concat(h _t ,k _t ))

wherein Concat denotes the first hidden layer feature h _t And a second hidden layer feature k _t The spliced characteristics are processed through a full-connection network MLP to obtain fusion characteristics Q of the t-th target image characteristics _t 。

Step S24, for each fusion feature, determining the frame type of the corresponding initial video frame based on the fusion feature.

In some possible implementations, the frame type of each initial video frame includes a first type that does not include recurring subtitle information and a second type that includes recurring subtitle information.

Specifically, for each fusion feature, a type probability value of an initial video frame corresponding to the fusion feature may be determined based on the fusion feature, and when the type probability value is smaller than a probability threshold, it is determined that a frame type of the initial video frame corresponding to the fusion feature is a first type, that is, it is indicated that the initial video frame corresponding to the fusion feature includes repeatedly occurring subtitle information. And when the type probability value is greater than or equal to the probability threshold value, determining that the frame type of the initial video frame corresponding to the fusion feature is of a second type, namely, representing that the initial video frame corresponding to the fusion feature comprises repeatedly appeared subtitle information.

Referring to fig. 8, fig. 8 is a flow chart of determining a frame type according to an embodiment of the present application. As shown in fig. 8, for each fusion feature, when determining the type probability value of the initial video frame corresponding to the fusion feature, the fusion feature may be pooled to obtain a corresponding type feature, and the type feature is further processed by the full connection layer to obtain a 1-dimensional probability feature, so that the probability feature is normalized to obtain the type probability value of the corresponding initial video frame.

The process of determining the corresponding type feature based on the fusion feature may specifically be performed in the following manner:

wherein AVGGPooling (Q) _t Dimension=channel) represents the fusion feature Q _t The type features resulting from the global averaging pooling operation of the channel dimension,and representing probability characteristics obtained after the full connection processing of the type characteristics.

Wherein, the process of determining the type probability value based on the probability feature can be specifically performed by the following ways:

i.e. by normalizing the function against the probability featuresAnd processing to obtain the corresponding type probability value of the initial video frame.

The video processing method provided by the embodiment of the application is further described below with reference to fig. 9. Fig. 9 is a schematic diagram of another flow chart for determining a frame type according to an embodiment of the present application. As shown in fig. 9, for a video to be processed, an initial video frame sequence corresponding to the video to be processed that includes a plurality of initial video frames may be determined, and initial image features for each initial image feature in the initial video frame sequence may be determined.

Further, for each initial video frame, the target image features of the initial video frame may be determined based on the initial image features of a plurality of initial video frames adjacent to the initial video frame, and the target image features may be arranged based on the arrangement sequence of the initial video frames corresponding to the target image features to obtain the corresponding image feature sequence.

Further, for each target image feature, based on the target image feature and a target image feature adjacent to the target image feature in the image feature sequence, determining a fusion feature corresponding to the target image feature, and further processing each fusion feature based on the flow shown in fig. 8 to obtain a type probability value of a corresponding initial video frame, and further determining a frame type of the initial video frame based on the type probability value, that is, determining whether each initial video frame includes repeatedly occurring subtitle information.

In the embodiment of the application, for each initial video frame of the video to be processed, the initial image characteristics of the initial video frame and the initial image characteristics of a plurality of adjacent initial video frames can be fused to obtain the target image characteristics of the initial video frame, so that the target image characteristics of each initial video frame comprise the characteristic information of the initial video frame and the characteristic information of a plurality of continuous initial video frames before and after the initial video frame, and the characteristic expression capability of the target image characteristics is further improved.

Further, for each initial video frame, the fusion feature corresponding to the initial video frame can be determined based on the target image feature of the initial video frame and the target image feature of the initial video frame adjacent to the initial video frame, so that the fusion feature of the initial video frame is further fused with the context feature information, and therefore the frame type of the corresponding initial video frame can be accurately and efficiently determined based on the inter-frame information, that is, whether the corresponding initial video frame comprises the repeatedly appearing subtitle information can be accurately and efficiently determined, and the applicability is high.

In some possible implementations, the video processing method provided by the embodiment of the present application may be implemented by a frame type prediction model.

Referring to fig. 10, fig. 10 is a flowchart illustrating a training method of a frame type prediction model according to an embodiment of the present application. As shown in fig. 10, the training method of the frame type prediction model provided in the embodiment of the present application specifically includes the following steps:

step S101, determining a training sample set, where the training sample set includes a plurality of sample videos and actual frame types of respective sample video frames of each sample video.

Wherein the actual frame type of each sample video frame also includes a first type and a second type, the first type of sample video frame not including the repeatedly occurring caption information, and the second type of sample video frame including the repeatedly occurring caption information.

The sample video may be a movie video, an advertisement video, a documentary, a self-media video with subtitle information, and the like, and may be specifically determined based on actual application scene requirements, which is not limited herein.

Specifically, for each sample video, when determining the actual frame type of each sample video frame of the sample video, caption information identification may be performed on each sample video frame of the sample video to obtain caption information of each sample video frame of the sample video.

Further, for each sample video frame of the sample video, if a first video frame exists in a sample video frame located after the sample video frame in the sample video, determining that an actual frame type of the sample video frame is a second type, and if no first video frame exists in a sample video frame located after the sample video frame in the sample video, determining that an actual frame type of the sample video frame is a first type, wherein subtitle information included in the first sample video frame is the same as subtitle information included in the sample video frame.

Step S102, inputting each sample video into an initial model to obtain the predicted frame type of each sample video of the sample video frame.

In some possible implementations, for each sample video, the initial model is based on determining a predicted frame type for each sample video frame of the sample video in the following manner:

determining a sample video frame sequence of the sample video, the sample video frame sequence comprising a plurality of sample video frames;

determining a sample image feature sequence, wherein the sample image feature sequence comprises sample image features of each sample video frame of the sample video, and the arrangement order of each sample image feature in the sample image feature sequence is consistent with the arrangement order of the corresponding sample video frame in the sample video frame sequence;

For each sample image feature, determining a sample fusion feature corresponding to the sample image feature based on the sample image feature and a sample image feature adjacent to the sample image feature in a sample image feature sequence;

for each sample fusion feature, a predicted frame type for the corresponding sample video frame is determined based on the sample fusion feature.

Wherein the predicted frame type of each sample video frame also includes a first type and a second type, the first type of sample video frame not including the repeatedly occurring caption information, and the second type of sample video frame including the repeatedly occurring caption information.

The embodiment of determining the sample video frame sequence of the sample video may refer to the implementation manner shown in step S21 in fig. 2, the embodiment of determining the sample image feature sequence may refer to the implementation manner shown in step S22 in fig. 2, the embodiment of determining the sample fusion feature corresponding to each sample image feature based on each sample image feature and the sample image feature adjacent to the sample image feature in the sample image feature sequence may refer to the implementation manner shown in step S23 in fig. 2, and the embodiment of determining the predicted frame type of the corresponding sample video frame based on each sample fusion feature may refer to the implementation manner shown in step S24 in fig. 2, which is not repeated herein.

And step S103, determining total training loss based on the actual frame type and the predicted frame type of each sample video frame, performing iterative training on the initial model based on the training sample set, stopping training until the total training loss accords with the training ending condition, and determining the model at the time of stopping training as a frame type predicted model.

In some possible embodiments, the total training loss may be determined based on the actual frame type and the predicted frame type of each sample video frame, in particular, a cross entropy loss of the actual frame type and the predicted frame type of each sample video frame may be determined, and the cross entropy loss may be determined as the total training loss.

Further, the initial model can be subjected to iterative training based on the training sample set, and for each iterative training, if the total training loss corresponding to the iterative training does not meet the training ending condition, the model parameters of the initial model are adjusted and the next iterative training is performed. And if the total training loss corresponding to the iterative training meets the training ending condition, stopping training and determining the model obtained by the iterative training as a final frame type prediction model.

The training ending condition may be that the total training loss reaches convergence, or that the total training loss is smaller than a preset value, or the like, and may be specifically determined based on the actual application scene requirement, which is not limited herein.

The sample video in the training sample set in the embodiment of the present application may be stored in a preset storage space, where the preset storage space may be a Database (Database), a cloud storage (closed storage), or a Blockchain (Blockchain), which may be specifically determined based on actual application scene requirements, and is not limited herein.

The database may be considered, in short, as an electronic filing cabinet, where electronic files are stored, and may be a relational database (SQL database) or a non-relational database (NoSQL database), which is not limited herein, and may be used to store sample videos in a training sample set in the present application. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. Blockchains are essentially a de-centralized database, which is a string of data blocks that are generated in association using cryptographic methods. In embodiments of the present application, each data block in the blockchain may store a sample video in a training sample set. Cloud storage is a new concept which extends and develops in the concept of cloud computing, and refers to that a large number of storage devices (also called storage nodes) of different types in a network are combined to work cooperatively through application software or application interfaces through functions of cluster application, grid technology, distributed storage file systems and the like, so that sample videos in a training sample set are stored together.

The training method of the frame type prediction model provided by the embodiment of the application can be realized based on a machine learning technology in artificial intelligence. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The machine learning is a training process of specially researching how a computer simulates or realizes the learning behavior of human beings to acquire new knowledge or skills, and reorganizing the existing knowledge structure to continuously improve the performance of the machine learning so as to realize the frame type prediction model.

The data processing process, such as determining fusion characteristics, determining type probability values, and the like, involved in the video processing method and the training method of the frame type prediction model provided by the embodiment of the application can be implemented based on Cloud Computing (Cloud Computing) Technology in the Cloud Technology field, so as to improve Computing efficiency.

The cloud technology is a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Cloud computing is a computing model, and is a product of fusion of traditional computer and network technology development such as Grid computing (Grid computing), distributed computing (Distributed Computing), parallel computing (Parallel Computing), utility computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load Balance), and the like. Cloud computing distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service according to requirements. The network providing the resources is called a ' cloud ', the resources in the cloud ' are infinitely expandable and available at any time, used on demand, expanded at any time, paid for use on demand.

Based on the frame type prediction model obtained by the frame type prediction model training method provided by the embodiment of the application, the frame type corresponding to each video frame can be accurately and efficiently determined based on the inter-frame information, namely whether each video frame comprises repeated subtitle information or not can be accurately and efficiently determined, and the applicability is high.

Meanwhile, when the actual frame type of each sample video frame of the sample video used for training the frame type prediction model is determined, the determination can be performed based on the caption information recognition technology, the labeling quantity is reduced, and the determination efficiency of the training samples is improved.

In addition, the training method of the frame type prediction model provided by the embodiment of the application does not need to independently train a related model for extracting the image characteristics of the video frame or an independent frame type classification model, can effectively reduce training time, save data storage space and improve frame type prediction effect.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. The video processing device provided by the embodiment of the application comprises:

a video frame determining module 111, configured to determine an initial video frame sequence of a video to be processed, where the initial video frame sequence includes a plurality of initial video frames;

an image feature determining module 112, configured to determine an image feature sequence, where the image feature sequence includes target image features of each initial video frame, and an arrangement order of each target image feature in the image feature sequence is consistent with an arrangement order of a corresponding initial video frame in the initial video frame sequence;

An image feature fusion module 113, configured to determine, for each of the target image features, a fusion feature corresponding to the target image feature based on the target image feature and a target image feature adjacent to the target image feature in the image feature sequence;

the frame type determining module 114 is configured to determine, for each of the fusion features, a frame type of a corresponding initial video frame based on the fusion feature, where the frame type includes a first type and a second type, the video frame of the first type does not include repeatedly occurring subtitle information, and the video frame of the second type includes repeatedly occurring subtitle information.

In some possible embodiments, the image feature determining module 112 is configured to:

extracting the characteristics of each initial video frame to obtain corresponding initial image characteristics;

determining an initial feature sequence corresponding to each initial video frame, wherein the initial feature sequence corresponding to each initial video frame comprises initial image features of a plurality of continuous initial video frames in the initial video frame sequence, and the plurality of continuous initial video frames comprise the initial video frame;

for each initial video frame, determining the target image characteristics of the initial video frame based on the initial characteristic sequence corresponding to the initial video frame.

respectively adding a preset number of preset video frames at two ends of the initial video frame sequence to obtain a target video frame sequence;

for each initial video frame, determining a sub-video frame sequence corresponding to the initial video frame based on the initial video frame, a continuous preset number of video frames located before and adjacent to the initial video frame in the target video frame sequence, and a preset number of video frames located after and adjacent to the initial video frame in the target video frame sequence, and determining an initial feature sequence corresponding to the initial video frame based on initial image features corresponding to each video frame in the sub-video frame sequence.

In some possible embodiments, for each of the target image features, the image feature fusion module 113 is configured to:

determining a first hidden layer feature corresponding to the target image feature based on the target image feature and a first hidden layer feature corresponding to a previous target image feature of the target image feature in the image feature sequence;

determining a second hidden layer feature corresponding to the target image feature based on the target image feature and a second hidden layer feature corresponding to a target image feature subsequent to the target image feature in the image feature sequence;

Determining fusion features corresponding to the target image features based on the first hidden layer features and the second hidden layer features corresponding to the target image features;

when the target image feature is the first target image feature in the image feature sequence, the first preset feature is used as the first hidden layer feature corresponding to the previous target image feature of the target image feature, and when the target image feature is the last target image feature in the image feature sequence, the second preset feature is used as the second hidden layer feature corresponding to the next target image feature of the target image feature.

In some possible embodiments, for each of the fusion features, the frame type determining module 114 is configured to:

determining a type probability value of an initial video frame corresponding to the fusion feature based on the fusion feature;

and determining that the frame type of the initial video frame corresponding to the fusion feature is the first type in response to the type probability value being smaller than the probability threshold value, and determining that the frame type of the initial video frame corresponding to the fusion feature is the second type in response to the type probability value being greater than or equal to the probability threshold value.

pooling the fusion features to obtain corresponding probability features;

and determining the type probability value of the initial video frame corresponding to the fusion feature based on the probability feature.

In some possible embodiments, the video processing apparatus further includes a training module, where the training module is configured to train the frame type prediction model, and the training module is configured to:

determining a training sample set, wherein the training sample set comprises a plurality of sample videos and actual frame types of sample video frames of each sample video;

inputting each sample video into an initial model to obtain a predicted frame type of each sample video frame of the sample video;

determining total training loss based on the actual frame type and the predicted frame type of each sample video frame, performing iterative training on the initial model based on the training sample set, stopping training until the total training loss meets the training ending condition, and determining the model at the time of stopping training as the frame type predicted model;

wherein the actual frame type and the predicted frame type include the first type and the second type.

In some possible embodiments, for each of the sample videos, the training module is configured to:

for each sample image feature, determining a sample fusion feature corresponding to the sample image feature based on the sample image feature and a sample image feature adjacent to the sample image feature in the sample image feature sequence;

for each of the above-described sample fusion features, a predicted frame type for the corresponding sample video frame is determined based on the sample fusion feature.

performing caption information identification on each sample video frame of the sample video to obtain caption information of each sample video frame of the sample video;

For each of the above-mentioned sample video frames of the sample video, if a first video frame exists in a sample video frame located after the sample video frame in the sample video, determining that an actual frame type of the sample video frame is the second type, and if the first video frame does not exist in a sample video frame located after the sample video frame in the sample video, determining that the actual frame type of the sample video frame is the first type, wherein subtitle information included in the first sample video frame is the same as subtitle information included in the sample video frame.

In a specific implementation, the device may execute, through each functional module built in the device, an implementation manner provided by each step in fig. 2 and/or fig. 10, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 12, the electronic apparatus 1200 in the present embodiment may include: processor 1201, network interface 1204, and memory 1205, and in addition, the electronic device 1200 may further include: an object interface 1203, and at least one communication bus 1202. Wherein a communication bus 1202 is used to enable connected communications between these components. Among other things, the object interface 1203 may include a Display screen (Display), a Keyboard (Keyboard), and the optional object interface 1203 may further include a standard wired interface, a wireless interface. The network interface 1204 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1205 may be a high-speed RAM memory or a non-volatile memory (NVM), such as at least one disk memory. The memory 1205 may also optionally be at least one storage device located remotely from the processor 1201. As shown in fig. 12, an operating system, a network communication module, an object interface module, and a device control application program may be included in the memory 1205 as one type of computer-readable storage medium.

In the electronic device 1200 shown in fig. 12, the network interface 1204 may provide network communication functions; while object interface 1203 is primarily an interface for providing input to an object; and processor 1201 may be configured to invoke device control applications stored in memory 1205 to implement:

In some possible embodiments, the processor 1201 is configured to:

In some possible embodiments, for each of the target image features, the processor 1201 is configured to:

In some possible implementations, for each of the fusion features, the processor 1201 is configured to:

pooling the fusion features to obtain corresponding probability features;

In some possible embodiments, the processor 1201 is further configured to train a frame type prediction model, where the processor 1201 is configured to:

In some possible embodiments, for each of the sample videos, the processor 1201 is configured to:

It should be appreciated that in some possible embodiments, the processor 1201 may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In a specific implementation, the electronic device 1200 may execute, through each functional module built in the electronic device, an implementation manner provided by each step in fig. 2 and/or fig. 10, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored and executed by a processor to implement the method provided by each step in fig. 2 and/or fig. 10, and specifically, the implementation manner provided by each step may be referred to, which is not described herein.

The computer readable storage medium may be the video processing apparatus or the internal storage unit of the electronic device provided in any one of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. The computer readable storage medium may also include a magnetic disk, an optical disk, a read-only memory (ROM), a random access memory (random access memory, RAM), or the like. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application provide a computer program product comprising a computer program for executing the method provided by the steps of fig. 2 and/or 10 by a processor.

The terms first, second and the like in the claims and in the description and drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to the list of steps or elements but may, alternatively, include other steps or elements not listed or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments. The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of video processing, the method comprising:

determining an initial video frame sequence of a video to be processed, the initial video frame sequence comprising a plurality of initial video frames;

determining an image feature sequence, wherein the image feature sequence comprises target image features of all initial video frames, and the arrangement order of the target image features in the image feature sequence is consistent with the arrangement order of corresponding initial video frames in the initial video frame sequence;

for each of the fusion features, determining a frame type of the corresponding initial video frame based on the fusion feature, the frame type including a first type and a second type, the first type video frame not including repeatedly occurring caption information, the second type video frame including repeatedly occurring caption information.

2. The method of claim 1, wherein determining a target image characteristic for each of the initial video frames comprises:

for each initial video frame, determining the target image characteristics of the initial video frame based on an initial characteristic sequence corresponding to the initial video frame.

3. The method of claim 2, wherein said determining an initial feature sequence for each of said initial video frames comprises:

for each initial video frame, determining a sub-video frame sequence corresponding to the initial video frame based on the initial video frame, a continuous preset number of video frames positioned before the initial video frame and adjacent to the initial video frame in the target video frame sequence, and a preset number of video frames positioned after the initial video frame and adjacent to the initial video frame in the target video frame sequence, and determining an initial feature sequence corresponding to the initial video frame based on initial image features corresponding to each video frame in the sub-video frame sequence.

4. The method of claim 1, wherein for each of the target image features, the determining a fusion feature corresponding to the target image feature based on the target image feature and a target image feature in the sequence of image features that is adjacent to the target image feature comprises:

5. The method of claim 1, wherein for each of the fusion features, the determining a frame type of the corresponding initial video frame based on the fusion feature comprises:

and determining that the frame type of the initial video frame corresponding to the fusion feature is the first type in response to the type probability value being smaller than a probability threshold value, and determining that the frame type of the initial video frame corresponding to the fusion feature is the second type in response to the type probability value being greater than or equal to the probability threshold value.

6. The method of claim 5, wherein for each of the fusion features, the determining a type probability value for an initial video frame to which the fusion feature corresponds based on the fusion feature comprises:

pooling the fusion features to obtain corresponding probability features;

7. The method according to any of claims 1-6, wherein the method of any of claims 1-6 is implemented based on a frame type prediction model, said frame type prediction model being trained based on:

determining total training loss based on the actual frame type and the predicted frame type of each sample video frame, performing iterative training on the initial model based on the training sample set, stopping training until the total training loss accords with a training ending condition, and determining the model at the time of stopping training as the frame type predicted model;

8. The method of claim 7, wherein for each of the sample video, the initial model is based on determining a predicted frame type for each of the sample video frames of the sample video by:

for each sample fusion feature, determining a predicted frame type of the corresponding sample video frame based on the sample fusion feature.

9. The method of claim 7, wherein for each of the sample videos, determining an actual frame type for each of the sample video frames of the sample video comprises:

for each sample video frame of the sample video, if a first video frame exists in sample video frames positioned behind the sample video frame in the sample video, determining that the actual frame type of the sample video frame is the second type, and if the first video frame does not exist in sample video frames positioned behind the sample video frame in the sample video, determining that the actual frame type of the sample video frame is the first type, wherein the subtitle information included in the first sample video frame is the same as the subtitle information included in the sample video frame.

10. A video processing apparatus, the apparatus comprising:

the image feature determining module is used for determining an image feature sequence, wherein the image feature sequence comprises target image features of the initial video frames, and the arrangement order of the target image features in the image feature sequence is consistent with the arrangement order of the corresponding initial video frames in the initial video frame sequence;

and the frame type determining module is used for determining the frame type of the corresponding initial video frame based on the fusion characteristics, wherein the frame type comprises a first type and a second type, the video frame of the first type does not comprise repeatedly-appearing subtitle information, and the video frame of the second type comprises repeatedly-appearing subtitle information.

11. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

the processor is configured to perform the method of any of claims 1 to 9 when the computer program is invoked.

12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 9.

13. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 9.