CN116980695A

CN116980695A - Video processing method, device, equipment and storage medium

Info

Publication number: CN116980695A
Application number: CN202310710061.6A
Authority: CN
Inventors: 文伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-10-31

Abstract

The embodiment of the application provides a video processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: dividing the video to be processed to obtain one or more video clips; one video clip corresponds to one video scene; acquiring a salient region of a video frame in each video segment; based on the salient regions of the video frames in each video segment, performing video frame clipping processing on the video frames in each video segment to obtain a plurality of clipped video frames; splicing the plurality of cut video frames to obtain a target video; the display conversion processing of the video can be completed quickly.

Description

Video processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a video processing method, apparatus, device, and storage medium.

Background

With the rapid development of new generation technologies such as internet technology, social networks and intelligent sensors, intelligent terminals such as mobile phones, flat panels and digital cameras are widely applied, so that people can make, acquire and spread videos more conveniently and rapidly. The conventional video is mainly PGC (Professionally Generated Content, professional production content) data, which is professionally produced data, and intelligent terminals such as mobile phones and the like are widely used for producing and using videos.

Because equipment such as televisions, computer displays, mobile phones, tablet computers and even intelligent watches can be used for playing videos, the video can be subjected to cutting and other processing so as to be convenient for adapting to different screen sizes to complete video playing, and even the switching between a horizontal screen and a vertical screen of the mobile phone can also involve the cutting and other processing of the videos. Most of the current methods utilize object detection modes such as human face, human body and objects to analyze and process video frames in the video, and then the detection results of each frame are fused by utilizing strategies to obtain a final video, so that the method is suitable for display conversion of horizontal screens and vertical screens or display conversion of display screens with different sizes.

At present, in the display conversion process of video, a plurality of object detection algorithms related to different objects such as faces, human bodies, objects and the like are required to be deployed for continuous detection, the processing logic is complex, and the resource consumption is high.

Disclosure of Invention

The embodiment of the application provides a video processing method, a device, equipment and a storage medium, which can rapidly complete the display conversion processing of videos.

In one aspect, an embodiment of the present application provides a video processing method, including:

Dividing the video to be processed to obtain one or more video clips; one video clip corresponds to one video scene;

acquiring a salient region of a video frame in each video segment;

based on the salient regions of the video frames in each video segment, performing video frame clipping processing on the video frames in each video segment to obtain a plurality of clipped video frames;

and performing splicing processing on the plurality of cut video frames to obtain a target video.

acquiring a plurality of video frames to be processed from a target video segment of the video to be processed; the target video segment is one video segment in the video to be processed;

respectively carrying out image feature extraction processing on each video frame to be processed to obtain a first feature map of each video frame to be processed;

obtaining a fusion feature map of each video frame to be processed according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed; the playing time sequence information is fused in the fusion characteristic diagram of each video frame to be processed;

Generating a salient feature map of a target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and a fusion feature map of the target video frame; the salient feature map is used for indicating salient regions of the target video frame so as to highlight the salient regions of the target video frame when the video to be processed is played.

In one aspect, an embodiment of the present application provides a video processing apparatus, including:

the segmentation unit is used for carrying out segmentation processing on the video to be processed to obtain one or more video clips; one video clip corresponds to one video scene;

an acquisition unit configured to acquire a salient region of a video frame in each video clip;

the clipping unit is used for clipping the video frames in each video clip based on the salient regions of the video frames in each video clip to obtain a plurality of clipped video frames;

and the splicing unit is used for splicing the plurality of cut video frames to obtain a target video.

the acquisition unit is used for acquiring a plurality of video frames to be processed from a target video fragment of the video to be processed; the target video segment is one video segment in the video to be processed;

The processing unit is used for respectively carrying out image feature extraction processing on each video frame to be processed to obtain a first feature map of each video frame to be processed;

the processing unit is further configured to obtain a fusion feature map of each video frame to be processed according to the play time sequence information of each video frame to be processed and the first feature map of each video frame to be processed; the playing time sequence information is fused in the fusion characteristic diagram of each video frame to be processed;

the processing unit is further configured to generate a salient feature map of the target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and a fusion feature map of the target video frame; the salient feature map is used for indicating salient regions of the target video frame so as to highlight the salient regions of the target video frame when the video to be processed is played.

In one aspect, an embodiment of the present application provides a video processing apparatus, where the video processing apparatus includes an input interface and an output interface, and further includes:

a processor adapted to implement one or more instructions; the method comprises the steps of,

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the video processing method described above.

In one aspect, embodiments of the present application provide a computer storage medium having stored therein computer program instructions for performing the above-described video processing method when executed by a processor.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer storage medium; the processor of the video processing apparatus reads the computer program from the computer storage medium, and the processor executes the computer program so that the video processing apparatus executes the video processing method described above.

In the embodiment of the application, the video to be processed can be segmented to obtain the video segments corresponding to the video scenes; furthermore, the salient regions of the video frames in each video segment can be obtained, and video frame clipping processing is carried out on the corresponding video frames based on the salient regions, so that clipped video frames are obtained; and performing splicing processing on a plurality of cut video frames obtained by cutting on the basis of the salient regions of the video frames in each video segment to obtain a target video. The video display conversion processing can be completed rapidly, and the video display conversion processing efficiency is improved to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a video processing scheme provided by an embodiment of the present application when implemented;

fig. 2 is a schematic flow chart of a video processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of dividing a video to be processed into video segments according to an embodiment of the present application;

FIG. 4 is a schematic diagram of salient regions of a video frame provided by an embodiment of the present application;

fig. 5 is a flowchart of another video processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of video segmentation based on differences in color statistics histograms according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a salient region detection model according to an embodiment of the present application;

FIG. 8a is a schematic diagram of generating a salient feature map based on a salient region detection model, provided by an embodiment of the present application;

FIG. 8b is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by the present application;

FIG. 9a is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by an embodiment of the present application;

FIG. 9b is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by the present application;

FIG. 10a is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by an embodiment of the present application;

FIG. 10b is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by the present application;

FIG. 11 is a schematic diagram of another generation of a salient feature map based on a salient region detection model, provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of determining centroid positions of salient regions of a video frame, provided by an embodiment of the present application;

FIG. 13a is a schematic diagram of another embodiment of the present application for cropping video frames according to a specified cropping scale;

FIG. 13b is a schematic diagram of another embodiment of the present application for cropping video frames according to a specified cropping scale;

FIG. 14 is a schematic diagram of smoothing a position of a centroid according to an embodiment of the present application;

FIG. 15 is a schematic diagram of another smoothing operation of the position of the centroid according to an embodiment of the present application;

FIG. 16 is a schematic diagram of cropping a video to be processed according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a video cropping effect according to an embodiment of the present application;

FIG. 18 is a flowchart of another video processing method according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application;

fig. 21 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. Artificial intelligence software technologies mainly include Computer Vision (CV), speech processing, natural language processing, and Machine Learning (ML)/Deep Learning (DL) directions. The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to perform machine vision such as recognition, detection and measurement on a target, and further perform graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data.

Based on the above mentioned computer vision technology, the embodiment of the present application provides a video processing scheme, which can divide a video to be processed into video segments corresponding to video scenes; thereby obtaining the salient region of the video frame in each video segment; based on the salient regions of the video frames in each video segment, performing video frame clipping processing on the video frames in each video segment to obtain a plurality of clipped video frames; and splicing the plurality of cut video frames to obtain a target video. The video to be processed may be any video that needs to be displayed and converted to adapt to a device for video playing, for example, may be a movie video, a variety video, a sports video, etc., which is not limited in the embodiments of the present application; the cropped video frame contains some or all of the salient regions of the video frame before cropping, that is, there is an overlap of the cropped video frame with the salient regions of the corresponding video frame before cropping.

The video processing scheme may be performed by a video processing device, which may be a terminal device or a server; the terminal devices herein may include, but are not limited to: computers, smart phones, tablet computers, notebook computers, intelligent home appliances, vehicle terminals, intelligent wearable devices and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like. Further alternatively, the above video processing scheme may be implemented by any electronic device with computing power, which is not limited by the embodiment of the present application, and the following embodiment of the present application is described by taking the video processing device as an example.

Referring to fig. 1, a schematic diagram of a video processing scheme according to an embodiment of the present application is implemented; the video processing device may divide the video to be processed to obtain one or more video clips, where one video clip corresponds to one video scene; thereby obtaining the salient region of the video frame in each video segment; based on the salient regions of the video frames in each video segment, performing video frame clipping processing on the video frames in each video segment to obtain a plurality of clipped video frames; and splicing the plurality of cut video frames to obtain a target video. Wherein the cropped video frame comprises part or all of the salient region of the video frame before cropping, that is, there is an overlap of the cropped video frame with the salient region of the corresponding video frame before cropping.

In the application, the collection and processing of related data (such as video to be processed) in the application of the embodiment should be strictly based on the requirements of related laws and regulations, obtain the informed consent or independent consent of the personal information body, and develop the subsequent data use and processing behaviors within the authorized range of laws and regulations and the personal information body.

Based on the video processing scheme, the embodiment of the application provides a video processing method. Referring to fig. 2, a flow chart of a video processing method according to an embodiment of the present application is shown. The video processing method shown in fig. 2 may be performed by a video processing apparatus. The video processing method shown in fig. 2 may include the steps of:

s201, dividing the video to be processed to obtain one or more video clips.

In one embodiment, the video to be processed may be any video that needs to be displayed for conversion to adapt to a device for video playing, for example, may be a movie video, a variety video, a sports video, and the like, and the embodiments of the present application are not limited. One video segment corresponds to one video scene in one or more video segments obtained by dividing the video to be processed, that is, the video to be processed is divided so that adjacent video segments correspond to different video scenes in the divided video segments; for example, referring to fig. 3, a schematic diagram of dividing a video to be processed into video segments is provided in an embodiment of the present application, where the video to be processed is shown as a 301 mark, the divided video segments are shown as 302, 303 and 304 marks, respectively, adjacent video segments shown as 302 and 303 marks correspond to different video scenes, and adjacent video segments shown as 303 and 304 marks correspond to different video scenes. Wherein, here, the video scene is correlated with video shot, a video scene corresponds to a video shot, in the embodiment of the application, a video shot is used for instructing the shooting device to shoot the obtained continuous video frames uninterruptedly, different video shots can correspond to continuous video frames with different frames; it can be considered that there is a video scene cut as there is a video shot cut. Based on this, the process of dividing the video to be processed to obtain one or more video clips, that is, a shot segmentation (shot dividing) process, where one video clip corresponds to one video shot, it may be understood that other shot segmentation algorithms may be used to divide the video to be processed according to the video shots to obtain one or more video clips. In other possible embodiments, different video scenes may be partitioned based on the difference between two adjacent video frames, and a larger difference (e.g., greater than a threshold) may be considered that the two adjacent video frames belong to different video scenes.

S202, obtaining the salient regions of the video frames in each video clip.

The salient region of the video frame is used for indicating a region of interest of an object in the video frame, namely a region which can be noticed in the video frame, namely a region with a larger degree of attention in the video frame; for example, referring to fig. 4, a schematic diagram of a salient region of a video frame is provided in an embodiment of the present application, in which a portrait region, as indicated by a 401 mark, is more noticeable than a background region, and then the portrait region, as indicated by a 401 mark, is the salient region of the video frame. The salient region of the video frame in each video segment obtained by the video processing device may be obtained based on a salient feature map of the video frame in each video segment, where the salient feature map is used to indicate the salient region of the corresponding video frame; further, the salient feature map of the video frames in each video segment may be obtained by performing salient region detection processing on the video frames in each video segment, and may be specifically implemented based on a visual salient detection (Visual Saliency Detection) technology in a computer vision technology, where the visual salient detection technology may simulate visual features of an object through an algorithm, so as to identify salient regions in the video frames.

S203, based on the salient regions of the video frames in each video segment, video frame clipping processing is performed on the video frames in each video segment, so as to obtain a plurality of clipped video frames.

In one embodiment, when the video processing device performs video frame clipping processing on video frames in each video clip, clipping is performed based on clipping proportion, and the clipping proportion can be set according to specific requirements; for example, the clipping ratio may be a ratio indicated in clipping requirements, such as clipping requirement indication: clipping the video to a video with an aspect ratio of 9:16, then the clipping ratio is 9:16, as well as clipping requirement indication: clipping the video into a video with the aspect ratio of 1:1, wherein the clipping ratio is 1:1, and the clipping requirement indicates: the video is cropped to a video that is adapted to the aspect ratio of the device for video playback, the cropping ratio may be the aspect ratio of the device for video playback, etc. Further, the clipped video frame includes part or all of the salient regions of the video frame before clipping, that is, the clipped video frame overlaps with the salient regions of the corresponding video frame before clipping, and the video frame clipping processing is performed based on the salient regions of the video frames in each video clip in order to preserve the salient regions of the video frame as much as possible when clipping the video frame.

S204, splicing the plurality of cut video frames to obtain a target video.

In one embodiment, when the video processing device performs the splicing processing on the plurality of clipped video frames, the video processing device splices the video frames in sequence according to the playing time sequence of the video frames before clipping corresponding to the plurality of clipped video frames in the video to be processed.

Based on the related embodiments of the video processing method described above, another video processing method is provided in the embodiments of the present application. Referring to fig. 5, a flowchart of another video processing method according to an embodiment of the present application is shown. The video processing method shown in fig. 5 may be performed by a video processing apparatus. The video processing method shown in fig. 5 may include the steps of:

S501, segmenting the video to be processed to obtain one or more video clips.

In one embodiment, one video clip corresponds to one video scene. The video processing device performs segmentation processing on the video to be processed to obtain one or more video clips, which may include: performing key frame determination processing on the video to be processed to obtain a plurality of key frames; determining a color statistical histogram of each key frame, and determining a difference of the color statistical histograms between adjacent key frames according to the color statistical histogram of each key frame; and determining target key frames from adjacent key frames with differences meeting the dividing conditions, and dividing the video to be processed based on each target key frame to obtain one or more video segments.

In one embodiment, the video processing device determines key frames from the video to be processed in order to determine video frames from the video to be processed that have certain specific information; in an alternative embodiment, when the video processing device determines a key frame from the video to be processed, each video frame in the video to be processed may be determined as a key frame; in another alternative embodiment, when the video processing device determines a key frame from the video to be processed, the key frame may be determined every certain number of video frames, for example, a video frame of every 5 frames in the video to be processed may be determined as a key frame, and for example, a video frame of every 10 frames in the video to be processed may be determined as a key frame; in another alternative embodiment, when the video processing device determines a key frame from the video to be processed, the video processing device may call an application program supporting key frame extraction, for example, may call an application program supporting key frame extraction FFmpeg to implement the application program; the embodiment of the present application is not limited to a specific implementation manner of determining a key frame from a video to be processed, and for convenience of explanation, the subsequent embodiment of the present application will be described by taking as an example that each video frame in the video to be processed is determined as a key frame.

In one embodiment, the dividing condition is used to indicate a condition when dividing the video to be processed, and the purpose of the dividing condition is to divide the video to be processed based on adjacent key frames with larger differences of color statistics histograms, that is, when the differences of the color statistics histograms between the adjacent key frames are smaller, the adjacent key frames can be considered to belong to the same video scene, and when the differences of the color statistics histograms between the adjacent key frames are larger, the adjacent key frames can be considered to belong to different video scenes; further, the dividing condition can be set according to specific requirements; for example, the dividing condition may be set as: when the difference of the color statistical histograms between the adjacent key frames is larger than a difference threshold, video segmentation is performed based on the adjacent key frames, that is, the video processing equipment can judge that the corresponding difference meets the dividing condition when the difference of the color statistical histograms between the adjacent key frames is larger than the difference threshold, and then can determine the target key frame from the adjacent key frames of which the difference meets the dividing condition, and segment the video to be processed based on each target key frame to obtain one or more video segments; the difference threshold may be set according to specific requirements. Further optionally, when the video processing device determines a target key frame from the adjacent key frames whose differences satisfy the dividing condition, the target key frame may be a subsequent key frame from the adjacent key frames whose differences satisfy the dividing condition; when the video processing device segments the video to be processed based on each target key frame, each target key frame can be used as a segmentation point to segment the video to be processed, so that the first frame in the obtained video segment is the corresponding target key frame. For example, referring to fig. 6, for a schematic diagram of video segmentation based on differences in color statistics histograms provided in an embodiment of the present application, if a video to be processed includes 250 frames, as can be seen from fig. 6, the differences in color statistics histograms between a key frame (1 st video frame) and its neighboring previous key frame (no existence) are greater than a difference threshold, the differences in color statistics histograms between a key frame (65 th video frame) and its neighboring previous key frame are greater than a difference threshold, the differences in color statistics histograms between a key frame (150 th video frame) and its neighboring previous key frame are greater than a difference threshold, the difference of the color statistical histogram between the key frame (226 th video frame) and the immediately preceding key frame is greater than the difference threshold, based on which the 1 st video frame, 65 th video frame, 150 th video frame, and 226 th video frame in the video to be processed are determined as target key frames, and 4 video clips, which are respectively the video clips of the 1 st video frame to 64 th video frame, the video clip of the 65 th video frame to 149 video frames, the video clip of the 150 th video frame to 225 th video frame, and the video clip of the 226 th video frame to 250 th video frame, are obtained by dividing the video to be processed based on the respective target key frames. The process of dividing the video to be processed to obtain one or more video segments, that is, the shot segmentation (shot segmentation) process, and the shots obtained by segmentation are the video segments obtained by segmentation, which can be understood that other shot segmentation algorithms may be used to segment the video to be processed by shot to obtain one or more shots (that is, video segments).

S502, obtaining the salient regions of the video frames in each video clip.

The salient regions of the video frames in each video segment acquired by the video processing device may be acquired based on salient feature maps of the video frames in each video segment, where the salient feature maps are used to indicate salient regions of corresponding video frames; based on this, the video processing apparatus may also acquire the salient feature map of the video frames in each video clip before acquiring the salient regions of the video frames in each video clip. In a possible implementation, before the video processing device obtains the salient regions of the video frames in each video clip, the video processing device may further: acquiring a plurality of video frames to be processed from target video clips in each video clip; the target video clip is any one of the video clips; respectively carrying out image feature extraction processing on each video frame to be processed to obtain a first feature map of each video frame to be processed; obtaining a fusion feature map of each video frame to be processed according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed; the fusion characteristic diagram of each video frame to be processed is fused with playing time sequence information; generating a salient feature map of the target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and a fusion feature map of the target video frame; the salient feature map is used for indicating salient areas of target video frames, wherein the target video frames are any one of a plurality of video frames to be processed; the process is a process of performing salient region detection processing on the target video frame. For convenience of explanation, in the present application, the first feature map of the video frame is also referred to as a first feature map corresponding to the video frame (i.e., including the first feature map of the video frame to be processed is referred to as a first feature map corresponding to the video frame to be processed), the fused feature map of the video frame is referred to as a fused feature map corresponding to the video frame (i.e., including the fused feature map of the video frame to be processed is referred to as a fused feature map corresponding to the video frame to be processed), and the salient feature map of the video frame is referred to as a salient feature map corresponding to the video frame (i.e., including the salient feature map of the target video frame is referred to as a salient feature map corresponding to the target video frame). The video processing device obtains a fusion feature map of each to-be-processed video frame according to the playing time sequence information of each to-be-processed video frame and the first feature map of each to-be-processed video frame, so that when a salient feature map corresponding to a target video frame is generated, time sequence features of a plurality of to-be-processed video frames comprising the target video frame are introduced, namely, influences of other video frames in the plurality of to-be-processed video frames on the target video frame can be introduced, prediction accuracy and prediction stability of the salient feature map of the target video frame can be improved, and continuous effectiveness of salient region prediction is ensured. In another possible implementation manner, the playing time sequence information of each video frame to be processed may also be added to the first feature map of each corresponding video frame to be processed, so as to obtain a fusion feature map of each video frame to be processed.

Further, when the video processing apparatus acquires a plurality of video frames to be processed from a target video clip in each video clip, the number of video frames to be processed to be acquired may be set according to specific requirements, for example, the number may be set to 5, 6, 10, or the like. In an optional embodiment, the plurality of video frames to be processed are obtained by performing frame extraction processing on the target video segment through frame extraction interval parameters, where the frame extraction interval parameters include any one or more of an interval video frame number and an interval time parameter; the number of frames of the video at intervals and the time parameter of the intervals can be set according to specific requirements, for example, the number of frames of the video at intervals can be set to 5 frames, 6 frames, etc., and the time parameter of the intervals can be set to 0.5 seconds, 1 second, etc.; because the number of the video frames to be processed and the time parameter of the interval can be set according to specific requirements, in 3 video frames to be processed, which are obtained by processing any three adjacent frames, the number of the video frames to be processed, which are separated between two adjacent video frames to be processed, can be different, for example, in a frame extraction mode of the video frames to be processed, the 1 st video frame to be processed is the 1 st video frame of the target video segment, the 2 nd video frame to be processed is the 2 nd video frame of the target video segment, the 3 rd video frame to be processed is the 5 th video frame of the target video segment, wherein 0 video frame is separated between the 1 st video frame to be processed and the 2 nd video frame to be processed, and 3 video frames are separated between the 2 nd video frame to be processed and the 3 rd video frame to be processed; for convenience of explanation, the following embodiments of the present application will be described by taking the frame interval parameter as a uniform parameter, that is, the same number of video frames between two adjacent video frames to be processed.

In an alternative embodiment, when the frame extraction interval parameter is 0, video frames with no interval between two video frames to be processed obtained by sequential frame extraction processing (i.e. two adjacent frame extraction processing); that is, the video processing apparatus may sequentially determine successive video frames as video frames to be processed in accordance with the play timings of the respective video frames in the target video clip; for example, if the target video clip includes 100 video frames, the number of to-be-processed video frames to be acquired is set to 6, the plurality of to-be-processed video frames acquired by the video processing apparatus from the target video clip may be 1 st to 6 th video frames in the target video clip, 7 th to 12 th video frames in the target video clip, 13 th to 18 th video frames in the target video clip, and so on. In an alternative embodiment, when the frame interval parameter is greater than 0, the frame number of the video with the frame interval parameter as the interval is taken as an example for explanation; one or more video frames are spaced between two video frames to be processed, which are obtained by sequential frame extraction processing (namely two adjacent frame extraction processing); that is, the video processing apparatus may acquire video frames in sequence as video frames to be processed in accordance with the play timing of each video frame in the target video clip, and satisfy the interval of one or more video frames between two video frames acquired in sequence. For example, if the target video clip includes 100 video frames, the number of the to-be-processed video frames to be acquired is set to 6, the 1 st video frame of the target video clip is taken as the to-be-processed video frame to be acquired for the first time (the position of the to-be-processed video frame to be acquired for the first time in the target video clip may also be set according to specific requirements, for example, the 2 nd video frame of the target video clip is taken as the to-be-processed video frame to be acquired for the first time), and every 5 frames, the plurality of to-be-processed video frames acquired by the video processing device from the target video clip may be the 1 st video frame, the 6 th video frame, the 11 th video frame, the 16 th video frame, the 21 st video frame, and the 26 th video frame in the target video clip, and may also be the 31 st video frame, the 36 th video frame, the 41 st video frame, the 46 th video frame, the 51 th video frame, and the 56 th video frame in the target video clip; and so on.

In one embodiment, the embodiment of the application provides a salient region detection model for generating a salient feature map corresponding to a video frame; referring to fig. 7, a schematic structural diagram of a salient region detection model according to an embodiment of the present application may include a feature extraction module, a feature fusion module, and a feature analysis module; the feature extraction module is used for respectively carrying out image feature extraction processing on each video frame to be processed so as to obtain a first feature map of each video frame to be processed; the feature fusion module is used for obtaining fusion feature images of all the video frames to be processed according to the playing time sequence information of all the video frames to be processed and the first feature images of all the video frames to be processed; the feature analysis module is used for generating a salient feature map of the target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and the fusion feature map of the target video frame. The feature extraction module aims at realizing image feature extraction processing on the video frame, and can select any neural network model capable of realizing image feature extraction processing on the video frame, and the embodiment of the application is not limited, for example, a lightweight network (MobileNet V2 model) can be selected; the feature fusion module is used for fusing the play time sequence information of each video frame to be processed to obtain a fusion feature map of each video frame to be processed, and any neural network model capable of realizing feature fusion processing can be selected, and the embodiment of the application is not limited, and for example, a GRU network (GRU, gated Recurrent Unit, a gate control circulation unit is a common gate control circulation neural network), a machine translation (transducer) model and the like can be selected. Further, the feature value of each pixel in the saliency map may be used to indicate the degree of saliency of the predicted corresponding pixel, and the saliency map may be a predicted thermodynamic diagram (also referred to as a saliency heat map). When the video processing device obtains a fusion feature map of each video frame to be processed according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed, the first feature map of each video frame to be processed can be input into a feature fusion module (for example, input into a GRU network) in a significant region detection model, the feature fusion module obtains and outputs a fusion feature map corresponding to each video frame to be processed according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed, when the feature fusion module processes the first feature map of any video frame to be processed, features contained in the first feature map of the video frame to be processed, which are located before the video frame to be processed according to the playing time sequence information, can be fused according to features contained in the first feature map of the video frame to be processed, which are located before the video frame to be processed, can be fused according to the playing time sequence information of any video frame to be processed, or which are contained in the video frame to be processed, can be fused according to the playing time sequence information of any video frame to be processed, and the feature map to be fused, which are contained in the video frame to be processed, can be fused. For example, if the number of videos to be processed is 6, the video frames from the 1 st to the 6 th video frames to be processed are sequentially called based on the playing time sequence, the feature fusion module may obtain a fusion feature map corresponding to the 1 st video frame to be processed, a fusion feature map corresponding to the 2 nd video frame to be processed, and a fusion feature map corresponding to the 6 th video frame to be processed; further, if the target video frame is the 1 st video frame to be processed, a salient feature map corresponding to the 1 st video frame to be processed can be generated according to the first feature map corresponding to the 1 st video frame to be processed and the fusion feature map corresponding to the 1 st video frame to be processed; if the target video frame is the 2 nd video frame to be processed, a salient feature map corresponding to the 2 nd video frame to be processed can be generated according to the first feature map corresponding to the 2 nd video frame to be processed and the fusion feature map corresponding to the 2 nd video frame to be processed, and the like.

In one embodiment, the video processing device generates a salient feature map of a target video frame according to a first feature map of the target video frame and a fused feature map of the target video frame in the plurality of video frames to be processed, and may include: performing feature interaction processing on the first feature map of the target video frame and the fusion feature map of the target video frame to obtain an interaction feature map of the target video frame; according to the interactive feature map of the target video frame, performing salient feature analysis processing to obtain a salient feature map of the target video frame; the process may be performed by a feature analysis module in the salient region detection model. For ease of explanation, the interaction feature map of a video frame is also referred to herein as an interaction feature map corresponding to the video frame (i.e., including the interaction feature map of a target video frame being referred to as an interaction feature map corresponding to the target video frame). Optionally, the feature interaction processing may refer to feature addition, and the feature interaction processing of the first feature map of the target video frame and the fused feature map of the target video frame refers to: carrying out feature addition on the first feature map of the target video frame and the fusion feature map of the target video frame; the process of deriving a salient feature map of a target video frame from an interactive feature map of the target video frame may be illustrated by a schematic diagram of generating a salient feature map based on a salient region detection model as shown in fig. 8 a. Further alternatively, referring to fig. 8b, a schematic diagram of a salient feature map is generated based on a salient region detection model for another method according to the present application; after obtaining the first feature images corresponding to the video frames to be processed, the video processing device may respectively perform convolution processing on the first feature images corresponding to the video frames to be processed to obtain first convolution feature images corresponding to the video frames to be processed, so as to achieve fusion of channel information in the first feature images corresponding to the video frames to be processed, and then, processing the first feature images corresponding to the video frames to be processed in the follow-up process, that is, processing the first convolution feature images corresponding to the video frames to be processed, that is, the video processing device may obtain fusion feature images of the video frames to be processed according to playing time sequence information of the video frames to be processed and the first convolution feature images corresponding to the video frames to be processed; generating a salient feature map of the target video frame according to a first convolution feature map corresponding to the target video frame in the plurality of video frames to be processed and a fusion feature map of the target video frame; the convolution processing of the first feature map corresponding to each video frame to be processed may be implemented by a convolution layer, and specifically, a convolution layer of 1*1 may be selected.

In a possible implementation manner, the video processing device performs salient feature analysis processing according to the interactive feature map corresponding to the target video frame through a feature analysis module in the salient region detection model, and when obtaining the salient feature map corresponding to the target video frame, the video processing device may perform convolution processing on the interactive feature map corresponding to the target video frame to obtain a second convolution feature map; the convolution layer adopted for obtaining the second convolution feature map may be a convolution layer of 1*1, and the number of convolution kernels is 1, so that the number of channels of the second convolution feature map is reduced to 1, and the feature map size is the same as the feature map size of the feature map input to the convolution layer, where the feature map size of the interaction feature map corresponding to the target video frame is the same; further, interpolation processing can be performed on the second convolution feature map, and the feature map after the interpolation processing is used as a salient feature map corresponding to the target video frame, wherein the interpolation processing is performed on the second convolution feature map so as to restore the feature map size of the second convolution feature map to the size of the target video frame, and then the feature map size of the salient feature map corresponding to the target video frame is the same as the size of the target video frame; when the interpolation processing is performed on the second convolution feature map, a neighbor interpolation method, a bilinear interpolation method, a bicubic interpolation method and the like can be selected, and the embodiment of the application is not limited.

In one embodiment, the first feature map corresponding to the target video frame obtained by the video processing device focuses more on the extraction of deep semantic features of the target video frame; in order to improve the prediction accuracy of the salient feature map corresponding to the target video frame, shallow features of the target video frame can be introduced in the process of predicting the salient feature map corresponding to the target video frame, and the shallow features of the target video frame pay more attention to extraction of texture features, position features and the like of the target video frame, so that the accuracy of the predicted salient feature map can be improved. Based on this, the video processing device performs salient feature analysis processing according to the interactive feature map of the target video frame, to obtain a salient feature map of the target video frame, and may include: performing feature stitching processing on the interactive feature map of the target video frame and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame; performing salient feature analysis processing according to the first spliced feature map of the target video frame to obtain a salient feature map of the target video frame; the second feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the first feature map of the target video frame is obtained by extracting semantic features of the second feature map of the target video frame. For convenience of explanation, the second feature map of the target video frame is also referred to as a second feature map corresponding to the target video frame, and the first stitching feature map of the target video frame is referred to as a first stitching feature map corresponding to the target video frame. Optionally, when the feature extraction module in the salient region detection model selects a neural network model, the feature extraction processing is performed on the video frame to be processed based on a plurality of hidden layers in the neural network model selected by the feature extraction module, and the second feature map corresponding to the target video frame is obtained in the process of performing the image feature extraction processing on the target video frame, that is, the second feature map and the first feature map corresponding to the target video frame are extracted at different stages in the process of performing the image feature extraction processing on the target video frame; further, the number of hidden layers spaced between the hidden layer of the second feature map corresponding to the output target video frame and the input layer is smaller than the number of hidden layers spaced between the hidden layer of the first feature map corresponding to the output target video frame and the input layer, that is, the distance between the hidden layer of the second feature map corresponding to the output target video frame and the input layer is closer, where the feature map output by a certain hidden layer in the neural network model referenced by the feature extraction module in the salient region detection model can be selected as the second feature map according to specific requirements; if the hidden layer of the first feature map corresponding to the output target video frame is called a first hidden layer and the hidden layer of the second feature map corresponding to the output target video frame is called a second hidden layer, semantic feature extraction is performed on the second feature map corresponding to the target video frame to obtain the first feature map corresponding to the target video frame, which is implemented through the hidden layers (including the first hidden layers) located between the rear of the second hidden layer and the first hidden layer.

Further alternatively, when the feature stitching processing is performed on the interactive feature map of the target video frame and the second feature map of the target video frame, the feature stitching processing may be implemented based on a stitching function (for example, a concat function); and, the interactive feature map of the target video frame is required to be the same as the feature map size of the second feature map of the target video frame; if the feature sizes of the interaction feature map of the target video frame and the second feature map of the target video frame are different, the video processing device performs feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame to obtain a first stitched feature map of the target video frame, which may include: performing size transformation processing on the interactive feature map of the target video frame to obtain an interactive feature map after size transformation; performing feature stitching processing on the interactive feature map after the size transformation and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame; the size of the interactive feature map after the size transformation is the same as that of the feature map of the second feature map of the target video frame, namely the width and the height of the feature map are the same; alternatively, the size transformation process may refer to upsampling, and the size transformation process of the interactive feature map of the target video frame may refer to: up-sampling an interaction feature diagram of a target video frame; this process may be illustrated by another schematic diagram of generating a salient feature map based on the salient region detection model shown in fig. 9 a. Accordingly, in the process of generating the salient feature map based on the salient region detection model shown in fig. 9a, a convolution layer for performing convolution processing on the first feature map of the video frame to be processed may be introduced, so that the subsequent processing on the first feature map of the video frame to be processed, that is, the processing on the first convolution feature map corresponding to the video frame to be processed, may be shown by another schematic diagram of generating the salient feature map based on the salient region detection model shown in fig. 9 b.

In one embodiment, in order to improve the prediction accuracy of the salient feature map corresponding to the target video frame, shallow features of the target video frame may be introduced in the process of predicting the salient feature map corresponding to the target video frame, and accordingly, one or more shallow features obtained in the process of performing image feature extraction processing on the target video frame may be introduced, and feature maps output by different hidden layers in the neural network model referred to by the feature extraction module in the salient region detection model may be selected as shallow features to be introduced according to specific requirements. Taking the example of introducing the second feature map of the target video frame and then introducing the third feature map of the target video frame obtained by different hidden layers, in this case, the video processing device performs significant feature analysis processing according to the first spliced feature map of the target video frame to obtain the significant feature map of the target video frame, which may include: performing feature stitching processing on the first stitching feature map of the target video frame and the third feature map of the target video frame to obtain a second stitching feature map of the target video frame; performing salient feature analysis processing according to the second spliced feature map of the target video frame to obtain a salient feature map of the target video frame; the third feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the second feature map of the target video frame is obtained by extracting semantic features of the third feature map of the target video frame. For convenience of explanation, the third feature map of the target video frame is also referred to as a third feature map corresponding to the target video frame, and the second stitching feature map of the target video frame is referred to as a second stitching feature map corresponding to the target video frame. Wherein, the distance between the hidden layer and the input layer of the third feature map of the output target video frame is closer than the distance between the hidden layer and the input layer of the second feature map of the output target video frame; if the hidden layer of the third feature map corresponding to the output target video frame is called a third hidden layer, extracting semantic features from the third feature map of the target video frame to obtain a second feature map corresponding to the target video frame, wherein the second feature map is realized through the hidden layers (including the second hidden layers) located between the rear part of the third hidden layer and the second hidden layer; the prediction accuracy of the salient feature map corresponding to the target video frame can be further improved by further introducing a third feature map with richer shallow features obtained in the process of extracting the image features of the target video frame.

Further optionally, performing feature stitching processing on the first stitched feature map of the target video frame and the third feature map of the target video frame, so as to obtain a second stitched feature map of the target video frame, where the second stitched feature map of the target video frame may be implemented based on a stitching function (for example, a concat function); and the first spliced feature map of the target video frame is required to be the same as the feature map of the third feature map of the target video frame in size; if the sizes of the first spliced feature map of the target video frame and the feature map of the third feature map of the target video frame are different, the video processing equipment performs feature splicing processing on the first spliced feature map of the target video frame and the third feature map of the target video frame, and when a second spliced feature map of the target video frame is obtained, the first spliced feature map of the target video frame can be subjected to size transformation processing to obtain a spliced feature map after size transformation; performing feature stitching processing on the stitching feature map after the size transformation and a third feature map of the target video frame to obtain a second stitching feature map of the target video frame; the size of the spliced feature map after the size transformation is the same as that of a feature map of a third feature map of the target video frame, namely the width and the height of the feature map are the same; alternatively, the size transformation process may refer to upsampling, and the size transformation process of the first stitching feature map of the target video frame may refer to: upsampling a first stitching feature map of a target video frame; this process may be illustrated by another schematic diagram of generating a salient feature map based on the salient region detection model shown in fig. 10 a. Accordingly, in the process of generating the salient feature map based on the salient region detection model shown in fig. 10a, a convolution layer for performing convolution processing on the first feature map of the video frame to be processed may be introduced, so that the subsequent processing on the first feature map of the video frame to be processed, that is, the processing on the first convolution feature map corresponding to the video frame to be processed, may be shown by another schematic diagram of generating the salient feature map based on the salient region detection model shown in fig. 10 b.

For example, if the neural network model selected by the feature extraction module in the salient region detection model is a mobilenet v2 model, the feature fusion module selects a GRU network, selects a feature map with the number of channels of the output feature map in the mobilenet v2 model being 1280 as a first feature map, selects a feature map with the number of channels of the output feature map in the mobilenet v2 model being 160 as a second feature map, selects a feature map with the number of channels of the output feature map in the mobilenet v2 model being 64 as a third feature map, and introduces a convolution layer for convoluting the first feature map of the video frame to be processed, see fig. 11, which is another schematic diagram for generating a salient feature map based on the salient region detection model provided by the embodiment of the present application; if the resolution of each video frame in the video to be processed is (256, 416) (i.e. width is 256 and height is 416), the number of video frames to be processed acquired from the target video segment is 6; firstly, video processing equipment can respectively conduct image feature extraction processing on each video frame to be processed through a MobileNet V2 model in a salient region detection model to obtain a first feature map, a second feature map and a third feature map corresponding to each video frame to be processed; the 6 to-be-processed video frames input to the MobileNetV2 model can be input in the form of a feature map, and can be expressed as: [6,3, 256, 416], wherein 6 represents the number of video frames to be processed, (256, 416) represents the feature map size of the input feature map of each video frame to be processed, and 3 represents the number of channels of the input feature map of each video frame to be processed (i.e., the number of input feature maps of each video frame to be processed, respectively, R (red) channel, G (green) channel, and B (yellow) channel; the first feature map corresponding to 6 video frames to be processed can be represented as [6, 1280,8, 13], the second feature map corresponding to 6 video frames to be processed can be represented as [6, 160, 16, 26] (represented as 2X feature in the figure), the third feature map corresponding to 6 video frames to be processed can be represented as [6, 64, 32, 52] (represented as 4X feature in the figure). Further, after the first feature map corresponding to each video frame to be processed is obtained, the first feature map corresponding to each video frame to be processed can be subjected to convolution processing to obtain the first convolution feature map corresponding to each video frame to be processed, wherein the corresponding convolution layer can be selected as the convolution layer of 1*1, at this time, the obtained first convolution feature map corresponding to 6 video frames to be processed can be represented as [6, 256,8, 13], the number of channels of the first convolution feature map is 256, the feature map size is (8, 13), the first feature map corresponding to 6 video frames to be processed can be subjected to convolution processing to each video frame to be input into a GRU (video frame to be processed) according to the time sequence information of each video frame to be processed, obtaining a fusion feature map corresponding to each video frame to be processed; the fusion feature map corresponding to the 6 video frames to be processed can be expressed as: [6, 256,8, 13], the number of channels was 256, and the feature map size was (8, 13).

If the target video frame is the 1 st video frame to be processed in the plurality of video frames to be processed, the video processing equipment can perform feature interaction processing on a first convolution feature map (expressed as [1, 256,8, 13 ]) corresponding to the target video frame and a fusion feature map (namely a fusion feature map corresponding to the first video frame to be processed, expressed as [1, 256,8, 13 ]) corresponding to the target video frame through a feature analysis module in the salient region detection model to obtain an interaction feature map corresponding to the target video frame; the interaction feature map corresponding to the target video frame may be obtained by performing feature addition on the first convolution feature map corresponding to the target video frame and the fusion feature map corresponding to the target video frame, where the interaction feature map corresponding to the target video frame may be expressed as: [1, 256,8, 13], i.e. the feature map size, is not changed.

Further, the video processing device can acquire a second feature map (expressed as [1, 160, 16, 26 ]) corresponding to the target video frame through a feature analysis module in the salient region detection model; because the interactive feature map (expressed as [1, 256,8, 13 ]) corresponding to the target video frame is different from the feature map of the second feature map (expressed as [1, 160, 16, 26 ]) corresponding to the target video frame, the interactive feature map corresponding to the target video frame can be subjected to size transformation processing to obtain an interactive feature map after size transformation, and the size of the interactive feature map after size transformation is the same as the size of the feature map of the second feature map corresponding to the target video frame, and the sizes of the interactive feature map and the feature map of the second feature map corresponding to the target video frame are (16, 26); performing feature stitching processing on the interactive feature map after the size transformation and the second feature map corresponding to the target video frame to obtain a first stitching feature map corresponding to the target video frame, where the first stitching feature map corresponding to the target video frame may be expressed as: [1, 416, 16, 26]; the interactive feature map corresponding to the target video frame is subjected to size transformation processing, so that the interactive feature map after size transformation is obtained, and the interactive feature map corresponding to the target video frame can be obtained by upsampling the interactive feature map corresponding to the target video frame, wherein the number of channels of the first spliced feature map corresponding to the target video frame is 416 (256+160).

Furthermore, the video processing device can obtain a third feature map (expressed as [1, 64, 32, 52 ]) corresponding to the target video frame through a feature analysis module in the salient region detection model; since the first mosaic feature map (expressed as: [1, 416, 16, 26 ]) corresponding to the target video frame is different from the third feature map (expressed as: [1, 64, 32, 52 ]) corresponding to the target video frame in size, the first mosaic feature map corresponding to the target video frame can be subjected to size transformation processing to obtain a mosaic feature map after size transformation, and the mosaic feature map after size transformation and the feature map of the third feature map corresponding to the target video frame have the same size (32, 52); performing feature stitching processing on the stitching feature map after the size transformation and the third feature map corresponding to the target video frame to obtain a second stitching feature map corresponding to the target video frame, where the second stitching feature map corresponding to the target video frame may be expressed as: [1, 460, 32, 52]; the first stitching feature map corresponding to the target video frame is subjected to size transformation processing to obtain a stitching feature map after size transformation, and the stitching feature map after size transformation can be obtained by upsampling the first stitching feature map corresponding to the target video frame, wherein the number of channels of the second stitching feature map corresponding to the target video frame is 480 (416+64). Then, the video processing equipment can perform salient feature analysis processing according to the second spliced feature map corresponding to the target video frame through a feature analysis module in the salient region detection model to obtain a salient feature map corresponding to the target video frame; in a possible implementation manner, a second splicing feature map corresponding to the target video frame may be subjected to convolution processing to obtain a second convolution feature map; the convolution layer used to obtain the second convolution feature map may be a convolution layer of 1*1, and the number of convolution kernels is 1, so that the number of channels of the second convolution feature map is reduced to 1, and the feature map size is the same as the feature map size of the feature map input to the convolution layer, where the feature map size of the second spliced feature map corresponding to the target video frame is the same (may be expressed as: [1, 32, 52 ]); further, interpolation processing can be performed on the second convolution feature map, and the feature map after the interpolation processing is used as a salient feature map corresponding to the target video frame, wherein the interpolation processing is performed on the second convolution feature map so as to restore the feature map size of the second convolution feature map to the size of the target video frame, and then, at the moment, the feature map size of the salient feature map corresponding to the target video frame is the same as the size of the target video frame (which can be expressed as: [1, 256, 416 ]); when the interpolation processing is performed on the second convolution feature map, a neighbor interpolation method, a bilinear interpolation method, a bicubic interpolation method and the like can be selected, and the embodiment of the application is not limited. Further optionally, in order to improve the prediction smoothness of the salient feature map, after the interpolation processing is performed on the second convolution feature map, the feature map after the interpolation processing is filled (i.e. padding) to obtain a filled feature map, the filled feature map is subjected to adjustment processing, and the feature map after the adjustment processing is used as the salient feature map corresponding to the target video frame; wherein, the characteristic diagram after filling can be expressed as: [1, 296, 456], i.e., filling in the corresponding dimension of the width, can be implemented by using a larger kernel when the filled feature map is subjected to adjustment processing, so that the feature map size of the feature map after adjustment processing is restored to the size of the target video frame, i.e., [1, 256, 416 ]).

In one embodiment, in order to further improve the prediction accuracy of the salient region in the video frame, after obtaining salient feature maps corresponding to a plurality of to-be-processed video frames in the target video segment through the salient region detection model, the video processing device may further: detecting a video frame corresponding to a voice signal in a plurality of video frames to be processed; carrying out face detection processing on a video frame corresponding to the voice signal; if the existence of the human face is detected, detecting the action change condition of a mouth area of the human face based on the video frame corresponding to the voice signal and the video frame associated with the video frame corresponding to the voice signal in the target video segment; under the condition that the action change exists in the mouth region of the face, the salient feature map of the video frame corresponding to the voice signal is updated according to the portrait region to which the face belongs. The face detection processing of the video frame corresponding to the voice signal may be implemented based on a related technology of face detection in the computer vision field, and the detection of the motion change condition of the mouth region of the face, that is, the detection of whether there is a motion change in the mouth region of the face, may be implemented based on a related technology of motion recognition (Action Recognition) in the computer vision field. Further, the number of video frames in the target video segment, which are associated with the video frames corresponding to the voice signal, is less than a frame number threshold, which may be set according to specific requirements, for example, if the frame number threshold is set to 5, 4 video frames before the video frames corresponding to the voice signal and 4 video frames after the video frames corresponding to the voice signal may be used as video frames in the target video segment, which are associated with the video frames corresponding to the voice signal, so as to detect the motion change condition of the mouth region of the face, that is, detect whether there is a motion change in the mouth region of the face. Further, under the condition that motion change exists in the mouth region of the face, updating the salient feature map of the video frame corresponding to the voice signal according to the portrait region to which the face belongs, wherein the salient region indicated by the updated salient feature map is the portrait region to which the corresponding face belongs; that is, if it is detected that there is a motion change in the mouth region of the face, a new salient feature map can be obtained by using the portrait region to which the face belongs as a new salient region, and the original salient feature map of the video frame corresponding to the voice signal can be replaced with the new salient feature map. According to the process, when a plurality of faces exist in a video frame corresponding to the voice signal, whether the mouth area of each face has action change or not is detected to detect the speaking object, so that the recognition of the speaking object is realized, the significant area indicated by the updated significant feature map is the portrait area to which the corresponding speaking object belongs, and the detection accuracy of the significant area can be improved. In the application, when the related embodiment of the application is applied to specific products or technologies, the related data collection, use and processing processes should comply with legal regulations, the information processing rules should be informed and the individual consent of the target object should be solicited before the face information is collected, and the face information is processed in strict compliance with the legal regulations and the personal information processing rules, and technical measures are taken to ensure the safety of the related data.

In one embodiment, if the frame extraction interval parameter is 0, that is, if there is no video frame with an interval between two video frames to be processed obtained by sequentially extracting frames, then, referring to the generation mode of the salient feature map corresponding to the target video frame, the salient feature map corresponding to each video frame in the target video segment can be obtained, and further, the salient feature map corresponding to each video frame in each video segment is obtained. If the frame extraction interval parameter is greater than 0, one or more video frames are spaced between two video frames to be processed, which are obtained through frame extraction processing in sequence; then, referring to the generation manner of the salient feature map corresponding to the target video frame, the salient feature map corresponding to the video frame determined to be the video frame to be processed in the target video segment can be obtained, but the salient feature map corresponding to the remaining video frame cannot be obtained. In a possible implementation manner, if the plurality of video frames to be processed are obtained by performing frame extraction processing on the target video segment through frame extraction interval parameters, the frame extraction interval parameters include any one or more of the number of frames of the video at intervals and the time parameters at intervals; the plurality of video frames to be processed comprise a first video frame to be processed, which is obtained during first frame extraction processing, and a second video frame to be processed, which is obtained during second frame extraction processing; one or more video frames are spaced between the first video frame to be processed and the second video frame to be processed; the second frame extraction processing refers to the next frame extraction processing of the first frame extraction processing; the video processing device may also: and determining the salient feature map of each video frame at the interval between the first to-be-processed video frame and the second to-be-processed video frame according to the salient feature map of the first to-be-processed video frame and the salient feature map of the second to-be-processed video frame. That is, the frame interpolation diffusion process may be performed according to the salient feature map of the first to-be-processed video frame and the salient feature map of the second to-be-processed video frame, so as to obtain salient feature maps of the respective video frames at intervals between the first to-be-processed video frame and the second to-be-processed video frame; the first to-be-processed video frame and the second to-be-processed video frame can be any two to-be-processed video frames obtained through sequential frame extraction processing (namely, two adjacent frame extraction processing), and the interval between the two to-be-processed video frames is one or more video frames. Optionally, for any video frame between the first to-be-processed video frame and the second to-be-processed video frame, one salient feature map may be selected from the salient feature map of the first to-be-processed video frame or the salient feature map of the second to-be-processed video frame, and the selected salient feature map is used as the salient feature map of the any video frame; further alternatively, when one salient feature map is selected from the salient feature maps of the first to-be-processed video frame or the salient feature map of the second to-be-processed video frame, the salient feature map of the to-be-processed video frame closest to any video frame (i.e., the number of video frames spaced from the first to-be-processed video frame is the smallest) may be selected as the salient feature map of any video frame. That is, when the video processing apparatus performs the frame interpolation diffusion processing based on the salient feature map predicted in the target video segment to obtain the salient feature map corresponding to the remaining video frames in the target video segment, for any remaining video frame in the target video segment, the salient feature map corresponding to the video frame to be processed that is closest to the any remaining video frame (i.e., the number of video frames spaced apart is the smallest) may be used as the salient feature map corresponding to the any remaining video frame. Based on the description, the salient feature map corresponding to each video frame in the target video segment can be obtained, and then the salient feature map corresponding to each video frame in each video segment can be obtained. If a plurality of video frames to be processed are acquired from the target video segment, video frames with no interval between two video frames to be processed, which are obtained by frame extraction processing, are sequentially extracted, then the salient feature images corresponding to all the video frames in the target video segment are all needed to be realized through a salient region detection model; because the content change of the video in a short time is relatively small (for example, the content change in 1 second is relatively small), when a plurality of video frames to be processed are acquired, the video frames can be acquired at intervals (namely, one or more video frames are spaced between two video frames to be processed, which are obtained by sequentially extracting frames, so that the frame inserting and diffusing process can be performed based on the salient feature images predicted in the target video segments, the salient feature images corresponding to the rest video frames in the target video segments can be obtained, the prediction process of the salient feature images can be accelerated, and the processing efficiency can be improved. For example, if the video frames to be processed are acquired every 4 frames, the prediction speed of the salient feature map can be improved by about 5 times compared with the continuous acquisition of the video frames to be processed; if there are 30 video frames, 6 to-be-processed video frames are acquired each time, then the acquisition is needed to be performed on the frames in succession and the frames are processed 5*6 times through the salient region detection model, and if the frames are acquired every 4 frames, the frames in 6 to-be-processed video frames (1 st, 6 th, 11 th, 16 th, 21 th and 26 th video frames respectively) are needed to be acquired, and the frames are processed 6 times through the salient region detection model only.

In one possible implementation, since the feature values of the respective pixels in the saliency map may be used to indicate: based on the predicted significance degree of the corresponding pixel points, the region formed by the pixel points with the indicated significance degree larger than a certain threshold value in the significance feature map corresponding to the video frame can be determined as the significance region of the video frame.

S503, determining a centroid position of the salient region of the corresponding video frame based on the salient region of the video frame in each video clip.

In one embodiment, when the video processing device determines the centroid position of the salient region of the corresponding video frame based on the salient region of the video frame in each video clip, taking determining the centroid position of the salient region of one video frame as an example, the region contour of the salient region of the video frame may be determined based on the salient region of the video frame first, and then the centroid position of the region contour may be determined as the centroid position of the salient region of the video frame. When determining the regional outline of the salient region of the video frame, the contour detection algorithm can be adopted, for example, the contour detection algorithm based on an edge detection operator can be adopted, and a neural network model for realizing contour detection can also be adopted; further optionally, because determining the region outline of the salient region of the video frame is to determine the centroid position of the region outline, based on this, the circumscribed rectangle of the salient region indicated by the salient feature map corresponding to the video frame may be determined as the region outline of the salient region of the video frame, the circumscribed circle of the salient region indicated by the salient feature map corresponding to the video frame may be determined as the region outline of the salient region of the video frame, and so on, the determining manner of the region outline is not limited. Referring to fig. 12, a schematic diagram of determining a centroid position of a salient region of a video frame according to an embodiment of the present application is provided; the salient feature map corresponding to the video frame may be shown as 1201 mark, the salient region indicated by the salient feature map corresponding to the video frame may be shown as 1202 mark, the region outline of the salient region of the video frame may be shown as 1203 mark, and the centroid position of the salient region of the video frame may be shown as 1204 mark.

S504, obtaining the clipping proportion, clipping the corresponding video frames according to the clipping proportion and the mass center position of the salient region of the video frames in each video clip, and obtaining a plurality of clipped video frames.

Wherein, the video frame after clipping overlaps with the salient region of the corresponding video frame before clipping.

In one embodiment, the clipping ratio can be set according to specific requirements; for example, the clipping ratio may be a ratio indicated in clipping requirements, such as clipping requirement indication: clipping the video to a video with an aspect ratio of 9:16, then the clipping ratio is 9:16, as well as clipping requirement indication: clipping the video into a video with the aspect ratio of 1:1, wherein the clipping ratio is 1:1, and the clipping requirement indicates: the video is cropped to a video that is adapted to the aspect ratio of the device for video playback, the cropping ratio may be the aspect ratio of the device for video playback, etc. Further, the clipped video frame includes part or all of the salient regions of the video frame before clipping, that is, the clipped video frame overlaps with the salient regions of the corresponding video frame before clipping, and the video frame clipping processing is performed based on the salient regions of the video frames in each video clip in order to preserve the salient regions of the video frame as much as possible when clipping the video frame. Based on this, since the centroid position of the salient region of the video frame can be used to indicate the most central point in the salient region of the video frame, when the corresponding video frame is clipped according to the clipping ratio and the centroid position of the salient region of the video frame, the content of the centroid position of the salient region of the corresponding video frame before clipping should be included in the clipped video frame, further, the centroid position of the clipped video frame should be made to overlap as much as possible with the centroid position of the salient region of the corresponding video frame before clipping, and when the centroid position of the clipped video frame cannot be made to overlap with the centroid position of the salient region of the corresponding video frame before clipping, the centroid position of the salient region of the video frame should be made to be located at the center in the abscissa dimension (i.e., width) of the clipped video frame or the centroid position of the salient region of the video frame should be made to be located at the center in the ordinate dimension (i.e., height) of the clipped video frame. For example, referring to fig. 13a, a schematic diagram of clipping video frames according to a specified clipping ratio is provided in an embodiment of the present application; the ratio of video frames is 16:9 (i.e., the aspect ratio is 16:9), and the centroid position of the salient region of the video frame can be as shown by the 1301 label; if the clipping ratio is 9:16, the clipped video frame can be shown as 1302 marks; referring to fig. 13b, another schematic diagram of clipping video frames according to a specified clipping ratio is provided in an embodiment of the present application; the ratio of video frames is 16:9 (i.e., the aspect ratio is 16:9), and the centroid position of the salient region of the video frames can be shown as 1311 marker; if the cropping ratio is 1:1, the cropped video frame may be as indicated by 1312.

In one embodiment, jitter may exist between centroid positions of salient regions of each video frame in a video clip of a video to be processed, so that smoothing of centroid positions of salient regions of each video frame in the video clip of the video to be processed is required to reduce jitter of centroid positions and improve stability between video frames in a target video; based on this, the video processing apparatus clips the corresponding video frame according to the clipping ratio and the centroid position of the salient region of the video frame in each video clip, to obtain a plurality of clipped video frames, which may include: aiming at the target video clips in each video clip, performing smoothing operation on the centroid position of the salient region of each video frame in the target video clip to obtain a smoothed centroid position; cutting corresponding video frames according to the cutting proportion and the smoothed mass center position corresponding to the video frames in each video segment to obtain a plurality of cut video frames; the target video segment can be any video segment in one or more video segments obtained by dividing the video to be processed; the relevant process of clipping the corresponding video frame according to the clipping ratio and the smoothed centroid position corresponding to the video frame in each video clip is similar to the relevant process of clipping the corresponding video frame according to the clipping ratio and the centroid position of the salient region of the video frame in each video clip.

In one embodiment, the video processing device performs Smoothing operation on the centroid position of the salient region of each video frame in each video segment for the target video segment in each video segment, and when the smoothed centroid position is obtained, a corresponding Smoothing algorithm may be implemented, for example, a Kernel Smoothing algorithm (Kernel Smoothing) may be adopted, a Smoothing algorithm based on polynomial fitting, and the like, which is not limited in the embodiment of the present application; the kernel smoothing algorithm can use a self-defined interval value to approach a continuous coordinate to a certain central point, namely, an average value of a plurality of nearby point coordinates is used as the coordinate of the current point; the polynomial fitting can fit discrete points with a continuous curve, so that the whole is smoother.

Further, when a kernel smoothing algorithm is used, a nearest neighbor smoothing algorithm (Nearest neighbor smoothing), a kernel average smoothing algorithm (Kernel average smoothing), a gaussian kernel smoothing algorithm (Gaussian kernel smoothing), etc. may be used, and embodiments of the present application are not limited. In an alternative embodiment, the video processing device performs a smoothing operation on the centroid position of the salient region of each video frame in the target video segment, to obtain a smoothed centroid position, which may include: for any video frame in the target video segment, acquiring a plurality of video frames from the target video segment; the number of video frames spaced between each video frame and any video frame is less than or equal to the number of video frames spaced between the rest of video frames in the target video segment and any video frame; carrying out average processing on the centroid positions corresponding to the obtained video frames to obtain average positions; and determining the average position as the smoothed centroid position corresponding to any video frame. The number of video frames to be acquired from the target video clip can be set according to specific requirements, for example, can be set to 10, 15, etc. for any video frame in the target video clip; and, the number of video frames spaced between each video frame acquired and any one of the video frames is required to be less than or equal to the number of video frames spaced between the remaining video frames in the target video clip and any one of the video frames, that is, for any one of the video frames in the target video clip, the plurality of video frames acquired from the target video clip are the plurality of video frames having the shortest distance to any one of the video frames (the minimum number of video frames spaced), and if the number of video frames required to be acquired from the target video clip is set to 5, the 5 video frames acquired from the target video clip are the 5 video frames having the shortest distance to any one of the video frames (the minimum number of video frames spaced). Further, the average position obtained by the video processing device performing average processing on the centroid position corresponding to each acquired video frame may be obtained by performing average processing on the coordinates of the centroid position corresponding to each acquired video frame. Referring to fig. 14, a schematic diagram of smoothing a centroid position according to an embodiment of the present application is provided, where a centroid position of a salient region of each video frame in a video segment may be shown as 1401 mark, and a smoothed centroid position corresponding to each video frame in the video segment may be shown as 1402 mark.

Furthermore, discrete points can be fitted by continuous curves by a polynomial fitting-based smoothing algorithm, so that the whole is smoother; based on this, the video processing device performs a smoothing operation on the centroid position of the salient region of each video frame in the target video clip, to obtain a smoothed centroid position, and may include: performing polynomial fitting processing on the centroid positions corresponding to all video frames in the target video segment to obtain a fitting curve; and determining the positions of all video frames in the target video segment corresponding to the fitting curves as smoothed mass center positions corresponding to the corresponding video frames. Referring to fig. 15, another schematic diagram of smoothing a centroid position according to an embodiment of the present application is provided, where a centroid position of a salient region of each video frame in a video clip may be shown as 1501, and a fitted curve may be shown as 1502. Further optionally, the video processing device performs smoothing operation on the centroid position of the salient region of each video frame in the target video segment according to the target video segment in each video segment, and when the smoothed centroid position is obtained, the method may be implemented in combination with a plurality of smoothing algorithms, for example, the kernel smoothing algorithm may be first used to perform smoothing operation on the centroid position of the salient region of each video frame in the target video segment, and then the smoothing algorithm based on polynomial fitting is used to perform smoothing operation on the centroid position obtained by the previous smoothing operation, and the result of this smoothing operation is used as the smoothed centroid position.

S505, performing splicing processing on the plurality of cut video frames to obtain a target video.

Referring to fig. 16, a schematic diagram of cropping a video to be processed according to an embodiment of the present application is provided; the video processing device may divide the video to be processed to obtain one or more video clips; wherein, a video clip corresponds to a video scene; the salient region detection processing can be carried out on the video frames in each video segment to obtain a salient feature map of the video frames, and then the salient regions of the video frames can be obtained based on the salient feature map of the video frames; the video processing device may perform video frame clipping processing on the video frames in each video clip based on the salient regions of the video frames in each video clip to obtain a plurality of clipped video frames, and perform stitching processing on the plurality of clipped video frames to obtain a target video (this process may also be referred to as a post-processing process). Further optionally, the process of segmenting the video to be processed to obtain one or more video segments may be performed by a video shot segmentation module, the process of performing salient region detection processing on the video frames in each video segment to obtain a salient feature map of the video frames may be performed by a video salient region detection module, and the post-processing process may be performed by a post-processing module.

Further, if the multiple to-be-processed video frames obtained from the target video segment in each video segment are obtained at intervals in the process of predicting the obtained salient feature map of the video frame, that is, one or more video frames are spaced between two to-be-processed video frames obtained by sequential frame extraction processing (that is, two adjacent frame extraction processing), then the post-processing module may perform frame insertion diffusion processing based on the salient feature map obtained by predicting the target video segment, so as to obtain the salient feature map corresponding to the remaining video frames in the target video segment. After obtaining the salient feature map corresponding to each video frame in each video segment, determining salient regions of the corresponding video frames based on the salient feature map corresponding to each video frame in each video segment, further determining centroid positions of the salient regions of the corresponding video frames (i.e. salient region centroid calculation in the illustration), and performing smoothing operation on centroid positions of salient regions of each video frame in the target video segment for the target video segment in each video segment to obtain smoothed centroid positions; and then, cutting corresponding video frames according to the cutting proportion and the smoothed mass center position corresponding to the video frames in each video segment to obtain a plurality of cut video frames, and splicing the plurality of cut video frames to obtain the target video. Referring to fig. 17, a schematic diagram of a video cropping effect according to an embodiment of the present application is provided, where a video frame in a video to be processed may be shown as a 1701 mark, and a corresponding cropped video frame in a target video may be shown as a 1702 mark.

In one embodiment, the saliency map may be generated by a saliency region detection model, and feature values of respective pixels in the saliency map may be used to indicate: the predicted significance of the corresponding pixel; the salient region detection model can be obtained by training an initial salient region detection model, and the initial salient region detection model has the same model structure as the salient region detection model but different model parameters; the relevant process of training to obtain the salient region detection model can be executed by video processing equipment or other electronic equipment with computational power, and the embodiment of the application is illustrated by taking the video processing equipment as an example. When the video processing device trains to obtain the salient region detection model, the method can comprise the following steps: acquiring a sample video frame and a reference salient feature map of the sample video frame; the feature values of the pixel points in the reference saliency map are used for indicating: the significance degree of the corresponding marked pixel points; performing salient region detection processing on the sample video frame through an initial salient region detection model to obtain a salient feature map of the sample video frame; and taking the distribution difference between the feature value of each pixel point in the reduced sample video frame salient feature map and the feature value of each pixel point in the corresponding reference salient feature map as a training target, and training an initial salient region detection model to obtain a salient region detection model. For convenience of explanation, the reference salient feature map of the sample video frame is also referred to as a reference salient feature map corresponding to the sample video frame, and the salient feature map of the sample video frame is referred to as a salient feature map corresponding to the sample video frame. The sample video frame may be a video frame obtained from a sample video; the method comprises the steps of performing salient region detection processing on a sample video frame through an initial salient region detection model to obtain a relevant process of a salient feature image of the sample video frame, and obtaining a salient feature image of a target video frame through the salient region detection model. Furthermore, the distribution difference between the feature value of each pixel point in the salient feature map corresponding to each sample video frame and the feature value of each pixel point in the reference salient feature map of the corresponding sample video frame is taken as a training target, and the model parameters of the initial salient region detection model are adjusted to obtain the salient region detection model. The distribution difference between the characteristic value of each pixel point in the salient feature map corresponding to each sample video frame and the characteristic value of each pixel point in the reference salient feature map of the corresponding sample video frame can be embodied by the loss value of the loss function, based on the distribution difference, the distribution difference between the characteristic value of each pixel point in the salient feature map corresponding to each sample video frame and the characteristic value of each pixel point in the reference salient feature map of the corresponding sample video frame is taken as a training target, namely the loss value of the loss function is reduced as the training target; based on this, the loss function that can be used to measure the distributed loss is applicable to the present application, and by way of example, one loss function provided by the embodiment of the present application can be shown by the following formula 1:

Wherein P represents a salient feature map corresponding to a sample video frame, Q represents a reference salient feature map corresponding to the sample video frame, i is an independent variable, and P _i Representing a salient feature map corresponding to an ith sample video frame, Q _i Representing a reference salient feature map corresponding to an ith sample video frame, e represents: the regularization parameters can be set according to specific requirements.

When the salient regions of the video frames in each video clip are acquired, the salient regions can be acquired based on salient region detection processing performed on the video frames in each video clip, the salient regions can be determined based on salient feature images of the corresponding video frames, and further clipping is performed, so that the clipped video frames preferentially reserve the salient regions of the video frames before clipping, the corresponding salient regions are prevented from being lost after clipping the video to be processed, high-quality clipped video can be output, and the clipped video has good playing effect when playing. And the video to be processed is segmented to obtain video segments corresponding to video scenes, and then, based on each video segment obtained by segmentation, the salient feature images which correspond to video frames in the corresponding video segments and can indicate salient areas are respectively predicted, so that the problems of instability and unfocusing possibly caused by sudden changes of the salient areas during rapid lens switching can be avoided, namely, for video frames in different scene switching, the salient feature images can be accurately predicted, further, the salient areas can be accurately detected, and the quality of the outputted cut video can be further improved.

In the process of carrying out salient region detection processing on target video frames of target video clips in all video clips, a plurality of to-be-processed video frames containing the target video frames can be obtained from the target video clips, and then fusion feature images of all to-be-processed video frames can be obtained according to playing time sequence information of all to-be-processed video frames and first feature images of all to-be-processed video frames, so that playing time sequence information is fused in the obtained fusion feature images of all to-be-processed video frames, and a salient feature image of the target video frame is further generated based on the fusion feature images of the target video frames; when the salient feature map of the target video frame is generated, the time sequence features of a plurality of to-be-processed video frames containing the target video frame can be introduced, namely, the influence of other video frames in the plurality of to-be-processed video frames on the target video frame can be introduced, and the prediction accuracy of the salient feature map of the target video frame can be improved. And when a plurality of to-be-processed video frames containing the target video frames are acquired from the target video segment, the to-be-processed video frames can be acquired at intervals, and the frame inserting and diffusing processing is carried out on the basis of the predicted salient feature images in the target video segment, so that the salient feature images corresponding to the residual video frames in the target video segment are obtained, the prediction process of the salient feature images can be accelerated, and the processing efficiency is improved. And after the centroid positions of the salient regions of the video frames in each video segment are determined based on the salient regions of the video frames in each video segment, smoothing processing can be performed on the centroid positions of the salient regions of the video frames in each video segment so as to reduce jitter of the centroid positions and improve stability among the video frames in the target video. In summary, it can be seen that the embodiment of the application can cut the video to be processed, and output a high-quality and stable cut video, so that the cut video has a good playing effect when played, and the processing efficiency can be improved.

Based on the related embodiments of the video processing method described above, another video processing method is provided in the embodiments of the present application. Referring to fig. 18, a flowchart of another video processing method according to an embodiment of the present application is shown.

The video processing method shown in fig. 18 may be performed by a video processing device, or may be performed by other electronic devices with computing power alone or in combination, and the embodiment of the present application is described by taking the video processing device as an example. The video processing method shown in fig. 18 may include the steps of:

s1801, a plurality of video frames to be processed are acquired from a target video clip of a video to be processed.

The target video segment is one video segment in the video to be processed; the video to be processed may be any video requiring a significant area of the predicted video frame, for example, a movie video, a variety video, a sports video, etc., and the embodiment of the present application is not limited.

In one embodiment, the target video segment may be any one of one or more video segments obtained by dividing the video to be processed; the video processing device performs segmentation processing on the video to be processed to obtain one or more video clips, which may include: performing key frame determination processing on the video to be processed to obtain a plurality of key frames; determining a color statistical histogram of each key frame, and determining a difference of the color statistical histograms between adjacent key frames according to the color statistical histogram of each key frame; determining target key frames from adjacent key frames with differences meeting the dividing conditions, and dividing the video to be processed based on each target key frame to obtain one or more video segments; this process is similar to the above-described related process of step S501, and will not be described again here.

S1802, respectively performing image feature extraction processing on each video frame to be processed to obtain a first feature map of each video frame to be processed.

S1803, obtaining a fusion feature map of each video frame to be processed according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed; and playing time sequence information is fused in the fusion characteristic diagram of each video frame to be processed.

And S1804, generating a salient feature map of the target video frame according to the first feature map of the target video frame in the plurality of video frames to be processed and the fusion feature map of the target video frame.

The target video frame may be any one of a plurality of video frames to be processed; the salient feature map of the target video frame is used for indicating the salient region of the target video frame so as to highlight the salient region of the target video frame when the video to be processed is played; the correlation process of steps S1802 to S1084 is similar to that of the above-described prediction target video frame salient feature map in step S502.

In one embodiment, the video processing device generates a salient feature map of a target video frame according to a first feature map of the target video frame and a fused feature map of the target video frame in the plurality of video frames to be processed, and may include: performing feature interaction processing on the first feature map of the target video frame and the fusion feature map of the target video frame to obtain an interaction feature map of the target video frame; and carrying out salient feature analysis processing according to the interactive feature map of the target video frame to obtain the salient feature map of the target video frame.

In one embodiment, the video processing device performs salient feature analysis processing according to the interactive feature map of the target video frame to obtain a salient feature map of the target video frame, and may include: performing feature stitching processing on the interactive feature map of the target video frame and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame; performing salient feature analysis processing according to the first spliced feature map of the target video frame to obtain a salient feature map of the target video frame; the second feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the first feature map of the target video frame is obtained by extracting semantic features of the second feature map of the target video frame.

In one embodiment, if the interactive feature map of the target video frame is different from the feature map size of the second feature map of the target video frame; the video processing device performs feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame to obtain a first stitched feature map of the target video frame, which may include: performing size transformation processing on the interactive feature map of the target video frame to obtain an interactive feature map after size transformation; and performing feature stitching processing on the interactive feature map after the size transformation and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame.

In one embodiment, the video processing device performs salient feature analysis processing according to the first spliced feature map of the target video frame to obtain a salient feature map of the target video frame, and may include: performing feature stitching processing on the first stitching feature map of the target video frame and the third feature map of the target video frame to obtain a second stitching feature map of the target video frame; performing salient feature analysis processing according to the second spliced feature map of the target video frame to obtain a salient feature map of the target video frame; the third feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the second feature map of the target video frame is obtained by extracting semantic features of the third feature map of the target video frame.

In one embodiment, if the plurality of video frames to be processed are obtained by performing frame extraction processing on the target video segment through frame extraction interval parameters, the frame extraction interval parameters include any one or more of an interval video frame number and an interval time parameter; the plurality of video frames to be processed comprise a first video frame to be processed, which is obtained during first frame extraction processing, and a second video frame to be processed, which is obtained during second frame extraction processing; one or more video frames are spaced between the first video frame to be processed and the second video frame to be processed; the second frame extraction processing refers to the next frame extraction processing of the first frame extraction processing; the video processing device may also: and determining the salient feature map of each video frame at the interval between the first to-be-processed video frame and the second to-be-processed video frame according to the salient feature map of the first to-be-processed video frame and the salient feature map of the second to-be-processed video frame.

In one embodiment, in order to improve the prediction accuracy of the salient region of the video frame, the video processing device may detect a video frame corresponding to the speech signal in the plurality of video frames to be processed; carrying out face detection processing on a video frame corresponding to the voice signal; if the existence of the human face is detected, detecting the action change condition of a mouth area of the human face based on the video frame corresponding to the voice signal and the video frame associated with the video frame corresponding to the voice signal in the target video segment; under the condition that the action change exists in the mouth region of the face, the salient feature map of the video frame corresponding to the voice signal is updated according to the portrait region to which the face belongs.

In one embodiment, the video processing device may perform post-processing operations on the target video frame after generating the salient feature map of the target video frame such that salient regions of the target video frame are highlighted while the video to be processed is played; for example, after the saliency map of the target video frame is generated, the saliency region may be subjected to a zoom-in operation based on the saliency region indicated by the saliency map of the target video frame, so that the saliency region of the target video frame is highlighted when the video to be processed is played, and when the zoom-in operation is optionally performed, the zoom-in operation may be implemented based on interpolation processing; for another example, after the saliency map of the target video frame is generated, a special effect may be added to the saliency region based on the saliency region indicated by the saliency map of the target video frame, so that the saliency region of the target video frame is highlighted when the video to be processed is played.

In one embodiment, a generation mode of a salient feature map of a target video frame can be referred to, salient feature maps of all video frames in the target video segment are generated, salient feature maps of all video frames in all video segments of a video to be processed are further generated, salient regions of corresponding video frames can be determined based on the salient feature maps of the video frames in all video segments, video frame cutting processing is carried out on the video frames in all video segments based on the salient regions of the video frames in all video segments, a plurality of cut video frames are obtained, and splicing processing is carried out on the plurality of cut video frames to obtain a target video; the clipped video frame comprises part or all of the salient region of the video frame before clipping; the effect of highlighting the salient areas of each video frame in the video to be processed can be achieved by playing the target video; the process is similar to the related processes of steps S202 to S204 and steps S502 to S505 described above.

In one embodiment, the video processing device performs video frame cropping processing on the video frames in each video clip based on the salient regions of the video frames in each video clip, to obtain a plurality of cropped video frames, and may include: determining centroid positions of salient regions of corresponding video frames based on salient regions of video frames in each video clip; and acquiring the clipping proportion, clipping the corresponding video frames according to the clipping proportion and the mass center position of the salient region of the video frames in each video clip, and obtaining a plurality of clipped video frames.

In one embodiment, to reduce jitter of a centroid position, and improve stability between video frames in a target video, a video processing apparatus clips a corresponding video frame according to a clipping ratio and a centroid position of a salient region of the video frame in each video clip, to obtain a plurality of clipped video frames, which may include: aiming at the target video clips in each video clip, performing smoothing operation on the centroid position of the salient region of each video frame in the target video clip to obtain a smoothed centroid position; and cutting corresponding video frames according to the cutting proportion and the smoothed mass center position corresponding to the video frames in each video segment to obtain a plurality of cut video frames.

In the embodiment of the application, a plurality of video frames to be processed can be obtained from the target video segment of the video to be processed; respectively carrying out image feature extraction processing on each video frame to be processed to obtain a first feature map of each video frame to be processed; furthermore, according to the playing time sequence information of each video frame to be processed and the first feature map of each video frame to be processed, a fusion feature map of each video frame to be processed fused with the playing time sequence information can be obtained, and according to the first feature map of a target video frame in a plurality of video frames to be processed and the fusion feature map of the target video frame, a salient feature map of the target video frame is generated; the time sequence characteristics of a plurality of to-be-processed video frames containing the target video frame can be introduced in the process of predicting the salient feature map of the target video frame, namely, the influence of other video frames in the plurality of to-be-processed video frames on the target video frame can be introduced, and the prediction accuracy of the salient feature map of the target video frame can be improved.

Based on the above embodiments related to the video processing method, the embodiments of the present application provide a video processing apparatus. Referring to fig. 19, a schematic structural diagram of a video processing apparatus according to an embodiment of the present application may include a dividing unit 1901, an obtaining unit 1902, a clipping unit 1903 and a stitching unit 1904. The video processing apparatus shown in fig. 19 can be used to perform the following operations:

a segmentation unit 1901, configured to segment a video to be processed to obtain one or more video segments; one video clip corresponds to one video scene;

an obtaining unit 1902, configured to obtain a salient region of a video frame in each video clip;

a clipping unit 1903, configured to clip video frames in each video clip based on the salient regions of the video frames in each video clip, so as to obtain a plurality of clipped video frames;

and a stitching unit 1904, configured to perform stitching on the plurality of clipped video frames to obtain a target video.

In one embodiment, the cropped video frame comprises part or all of the salient region of the pre-cropped video frame.

In one embodiment, before the obtaining unit 1902 obtains the salient regions of the video frames in each video segment, the obtaining unit is further configured to:

Acquiring a plurality of video frames to be processed from target video clips in each video clip; the target video clip is any one of the video clips;

generating a salient feature map of a target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and a fusion feature map of the target video frame; the salient feature map is used to indicate salient regions of the target video frame.

In one embodiment, the obtaining unit 1902 specifically performs the following operations when generating the salient feature map of the target video frame according to the first feature map of the target video frame in the plurality of to-be-processed video frames and the fused feature map of the target video frame:

Performing feature interaction processing on the first feature map of the target video frame and the fusion feature map of the target video frame to obtain an interaction feature map of the target video frame;

and carrying out salient feature analysis processing according to the interactive feature map of the target video frame to obtain the salient feature map of the target video frame.

In one embodiment, the obtaining unit 1902 performs a salient feature analysis process according to the interactive feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

performing feature stitching processing on the interactive feature map of the target video frame and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame;

performing salient feature analysis processing according to the first spliced feature map of the target video frame to obtain a salient feature map of the target video frame;

the second feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the first feature map of the target video frame is obtained by extracting semantic features of the second feature map of the target video frame.

In one embodiment, the interactive feature map of the target video frame is of a different feature map size than the second feature map of the target video frame;

the obtaining unit 1902 performs feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame, and when obtaining the first stitching feature map of the target video frame, specifically performs the following operations:

performing size transformation processing on the interactive feature map of the target video frame to obtain an interactive feature map after size transformation;

and performing feature stitching processing on the interactive feature map after the size transformation and the second feature map of the target video frame to obtain a first stitching feature map of the target video frame.

In one embodiment, the obtaining unit 1902 performs a salient feature analysis process according to the first spliced feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

performing feature stitching processing on the first stitching feature map of the target video frame and the third feature map of the target video frame to obtain a second stitching feature map of the target video frame;

performing salient feature analysis processing according to the second spliced feature map of the target video frame to obtain a salient feature map of the target video frame;

The third feature map of the target video frame is obtained in the process of extracting image features of the target video frame, and the second feature map of the target video frame is obtained by extracting semantic features of the third feature map of the target video frame.

In one embodiment, the plurality of video frames to be processed are obtained by performing frame extraction processing on the target video segment through frame extraction interval parameters, where the frame extraction interval parameters include any one or more of an interval video frame number and an interval time parameter; the plurality of video frames to be processed comprise a first video frame to be processed, which is obtained during first frame extraction processing, and a second video frame to be processed, which is obtained during second frame extraction processing; one or more video frames are spaced between the first video frame to be processed and the second video frame to be processed; the second frame extraction processing refers to next frame extraction processing of the first frame extraction processing;

the obtaining unit 1902 is further configured to:

and determining the salient feature map of each video frame of the interval between the first to-be-processed video frame and the second to-be-processed video frame according to the salient feature map of the first to-be-processed video frame and the salient feature map of the second to-be-processed video frame.

In one embodiment, the obtaining unit 1902 is further configured to:

detecting a video frame corresponding to a voice signal in the plurality of video frames to be processed;

carrying out face detection processing on the video frames corresponding to the voice signals;

if the face is detected to exist, detecting the action change condition of a mouth area of the face based on the video frame corresponding to the voice signal and the video frame associated with the video frame corresponding to the voice signal in the target video segment;

under the condition that the action change exists in the mouth area of the face, updating the salient feature map of the video frame corresponding to the voice signal according to the portrait area to which the face belongs.

In one embodiment, when the segmentation unit 1901 segments the video to be processed to obtain one or more video segments, the following operations are specifically performed:

performing key frame determination processing on the video to be processed to obtain a plurality of key frames;

determining a color statistical histogram of each key frame, and determining the difference of the color statistical histograms between adjacent key frames according to the color statistical histogram of each key frame;

and determining target key frames from adjacent key frames of which the differences meet the dividing conditions, and dividing the video to be processed based on each target key frame to obtain one or more video segments.

In one embodiment, the cropping unit 1903 performs video frame cropping processing on the video frames in each video clip based on the salient regions of the video frames in each video clip, so as to obtain a plurality of cropped video frames, and specifically perform the following operations:

determining centroid positions of salient regions of the corresponding video frames based on salient regions of video frames in the respective video clips;

and acquiring a clipping proportion, clipping corresponding video frames according to the clipping proportion and the mass center position of the salient region of the video frames in each video segment, and obtaining a plurality of clipped video frames.

In one embodiment, the cropping unit 1903 crops the corresponding video frame according to the cropping proportion and the centroid position of the salient region of the video frame in each video segment, and when obtaining a plurality of cropped video frames, specifically performs the following operations:

performing smoothing operation on the centroid position of the salient region of each video frame in the target video segment aiming at the target video segment in each video segment to obtain a smoothed centroid position;

and cutting corresponding video frames according to the cutting proportion and the smoothed mass center positions corresponding to the video frames in each video segment to obtain a plurality of cut video frames.

In one embodiment, the clipping unit 1903 performs a smoothing operation on the centroid position of the salient region of each video frame in the target video segment, and when the smoothed centroid position is obtained, specifically performs the following operations:

for any video frame in the target video segment, acquiring a plurality of video frames from the target video segment; the number of video frames spaced between each video frame and any video frame is less than or equal to the number of video frames spaced between the rest of video frames in the target video segment and any video frame;

carrying out average processing on the centroid positions corresponding to the obtained video frames to obtain average positions;

and determining the average position as the smoothed centroid position corresponding to any video frame.

performing polynomial fitting processing on the centroid positions corresponding to all video frames in the target video segment to obtain a fitting curve;

And determining the positions of all video frames in the target video segment corresponding to the fitting curve as smoothed mass center positions corresponding to the corresponding video frames.

In one embodiment, the salient feature map is generated by a salient region detection model, and feature values of each pixel point in the salient feature map are used for indicating: the predicted significance of the corresponding pixel;

the video processing apparatus further includes a training unit 1905, where the training unit 1905 is configured to perform the following operations when the salient region detection model is obtained by training:

acquiring a sample video frame and a reference salient feature map of the sample video frame; the feature values of the pixel points in the reference saliency map are used for indicating: the significance degree of the corresponding marked pixel points;

performing salient region detection processing on the sample video frame through an initial salient region detection model to obtain a salient feature map of the sample video frame;

and training the initial salient region detection model by taking the distribution difference between the characteristic value of each pixel point in the reduced salient feature image of the sample video frame and the characteristic value of each pixel point in the corresponding reference salient feature image as a training target to obtain the salient region detection model.

According to one embodiment of the present application, the steps involved in the video processing methods shown in fig. 2 and 5 may be performed by the units in the video processing apparatus shown in fig. 19. For example, step S201 shown in fig. 2 may be performed by the dividing unit 1901 in the video processing apparatus shown in fig. 19, step S202 shown in fig. 2 may be performed by the acquiring unit 1902 in the video processing apparatus shown in fig. 19, step S203 shown in fig. 2 may be performed by the clipping unit 1903 in the video processing apparatus shown in fig. 19, and step S204 shown in fig. 2 may be performed by the splicing unit 1904 in the video processing apparatus shown in fig. 19. As another example, step S501 shown in fig. 5 may be performed by the dividing unit 1901 in the video processing apparatus shown in fig. 19, step S502 shown in fig. 5 may be performed by the acquiring unit 1902 in the video processing apparatus shown in fig. 19, steps S503 to S504 shown in fig. 5 may be performed by the clipping unit 1903 in the video processing apparatus shown in fig. 19, and step S505 shown in fig. 5 may be performed by the splicing unit 1904 in the video processing apparatus shown in fig. 19.

Based on the above-mentioned embodiments related to the video processing method, another video processing apparatus is provided in an embodiment of the present application. Referring to fig. 20, a schematic structural diagram of another video processing apparatus according to an embodiment of the present application may include an acquisition unit 2001 and a processing unit 2002. The video processing apparatus shown in fig. 20 can be used to perform the following operations:

An acquiring unit 2001 for acquiring a plurality of video frames to be processed from a target video clip of a video to be processed; the target video segment is one video segment in the video to be processed;

the processing unit 2002 is configured to perform image feature extraction processing on each video frame to be processed, so as to obtain a first feature map of each video frame to be processed;

the processing unit 2002 is further configured to obtain a fusion feature map of each to-be-processed video frame according to the play time sequence information of each to-be-processed video frame and the first feature map of each to-be-processed video frame; the playing time sequence information is fused in the fusion characteristic diagram of each video frame to be processed;

the processing unit 2002 is further configured to generate a salient feature map of a target video frame according to a first feature map of the target video frame in the plurality of video frames to be processed and a fused feature map of the target video frame; the salient feature map is used for indicating salient regions of the target video frame so as to highlight the salient regions of the target video frame when the video to be processed is played.

In one embodiment, the processing unit 2002 generates the salient feature map of the target video frame according to the first feature map of the target video frame in the plurality of to-be-processed video frames and the fused feature map of the target video frame, and specifically performs the following operations:

In one embodiment, the processing unit 2002 performs a salient feature analysis process according to the interactive feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

the processing unit 2002 performs feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame, and when obtaining the first stitching feature map of the target video frame, specifically performs the following operations:

In one embodiment, the processing unit 2002 performs the salient feature analysis processing according to the first spliced feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

the processing unit 2002 is further configured to:

In one embodiment, the processing unit 2002 is further configured to:

In one embodiment, when the processing unit 2002 performs segmentation processing on the video to be processed to obtain one or more video segments, the following operations are specifically performed:

According to one embodiment of the present application, the steps involved in the video processing method shown in fig. 18 may be performed by the respective units in the video processing apparatus shown in fig. 20. For example, step S1801 shown in fig. 18 may be performed by the acquisition unit 2001 in the video processing device shown in fig. 20, and steps S1802 to S1804 shown in fig. 18 may be performed by the processing unit 2002 in the video processing device shown in fig. 20.

According to another embodiment of the present application, each unit in the video processing apparatus shown in fig. 19, and each unit in the video processing apparatus shown in fig. 20 may be individually or collectively combined into one or several other units, or some unit(s) thereof may be further split into a plurality of units having smaller functions, which can achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit, for example, the functions implemented by the respective units may be implemented by one processing unit. In other embodiments of the present application, the video processing apparatus based on logical function division may also include other units, and in practical applications, these functions may also be implemented with assistance by other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, a video processing apparatus as shown in fig. 19 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 and 5 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and implementing the video processing method of the embodiment of the present application; alternatively, a computer program (including program code) capable of executing steps involved in the respective methods shown in fig. 18 is run to construct a video processing apparatus as shown in fig. 20, and to implement the video processing method of the embodiment of the present application. The computer program may be recorded on, for example, a computer readable storage medium, and loaded into and executed by the computing device described above.

Based on the related embodiments of the video processing method and the embodiments of the video processing device, the application further provides video processing equipment. Referring to fig. 21, a schematic structural diagram of a video processing apparatus according to an embodiment of the present application is provided. The video processing device shown in fig. 21 may include at least a processor 2101, an input interface 2102, an output interface 2103, and a computer storage medium 2104. Wherein the processor 2101, the input interface 2102, the output interface 2103, and the computer storage medium 2104 may be connected by a bus or other means.

The computer storage medium 2104 may be stored in a memory of the video processing apparatus, the computer storage medium 2104 for storing a computer program comprising program instructions, and the processor 2101 for executing the program instructions stored in the computer storage medium 2104. The processor 2101 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the video processing device, which are adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement the video processing method flow or corresponding functions described above.

The embodiment of the application also provides a computer storage medium (Memory), which is a Memory device in the video processing device and is used for storing programs and data. It will be appreciated that the computer storage medium herein may include both a built-in storage medium in the terminal and an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 2101. Note that the computer storage medium may be a high-speed random access memory (random access memory, RAM) or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by the processor 2101 to implement the respective steps of the methods described above in connection with the video processing method embodiments of fig. 2 and 5 or to implement the respective steps of the methods described above in connection with the video processing method embodiment of fig. 18.

In particular implementations, when one or more instructions in the computer storage medium are loaded by the processor 2101 and perform the corresponding steps of the methods described above in connection with the video processing method embodiments of fig. 2 and 5, the following operations are specifically performed:

acquiring a salient region of a video frame in each video segment;

In one embodiment, before the processor 2101 obtains the salient regions of the video frames in each video clip, it is further configured to:

In one embodiment, the processor 2101 specifically performs the following operations when generating the salient feature map of the target video frame according to the first feature map of the target video frame in the plurality of video frames to be processed and the fused feature map of the target video frame:

In one embodiment, the processor 2101 performs a salient feature analysis process according to the interactive feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

the processor 2101 performs feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame, and when obtaining a first stitched feature map of the target video frame, specifically performs the following operations:

In one embodiment, the processor 2101 performs a salient feature analysis process according to the first stitched feature map of the target video frame, and when obtaining the salient feature map of the target video frame, specifically performs the following operations:

the processor 2101 is further configured to:

In one embodiment, the processor 2101 is further configured to:

In one embodiment, when the processor 2101 performs segmentation processing on the video to be processed to obtain one or more video segments, the following operations are specifically performed:

In one embodiment, the processor 2101 performs a video frame cropping process on the video frames in each video clip based on the salient regions of the video frames in each video clip, so as to obtain a plurality of cropped video frames, and specifically performs the following operations:

In one embodiment, the processor 2101 clips the corresponding video frame according to the clipping ratio and the centroid position of the salient region of the video frame in each video clip, and when obtaining a plurality of clipped video frames, the processor specifically performs the following operations:

In one embodiment, the processor 2101 performs a smoothing operation on the centroid position of the salient region of each video frame in the target video segment, and when the smoothed centroid position is obtained, the following operations are specifically performed:

the processor 2101 is configured to perform the following operations when training to obtain the salient region detection model:

In particular implementations, when one or more instructions in the computer storage medium are loaded by the processor 2101 and perform the corresponding steps of the method described above in connection with the video processing method embodiment of fig. 18, the following operations are specifically performed:

The processor 2101 is further configured to:

In one embodiment, the processor 2101 is further configured to:

Embodiments of the present application provide a computer program product comprising a computer program stored in a computer storage medium; the processor of the video processing device reads the computer program from the computer storage medium and executes the computer program, so that the video processing device executes the method embodiment as shown in fig. 2 and 5 or the method embodiment as shown in fig. 18. The computer readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A video processing method, comprising:

acquiring a salient region of a video frame in each video segment;

2. The method of claim 1, wherein the cropped video frame comprises a portion or all of a salient region of a pre-cropped video frame.

3. The method of claim 1, wherein prior to the acquiring the salient regions of the video frames in each video clip, the method further comprises:

4. The method of claim 3, wherein the generating the salient feature map of the target video frame from the first feature map of the target video frame of the plurality of video frames to be processed and the fused feature map of the target video frame comprises:

5. The method of claim 4, wherein the performing salient feature analysis processing according to the interactive feature map of the target video frame to obtain the salient feature map of the target video frame comprises:

6. The method of claim 5, wherein the interactive feature map of the target video frame is of a different feature map size than the second feature map of the target video frame;

performing feature stitching processing on the interaction feature map of the target video frame and the second feature map of the target video frame to obtain a first stitched feature map of the target video frame, including:

7. The method of claim 5, wherein the performing salient feature analysis according to the first stitched feature map of the target video frame to obtain the salient feature map of the target video frame comprises:

8. The method of claim 3, wherein the plurality of video frames to be processed are obtained by performing frame extraction processing on the target video segment by using frame extraction interval parameters, wherein the frame extraction interval parameters comprise any one or more of a frame number of the video at intervals and a time parameter at intervals; the plurality of video frames to be processed comprise a first video frame to be processed, which is obtained during first frame extraction processing, and a second video frame to be processed, which is obtained during second frame extraction processing; one or more video frames are spaced between the first video frame to be processed and the second video frame to be processed; the second frame extraction processing refers to next frame extraction processing of the first frame extraction processing;

the method further comprises the steps of:

9. A method as claimed in claim 3, wherein the method further comprises:

10. The method of claim 1, wherein the segmenting the video to be processed to obtain one or more video segments comprises:

11. The method of claim 1, wherein the performing video frame cropping on the video frames in each video clip based on the salient regions of the video frames in each video clip to obtain a plurality of cropped video frames comprises:

12. The method of claim 11, wherein cropping the respective video frame based on the cropping ratio and the centroid position of the salient region of the video frame in the respective video segment to obtain a plurality of cropped video frames, comprising:

13. The method of claim 12, wherein smoothing the centroid position of the salient region of each video frame in the target video segment to obtain a smoothed centroid position comprises:

14. The method of claim 12, wherein smoothing the centroid position of the salient region of each video frame in the target video segment to obtain a smoothed centroid position comprises:

15. The method of claim 3, wherein the saliency map is generated by a saliency region detection model, and feature values of pixels in the saliency map are used to indicate: the predicted significance of the corresponding pixel;

the training method for obtaining the salient region detection model comprises the following steps:

16. A video processing method, comprising:

17. A video processing apparatus, comprising:

18. A video processing apparatus, comprising:

19. A video processing device, the video processing device comprising an input interface and an output interface, further comprising:

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the video processing method of any one of claims 1-15; alternatively, the one or more instructions are adapted to be loaded by the processor and to perform the video processing method of claim 16.

20. A computer storage medium having stored therein computer program instructions for performing the video processing method of any of claims 1-15 when executed by a processor; alternatively, the computer program instructions, when executed by a processor, are for performing the video processing method of claim 16.