CN116977884A

CN116977884A - Training method of video segmentation model, video segmentation method and device

Info

Publication number: CN116977884A
Application number: CN202211394354.XA
Authority: CN
Inventors: 熊江丰; 王臻郅; 李智敏; 芦清林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-10-31

Abstract

The application discloses a training method of a video segmentation model, a video segmentation method and a video segmentation device, and belongs to the technical field of Internet. The method comprises the following steps: acquiring a multi-frame sample image extracted from a sample video and first annotation information of each frame of sample image; determining first prediction information and second prediction information of each frame of sample image through a neural network model, wherein the first prediction information of the sample image represents the probability that the distance between the sample image and a sample dividing point is not greater than a distance threshold value, and the second prediction information of the sample image represents the prediction offset between the time stamp of the sample image and the time stamp of the sample dividing point; and training the neural network model based on the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model. The time stamp of the video segmentation point can be accurately determined through the video segmentation model, so that the video is segmented based on the time stamp of the video segmentation point, and the accuracy of the segmentation result is improved.

Description

Training method of video segmentation model, video segmentation method and device

Technical Field

The embodiment of the application relates to the technical field of Internet, in particular to a training method of a video segmentation model, a video segmentation method and a device.

Background

With the continuous development of the internet, resources on the network have increased dramatically, and a large amount of video is contained in these resources. In general, a video needs to be segmented to facilitate understanding of a video structure, video content, and the like based on a segmentation result. Therefore, how to segment video becomes a problem to be solved.

Disclosure of Invention

The application provides a training method of a video segmentation model, a video segmentation method and a video segmentation device, which can be used for solving the problems in the related art.

In one aspect, a method for training a video segmentation model is provided, the method comprising:

acquiring a plurality of frames of sample images extracted from a sample video and first annotation information of each frame of sample image, wherein the first annotation information of the sample image represents whether the distance between the sample image and a sample dividing point of the sample video is not more than a distance threshold;

determining first prediction information and second prediction information of each frame of sample image through a neural network model, wherein the first prediction information of the sample image represents the probability that the distance between the sample image and the sample dividing point is not greater than a distance threshold value, and the second prediction information of the sample image represents the prediction offset between the time stamp of the sample image and the time stamp of the sample dividing point;

Training the neural network model based on the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model, wherein the video segmentation model is used for segmenting a target video.

In another aspect, a video slicing method is provided, the method including:

acquiring a video segmentation model and a multi-frame target image extracted from a target video, wherein the video segmentation model is obtained by training according to the training method of the video segmentation model;

determining first prediction information and second prediction information of a target image of each frame through the video segmentation model, wherein the first prediction information of the target image represents the probability that the distance between the target image and a target segmentation point of the target video is not greater than a distance threshold value, and the second prediction information of the target image represents the prediction offset between a time stamp of the target image and a time stamp of the target segmentation point;

and cutting the target video based on the first prediction information and the second prediction information of the target image of each frame to obtain at least two video sequences.

In another aspect, a training device for a video segmentation model is provided, where the device includes:

The acquisition module is used for acquiring a plurality of frames of sample images extracted from a sample video and first annotation information of each frame of sample image, wherein the first annotation information of the sample image represents whether the distance between the sample image and a sample dividing point of the sample video is not more than a distance threshold;

a determining module, configured to determine, by using a neural network model, first prediction information and second prediction information of each frame of sample image, where the first prediction information of the sample image characterizes a probability that a distance between the sample image and the sample segmentation point is not greater than a distance threshold image, and the second prediction information of the sample image characterizes a prediction offset between a timestamp of the sample image and a timestamp of the sample segmentation point;

the training module is used for training the neural network model based on the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model, and the video segmentation model is used for segmenting a target video.

In a possible implementation manner, the determining module is configured to determine, through a neural network model, sample features of the sample images of each frame; for any frame of sample image, determining a characteristic difference between the any frame of sample image and an adjacent frame of sample image based on sample characteristics of the any frame of sample image and sample characteristics of the adjacent frame of sample image through the neural network model, wherein the adjacent frame of sample image is a sample image adjacent to the any frame of sample image in the multi-frame sample image; and determining first prediction information and second prediction information of any frame sample image based on the feature difference between the any frame sample image and the adjacent frame image through the neural network model.

In a possible implementation manner, the determining module is configured to determine, through a neural network model, image features of the sample images of each frame; determining, by the neural network model, complementary features of the sample images of each frame, where the complementary features of the sample images include at least one of audio features of sample audio corresponding to the sample images and text features of sample text corresponding to the sample images; and fusing the image characteristics of each frame of sample image and the complementary characteristics of each frame of sample image through the neural network model to obtain the sample characteristics of each frame of sample image.

In a possible implementation manner, the determining module is configured to fuse, through the neural network model, image features of each frame of sample image and audio features of sample audio corresponding to each frame of sample image, so as to obtain first fusion features of each frame of sample image; for any frame of sample image, determining a first text feature related to a first fusion feature of the any frame of sample image from text features of sample texts corresponding to the frames of sample images; fusing the first fusion characteristic of the sample image of any frame with the first text characteristic to obtain a second fusion characteristic of the sample image of any frame; and determining sample characteristics of each frame of sample image based on the second fusion characteristics of each frame of sample image.

In a possible implementation manner, the determining module is configured to determine, for any frame of sample images, a second text feature related to an image feature of the any frame of sample images from text features of sample texts corresponding to the respective frame of sample images; fusing the image features of the sample images of any frame and the second text features through the neural network model to obtain third fusion features of the sample images of any frame; fusing the third fusion characteristic of the sample image of any frame with the audio characteristic of the sample audio corresponding to the sample image of any frame to obtain a fourth fusion characteristic of the sample image of any frame; and determining sample characteristics of each frame of sample image based on the fourth fusion characteristics of each frame of sample image.

In one possible implementation manner, the training module is configured to determine a first loss of each frame of sample image based on the first labeling information and the first prediction information of each frame of sample image; determining a second loss of each frame of sample images based on the second prediction information of each frame of sample images, the time stamp of each frame of sample images, and the time stamp of the sample slicing point; and training the neural network model based on the first loss of each frame of sample image and the second loss of each frame of sample image to obtain a video segmentation model.

In a possible implementation manner, the training module is configured to determine, for any frame of sample images, a sum of second prediction information of the any frame of sample images and a timestamp of the any frame of sample images as a reference timestamp; a second loss of the sample image of any frame is determined based on a difference between a first threshold, the reference timestamp, and a timestamp of the sample cut point.

In one possible implementation, the apparatus further includes:

the determining module is further configured to determine third prediction information of each frame of sample image through the neural network model, where the third prediction information of the sample image characterizes a prediction probability that the sample image belongs to each image class;

the acquisition module is further configured to acquire second labeling information of each frame of sample image, where the second labeling information of the sample image characterizes whether the sample image obtained by labeling belongs to each image class;

the training module is used for training the neural network model based on the first labeling information, the first prediction information, the second prediction information, the third prediction information and the second labeling information of each frame of sample image to obtain a video segmentation model.

In one possible implementation manner, the training module is configured to determine, for any frame of sample image, a positive sample loss of the any frame of sample image based on a prediction probability that the any frame of sample image belongs to a first category, where the first category is an image category to which the any frame of sample image obtained by labeling belongs; determining negative sample loss of any frame of sample image based on a second threshold and a prediction probability that the any frame of sample image belongs to a second category, wherein the second category is an image category that the any frame of sample image obtained through labeling does not belong to; determining a third loss of the any one frame of sample images based on the positive sample loss of the any one frame of sample images and the negative sample loss of the any one frame of sample images; and training the neural network model based on the third loss, the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model.

In another aspect, a video slicing apparatus is provided, the apparatus including:

the acquisition module is used for acquiring a video segmentation model and a multi-frame target image extracted from a target video, wherein the video segmentation model is obtained by training according to the training method of the video segmentation model;

The determining module is used for determining first prediction information and second prediction information of each frame of target image through the video segmentation model, wherein the first prediction information of the target image represents the probability that the distance between the target image and a target segmentation point of the target video is not greater than a distance threshold value, and the second prediction information of the target image represents the prediction offset between a time stamp of the target image and a time stamp of the target segmentation point;

and the segmentation module is used for segmenting the target video based on the first prediction information and the second prediction information of the target image of each frame to obtain at least two video sequences.

In one possible implementation manner, the segmentation module is configured to determine a reference image from the multi-frame target image based on first prediction information of each frame of target image, where the first prediction information of the reference image is not less than a probability threshold; extracting second prediction information of the reference image from the second prediction information of each frame of target image; determining a time stamp of the target cut point based on the time stamp of the reference image and second prediction information of the reference image; and cutting the target video based on the time stamp of the target cutting point to obtain at least two video sequences.

In one possible implementation, the apparatus further includes:

the determining module is further configured to determine third prediction information of the target image of each frame according to the video segmentation model, where the third prediction information of the target image characterizes a prediction probability that the target image belongs to each image class; for any frame of target image, if the prediction probability of any frame of target image belonging to any image category is larger than a third threshold value, determining that any frame of target image belongs to any image category; and if the prediction probability of the any frame of target image belonging to any image category is not greater than a third threshold value, determining that the any frame of target image does not belong to any image category.

In another aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where at least one computer program is stored in the memory, where the at least one computer program is loaded and executed by the processor, so that the electronic device implements any one of the training methods of the video segmentation model or implements any one of the video segmentation methods described above.

In another aspect, there is further provided a computer readable storage medium, where at least one computer program is stored, where the at least one computer program is loaded and executed by a processor, so that an electronic device implements the training method of the video segmentation model or implements the video segmentation method of any one of the foregoing.

In another aspect, a computer program or a computer program product is provided, where at least one computer program is stored, where the at least one computer program is loaded and executed by a processor, so that the electronic device implements the training method of any one of the video segmentation models or implements the video segmentation method of any one of the above.

The technical scheme provided by the application has at least the following beneficial effects:

according to the technical scheme provided by the application, multiple frames of sample images are extracted from the sample video, the probability that the distance between each frame of sample image and the sample dividing point of the sample video is not greater than the distance threshold value and the prediction offset between the time stamp of each frame of sample image and the time stamp of the sample dividing point are determined through the neural network model, so that the time stamp of the sample dividing point can be determined based on the probability, the prediction offset and the time stamp of each frame of sample image. Therefore, after the neural network model is trained to obtain the video segmentation model through the first labeling information, the first prediction information and the second prediction information of each frame of sample image, the time stamp of the video segmentation point can be accurately determined through the video segmentation model, so that the video is segmented based on the time stamp of the video segmentation point, and the accuracy of a video segmentation result is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an implementation environment of a training method of a video segmentation model or a video segmentation method according to an embodiment of the present application;

FIG. 2 is a flowchart of a training method of a video segmentation model according to an embodiment of the present application;

fig. 3 is a flowchart of a video slicing method according to an embodiment of the present application;

FIG. 4 is a block diagram of a video segmentation model provided by an embodiment of the present application;

fig. 5 is a schematic diagram of a video parsing result according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training device for a video segmentation model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video slicing device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a training method of a video slicing model or a video slicing method according to an embodiment of the present application, where, as shown in fig. 1, the implementation environment includes a terminal device 101 and a server 102. The training method of the video slicing model or the video slicing method in the embodiment of the present application may be performed by the terminal device 101, or may be performed by the server 102, or may be performed by the terminal device 101 and the server 102 together.

The terminal device 101 may be a smart phone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart television, a smart car device, a smart voice interaction device, a smart home appliance, etc. The server 102 may be a server, or a server cluster formed by a plurality of servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 102 may be in communication connection with the terminal device 101 via a wired network or a wireless network. The server 102 may have functions of data processing, data storage, data transceiving, etc., which are not limited in the embodiment of the present application. The number of terminal devices 101 and servers 102 is not limited, and may be one or more.

The technical scheme provided by the embodiment of the application can be realized based on artificial intelligence (Artificial Intelligence, AI) technology. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

With the continued development of the internet, the number of videos has increased dramatically. By segmenting the video, the structure, the content and the like of the video can be systematically and conveniently understood, and the accuracy of the functions such as video recommendation, video retrieval and the like can be improved.

In general, the video segmentation is implemented by using a trained image classification model, and the video segmentation mode can be summarized as follows: and determining the image category of each frame of image in the video through the image classification model. When the image category of a certain frame image is different from the image category of the previous frame image of the frame image, determining a video segmentation point based on the time stamp of the frame image so as to segment the video into at least two fragments based on the video segmentation point.

However, one frame of image corresponds to at least one image category. The method for determining the video segmentation point by judging whether the image categories of two adjacent frames of images are the same easily appears that only if the two adjacent frames of images have different image categories, one video segmentation point is determined, so that the accuracy of the video segmentation point is lower, and the video segmentation result is inaccurate.

The embodiment of the application provides a training method of a video segmentation model, which can be applied to the implementation environment. The video segmentation model trained by the method provided by the embodiment of the application can determine the accurate video segmentation points, and improves the accuracy of the video segmentation result. Taking the flowchart of the training method of the video segmentation model provided in the embodiment of the present application shown in fig. 2 as an example, for convenience of description, the terminal device 101 or the server 102 that performs the training method of the video segmentation model in the embodiment of the present application is referred to as an electronic device, and the method may be performed by the electronic device. As shown in fig. 2, the method includes the following steps.

Step 201, acquiring a plurality of frames of sample images extracted from a sample video and first labeling information of each frame of sample images.

The electronic device may acquire the sample video in any manner, for example, the electronic device may capture any video from the network as the sample video, or the electronic device may capture the input video as the sample video. The sample video comprises a plurality of frames of images, a part of images can be extracted from each frame of images included in the sample video, the part of images are taken as sample images, and the number of the sample images is a plurality of. The embodiment of the present application is not limited to the extraction method, and may extract a set number (for example, a set number of 100) of images from each Frame image included in the sample video, or may extract images from each Frame image included in the sample video at a set sampling rate (for example, a set sampling rate of 2 transmission Frames Per Second (FPS)). By extracting multiple frames of sample images from a sample video, computational efficiency may be improved.

At least one sample cut point can be marked from the sample video, and the time of any sample cut point in the sample video is the time stamp of the sample cut point. For example, if a certain sample slicing point is located at 0 th minute and 10 th seconds of the sample video, the time stamp of the sample slicing point is 0:0:10.

For any one of the multiple frames of sample images, a distance of the sample image from a sample slicing point of the sample video may be determined. And obtaining first labeling information of the sample image by labeling whether the distance is not greater than a distance threshold. Thus, the first annotation information of the sample image characterizes whether the distance of the sample image from the sample segmentation point is not greater than a distance threshold.

The embodiment of the application does not limit the determination mode of the distance between the sample image and the sample dividing point. For example, the start time and/or end time of any frame of sample image in the sample video may be used as the time stamp of the sample image, for example, the time stamp of the sample image is 0:0:12-0:0:20 when the sample image appears in the sample video from the 12 th minute to the 20 th minute when the sample image appears in the sample video. The number of sample images existing between any one frame of sample image and the sample dividing point is taken as the distance between any one frame of sample image and the sample dividing point based on the time stamp of each frame of sample image and the time stamp of the sample dividing point. Alternatively, the difference between the time stamp of any one frame of sample image and the time stamp of the sample slicing point is taken as the distance between any one frame of sample image and the sample slicing point.

The distance threshold is not limited in the embodiment of the present application, and the distance threshold is a set value, or the distances between the sample images of each frame and the sample dividing points are ordered, and the ordered first number of distances is used as the distance threshold.

Alternatively, the number of sample slicing points is at least one, and a time difference between the time stamp of each sample image and the time stamp of each sample slicing point may be calculated. For any one of the sample slicing points, a smallest time difference value is selected from time stamps of the respective sample images and time stamps of the sample slicing points, a sample image corresponding to the smallest time difference value is marked as a sample image having a distance from the sample slicing point of the sample video not greater than a distance threshold value, and sample images corresponding to other time difference values than the smallest time difference value are marked as sample images having a distance from the sample slicing point of the sample video not greater than the distance threshold value. In this way, whether the distance between each frame of sample image and the sample dividing point of the sample video is not greater than a distance threshold value can be marked, and the first marking information of each frame of sample image is obtained. In this case, the first labeling information of any one frame of sample image characterizes whether the sample image is closest to the sample segmentation point.

It will be appreciated that when the time stamp of the sample image includes a start time and an end time, the difference between the time stamp of the sample image and the time stamp of the sample cut point may be calculated using either the start time or the end time.

Step 202, determining first prediction information and second prediction information of each frame of sample image through a neural network model.

Each frame of sample image may be input into a neural network model, and first prediction information of each frame of sample image and second prediction information of each frame of sample image may be determined by the neural network model.

The first prediction information of any frame of sample image characterizes a probability that a distance of the sample image from a sample segmentation point is not greater than a distance threshold. The greater the probability, the more likely the sample image is to be at a distance from the sample slicing point that is no greater than the distance threshold. Optionally, the first prediction information of the sample image is greater than or equal to 0 and less than or equal to 1.

The second prediction information of any frame of sample image characterizes a prediction offset between a time stamp of the sample image and a time stamp of the sample segmentation point. Optionally, when the second prediction information of the sample image is negative, the timestamp indicating the sample image is greater than the timestamp of the sample segmentation point; when the second prediction information of the sample image is a positive value, it is indicated that the time stamp of the sample image is smaller than the time stamp of the sample segmentation point.

The embodiment of the application does not limit the model structure, model parameters and the like of the neural network model. Alternatively, the neural network model is an initial network model, in which case the model structure of the neural network model is the same as the model structure, model parameters, and the like of the initial network model. The model structure of the initial network model includes a feature processing network and a prediction network, and functions, implementation principles and the like of each network are described in the following in a related manner, which is not repeated herein. Optionally, the neural network model is obtained by training the initial network model at least once in the manner of step 201 to step 203. In this case, the model structure of the neural network model is the same as that of the initial network model, but the model parameters of the neural network model and the model parameters of the initial network model are different.

In one possible implementation, step 202 includes steps 2021 to 2023.

In step 2021, the sample characteristics of each frame of sample image are determined by the neural network model.

Each frame of sample image can be input into the neural network model, and the sample characteristics of each frame of sample image can be extracted by the characteristic processing network of the neural network model. Optionally, the feature processing network includes an image feature extraction network, each frame of sample image is input into the image feature extraction network, and the image feature extraction network performs image feature extraction on each frame of sample image to obtain the image feature of each frame of sample image. The feature processing network may take the image features of each frame of sample image as the sample features of each frame of sample image, or the feature processing network further includes a feature fusion network connected in series after the image feature extraction network, where the feature fusion network may perform feature fusion on the image features of each frame of sample image to obtain the image features of each frame of sample image.

The embodiment of the application does not limit the network structure of the image feature extraction network, and the like, and the image feature extraction network is any one of a Swin Transformer, a visual geometry group network (Visual Geometry Group Network, VGG Net), and the like by way of example.

The embodiment of the application also does not limit the mode of feature fusion of the feature fusion network to the image features of the sample image. For example, the feature fusion network may stitch image features of the sample image and position features of the sample image, and take the stitched features as sample features of the sample image. Alternatively, the feature fusion network may perform any kind of convolution processing (e.g., normal convolution processing, dilation convolution processing, etc.) on the image features (or the stitched features) of the sample image to obtain sample features of the sample image. The position feature of the sample image characterizes the position of the sample image in the multi-frame sample image, in short, any frame of sample image is the i-th frame of sample image in the multi-frame sample image, and the position feature of any frame of sample image is i, i is a positive integer.

In one possible implementation, step 2021 includes steps A1 to A3.

And step A1, determining the image characteristics of each frame of sample image through a neural network model.

It has been mentioned above that the image feature extraction network in the neural network model may be used to extract the image feature of each frame of sample image, and will not be described here.

And A2, determining the supplementary features of each frame of sample image through the neural network model, wherein the supplementary features of the sample image comprise at least one of the audio features of sample audio corresponding to the sample image and the text features of sample text corresponding to the sample image.

In one possible implementation, sample audio corresponding to any frame of sample image may be extracted from the sample video based on the start time and end time of the frame of sample image in the sample video. The starting time of any frame of sample image in the sample video is the same as the starting time of the sample audio corresponding to the frame of sample image in the sample video, and the ending time of any frame of sample image in the sample video is the same as the ending time of the sample audio corresponding to the frame of sample image in the sample video.

The feature processing network of the neural network model further includes an audio feature extraction network. The sample audio corresponding to each frame of sample image can be input into an audio feature extraction network, and the audio feature extraction network performs audio feature extraction on the sample audio corresponding to each frame of sample image to obtain the audio feature of the sample audio corresponding to each frame of sample image. And taking the audio characteristics of the sample audio corresponding to each frame of sample image as the complementary characteristics of each frame of sample image.

The embodiment of the application does not limit the network structure of the audio feature extraction network, and the like, and the audio feature extraction network is any one of VGGish, a cyclic neural network (Recurrent Neural Network, RNN) and the like by way of example.

In another possible implementation manner, text recognition can be performed on any frame of sample image, so as to identify contained text from the sample image, and the identified text is taken as sample text corresponding to the frame of sample image. Alternatively, the sample video is subjected to automatic speech recognition (Automatic Speech Recognition, ASR) processing to extract individual text in the sample video. Sample text corresponding to a frame of sample image may be extracted from each text based on a start time of each text in the sample video, an end time of each text in the sample video, a start time of any frame of sample image in the sample video, and an end time of the frame of sample image in the sample video. The starting time of the sample text corresponding to the sample image in the sample video is the same as the starting time of the sample image in the sample video, and the ending time of the sample text corresponding to the sample image in the sample video is the same as the ending time of the sample image in the sample video.

Alternatively, if any one of the sample images does not include text, or if each text does not include a sample text corresponding to any one of the sample images, the set character is used as the sample text corresponding to the sample image, for example, the set character is a null character, a special character, or the like.

The feature processing network of the neural network model also includes a text feature extraction network. The sample text corresponding to each frame of sample image can be input into a text feature extraction network, and the text feature extraction network extracts the text feature of the sample text corresponding to each frame of sample image, so as to obtain the text feature of the sample text corresponding to each frame of sample image. And taking the text characteristics of the sample text corresponding to each frame of sample image as the supplementary characteristics of each frame of sample image. Alternatively, when the sample text corresponding to any one of the sample images is a set character, the text feature extraction network may extract a set feature as the text feature of the sample text corresponding to the sample image, for example, a feature composed of the number 0.

The embodiment of the application does not limit the network structure of the text feature extraction network, which is illustratively any one of a bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT) based on a converter, a vector space model (Vector Space Modal, VSM), and the like.

Optionally, the feature processing network includes a text feature extraction network and an audio feature extraction network. In this case, the feature processing network may use the text feature of the sample text corresponding to each frame of sample image and the audio feature of the sample audio corresponding to each frame of sample image as the supplementary feature of each frame of sample image.

And step A3, fusing the image characteristics of each frame of sample image and the complementary characteristics of each frame of sample image through a neural network model to obtain the sample characteristics of each frame of sample image.

In the embodiment of the application, the feature processing network comprises a multi-modal feature fusion network, wherein the multi-modal feature fusion network is connected in series behind the image feature extraction network, and the multi-modal feature fusion network is connected in series behind the audio feature extraction network and/or the text feature extraction network. The multi-mode feature fusion network can fuse the image features of each frame of sample image and the complementary features of each frame of sample image to obtain a first fusion result of each frame of sample image. And the first fusion result of any frame of sample image is the sample characteristic of the frame of sample image, or the first fusion result of each frame of sample image is fused again to obtain the sample characteristic of each frame of sample image.

The embodiment of the application does not limit the network structure of the Multi-modal feature fusion network, and illustratively, the Multi-modal feature fusion network comprises a Multi-Head Attention (MHA) network, a Multi-layer perceptron and the like.

In implementation B1, the supplemental features of the sample image include audio features of sample audio corresponding to the sample image and text features of sample text corresponding to the sample image. In this case, step A3 includes steps a31 to a34.

And step A31, fusing the image characteristics of each frame of sample image and the audio characteristics of the sample audio corresponding to each frame of sample image through a neural network model to obtain first fusion characteristics of each frame of sample image.

Optionally, the multi-modal feature fusion network in the neural network model may splice the image features of any frame of sample image with the audio features of the sample audio corresponding to the frame of sample image, to obtain the first spliced feature of the frame of sample image. By splicing the image features of the sample image and the audio features of the sample audio corresponding to the frame of sample image, the time stamp of the image features and the time stamp of the audio features are aligned, and the alignment mode belongs to explicit alignment. And performing attention fusion processing on the first spliced characteristic of the frame sample image to obtain the attention characteristic of the frame sample image. And splicing the attention characteristic of the frame sample image and the position characteristic of the frame sample image to obtain a first fusion characteristic of the frame sample image.

The embodiment of the present application does not limit the manner of the attention fusion process, and the attention fusion process is, for example, a self-attention fusion process, a spatial attention fusion process, a channel attention fusion process, or the like. Wherein, when the attention fusion process is a channel attention fusion process, the attention features of the sample image include the attention features of at least one channel, and the dimensions of the respective channels are different.

Step A32, for any frame of sample image, determining a first text feature related to the first fusion feature of any frame of sample image from the text features of the sample text corresponding to each frame of sample image.

The multimodal feature fusion network in the neural network model includes a multi-headed attention network. The multi-mode feature fusion network can splice the text features of the sample text corresponding to any frame of sample image and the position features of the frame of sample image to obtain the second spliced features of the frame of sample image, and input the first fused features of each frame of sample image and the second spliced features of each frame of sample image into the multi-head attention network. The first fusion characteristic of each frame of sample image is used as a query vector, the second splicing characteristic of each frame of sample image is used as a value vector, and the second splicing characteristic of each frame of sample image is used as a key vector. The multi-head attention network determines a first text feature related to a first fusion feature of any frame of sample images from text features of sample text corresponding to each frame of sample images by using the key vector, the value vector and the query vector.

And step A33, fusing the first fusion characteristic of any frame of sample image with the first text characteristic to obtain a second fusion characteristic of any frame of sample image.

After the multi-head attention network extracts the first text feature of any frame of sample image, the first fusion feature of any frame of sample image and the first text feature of the frame of sample image can be fused, so as to obtain and output the multi-head attention feature of the frame of sample image. The multi-headed attention feature of the frame sample image may be taken as a second fused feature of the frame sample image. The multi-head attention feature of the frame sample image and the first fusion feature of the frame sample image can be spliced to obtain a third splicing feature of the frame sample image, and the third splicing feature of the frame sample image is used as the second fusion feature of the frame sample image.

In general, any frame of sample image may have sample text or may have no sample text. For example, some sample videos have subtitles, while some sample videos do not. Considering that there is a case where the sample image does not contain the sample text, the sample image and the sample text corresponding to the sample image may not be aligned. According to the embodiment of the application, the first text feature of any frame of sample image is determined from the text features of the sample text corresponding to each frame of sample image, and the first fusion feature of the sample image and the first text feature are fused, so that the time stamp of the first fusion feature and the time stamp of the first text feature are aligned, the time stamp of the image feature, the time stamp of the audio feature and the time stamp of the text feature are implicitly aligned, and the accuracy of the second fusion feature of the sample image is improved.

Step a34, determining sample characteristics of each frame of sample image based on the second fusion characteristics of each frame of sample image.

The feature processing network further includes a convolutional network in series with the multi-modal feature fusion network. The convolution network can carry out convolution processing on the second fusion characteristic of each frame of sample image to obtain and output the sample characteristic of each frame of sample image. The embodiment of the application does not limit the network structure of the convolution network, and the convolution network is illustratively a Multi-stage time sequence convolution network (Multi-Stage Temporal Convolutional Networks, MS-TCN) or MS-TCN++, and the like.

Taking a convolution network as an MS-TCN++ as an example, the MS-TCN++ is divided into convolution processing of at least one Stage (Stage), and for convenience of description, the at least one Stage is denoted as N stages, and the N stages are respectively a first Stage to an N-th Stage, and N is a positive integer.

The first stage corresponds to l ₀ +1 hole progressively larger hole convolution network and l ₀ A hole convolution network in which +1 holes gradually decrease. The hole convolutional network is also called as a dilation convolutional network, and the hole of the hole convolutional network can be named as a related Conv r (DC r). Thus, l ₀ The hole convolution network with +1 holes gradually increasing can be noted as ddr=2 ⁰ ＝1、DC r＝2 ¹ ＝2、……、l ₀ +1 holes gradually decreasing in holesThe convolutional network can be written as……、DC r＝2 ⁰ ＝1。

In the embodiment of the application, on one hand, the second fusion characteristic input l of each frame of sample image ₀ And (3) carrying out hole convolution processing on the second fusion characteristic of each frame of sample image by the first hole convolution network in the first hole convolution network with the number of the+1 holes gradually increased, carrying out hole convolution processing on the characteristic output by the first hole convolution network by the second hole convolution network, and the like until the characteristic output by the last hole convolution network is obtained, and marking the characteristic output by the last hole convolution network as the first hole convolution characteristic. That is, l ₀ The cavitation convolution network with gradually increased +1 cavitation carries out l on the second fusion characteristic of each frame of sample image ₀ And carrying out cavity convolution processing for +1 times to obtain a first cavity convolution characteristic. On the other hand, the second fusion feature input/of each frame of sample image ₀ And (3) carrying out hole convolution processing on the second fusion characteristic of each frame of sample image by the first hole convolution network in the first hole convolution network with the gradually reduced holes of +1, carrying out hole convolution processing on the characteristic output by the first hole convolution network by the second hole convolution network, and the like until the characteristic output by the last hole convolution network is obtained, and marking the characteristic output by the last hole convolution network as the second hole convolution characteristic. That is, l ₀ The cavitation convolution network with gradually reduced +1 cavitation carries out l on the second fusion characteristic of each frame of sample image ₀ And carrying out cavity convolution processing for +1 times to obtain a second cavity convolution characteristic. And splicing the first cavity convolution characteristic and the second cavity convolution characteristic to obtain a characteristic corresponding to the first stage.

When MS-tcn++ is a one-stage convolution process, the features corresponding to the first stage may be sample features of sample images of each frame. When MS-tcn++ is a convolution process of at least two stages, a convolution process of a second stage is required to be performed on the features corresponding to the first stage, so as to obtain sample features of each frame of sample image based on the features corresponding to the second stage.

Second stage corresponds to l ₁ A hole convolution network in which +1 holes are gradually increased. l (L) ₁ The hole convolution network with +1 holes gradually increasing can be noted as ddr=2 ⁰ ＝1、DC r＝2 ¹ ＝2、……、Characteristic input l corresponding to the first stage ₁ And carrying out hole convolution processing on the characteristics corresponding to the first stage by the first hole convolution network, carrying out hole convolution processing on the characteristics output by the first hole convolution network by the second hole convolution network, and the like until the characteristics output by the last hole convolution network are obtained, and marking the characteristics output by the last hole convolution network as the characteristics corresponding to the second stage. That is, l ₁ The cavity convolution network with +1 cavities gradually increasing carries out l on the characteristics corresponding to the first stage ₁ And carrying out cavity convolution treatment for +1 times to obtain the corresponding characteristics of the second stage. />

When MS-tcn++ is a two-stage convolution process, the features corresponding to the second stage may be sample features of sample images of each frame. When MS-tcn++ is convolution processing of at least three stages, a third stage of convolution processing is required to be performed on the features corresponding to the second stage, so as to obtain sample features of each frame of sample image based on the features corresponding to the third stage. The network structure, feature processing manner and the like of the third stage and the second stage are similar, and are not described herein.

Optionally, the features corresponding to each stage may be fused to obtain sample features of each frame of sample image. The second fusion characteristic of each frame of sample image is subjected to cavity convolution processing through MS-TCN++, so that the receptive field can be enlarged, the time sequence relation of each frame of sample image is captured, and the characterization capability of the sample characteristic of the sample image is improved.

In implementation B2, the supplemental features of the sample image include audio features of sample audio corresponding to the sample image and text features of sample text corresponding to the sample image. In this case, step A3 includes steps a35 to a38.

Step a35, for any frame of sample image, determining a second text feature related to the image feature of any frame of sample image from the text features of the sample text corresponding to each frame of sample image.

The multi-mode feature fusion network can splice the image features of any frame of sample image and the position features of the frame of sample image to obtain a fourth spliced feature of the frame of sample image; and splicing the text characteristics of the sample text corresponding to any frame of sample image and the position characteristics of the frame of sample image to obtain a second splicing characteristic of the frame of sample image. The fourth stitching characteristic of each frame of sample image and the second stitching characteristic of each frame of sample image are input into a multi-head attention network. The fourth stitching feature of each frame of sample image is used as a query vector, the second stitching feature of each frame of sample image is used as a value vector, and the second stitching feature of each frame of sample image is used as a key vector. The multi-head attention network determines a second text feature of any frame of sample images from the text features of the sample text corresponding to each frame of sample images by using the key vector, the value vector and the query vector. The implementation of step a35 is similar to that of step a32, and will not be described herein.

And step A36, fusing the image features of any frame of sample image and the second text features through a neural network model to obtain a third fusion feature of any frame of sample image.

The multimodal feature fusion network includes a multi-headed attention network. After the multi-mode feature fusion network splices the image features of any frame of sample image and the position features of the frame of sample image to obtain the fourth spliced feature of the frame of sample image, the multi-head attention network can fuse the fourth spliced feature of any frame of sample image and the second text feature of the frame of sample image to obtain and output the multi-head attention feature of the frame of sample image. The multi-headed attention feature of the frame sample image may be taken as a third fused feature of the frame sample image. The multi-head attention feature of the frame sample image and the fourth splicing feature of the frame sample image can be spliced, and the spliced feature is used as a third fusion feature of the frame sample image. The implementation of step a36 is similar to that of step a33, and will not be described herein.

And step A37, fusing the third fusion characteristic of any frame of sample image and the audio characteristic of the sample audio corresponding to any frame of sample image to obtain a fourth fusion characteristic of any frame of sample image.

The multi-modal feature fusion network in the neural network model can splice the third fusion feature of any frame of sample image and the audio feature of the sample audio corresponding to the frame of sample image to obtain a fifth splicing feature of the frame of sample image. And performing attention fusion processing on the fifth spliced characteristic of the frame sample image to obtain the attention characteristic of the frame sample image. And taking the attention characteristic of the frame sample image as a fourth fusion characteristic of the sample image, or splicing the attention characteristic of the frame sample image and the position characteristic of the frame sample image to obtain the fourth fusion characteristic of the frame sample image. The implementation of step a37 is similar to that of step a31, and will not be described herein.

Step a38, determining sample characteristics of each frame of sample image based on the fourth fusion characteristics of each frame of sample image. The implementation of step a38 is similar to that of step a34, and will not be described herein.

In implementation B3, the supplemental features of the sample image include audio features of sample audio corresponding to the sample image. In this case, step A3 includes: the image features of each frame of sample image and the audio features of the sample audio corresponding to each frame of sample image are fused through the neural network model, so that first fusion features of each frame of sample image are obtained, and the sample features of each frame of sample image are determined based on the first fusion features of each frame of sample image. The determining manner of the first fusion feature of the sample image in step a31 is described, and the determining the content of the sample feature of each frame of sample image based on the first fusion feature of each frame of sample image is similar to the implementation principle of step a34, which is not described herein.

In implementation B4, the supplemental features of the sample image include text features of the sample text to which the sample image corresponds. In this case, step A3 includes: for any frame of sample image, determining a second text feature related to the image feature of any frame of sample image from the text features of the sample text corresponding to each frame of sample image; fusing the image features of any frame of sample image with the second text features to obtain a third fusion feature of any frame of sample image; sample features of each frame of sample images are determined based on the third fused features of each frame of sample images. The determining manner of the third fusion feature of the sample image in step a35 is described, and the determining of the content of the sample feature of each frame of sample image based on the third fusion feature of each frame of sample image is similar to the implementation principle of step a34, which is not described herein.

Step 2022, for any frame of sample images, determining, by the neural network model, a feature difference between any frame of sample images and adjacent frame of images based on the sample features of any frame of sample images and the sample features of the adjacent frame of images.

The sample image adjacent to any one frame of sample image can be determined from the multi-frame sample images, and the determined sample image is used as the adjacent frame image of any one frame of sample image. Wherein, any frame of sample image corresponds to one frame or two frames of adjacent frame images, and the adjacent frame image of any frame of sample image can be positioned before any frame of sample image or can be positioned after any frame of sample image. For example, for a plurality of frame sample images, adjacent frame images of a first frame sample image are second frame sample images, and adjacent frame images of the second frame sample image are first frame sample images and third frame sample images.

Through a prediction network in the neural network model, information such as a difference value, a variance and the like between sample characteristics of any one frame of sample image and sample characteristics of any adjacent frame of sample image can be calculated, and the information is used as characteristic differences between the sample image and the adjacent frame of image.

Step 2023, determining, by the neural network model, the first prediction information and the second prediction information of any frame sample image based on the feature differences between any frame sample image and the adjacent frame images.

In the embodiment of the application, the prediction network in the neural network model can determine the probability that the distance between the sample image and the sample dividing point is not greater than the distance threshold value based on the characteristic difference between any frame of sample image and each adjacent frame of image, and then the first prediction information of the sample image is obtained. Wherein the larger the feature difference between the sample image and the adjacent frame image, the larger the probability that the sample image is a sample image having a distance from the sample dividing point not greater than a distance threshold.

The prediction network in the neural network model can determine the prediction offset between the time stamp of the sample image and the time stamp of the sample dividing point based on the characteristic difference between any frame of sample image and each adjacent frame of image, so as to obtain the second prediction information of the sample image. The larger the characteristic difference between the sample image and the adjacent frame image is, the smaller the prediction offset corresponding to the sample image is.

Alternatively, when any one of the sample images corresponds to two adjacent frame images, the prediction network may determine the first prediction information and the second prediction information of the sample image based on the largest feature difference among the feature differences corresponding to the two adjacent frame images. Alternatively, the prediction network may perform weighted summation on feature differences corresponding to the two adjacent frame images, and determine the first prediction information and the second prediction information of the sample image based on the weighted summation result.

And 203, training the neural network model based on the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model.

The loss of the neural network model may be calculated based on the first labeling information, the first prediction information, and the second prediction information of each frame of the sample image. Model parameters of the neural network model are adjusted based on the loss of the neural network model, so that the neural network model is trained once, and the trained neural network model is obtained.

Alternatively, when the model parameters of the neural network model are adjusted based on the loss of the neural network model, the parameters of one part of the network may be adjusted, leaving the parameters of another part of the network unchanged. For example, parameters of three networks, namely an image feature extraction network, an audio feature extraction network and a text feature extraction network, are kept unchanged, and parameters of several networks, namely a multi-modal feature fusion network, a convolution network and a prediction network, are adjusted.

If the trained neural network model meets the training ending condition, taking the trained neural network model as a video segmentation model; if the trained neural network model does not meet the training ending condition, taking the trained neural network model as a neural network model to be trained next time, and training the neural network model next time according to the modes from step 201 to step 203 until a video segmentation model is obtained. The video segmentation model is used for segmenting the target video.

The embodiment of the application does not limit the meeting of the training ending condition. Illustratively, the training end condition is satisfied that the number of training times reaches a set number of times, for example, the number of training times reaches 500 times, or the training end condition is satisfied that the difference between the loss of the neural network model obtained by the present training and the loss of the neural network model obtained by the last training is within a set range.

In one possible implementation, step 203 includes steps 2031 to 2033.

Step 2031, determining a first loss for each frame of sample image based on the first annotation information and the first prediction information for each frame of sample image.

The first loss of the frame sample image may be determined according to a calculation formula of the first loss function based on the first labeling information of any frame sample image and the first prediction information of the frame sample image. The calculation formula of the first loss function may be a binary cross entropy loss function, as shown in the following formula (1). Alternatively, the first loss function is a mean square error (Mean Square Error, MSE) loss function, a mean absolute error (MeanAbsolute Error, MAE) loss function, or the like. The MSE loss function may also be referred to as an L2 loss function, and the MAE loss function may also be referred to as an L1 loss function.

L ₁ =yog (Y) + (1-Y) log (1-Y) formula (1)

Wherein L is ₁ And representing the first loss of any frame of sample image, wherein Y represents the first labeling information of the frame of sample image, Y represents the first prediction information of the frame of sample image, and log represents the sign of a logarithmic function.

Step 2032, determining a second penalty for each frame of sample image based on the second prediction information for each frame of sample image, the time stamp for each frame of sample image, and the time stamp for the sample cut point.

The second loss of the frame sample image may be determined based on the second prediction information of any frame sample image, the time stamp of the frame sample image, and the time stamp of the sample slicing point according to a calculation formula of the second loss function. The embodiment of the application does not limit the calculation formula of the second loss function, and the second loss function is an L1 loss function or an L2 loss function, for example.

In one possible implementation, step 2032 includes: for any frame of sample image, determining the sum of the second prediction information of any frame of sample image and the time stamp of any frame of sample image as a reference time stamp; a second penalty for any frame of sample images is determined based on the difference between the first threshold, the reference timestamp, and the timestamp of the sample segmentation point.

The time stamp of any frame of sample image is the start time and/or end time of the sample image in the sample video, and the second prediction information of the sample image is the prediction offset between the time stamp of the sample image and the time stamp of the sample slicing point. And taking the sum of the predicted offset and the starting time or the sum of the predicted offset and the ending time as a reference time stamp. Next, the difference between the reference time stamp and the time stamp of the sample split point is calculated.

In one possible implementation, the second loss function is a smoothed L1 loss function, as shown in equation (2) below. A second loss of the frame sample image is determined based on the difference and the first threshold according to equation (2).

Wherein L is ₂ And representing the second loss of the sample image of any frame, wherein x represents the difference value between the reference timestamp corresponding to the sample image and the timestamp of the sample dividing point, and alpha represents the first threshold value. The value of the first threshold is not limited in the embodiment of the present application, and the first threshold is 1 in an exemplary manner.

In one possible implementation, the second loss function is an L1 loss function or an L2 loss function, and the second loss of the frame sample image is determined based on the difference value according to the L1 loss function or the L2 loss function. Wherein the L1 loss function may be 0.5x in the above formula (2) ² The L2 loss function may be |x| -0.5 in the above formula (2).

Step 2033, training the neural network model based on the first loss of each frame of sample image and the second loss of each frame of sample image to obtain a video segmentation model.

The first loss of the sample image of any frame and the second loss of the sample image may be subjected to operations such as weighted summation and weighted averaging, and the obtained operation result may be used as the first weighted loss of the sample image. And carrying out operations such as weighted summation, weighted averaging and the like on the first weight loss of each sample image, and taking the obtained operation result as the loss of the neural network model. Training the neural network model through the loss of the neural network model to obtain a video segmentation model.

In one possible implementation, step 203 is preceded by steps 204 to 205.

Step 204, determining third prediction information of each frame of sample image through the neural network model, wherein the third prediction information of the sample image characterizes the prediction probability of the sample image belonging to each image category.

The prediction network of the neural network model may determine third prediction information of any one of the sample images based on sample characteristics of the sample image. The third prediction information of the sample image comprises a plurality of prediction data, the number of the prediction data is the same as that of the image categories, and each prediction data sequentially corresponds to each image category. Any one of the third prediction information of the sample image is greater than or equal to 0 and less than or equal to 1, and the prediction data characterizes the probability that the sample image belongs to the image category corresponding to the prediction data, and the greater the numerical value of the prediction data is, the more likely the sample image belongs to the image category corresponding to the prediction data is. Thus, the third prediction information of the sample image may characterize the prediction probabilities that the sample image belongs to the respective image class.

Step 205, obtaining second labeling information of each frame of sample image, where the second labeling information of the sample image characterizes whether the sample image obtained by labeling belongs to each image category.

And determining the second annotation information of each frame of sample image based on the image category of each frame of sample image by determining the image category of each frame of sample image. Embodiments of the present application are not limited to the manner in which the image class of the sample image is determined. The image category to which any frame of sample image belongs is manually marked, or any frame of sample image is input into an image category marking model, and the image category to which the sample image belongs is output through the image category marking model.

Optionally, the second labeling information of the sample image includes a plurality of labeling data, the number of the labeling data is the same as the number of the image categories, and each labeling data sequentially corresponds to each image category. Any frame of sample image belongs to at least one image category. If it is determined that the sample image belongs to a certain image class based on each image class to which any frame of sample image belongs, marking data corresponding to the image class in second marking information of the sample image is first data (for example, the first data is 1); if it is determined that the sample image does not belong to a certain image category based on each image category to which any one of the sample images belongs, the annotation data corresponding to the image category in the second annotation information of the sample image is second data (for example, the second data is 0). In this way, second annotation information for the sample image may be obtained.

Step 203 includes step 2034 on the basis of step 204 and step 205. The implementation manner of step 2034 is intersected with the implementation manner of steps 2031 to 2033, which are described in detail below and are not described in detail herein.

Step 2034, training the neural network model based on the first labeling information, the first prediction information, the second prediction information, the third prediction information and the second labeling information of each frame of sample image to obtain a video segmentation model.

The loss of the neural network model may be determined based on the first labeling information, the first prediction information, the second prediction information, the third prediction information, and the second labeling information for each frame of the sample image. Training the neural network model based on the loss of the neural network model to obtain a video segmentation model.

In one possible implementation, step 2034 includes steps C1 to C4.

And C1, for any frame of sample image, determining positive sample loss of any frame of sample image based on the prediction probability that any frame of sample image belongs to a first category, wherein the first category is the image category to which any frame of sample image obtained through labeling belongs.

If the labeling data corresponding to one image category in the second labeling information of the sample image is the first data, determining that the sample image belongs to the image category, and marking the image category as the first category. In this way, the respective first categories to which the sample images belong can be determined. And extracting the prediction data corresponding to each first category from the third prediction information of the sample image, wherein the prediction data corresponding to any first category is the prediction probability of the sample image belonging to the first category.

For any one of the first categories to which the sample image belongs, the positive sample loss corresponding to the first category to which the sample image belongs may be determined based on the prediction data corresponding to the first category according to a positive sample loss function. The embodiment of the present application does not limit the positive sample loss function, and alternatively, the positive sample loss function is formula (3) shown below.

Wherein L is ₊ Positive sample loss corresponding to any first category to which any frame of sample image belongs is represented, p represents prediction data corresponding to the first category, and log represents a sign of a log function. Gamma ray ₊ Is super-parametric, the embodiment of the application does not aim at gamma ₊ Is defined by the value of (a), illustratively γ ₊ ＝0。

Since any one of the sample images belongs to at least one first class, the positive sample loss corresponding to each first class can be subjected to operations such as weighted summation and weighted averaging, and the obtained operation result can be used as the positive sample loss of the sample image.

And C2, determining negative sample loss of any frame of sample image based on a second threshold and the prediction probability that any frame of sample image belongs to a second category, wherein the second category is an image category which any frame of sample image obtained through labeling does not belong to.

If the annotation data corresponding to one image category in the second annotation information of the sample image is the second data, determining that the sample image does not belong to the image category, and marking the image category as the second category. In this way, the respective second categories to which the sample images belong can be determined. And extracting the prediction data corresponding to each second category from the third prediction information of the sample image, wherein the prediction data corresponding to any second category is the prediction probability of the sample image belonging to the second category.

For any one of the second categories to which the sample image belongs, a difference between the prediction data corresponding to the second category and a second threshold may be calculated, and a reference probability that the sample image belongs to the second category is determined based on the difference and the reference difference. Optionally, the reference probability that the sample image belongs to the second class is the maximum value of the difference value and the reference difference value.

The reference probability that the sample image belongs to the second class may be determined according to equation (4) as shown below. In formula (4), the reference difference is 0, and in practical application, the reference difference may be other data than 0, for example, the reference difference is 0.1.

p _m =max (p-m, 0) formula (4)

Wherein p is _m Representing the reference probability that any frame of sample image belongs to any second category, p represents the prediction data corresponding to the second category to which any frame of sample image belongs, m represents the second threshold, and max represents the sign of the maximum value.

The negative sample loss corresponding to any one of the second categories to which the sample image belongs may be determined based on the reference probability that the sample image belongs to the second category of any one of the frames according to the negative sample loss function. The embodiment of the present application does not limit the negative sample loss function, and alternatively, the negative sample loss function is the following formula (5).

Wherein L is _- Characterizing negative sample loss corresponding to any second category to which any frame of sample image belongs, P _m The reference probability that the sample image belongs to the second class is characterized, and the sign of the log-valued function is characterized. Gamma ray _- Is super-parametric, the embodiment of the application does not aim at gamma _- Is defined by the value of (a), illustratively γ _- ＞γ ₊ 。

Since any one of the sample images belongs to at least one of the second categories, the negative sample loss corresponding to each of the second categories may be subjected to operations such as weighted summation and weighted averaging, and the obtained operation result may be used as the negative sample loss of the sample image.

And step C3, determining a third loss of any frame of sample image based on the positive sample loss of any frame of sample image and the negative sample loss of any frame of sample image.

The positive sample loss of any one of the frame sample images and the negative sample loss of that frame sample image may be subjected to operations such as weighted summation and weighted averaging, and the obtained operation result may be used as the third loss of that sample image.

And step C4, training the neural network model based on the third loss of each frame of sample image, the first labeling information, the first prediction information and the second prediction information of each frame of sample image, and obtaining a video segmentation model.

In the embodiment of the present application, the first loss of the sample image and the second loss of the sample image may be determined based on the first labeling information, the first prediction information, and the second prediction information of any frame of sample image, and the determination process may be described in the above related steps 2031 to 2032, which are not repeated herein.

Then, the first loss of the sample image of any frame, the second loss of the sample image, and the third loss of the sample image are subjected to operations such as weighted summation and weighted averaging, and the obtained operation result is used as the second weighted loss of the sample image. And carrying out operations such as weighted summation, weighted averaging and the like on the second weight loss of each sample image, and taking the obtained operation result as the loss of the neural network model. Training the neural network model through the loss of the neural network model to obtain a video segmentation model.

In steps C1 to C3, a third loss of each frame of sample image is determined. When applied, the loss of sample sequences or the loss of sample video may be determined based on the principles of steps C1 to C3.

For the loss of the sample sequence, the sample video can be segmented according to each sample segmentation point in the sample video, so as to obtain each sample sequence. And carrying out maximum pooling processing by using third prediction information of each frame of sample image belonging to the same sample sequence to obtain third prediction information corresponding to the sample sequence, wherein the third prediction information corresponding to the sample sequence comprises the probability that the sample sequence belongs to each image class. In addition, labeling information of the sample sequence is obtained, wherein the labeling information of the sample sequence comprises various image categories to which the sample sequence belongs. And determining the loss of the sample sequence according to the determination mode of the third loss of the sample image based on the third prediction information and the labeling information corresponding to the sample sequence. Training the neural network model based on the loss of each sample sequence and the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model.

For the loss of the sample video, the third prediction information of each frame of sample image can be utilized to carry out the maximum pooling processing to obtain the third prediction information corresponding to the sample video, wherein the third prediction information corresponding to the sample video comprises the probability that the sample video belongs to each image category. In addition, the annotation information of the sample video is obtained, and the annotation information of the sample video comprises each image category to which the sample video belongs. And determining the loss of the sample video according to the determination mode of the third loss of the sample image based on the third prediction information and the labeling information corresponding to the sample video. Training the neural network model based on the loss of the sample video and the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model.

It can be appreciated that the neural network model can be trained according to at least one of a loss of the sample video, a loss of each sample sequence, a third loss of each sample image, first labeling information and first prediction information of each frame of sample image, and second prediction information of each frame of sample image, to obtain a video segmentation model.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the sample video and the like involved in the present application are all acquired with sufficient authorization.

According to the method, multiple frames of sample images are extracted from the sample video, the probability that the distance between each frame of sample image and the sample dividing point of the sample video is not greater than a distance threshold value and the prediction offset between the time stamp of each frame of sample image and the time stamp of the sample dividing point are determined through the neural network model, and the time stamp of the sample dividing point can be determined based on the probability, the prediction offset and the time stamp of each frame of sample image. Therefore, after the neural network model is trained to obtain the video segmentation model through the first labeling information, the first prediction information and the second prediction information of each frame of sample image, the time stamp of the video segmentation point can be accurately determined through the video segmentation model, so that the video is segmented based on the time stamp of the video segmentation point, and the accuracy of a video segmentation result is improved.

The embodiment of the application also provides a video segmentation method which can be applied to the implementation environment. The method provided by the embodiment of the application can determine the accurate video segmentation point through the video segmentation model, and improves the accuracy of the video segmentation result, thereby improving the accuracy of the video segmentation result. Taking the flowchart of the video slicing method provided by the embodiment of the present application shown in fig. 3 as an example, for convenience of description, the terminal device 101 or the server 102 that performs the video slicing method in the embodiment of the present application is referred to as an electronic device, and the method may be performed by the electronic device. As shown in fig. 3, the method includes the following steps.

Step 301, acquiring a video segmentation model and a multi-frame target image extracted from a target video.

The video segmentation model is obtained by training according to a training method of the video segmentation model related to fig. 2, and will not be described herein. The content of the target video is similar to that of the sample video, and the implementation manner of extracting the multi-frame target image from the target video is similar to that of extracting the multi-frame sample image from the sample video, so that the description of step 201 will be omitted herein.

In step 302, first prediction information and second prediction information of each frame of target image are determined through a video segmentation model.

The first prediction information of the target image represents the probability that the distance between the target image and the target segmentation point of the target video is not larger than a distance threshold value, and the second prediction information of the target image represents the prediction offset between the time stamp of the target image and the time stamp of the target segmentation point. The description of step 302 can be seen in the description of step 202, and the implementation principles of the two are similar, and will not be repeated here.

Step 303, slicing the target video based on the first prediction information and the second prediction information of the target image of each frame to obtain at least two video sequences.

Because the first prediction information of the target image characterizes the probability that the distance between the target image and the target cut point of the target video is not greater than the distance threshold, and the second prediction information of the target image characterizes the prediction offset between the time stamp of the target image and the time stamp of the target cut point, the time stamp of the target cut point can be accurately determined based on the first prediction information of each frame of the target image, the second prediction information of each frame of the target image and the time stamp of each frame of the target image. The time stamp of the target segmentation point is the time of the target segmentation point in the target video, and the target video is segmented based on the time stamp of the target segmentation point, so that each video sequence is obtained. Since there is at least one target cut point in the target video, at least two video sequences can be obtained by cutting the target video based on the time stamps of the respective target cut points.

In one possible implementation, step 303 includes: determining a reference image from the multi-frame target image based on first prediction information of each frame of target image, wherein the first prediction information of the reference image is not smaller than a probability threshold; extracting second prediction information of the reference image from second prediction information of each frame of target image; determining a time stamp of the target cut point based on the time stamp of the reference image and the second prediction information of the reference image; and cutting the target video based on the time stamp of the target cutting point to obtain at least two video sequences.

The first prediction information of the target image characterizes a probability that a distance between the target image and a target segmentation point of the target video is not greater than a distance threshold, and the greater the probability, the more likely the target image is that the distance between the target image and the target segmentation point is not greater than the distance threshold. From the first prediction information of each frame of target image, a reference image with probability greater than or equal to a probability threshold value, which is a target image with a distance from a target dividing point not greater than a distance threshold value, can be determined from multiple frames of target images. The embodiment of the application does not limit the probability threshold value. Illustratively, the probability threshold is set data, e.g., the probability threshold is 0.7. Alternatively, the probability threshold is the maximum probability in the first prediction information of the target image of each frame. Alternatively, the first prediction information of the target image of each frame may be regarded as a curve having at least one peak, and the probability threshold is a peak value of each peak.

Since the reference image is a target image, the second prediction information of the reference image can be extracted from the second prediction information of the target image of each frame. Since the second prediction information of the reference picture characterizes a prediction offset between a time stamp of the reference picture and a time stamp of the target cut point, the time stamp of the target cut point may be determined based on the prediction offset of the time stamp of the reference picture and the reference picture. The time stamp of the reference image comprises a starting time and/or an ending time of the reference image in the target video, and the time stamp of the target slicing point can be determined based on a prediction offset corresponding to the reference image, the starting time or the ending time of the reference image in the target video.

The number of target cut points is at least one. The target video may be sliced based on the time stamps of the respective target slicing points to obtain at least two video sequences.

In one possible implementation, step 301 further includes: determining third prediction information of each frame of target image through a video segmentation model, wherein the third prediction information of the target image characterizes the prediction probability of the target image belonging to each image category; for any frame of target image, if the prediction probability of any frame of target image belonging to any image category is larger than a third threshold value, determining that any frame of target image belongs to any image category; and if the prediction probability of any frame of target image belonging to any image category is not greater than a third threshold value, determining that any frame of target image does not belong to any image category.

In the embodiment of the present application, the description of step 204 can be seen for the implementation of determining the third prediction information of each frame of target image by using the video segmentation model, and the implementation principles of the two are similar, which is not described herein again.

The third prediction information of the target image comprises a plurality of prediction data, the number of the prediction data is the same as that of the image categories, and each prediction data sequentially corresponds to each image category. Any one of the third prediction information of the target image is greater than or equal to 0 and less than or equal to 1, and the prediction data represents the probability that the target image belongs to the image category corresponding to the prediction data, and the greater the numerical value of the prediction data is, the more likely the target image belongs to the image category corresponding to the prediction data is. Thus, the third prediction information of the target image may characterize the prediction probabilities that the target image belongs to the respective image categories.

For any one of the third prediction data of the target image of any frame, if the prediction data is larger than a third threshold, that is, if the prediction probability that the target image belongs to the image category corresponding to the prediction data is larger than the third threshold, determining that the target image belongs to the image category corresponding to the prediction data; and if the prediction data is not greater than a third threshold, that is, the prediction probability that the target image belongs to the image category corresponding to the prediction data is not greater than the third threshold, determining that the target image does not belong to the image category corresponding to the prediction data.

The value of the third threshold is not limited in the embodiment of the present application, and the third threshold is a set value, for example, the third threshold is 0.7 or 0.8.

In one possible implementation manner, for any video sequence, third prediction information corresponding to each frame of target image in the video sequence is determined based on the third prediction information of each frame of target image in the video sequence, and the third prediction information corresponding to the video sequence comprises probability that the video sequence belongs to each image class; and determining the image category to which the video sequence belongs based on the third prediction information corresponding to the video sequence.

The method includes the steps of carrying out maximum pooling processing on third prediction information of target images of frames in a video sequence to obtain the third prediction information corresponding to the video sequence. If the probability that the video sequence belongs to any image category is larger than a third threshold value, determining that the video sequence belongs to any image category; if the probability that the video sequence belongs to any one image category is not greater than a third threshold value, determining that the video sequence does not belong to any one image category. Wherein the video sequence belongs to a certain image category, and each frame of image in the video sequence can be characterized as belonging to the image category.

Optionally, in a manner of determining an image category to which the target image belongs, an image category of each frame image in the target video may be determined. By splitting the target video into at least two video sequences and determining the image categories of each frame of images in the target video, the accuracy of video editing, video recommendation, video retrieval, etc. can be improved. For example, in terms of video editing, at least two video sequences can be freely combined, so that the labor cost of video editing objects is reduced; in the aspect of video recommendation, the recommended video sequence can be realized based on the image types of each frame image in the video sequence, and compared with a recommended target video, the recommended video sequence has finer granularity, thereby being beneficial to improving the recommendation accuracy; in the aspect of video retrieval, the category of the video sequence can be determined based on the image category of each frame image in the video sequence, the retrieval relation between the video sequence and the category is stored, and the video sequence can be more finely positioned to the video sequence instead of the target video through the retrieval category, so that the retrieval precision is improved.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the target video and the like involved in the present application are all acquired with sufficient authorization.

According to the method, multiple frames of target images are extracted from the target video, the probability that the distance between each frame of target image and the target cutting point of the target video is not greater than the distance threshold value and the prediction offset between the time stamp of each frame of target image and the time stamp of the target cutting point are determined through the video cutting model, so that the time stamp of the target cutting point can be accurately determined based on the probability, the prediction offset and the time stamp of each frame of target image, the target video is cut based on the time stamp of the target cutting point, and the accuracy of the video cutting result is improved.

The foregoing describes the training method of the video segmentation model according to the embodiment of the present application and the video segmentation method for performing video segmentation by using the video segmentation model from the perspective of method steps, and the following system fully describes the video segmentation model according to the embodiment of the present application.

Referring to fig. 4, fig. 4 is a frame diagram of a video slicing model according to an embodiment of the present application. The video segmentation model is obtained by training the neural network model, so that the frame of the video segmentation model is the same as the frame of the neural network model. In the process of training the video segmentation model, multiple frames of sample images (hereinafter referred to as images), sample audio corresponding to each frame of sample images (hereinafter referred to as audio) and sample text corresponding to each frame of sample images (hereinafter referred to as text) are required to be used as inputs of the neural network model, so that the output of the neural network model is utilized to train and obtain the video segmentation model. The neural network model and the video segmentation model comprise a feature extraction network, a multi-mode feature fusion network, a convolution network and a prediction network. The feature extraction network includes the above-mentioned image feature extraction network, audio feature extraction network, and text feature extraction network, and the above-mentioned feature processing network includes a feature extraction network, a multi-modal feature fusion network, and a convolution network.

The image is input into a feature extraction network, the feature extraction network extracts the image features, similarly, the audio is input into the feature extraction network, the feature extraction network extracts the audio features, the text is input into the feature extraction network, and the feature extraction network extracts the text features. The text comprises a valid text and a filling text, wherein the valid text is a sample text corresponding to a sample image extracted from the video, and the filling text is a setting character filled when the sample text corresponding to the sample image does not exist in the video. Based on this, the text features include an effective feature, which is a text feature corresponding to the effective text, and a fill feature, which is a text feature corresponding to the fill text, for example, the fill feature is the above-mentioned set feature.

In one aspect, image features, audio features, and text features are input into a multimodal feature fusion network, and the image features and audio features are multiplied by the multimodal feature fusion network to fuse the image features and the audio features. The multiplied features are subjected to channel attention processing through a channel attention network, the processed features, the multiplied features and the position features are added to obtain a query vector, and the query vector is the first fusion feature of each frame of sample image. On the other hand, after the text feature and the position feature are added, a key vector and a value vector are obtained. Inputting the query vector, the key vector and the value vector into a multi-head attention network, carrying out feature fusion by the multi-head attention network to obtain multi-head attention features of each frame of sample images, multiplying the multi-head attention features of each frame of sample images by the query vector to fuse the multi-head attention features of each frame of sample images with the query vector, and obtaining a second fusion feature of each frame of sample images.

The convolution network corresponds to the first stage to the N stage, and N is a positive integer. The first stage corresponds to a hole convolution network with l+1 holes gradually increasing and a hole convolution network with l+1 holes gradually decreasing. The hole convolution network with gradually increasing l+1 holes can be expressed as Dcr=1, … …, dcr=2 ^l The method comprises the steps of carrying out a first treatment on the surface of the The hole convolution network where l+1 holes gradually decrease can be noted as ddr=2 ^l … …, DC r=1. And carrying out l+1 times of hole convolution processing on the second fusion characteristics of each frame of sample image by using the l+1 hole convolution network, so as to obtain a first hole convolution characteristic, and carrying out l+1 times of hole convolution processing on the second fusion characteristics of each frame of sample image by using the l+1 hole convolution network with gradually reduced holes, so as to obtain a second hole convolution characteristic. And adding the first cavity convolution feature and the second cavity convolution feature to obtain a feature corresponding to the first stage. In the second stage, the cavitation convolutional network corresponding to the gradual increase of l+1 cavitation is respectively marked as Dcr=1, … … and Dcr=2 ^l . The cavity convolution network with gradually increased l+1 cavities performs the feature corresponding to the first stageAnd carrying out cavity convolution treatment for l+1 times to obtain the corresponding characteristics of the second stage. The third to nth stages are similar to the second stage, and are not described here. And fusing the characteristics corresponding to each stage to obtain the sample characteristics of each frame of sample image.

The sample characteristics of each frame of sample images are input into a prediction network, and on one hand, the prediction network determines the characteristic difference between the sample image and the adjacent frame image based on the sample characteristics of the sample image and the sample characteristics of the adjacent frame image for any frame of sample images. Then, first prediction information of the sample image and second prediction information of the sample image are determined based on a feature difference between the sample image and an adjacent frame image. On the other hand, the prediction network determines third prediction information of any one of the sample images based on the sample characteristics of the sample image.

And training the neural network model by using the first prediction information, the second prediction information and the third prediction information of each frame of sample image to obtain a video segmentation model. The implementation of training the neural network model is described in detail in the implementation of fig. 2, and will not be described here.

The training can be utilized to obtain a video segmentation model to carry out video segmentation on the target video. Optionally, multiple frames of target images are extracted from the target video, and target audio corresponding to each frame of target image and target text corresponding to each frame of target image are determined. And obtaining first prediction information, second prediction information and third prediction information of each frame of target image according to the processing modes of the sample image, the sample audio and the sample text. The first prediction information of the target image represents the probability that the distance between the target image and the target segmentation point of the target video is not larger than a distance threshold value, and the second prediction information of the target image represents the prediction offset between the time stamp of the target image and the time stamp of the target segmentation point. Based on the first prediction information and the second prediction information of the target image of each frame, the target video can be segmented to obtain at least two video sequences. The third prediction information of the target image characterizes the prediction probability that the target image belongs to each image category, and the image category to which each frame of target image belongs can be determined based on the third prediction information of each frame of target image.

Referring to fig. 5, fig. 5 is a schematic diagram of a video parsing result according to an embodiment of the present application. Wherein eight images shown in fig. 5 refer to target images of respective frames. Eight images can be segmented through the video segmentation model, and a first curtain, a second curtain and a third curtain are obtained. Wherein the first screen comprises two images, the second screen comprises five images, and the third screen comprises one image.

Each screen may be analyzed to obtain the type of each screen. For example, the first screen type is an interface recording screen, the second screen type is a mouth cast recommendation, the third screen type is a single diagram, and the dynamic effect package is carried out. The text of each screen may be determined. For example, the text of the first screen is: video editing and enriching life; the text of the second screen is: many people follow me in that the video clip is too difficult … …; the third screen has no text. And the watching condition of the target video can be counted to obtain a play rate curve and a loss rate curve.

By analyzing the video, the video can be understood in a finer system, and the frames of the video can be split. The learning of the frames of the high-quality video is beneficial to optimizing the video, improving the playing rate of the video and reducing the loss rate of the video.

The video segmentation model obtained by training by the training method of the video segmentation model provided by the embodiment of the application has excellent performance and better expansibility. Compared with the related art, when the video segmentation model provided by the embodiment of the application is used for video segmentation, the video segmentation precision can be improved, and the processing time can be reduced. For a single video, the processing duration of the related art is 47.9 seconds, while the processing duration of the embodiment of the present application is 10 seconds, the F1@0.5 index (an evaluation index) of the related art on the segmentation accuracy is 53%, and the F1@0.5 index of the embodiment of the present application on the segmentation accuracy is 85%.

The embodiment of the application utilizes multi-modal features (namely image features, audio features and text features) to determine accurate video segmentation points. The video is segmented into each video sequence based on the video segmentation point, the image category of each image in the video sequence can be determined, and the method has the characteristics of high efficiency and simplicity, and has more potential applications in multiple fields (such as advertising video fields, animation fields and the like), wherein the potential applications include, but are not limited to, video clipping, video recommendation, video clip retrieval and the like.

Fig. 6 is a schematic structural diagram of a training device for a video segmentation model according to an embodiment of the present application, where, as shown in fig. 6, the device includes:

The obtaining module 601 is configured to obtain multiple frames of sample images extracted from a sample video and first labeling information of each frame of sample image, where the first labeling information of the sample image characterizes whether a distance between the sample image and a sample dividing point of the sample video is not greater than a distance threshold;

a determining module 602, configured to determine, by using a neural network model, first prediction information and second prediction information of each frame of sample image, where the first prediction information of the sample image characterizes a probability that a distance between the sample image and a sample segmentation point is not greater than a distance threshold, and the second prediction information of the sample image characterizes a prediction offset between a timestamp of the sample image and a timestamp of the sample segmentation point;

the training module 603 is configured to train the neural network model based on the first labeling information, the first prediction information, and the second prediction information of each frame of sample image, to obtain a video segmentation model, where the video segmentation model is used for segmenting the target video.

In one possible implementation, the determining module 602 is configured to determine, through a neural network model, sample features of each frame of sample image; for any frame of sample image, determining the characteristic difference between any frame of sample image and adjacent frame of sample image based on the sample characteristics of any frame of sample image and the sample characteristics of adjacent frame of sample image through a neural network model, wherein the adjacent frame of sample image is the sample image adjacent to any frame of sample image in a plurality of frames of sample images; the first prediction information and the second prediction information of any frame sample image are determined based on the feature difference between any frame sample image and the adjacent frame image through the neural network model.

In one possible implementation, the determining module 602 is configured to determine, through a neural network model, an image feature of each frame of sample image; determining the complementary features of each frame of sample image through a neural network model, wherein the complementary features of the sample image comprise at least one of audio features of sample audio corresponding to the sample image and text features of sample text corresponding to the sample image; and fusing the image characteristics of each frame of sample image and the complementary characteristics of each frame of sample image through a neural network model to obtain the sample characteristics of each frame of sample image.

In a possible implementation manner, the determining module 602 is configured to fuse, through a neural network model, image features of each frame of sample image and audio features of sample audio corresponding to each frame of sample image, to obtain first fusion features of each frame of sample image; for any frame of sample image, determining a first text feature related to a first fusion feature of any frame of sample image from text features of sample texts corresponding to each frame of sample image; fusing the first fusion characteristic of any frame of sample image with the first text characteristic to obtain a second fusion characteristic of any frame of sample image; sample features of each frame of sample images are determined based on the second fused features of each frame of sample images.

In a possible implementation manner, the determining module 602 is configured to determine, for any frame of sample images, a second text feature related to an image feature of any frame of sample images from text features of sample text corresponding to each frame of sample images; fusing the image features of any frame of sample image and the second text features through a neural network model to obtain a third fusion feature of any frame of sample image; fusing the third fusion characteristic of any frame of sample image and the audio characteristic of the sample audio corresponding to any frame of sample image to obtain a fourth fusion characteristic of any frame of sample image; sample features of each frame of sample images are determined based on the fourth fused features of each frame of sample images.

In one possible implementation, the training module 603 is configured to determine a first loss of each frame of sample image based on the first labeling information and the first prediction information of each frame of sample image; determining a second loss of each frame of sample image based on the second prediction information of each frame of sample image, the time stamp of each frame of sample image, and the time stamp of the sample segmentation point; training the neural network model based on the first loss of each frame of sample image and the second loss of each frame of sample image to obtain a video segmentation model.

In a possible implementation, the training module 603 is configured to determine, for any frame of sample images, a sum of the second prediction information of any frame of sample images and a timestamp of any frame of sample images as a reference timestamp; a second penalty for any frame of sample images is determined based on the difference between the first threshold, the reference timestamp, and the timestamp of the sample segmentation point.

In one possible implementation, the apparatus further includes:

the determining module 602 is further configured to determine, by using the neural network model, third prediction information of each frame of sample image, where the third prediction information of the sample image characterizes a prediction probability that the sample image belongs to each image class;

the obtaining module 601 is further configured to obtain second labeling information of each frame of sample image, where the second labeling information of the sample image characterizes whether a sample image obtained by labeling belongs to each image class;

the training module 603 is configured to train the neural network model based on the first labeling information and the first prediction information, the second prediction information, the third prediction information and the second labeling information of each frame of sample image, so as to obtain a video segmentation model.

In one possible implementation, the training module 603 is configured to determine, for any frame of sample image, a positive sample loss of any frame of sample image based on a prediction probability that any frame of sample image belongs to a first category, where the first category is an image category to which any frame of sample image obtained by labeling belongs; determining negative sample loss of any frame of sample image based on a second threshold and the prediction probability that any frame of sample image belongs to a second category, wherein the second category is an image category to which any frame of sample image obtained through labeling does not belong; determining a third loss of any frame of sample images based on the positive sample loss of any frame of sample images and the negative sample loss of any frame of sample images; training the neural network model based on the third loss of each frame of sample image, the first labeling information, the first prediction information and the second prediction information of each frame of sample image, and obtaining a video segmentation model.

The device extracts a plurality of frames of sample images from the sample video, and determines the probability that the distance between each frame of sample image and the sample dividing point of the sample video is not greater than a distance threshold value and the prediction offset between the time stamp of each frame of sample image and the time stamp of the sample dividing point through the neural network model, so that the time stamp of the sample dividing point can be determined based on the probability, the prediction offset and the time stamp of each frame of sample image. Therefore, after the neural network model is trained to obtain the video segmentation model through the first labeling information, the first prediction information and the second prediction information of each frame of sample image, the time stamp of the video segmentation point can be accurately determined through the video segmentation model, so that the video is segmented based on the time stamp of the video segmentation point, and the accuracy of a video segmentation result is improved.

Fig. 7 is a schematic structural diagram of a video slicing device according to an embodiment of the present application, where, as shown in fig. 7, the device includes:

the acquiring module 701 is configured to acquire a video segmentation model and a multi-frame target image extracted from a target video, where the video segmentation model is obtained by training according to the training method of the video segmentation model of any one of the above-mentioned aspects;

The determining module 702 is configured to determine, according to a video segmentation model, first prediction information and second prediction information of each frame of a target image, where the first prediction information of the target image characterizes a probability that a distance between the target image and a target segmentation point of the target video is not greater than a distance threshold, and the second prediction information of the target image characterizes a prediction offset between a timestamp of the target image and a timestamp of the target segmentation point;

the splitting module 703 is configured to split the target video based on the first prediction information and the second prediction information of the target image of each frame, so as to obtain at least two video sequences.

In one possible implementation, the segmentation module 703 is configured to determine a reference image from the multi-frame target image based on the first prediction information of each frame target image, where the first prediction information of the reference image is not less than a probability threshold; extracting second prediction information of the reference image from second prediction information of each frame of target image; determining a time stamp of the target cut point based on the time stamp of the reference image and the second prediction information of the reference image; and cutting the target video based on the time stamp of the target cutting point to obtain at least two video sequences.

In one possible implementation, the apparatus further includes:

The determining module 702 is further configured to determine third prediction information of the target image of each frame according to the video segmentation model, where the third prediction information of the target image characterizes a prediction probability that the target image belongs to each image class; for any frame of target image, if the prediction probability of any frame of target image belonging to any image category is larger than a third threshold value, determining that any frame of target image belongs to any image category; and if the prediction probability of any frame of target image belonging to any image category is not greater than a third threshold value, determining that any frame of target image does not belong to any image category.

The device extracts multi-frame target images from the target video, determines the probability that the distance between each frame of target image and the target cutting point of the target video is not greater than a distance threshold value and the prediction offset between the time stamp of each frame of target image and the time stamp of the target cutting point through the video cutting model, so that the time stamp of the target cutting point can be accurately determined based on the probability, the prediction offset and the time stamp of each frame of target image, the target video is cut based on the time stamp of the target cutting point, and the accuracy of the video cutting result is improved.

It should be understood that, when the apparatus provided in fig. 6 or fig. 7 is implemented, only the division of the functional modules is used for illustration, and in practical application, the functional modules may be allocated to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 8 shows a block diagram of a terminal device 800 according to an exemplary embodiment of the present application. The terminal device 800 includes: a processor 801 and a memory 802.

Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of the content that the display screen is required to display. In some embodiments, the processor 801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one computer program for execution by processor 801 to implement the training method or video slicing method of the video slicing model provided by the method embodiments of the present application.

In some embodiments, the terminal device 800 may further optionally include: a peripheral interface 803, and at least one peripheral. The processor 801, the memory 802, and the peripheral interface 803 may be connected by a bus or signal line. Individual peripheral devices may be connected to the peripheral device interface 803 by buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, a display 805, a camera assembly 806, audio circuitry 807, and a power supply 808.

Peripheral interface 803 may be used to connect at least one Input/Output (I/O) related peripheral to processor 801 and memory 802. In some embodiments, processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 804 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 804 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 804 may also include NFC (Near Field Communication ) related circuits, which the present application is not limited to.

The display 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to collect touch signals at or above the surface of the display 805. The touch signal may be input as a control signal to the processor 801 for processing. At this time, the display 805 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 805 may be one, and disposed on a front panel of the terminal device 800; in other embodiments, the display 805 may be at least two, and disposed on different surfaces of the terminal device 800 or in a folded design; in other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal device 800. Even more, the display 805 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 806 is used to capture images or video. Optionally, the camera assembly 806 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 801 for processing, or inputting the electric signals to the radio frequency circuit 804 for voice communication. For stereo acquisition or noise reduction purposes, a plurality of microphones may be respectively disposed at different portions of the terminal device 800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 807 may also include a headphone jack.

The power supply 808 is used to power the various components in the terminal device 800. The power supply 808 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 808 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal device 800 also includes one or more sensors 809. The one or more sensors 809 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, optical sensor 814, and proximity sensor 815.

The acceleration sensor 811 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal apparatus 800. For example, the acceleration sensor 811 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 801 may control the display screen 805 to display a user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 811. Acceleration sensor 811 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal device 800, and the gyro sensor 812 may collect a 3D motion of the user to the terminal device 800 in cooperation with the acceleration sensor 811. The processor 801 may implement the following functions based on the data collected by the gyro sensor 812: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 813 may be disposed at a side frame of the terminal device 800 and/or at a lower layer of the display 805. When the pressure sensor 813 is provided at a side frame of the terminal device 800, a grip signal of the terminal device 800 by a user can be detected, and the processor 801 performs left-right hand recognition or quick operation according to the grip signal acquired by the pressure sensor 813. When the pressure sensor 813 is disposed at the lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 814 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the display screen 805 based on the ambient light intensity collected by the optical sensor 814. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 805 is turned up; when the ambient light intensity is low, the display brightness of the display screen 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 814.

A proximity sensor 815, also known as a distance sensor, is typically provided on the front panel of the terminal device 800. The proximity sensor 815 is used to collect the distance between the user and the front face of the terminal device 800. In one embodiment, when the proximity sensor 815 detects a gradual decrease in the distance between the user and the front face of the terminal device 800, the processor 801 controls the display 805 to switch from the bright screen state to the off screen state; when the proximity sensor 815 detects that the distance between the user and the front surface of the terminal device 800 gradually increases, the processor 801 controls the display screen 805 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

Fig. 9 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors 901 and one or more memories 902, where the one or more memories 902 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 901 to implement the training method or the video slicing method of the video slicing model provided in the foregoing method embodiments, and the processor 901 is a CPU. Of course, the server 900 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to cause an electronic device to implement a training method or a video slicing method of any of the video slicing models described above.

Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program or a computer program product is also provided, where at least one computer program is stored, and the at least one computer program is loaded and executed by a processor, to cause an electronic device to implement a training method or a video slicing method of any of the video slicing models described above.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The above embodiments are merely exemplary embodiments of the present application and are not intended to limit the present application, any modifications, equivalent substitutions, improvements, etc. that fall within the principles of the present application should be included in the scope of the present application.

Claims

1. A method for training a video segmentation model, the method comprising:

2. The method of claim 1, wherein determining the first prediction information and the second prediction information for each frame of sample image by a neural network model comprises:

determining sample characteristics of each frame of sample image through a neural network model;

for any frame of sample image, determining a characteristic difference between the any frame of sample image and an adjacent frame of sample image based on sample characteristics of the any frame of sample image and sample characteristics of the adjacent frame of sample image through the neural network model, wherein the adjacent frame of sample image is a sample image adjacent to the any frame of sample image in the multi-frame sample image;

and determining first prediction information and second prediction information of any frame sample image based on the feature difference between the any frame sample image and the adjacent frame image through the neural network model.

3. The method of claim 2, wherein determining sample characteristics of the frames of sample images by a neural network model comprises:

determining the image characteristics of each frame of sample image through a neural network model;

determining, by the neural network model, complementary features of the sample images of each frame, where the complementary features of the sample images include at least one of audio features of sample audio corresponding to the sample images and text features of sample text corresponding to the sample images;

And fusing the image characteristics of each frame of sample image and the complementary characteristics of each frame of sample image through the neural network model to obtain the sample characteristics of each frame of sample image.

4. A method according to claim 3, wherein the fusing, by the neural network model, the image features of the respective frame of sample images and the complementary features of the respective frame of sample images to obtain the sample features of the respective frame of sample images comprises:

fusing the image characteristics of each frame of sample image and the audio characteristics of sample audio corresponding to each frame of sample image through the neural network model to obtain first fusion characteristics of each frame of sample image;

for any frame of sample image, determining a first text feature related to a first fusion feature of the any frame of sample image from text features of sample texts corresponding to the frames of sample images;

fusing the first fusion characteristic of the sample image of any frame with the first text characteristic to obtain a second fusion characteristic of the sample image of any frame;

and determining sample characteristics of each frame of sample image based on the second fusion characteristics of each frame of sample image.

5. A method according to claim 3, wherein the fusing, by the neural network model, the image features of the respective frame of sample images and the complementary features of the respective frame of sample images to obtain the sample features of the respective frame of sample images comprises:

for any frame of sample image, determining a second text characteristic related to the image characteristic of any frame of sample image from the text characteristics of the sample text corresponding to each frame of sample image;

fusing the image features of the sample images of any frame and the second text features through the neural network model to obtain third fusion features of the sample images of any frame;

fusing the third fusion characteristic of the sample image of any frame with the audio characteristic of the sample audio corresponding to the sample image of any frame to obtain a fourth fusion characteristic of the sample image of any frame;

and determining sample characteristics of each frame of sample image based on the fourth fusion characteristics of each frame of sample image.

6. The method according to claim 1, wherein training the neural network model based on the first labeling information, the first prediction information, and the second prediction information of each frame of sample image to obtain a video segmentation model includes:

Determining a first loss of each frame of sample image based on the first labeling information and the first prediction information of each frame of sample image;

determining a second loss of each frame of sample images based on the second prediction information of each frame of sample images, the time stamp of each frame of sample images, and the time stamp of the sample slicing point;

and training the neural network model based on the first loss of each frame of sample image and the second loss of each frame of sample image to obtain a video segmentation model.

7. The method of claim 6, wherein the determining the second loss of each frame of sample images based on the second prediction information of each frame of sample images, the time stamp of each frame of sample images, and the time stamp of the sample slicing point comprises:

for any frame of sample image, determining the sum of second prediction information of the any frame of sample image and the time stamp of the any frame of sample image as a reference time stamp;

a second loss of the sample image of any frame is determined based on a difference between a first threshold, the reference timestamp, and a timestamp of the sample cut point.

8. The method according to claim 1, wherein the method further comprises:

Determining third prediction information of each frame of sample image through the neural network model, wherein the third prediction information of the sample image represents the prediction probability that the sample image belongs to each image class;

acquiring second labeling information of each frame of sample image, wherein the second labeling information of the sample image characterizes whether the sample image obtained by labeling belongs to each image category or not;

training the neural network model based on the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model, wherein the training comprises the following steps:

and training the neural network model based on the first labeling information, the first prediction information, the second prediction information, the third prediction information and the second labeling information of each frame of sample image to obtain a video segmentation model.

9. The method according to claim 8, wherein training the neural network model based on the first labeling information, the first prediction information, the second prediction information, the third prediction information, and the second labeling information of each frame of sample image to obtain a video segmentation model includes:

For any frame of sample image, determining positive sample loss of the any frame of sample image based on the prediction probability that the any frame of sample image belongs to a first category, wherein the first category is an image category to which the any frame of sample image obtained through labeling belongs;

determining negative sample loss of any frame of sample image based on a second threshold and a prediction probability that the any frame of sample image belongs to a second category, wherein the second category is an image category that the any frame of sample image obtained through labeling does not belong to;

determining a third loss of the any one frame of sample images based on the positive sample loss of the any one frame of sample images and the negative sample loss of the any one frame of sample images;

and training the neural network model based on the third loss, the first labeling information, the first prediction information and the second prediction information of each frame of sample image to obtain a video segmentation model.

10. A method of video slicing, the method comprising:

acquiring a video segmentation model and a multi-frame target image extracted from a target video, wherein the video segmentation model is obtained by training according to the training method of the video segmentation model as set forth in any one of claims 1 to 9;

11. A training device for a video segmentation model, the device comprising:

a determining module, configured to determine, by using a neural network model, first prediction information and second prediction information of each frame of sample image, where the first prediction information of the sample image characterizes a probability that a distance between the sample image and the sample segmentation point is not greater than a distance threshold, and the second prediction information of the sample image characterizes a prediction offset between a timestamp of the sample image and a timestamp of the sample segmentation point;

12. A video slicing apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video segmentation model and a multi-frame target image extracted from a target video, wherein the video segmentation model is obtained by training according to the training method of the video segmentation model as set forth in any one of claims 1 to 9;

13. An electronic device comprising a processor and a memory, wherein the memory stores at least one computer program, the at least one computer program being loaded and executed by the processor to cause the electronic device to implement the method of training the video slicing model of any one of claims 1 to 9 or the method of video slicing of claim 10.

14. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, and the at least one computer program is loaded and executed by a processor, to cause an electronic device to implement the training method of the video slicing model according to any one of claims 1 to 9 or implement the video slicing method according to claim 10.

15. A computer program product, characterized in that at least one computer program is stored in the computer program product, which is loaded and executed by a processor to cause an electronic device to implement the training method of the video slicing model according to any one of claims 1 to 9 or to implement the video slicing method according to claim 10.