CN117011744A

CN117011744A - Video clip determining method, device, equipment, storage medium and program product

Info

Publication number: CN117011744A
Application number: CN202211485084.3A
Authority: CN
Inventors: 甘蓓; 谯睿智; 吴昊谦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-11-07

Abstract

The application provides a method, a device, equipment, a storage medium and a program product for determining video clips; the method comprises the following steps: obtaining a video slicing sequence obtained by video slicing of a target video; performing feature extraction on each video key frame aiming at each video slice in the video slice sequence to obtain picture features of each video key frame, and performing feature extraction on audio frames corresponding to each video key frame in the video slice to obtain audio features of each audio frame; carrying out feature fusion on the picture features of each video segment and the corresponding audio features to obtain fusion features of each video segment; based on the fusion characteristics of each video slice, predicting the key degree of each video slice in the target video to obtain a key degree sequence corresponding to the video slice sequence; a key video snippet is determined from the target video based on the sequence of key degrees. The method and the device can effectively improve the accuracy of determining the key video clips in the video.

Description

Video clip determining method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for determining a video clip, an electronic device, a storage medium, and a program product.

Background

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In the related art, detection of video clips is usually implemented by means of picture-level structured detection, and for video with more complex and changeable video pictures, the accuracy of detecting video clips in the related art is lower due to the complex degree of structuring.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for determining video clips, which can effectively improve the accuracy of determining key video clips in videos.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for determining video clips, which comprises the following steps:

obtaining a video slicing sequence obtained by video slicing of a target video, wherein each video slicing comprises at least one video key frame and an audio frame corresponding to each video key frame;

for each video segment in the video segment sequence, extracting features of each video key frame to obtain picture features of each video key frame, and extracting features of audio frames corresponding to each video key frame in the video segment to obtain audio features of each audio frame;

performing feature fusion on the picture features of each video segment and the corresponding audio features to obtain fusion features of each video segment;

based on the fusion characteristics of the video clips, predicting the key degree of the video clips in the target video respectively to obtain a key degree sequence corresponding to the video clip sequence;

and determining a key video snippet from the target video based on the key degree sequence.

The embodiment of the application provides a video clip determining device, which comprises:

The device comprises an acquisition module, a video segmentation module and a video segmentation module, wherein the acquisition module is used for acquiring a video segmentation sequence obtained by performing video segmentation on a target video, and each video segmentation comprises at least one video key frame and an audio frame corresponding to each video key frame;

the feature extraction module is used for carrying out feature extraction on each video key frame aiming at each video slice in the video slice sequence to obtain the picture feature of each video key frame, and carrying out feature extraction on the audio frame corresponding to each video key frame in the video slice to obtain the audio feature of each audio frame;

the feature fusion module is used for carrying out feature fusion on the picture features of each video segment and the corresponding audio features to obtain fusion features of each video segment;

the prediction module is used for predicting the key degree of each video slice in the target video based on the fusion characteristics of each video slice to obtain a key degree sequence corresponding to the video slice sequence;

and the determining module is used for determining a key video fragment from the target video based on the key degree sequence.

In some embodiments, the obtaining module is further configured to obtain a target video, and a slicing step, where the slicing step characterizes a number of video frames included in the video slice, and the video frames include the video key frames and the video non-key frames; and performing video slicing on the target video according to the slicing step length to obtain the video slicing sequence.

In some embodiments, the above feature fusion module is further configured to perform the following processing for each video slice: splicing the picture features of the video clips to obtain spliced picture features of the video clips, and splicing the audio features of the video clips to obtain spliced audio features of the video clips; and acquiring the picture weight and the audio weight of the video segment, and carrying out weighted fusion on the spliced picture characteristic and the spliced audio characteristic of the video segment based on the picture weight and the audio weight of the video segment to obtain the fusion characteristic of the video segment.

In some embodiments, the prediction module is further configured to invoke a target prediction model based on a fusion feature of each video slice, and predict each video slice to obtain a criticality of each video slice; combining the key degree of each video fragment into a candidate key degree sequence according to the playing time sequence of each video fragment in the video fragment sequence; and smoothing the candidate keyword sequences to obtain the keyword sequences corresponding to the video slicing sequences.

In some embodiments, the prediction module is further configured to obtain a playing time of a video segment corresponding to each key probability in the candidate key degree sequence and a smooth time interval, and compare each playing time with the smooth time interval to obtain an interval comparison result corresponding to each key probability; and smoothing the key probability of the playing time in the smooth time interval in the candidate key degree sequence based on the interval comparison result to obtain the key degree sequence.

In some embodiments, the prediction module is further configured to perform processing for each of the key probabilities that the playing time is in the smooth time interval in the candidate criticality sequence, respectively: in the candidate keyword sequence, taking the position of the keyword probability in the candidate keyword sequence as a central position, and selecting at least two reference keyword probabilities at equal intervals; and carrying out weighted average on the at least two reference key probabilities to obtain weighted average probability, and determining the weighted average probability as smooth key probability corresponding to the key probability.

In some embodiments, the apparatus for determining a video clip further includes: the training module is used for acquiring at least two video slicing samples, wherein the at least two video slicing samples belong to different video topics, and each video slicing sample comprises at least one video key frame sample and an audio frame sample corresponding to each video key frame sample; for each video slicing sample, performing feature extraction on the video key frame sample to obtain picture sample features of the video key frame sample, and performing feature extraction on an audio frame sample corresponding to the video key frame sample in the video slicing sample to obtain audio sample features of the audio frame sample; performing feature fusion on the picture sample features of each video slicing sample and the corresponding audio sample features to obtain fusion sample features of each video slicing sample; based on the fusion sample characteristics of each video slicing sample, calling a prediction model to predict each video slicing sample, and obtaining the prediction key probability of each video slicing sample; and training the prediction model based on the prediction key probability of each video slice sample to obtain the target prediction model.

In some embodiments, the training module is further configured to obtain a label key probability of each of the video slicing samples; determining a loss value of each tag key probability based on each tag key probability and the corresponding predicted key probability; summing the loss values of the key probabilities of the labels to obtain a training loss value; and training the prediction model based on the training loss value to obtain the target prediction model.

In some embodiments, the determining module is further configured to obtain at least one criticality sub-sequence composed of a plurality of consecutive criticalities from the criticality sequence, where the number of criticalities in the criticality sub-sequence is greater than or equal to a first threshold; wherein each of the criticality subsequences is greater than or equal to a criticality threshold; determining video fragment sub-sequences corresponding to the key degree sub-sequences from the video fragment sequences; and determining video fragments corresponding to the video fragment sub-sequences from the target video, and determining the video fragments as the key video fragments.

In some embodiments, the determining module is further configured to obtain at least one criticality sub-sequence composed of a plurality of consecutive criticalities from the criticality sequence, where the number of criticalities in the criticality sub-sequence is greater than or equal to a first threshold, and the number of criticalities in the criticality sub-sequence is less than or equal to a second threshold; wherein, in the critical degree subsequence, the number of critical degrees reaching a critical degree threshold is greater than or equal to a third threshold; determining video fragment sub-sequences corresponding to the key degree sub-sequences from the video fragment sequences; and determining video fragments corresponding to the video fragment sub-sequences from the target video, and determining the video fragments as the key video fragments.

In some embodiments, the apparatus for determining a video clip further includes: the recommendation module is used for editing the key video snippets in the target video to obtain the key video snippets; acquiring a target object interested in the target video; and recommending the key video snippets to the target object.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions or computer programs;

and the processor is used for realizing the method for determining the video clips provided by the embodiment of the application when executing the computer executable instructions or the computer programs stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores computer executable instructions for causing a processor to execute the method for determining video clips.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer-executable instructions from the computer-readable storage medium, and executes the computer-executable instructions, so that the electronic device performs the method for determining a video clip according to the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

and extracting features of video key frames and audio frames of each video slice in the video slice sequence by acquiring a video slice sequence obtained by video slicing of the target video, obtaining picture features and audio features, and fusing the picture features and the audio features to obtain fusion features. And predicting based on the fusion characteristics of the video fragments to obtain a key degree sequence, and determining the key video fragments from the target video based on the key degree sequence. Therefore, the characteristics of the audio and the picture of the video are effectively fused, the determined fusion characteristics can reflect the characteristics of each video fragment more accurately, and the fusion characteristics are used for predicting later.

Drawings

FIG. 1 is a schematic diagram of a related art method for determining video clips;

FIG. 2 is a schematic diagram of a video clip determination system according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a determining electronic device for video clips according to an embodiment of the present application;

Fig. 4 is a flowchart of a method for determining a video clip according to an embodiment of the present application;

fig. 5 is a flowchart of a method for determining a video clip according to an embodiment of the present application;

fig. 6 is a flowchart of a method for determining a video clip according to an embodiment of the present application;

fig. 7 to 8 are schematic flow diagrams of a method for determining video clips according to an embodiment of the present application;

FIG. 9 is a schematic diagram showing the comparison of the effects of a keyword sequence and a candidate keyword sequence according to an embodiment of the present application;

fig. 10 is a schematic diagram of a method for determining a video clip according to an embodiment of the present application;

fig. 11 is a flowchart of a method for determining a video clip according to an embodiment of the present application;

fig. 12 is a flowchart of a method for determining a video clip according to an embodiment of the present application;

fig. 13 is a schematic diagram of a method for determining a video clip according to an embodiment of the present application;

fig. 14 is a flowchart of a method for determining a video clip according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

2) Convolutional neural network (CNN, convolutional Neural Networks): is a type of feedforward neural network (FNN, feed forward Neural Networks) with a Deep structure that includes convolution computation, and is one of representative algorithms of Deep Learning. Convolutional neural networks have the capability of token learning (Representation Learning) and are capable of performing a Shift-Invariant Classification classification of input images in their hierarchical structure.

3) Convolution layer: each convolution layer (Convolutional Layer) in the convolution neural network is composed of a plurality of convolution units, and parameters of each convolution unit are optimized through a back propagation algorithm. The purpose of convolution operations is to extract different features of the input, and the first layer of convolution may only extract some low-level features such as edges, lines, and corners, and more layers of the network may iteratively extract more complex features from the low-level features.

4) Pooling layer: after the feature extraction is performed by the convolution layer, the output feature map is transferred to the pooling layer for feature selection and information filtering. The pooling layer contains a predefined pooling function that functions to replace the results of individual points in the feature map with the feature map statistics of its neighboring regions. The pooling layer selects pooling area and the step of the convolution kernel scanning characteristic diagram are the same, and the pooling area, step length and filling are controlled.

5) full-Connected Layer: the fully connected layer in convolutional neural networks is equivalent to the hidden layer in conventional feed forward neural networks. The full connection layer is positioned at the last part of the hidden layer of the convolutional neural network and only transmits signals to other full connection layers. The signature loses spatial topology in the fully connected layers, is expanded into vectors and passes through the excitation function.

6) Key video clip: the video is a highlight video clip which can highly reflect the content of a video main body and contains important information of the video. The video includes a key video snippet and a non-key video snippet, the key degree of the key video snippet being greater than the key degree of the non-key video snippet.

7) Multiplayer online tactical athletic game (Multiplayer Online Battle Arena, MOBA): also known as Action Real-Time strategic games (ARTS), in multiplayer online tactical athletic games where equipment is typically purchased, game players are typically separated into two teams that compete against each other in a decentralized game map, each player controlling a selected character through a game interface. Such games typically do not require the manipulation of organizational units, such as building groups, resources, etc., that are common in games, where the player only controls the game character of his own choosing.

In the implementation of the embodiments of the present application, the applicant found that the related art has the following problems:

in the related art, referring to fig. 1, fig. 1 is a schematic diagram of a related art method for determining a video clip. Although the related game key video detection method based on deep learning can accurately identify key events of the game video so as to obtain key fragments of the game video, the related game key fragment detection method has great limitations: the method is essentially event detection (e.g., small map detection, defensive tower detection, broadcast detection, blood streak detection, skill detection, etc.), relying on picture-level structured detection to identify markers for a particular area. On one hand, once the game interface is changed, the viewing angle is switched, the anchor plays the template and shields in live broadcast, the algorithm is possibly invalid, other games cannot be generalized, only a customized scheme can be formulated for each game, a large amount of manpower is needed to label each game, and the cost is very high; on the other hand, the related recognition schemes cannot fully utilize the time sequence information of the video, cannot utilize the multi-mode information of the video, and cannot be recognized by the current algorithm at all in some segments which are strong in resistance, strong in player and audience response and free from casualties. Therefore, the game video key fragment scheme has the following characteristics: the method has certain robustness, can correctly identify when the picture changes, not only aims at the scheme of certain specific game design, but also can generalize to various games in the multi-player online tactical athletic game (MOBA), utilizes the time sequence information and the audio mode of the video, can combine the context semantic and the audio information, and can identify key fragments under the condition of no casualties but intense fight.

Referring to fig. 2, fig. 2 is a schematic architecture diagram of a video clip determining system 100 according to an embodiment of the present application, where a terminal (a terminal 400 is shown in an example) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is configured to display the key video snippets at a graphical interface 410-1 (graphical interface 410-1 is shown in an exemplary manner) for use by a user using the client 410. The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart television, a smart watch, a car terminal, etc. The electronic device provided by the embodiment of the application can be implemented as a terminal or a server. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In some embodiments, the terminal 400 acquires the target video and sends the target video to the server 200, and the server 200 performs video slicing on the target video to obtain a video slicing sequence; for each video clip in the video clip sequence, a key video clip is determined from the target video and the determined video clip is sent to the terminal 400.

In other embodiments, the server 200 obtains a target video, performs video slicing on the target video to obtain a video slicing sequence, determines a key video slice from the target video for each video slice in the video slicing sequence, and sends the determined video slice to the terminal 400.

In other embodiments, the embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data.

The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device 500 for determining video clips according to an embodiment of the present application, where the electronic device 500 shown in fig. 3 may be the server 200 or the terminal 400 in fig. 2, and the electronic device 500 shown in fig. 3 includes: at least one processor 410, a memory 450, at least one network interface 420. The various components in electronic device 500 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 3 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi, wireless Fidelity), and universal serial bus (USB, universal Serial Bus), etc.

In some embodiments, the video clip determining apparatus provided in the embodiments of the present application may be implemented in software, and fig. 3 shows the video clip determining apparatus 455 stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the acquisition module 4551, the feature extraction module 4552, the feature fusion module 4553, the prediction module 4554, the determination module 4555 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the video clip determining apparatus provided in the embodiments of the present application may be implemented in hardware, and by way of example, the video clip determining apparatus provided in the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the video clip determining method provided in the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic components.

In some embodiments, the terminal or the server may implement the method for determining video clips provided by the embodiments of the present application by running a computer program or computer executable instructions. For example, the computer program may be a native program (e.g., a dedicated video clip determining program) or a software module in an operating system, e.g., a video clip determining module that may be embedded in any program (e.g., an instant messaging client, an album program, an electronic map client, a navigation client); for example, a Native Application (APP) may be used, i.e. a program that needs to be installed in an operating system to be run. In general, the computer programs described above may be any form of application, module or plug-in.

The method for determining video clips provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the server or the terminal provided by the embodiment of the present application.

Referring to fig. 4, fig. 4 is a schematic flow chart of a method for determining a video clip according to an embodiment of the present application, which will be described with reference to steps 101 to 105 shown in fig. 4, the method for determining a video clip according to an embodiment of the present application may be implemented by a server or a terminal alone or in conjunction with the server and the terminal, and will be described with reference to a server alone embodiment.

In step 101, a video slicing sequence obtained by video slicing a target video is acquired.

In some embodiments, each video slice includes at least one (e.g., at least two) video key frames, and an audio frame corresponding to each video key frame.

In some embodiments, the target video comprises a plurality of video slices, each video slice comprising at least one video key frame and at least one (e.g., at least two) video non-key frame, each video key frame corresponding to an audio frame, the target video may be a video in a different form such as a network game video, an audiovisual video, or the like.

In some embodiments, the video key frames may be video frames corresponding to audio frames and the video non-key frames may be video frames not corresponding to audio frames.

In some embodiments, referring to fig. 5, fig. 5 is a flowchart of a method for determining a video clip according to an embodiment of the present application, and step 101 shown in fig. 5 may be implemented by performing the following steps 1011 to 1012.

In step 1011, a target video is acquired, and a slicing step is performed, the slicing step representing a number of video frames included in the video slices, the video frames including video key frames and video non-key frames.

In some embodiments, the slicing step characterizes a number of video frames included in the video slice, the number of video frames being equal to a sum of a number of video key frames and a number of video non-key frames.

In step 1012, video slicing is performed on the target video according to the slicing step, so as to obtain a video slicing sequence.

In some embodiments, the slicing step size and the number of video slices in the video slicing sequence have the following relationship: the product of the slicing step size and the number of video slices in the video slicing sequence is equal to the number of video frames in the target video, which is equal to the sum of the number of video key frames and the number of video non-key frames.

As an example, referring to fig. 6, fig. 6 is a schematic diagram of a method for determining a video clip according to an embodiment of the present application, a target video 1 is obtained, and a clip step size, which characterizes a number of video frames included in a video clip, where the video frames include a video key frame and a video non-key frame. And performing video slicing on the target video according to the slicing step length to obtain a video slicing sequence 2.

Therefore, the target video is subjected to video slicing to obtain the video slicing sequence comprising a plurality of video slices, so that feature extraction can be performed on each video slice in parallel in the subsequent feature extraction process, the feature extraction time is effectively saved, the algorithm execution efficiency is effectively improved, and the video slice determination efficiency is effectively improved.

In step 102, feature extraction is performed on each video key frame for each video slice in the video slice sequence to obtain picture features of each video key frame, and feature extraction is performed on audio frames corresponding to each video key frame in the video slice sequence to obtain audio features of each audio frame.

In some embodiments, feature extraction of video key frames may be achieved through an image coding network, the picture features of the video key frames being in the form of vectors of the video key frames. Feature extraction of audio frames may be achieved through an audio coding network, the audio features being in the form of vectors of the audio frames.

As an example, a video slice includes at least one video key frame, and the following processing is performed for each video slice in the sequence of video slices: and extracting the characteristics of each video key frame in the video fragments to obtain the picture characteristics of each video key frame, and extracting the characteristics of the audio frames corresponding to each video key frame in the video fragments to obtain the audio characteristics of each audio frame.

Therefore, the picture characteristics and the audio characteristics of the corresponding video key frames are respectively extracted for each video slice in the video slice sequence, so that the filtering of the video non-key frames in the target video is realized, the video key frames with key information and the corresponding audio frames in the target video are reserved, the determination of the key information for the subsequent video slices is realized, the calculation amount of an algorithm is effectively saved, namely, the accuracy of the determined video slices is effectively improved while the determination efficiency of the video slices is effectively improved.

In step 103, feature fusion is performed on the picture features of each video segment and the corresponding audio features, so as to obtain fusion features of each video segment.

In some embodiments, feature fusion includes stitching and weighted fusion, where stitching refers to a process of stitching at least two vectors to form one vector, stitching may reduce the number of vectors and increase the dimension of the vectors, and weighted fusion refers to a process of fusing at least two vectors according to their respective weights.

In some embodiments, referring to fig. 7, fig. 7 is a flowchart of a method for determining a video clip according to an embodiment of the present application, where the video clip includes at least two video key frames, step 103 shown in fig. 7 may be implemented by performing the following steps 1031 to 1032 for each video clip.

In step 1031, each picture feature of the video clip is spliced to obtain a spliced picture feature of the video clip, and each audio feature of the video clip is spliced to obtain a spliced audio feature of the video clip.

In some embodiments, when the video clip includes at least two video key frames, the number of picture features and the number of audio features corresponding to each video key frame are the same as the number of video key frames, i.e., the video clip includes at least two picture features and the audio features corresponding to each picture feature. As such, by performing the following processing for each video slice separately: splicing the picture characteristics of the video fragments to obtain splicing characteristics of the video fragments; and splicing all the audio features of the video fragments to obtain the spliced audio features of the video fragments.

In step 1032, the picture weight and the audio weight of the video segment are obtained, and the spliced picture feature and the spliced audio feature of the video segment are weighted and fused based on the picture weight and the audio weight of the video segment, so as to obtain the fusion feature of the video segment.

In some embodiments, the sum of the picture weight and the audio weight is equal to 1, and the picture weight and the audio weight of the video slice may be specifically set according to the actual situation, for example, may be specifically set according to the type of the target video, for example, the picture weight and the audio weight are set to be equal, or the picture weight is set to be greater than the audio weight, or the picture weight is set to be less than the audio weight.

In some embodiments, when the video clip includes a video key frame, the step 103 may be implemented as follows: and obtaining the picture weight and the audio weight of the video fragments, and carrying out weighted fusion on the picture characteristics and the corresponding audio characteristics of each video fragment according to the picture weight and the audio weight to obtain the fusion characteristics of each video fragment.

Therefore, the image characteristics of each video segment and the corresponding audio characteristics are subjected to characteristic fusion to obtain the fusion characteristics of each video segment, so that the determined fusion characteristics effectively fuse the characteristics of two modes of audio and image of video, the determined fusion characteristics can reflect the characteristics of each video segment more accurately, and the fusion characteristics can reflect the characteristics of each video segment more accurately later, so that the prediction accuracy is higher, and the accuracy of the determined video segment is effectively improved.

In step 104, based on the fusion characteristics of each video slice, the key degree of each video slice in the target video is predicted, so as to obtain a key degree sequence corresponding to the video slice sequence.

In some embodiments, the prediction is used to determine a criticality of each video slice, where the criticality may be used to determine whether the video slice is a critical video slice, where the criticality is proportional to the amount of information in the video, and the greater the amount of information in the video, the greater the criticality of the corresponding video, and the lesser the amount of information in the video, and the lesser the corresponding criticality.

In some embodiments, referring to fig. 8, fig. 8 is a flowchart of a method for determining a video clip according to an embodiment of the present application, and step 104 shown in fig. 8 may be implemented by executing the following steps 1041 to 1043.

In step 1041, a target prediction model is invoked to predict each video slice based on the fusion characteristics of each video slice, so as to obtain the key degree of each video slice.

In some embodiments, the target prediction model may be obtained by training a prediction model, which may be a time sequence model, the time sequence model is a neural network model based on a Bi-Long Short-Term Memory network (Bi-LSTM), and the time sequence model includes a convolution layer, an activation layer, a normalization layer, and a Bi-Long Short Term Memory network (prediction layer).

As an example, referring to fig. 13, based on the fusion feature of each video slice, a convolution layer, an activation layer, a prediction layer, and a normalization layer of the target prediction model are called to predict each video slice, so as to obtain the criticality of each video slice.

In some embodiments, each criticality in the candidate criticality sequence may be indicated by means of a criticality probability, and the criticality may also be indicated by means of a category after two classification of the criticality probability.

In some embodiments, the step 1041 may be implemented as follows: based on the fusion characteristics of each video segment, a target prediction model is called to predict each video segment, so that the key probability of each video segment is obtained; comparing each key probability with a key probability threshold value to obtain a probability comparison result; determining the criticality of the video clips as non-critical video clips when the probability comparison result represents that the critical probability is smaller than a critical probability threshold; and determining the criticality of the video slices as the critical video slices when the probability comparison result characterizes that the smooth critical probability is greater than or equal to a critical probability threshold.

In other embodiments, the step 1041 may be implemented as follows: based on the fusion characteristics of each video segment, a target prediction model is called to predict each video segment, so that the key probability of each video segment is obtained; and determining each key probability as the key degree of the corresponding video slice.

In some embodiments, prior to step 1041 described above, the target prediction model may be trained by: acquiring at least two video slicing samples, wherein the at least two video slicing samples belong to different video topics, and each video slicing sample comprises at least one video key frame sample and an audio frame sample corresponding to each video key frame sample; performing feature extraction on video key frame samples aiming at each video slicing sample to obtain picture sample features of the video key frame samples, and performing feature extraction on audio frame samples corresponding to the video key frame samples in the video slicing samples to obtain audio sample features of the audio frame samples; carrying out feature fusion on the picture sample features of each video slicing sample and the corresponding audio sample features to obtain fusion sample features of each video slicing sample; based on the fusion sample characteristics of each video slicing sample, calling a prediction model to predict each video slicing sample, and obtaining the prediction key probability of each video slicing sample; and training the prediction model based on the prediction key probability of each video slice sample to obtain a target prediction model.

In some embodiments, at least two video clip samples are assigned to different video themes, and in an application scenario of a network game, the video theme to which each of the at least two video clip samples is assigned may be a different type of network game, where the type of network game includes a music game, a shooting game, a strategy game, a multiplayer online combat game, a strategic role playing game, an instant strategic game, a sports game, and so on. The types of the video topics can be specifically set according to different application scenes, and it is understood that the more the number of video slicing samples is, the more the types of the video topics are, and the better the training effect on the prediction model is.

In some embodiments, obtaining at least two video slice samples may be accomplished by: at least one video slicing sample belonging to a fight game video theme, at least one video slicing sample belonging to a shooting game video theme, at least one video slicing sample belonging to a strategy game video theme, and at least one video classification sample of a multiplayer online fight game video theme are obtained.

In some embodiments, the video slicing samples carry label key probabilities of the video slicing samples, the label key probabilities represent actual key degrees of the video slicing samples, and the prediction model is trained based on the prediction key probabilities of the video slicing samples to obtain a target prediction model, which can be achieved by the following modes: and determining a training loss value of the prediction model based on the label key probability and the prediction key probability, and training the prediction model based on the training loss value to obtain a target prediction model.

In some embodiments, the training penalty value of the predictive model may be the difference between the tag key probability and the predictive key probability, and the penalty value of the predictive model is determined based on the tag key probability and the predictive key probability, but the specific form of the training penalty value may be different, and the specific form of the training penalty value may be determined according to different types of penalty functions, which are penalty functions with the tag probability and the predictive key probability as function parameters.

As an example, the expression of the loss value of the above prediction model may be:

Y ₁ ＝P ₁ -P ₂ (1)

wherein Y is ₁ Characterizing loss values of predictive models, P ₁ Characterizing tag probability, P ₂ The predictive key probability is characterized.

In some embodiments, the training manner of training the prediction model may be a gradient update training manner, or may be a training manner such as a small-batch gradient descent update training manner.

In some embodiments, the training of the prediction model based on the prediction key probability of each video slice sample to obtain the target prediction model may be implemented as follows: acquiring label key probability of each video slicing sample; determining a loss value of each tag key probability based on each tag key probability and the corresponding predicted key probability; summing the loss values of the key probabilities of all the labels to obtain a training loss value; and training the prediction model based on the training loss value to obtain a target prediction model.

As an example, the expression of the training loss value of the above-described prediction model may be:

Y ₂ ＝(P ₁ -P ₂ )+(P ₃ -P ₄ )+…(P _n-1 -P _n ) (2)

wherein Y is ₂ Characterization of training loss value, P ₁ 、P ₃ 、P _n-1 Characterizing key probability of each tag, P ₂ 、P ₄ 、P _n Characterizing each predictive key probability, (P) ₁ -P ₂ )、(P ₃ -P ₄ )、(P _n-1 -P _n ) Loss value, Y, characterizing key probabilities of each tag ₂ Training loss values characterizing the predictive model.

In some embodiments, determining the loss value of each tag key probability based on each tag key probability and the corresponding predicted key probability may be determined as follows: and calling a loss function based on the key probability of each label and the corresponding predicted key probability to obtain a loss value of each label key probability.

In some embodiments, the loss function may be a cross loss function, a difference loss function and an index loss function, a hinge loss function, etc., the loss function being defined on a single sample, referring to the error of one sample.

In some embodiments, training the prediction model based on the training loss value to obtain the target prediction model may be achieved by: based on the training loss value, updating model parameters of the prediction model in a gradient updating mode to obtain a prediction model with updated parameters, and determining the prediction model with updated parameters as a target prediction model.

In step 1042, the key degree of each video clip is combined into a candidate key degree sequence according to the playing time sequence of each video clip in the video clip sequence.

In some embodiments, the playing time characterizes a first frame of each video clip, and the relative playing time characterizes a time offset of a starting time of playing the first frame of the target video relative to the playing time in a target video playing process.

In some embodiments, the playing time corresponding to the video clip of the head of the queue in the video clip sequence is the earliest, the playing time corresponding to the video clip of the tail of the queue in the video clip sequence is the latest, and correspondingly, the playing time corresponding to the key degree of the video clip of the head of the queue in the candidate key degree sequence is the earliest, and the playing time corresponding to the key degree of the video clip of the tail of the queue in the candidate key degree sequence is the latest.

In step 1043, the candidate keyword sequence is smoothed, so as to obtain a keyword sequence corresponding to the video slicing sequence.

In some embodiments, smoothing is used to eliminate the occurrence of a mutation in the criticality that may be present in the candidate sequence of criticality.

In some embodiments, each criticality in the candidate criticality sequence may be indicated by means of a criticality probability (e.g., a criticality probability of 0.8, a criticality probability of 0.08, etc., a value range of the criticality probability is 0 to 1), or by means of a category after two classification of the criticality probability.

In some embodiments, each criticality in the candidate criticality sequence is indicated by a criticality probability, and step 1043 may be implemented as follows: acquiring playing time and smooth time intervals of video fragments corresponding to each key probability in the candidate key degree sequence, and comparing each playing time with the smooth time intervals to obtain interval comparison results corresponding to each key probability; and smoothing the key probability of the playing time in the smooth time interval in the candidate key degree sequence based on the interval comparison result to obtain the key degree sequence.

In some embodiments, the interval comparison results characterize whether the playing time of the video clip corresponding to the key probability is located in the smooth time interval.

In some embodiments, the above smoothing of the key probability that the playing time is in the smooth time interval in the candidate key degree sequence may be implemented as follows: and respectively executing processing for each key probability of the playing time in the smooth time interval in the candidate key degree sequence: in the candidate keyword sequences, taking the positions of the keyword probabilities in the candidate keyword sequences as central positions, and selecting at least two reference keyword probabilities at equal intervals; and carrying out weighted average on at least two reference key probabilities to obtain weighted average probability, and determining the weighted average probability as smooth key probability corresponding to the key probability.

In some embodiments, when the number of reference key probabilities is two, the expression for smoothing the key probabilities may be:

wherein m is _i Characterizing smooth key probability, med () characterizing weighted average function, sigma characterizing hyper-parameters, i.e. the length of smooth time interval, i characterizing playing time, y _i Characterizing key probability, y, of the ith play moment _i-σ Representing the key probability of the ith-sigma playing moment, y _i+σ The key probability of the i+sigma playing moment is represented.

In some embodiments, the above-mentioned smoothing, based on the interval comparison result, the key probability of the playing time in the smooth time interval in the candidate key degree sequence, to obtain the key degree sequence may be implemented in the following manner: when the interval comparison result represents that the playing time is in the smooth time interval, smoothing the corresponding key probability to obtain a smooth key probability; and when the interval comparison result indicates that the playing time is not in the smooth time interval, determining the corresponding key probability as the smooth key probability. And constructing a key degree sequence based on the smooth key probability corresponding to each playing time.

As an example, referring to fig. 9, fig. 9 is a schematic diagram showing the comparison of the effects of a keyword sequence and a candidate keyword sequence according to an embodiment of the present application. As shown in the candidate keyword sequence in fig. 9, the distribution of each predicted keyword probability in the candidate keyword sequence is relatively uneven, for example, between playing moments 10 to 20, the corresponding predicted keyword probability is a distribution of a plurality of probability values at the same moment, and between playing moments 60 to 70, the predicted keyword probability has a mutation phenomenon. The key probability of playing time in a smooth time interval in the candidate key degree sequence is smoothed, the obtained key degree sequence is shown as a key degree sequence in fig. 9, the distribution of each prediction smooth key probability in the key degree sequence is uniform, for example, between playing time 10 and 20, the corresponding prediction key probability is the distribution of a single probability value at the same time, the mutation peak value of the prediction key probability is obviously reduced between playing time 60 and 70, the mutation degree of mutation phenomenon is effectively reduced, and the effective smoothing of the candidate key degree sequence is realized.

Therefore, the candidate keyword sequences are subjected to smoothing processing to obtain the keyword sequences corresponding to the video slicing sequences, so that effective smoothing of the keyword sequences is realized, the keywords of the target video at different playing moments can be more accurately represented, the subsequent determined keyword video fragments based on the keyword sequences are more accurate, and the accuracy of the determined keyword video fragments is effectively improved.

In step 105, a key video snippet is determined from the target video based on the sequence of key degrees.

In some embodiments, the target video includes a key video snippet and a non-key video snippet, the key video snippet being significantly more critical than the non-key video snippet.

As an example, referring to fig. 10, fig. 10 is a schematic diagram of a method for determining a video clip according to an embodiment of the present application, where a key video clip and a non-key video clip are determined from a target video based on a key sequence.

In some embodiments, referring to fig. 11, fig. 11 is a flowchart illustrating a method for determining a video clip according to an embodiment of the present application, and step 105 shown in fig. 11 may be implemented by performing the following steps 1051 to 1053.

In step 1051, at least one criticality sub-sequence of consecutive plurality of criticalities is acquired from the criticality sequence, the number of criticalities in the criticality sub-sequence being greater than or equal to a first threshold.

In some embodiments, each criticality in the criticality sub-sequence is greater than or equal to a criticality threshold.

In some embodiments, each of the criticality sub-sequences selected in step 1051 is greater than or equal to a criticality threshold, and the number of criticality in the criticality sub-sequence is greater than or equal to a first threshold, and is comprised of a plurality of criticality in succession.

In some embodiments, the threshold of the criticality and the first threshold may be specifically set according to different application scenarios.

In step 1052, from the video slice sequences, video slice sub-sequences corresponding to each criticality sub-sequence are determined.

In some embodiments, the key degree sequences and the playing moments in the video slicing sequences are in one-to-one correspondence, and the step 1052 may be implemented as follows: the following processing is performed for each critical degree sub-sequence: and determining the starting playing time and the ending playing time corresponding to the keyword subsequence, and determining the video slicing sequence between the starting playing time and the ending playing time from the video slicing sequences as a video slicing subsequence corresponding to the keyword subsequence.

In step 1053, from the target video, video clips corresponding to each video clip sub-sequence are determined, and each video clip is determined to be a key video clip.

In some embodiments, the key degree sequence and each playing time in the video slicing sequence are in one-to-one correspondence, and the target video and each playing time in the video slicing sequence are in one-to-one correspondence. The step 1053 may be implemented as follows: the following processing is performed for each video slice sub-sequence: and determining the starting playing time and the ending playing time of the video segment sub-sequence, determining a target video segment between the starting playing time and the ending playing time from the target video, and determining the target video segment as a key video segment.

In this way, through selecting from the key degree sequence, each key degree is greater than or equal to the key degree threshold value, and the number of key degrees in the key degree subsequence is greater than or equal to the first threshold value, and the key degree subsequence formed by a plurality of continuous key degrees, the video segment corresponding to the key degree subsequence is determined to be the key video segment, thereby effectively ensuring the number of video frames and audio frames in the key video segment, being beneficial to improving the viewing experience of a viewer, and simultaneously, ensuring that the number of key degrees in the key degree subsequence is greater than or equal to the first threshold value, thereby ensuring that the determined key degree of the key video segment is higher, and effectively improving the accuracy of the determined key video segment.

In some embodiments, referring to fig. 12, fig. 12 is a flowchart of a method for determining a video clip according to an embodiment of the present application, and step 105 shown in fig. 12 may be implemented by performing the following steps 1054 to 1056.

In step 1054, at least one criticality sub-sequence of consecutive plurality of criticalities is obtained from the criticality sequence, the number of criticalities in the criticality sub-sequence being greater than or equal to a first threshold, and the number of criticalities in the criticality sub-sequence being less than or equal to a second threshold.

In some embodiments, the number of criticalities in the criticality sub-sequence that reach the criticality threshold is greater than or equal to the third threshold.

In some embodiments, the number of criticality is between the first threshold and the second threshold and the number of criticalities that reach the criticality threshold is greater than or equal to the third threshold (i.e., each criticality in the criticality sub-sequence is allowed to be at least partially less than the third threshold) through the criticality sub-sequence selected in step 1054 above, such that the selected criticality sub-sequence has a criticality that is between the first threshold and the second threshold while the number of criticality that reach the criticality threshold is greater than or equal to the third threshold.

In some embodiments, the magnitude relation among the first threshold, the second threshold, and the third threshold may be: the first threshold is less than the second threshold, and the second threshold is less than the third threshold.

In step 1055, a video slice sub-sequence corresponding to each criticality sub-sequence is determined from the video slice sequences.

In some embodiments, the key degree sequence corresponds to each playing time in the video slicing sequence one by one, and the step 1055 may be implemented as follows: the following processing is performed for each critical degree sub-sequence: and determining the starting playing time and the ending playing time corresponding to the keyword subsequence, and determining the video slicing sequence between the starting playing time and the ending playing time from the video slicing sequences as a video slicing subsequence corresponding to the keyword subsequence.

In step 1056, from the target video, video clips corresponding to each video clip sub-sequence are determined, and each video clip is determined to be a key video clip.

In some embodiments, the key degree sequence and each playing time in the video slicing sequence are in one-to-one correspondence, and the target video and each playing time in the video slicing sequence are in one-to-one correspondence. The step 1056 may be implemented as follows: the following processing is performed for each video slice sub-sequence: and determining the starting playing time and the ending playing time of the video segment sub-sequence, determining a target video segment between the starting playing time and the ending playing time from the target video, and determining the target video segment as a key video segment.

In this way, the number of the key degrees is greater than or equal to the first threshold value, the number of the key degrees in the key degree sub-sequence is less than or equal to the second threshold value, the number of the key degrees of the key degree threshold value is greater than or equal to the key degree sub-sequence of the third threshold value, and the video segments corresponding to the key degree sub-sequence are determined to be the key video segments, so that the number of the video frames and the audio frames in the key video segments is effectively ensured, the video watching experience of a viewer is improved, and meanwhile, the number of the key degrees in the key degree sub-sequence is greater than or equal to the first threshold value, so that the determined key degree of the key video segments is ensured to be higher, and the accuracy of the determined key video segments is effectively improved.

In some embodiments, following step 105 described above, the key video snippets may be recommended by: editing the key video clips in the target video to obtain the key video clips; acquiring a target object interested in a target video; and recommending the key video snippets to the target object.

In some embodiments, the editing of the key video snippets in the target video refers to a process of performing nonlinear editing on the key video snippets by using a video editing tool, by adding materials such as pictures, background music, special effects, scenes and the like into the key video snippets, and then re-mixing the materials with the video, cutting and merging the video sources, and generating new key video snippets with different expressive forces by secondary encoding.

In this way, the video slicing sequence obtained by video slicing the target video is obtained, the video key frames and the audio frames of each video slice in the video slicing sequence are subjected to feature extraction to obtain the picture features and the audio features, and the picture features and the audio features are fused to obtain the fusion features. And predicting based on the fusion characteristics of the video fragments to obtain a key degree sequence, and determining the key video fragments from the target video based on the key degree sequence. Therefore, the characteristics of the audio and the picture of the video are effectively fused, the determined fusion characteristics can reflect the characteristics of each video fragment more accurately, and the fusion characteristics are used for predicting later.

In the following, an exemplary application of an embodiment of the present application in a determined application scenario of an actual video clip will be described.

With the explosion of electronic competition, the speed of game video of the internet is increasing, so that users need to efficiently browse key fragments in the video and improve user experience. The video key segment detection is a key technology, and is used for dividing a long video which is not manually clipped into a plurality of clips with equal length, judging the key degree of each clip according to video information, and thus obtaining the key segment of the video.

The game key segment detection scheme in the related art is usually designed aiming at a specific game and only depends on visual picture information, so that a large amount of resources are needed for manual labeling, and the scheme does not have generalization: if the game picture is changed or the viewing angle is switched, the picture cannot be correctly identified, and the picture cannot be migrated to other games, and each game needs to be marked and trained. To solve the above problems, embodiments of the present application propose a multi-modal based timing modeling framework for multiple games in a multiplayer online tactical athletic game (MOBA). The embodiment of the application combines visual and auditory information to label key fragments of a plurality of common MOBA games, and tests a plurality of games which are seen in the labels and MOBA games which are not seen in the labels in the testing stage so as to verify generalization and universality of the scheme. The video clip determination provided by the embodiment of the application is mainly divided into two parts: the first part utilizes a pre-training model to extract visual and auditory characterizations of the video, analyzes importance and actions of different modes, and further fuses multi-mode features; the second part carries out time sequence modeling on the video and infers and selects key fragments of the video.

The embodiment of the application can be applied to a key segment identification task of game video, which is also called video abstract and video gathering at the product side, namely, given a longer game video, the algorithm can automatically identify one or more segments which attract eyes and are more noticed by audiences according to visual and auditory contents. The segments are used as key segments of long videos in video websites, so that a user can directly and efficiently browse key parts in the long videos, or the segments are automatically clipped out to be used as key materials for secondary production of short videos, key highlights and the like.

In some embodiments, referring to fig. 6, embodiments of the present application include the following three tasks: constructing a general game key fragment data set, wherein the data set is required to be capable of verifying the generalization capability of a model in different kinds of games; after a data set is acquired, key frames of the video are extracted at equal time intervals, multi-mode features of the video are extracted by using pre-training model extractors M1 and M2, and the features are divided into a plurality of fragments by using sliding windows; and training a key fragment detection model D by using the fragment characteristics extracted by M1 and M2. In task one, the test set of data needs to cover the common MOBA games on the market, and the game types of the training set must be less than the test set, so as to verify the model generalization capability. In task two, the given original video is usually a live video or an event video with long time, for the visual features, the key frame of the extracted video with compensation of 1 is set first, and the final layer of features are acquired as the visual picture features of the game by using the characterizer M1. For audio features, after the audio portion of the video is presented, the original waveform (wav) is converted into a mel-frequency spectrogram, and the feature extractor M2 is used to obtain the final layer of features as the audio features of the video. After the feature extraction is completed, splicing (concat) the visual feature and the audio feature to obtain a spliced feature, sliding the spliced feature in time sequence, and dividing the spliced feature into a plurality of equal-length sub-segments. Wherein M1 is a visual encoder trained on a very large-scale dataset containing 4 hundred million image-text pairs, M2 is CNN14 pre-trained on a large-scale audio dataset AudioSet, and D is a Bi-LSTM structure-based neural network. In task three, given a video sequence s_t (t=0, 1,2,3, …, T), the objective of an embodiment of the present application is to solve an output sequence y_t (t=0, 1,2,3, …, T) using a timing model D, the value of the output sequence representing the probability that the content of the time belongs to the critical content, the value being 0-1,0 representing that the moment is not the critical moment, and 1 representing that the moment is the critical moment. The time sequence model provided by the embodiment of the application can be combined with the semantic information of the context to achieve better performance.

In some embodiments, referring to fig. 6, the algorithm framework shown in fig. 6 mainly comprises two phases: (1) video characterization extraction; (2) video sequence key segment learning. In the video characterization extraction stage, as the pre-training model has learned a powerful feature extraction function, parameters of the feature extraction network are frozen and are not involved in the parameter update of the network in order to prevent the damage to the original network performance. After video characterization is obtained, sliding windows are used for obtaining a plurality of video sequences, video sequence features are sent into Bi-LSTM model sequence modeling, prediction is carried out on each Clip, namely, after Bi-LSTM model sequence modeling, a key degree tag sequence Y_t can be predicted.

In some embodiments, for the construction of a dataset, the dataset includes videos of a variety of online games. Mainly from webcast, game video, including first view, game view, etc. Data set partitioning: in order to verify the generalization of the model, the test set comprises all the types of data, and in the training set, only part of games are selected, and the universality of the model is measured by the accuracy of the Unsen data. The data labeling rules are as follows: if certain events occur in the video, such as "one-shot", "two-shot", "killed" and the like fight more violent events, then the critical events are marked. The starting time and the ending time of the event need to be marked, the starting time is the time of hero meeting, and the ending time is the time of the event such as killing or being killed. If the enemy disappears more than two seconds in the picture in the middle, counting from the new simultaneous point in time.

In some embodiments, for visual feature extraction, the collected original video is usually live video or event video with long time, because of the greater redundancy of the original video stream, key frames of the video need to be extracted, but if sampling is too sparse, some key information is easily missed, such as event broadcasting, etc., and it is found through experiments that setting step=1 can extract key information without greater redundancy. Because of the large difference in domain between game video frames and the data sets on the market, if a common pre-training model (such as Resnet50 trained on Image Net or SlowFast and other models trained on K400) is used as a feature extractor, the features of the visual mode cannot be extracted well.

In some embodiments, for auditory feature extraction, for audio features, after the audio portion of the video is presented, the audio is cut into equal length wav, which is converted into a mel spectrogram that can simulate the processing of real world sounds (particularly human voice) by the human ear, extracting information for the degree of pitch sensitivity. Next, using the CNN14 pre-trained on the large-scale audio dataset AudioSet as a feature extractor, the pre-trained model can better extract the feature embedding (Embbedding) of the audio, we acquire the last layer of features as the audio features of the video.

In some embodiments, the embodiments of the present application use a boundary free model to model and optimize the video critical detection task, as shown in fig. 13, and fig. 13 is a schematic diagram of a method for determining video clips provided by the embodiments of the present application. Taking data with an input length of B x Shot-Len x N as an example (B is the batch size, shot-Len is the number of fragments processed in a single batch, N is the dimension of the fragment characteristics), performing key degree identification on each fragment by using a Bi-LSTM sequence-based modeling mode, namely, after Bi-LSTM model sequence modeling, outputting a sequence of B x Shot-Len x 2,2 as a score of two classifications, and performing binarization through a threshold value to obtain the category of the fragment, wherein 1 represents that the fragment is a key fragment, and 0 represents that the fragment is not a key fragment. By the following means: the complexity and the parameter quantity of the model are moderate, so that the efficiency of inference can be improved; and meanwhile, the key degree identification of each fragment also depends on the nearby context information of the fragment, so that the inference result contains more global features and overlong dependence is avoided.

In some embodiments, referring to fig. 13, the binarized threshold selection is hindered by the aliasing of the resulting curve by impulse noise. Therefore, we first apply a median filter to smooth the output curve. Assuming that yt (t=0, 1,2,3, …, T) is the original curve of the network output, the smoothed curve m= (m 1, …, mn) is given by:

Where Med (y 1, y 2) represents the clipping average in the window, where σ is the super parameter, typically 4, and the filtering effect is shown in fig. 9. However, the probability curves for different videos are unstable, the absolute severity of the different videos also varies, and a fixed threshold does not produce satisfactory results.

To address this problem, embodiments of the present application apply adaptive thresholds to different curves to perform binarization, selecting the relatively most critical sub-segments for each video. Finally, the embodiment of the application only intercepts the intermediate result of the output sequence: when the characteristic sliding window is input, the window size is Shot-Len, the step length is Shot-Len/2, the classification result output by each window is B.shot-Len, and as clips at two ends of the sequence cannot contain sufficient context information, each window only selects the result with the middle length of Shot-Len/2, and then all sliding window results are spliced to obtain all outputs of the whole video.

In some embodiments, referring to fig. 14, fig. 14 is a flowchart of a method for determining a video clip according to an embodiment of the present application. In step 201, a key frame is extracted; in step 202, feature extraction is performed on video key frames; in step 203, a sliding window generates a picture feature; in step 204, audio is extracted; in step 205, extracting features of the audio to obtain audio features; in step 206, the sliding window generates an audio feature; in step 207, the audio features and the video features are fused to obtain fused features. In step 208, a predicted sequence is obtained based on the fusion features; in step 209, cross entropy is calculated; in step 210, network parameters are updated.

The embodiment of the application aims to perform supervised training on a training video scene segmentation model, in the stage, parameters of a feature extractor are fixed, firstly, a marked video sequence is extracted by a shot feature extractor to obtain corresponding embedded features, then the embedded features are input into a designed framework, a corresponding output sequence represents a critical degree label prediction sequence of the clip, a cross soil moisture loss function is used as an optimization target, and finally, a gradient back propagation strategy is used for updating network parameters. In the stage, bi-LSTM and three full-connection layers are used as a basic network framework of a scene segmentation model, an SGD optimizer is used for optimization, the initial learning rate is set to be 0.01, the training batch size is 8, and the training period is 50.

It will be appreciated that in the embodiments of the present application, related data such as target video is involved, when the embodiments of the present application are applied to specific products or technologies, user permissions or agreements need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Continuing with the description below of an exemplary structure of the video clip determining apparatus 455 provided by embodiments of the present application implemented as a software module, in some embodiments, as shown in fig. 3, the software module stored in the video clip determining apparatus 455 of the memory 450 may include: the obtaining module 4551 is configured to obtain a video slicing sequence obtained by performing video slicing on a target video, where each video slicing includes at least one video key frame and an audio frame corresponding to each video key frame; the feature extraction module 4552 is configured to perform feature extraction on each video key frame for each video segment in the video segment sequence to obtain a picture feature of each video key frame, and perform feature extraction on an audio frame corresponding to each video key frame in the video segment to obtain an audio feature of each audio frame; the feature fusion module 4553 is configured to perform feature fusion on the picture features of each video segment and the corresponding audio features to obtain fusion features of each video segment; the prediction module 4554 is configured to predict, based on the fusion characteristics of each video slice, a key degree of each video slice in the target video, so as to obtain a key degree sequence corresponding to the video slice sequence; a determining module 4555 is configured to determine a key video snippet from the target video based on the sequence of key degrees.

In some embodiments, the obtaining module 4551 is further configured to obtain a target video, and a slicing step, where the slicing step characterizes a number of video frames included in the video slice, and the video frames include video key frames and video non-key frames; and performing video slicing on the target video according to the slicing step length to obtain a video slicing sequence.

In some embodiments, the feature fusion module 4553 is further configured to perform the following processing for each video slice: splicing the picture features of the video clips to obtain spliced picture features of the video clips, and splicing the audio features of the video clips to obtain spliced audio features of the video clips; and acquiring the picture weight and the audio weight of the video clips, and carrying out weighted fusion on the spliced picture features and the spliced audio features of the video clips based on the picture weight and the audio weight of the video clips to obtain fusion features of the video clips.

In some embodiments, the prediction module 4554 is further configured to invoke the target prediction model to predict each video slice based on the fusion feature of each video slice, so as to obtain a key degree of each video slice; combining the key degree of each video fragment into a candidate key degree sequence according to the playing time sequence of each video fragment in the video fragment sequence; and carrying out smoothing treatment on the candidate keyword sequences to obtain the keyword sequences corresponding to the video slicing sequences.

In some embodiments, the prediction module 4554 is further configured to obtain a playing time and a smooth time interval of the video segment corresponding to each key probability in the candidate key degree sequence, and compare each playing time with the smooth time interval to obtain an interval comparison result corresponding to each key probability; and smoothing the key probability of the playing time in the smooth time interval in the candidate key degree sequence based on the interval comparison result to obtain the key degree sequence.

In some embodiments, the prediction module 4554 is further configured to perform processing for each critical probability that the playing time is in the smooth time interval in the candidate criticality sequence: in the candidate keyword sequences, taking the positions of the keyword probabilities in the candidate keyword sequences as central positions, and selecting at least two reference keyword probabilities at equal intervals; and carrying out weighted average on at least two reference key probabilities to obtain weighted average probability, and determining the weighted average probability as smooth key probability corresponding to the key probability.

In some embodiments, the apparatus for determining a video clip further includes: the training module is used for acquiring at least two video slicing samples, wherein the at least two video slicing samples belong to different video topics, and each video slicing sample comprises at least one video key frame sample and an audio frame sample corresponding to each video key frame sample; performing feature extraction on video key frame samples aiming at each video slicing sample to obtain picture sample features of the video key frame samples, and performing feature extraction on audio frame samples corresponding to the video key frame samples in the video slicing samples to obtain audio sample features of the audio frame samples; carrying out feature fusion on the picture sample features of each video slicing sample and the corresponding audio sample features to obtain fusion sample features of each video slicing sample; based on the fusion sample characteristics of each video slicing sample, calling a prediction model to predict each video slicing sample, and obtaining the prediction key probability of each video slicing sample; and training the prediction model based on the prediction key probability of each video slice sample to obtain a target prediction model.

In some embodiments, the training module is further configured to obtain a label key probability of each video slice sample; determining a loss value of each tag key probability based on each tag key probability and the corresponding predicted key probability; summing the loss values of the key probabilities of all the labels to obtain a training loss value; and training the prediction model based on the training loss value to obtain a target prediction model.

In some embodiments, the determining module 4555 is further configured to obtain at least one criticality sub-sequence composed of a plurality of consecutive criticalities from the criticality sequence, where the number of criticalities in the criticality sub-sequence is greater than or equal to the first threshold; wherein each criticality in the criticality subsequence is greater than or equal to a criticality threshold; determining video slicing sub-sequences corresponding to each key degree sub-sequence from the video slicing sequences; and determining video fragments corresponding to the video fragment sub-sequences from the target video, and determining the video fragments as key video fragments.

In some embodiments, the determining module 4555 is further configured to obtain at least one criticality sub-sequence composed of a plurality of consecutive criticalities from the criticality sequence, where the number of criticalities in the criticality sub-sequence is greater than or equal to a first threshold and the number of criticalities in the criticality sub-sequence is less than or equal to a second threshold; wherein, in the critical degree subsequence, the number of critical degrees reaching the critical degree threshold is greater than or equal to a third threshold; determining video slicing sub-sequences corresponding to each key degree sub-sequence from the video slicing sequences; and determining video fragments corresponding to the video fragment sub-sequences from the target video, and determining the video fragments as key video fragments.

In some embodiments, the apparatus for determining a video clip further includes: the recommendation module is used for editing the key video clips in the target video to obtain the key video clips; acquiring a target object interested in a target video; and recommending the key video snippets to the target object.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, cause the processor to perform the method for determining video clips provided by embodiments of the present application, for example, the method for determining video clips as shown in fig. 4.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of electronic devices including one or any combination of the above-described memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application has the following beneficial effects:

(1) And extracting features of video key frames and audio frames of each video slice in the video slice sequence by acquiring a video slice sequence obtained by video slicing of the target video, obtaining picture features and audio features, and fusing the picture features and the audio features to obtain fusion features. And predicting based on the fusion characteristics of the video fragments to obtain a key degree sequence, and determining the key video fragments from the target video based on the key degree sequence. Therefore, the characteristics of two modes of audio and picture of the video are effectively fused, the determined fusion characteristics can reflect the characteristics of each video fragment more accurately, and the fusion characteristics are used for predicting later.

(2) The target video is subjected to video slicing to obtain a video slicing sequence comprising a plurality of video slices, so that feature extraction can be performed on each video slice in parallel in the subsequent feature extraction process, the feature extraction time is effectively saved, the algorithm execution efficiency is effectively improved, and the video slice determination efficiency is effectively improved.

(3) The method has the advantages that the picture characteristics and the audio characteristics of the corresponding video key frames are respectively extracted for each video slice in the video slice sequence, so that the filtering of the video non-key frames in the target video is realized, the video key frames with key information and the corresponding audio frames in the target video are reserved, the determination of the key information for the subsequent video slices is realized, the calculation amount of an algorithm is effectively saved, namely, the determination efficiency of the video slices is effectively improved, and meanwhile, the accuracy of the determined video slices is effectively improved.

(4) The image characteristics of each video segment and the corresponding audio characteristics are subjected to characteristic fusion to obtain fusion characteristics of each video segment, so that the determined fusion characteristics effectively fuse the characteristics of two modes of audio and image of video, the determined fusion characteristics can reflect the characteristics of each video segment more accurately, and the fusion characteristics can reflect the characteristics of each video segment more accurately later, so that the accuracy of prediction is higher, and the accuracy of the determined video segments is effectively improved.

(5) By carrying out smoothing treatment on the candidate keyword sequences, the keyword sequences corresponding to the video slicing sequences are obtained, so that effective smoothing of the keyword sequences is realized, the keywords of the target video at different playing moments can be more accurately represented, the subsequent determined keyword video fragments based on the keyword sequences are more accurate, and the accuracy of the determined keyword video fragments is effectively improved.

(6) The video segments corresponding to the keyword subsequences are determined to be the key video segments by selecting from the keyword sequences, wherein each keyword is greater than or equal to a keyword threshold value, the number of the keywords in the keyword subsequences is greater than or equal to a first threshold value, and the video segments corresponding to the keyword subsequences are determined to be the key video segments, so that the number of video frames and audio frames in the key video segments is effectively ensured, the video watching experience of a viewer is improved, and meanwhile, the number of the keywords in the keyword subsequences is greater than or equal to the first threshold value, so that the determined keywords of the key video segments are ensured to be higher, and the accuracy of the determined key video segments is effectively improved.

(7) The number of the key degrees is larger than or equal to a first threshold value, the number of the key degrees in the key degree subsequence is smaller than or equal to a second threshold value, the number of the key degrees of the key degree threshold value is larger than or equal to a third threshold value, and the video segments corresponding to the key degree subsequence are determined to be key video segments, so that the number of video frames and audio frames in the key video segments is effectively ensured, the video watching experience of a viewer is improved, and meanwhile, the number of the key degrees in the key degree subsequence is larger than or equal to the first threshold value, so that the determined key degree of the key video segments is ensured to be higher, and the accuracy of the determined key video segments is effectively improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for determining a video clip, the method comprising:

2. The method according to claim 1, wherein the obtaining a video slicing sequence obtained by video slicing a target video includes:

obtaining a target video and a slicing step length, wherein the slicing step length represents the number of video frames included in the video slicing, and the video frames comprise the video key frames and the video non-key frames;

And performing video slicing on the target video according to the slicing step length to obtain the video slicing sequence.

3. The method of claim 1, wherein when the video slices include at least two video key frames, the feature fusing the picture features of each video slice with the corresponding audio features to obtain fused features of each video slice comprises:

the following processing is respectively executed for each video slice:

splicing the picture features of the video clips to obtain spliced picture features of the video clips, and splicing the audio features of the video clips to obtain spliced audio features of the video clips;

and acquiring the picture weight and the audio weight of the video segment, and carrying out weighted fusion on the spliced picture characteristic and the spliced audio characteristic of the video segment based on the picture weight and the audio weight of the video segment to obtain the fusion characteristic of the video segment.

4. The method according to claim 1, wherein the predicting, based on the fusion characteristics of each video slice, the criticality of each video slice in the target video to obtain a criticality sequence corresponding to the video slice sequence includes:

Based on the fusion characteristics of the video fragments, a target prediction model is called to predict the video fragments, so that the key degree of the video fragments is obtained;

combining the key degree of each video fragment into a candidate key degree sequence according to the playing time sequence of each video fragment in the video fragment sequence;

and smoothing the candidate keyword sequences to obtain the keyword sequences corresponding to the video slicing sequences.

5. The method of claim 4, wherein each of the criticality in the candidate sequence of criticalities is indicated by a criticality probability; the step of performing smoothing on the candidate keyword sequences to obtain the keyword sequences corresponding to the video slicing sequences, includes:

acquiring playing time and smooth time intervals of video fragments corresponding to the key probabilities in the candidate key degree sequences, and comparing the playing time with the smooth time intervals to obtain interval comparison results corresponding to the key probabilities;

and smoothing the key probability of the playing time in the smooth time interval in the candidate key degree sequence based on the interval comparison result to obtain the key degree sequence.

6. The method of claim 5, wherein smoothing the key probabilities that the playing time is in the smoothing time interval in the candidate criticality sequence comprises:

and respectively executing processing for each key probability that the playing time is in the smooth time interval in the candidate key degree sequence:

in the candidate keyword sequence, taking the position of the keyword probability in the candidate keyword sequence as a central position, and selecting at least two reference keyword probabilities at equal intervals;

and carrying out weighted average on the at least two reference key probabilities to obtain weighted average probability, and determining the weighted average probability as smooth key probability corresponding to the key probability.

7. The method of claim 4, wherein the invoking a target prediction model predicts each of the video slices based on the fused features of each of the video slices, the method further comprising, prior to deriving the key probabilities for each of the video slices:

acquiring at least two video slicing samples, wherein the at least two video slicing samples belong to different video topics, and each video slicing sample comprises at least one video key frame sample and an audio frame sample corresponding to each video key frame sample;

For each video slicing sample, performing feature extraction on the video key frame sample to obtain picture sample features of the video key frame sample, and performing feature extraction on an audio frame sample corresponding to the video key frame sample in the video slicing sample to obtain audio sample features of the audio frame sample;

performing feature fusion on the picture sample features of each video slicing sample and the corresponding audio sample features to obtain fusion sample features of each video slicing sample;

based on the fusion sample characteristics of each video slicing sample, calling a prediction model to predict each video slicing sample, and obtaining the prediction key probability of each video slicing sample;

and training the prediction model based on the prediction key probability of each video slice sample to obtain the target prediction model.

8. The method of claim 7, wherein training the predictive model based on the predictive key probabilities for each of the video slice samples to obtain the target predictive model comprises:

acquiring the label key probability of each video slicing sample;

Determining a loss value of each tag key probability based on each tag key probability and the corresponding predicted key probability;

summing the loss values of the key probabilities of the labels to obtain a training loss value;

and training the prediction model based on the training loss value to obtain the target prediction model.

9. The method of claim 1, wherein determining a key video snippet from the target video based on the sequence of criticalities comprises:

obtaining at least one criticality subsequence composed of a plurality of continuous criticality from the criticality sequence, wherein the number of the criticality in the criticality subsequence is greater than or equal to a first threshold value;

wherein each of the criticality subsequences is greater than or equal to a criticality threshold;

determining video fragment sub-sequences corresponding to the key degree sub-sequences from the video fragment sequences;

and determining video fragments corresponding to the video fragment sub-sequences from the target video, and determining the video fragments as the key video fragments.

10. The method of claim 1, wherein determining a key video snippet from the target video based on the sequence of criticalities comprises:

Obtaining at least one criticality sub-sequence composed of a plurality of continuous criticality from the criticality sequence, wherein the number of the criticality in the criticality sub-sequence is larger than or equal to a first threshold value, and the number of the criticality in the criticality sub-sequence is smaller than or equal to a second threshold value;

wherein, in the critical degree subsequence, the number of critical degrees reaching a critical degree threshold is greater than or equal to a third threshold;

11. The method of claim 1, wherein after determining a key video snippet from the target video based on the sequence of criticalities, the method further comprises:

editing the key video segments in the target video to obtain the key video segments;

acquiring a target object interested in the target video;

and recommending the key video snippets to the target object.

12. A video clip determination apparatus, the apparatus comprising:

13. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions or computer programs;

a processor for implementing the method of determining video segments according to any one of claims 1 to 11 when executing computer executable instructions or computer programs stored in said memory.

14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the method of determining video segments of any one of claims 1 to 11.

15. A computer program product comprising a computer program or computer-executable instructions which, when executed by a processor, implement the method of determining video segments as claimed in any one of claims 1 to 11.