CN115905584B

CN115905584B - Video splitting method and device

Info

Publication number: CN115905584B
Application number: CN202310029942.1A
Authority: CN
Inventors: 赵仪琳; 魏海巍
Original assignee: Gongdao Network Technology Co ltd
Current assignee: Gongdao Network Technology Co ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-08-11
Anticipated expiration: 2043-01-09
Also published as: CN115905584A

Abstract

The application discloses a video splitting method and device. The method comprises the following steps: acquiring an original video, and dividing the original video into a plurality of video clips; extracting audio features and image features of the corresponding video clips, and text features corresponding to the audio features; respectively calculating a first correlation between the audio feature and the text feature and a second correlation between the image feature and the text feature; taking the product of the audio feature and the first correlation as an audio feature input vector, taking the product of the image feature and the second correlation as an image feature input vector, and judging whether the corresponding video segment is a splitting node of the original video according to the text feature, the audio feature input vector and the image feature input vector; and splitting the original video into a plurality of target videos by taking the starting time or the ending time of the video segment determined as the splitting node as the splitting time.

Description

Video splitting method and device

Technical Field

One or more embodiments of the present application relate to the field of video processing technologies, and in particular, to a method and apparatus for splitting video.

Background

Video is continuous content with a certain value presented to people through media, is an important component of life, leisure, entertainment, information and social life of people, and a user can watch the video in various modes, wherein news video is an important channel for news propagation. But the news video played on television is a complete news program and usually contains a plurality of news items, and the main news program usually has a duration of 30 minutes or more, and in the state that the short video is widely popular, people want to pay attention to the content of interest by themselves with limited time. Therefore, splitting a complete video into multiple short videos according to entries to achieve the personalized needs of users is an important ring in video processing technology. Meanwhile, the video is separated according to the items, so that the television station can catalog and put in storage conveniently, and inquiry and management are facilitated.

At present, video websites or news applications can manually click, split and online news materials gathered from various channels, so that users can click and watch each news of interest in the news materials. However, video content produced by the video platform every day is extremely large, news timeliness requirements are high, manpower cannot keep up, and accuracy of splitting positions is difficult to guarantee due to existence and difference of human subjectivity factors. Therefore, the application provides a novel video splitting method and device to improve the efficiency and accuracy of video splitting.

Disclosure of Invention

The application provides a video splitting method and a video splitting device, which are used for solving the defects in the related art.

According to a first aspect of one or more embodiments of the present application, there is provided a video splitting method, the method comprising:

acquiring an original video, and dividing the original video into a plurality of video clips;

for each video clip, the following operations are performed: extracting audio features and image features of the corresponding video clips, and text features corresponding to the audio features; respectively calculating a first correlation between the audio feature and the text feature and a second correlation between the image feature and the text feature; taking the product of the audio feature and the first correlation as an audio feature input vector, taking the product of the image feature and the second correlation as an image feature input vector, and judging whether the corresponding video segment is a splitting node of the original video according to the text feature, the audio feature input vector and the image feature input vector;

and splitting the original video into a plurality of target videos by taking the starting time or the ending time of the video segment determined as the splitting node as the splitting time.

Optionally, the extracting the audio feature and the image feature of the corresponding video clip, and the text feature corresponding to the audio feature includes:

respectively extracting original audio features and original image features of corresponding video clips;

extracting the characteristics of the original audio characteristics through a Bi-LSTM model to obtain the audio characteristics of the corresponding video clips; extracting features of the original image features through a Bi-LSTM model to obtain image features of corresponding video clips;

converting the audio part of the corresponding video clip into text content through voice recognition; and extracting the characteristics of the text content through a textCNN model to obtain the text characteristics of the corresponding video clips.

Optionally, the extracting the original audio feature and the original image feature of the corresponding video clip respectively includes:

acquiring an audio part of the corresponding video clip, converting the audio part into an audio waveform, and extracting the original audio feature from the audio waveform by adopting a VGGish model;

the image part of the corresponding video clip is obtained, the image part is converted into a video frame image set, at least one part of video frame images are extracted to serve as representative images, and the original image features are extracted from the representative images by adopting an innonv 3 model.

Optionally, the calculating the first correlation between the audio feature and the text feature, and the second correlation between the image feature and the text feature respectively includes:

calculating a first correlation between the audio feature and the text feature by an attention mechanism; and calculating a second correlation between the image feature and the text feature by an attention mechanism.

Optionally, the determining whether the corresponding video segment is a split node of the original video according to the text feature, the audio feature input vector and the image feature input vector includes:

and taking the text feature, the audio feature input vector and the image feature input vector as input-in input trained recognition models, and judging whether the corresponding video segment is a splitting node of the original video according to the output result of the recognition models.

Optionally, the recognition model is trained by:

obtaining a plurality of sample video fragments obtained by dividing an original sample video, wherein type labels are added to the sample video fragments, and the type labels indicate whether the corresponding sample video fragments are split nodes of the original sample video or not;

For each sample video segment, the following operations are performed: extracting sample audio features and sample image features of a corresponding sample video segment, and sample text features corresponding to the sample audio features; respectively calculating a first sample correlation between the sample audio feature and the sample text feature and a second sample correlation between the sample image feature and the sample text feature; taking the product of the correlation of the sample audio feature and the first sample as a sample audio feature input vector, taking the product of the correlation of the sample image feature and the second sample as a sample image feature input vector, and taking the sample text feature, the sample audio feature input vector and the sample image feature input vector as a group of training samples corresponding to corresponding sample video segments;

and training the original recognition model according to training samples and type labels respectively corresponding to the plurality of sample video clips to obtain the trained recognition model.

Optionally, the method further comprises:

and updating and training the trained identification model when the accumulated number of the sample video fragments added with the type labels reaches a preset updating sample threshold value at the end of each preset updating period.

According to a second aspect of one or more embodiments of the present application, there is provided a video splitting apparatus, the apparatus comprising:

the preprocessing unit is used for acquiring an original video and dividing the original video into a plurality of video clips;

a judging unit, configured to perform the following operations for each video clip: extracting audio features and image features of the corresponding video clips, and text features corresponding to the audio features; respectively calculating a first correlation between the audio feature and the text feature and a second correlation between the image feature and the text feature; taking the product of the audio feature and the first correlation as an audio feature input vector, taking the product of the image feature and the second correlation as an image feature input vector, and judging whether the corresponding video segment is a splitting node of the original video according to the text feature, the audio feature input vector and the image feature input vector;

and the splitting unit is used for splitting the original video into a plurality of target videos by taking the starting time or the ending time of the video segment determined as the splitting node as the splitting time.

According to a third aspect of one or more embodiments of the present application, there is provided an electronic device comprising a processor; a memory for storing processor-executable instructions; wherein the processor implements an alternative method of the video splitting by executing the executable instructions.

According to a fourth aspect of one or more embodiments of the present application, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement an alternative method of said video splitting.

By the embodiment of the application, the text feature, the audio feature input vector and the image feature input vector are generated by extracting the audio feature and the image feature of the corresponding video clip and the text feature corresponding to the audio feature, whether the corresponding video clip is a splitting point of the original video is judged according to the text feature, the original video is split into a plurality of target videos according to the splitting point, the automatic and rapid splitting of the video is realized, the dependence of the video splitting on manual operation is reduced, the labor cost is saved, the negative influence of human subjective factors on the video splitting result is weakened, and the efficiency and the accuracy of the video splitting are improved by utilizing a computer intelligent audio-video algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a flow chart of a video splitting method according to an exemplary embodiment of the present application;

FIG. 2 is a training flow diagram of an identification model according to an embodiment of the present application;

fig. 3 is a schematic structural view of a video splitting apparatus according to an exemplary embodiment of the present application;

fig. 4 is a hardware configuration diagram of a computer device where the video splitting apparatus according to an embodiment of the present application is located.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The video splitting method and the video splitting device provided by the application are specifically described below.

For a specific video splitting task, especially in a business scenario where a news video is split according to an item, news programs are usually concentrated in a fixed time period (for example, 12 to 12 am and 7 to 7 pm) and played for 30 minutes or more, and a whole news video (a video with a duration of more than 10 minutes) at least contains more than two news contents, and a user is more concerned about a certain news program or a certain news program, so that the long video is unfavorable for capturing interest points of people rapidly. Meanwhile, for video contents such as television drama, users expect to be able to skip a certain scenario or a certain person conveniently and only watch the interesting part of the users, so how to split the video automatically according to a specific reason is an objectively existing technical requirement. In order to meet the requirement, the application provides a video splitting method, which saves part of labor cost required by repeated work of splitting videos and improves the efficiency and accuracy of video splitting.

The video splitting method shown in the embodiment can be applied to video splitting tasks under various service scenes, the original video can be news video containing a plurality of specific news, video content such as short video sets, television dramas and movies, splitting can be performed according to news items, splitting can be performed according to conditions such as video scenario content, role characters and environment scenes, and the target video obtained through splitting can be news with independent subjects, or video with independent subjects such as multi-section scenario content, role characters and environment scenes. Fig. 1 is a flowchart illustrating a video splitting method according to an exemplary embodiment of the present application. As shown in fig. 1, the video splitting method mainly includes the following steps.

Step S101: and acquiring an original video, and dividing the original video into a plurality of video clips.

Step S102: for each video clip, the following operations are performed: extracting audio features and image features of the corresponding video clips, and text features corresponding to the audio features; respectively calculating a first correlation between the audio feature and the text feature and a second correlation between the image feature and the text feature; taking the product of the audio feature and the first correlation as an audio feature input vector, taking the product of the image feature and the second correlation as an image feature input vector, and judging whether the corresponding video segment is a splitting node of the original video according to the text feature, the audio feature input vector and the image feature input vector.

Step S103: and splitting the original video into a plurality of target videos by taking the starting time or the ending time of the video segment determined as the splitting node as the splitting time.

In step S101, the original video may be any video with any duration, any type, and any source, that is, any video may be used as the original video in the present application, and the original video may be obtained through a news video website, software, a television station, etc., which is not limited by the present application.

The original video can be segmented by adopting various video editing tools, software and algorithms, and for the segmented video segments, the time length can be kept consistent to reduce the information quantity difference between the segments, so that the accuracy of extracting and classifying multiple characteristics in subsequent operation is improved, the time length of the video segments can be set to be 5 seconds, and the time length of the video segments can be automatically adjusted according to actual needs to continuously optimize a model, so that the model accuracy is improved. When the duration of the original video is 30 minutes, the original video can be cut into 360 video segments with the duration of 5 seconds, and also can be cut into 180 video segments with the duration of 10 seconds.

Video (Video) generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. When more than 24 frames of images change continuously per second, according to the persistence of vision principle, the human eyes cannot distinguish single static images and can only see smooth continuous visual effect, and the continuous images are called videos. The audio file is also designed according to the time line and the picture rhythm, and the audio file is added into the video, so that the video can be voiced and colored. Currently, a video file is composed of audio and image, so that in order to split a video according to content information, a computer needs to understand the content of the video, and therefore, audio features and image features of the video need to be extracted.

In step S102, for each video segment, the audio features are extracted by simplifying the original waveform sampling signal, using a series of digital representations (sampling and quantization) of the audio, and performing various signal processing operations on the digital representation signal, and identifying the energy features, time domain features, frequency domain features, musical tone features, perceptual features, etc. of the audio, where the processes including convolution, fourier transform, laplace transform, etc. may be used, where the processing of the continuous analog signal by some analog devices may be performed, where the original audio features may be extracted by using a neural network structure such as CNN, DNN, transformer or RNN, or where the original audio features may be extracted by using a pre-training model such as L3-net, and audio set-based PANNs.

In a specific embodiment, the original audio features of the corresponding video segment may be extracted by: and acquiring an audio part of the corresponding video clip, converting the audio part into an audio waveform, and extracting the original audio characteristics from the audio waveform by adopting a VGGish model. The VGGish model can be used as a feature extractor to convert the audio input features into 128-dimensional high-level feature vectors with semantics and meaning, and can be used as the input of a downstream model. The input data of the VGGish model is a wav audio file, and after the audio portion of the corresponding video clip is acquired, the audio is resampled to 16kHz Mono audio, which is short-time fourier transformed with a Hann time window of 25 ms and a frame shift of 10ms to obtain a spectrogram, and calculating mel-spectra by mapping the spectrogram into a 64-order mel-filter bankA stable mel spectrum is obtained, with a bias of 0.01 added to avoid logarithming of 0, and these features will then be framed with a duration of 0.96s and no overlapping of frames, each frame containing 64 mel bands, for a duration of 10ms (i.e. 96 frames total).

The image feature extraction is a process of distinguishing images from each other and explaining the meaning as far as possible, and because each second at least comprises 24 frames of images, the difference between the frames of images is not large within one second, the image feature extraction firstly needs to carry out frame extraction processing, a representative frame image within a certain duration range is selected, natural features of the image such as brightness, edges, textures, colors and the like and digital features which can be obtained through transformation or processing such as moment, histogram, principal components and the like are extracted, a plurality of or a plurality of features of objects of a certain class are combined together to form a feature vector to represent the object, and if only a single numerical feature exists, the feature vector is a one-dimensional vector, and if the feature vector is a combination of n features, the feature vector is an n-dimensional feature vector. The original image features can be extracted by adopting a plurality of algorithms such as SIFT, HOG, ORB or HAAR, and the original image features can be extracted by adopting a Deep Learning model, an innon V3 model and the like.

In a specific embodiment, the original image features of the corresponding video segment may be extracted by: the image part of the corresponding video clip is obtained, the image part is converted into a video frame image set, at least one part of video frame images are extracted to serve as representative images, and the original image features are extracted from the representative images by adopting an innonv 3 model. The network can be pre-trained by designing a multilayer overlapped network structure of the CNN, and a certain training set such as a training set A or a training set B can be used for training the network in advance, and network parameters are obtained by training on a task A or a task B and are stored for later use. The same network structure is adopted when facing the third task C, and parameters learned by the task A or the task B can be loaded when the network parameters are initialized in a shallower CNN structure with several layers, and other CNN high-layer parameters are still initialized randomly. And then training the network by using training data of the C task, wherein the parameters of the bottom layer network are still changed continuously along with the progress of training in the training process of the C task, and the parameters are better adjusted so as to be more suitable for the current C task.

After the original audio features and the original image features of the corresponding video clips are extracted, the audio features and the image features are further optimized, the capacity of solving the problem can be improved, the number of irrelevant features and redundant features is reduced, the model training speed can be accelerated, the learning efficiency is improved on the basis that the accuracy of the recognition model is ensured not to be lost too much, and the class distribution of the output result of the recognition model is enabled to be as close to the real class distribution as possible. The original audio features and the original image features can be extracted and optimized through various models, such as model methods of SuperPoint networks, DCNN deep convolutional neural networks, LSTM models, bi-LSTM models and the like.

In a specific embodiment, feature extraction can be performed on the original audio features through a Bi-LSTM model to obtain audio features of corresponding video clips; and extracting the features of the original image by the Bi-LSTM model to obtain the image features of the corresponding video clips. The Bi-LSTM (Bi-directional Long Short-Term Memory) model is divided into 2 independent LSTM, the input sequences are respectively input into 2 LSTM neural networks in positive sequence and reverse sequence for feature extraction, and word vectors formed by splicing 2 output vectors (namely extracted feature vectors) are used as final feature expression of the word. The model design concept of Bi-LSTM is to make the feature data obtained at time t possess the information between the past and the future at the same time so as to obtain the audio feature and the image feature of the protection time sequence feature.

The text feature extraction is a text mining and information retrieval process, and the unstructured original text is converted into the structured computer to identify the processed information, namely the text is scientifically abstracted and quantitatively expressed, so that the number of words to be processed is reduced as much as possible under the condition that the core information of the text is not damaged, the vector space dimension is reduced, and the data calculation is simplified, thereby improving the speed and efficiency of text processing. For video, the text content needs to be converted according to the audio part, and text features such as word frequency, document frequency, inverse document frequency, mutual information, expected cross entropy and the like can be obtained through a feature extraction method based On statistics, such as a construction evaluation function, an N-gram algorithm and the like, and features with the most classification information can also be obtained through a feature extraction method based On semantics, such as an On-tology model, a VSM vector space model and the like.

In a specific embodiment, the audio part of the corresponding video clip is converted into text content through voice recognition; and extracting the characteristics of the text content through a textCNN model to obtain the text characteristics of the corresponding video clips. Because text content as discrete symbolized words cannot transfer semantic information out, words need to be mapped to vector space, which is not only beneficial to corresponding calculation, but also enables related vectors to have certain semantics in the mapping process. And extracting features of the text content by using an ernie pre-training model to obtain trained word vectors, and sending the trained word vectors into a textCNN module to obtain text features through three convolution pooling.

In step S102, after feature extraction, further performing mode fusion to obtain a correlation expression of audio features, image features and text features. In attempting to describe video in a computer language, a single modality such as audio features, image features, and text features corresponding to the audio features often cannot contain all of the effective information needed to produce an accurate conveying meaning, and a multimodal fusion process combines information from two or more modalities to achieve information supplementation, widen the coverage of information contained in the input data, improve the accuracy of the prediction results, and improve the robustness of model classification. The mode fusion method comprises the steps of multiplying or adding elements at the same position on each mode representation, constructing an encoder-decoder structure and integrating information by using an LSTM neural network, and also comprises the steps of determining combination strategies of different model output results through rules, such as maximum value combination, average value combination, bayesian rule combination, ensemble learning and other combination methods. The product of the audio feature and the first correlation is taken as an audio feature input vector and the product of the image feature and the second correlation is taken as an image feature input vector by respectively calculating the first correlation between the audio feature and the text feature and the second correlation between the image feature and the text feature. The audio feature input vector is a fusion feature vector between the audio feature and the text feature mode, and can reflect the image feature input vector to be a fusion feature vector between the image feature and the text feature mode.

In a specific embodiment, calculating a first correlation between the audio feature and the text feature by an attention mechanism; and calculating a second correlation between the image feature and the text feature by an attention mechanism. Attention mechanisms can be used to assign importance weights to these different feature representations, which can determine the most relevant aspects, ignoring noise and redundancy in the input. When the input layer of the attention mechanism is a concatenation of text features and audio features, global average pooling (AvgPool) and global maximum pooling (MaxPool) are respectively carried out on the input feature layer (both pooling are aimed at the height and width of the input feature layer), then the results of the average pooling and the maximum pooling are processed by using a Shared full-connection layer (Shared MLP), then the results obtained by the Shared full-connection layer are added and then a Sigmoid activation function is used, so that channel attention patterns are obtained, namely the weight (between 0 and 1) of each channel of the input feature layer, namely the first correlation is obtained, and the weight is weighted onto the input feature layer channel by channel through multiplication, so that the audio feature input vector is obtained. When the text features and the image features are connected in series to be used as an input layer, the operation is similar to that described above, namely the channel attention pattern can be obtained, namely the weight (between 0 and 1) of each channel of the input feature layer, namely the second relativity, is obtained, and the weights are weighted on the input feature layer channel by channel through multiplication, so that the image feature input vector can be obtained.

In step S102, it is determined whether the corresponding video segment is a split node of the original video according to the text feature, the audio feature input vector, and the image feature input vector. The judging method can be a logistic regression, naive Bayes, decision trees, support vector machines, random forests, gradient lifting trees and other classification algorithm models. Classifying the corresponding video segments into two types according to the text features, the audio feature input vectors, the image feature input vectors and other feature vectors: is/is not a split node of the original video. In one embodiment of the application, the classification is performed using an identification model. Specifically, the text feature, the audio feature input vector and the image feature input vector are used as input-in and input-trained recognition models, and whether the corresponding video segments are split nodes of the original video is judged according to the output results of the recognition models. The identification model can be constructed by a Softmax algorithm, a certain maximum value is not uniquely determined any more, a probability value is given to each output classification result, the probability that the corresponding video segment is a split node or not is indicated, namely, a type label is added to the video segment, and the type label can be used for indicating whether the corresponding video segment is the split node of the original video. In case the likelihood that the corresponding video clip is a split node exceeds a preset threshold (e.g. 0.9), then it may be determined that the corresponding video clip is considered to be an original video split node.

In step S103, the original video is split into a plurality of target videos with the start time or the end time of the video clip determined as the splitting node as the splitting time.

The implementation of a preferred embodiment of the present application is more intuitively illustrated by a specific example. FIG. 2 is a training flow diagram of an identification model, according to an exemplary embodiment of the present application. As shown in fig. 2, the training process of the recognition model mainly includes the following steps.

Step S201: and obtaining a plurality of sample video fragments obtained by dividing an original sample video, wherein type labels are added to the sample video fragments, and the type labels indicate whether the corresponding sample video fragments are split nodes of the original sample video or not.

Step S202: for each sample video segment, the following operations are performed: extracting sample audio features and sample image features of a corresponding sample video segment, and sample text features corresponding to the sample audio features; respectively calculating a first sample correlation between the sample audio feature and the sample text feature and a second sample correlation between the sample image feature and the sample text feature; taking the product of the correlation of the sample audio feature and the first sample as a sample audio feature input vector, taking the product of the correlation of the sample image feature and the second sample as a sample image feature input vector, and taking the sample text feature, the sample audio feature input vector and the sample image feature input vector as a group of training samples corresponding to corresponding sample video segments.

Step S203: and training the original recognition model according to training samples and type labels respectively corresponding to the plurality of sample video clips to obtain the trained recognition model.

The sample video clips divide the samples into a training set, a verification set and a test set according to a certain proportion (such as 8:1:1), wherein the training set samples can be used for training the recognition model, the verification set samples can be used for evaluating the recognition model, and the test set samples are used for calculating errors of the recognition model. And when the accuracy of the trained split model obtained in the verification set is not increased any more and the loss value is stably attenuated to a preset value, the model is considered to be converged, and the iterative training is ended, so that the trained split model is obtained.

In a specific embodiment, the trained recognition model is updated and trained at the end of each preset update period and/or in the event that the cumulative number of sample video segments with type tags obtained reaches a preset update sample threshold. The training-completed model is deployed in a software application, a function capable of achieving standard reaching of an original video, namely confirming a splitting point, is provided for a user online or offline, a certain time period is set and/or a sample threshold is set, video samples reaching the standard of the user are uploaded to a database when a preset updating period is finished or the accumulated number of sample video fragments added with type labels reaches the preset updating sample threshold, the new training set samples are collected, and in order to prevent meaningless marking of the user, the data are randomly extracted by manual marking for auditing and verification. And under a small amount of marks, combining with user data, and improving the overall effect of the model through the flow.

Corresponding to the embodiment of the method, the embodiment of the application also provides a video splitting device, which is used for supporting the video splitting method provided by any one embodiment or combination of the embodiments.

Fig. 3 is a schematic structural diagram of a video splitting apparatus according to an exemplary embodiment, where the apparatus includes: a preprocessing unit 31, a judging unit 32 and a splitting unit 33.

The preprocessing unit 31 is configured to acquire an original video, and divide the original video into a plurality of video clips.

A judging unit 32, configured to perform the following operations for each video clip: extracting audio features and image features of the corresponding video clips, and text features corresponding to the audio features; respectively calculating a first correlation between the audio feature and the text feature and a second correlation between the image feature and the text feature; taking the product of the audio feature and the first correlation as an audio feature input vector, taking the product of the image feature and the second correlation as an image feature input vector, and judging whether the corresponding video segment is a splitting node of the original video according to the text feature, the audio feature input vector and the image feature input vector.

A splitting unit 33, configured to split the original video into a plurality of target videos with a start time or a stop time of the video segment determined as the splitting node as a splitting time.

In an exemplary embodiment, the judging unit 32 is further configured to extract the original audio features and the original image features of the corresponding video clips, respectively; extracting the characteristics of the original audio characteristics through a Bi-LSTM model to obtain the audio characteristics of the corresponding video clips; extracting features of the original image features through a Bi-LSTM model to obtain image features of corresponding video clips; converting the audio part of the corresponding video clip into text content through voice recognition; and extracting the characteristics of the text content through a textCNN model to obtain the text characteristics of the corresponding video clips.

The judging unit 32 is further configured to obtain an audio portion of the corresponding video clip, convert the audio portion into an audio waveform, and extract the original audio feature from the audio waveform by using a VGGish model; the image part of the corresponding video clip is obtained, the image part is converted into a video frame image set, at least one part of video frame images are extracted to serve as representative images, and the original image features are extracted from the representative images by adopting an innonv 3 model. Calculating a first correlation between the audio feature and the text feature by an attention mechanism; and calculating a second correlation between the image feature and the text feature by an attention mechanism.

In another exemplary embodiment, the determining unit 32 is further configured to use the text feature, the audio feature input vector, and the image feature input vector as the input and input trained recognition models, and determine whether the corresponding video segment is a splitting node of the original video according to the output result of the recognition models.

In addition, the video splitting device of the embodiment of the application further comprises: training unit (not shown in fig. 3). The training unit is used for acquiring a plurality of sample video fragments obtained by dividing an original sample video, wherein type labels are added to the sample video fragments, and the type labels indicate whether the corresponding sample video fragments are split nodes of the original sample video or not; for each sample video segment, the following operations are performed: extracting sample audio features and sample image features of a corresponding sample video segment, and sample text features corresponding to the sample audio features; respectively calculating a first sample correlation between the sample audio feature and the sample text feature and a second sample correlation between the sample image feature and the sample text feature; taking the product of the correlation of the sample audio feature and the first sample as a sample audio feature input vector, taking the product of the correlation of the sample image feature and the second sample as a sample image feature input vector, and taking the sample text feature, the sample audio feature input vector and the sample image feature input vector as a group of training samples corresponding to corresponding sample video segments; and training the original recognition model according to training samples and type labels respectively corresponding to the plurality of sample video clips to obtain the trained recognition model.

The video splitting device of the embodiment of the application further comprises: an updating unit (not shown in fig. 3). And the updating unit is used for updating and training the trained identification model when each preset updating period is finished and/or the accumulated number of the obtained sample video fragments added with the type labels reaches a preset updating sample threshold value.

The embodiment of the video splitting device can be applied to computer equipment, such as a server or terminal equipment. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory through a processor of a video splitting device where the device is located. Other hardware may also be included, generally depending on the actual functionality of the computer device, which will not be described in detail. In terms of hardware, as shown in fig. 4, a hardware structure diagram of a computer device where the video splitting apparatus of the embodiment of the present application is located is shown, and in addition to the processor 402, the internal bus 404, the network interface 406, the memory 408, and the nonvolatile memory 410 shown in fig. 4, other hardware may be included according to actual functions of the computer device, which will not be described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. A method of video splitting, comprising:

2. The method of claim 1, wherein the extracting audio features and image features of the respective video clip, and text features corresponding to the audio features, comprises:

3. The method of claim 2, wherein extracting the original audio features and the original image features of the respective video segments, respectively, comprises:

4. The method of claim 1, wherein the computing a first correlation between the audio feature and the text feature, and a second correlation between the image feature and the text feature, respectively, comprises:

5. The method of claim 1, wherein said determining whether the respective video segment is a split node of the original video based on the text feature, the audio feature input vector, and the image feature input vector comprises:

6. The method of claim 5, wherein the recognition model is trained by:

7. The method of claim 6, wherein the method further comprises:

8. A video splitting apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method of video splitting of any of claims 1-7 by executing the executable instructions.

10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the video splitting method of any of claims 1-7.