CN115129932A

CN115129932A - Video clip determination method, device, equipment and storage medium

Info

Publication number: CN115129932A
Application number: CN202210363724.7A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-09-30

Abstract

The application provides a method, a device, equipment and a storage medium for determining video clips, which can be applied to scenes such as video clip identification, artificial intelligence, vehicle-mounted scenes and the like in computer technology. The method comprises the following steps: determining audio similarity information based on the audio data of the first video and the audio data of the second video; determining image similarity information based on the image data of the first video and the image data of the second video; and determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information and the image similar information. According to the scheme, the leader video clip and the trailer video clip in the video are determined by using the information of the audio dimensionality and the image dimensionality, so that the information of the audio dimensionality and the information of the image dimensionality are complementary, the difficulty of determining the leader video clip and the trailer video clip is reduced, and the accuracy is improved.

Description

Video clip determination method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a video clip.

Background

With the development of internet technology, watching tv shows has become a common form of entertainment. In order to enhance the user's experience of watching the television series, the video platform will usually provide a function of skipping the head and the end of the movie, and the skipping of the head and the end of the movie is based on determining the positions of the video segments corresponding to the head and the end of the movie in the television series.

In the related art, a manual watching method is usually adopted to determine the positions of the video segments corresponding to the head and the end of the series, that is, the series is watched manually, and then the positions of the corresponding video segments of the head and the end of the series in the series are marked.

However, the above labeling method consumes a lot of time and human resources, resulting in inefficient determination of the positions of the head and the tail of the tv drama.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for determining a video clip, which can complement information of audio dimensionality and image dimensionality, thereby reducing the difficulty of determining a leader video clip and a trailer video clip and improving the accuracy. The technical scheme is as follows:

in one aspect, a method for determining a video segment is provided, where the method includes:

determining audio similar information based on audio data of a first video and audio data of a second video, wherein the audio similar information is used for representing audio similarity between the first video and the second video, and the first video and the second video are series videos and belong to the same series;

determining image similarity information based on the image data of the first video and the image data of the second video, wherein the image similarity information is used for representing the image similarity between the first video and the second video;

and determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information and the image similar information.

In another aspect, an apparatus for determining a video segment is provided, the apparatus comprising:

the audio determining module is used for determining audio similar information based on audio data of a first video and audio data of a second video, wherein the audio similar information is used for representing audio similarity between the first video and the second video, and the first video and the second video are series videos and belong to the same series;

an image determining module, configured to determine image similarity information based on image data of the first video and image data of the second video, where the image similarity information is used to indicate an image similarity between the first video and the second video;

and the segment determining module is used for determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information and the image similar information.

In some embodiments, the audio determination module comprises:

an audio feature extraction unit, configured to extract an audio feature vector of the first video from audio data of the first video;

the audio feature extraction unit is further configured to extract an audio feature vector of the second video from the audio data of the second video;

a first determining unit, configured to determine the audio similarity information based on a similarity between an element in the audio feature vector of the first video and an element in the audio feature vector of the second video.

In some embodiments, the audio feature extraction unit is configured to extract a temporal feature vector of the first video from audio data of the first video; extracting a frequency domain feature vector of the first video from audio data of the first video; and fusing the time domain characteristic vector and the frequency domain characteristic vector of the first video to obtain the audio characteristic vector of the first video.

In some embodiments, the first determining unit is configured to determine, for any element in the audio feature vector of the first video, a cosine distance between the element and each element in the audio feature vector of the second video, so as to obtain a similarity line vector of the element; and constructing audio similarity information in the form of an audio similarity matrix based on the similarity row vectors of the elements in the audio feature vector of the first video.

In some embodiments, the apparatus further comprises:

the image frame acquisition module is used for acquiring the target dimension number of the audio characteristic vector extracted based on the audio data; and respectively extracting the image frames with the same number as the target dimension number from the first video and the second video to obtain the image data of the first video and the image data of the second video.

In some embodiments, the image determination module comprises:

an image feature extraction unit configured to extract an image feature vector of the first video from image data of the first video;

the image feature extraction unit is further configured to extract an image feature vector of the second video from image data of the second video;

a second determining unit, configured to determine the image similarity information based on a similarity between an element in the image feature vector of the first video and an element in the image feature vector of the second video.

In some embodiments, the second determining unit is configured to determine, for any element in the image feature vector of the first video, a cosine distance between the element and each element in the image feature vector of the second video, so as to obtain a similarity row vector of the element; and constructing image similarity information in an image similarity matrix form based on the similarity row vectors of all elements in the image feature vector of the first video.

In some embodiments, the segment determining module is configured to fuse the audio similar information and the image similar information to obtain video similar information, where the video similar information is used to indicate a similarity between the first video and the second video; determining a first time period and a second time period based on the video similarity information; determining a head video segment in the first video and the second video based on the first time period, the head video segment being within the first time period; determining a last video segment in the first video and the second video based on the second time period, the last video segment being within the second time period.

In some embodiments, the segment determining module is configured to average values of corresponding positions in the audio similar information and the image similar information to obtain intermediate fusion information; and carrying out normalization processing on the values in the intermediate fusion information to obtain the video similar information.

In some embodiments, the segment determining module is configured to determine caption similarity information based on caption data of the first video and caption data of the second video, where the caption similarity information is used to indicate a caption similarity between the first video and the second video; and determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information, the image similar information and the subtitle similar information.

In another aspect, a computer device is provided, which includes a processor and a memory, where the memory is used to store at least one piece of computer program, and the at least one piece of computer program is loaded and executed by the processor to implement the method for determining a video segment in the embodiment of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one piece of computer program is stored, and the at least one piece of computer program is loaded and executed by a processor to implement the method for determining a video segment as in the embodiments of the present application.

In another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of determining a video segment as provided in the above aspects or in various alternative implementations of the aspects.

The embodiment of the application provides a video clip determining scheme, wherein the information of two dimensionalities of audio and image is used for determining a head video clip and a tail video clip in a video, so that the information of the audio dimensionality and the image dimensionality can be complemented, the difficulty of determining the head video clip and the tail video clip is reduced, and the accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic implementation environment of a video segment determination method according to an embodiment of the present application;

fig. 2 is a flowchart of a method for determining a video segment according to an embodiment of the present application;

fig. 3 is a flowchart of a method for determining a video segment according to an embodiment of the present application;

fig. 4 is a schematic diagram of a PANNs network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of constructing an audio similarity matrix according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a Resnet50 network according to an embodiment of the present application;

fig. 7 is a schematic diagram of a video similarity matrix according to an embodiment of the present application;

fig. 8 is a block diagram of a device for determining a video segment according to an embodiment of the present application;

fig. 9 is a block diagram of another apparatus for determining a video segment according to an embodiment of the present application;

fig. 10 is a block diagram of another apparatus for determining a video segment according to an embodiment of the present application;

fig. 11 is a block diagram of a terminal according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server provided according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, audio data and image data referred to in this application are obtained with sufficient authorization.

Hereinafter, terms related to the present application are explained.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge submodel to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Cosine distance, also called cosine similarity, is a measure of the magnitude of the difference between two individuals using the cosine value of the angle between two vectors in a vector space.

ResNet (Residual Network) is the idea of adding Residual learning (Residual learning) to the traditional convolutional neural Network, and solves the problems of gradient dispersion and precision reduction (training set) in a deep Network, so that the Network can be deeper and deeper, the precision is guaranteed, and the speed is controlled. Resnet50 is a large optic neural network structure built based on residual error networks.

PANNs (Large-Scale pre-input Audio Networks for Audio Pattern Recognition, based on Large Audio data sets and trained Audio Neural Networks) are commonly used for Audio Pattern Recognition or embedding at the Audio frame level as a number of model front-end coding Networks.

The Reshape function is a function for transforming a specified matrix into a specific dimension matrix in MATLAB, the number of elements in the matrix is unchanged, and the function can readjust the row number, the column number and the dimension of the matrix. The syntax of the function B-reshape (a, ize) refers to returning an n-dimensional array that is identical to the a element, but the size of the reconstructed array dimension is determined by the vector size.

The reli (rectified Linear units) activation function is a Linear rectification function (Linear rectification function), also called a modified Linear unit, which is a commonly used activation function (activation function) in an artificial neural network, and generally refers to a nonlinear function represented by a ramp function and a variation thereof.

The bottle neck layer is generally used for a network with a higher depth, and the main function is to reduce the number of parameters for calculation.

Dynamic programming is a branch of operations research and is a mathematical method for solving decision process (decision process) optimization.

Hereinafter, an embodiment of the present application will be described.

The method for determining the video segment provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. Next, an implementation environment of the method for determining a video segment provided in the embodiment of the present application is described, and fig. 1 is a schematic diagram of an implementation environment of a method for determining a video segment provided in the embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102.

The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In some embodiments, the terminal 101 is a smartphone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart appliance, a vehicle-mounted terminal, and the like, but is not limited thereto. The terminal 101 is installed and operated with an application program supporting multimedia playing, such as a player type application program, a social type application program, an information stream type application program, or a vehicle-mounted type application program.

In some embodiments, the server 102 is an independent physical server, can also be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and artificial intelligence platform, and the like. The server 102 is used for providing background services for the application programs supporting the virtual scenes. In some embodiments, the server 102 undertakes primary computing work and the terminal 101 undertakes secondary computing work; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.

It should be noted that the number of the terminals 101 and the servers 102 is not limited.

In the following description, an application scenario of the embodiment of the present application is described with reference to the foregoing implementation environment, and in the following description, a terminal is the terminal 101 in the foregoing implementation environment, and a server is the server 102 in the foregoing implementation environment.

The method for determining the video clips can be applied to the scene of determining the head video clip and the tail video clip of the series of videos. Wherein the series of videos includes a television series, a documentary, and the like, which include a plurality of multimedia assets of consecutive videos. For example, the application is applied in a scene of identifying a head and a tail of a tv series, or in a scene of identifying a head and a tail of a documentary, and so on.

The method for determining the video segments provided by the embodiment of the application is applied to the scene of identifying the head and the tail of the TV drama. First, a series of videos to be processed is acquired, where the series of videos includes a plurality of series of videos, and each series of videos is an episode of a television show. Then, two sets of videos are arbitrarily extracted from the television play as the first video and the second video. Then, by adopting the technical scheme provided by the embodiment of the application, the head video segment and the tail video segment in the first video and the second video are determined. Then extracting the two sets of videos again to serve as a new first video and a new second video, and determining a head video segment and a tail video segment. And repeating the steps until the head video segment and the tail video segment in each video of the television scenario are determined.

In some embodiments, the server can mark a head video segment and a tail video segment separately in each episode of video of a tv series by using the scheme provided by the embodiments of the present application before the tv series is released, so that when a user watches the tv series based on a terminal, the user can instruct the terminal to automatically skip the head video segment and the tail video segment when playing the tv series by triggering a "skip head and tail" operation.

In some embodiments, the server is further capable of determining, in real time, a head video segment and a tail video segment of each episode video in the tv series based on a scheme provided by the present application when a user watches the tv series on line based on the terminal, transmitting positions of the head video segment and the tail video segment to the terminal, and skipping the head video segment and the tail video segment in real time by the terminal in a process of playing the tv series.

It should be noted that, the foregoing is described by taking an example that the method for determining a video segment provided in the embodiment of the present application is applied in a scene of identifying a title and a title of a tv series, and implementation processes of other application scenes and the foregoing description belong to the same inventive concept, and are not described again.

In the embodiment of the present application, the computer device can be configured as a terminal or a server, and the technical solution provided in the embodiment of the present application may be executed by the terminal or the server, or may be executed by both the terminal and the server. Fig. 2 is a flowchart of a method for determining a video segment according to an embodiment of the present application, and as shown in fig. 2, the embodiment of the present application is described as an example executed by a server. The method for determining the video clip comprises the following steps:

201. the server determines audio similar information based on audio data of a first video and audio data of a second video, wherein the audio similar information is used for representing audio similarity between the first video and the second video, and the first video and the second video are series videos and belong to the same series.

In the embodiment of the present application, the server is the server 102 in fig. 1. The first video and the second video are series videos belonging to the same series, such as episodes in television shows, recorded videos in documentaries, and the like. The audio data is an audio signal of a video. The server can extract audio feature vectors from the audio data and then determine the audio similarity information based on the audio feature vectors of the two videos, wherein the audio similarity information can represent the similarity between the audio in the first video and the audio in the second video.

202. The server determines image similarity information indicating image similarity between the first video and the second video based on the image data of the first video and the image data of the second video.

In the embodiment of the application, the image data is image frames of videos, and the server can extract image feature vectors from the image data and then determine image similarity information based on the image feature vectors of the two videos, wherein the image similarity information can represent the similarity between the image frames in the first video and the image frames in the second video.

203. The server determines a head video segment and a tail video segment in the first video and the second video based on the audio similarity information and the image similarity information.

In this embodiment of the application, the server can fuse the audio similar information and the image similar information to complement the audio dimension information and the image dimension information, so as to determine similar video segments in the first video and the second video, that is, determine a head video segment and a tail video segment in the first video and the second video.

Fig. 2 illustrates a main flow of a method for determining a video segment according to an embodiment of the present application, and the foregoing solution according to the embodiment of the present application is further described below based on an application scenario. In the embodiment of the present application, the computer device can be configured as a terminal or a server, and the technical solution provided in the embodiment of the present application may be executed by the terminal or the server, or may be executed by both the terminal and the server. Fig. 3 is a flowchart of a method for determining a video segment according to an embodiment of the present application, and as shown in fig. 3, the embodiment of the present application is described as an example of a server executing the method. The method for determining the video clip comprises the following steps:

301. the server extracts an audio feature vector of a first video from audio data of the first video, wherein the first video is a series of videos.

In this embodiment, the first video is any one of a plurality of series videos, and the plurality of series videos belong to the same series. If the first video is any episode in television episodes, or the first video is any recorded video in documentaries. The server can perform feature extraction on the audio data of the first video to obtain the audio feature vector.

In some embodiments, the server can extract features from the time domain dimension and the frequency domain dimension, and then fuse the extracted feature vectors to obtain the audio feature vector. Correspondingly, the server extracts the time domain feature vector of the first video from the audio data of the first video. Then, the server extracts the frequency domain feature vector of the first video from the audio data of the first video. Then, the server fuses the time domain feature vector and the frequency domain feature vector of the first video to obtain the audio feature vector of the first video. By removing the time domain characteristic vector, the audio loudness and the amplitude of the audio sampling point can be obtained. The audio feature vector is obtained by respectively extracting feature vectors of time domain dimensions and frequency domain dimensions and then fusing the extracted time domain feature vectors and frequency domain feature vectors, so that the audio feature vector contains audio semantic information of the time domain dimensions and the frequency domain dimensions.

For example, the audio feature vector is extracted by using a PANNs network as an example. Referring to fig. 4, fig. 4 is a schematic structural diagram of a PANNs network according to an embodiment of the present application. As shown in fig. 4, the input to the network structure is audio data, which has a size of 32 kHz. The network architecture includes: 401. a one-dimensional convolutional layer input to the audio data, the one-dimensional convolutional layer comprising a one-dimensional convolutional neural network (Conv1D) with a convolutional kernel size of 11 and a step size of 5, the one-dimensional convolutional neural network being used for processing the audio data. 402. A one-dimensional convolutional block layer comprising three one-dimensional convolutional blocks, each one-dimensional convolutional block being followed by a downsampling layer with step 4. Wherein each one-dimensional convolution block is composed of two convolution layers with expansions of 1 and 2, respectively, the design being used to increase the receptive field of the convolution layers. By using the stride and downsampling three times, the 32kHz audio data can be downsampled to 32000/5/4/4/4 per second, which is a 100 frame feature. 403. And the sequence reconstruction layer is used for reconstructing the one-dimensional sequence output by the one-dimensional convolution block layer into a two-dimensional time domain feature map based on a Reshape function, and the two-dimensional time domain feature map is a time domain feature vector. Wherein, the one-dimensional sequence comprises 100 elements, corresponding to the features of the 100 frames, each element comprises 2048-dimensional features. The two-dimensional time domain signature is a matrix of 100 x 2048. 404. A spectral processing layer, the input of which is the audio data, for determining a log-mel (log-mel) spectrum of the audio data. 405. And the two-dimensional convolution block layer is used for processing the logarithmic Mel frequency spectrum to obtain a two-dimensional frequency domain characteristic map, and the two-dimensional frequency domain characteristic map is a frequency domain characteristic vector. And the dimension number of the two-dimensional frequency domain characteristic diagram is consistent with that of the two-dimensional time domain characteristic diagram. 406. And a fusion layer for fusing (concat) the time domain feature vector and the frequency domain feature vector. 407. And the two-dimensional convolution layer is used for processing the superposed characteristic vectors. 408. And the characteristic processing layer is used for converting the two-dimensional characteristics output by the two-dimensional convolution layer into one-dimensional vectors. The feature processing layer comprises an average sublayer and a maximum sublayer, wherein the average sublayer is used for solving an average (mean) according to any one dimension (row or column) in the two-dimensional features; the maximum sublayer is used to find the maximum (max) in the same way as the mean unit. 409. And the summation layer is used for adding the average value and the maximum value. 410. And the activation layer is used for obtaining the audio characteristic vector after the processing based on the ReLu activation function.

It should be noted that the server can also use other convolutional neural networks to extract the audio feature vector, which is not limited in this embodiment.

302. The server extracts the audio feature vector of the second video from the audio data of the second video.

In this embodiment, any video of the plurality of series of videos of the second video, except the first video, that is, the second video and the first video belong to the same series. The manner of extracting the audio feature vector of the second video by the server is similar to that of step 301, and is not described herein again.

It should be noted that, the steps in the implementation of the present application are numbered for convenience of description, and the execution order of the steps is not limited. The server can execute step 301 or step 302 first, and can also execute step 301 and step 302 simultaneously in a multi-thread manner, which is not limited in the embodiment of the present application.

303. The server determines the audio similarity information based on the similarity between the elements in the audio feature vector of the first video and the elements in the audio feature vector of the second video, wherein the audio similarity information is used for representing the audio similarity between the first video and the second video.

In an embodiment of the application, the audio feature vector comprises a plurality of feature dimensions, each feature dimension being represented by an element. The server can respectively determine the similarity between each element in the two audio feature vectors to obtain audio similar information.

In some embodiments, the server can represent the similarity between elements by the cosine distance between the elements. Correspondingly, for any element in the audio feature vector of the first video, the server can determine the cosine distance between the element and each element in the audio feature vector of the second video, and obtain the similarity row vector of the element. Then, the server can construct audio similarity information in the form of an audio similarity matrix based on the similarity row vectors of the elements in the audio feature vector of the first video. The audio similarity matrix is constructed so that the value in the audio similarity matrix can represent the similarity between any two elements in the audio feature vectors of the two videos, and thus the audio similarity matrix can represent the similarity between the audio feature vectors of the two videos.

For example, referring to fig. 5, fig. 5 is a schematic diagram for constructing an audio similarity matrix according to an embodiment of the present application. As shown in fig. 5, the audio feature vector of the first video is represented as { a1, a2, A3, …, An }, and the audio feature vector of the second video is represented as { B1, B2, B3, …, Bm }, where n and m are positive integers, n and m represent the number of elements of the audio feature vector, and n and m may be equal or unequal. Taking a1 as an example, cosine distances between a1 and B1, B2, B3, …, Bm are respectively calculated to obtain a similarity row vector of a 1. The plurality of similarity row vectors form an audio similarity matrix. In the audio similarity matrix, A1B1 represents the cosine distance between A1 and B1, A1Bm represents the cosine distance between A1 and Bm, AnB1 represents the cosine distance between An and B1, and AnBm represents the cosine distance between An and Bm.

Note that, the cosine distance between two elements is calculated as shown in the following formula (1).

Wherein n represents the number of elements, a _i I-th element in an audio feature vector representing a first video, b _i The ith element in the audio feature vector of the second video.

304. The server acquires image data of the first video and image data of the second video.

In this embodiment, the image data is a video image frame, and the server can extract the image frame from the first video and the second video respectively according to the target dimension number of the audio feature vector, so that the dimension number of the extracted image feature vector is equal to the target dimension number.

In some embodiments, the server obtains a target dimension number of the audio feature vectors extracted based on the audio data, and then extracts the same number of image frames as the target dimension number from the first video and the second video respectively to obtain image data of the first video and image data of the second video. By extracting the image frames with the same number as the target dimensionality, the dimensionality number of the image characteristic vector obtained after the image data is subjected to feature extraction is equal to the target dimensionality number, so that the audio characteristic vector and the image characteristic vector are conveniently fused, and the fusion efficiency is improved.

305. The server extracts the image feature vector of the first video from the image data of the first video.

In the embodiment of the application, the server can respectively extract the features of each video image frame in the image data to obtain an image feature vector, wherein the image feature vector comprises a plurality of feature dimensions, and the plurality of feature dimensions are in one-to-one correspondence with the plurality of video image frames in the image data.

For example, the description will be given taking an example in which the server extracts the image feature vector based on the Resnet50 network. Referring to fig. 6, fig. 6 is a schematic structural diagram of a Resnet50 network according to an embodiment of the present application. As shown in fig. 6, the network is divided into 5 processing stages: stage 0 to stage 4, stage 0 being used for preprocessing the input image frames. The stages 1 to 4 are similar in structure and each consist of a Bottleneck layer (bottleeck). Wherein, stage 1 includes 3 bottleneck layers, stage 2 includes 4 bottleneck layers, stage 3 includes 6 bottleneck layers, and stage 4 includes 3 bottleneck layers.

It should be noted that the server can also use other network models to extract the image feature vector, which is not limited in this embodiment of the present application.

306. The server extracts the image feature vector of the second video from the image data of the second video.

In the embodiment of the present application, the manner in which the server extracts the image feature vector of the second video is shown in step 305, and is not described herein again.

In the present application, the steps are numbered for convenience of description, and the execution order of the steps is not limited. The server can execute step 305 and step 306 first, and can also execute step 305 and step 306 simultaneously in a multi-thread manner, which is not limited in the embodiment of the present application.

307. The server determines the image similarity information based on the similarity between the elements in the image feature vector of the first video and the elements in the image feature vector of the second video, wherein the image similarity information is used for representing the image similarity between the first video and the second video.

In an embodiment of the present application, an image feature vector includes a plurality of feature dimensions, each feature dimension being represented by an element. The server can respectively determine the similarity between each element in the two image feature vectors to obtain image similarity information.

In some embodiments, the server can represent the similarity between elements by the cosine distance between the elements. Correspondingly, for any element in the image feature vector of the first video, the server can determine the cosine distance between the element and each element in the image feature vector of the second video, and obtain the similarity row vector of the element. Then, the server can construct image similarity information in the form of an image similarity matrix based on the similarity row vector of each element in the image feature vector of the first video. By constructing the image similarity matrix, the value in the image similarity matrix can represent the similarity between any two elements in the image feature vectors of the two videos, so that the image similarity matrix can represent the similarity between the image feature vectors of the two videos. The method for constructing the image similarity matrix is shown in fig. 5, and is not repeated herein.

308. The server determines a head video segment and a tail video segment in the first video and the second video based on the audio similarity information and the image similarity information.

In the embodiment of the application, the audio similar information and the image similar information acquired by the server are fused, and the time period corresponding to the leader video clip and the time period corresponding to the trailer video clip are determined based on the fused information, so that the leader video clip and the trailer video clip are determined.

In some embodiments, the server can fuse the audio similarity information and the image similarity information to obtain video similarity information, and the video similarity information is used for representing the similarity between the first video and the second video. The server can then determine the first time period and the second time period based on the video similarity information. The server can then determine a head-of-title video segment in the first video and the second video based on the first time period, the head-of-title video segment being within the first time period. The server can determine a last video clip in the first video and the second video based on the second time period, the last video clip being within the second time period. The two time periods are determined based on the fused video similar information, so that the videos in the two time periods are the time periods of the head video clip and the tail video clip, and the head video clip and the tail video clip can be determined more accurately.

In some embodiments, the manner of fusing the audio similar information and the image similar information by the server is as follows: the server calculates the average value of the values of the corresponding positions in the audio similar information and the image similar information to obtain intermediate fusion information, and then the server performs normalization processing on the values in the intermediate fusion information to obtain video similar information. The server can calculate the average value of the values of the corresponding positions in the two matrixes, and then normalize the average value to obtain the video similarity information in the form of the video similarity matrix.

In some embodiments, since the audio similarity matrix and the image similarity matrix are calculated according to cosine distances, the distribution of the values of the two similarity matrices is-1 to 1, the server can make the distribution of the values of the two similarity matrices 0 to 2 by adding 1 to all the distance values, and then average the values of the two similarity matrices, the value range of the values is still 0 to 1, where 1 represents a small distance and 0 represents a large distance. Referring to fig. 7, fig. 7 is a schematic diagram of a video similarity matrix according to an embodiment of the present application. As in fig. 7, the distance between two similar features is close to 1, shown in fig. 7 as being close to white, indicating that the two features in the two videos are similar; the distance between two dissimilar features is close to 0, shown in fig. 7 as being close to black, indicating that the two features in the two videos are similar. The pixel points in fig. 7 can represent the similarity between different video frames in two videos, and in any line example in fig. 7, the line represents the similarity after the audio similarity and the image similarity between a certain frame in one video and each frame in another video are fused. If the content of a certain frame appears frequently in another video, the similarity between the frame and each frame in the other video is high, and the pixel points of the frame in the corresponding line in fig. 7 are approximately white; if the content of a frame does not appear in another video, the similarity between the frame and each frame in the other video is low, and the pixel points of the frame in the corresponding line in fig. 7 are approximately black. Because the similarity between frames in the two videos is shown in fig. 7, the video frame similarity corresponding to the head video segment in the two videos is higher, and the similarity of the tail video segment is also higher, the pixel point showing the similarity between frames in the head video segment in fig. 7 is close to white, and the pixel point showing the similarity between frames in the tail video segment is also close to white in the same way, that is, the white line segments at the upper left corner and the lower right corner in fig. 7 show that the characteristics of the head video segment in the first video and the second video are similar to the video characteristics of the tail video segment.

In some embodiments, the server can determine the first time period and the second time period in a dynamic programming manner. And the server uses breadth-first traversal to search the diagonal line path with the shortest distance value at the upper left corner and the lower right corner in the video similarity matrix. Because two dimensions of the video similarity matrix respectively represent the number of elements of two videos, and the number of the elements is equal to the number of the extracted video image frames, the time period corresponding to the oblique line path can be determined based on the proportional relationship between the oblique line path and the number of the elements, and the first time period and the second time period can be determined by determining the starting and ending positions of the oblique line.

It should be noted that, in order to make the determination scheme of the video segment provided in the embodiment of the present application easier to understand, referring to fig. 8, fig. 8 is a flowchart of another determination method of the video segment provided in the embodiment of the present application. As shown in fig. 8, the flowchart is divided into two stages, the first stage is the calculation of the similarity matrix, and the second stage is the determination of the positions of the head video segment and the tail video segment based on the similarity matrix. First, the first stage is described, which includes two parts, one is the calculation of the audio similarity matrix and the other is the calculation of the video similarity matrix. The first stage inputs are two episodes of a television series: episode 1 and episode 2, episode 1 including audio data 1 and image data 1, and episode 2 including audio data 2 and image data 2. And respectively carrying out feature extraction on the audio data 1 and the audio data 2 to obtain an audio feature vector of the episode 1 and an audio feature vector of the episode 2. And respectively extracting the features of the image data 1 and the image data 2 to obtain an image feature vector of the episode 1 and an image feature vector of the episode 2. And performing distance calculation on the audio characteristic vector of the episode 1 and the audio characteristic vector of the episode 2 to obtain an audio similarity matrix. And (4) performing distance calculation on the image characteristic vector of the episode 1 and the image characteristic vector of the episode 2 to obtain an image similarity matrix. And in the second stage, the audio similarity matrix and the image similarity matrix obtained in the first stage are fused to obtain a video similarity matrix, and the corresponding positions of the leader video clip and the trailer video clip in the video similarity matrix are determined based on the video similarity matrix. And finally, determining the start-stop time of the head video clip and the start-stop time of the tail video clip according to the corresponding relation between the position in the video similarity and the episode duration.

In some embodiments, the server is also capable of integrating the subtitle data on the basis of the audio data and the image data. Accordingly, the server can determine caption similarity information indicating a degree of caption similarity between the first video and the second video based on the caption data of the first video and the caption data of the second video. Then, based on the audio similarity information, the image similarity information, and the subtitle similarity information, a head video segment and a tail video segment in the first video and the second video are determined. The server can calculate weighted average of values of corresponding positions in the audio similar information, the image similar information and the subtitle similar information to obtain a video similarity matrix, and then determines a head video segment and a tail video segment in the first video and the second video based on the video similarity matrix. By adding the caption data, the server can determine the leader video segment and the trailer video segment from three different dimensions of audio, images and texts, the characteristics of the different dimensions can be different from each other, the robustness is improved, and the accuracy is further improved.

The embodiment of the application provides a video clip determining scheme, wherein the information of two dimensionalities of audio and image is used for determining a head video clip and a tail video clip in a video, so that the information of the audio dimensionality and the image dimensionality can be complemented, the difficulty of determining the head video clip and the tail video clip is reduced, and the accuracy is improved. The scheme that this application provided compares in artifical mode of marking, can reduce the cost, can improve the efficiency of batch processing video again. In addition, for the leader video clips and the trailer video clips with different pictures or different music in the whole episode, the difficulty in confirmation caused by the diversification of the leader video clips and the trailer video clips can be reduced and the accuracy can be improved by complementing the information of the audio dimension and the image dimension. In addition, the accuracy of the scheme can reach millisecond level by using the audio characteristic vector and the image characteristic vector at the frame level, and the accuracy is improved. Finally, compared with a mode of respectively determining the leader video segment and the trailer video segment, the method has the advantages that the situation that the leader video segment and the trailer video segment cannot be accurately determined due to overlarge difference between the conclusion obtained based on the audio similarity matrix and the conclusion obtained based on the image similarity matrix is avoided, and the robustness of the scheme is improved.

Fig. 9 is a block diagram of a device for determining a video segment according to an embodiment of the present application. The device includes: an audio determination module 901, an image determination module 902, and a segment determination module 903.

An audio determining module 901, configured to determine audio similar information based on audio data of a first video and audio data of a second video, where the audio similar information is used to indicate audio similarity between the first video and the second video, and the first video and the second video are both series videos and belong to the same series;

an image determining module 902, configured to determine image similarity information based on the image data of the first video and the image data of the second video, where the image similarity information is used to indicate image similarity between the first video and the second video;

a section determining module 903, configured to determine a head video section and a tail video section in the first video and the second video based on the audio similarity information and the image similarity information.

In some embodiments, fig. 10 is a block diagram of another apparatus for determining a video segment provided in an embodiment of the present application, where the apparatus includes an audio determining module 901, an image determining module 902, and a segment determining module 903, and reference may be made to the description of fig. 9. Referring to fig. 10, the audio determining module 901 includes:

an audio feature extraction unit 1001 configured to extract an audio feature vector of the first video from the audio data of the first video;

the audio feature extraction unit 1001 is further configured to extract an audio feature vector of the second video from the audio data of the second video;

a first determining unit 1002, configured to determine the audio similarity information based on a similarity between an element in the audio feature vector of the first video and an element in the audio feature vector of the second video.

In some embodiments, the audio feature extraction unit 1001 is configured to extract a temporal feature vector of the first video from the audio data of the first video; extracting a frequency domain feature vector of the first video from the audio data of the first video; and fusing the time domain feature vector and the frequency domain feature vector of the first video to obtain the audio feature vector of the first video.

In some embodiments, the first determining unit 1002 is configured to determine, for any element in the audio feature vector of the first video, a cosine distance between the element and each element in the audio feature vector of the second video, so as to obtain a similarity row vector of the element; and constructing audio similarity information in the form of an audio similarity matrix based on the similarity row vectors of the elements in the audio feature vector of the first video.

In some embodiments, referring to fig. 10, the apparatus further comprises:

an image frame obtaining module 904, configured to obtain a target dimension number of an audio feature vector extracted based on the audio data; and respectively extracting the image frames with the same number as the target dimension from the first video and the second video to obtain the image data of the first video and the image data of the second video.

In some embodiments, referring to fig. 10, the image determination module 902 includes:

an image feature extraction unit 1003 for extracting an image feature vector of the first video from the image data of the first video;

the image feature extraction unit 1003 is further configured to extract an image feature vector of the second video from the image data of the second video;

a second determining unit 1004, configured to determine the image similarity information based on a similarity between an element in the image feature vector of the first video and an element in the image feature vector of the second video.

In some embodiments, the second determining unit 1004 is configured to determine, for any element in the image feature vector of the first video, a cosine distance between the element and each element in the image feature vector of the second video, so as to obtain a similarity row vector of the element; and constructing image similarity information in an image similarity matrix form based on the similarity row vectors of all elements in the image feature vector of the first video.

In some embodiments, the segment determining module 903 is configured to fuse the audio similar information and the image similar information to obtain video similar information, where the video similar information is used to indicate a similarity between the first video and the second video; determining a first time period and a second time period based on the video similarity information; determining a head video segment in the first video and the second video based on the first time period, the head video segment being within the first time period; and determining an end-of-track video segment in the first video and the second video based on the second time period, wherein the end-of-track video segment is within the second time period.

In some embodiments, the segment determining module 903 is configured to calculate an average value of values of corresponding positions in the audio similar information and the image similar information to obtain intermediate fusion information; and normalizing the values in the intermediate fusion information to obtain the video similar information.

In some embodiments, the segment determining module 903 is configured to determine subtitle similarity information based on subtitle data of the first video and subtitle data of the second video, where the subtitle similarity information is used to indicate a subtitle similarity between the first video and the second video; and determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information, the image similar information and the subtitle similar information.

It should be noted that: in the above embodiment, when determining the first video segment and the last video segment in the video, the determining apparatus for video segments provided in the above embodiment only exemplifies the division of the above functional modules, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the determining apparatus of the video segment and the determining method of the video segment provided in the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In this embodiment of the present application, the computer device can be configured as a terminal or a server, when the computer device is configured as a terminal, the terminal can be used as an execution subject to implement the technical solution provided in the embodiment of the present application, when the computer device is configured as a server, the server can be used as an execution subject to implement the technical solution provided in the embodiment of the present application, or the technical solution provided in the present application can be implemented through interaction between the terminal and the server, which is not limited in this embodiment of the present application.

When the computer device is configured as a terminal, fig. 11 is a block diagram of a terminal 1100 provided according to an embodiment of the present application. The terminal 1100 includes: a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and rendering content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1102 is used to store at least one computer program for execution by the processor 1101 to implement the method of determining a video segment provided by the method embodiments herein.

In some embodiments, the terminal 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, and power supply 1108.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. In some embodiments, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuit, which is not limited in this application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, disposed on a front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. In some embodiments, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of a terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. The plurality of microphones may be provided at different portions of the terminal 1100 for the purpose of stereo sound collection or noise reduction, respectively. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

Power supply 1108 is used to provide power to various components within terminal 1100. The power source 1108 may be alternating current, direct current, disposable or rechargeable. When the power source 1108 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1100 can also include one or more sensors 1109. The one or more sensors 1109 include, but are not limited to: acceleration sensor 1110, gyro sensor 1111, pressure sensor 1112, optical sensor 1113, and proximity sensor 1114.

The acceleration sensor 1110 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 1100. For example, the acceleration sensor 1110 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1110. The acceleration sensor 1110 may also be used for game or user motion data acquisition.

The gyro sensor 1111 may detect the body direction and the rotation angle of the terminal 1100, and the gyro sensor 1111 may acquire the 3D motion of the user on the terminal 1100 in cooperation with the acceleration sensor 1110. The processor 1101 can implement the following functions according to the data collected by the gyro sensor 1111: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.

Pressure sensors 1112 may be disposed on the side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1112 is disposed on a side frame of the terminal 1100, a holding signal of the user to the terminal 1100 can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1112. When the pressure sensor 1112 is disposed at a lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The optical sensor 1113 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 according to the ambient light intensity collected by the optical sensor 1113. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 according to the intensity of the ambient light collected by the optical sensor 1113.

Proximity sensor 1114, also known as a distance sensor, is typically disposed on a front panel of terminal 1100. Proximity sensor 1114 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when proximity sensor 1114 detects that the distance between the user and the front surface of terminal 1100 is gradually decreasing, display screen 1105 is controlled by processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1114 detects that the distance between the user and the front surface of the terminal 1100 is gradually increased, the display screen 1105 is controlled by the processor 1101 to switch from a rest screen state to a lighted screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 is not limiting of terminal 1100, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be used.

When the computer device is configured as a server, fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1200 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where the memories 1202 store at least one computer program, and the at least one computer program is loaded and executed by the processors 1201 to implement the determination method for providing the video segment according to the above-described method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where at least one segment of computer program is stored in the computer-readable storage medium, and the at least one segment of computer program is loaded and executed by a processor of a computer device to implement the operations performed by the computer device in the method for determining a video segment according to the foregoing embodiment. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

An embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method for determining a video segment provided in the foregoing aspects or various optional implementations of the aspects

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining a video segment, the method comprising:

2. The method of claim 1, wherein determining audio similarity information based on the audio data of the first video and the audio data of the second video comprises:

extracting an audio feature vector of the first video from the audio data of the first video;

extracting an audio feature vector of the second video from the audio data of the second video;

determining the audio similarity information based on a similarity between elements in the audio feature vector of the first video and elements in the audio feature vector of the second video.

3. The method of claim 2, wherein the extracting the audio feature vector of the first video from the audio data of the first video comprises:

extracting a time domain feature vector of the first video from the audio data of the first video;

extracting a frequency domain feature vector of the first video from audio data of the first video;

and fusing the time domain characteristic vector and the frequency domain characteristic vector of the first video to obtain the audio characteristic vector of the first video.

4. The method of claim 2, wherein the determining the audio similarity information based on similarity between elements in the audio feature vector of the first video and elements in the audio feature vector of the second video comprises:

for any element in the audio characteristic vector of the first video, determining the cosine distance between the element and each element in the audio characteristic vector of the second video to obtain a similarity row vector of the element;

and constructing audio similarity information in the form of an audio similarity matrix based on the similarity row vectors of the elements in the audio feature vector of the first video.

5. The method of claim 1, further comprising:

acquiring the target dimension number of audio feature vectors extracted based on the audio data;

and respectively extracting the image frames with the same number as the target dimension number from the first video and the second video to obtain the image data of the first video and the image data of the second video.

6. The method of claim 1, wherein determining image similarity information based on the image data of the first video and the image data of the second video comprises:

extracting an image feature vector of the first video from image data of the first video;

extracting an image feature vector of the second video from image data of the second video;

determining the image similarity information based on a similarity between elements in an image feature vector of the first video and elements in an image feature vector of the second video.

7. The method of claim 6, wherein the determining the image similarity information based on similarity between elements in the image feature vector of the first video and elements in the image feature vector of the second video comprises:

for any element in the image feature vector of the first video, determining the cosine distance between the element and each element in the image feature vector of the second video to obtain a similarity row vector of the element;

and constructing image similarity information in an image similarity matrix form based on the similarity row vectors of all elements in the image feature vector of the first video.

8. The method of claim 1, wherein the determining a head video segment and a tail video segment in the first video and the second video based on the audio similarity information and the image similarity information comprises:

fusing the audio similar information and the image similar information to obtain video similar information, wherein the video similar information is used for representing the similarity between the first video and the second video;

determining a first time period and a second time period based on the video similarity information;

determining a head-of-title video segment in the first video and the second video based on the first time period, the head-of-title video segment being within the first time period;

determining a last video segment in the first video and the second video based on the second time period, the last video segment being within the second time period.

9. The method according to claim 8, wherein the fusing the audio similarity information and the image similarity information to obtain video similarity information comprises:

averaging values of corresponding positions in the audio similar information and the image similar information to obtain intermediate fusion information;

and carrying out normalization processing on the values in the intermediate fusion information to obtain the video similar information.

10. The method of claim 1, wherein the determining a head video segment and a tail video segment in the first video and the second video based on the audio similarity information and the image similarity information comprises:

determining caption similar information based on the caption data of the first video and the caption data of the second video, wherein the caption similar information is used for representing the caption similarity between the first video and the second video;

and determining a head video segment and a tail video segment in the first video and the second video based on the audio similar information, the image similar information and the subtitle similar information.

11. An apparatus for determining a video segment, the apparatus comprising:

an image determining module, configured to determine image similarity information based on image data of the first video and image data of the second video, where the image similarity information is used to represent image similarity between the first video and the second video;

12. A computer device, characterized in that the computer device comprises a processor and a memory for storing at least one piece of computer program, which is loaded by the processor and which performs the method of determining a video segment according to any one of claims 1 to 10.

13. A computer-readable storage medium for storing at least one computer program for performing the method for determining a video segment according to any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method of determining a video segment according to any one of claims 1 to 10.