CN114241223B

CN114241223B - Video similarity determination method and device, electronic equipment and storage medium

Info

Publication number: CN114241223B
Application number: CN202111552585.4A
Authority: CN
Inventors: 陈翼翼; 刘旭东; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2023-03-24
Anticipated expiration: 2041-12-17
Also published as: CN114241223A

Abstract

The method obtains feature data of a target frame in a first video and a second video, obtains an inter-frame similarity matrix between the target frame in the first video and the target frame in the second video according to the feature data, performs nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video, and further obtains the similarity between the first video and the second video by performing countermeasure calculation on the similarity matrix. In the embodiment, when the similarity of the video is determined, the similarity characteristics between frames of the video and the similarity characteristics of the video granularity are considered, so that the calculated similarity is more accurate.

Description

Video similarity determination method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for determining video similarity, an electronic device, and a storage medium.

Background

With the development of internet technology, network video platforms are gradually popularized, and more users upload videos through the video platforms, so that how to control the quality of videos in the network video platforms is very important.

In the related art, when determining whether two videos are repeated videos, whether the two videos are repeated is generally determined by extracting frame-by-frame features or embedding vectors of the videos, then calculating similarity of the two videos based on the frame-by-frame features or the embedding vectors of the two videos, or converting the frame-by-frame features or the embedding vectors into feature representations or vector representations of the entire videos, and then determining the similarity according to the distance between the feature representations or the vector representations.

However, if the resolution of the video is modified, or the video is cut, mixed, added with the head and the tail, modified with the background, etc., it is difficult to accurately judge the repeated or similar video.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, and a storage medium for determining video similarity, so as to at least solve the problem in the related art that the accuracy of determining a duplicate or similar video is not high. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, a method for determining video similarity is provided, including:

acquiring characteristic data of a target frame in a first video and a second video;

acquiring an inter-frame similarity matrix between a target frame in the first video and a target frame in the second video according to the characteristic data;

carrying out nonlinear conversion on the interframe similarity matrix to obtain a similarity matrix between the first video and the second video;

and performing countermeasure calculation on the similarity matrix to obtain the similarity between the first video and the second video.

In one embodiment, the performing nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video includes: and carrying out nonlinear conversion on the interframe similarity matrix through a convolutional neural network to obtain a similarity matrix between the first video and the second video.

In one embodiment, the convolutional neural network is generated in a manner that includes: acquiring a training data set, wherein the training data set comprises a plurality of triple sample data, and the triple sample data comprises an anchor point sample video, a first sample video similar to the anchor point sample video and a second sample video dissimilar to the anchor point sample video; obtaining a first sample inter-frame similarity matrix between the anchor sample video and the first sample video, and obtaining a second sample inter-frame similarity matrix between the anchor sample video and the second sample video; and training a basic convolutional network by adopting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network.

In one embodiment, the training a base convolutional network by using the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network includes: respectively inputting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix into a basic convolution network for convolution processing to obtain a first sample video similarity matrix corresponding to the first sample inter-frame similarity matrix and a second sample video similarity matrix corresponding to the second sample inter-frame similarity matrix; obtaining a first sample similarity between the anchor point sample video and the first sample video by performing countermeasure calculation on the first sample video similarity matrix, and obtaining a second sample similarity between the anchor point sample video and the second sample video by performing countermeasure calculation on the second sample video similarity matrix; and determining network loss according to the first sample similarity and the second sample similarity, and performing parameter adjustment on the basic convolutional network according to the network loss to obtain the trained convolutional neural network.

In one embodiment, the obtaining the similarity between the first video and the second video by performing a confrontation calculation on the similarity matrix includes: determining the maximum value of each column in the similarity matrix; and acquiring the ratio of the sum of the maximum values in each column to the number of rows of the similarity matrix as the similarity between the first video and the second video.

In one embodiment, the obtaining an inter-frame similarity matrix between a target frame in the first video and a target frame in the second video according to the feature data includes: for each target frame in the first video, acquiring feature similarity between corresponding feature data and corresponding feature data of each target frame in the second video respectively to obtain a row of elements or a column of elements corresponding to the target frame; and generating a corresponding inter-frame similarity matrix according to each row element or each column element.

In one embodiment, the acquiring feature data of the target frame in the first video and the second video includes: identifying a video duration of the first video or the second video, respectively; when the video duration of the first video is less than or equal to a set duration, extracting a target frame from the first video according to a first set frequency; when the video duration of the second video is less than or equal to a set duration, extracting a target frame from the second video according to the first set frequency; and carrying out feature identification on the target frames to obtain feature data of each target frame.

In one embodiment, the method further comprises: when the video duration of the first video is longer than a preset duration, extracting a target frame from a preset part in the first video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the first video by adopting a third set frequency; and when the video duration of the second video is longer than a preset duration, extracting a target frame from a preset part in the second video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the second video by adopting a third set frequency.

In one embodiment, the starting point of the preset portion is a video starting point of the first video or a video starting point of the second video, and the second set frequency is greater than the third set frequency.

According to a second aspect of the embodiments of the present disclosure, there is provided a video similarity determination apparatus, including:

the characteristic data acquisition module is configured to acquire characteristic data of a target frame in the first video and the second video;

an inter-frame similarity matrix obtaining module configured to obtain an inter-frame similarity matrix between a target frame in the first video and a target frame in the second video according to the feature data;

a video similarity matrix obtaining module configured to perform nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video;

a similarity determination module configured to perform a confrontation calculation on the similarity matrix to obtain a similarity between the first video and the second video.

In one embodiment, the similarity matrix obtaining module is configured to perform: and carrying out nonlinear conversion on the interframe similarity matrix through a convolutional neural network to obtain a similarity matrix between the first video and the second video.

In one embodiment, the apparatus further comprises a convolutional neural network generation module configured to perform: acquiring a training data set, wherein the training data set comprises a plurality of triple sample data, and the triple sample data comprises an anchor point sample video, a first sample video similar to the anchor point sample video and a second sample video dissimilar to the anchor point sample video; obtaining a first sample inter-frame similarity matrix between the anchor sample video and the first sample video, and obtaining a second sample inter-frame similarity matrix between the anchor sample video and the second sample video; and training a basic convolutional network by adopting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network.

In one embodiment, the convolutional neural network generation module is further configured to perform: respectively inputting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix into a basic convolution network for convolution processing to obtain a first sample video similarity matrix corresponding to the first sample inter-frame similarity matrix and a second sample video similarity matrix corresponding to the second sample inter-frame similarity matrix; obtaining a first sample similarity between the anchor point sample video and the first sample video by performing countermeasure calculation on the first sample video similarity matrix, and obtaining a second sample similarity between the anchor point sample video and the second sample video by performing countermeasure calculation on the second sample video similarity matrix; and determining network loss according to the first sample similarity and the second sample similarity, and performing parameter adjustment on the basic convolutional network according to the network loss to obtain the trained convolutional neural network.

In one embodiment, the similarity determination module is configured to perform: determining the maximum value of each column in the similarity matrix; and acquiring the ratio of the sum of the maximum values in each column to the row number of the similarity matrix as the similarity between the first video and the second video.

In one embodiment, the inter-frame similarity matrix obtaining module is configured to perform: for each target frame in the first video, acquiring feature similarity between corresponding feature data and corresponding feature data of each target frame in the second video respectively to obtain a row of elements or a column of elements corresponding to the target frame; and generating a corresponding interframe similarity matrix according to each row element or each column element.

In one embodiment, the feature data acquisition module is configured to perform: identifying a video duration of the first video or the second video, respectively; when the video duration of the first video is less than or equal to a set duration, extracting a target frame from the first video according to a first set frequency; when the video duration of the second video is less than or equal to a set duration, extracting a target frame from the second video according to a first set frequency; and carrying out feature identification on the target frames to obtain feature data of each target frame.

In one embodiment, the feature data acquisition module is further configured to perform: when the video duration of the first video is longer than a preset duration, extracting a target frame from a preset part in the first video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the first video by adopting a third set frequency; and when the video duration of the second video is longer than a preset duration, extracting a target frame from a preset part in the second video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the second video by adopting a third set frequency.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video similarity determination method according to any one of the above first aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video similarity determination method according to any one of the first aspect above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which includes instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the video similarity determination method according to any one of the above first aspects.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of obtaining feature data of a target frame in a first video and a target frame in a second video, obtaining an inter-frame similarity matrix between the target frame in the first video and the target frame in the second video according to the feature data, carrying out nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video, and further carrying out countermeasure calculation on the similarity matrix to obtain the similarity between the first video and the second video. In the embodiment, when the similarity of the video is determined, the similarity characteristics between frames of the video and the similarity characteristics of the video granularity are considered, so that the calculated similarity is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating video granularity-based re-determination according to an exemplary embodiment.

FIG. 2 is a diagram illustrating frame granularity based re-determination, according to an example embodiment.

Fig. 3 is a flowchart illustrating a video similarity determination method according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating the step of calculating the similarity, according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating the training steps of a convolutional neural network, according to an exemplary embodiment.

FIG. 6 is a training schematic of a convolutional neural network shown in accordance with an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating a video similarity determination method according to an exemplary embodiment.

Fig. 8 is a diagram illustrating the calculation of a video granularity similarity matrix according to an exemplary embodiment.

Fig. 9 is a block diagram illustrating a video similarity determination apparatus according to an exemplary embodiment.

FIG. 10 is a block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

In the conventional technology, as shown in fig. 1, features of a video X and a video Y may be extracted frame by CNN (Convolutional Neural Networks), and a frame-by-frame feature or embedding vector is converted into a feature representation f or embedding vector e of the entire video (i.e., a feature q corresponding to the video X and a feature p corresponding to the video Y) by using an operation such as max pooling or aggregation, so as to calculate a distance between the feature q and the feature p of the two videos, determine a similarity (i.e., a similarity in the graph) between the videos according to the distance, and determine whether the videos are repeated. However, the above-mentioned duplication judgment method based on video granularity directly converts the feature expression of each frame into the feature expression of the video through simple operation, so that the spatiotemporal information of the video is lost, which results in insufficient accuracy.

As shown in fig. 2, by adopting a frame granularity-based re-determination method, a vector product (qi) pj of frame-by-frame features or embedding vectors of two videos (i.e., frame-by-frame features q1, … …, qi, … …, qx corresponding to video X, and frame-by-frame features p1, … …, pi, … …, px corresponding to video Y) is calculated, wherein qi is a feature corresponding to the ith frame of video X, and pj is a feature corresponding to the jth frame of video Y, so as to obtain a similarity matrix between one frame and the other frame, and then the similarity matrix is calculated to obtain the similarity (simviarity) of the two videos, so as to determine whether the videos are repeated. However, the frame-granularity-based re-determination ignores the spatial relationship between video frames, resulting in poor accuracy.

Because the above schemes cannot accurately judge the repeated or similar videos, the probability that the same user frequently sees the same video or similar videos is high, and therefore, the user experience is poor, and the conversion rate is limited.

Based on this, the present disclosure provides a video similarity determining method, as shown in fig. 3, including the following steps:

in step S310, feature data of a target frame in the first video and the second video is acquired.

The first video and the second video refer to any two videos which need to be subjected to similarity comparison, and whether the two videos are similar or repeated is judged by performing similarity comparison. The target frame is a frame extracted from the first video and the second video respectively for comparison when the similarity comparison is performed on the first video and the second video, and may be all frames in the first video and the second video, or may be a partial frame extracted from the first video and the second video based on a certain rule. The feature data refers to an abstract concept extracted from the target frame to describe the frame picture. In this embodiment, when the similarity comparison between the first video and the second video is to be performed, the feature data of the target frame in the first video and the feature data of the target frame in the second video are first obtained, and then the similarity comparison is performed through the subsequent steps.

In step S320, an inter-frame similarity matrix between the target frame in the first video and the target frame in the second video is obtained according to the feature data.

Wherein the inter-frame similarity matrix is a set formed based on the similarity between each target frame in the first video and each target frame in the second video. For each element m in the similarity matrix _ij And representing the similarity between the ith target frame in the first video and the jth target frame in the second video. In the present embodiment, the feature data of each target frame in the first video and the feature data of each target frame in the second video are basedAnd respectively acquiring the similarity between each target frame in the first video and each target frame in the second video, and further acquiring an inter-frame similarity matrix according to the acquired similarity between the target frames.

In step S330, the inter-frame similarity matrix is subjected to a non-linear transformation to obtain a similarity matrix between the first video and the second video.

Wherein the similarity matrix is a set of similarities between the first video and the second video based on the video granularity representation. The nonlinear conversion is conversion in which the ratio of the variation of each output value to the variation of the corresponding input value is not constant, and can be realized by manual conversion, a kernel method or a neural network. In this embodiment, the inter-frame similarity matrix is subjected to nonlinear conversion, so as to obtain a converted similarity matrix capable of reflecting the granularity similarity of the videos, and thus obtain a similarity matrix between the first video and the second video.

In step S340, a confrontation calculation is performed on the similarity matrix to obtain a similarity between the first video and the second video.

The confrontation means that the situation that a new video is constructed due to operations such as disordering the sequence of video clips, modifying the resolution, simply cutting, mixing and cutting, adding the head and the tail of the clip, modifying the background of the video and the like can be confronted, and higher similarity can be obtained through a confrontation calculation mode, so that the problem of repeated virtual contents of the video is avoided. In this embodiment, the similarity between the first video and the second video is obtained by performing a confrontation calculation on the obtained similarity matrix.

In the video similarity determination method, feature data of a target frame in a first video and a target frame in a second video are obtained, an inter-frame similarity matrix between the target frame in the first video and the target frame in the second video is obtained according to the feature data, the inter-frame similarity matrix is subjected to nonlinear conversion to obtain a similarity matrix between the first video and the second video, and then the similarity matrix is subjected to countermeasure calculation to obtain the similarity between the first video and the second video. In the embodiment, when determining the similarity of the video, the similarity characteristics between frames of the video and the similarity characteristics of the video granularity are considered, and the similarity is obtained in a countermeasure calculation mode, so that the calculated similarity is more accurate.

In an exemplary embodiment, in step S330, performing a nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video, which may specifically include: and carrying out nonlinear conversion on the interframe similarity matrix through a convolutional neural network to obtain a similarity matrix between the first video and the second video. Among them, convolutional Neural Networks (CNN) is a kind of feed-forward Neural network containing convolution calculation and having a deep structure, which is one of the representative algorithms of deep learning. In this embodiment, the convolution neural network is used to perform convolution processing on the inter-frame similarity matrix, so as to implement nonlinear conversion on the inter-frame similarity matrix, thereby obtaining the similarity matrix between the first video and the second video more conveniently.

In an exemplary embodiment, as shown in fig. 4, in step S340, performing a confrontation calculation on the similarity matrix to obtain a similarity between the first video and the second video, specifically including:

in step S342, the maximum value of each column in the similarity matrix is determined.

Since a matrix is a set of complex or real numbers arranged in a rectangular array, it has corresponding rows and columns. In this embodiment, the similarity matrix is a set of similarity arranged according to a rectangular array. For example, if there are m rows and n columns in the similarity matrix, there are m elements for each column, and in this embodiment, the maximum value element in each column is determined by comparing the sizes of the m elements in each column, and for the similarity matrix with n columns, the maximum value elements can be obtained.

In step S344, the ratio between the sum of the maximum values in the columns and the number of rows of the similarity matrix is obtained as the similarity between the first video and the second video.

Specifically, the obtained n elements of the maximum values are summed, and then a ratio is obtained with the number of rows of the similarity matrix, that is, the sum of the n elements of the maximum values is compared with the number of rows m of the similarity matrix, so as to obtain a corresponding ratio, and the corresponding ratio is used as the similarity between the first video and the second video.

For example, if the similarity matrix is in the form of 3*4 (i.e., 3 rows and 4 columns):

then for each column in the matrix, the maximum value element in that column is first determined, resulting in a matrix of 1*4:

[0.67,0.73,0.95,0.78]。

then, the sum of the 1*4 matrices is 0.67+0.73+0.95+0.78=3.13, and the ratio between the sum and the number of rows (i.e. 3) of the matrix is calculated, i.e. 3.13/3=1.04, and then the ratio 1.04 is the similarity between the first video and the second video.

In the above embodiment, by determining the maximum value of each column in the similarity matrix, and further obtaining the ratio between the sum of the maximum values in each column and the row number of the similarity matrix as the similarity between the first video and the second video, the operations of constructing a new video by sequentially disordering video segments and performing blending and shearing logic can be resisted, and the accuracy of judging such similar videos or repeated videos can be improved.

In an exemplary embodiment, in step S310, acquiring feature data of a target frame in a first video and a second video specifically includes: identifying the video duration of the first video or the second video respectively; when the video duration of the first video is less than or equal to the set duration, extracting a target frame from the first video according to a first set frequency; when the video duration of the second video is less than or equal to the set duration, extracting a target frame from the second video according to a first set frequency; and carrying out feature identification on the target frames to obtain feature data of each target frame.

The video duration refers to the time required for playing the video. For example, for a first video, when the video duration of the first video is less than or equal to a set duration, a target frame is extracted from the first video at a first set frequency. The set duration refers to a preset video duration for extracting the target frame according to a fixed frequency, the set duration is usually a small value, for example, 10 seconds, 20 seconds, and the like, and the first set frequency is a preset fixed frequency for extracting the target frame in the video with a small video duration, for example, extracting one frame per second, or extracting two frames per second, and the like. Namely, when the video duration of the first video is less than or equal to the set duration, the target frame is extracted from the first video according to the first set frequency. Similarly, the same processing can be performed for the second video, such as when the video duration of the second video is less than or equal to the set duration, the target frame is extracted from the second video according to the first set frequency.

In this embodiment, after the target frames are extracted from the first video and the second video respectively in the above manner, feature recognition is further performed on the target frames to obtain feature data of each target frame. In the method, a target frame can be identified by adopting a classical model such as deep learning (MobileNet) and residual error network (ResNet) to extract corresponding feature data. Specifically, for example, taking resenet-50 as an example of feature identification, for each target frame in the first video and the second video, features are extracted from the intermediate convolutional layer of resenet-50 and aggregated, so as to obtain a feature expression of each target frame, such as (X, featuremap size), (Y, featuremap size), where X, Y is the number of target frames in the first video and the second video, respectively, and featuremap size is the feature map size of the corresponding video.

In this embodiment, when the video duration is less than or equal to the set duration, it indicates that the video duration of the video is smaller, and therefore, the first set frequency for extracting the video with the smaller video duration is used to extract the target frame, thereby avoiding the problem that the final similarity comparison result is affected by too few extracted target frames due to too large frame extraction interval, so as to improve the accuracy of the frame extraction result.

In an exemplary embodiment, for the extracting of the target frame, the method may further include: when the video duration of the first video is longer than the set duration, the second set frequency may be used to extract the target frame from the preset portion of the first video, and the third set frequency may be used to extract the target frame from the remaining portion of the first video except the preset portion. And when the video duration of the second video is longer than the set duration, extracting the target frame from the preset part in the second video by adopting a second set frequency, and extracting the target frame from the rest part except the preset part in the second video by adopting a third set frequency.

The second setting frequency and the third setting frequency are predetermined frequencies for extracting the target frame, such as extracting one frame per second or extracting two frames per second. The second setting frequency is different from the third setting frequency, and can be specifically set according to an actual application scenario. The preset portion is a predetermined portion of the video extracted with the second set frequency from the target frame, and may be, for example, a previous portion or a next portion of the video, or a middle portion, and may be, for example, the first 40% of the video, the last 30% of the video, or a portion between 40% and 80% of the middle of the video. It can also be set according to the actual application scenario.

In this embodiment, when the video duration of the video is longer than the set duration, it indicates that the video duration of the video is longer. And because the emphasis of the front and back parts of some videos is different, such as advertisement videos, the back part is mostly jump pages, the content is not important, and the video content of the front part is more important for users. Therefore, the target frame in the video can be determined by adopting a dynamic frame extraction strategy, for example, a frame extraction strategy with a dense front and a sparse back is adopted, that is, the frame extraction interval of the former part of the video is small, and the frame extraction interval of the latter part of the video is large, so that the video characteristics can be better embodied. It is understood that in other scenarios, a frame-extracting strategy with a sparse front and a dense back, or other frame-extracting strategies may also be employed.

Specifically, for the first video, when the video duration of the first video is longer than the set duration, the second set frequency is used to extract the target frame from the preset portion of the first video, and the third set frequency is used to extract the target frame from the remaining portion of the first video except the preset portion. Similarly, the second video may be processed in the same way, for example, when the video duration of the second video is longer than the set duration, the second set frequency is used to extract the target frame from the preset portion of the second video, and the third set frequency is used to extract the target frame from the remaining portion of the second video except the preset portion. For videos with longer video duration, the embodiment can extract the target frames by adopting different frequencies according to the actual scene, thereby realizing dynamic frame extraction of the videos and improving the accuracy of similarity comparison.

In an exemplary embodiment, in a scene where the video content of the first half is more important relative to the user, the start point of the preset portion is the video start point of the frame-extracting video, and the second set frequency is greater than the third set frequency. The frame extraction interval of the former part of the video is small, the frame extraction interval of the latter part is large, namely the frame extraction density of the former part is large, and the frame extraction density of the latter part is small, so that useful video characteristics can be extracted better.

In an exemplary embodiment, in step S320, acquiring an inter-frame similarity matrix between a target frame in a first video and a target frame in a second video according to the feature data specifically includes: for each target frame in the first video, acquiring the feature similarity between the corresponding feature data and the feature data corresponding to each target frame in the second video respectively, obtaining a row of elements or a column of elements corresponding to the target frame, and generating a corresponding inter-frame similarity matrix according to each row of elements or each column of elements.

For example, if there are m target frames in the first video and the corresponding feature data includes (a 1, a2, … … am), and there are n target frames in the second video and the corresponding feature data includes (b 1, b2, … … bn), then for each target frame in the first video, the feature similarity between the corresponding feature data and the corresponding feature data of each target frame in the second video is obtained. For example, for feature data a1 of a certain target frame in a first video, feature similarity (a 1 b 1) between the feature data a1 and feature data b1 in a second video needs to be obtained, and feature similarity (a 1 b 2) between the feature data a1 and feature data b2 in the second video needs to be obtained, that is, feature similarity (a 1 b1, a1 b2, … … a1 bn) between the feature data a1 in the first video and each feature data in the second video is obtained, so that a row of elements or a column of elements of the target frame a1 is obtained, that is, the row of elements or the column of elements may be used as a row of a matrix or a column of the matrix. Similarly, for the feature data a2 in the first video, the feature similarity (a 2 b1, a2 b2, … … a2 bn) with each feature data in the second video can be obtained, that is, the corresponding row element or column element is obtained. For the feature data am in the first video, the feature similarity (am b1, am b2, … … am bn) between the feature data am and each feature data in the second video can be obtained, that is, the corresponding row of elements or column of elements can be obtained. And generating a corresponding inter-frame similarity matrix according to a row of elements or a column of elements corresponding to each target frame in the first video. The following m rows and n columns of inter-frame similarity matrix is obtained:

for each element am bn in the above matrix, a similarity between feature data am of the mth target frame in the first video and feature data bn of the nth target frame in the second video is represented. Specifically, the similarity may be determined based on the distance between the feature data am and bn, and in this embodiment, the method for calculating the distance between the features includes, but is not limited to, methods such as euclidean distance, cosine similarity, and Tanimoto coefficient.

In the above embodiment, a line of elements or a column of elements corresponding to the target frame is obtained by obtaining the feature similarity between the feature data corresponding to each target frame in the first video and the feature data corresponding to each target frame in the second video, and then a corresponding inter-frame similarity matrix may be generated according to each line of elements or each column of elements. The inter-frame similarity matrix comprises the similarity between each target frame in the first video and each target frame in the second video respectively, namely the spatial relation between the frames of the videos is established, so that the spatial information between the frames of the videos can be reflected comprehensively.

In an exemplary embodiment, as shown in fig. 5, the generation manner of the convolutional neural network may specifically include the following steps:

in step S510, a training data set is acquired.

Wherein the training data set is a set containing several sample data for training the network. In the present embodiment, the sample data refers to triple sample data, and in particular, the triple sample data includes an anchor sample video (i.e., a reference sample), a first sample video (i.e., a positive sample) similar to the anchor sample video, and a second sample video (i.e., a negative sample) dissimilar to the anchor sample video. The anchor sample video and the first sample video form a positive sample pair, and the anchor sample video and the second sample video form a negative sample pair.

In step S520, a first sample inter-frame similarity matrix between the anchor sample video and the first sample video and a second sample inter-frame similarity matrix between the anchor sample video and the second sample video are obtained.

In this embodiment, for each triple sample data, a first sample inter-frame similarity matrix of a positive sample pair and a second sample inter-frame similarity matrix of a negative sample pair are obtained respectively. Specifically, for the process of acquiring the inter-frame similarity matrix of the first sample and the inter-frame similarity matrix of the second sample, reference may be made to the process of acquiring the inter-frame similarity matrix in the foregoing embodiment, which is not described in detail in this embodiment.

In step S530, a basic convolutional network is trained by using the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix, so as to obtain a trained convolutional neural network.

Wherein the underlying convolutional network is a conventional network for performing convolutional processing. In this embodiment, a basic convolutional network is trained based on the obtained first sample inter-frame similarity matrix and the obtained second sample inter-frame similarity matrix, so as to obtain a trained convolutional neural network. And then, the convolution neural network is used for carrying out convolution processing on the interframe similarity matrix to obtain a video similarity matrix between the first video and the second video, so that the similarity matrix of the video granularity is extracted.

In an exemplary embodiment, as shown in fig. 6, the following further explains a training process of the convolutional neural network, and in this embodiment, taking the extraction of feature data by using ResNet-50 as an example, a training method specifically may include: inputting the anchor sample video (anchor video), the first sample video (positive video) similar to the anchor sample video and the second sample video (negative video) dissimilar to the anchor sample video into ResNet-50 respectively for feature extraction, thereby obtaining feature data (namely R-MAC in the figure) of the anchor sample video, the first sample video and the second sample video respectively. And then obtaining a first sample inter-frame similarity matrix TD1 between the anchor sample video and the first sample video, and obtaining a second sample inter-frame similarity matrix TD2 between the anchor sample video and the second sample video.

And then, respectively inputting the first sample inter-frame similarity matrix TD1 and the second sample inter-frame similarity matrix TD2 into a basic convolutional network (CNN) for convolution processing to obtain a first sample video similarity matrix corresponding to the first sample inter-frame similarity matrix TD1 and a second sample video similarity matrix corresponding to the second sample inter-frame similarity matrix TD2. In this embodiment, a four-layer CNN network is taken as an example for explanation, and the processing process of the method may be understood as performing a nonlinear transformation on the original first sample inter-frame similarity matrix TD1 and the original second sample inter-frame similarity matrix TD2, so as to map the two matrices into a new space, where the meaning of each element in the obtained first sample video similarity matrix and the obtained second sample video similarity matrix is the similarity in a new mapped space. In addition, the dimensionality of the sample video similarity matrix is also changed. If the dimension of the original sample inter-frame similarity matrix TD1 and TD2 is X rows and Y columns, the dimension of the new similarity matrix obtained after passing through this CNN network is X 'rows and Y' columns. Network parameter conditions of the CNN shown in the figure (i.e., 4 convolutional layers Conv and two pooling layers max pooling, where Conv (3 × 3) × 32 indicates that 3 × 3 convolution kernels are used, the number of output channels is 32 Conv (3 × 3) × 64 indicates that 3 × 3 convolution kernels are used, the number of output channels is 64 Conv (3 × 3) × 128 indicates that 3 × 3 convolution kernels are used, the number of output channels is 128 Conv (1 × 1) × 1 indicates that 3 × 3 convolution kernels are used, and the number of output channels is 1) =1/4X, Y' =1/4Y.

And then respectively carrying out antagonistic calculation on the similarity matrix of the first sample video and the similarity matrix of the second sample video to obtain a first sample similarity CS1 between the anchor sample video and the first sample video and obtain a second sample similarity CS2 between the anchor sample video and the second sample video. The specific calculation process may refer to the method shown in fig. 4. And finally, determining network loss (namely triple loss in the graph) according to the first sample similarity CS1 and the second sample similarity CS2, and performing parameter adjustment on the basic convolutional network according to the network loss to obtain the trained convolutional neural network. The network loss can be calculated by the following formula:

Loss＝max{0，SimScore(anhor，negtive)-SimScore(anhor，postive)+delta}

where simstore (anchor, negative) is the similarity between the anchor sample video and the second sample video (i.e. dissimilar sample) (i.e. second sample similarity CS 2), simstore (anchor, positive) is the similarity between the anchor sample video and the first sample video (i.e. similar sample) (i.e. first sample similarity CS 1), and delta is a constant approaching zero.

In this embodiment, the network is trained based on the triple loss, so that the distance between the reference sample of the same category in the space and the positive sample is closer, and the distance between the reference sample of the different category and the negative sample is farther, so that the trained convolutional neural network can well distinguish details, and especially when two inputs are very similar, better feature representation can be learned, so that the accuracy of video similarity judgment is improved.

In an exemplary embodiment, as shown in fig. 7, the video similarity determining method is further described below, which specifically includes:

step one, dynamically extracting frames. And adopting a dynamic frame extracting strategy to extract frames of the video X and the video Y which need to be subjected to similarity judgment. For example, if the length of video X or video Y is >20s, then the top 40% of the decimated space for video X or video Y is 1s, and the bottom 60% of the decimated space is 2s; if the length of the video X or the video Y < =20s, the frame extraction interval is always 1s.

And step two, CNN feature extraction. And (2) extracting feature expressions (X, featuremap size) and (Y) of a pair of videos (X, Y) frame by frame based on the CNN, wherein an algorithm for extracting features comprises but is not limited to classical models such as ResNet and MobileNet.

And step three, calculating a frame granularity similarity matrix. And calculating the similarity between all frames of the pair of videos based on the characteristics. And (3) setting that the video X comprises A target frames and the video Y comprises B target frames, and calculating the characteristic distance between the ith frame Ai of the video X and the jth frame Bj of the video Y and recording the characteristic distance as mij. Finally, a matrix of A and B is obtained, and the matrix elements are mij. The method for calculating the characteristic distance includes, but is not limited to, classical algorithms such as euclidean distance and cosine similarity.

And step four, calculating a video granularity similarity matrix. The interframe similarity matrix (a × B) of the frame-granularity similarities of paired videos is passed through a four-layer CNN network, as shown in fig. 8 below, to obtain a new video similarity matrix (a '× B'), where each element of the matrix is denoted by sij, a 'is a quarter a, and B' is a quarter B.

And step five, calculating a similarity value. The similarity score between videos is obtained through calculation, and the specific calculation process can refer to the embodiment shown in fig. 4, so that mixed-cropping logic hack can be avoided, and the accuracy is improved.

It should be understood that although the various steps in the flowcharts of fig. 3-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Fig. 9 is a block diagram illustrating a video similarity determination apparatus according to an example embodiment. Referring to fig. 9, the apparatus includes a feature data acquisition module 902, an inter-frame similarity matrix acquisition module 904, a video similarity matrix acquisition module 906, and a similarity determination module 908.

A feature data obtaining module 902 configured to perform obtaining feature data of a target frame in the first video and the second video;

an inter-frame similarity matrix obtaining module 904 configured to perform obtaining an inter-frame similarity matrix between a target frame in the first video and a target frame in the second video according to the feature data;

a video similarity matrix obtaining module 906, configured to perform nonlinear conversion on the inter-frame similarity matrix to obtain a similarity matrix between the first video and the second video;

a similarity determination module 908 configured to perform a confrontation calculation on the video similarity matrix to obtain a similarity between the first video and the second video.

In an exemplary embodiment, the similarity matrix obtaining module is configured to perform: and carrying out nonlinear conversion on the interframe similarity matrix through a convolutional neural network to obtain a similarity matrix between the first video and the second video.

In an exemplary embodiment, the apparatus further comprises a convolutional neural network generating module configured to perform: acquiring a training data set, wherein the training data set comprises a plurality of triple sample data, and the triple sample data comprises an anchor point sample video, a first sample video similar to the anchor point sample video and a second sample video dissimilar to the anchor point sample video; obtaining a first sample inter-frame similarity matrix between the anchor sample video and the first sample video, and obtaining a second sample inter-frame similarity matrix between the anchor sample video and the second sample video; and training a basic convolutional network by adopting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network.

In an exemplary embodiment, the convolutional neural network generating module is further configured to perform: respectively inputting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix into a basic convolution network for convolution processing to obtain a first sample video similarity matrix corresponding to the first sample inter-frame similarity matrix and a second sample video similarity matrix corresponding to the second sample inter-frame similarity matrix; obtaining a first sample similarity between the anchor point sample video and the first sample video by performing countermeasure calculation on the first sample video similarity matrix, and obtaining a second sample similarity between the anchor point sample video and the second sample video by performing countermeasure calculation on the second sample video similarity matrix; and determining network loss according to the first sample similarity and the second sample similarity, and performing parameter adjustment on the basic convolutional network according to the network loss to obtain the trained convolutional neural network.

In an exemplary embodiment, the similarity determination module is configured to perform: determining the maximum value of each column in the similarity matrix; and acquiring the ratio of the sum of the maximum values in each column to the number of rows of the similarity matrix as the similarity between the first video and the second video.

In an exemplary embodiment, the inter-frame similarity matrix obtaining module is configured to perform: for each target frame in the first video, acquiring feature similarity between corresponding feature data and corresponding feature data of each target frame in the second video respectively to obtain a row of elements or a column of elements corresponding to the target frame; and generating a corresponding inter-frame similarity matrix according to each row element or each column element.

In an exemplary embodiment, the feature data acquisition module is configured to perform: identifying a video duration of the first video or the second video, respectively; when the video duration of the first video is less than or equal to a set duration, extracting a target frame from the first video according to a first set frequency; when the video duration of the second video is less than or equal to a set duration, extracting a target frame from the second video according to a first set frequency; and performing feature recognition on the target frames to obtain feature data of each target frame.

In an exemplary embodiment, the feature data acquisition module is further configured to perform: when the video duration of the first video is longer than a set duration, extracting a target frame from a preset part in the first video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the first video by adopting a third set frequency; and when the video duration of the second video is longer than a set duration, extracting a target frame from a preset part in the second video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the second video by adopting a third set frequency.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a block diagram illustrating an electronic device S00 for video similarity determination, in accordance with an example embodiment. For example, the electronic device S00 may be a server. Referring to fig. 10, the electronic device S00 comprises a processing component S20, which further comprises one or more processors, and memory resources, represented by memory S22, for storing instructions, e.g. application programs, executable by the processing component S20. The application stored in the memory S22 may include one or more modules each corresponding to a set of instructions. Furthermore, the processing component S20 is configured to execute instructions to perform the above-described method.

The electronic device S00 may further include: the power supply component S24 is configured to perform power management of the electronic device S00, the wired or wireless network interface S26 is configured to connect the electronic device S00 to a network, and the input-output (I/O) interface S28. The electronic device S00 may operate based on an operating system stored in the memory S22, such as Windows Server, mac OS X, unix, linux, freeBSD, or the like.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory S22 comprising instructions, executable by a processor of the electronic device S00 to perform the above-described method is also provided. The storage medium may be a computer-readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by a processor of the electronic device S00 to perform the above method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for determining video similarity, the method comprising:

determining the maximum value of each column in the similarity matrix; and acquiring a ratio between the sum of the maximum values in each column and the row number of the similarity matrix, and taking the ratio as the similarity between the first video and the second video.

2. The method of claim 1, wherein the performing the non-linear transformation on the inter-frame similarity matrix to obtain the similarity matrix between the first video and the second video comprises:

and carrying out nonlinear conversion on the interframe similarity matrix through a convolutional neural network to obtain a similarity matrix between the first video and the second video.

3. The method of claim 2, wherein the convolutional neural network is generated in a manner comprising:

acquiring a training data set, wherein the training data set comprises a plurality of triple sample data, and the triple sample data comprises an anchor point sample video, a first sample video similar to the anchor point sample video and a second sample video dissimilar to the anchor point sample video;

obtaining a first sample inter-frame similarity matrix between the anchor sample video and the first sample video, and obtaining a second sample inter-frame similarity matrix between the anchor sample video and the second sample video;

and training a basic convolutional network by adopting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network.

4. The method of claim 3, wherein the training a base convolutional network using the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix to obtain a trained convolutional neural network, comprises:

respectively inputting the first sample inter-frame similarity matrix and the second sample inter-frame similarity matrix into a basic convolution network for convolution processing to obtain a first sample video similarity matrix corresponding to the first sample inter-frame similarity matrix and a second sample video similarity matrix corresponding to the second sample inter-frame similarity matrix;

performing countermeasure calculation on the first sample video similarity matrix to obtain a first sample similarity between the anchor point sample video and the first sample video, and performing countermeasure calculation on the second sample video similarity matrix to obtain a second sample similarity between the anchor point sample video and the second sample video;

and determining network loss according to the first sample similarity and the second sample similarity, and performing parameter adjustment on the basic convolutional network according to the network loss to obtain the trained convolutional neural network.

5. The method according to any one of claims 1 to 4, wherein the obtaining an inter-frame similarity matrix between the target frame in the first video and the target frame in the second video according to the feature data comprises:

for each target frame in the first video, acquiring feature similarity between corresponding feature data and corresponding feature data of each target frame in the second video respectively to obtain a row of elements or a column of elements corresponding to the target frame;

and generating a corresponding inter-frame similarity matrix according to each row element or each column element.

6. The method according to any one of claims 1 to 4, wherein the obtaining the feature data of the target frame in the first video and the second video comprises:

identifying a video duration of the first video or the second video, respectively;

when the video duration of the first video is less than or equal to a set duration, extracting a target frame from the first video according to a first set frequency;

when the video duration of the second video is less than or equal to a set duration, extracting a target frame from the second video according to the first set frequency;

and carrying out feature identification on the target frames to obtain feature data of each target frame.

7. The method of claim 6, further comprising:

when the video duration of the first video is longer than a preset duration, extracting a target frame from a preset part in the first video by adopting a second set frequency, and extracting a target frame from the rest part except the preset part in the first video by adopting a third set frequency;

and when the video duration of the second video is longer than a set duration, extracting a target frame from a preset part in the second video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the second video by adopting a third set frequency.

8. The method according to claim 7, wherein the starting point of the preset portion is a video starting point of the first video or a video starting point of the second video, and the second set frequency is greater than the third set frequency.

9. A video similarity determination apparatus, comprising:

a similarity determination module configured to perform determining a maximum value for each column in the similarity matrix; and acquiring a ratio between the sum of the maximum values in each column and the row number of the similarity matrix, and taking the ratio as the similarity between the first video and the second video.

10. The apparatus of claim 9, wherein the similarity matrix obtaining module is configured to perform:

11. The apparatus of claim 10, further comprising a convolutional neural network generation module configured to perform:

12. The apparatus of claim 11, wherein the convolutional neural network generation module is further configured to perform:

13. The apparatus according to any of claims 9 to 12, wherein the inter-frame similarity matrix obtaining module is configured to perform:

14. The apparatus according to any one of claims 9 to 12, wherein the feature data acquisition module is configured to perform:

when the video duration of the second video is less than or equal to a set duration, extracting a target frame from the second video according to a first set frequency;

and performing feature recognition on the target frames to obtain feature data of each target frame.

15. The apparatus of claim 14, wherein the feature data acquisition module is further configured to perform:

and when the video duration of the second video is longer than a preset duration, extracting a target frame from a preset part in the second video by adopting the second set frequency, and extracting a target frame from the rest part except the preset part in the second video by adopting a third set frequency.

16. The apparatus according to claim 15, wherein the starting point of the preset portion is a video starting point of the first video or a video starting point of the second video, and the second set frequency is greater than the third set frequency.

17. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video similarity determination method according to any one of claims 1 to 8.

18. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video similarity determination method of any one of claims 1 to 8.

19. A computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the video similarity determination method according to any one of claims 1 to 8.