CN113609316A

CN113609316A - Method and device for detecting similarity of media contents

Info

Publication number: CN113609316A
Application number: CN202110850911.3A
Authority: CN
Inventors: 蒋晨; 黄凯明; 何思枫; 杨旭东; 张伟; 张晓博; 程远; 徐富荣; 王清; 潘覃
Original assignee: Alipay Hangzhou Information Technology Co Ltd; Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd; Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-11-05

Abstract

The embodiment of the specification provides a method and a device for detecting media content similarity. The method and the device can be applied to copyright protection in the block chain technology. The method comprises the following steps: sampling target media content to obtain basic frame data; determining each key frame in the target media content according to the basic frame data; determining reference media content; obtaining each key frame in the predetermined reference media content; aiming at each key frame of the target media content, calculating the similarity between each frame of the key frame and each key frame in the reference media content; and determining the overall similarity of the target media content and the reference media content according to the calculated similarity between the frames. The method reduces the calculated amount and the storage amount, and simultaneously improves the accuracy of detection.

Description

Method and device for detecting similarity of media contents

Technical Field

One or more embodiments of the present disclosure relate to network information technology, and in particular, to a method and apparatus for detecting similarity of media contents.

Background

With the development of network technology, various media contents such as video, audio, text, etc. are largely spread in the network. In many application scenarios, it is necessary to detect the similarity between two media contents. For example, the copyright of a movie video needs to be protected, and therefore, a target video propagated in a website needs to be compared with the movie video to detect the similarity between the target video and the movie video, so as to determine whether the target video infringes.

Currently, the similarity between two media contents is mainly detected based on all uniform frames of a sample. In such a way, the calculation amount is large, and the detection efficiency is reduced.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for detecting media content similarity, which can reduce the amount of computation and improve the detection efficiency.

According to a first aspect, a method for detecting media content similarity is provided, which includes:

sampling target media content to obtain basic frame data;

determining each key frame in the target media content according to the basic frame data;

determining reference media content;

obtaining each key frame in the predetermined reference media content;

aiming at each key frame of the target media content, calculating the similarity between each frame of the key frame and each key frame in the reference media content;

and determining the overall similarity of the target media content and the reference media content according to the calculated similarity between the frames.

Wherein the determining the reference media content comprises:

obtaining at least two feature vectors corresponding to at least two frames of the target media content;

obtaining a retrieval result of the feature vectors similar to the at least two feature vectors of the target media content from a media content database;

and determining reference media content similar to the target media content from a media content database based on the retrieval result of the feature vector.

Determining each key frame in the target media content according to the basic frame data comprises:

converting each basic frame data into a two-dimensional small graph with a preset size;

sequentially splicing the converted small images according to the time sequence of the basic frame data to obtain a two-dimensional spliced image;

inputting the two-dimensional mosaic into a classification network trained in advance;

and obtaining the information of each key frame in the target media content according to the output of the classification network.

The information of each key frame in the target media content comprises: a first key frame confidence matrix, wherein the vector value in the first key frame confidence matrix is 0 or 1, if the vector value of one vector is 0, the frame at the time sequence position corresponding to the vector is characterized as not being a key frame, and if the vector value of one vector is 1, the frame at the time sequence position corresponding to the vector is characterized as being a key frame.

The training method of the classification network comprises the following steps:

performing at least two rounds of training of a classification network using at least two sample media content, each round of training comprising: inputting a sample two-dimensional splicing image formed by splicing all basic frames of a sample media content into a classification network, and enabling the classification network to output a second key frame confidence coefficient matrix; the vector value in the second key frame confidence matrix is one value from 0 to 1, and the higher the value of a vector is, the higher the confidence that the frame at the time sequence position corresponding to the vector is the key frame is.

The training method of the classification network further comprises the following steps:

converting two second key frame confidence coefficient matrixes obtained aiming at the first sample media content and the second sample media content into each key frame confidence coefficient vector;

matching and multiplying each key frame confidence coefficient vector obtained aiming at one second key frame confidence coefficient matrix and each key frame confidence coefficient vector obtained aiming at the other second key frame confidence coefficient matrix pairwise to obtain a third key frame confidence coefficient matrix;

and adjusting a loss function of the classification network by using the similar frame position between the first sample media content and the second sample media content output by the deep learning detection model and the third key frame confidence coefficient matrix.

After the multiplication of the pairwise matches and before obtaining a third key frame confidence matrix, further comprising: and setting the vector values at every set number of positions in the primary matrix as 1 to obtain the third key frame confidence coefficient matrix.

The training method of the deep learning detection model comprises the following steps:

calculating the similarity of the feature vector of each frame of the first sample media content and the feature vector of each frame of the second sample media content to obtain a similarity matrix;

multiplying the third key frame confidence coefficient matrix with the similarity matrix to obtain a weighted similarity matrix;

and inputting the weighted similarity matrix into a deep learning detection model so as to train the deep learning detection model.

Wherein, the determining the overall similarity of the target media content and the reference media content according to the calculated similarity between the frames comprises:

and inputting the calculated inter-frame similarity into a pre-trained deep learning detection model to obtain the similar frame position between the target media content and the reference media content output by the deep learning detection model, and determining the overall similarity of the target media content and the reference media content according to the similar frame position.

According to a second aspect, there is provided an apparatus for detecting media content similarity, comprising:

the basic frame data acquisition module is configured to sample the target media content to obtain basic frame data;

a reference media content determination module configured to determine reference media content;

the key frame determining module is configured to determine each key frame in the target media content according to the basic frame data; obtaining each key frame in the predetermined reference media content;

the inter-frame similarity calculation module is configured to calculate the inter-frame similarity between each key frame of the target media content and each key frame of the reference media content;

and the overall similarity calculation module is configured to determine the overall similarity of the target media content and the reference media content according to the calculated similarity between the frames.

According to a third aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements a method as described in any of the embodiments of the present specification.

The method and the device for detecting the similarity of the media contents provided by the embodiment of the specification utilize the key frame to calculate the similarity between the two media contents. Because the key frame is a data frame capable of determining the meaning of the media content, the key information in the media content is reserved, and because the key frame is not a dense data frame obtained after sampling, the data redundancy is removed, so that the number of data frames used in the detection process is greatly reduced, the calculated amount and the memory amount are reduced, the complexity of implementation is reduced, and the detection efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for detecting media content similarity in one embodiment of the present disclosure.

FIG. 2 is a flow chart of a method for determining the location of each first keyframe in target media content in one embodiment of the present disclosure.

FIG. 3 is a schematic diagram of a two-dimensional mosaic of targeted media content in one embodiment of the present description.

FIG. 4 is a diagram of a key frame confidence matrix A1 in one embodiment of the present description.

FIG. 5 is a flowchart of a method for jointly training a classification network and a deep learning detection model in one embodiment of the present disclosure.

FIG. 6 is a diagram of a key frame confidence matrix C in one embodiment of the present description.

FIG. 7 is a schematic diagram of a matrix D characterizing matching key frames of target media content and reference media content in one embodiment of the present description.

Fig. 8 is a schematic structural diagram of a device for detecting media content similarity in one embodiment of the present specification.

Detailed Description

As previously mentioned, the prior art detects the similarity between two media contents based on all uniform frames of a sample. For example, if the target media content is a video, the target video is uniformly sampled in the prior art, and is usually sampled every 1 second, the video length is 300 seconds, and 300 uniform frames are sampled. And simultaneously, uniformly sampling the reference video, for example, 300 uniform frames are also sampled, and then determining whether the target video is similar to the reference video according to the similarity between all the uniform frames of the two videos.

It can be seen that when all the uniform frames of the samples are used for detection, a large number of data frames are sampled because sampling is required once every short time, and the subsequent similarity calculation is performed based on the large number of data frames. With the mass increase of media content in the network (for example, a target video may need to perform similarity calculation with 1 million reference videos respectively), and the increase of media content duration (for example, the duration of a target video is too long, which results in millions of frame data in all sampled uniform frames), this manner of using all uniform frames may result in an excessive number of used data frames, which greatly increases the complexity of implementation and reduces the detection efficiency.

In order to solve the problems in the prior art, an embodiment of the present specification provides a method for detecting media content similarity. The main execution body of the method is a device for detecting the similarity of the media contents. It is to be understood that the method may also be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Referring to fig. 1, the method includes:

step 101: and sampling the target media content to obtain basic frame data.

Step 103: and determining each key frame in the target media content according to the basic frame data.

Step 105: reference media content is determined.

Step 107: individual key frames in the predetermined reference media content are obtained.

Step 109: for each key frame of the target media content, calculating the similarity between each key frame and each key frame in the reference media content.

Step 111: and determining the overall similarity of the target media content and the reference media content according to the calculated similarity between the frames.

For a media content, the data frame determining the meaning of the content (for example, the frame where the key action in the movement or change of the character or object is located, referred to as the key frame) does not appear once every short time, for example, a movie video, the key frame determining the video content does not appear once every 1 second, for example, in a time as long as 10 seconds, the content in the movie video is a scene in which a character is reading, so that a large number of redundant data frames exist in all uniform frames sampled every 1 second, and for the content of 10 seconds, only 1 frame, for example, the 1 st frame in the 10 seconds, needs to be actually sampled and used as the key frame to perform the subsequent similarity detection calculation. In the process described above in fig. 1, it is the key frames that are used to calculate the similarity between two media contents. Because the key frame is a data frame capable of determining the meaning of the media content, the key information in the media content is reserved, and because the key frame is not a dense data frame obtained after sampling, the data redundancy is removed, the number of data frames used in the detection process is greatly reduced, the complexity of implementation is reduced, and the detection efficiency is improved.

The following is a description of each step shown in fig. 1.

First for step 101: and sampling the target media content to obtain basic frame data.

In various embodiments of the present description, the media content may be content such as video, audio, text, pictures, etc. transmitted in a network.

For example, in step 101, the target media content is a segment of video to be uploaded to the blockchain system, and the video may be uniformly sampled and sampled every 1 second, so as to obtain uniformly sampled 100 frames of base frame data.

Next, in step 103, each key frame in the target media content is determined according to the basic frame data.

For example, the processing in step 103 determines the 1 st frame, the 4 th frame, the 6 th frame, the 7 th frame, the 11 th frame, the 23 rd frame, and the like of the base frame data of a total of 100 frames as the key frames.

In one embodiment of the present specification, a one-dimensional basic frame data sequence formed by each basic frame data may be converted into a two-dimensional mosaic, and the two-dimensional mosaic is used to determine which frame in the target media content is a key frame. In this case, the specific implementation process of step 103 can be seen in fig. 2, and includes:

step 201: each of the basic frame data is converted into a two-dimensional thumbnail of a predetermined size.

In step 201, each base frame may be converted into a two-dimensional small map of, for example, 32 × 32 pixels or 64 × 64 pixels.

Step 203: and sequentially splicing the converted small pictures according to the time sequence of the basic frame data to obtain a two-dimensional spliced picture.

In step 203, a N × N mosaic may be formed. For example, referring to fig. 3, 100 pieces of basic frame data are obtained in step 101, then, in step 203, the 100 pieces of basic frame data are spliced into a 10 × 10 two-dimensional mosaic according to the time sequence order of the 100 pieces of basic frame data, that is, 10 thumbnails in the first row in the two-dimensional mosaic sequentially correspond to the thumbnail converted from the basic frame data of the 1 st frame, the thumbnail … … converted from the basic frame data of the 2 nd frame, and so on until the thumbnail converted from the basic frame data of the 10 th frame, and similarly, 10 thumbnails in the second row in the two-dimensional mosaic sequentially correspond to 10 thumbnails respectively converted from the basic frame data of the 11 th frame to the 20 th frame, and so on.

Step 205: and inputting the two-dimensional mosaic into a classification network trained in advance.

Step 207: and obtaining the information of each key frame in the target media content according to the output of the classification network.

In one embodiment of the present description, because the two-dimensional mosaic is input into the classification network, the classification network may output a two-dimensional matrix having the same dimension as the two-dimensional mosaic, denoted as a key frame confidence matrix a1, and use the vectors in the matrix a1 to characterize each key frame in the target media content in step 207.

In this step 207, the vector value of each vector in the key frame confidence matrix a1 is 0 or 1, where 0 indicates that the frame at the corresponding time sequence position is not a key frame of the target media content, and 1 indicates that the frame at the corresponding time sequence position is a key frame of the target media content. For example, referring to FIG. 4, in the key frame confidence matrix A1, vector X₁₁、X₁₄、X₁₆、X₁₇、X₂₁、X₃₃If the vector value is 1 and the rest is 0, the frame at the corresponding time sequence position in the target media content, namely the 1 st frame (corresponding to the vector X) is shown₁₁) Frame 4 (corresponding vector X)₁₄) Frame 6 (corresponding vector X)₁₆) Frame 7 (corresponding vector X)₁₇) Frame 11 (corresponding vector X)₂₁) Frame 23 (corresponding vector X)₃₃) Etc. are key frames.

In step 207, the vector values smaller than 0.5 in the key frame confidence matrix a1, i.e., the key frame confidence values, may be set to 0, and the vector values not smaller than 0.5 in the key frame confidence matrix a1, i.e., the key frame confidence values, may be set to 1 by setting the key frame confidence threshold value, e.g., 0.5.

Because one data frame has four adjacent relations of up, down, left and right in the two-dimensional splicing map, compared with the situation that a one-dimensional data sequence only has two adjacent relations of left and right, the two-dimensional splicing map has more adjacent relations, and therefore when the key frame is determined by utilizing the classification network, more calculation information can be provided, and more accurate key frames can be obtained.

In the flow shown in fig. 2, the two-dimensional mosaic and the classification network are used to change the problem of identifying the key frames of the target media content into a classification problem, which can improve the calculation efficiency.

The classification network used in fig. 2 is pre-trained. Referring to fig. 5, the training process of the classification network may include:

step 501: performing at least two rounds of training of a classification network using at least two sample media content, each round of training comprising: and inputting a sample two-dimensional splicing image formed by splicing all basic frames of a sample media content into a classification network, so that the classification network outputs a key frame confidence matrix B.

When the sample two-dimensional mosaic of the sample media content is input into the classification network, information on whether each frame in the sample media content is a key frame is also input into the classification network. In the training phase, the vector value in the key frame confidence matrix B output by the classification network is one value from 0 to 1, and the higher the value of a vector is, the higher the confidence of the key frame is in the frame position corresponding to the vector.

Here, the method for obtaining the sample two-dimensional mosaic of sample media content may refer to the method principle described in the above steps 201 to 203.

After performing multiple rounds of processing in step 501, i.e., training the classification network using multiple sample media contents, the initial training of the classification network is completed.

After the plurality of sample media contents are input into the classification network, a plurality of key frame confidence coefficient matrixes B corresponding to the plurality of sample media contents can be obtained. Such as a key frame confidence matrix B1 for sample media content 1, a key frame confidence matrix B2 for sample media content 2, and so on.

In order to further improve the training effect of the classification network, an end-to-end joint training mode can be adopted, the classification network and a subsequently used deep learning detection model (the deep learning detection model is used for calculating the overall similarity of two media contents according to the inter-frame similarity) are subjected to joint training, and the parameters in the classification network are further adjusted by using the result obtained by the deep learning detection model. Referring to fig. 5, in the joint training mode, after the step 501 is performed, the following steps are further included:

step 503: in chronological order, both the two key frame confidence matrices B1 and B2 obtained for sample media content 1 and sample media content 2 are converted into one-dimensional respective key frame confidence vectors.

The vector values in the key frame confidence matrixes B1 and B2 are all values from 0 to 1, and the higher the value of a vector is, the higher the confidence of the frame position corresponding to the vector as the key frame is.

Because the two media contents being compared are typically different in length, the resulting keyframe confidence matrices are also typically different in dimension, such as keyframe confidence matrix B1 being a 10 x 10 matrix and keyframe confidence matrix B2 being a 3 x 7 matrix. Therefore, in order to match the locations of the keyframes in the two sample media contents, the two keyframe confidence matrices need to be converted into two one-dimensional keyframe confidence vectors, so that the two keyframe confidence vectors can be multiplied by each other in step 505.

Step 505: and (3) pairwise matching and multiplying each key frame confidence coefficient vector obtained aiming at the key frame confidence coefficient matrix B1 and each key frame confidence coefficient vector obtained aiming at the key frame confidence coefficient matrix B2 to obtain a key frame confidence coefficient matrix C.

For example, for the keyframe confidence matrix B1, which is a 10 x 10 matrix, and thus includes 100 keyframe confidence vectors, for the keyframe confidence matrix B2, which is a 3 x 7 matrix, and thus includes 21 keyframe confidence vectors, each of the 100 keyframe confidence vectors is pairwise matched multiplied by each of the 21 keyframe confidence vectors. One way to achieve this pairwise matching multiplication may be: the method comprises the steps of representing 100 key frame confidence coefficient vectors in a key frame confidence coefficient matrix B1 as numerical values on an X axis, representing 21 key frame confidence coefficient vectors in a key frame confidence coefficient matrix B2 as numerical values on a Y axis, and multiplying elements (element-wise) corresponding to the numerical values of two coordinate axes one by one, so that a matrix of 100X 21 can be correspondingly obtained.

In an embodiment of the present specification, in this step 505, a primary matrix obtained by multiplying two matches may be directly used as the key frame confidence matrix C. For example, the key frame confidence matrix C can be seen in fig. 6 (only vector values of a portion of the vectors are shown in fig. 6, it being understood that each vector has a vector value of one of 0 to 1).

In another embodiment of the present specification, in this step 505, the primary matrix may not be used as the key frame confidence matrix C. After a primary matrix formed by the multiplication results of every two is obtained, interpolation processing is carried out in the primary matrix, and the matrix obtained after interpolation processing is used as a key frame confidence coefficient matrix C. This is because, considering that the number of key frames in both media contents may not be enough, if the number of frames used in the subsequent contrast similarity is too small, the content of the calculation basis is too small to meet the accuracy requirement. Therefore, in order to further improve the training effect of the classification network and the depth detection model, the position of the frame for subsequently detecting the similarity may be increased, that is, the interpolation process may be performed. A preferred implementation is to add uniform sparse frames to participate in the subsequent detection process in the training phase. Therefore, after the primary matrix is obtained, the vector values at every set number of positions in the primary matrix may be set to 1, and the matrix obtained at this time may be used as the key frame confidence matrix C. For example, in the key frame confidence matrix C, from the first vector, every 10 vectors are forced to be 1 regardless of the current value of the position, so that it is ensured that the frames participating in the similarity detection subsequently include frames selected every 10 seconds in addition to frames that are the key frames of two media contents, and the number of frames participating in the subsequent detection is enriched. But at the same time, because all uniform frames are not used for participating in the subsequent detection, the detection efficiency is also improved.

It can be seen that, in this step 505, the vector value of each vector of the key frame confidence matrix C (e.g., the matrix of 100 × 21) is also a value from 0 to 1, which represents the confidence that each frame of two sample media contents is a key frame of two sample media contents at the same time when the two frames are matched two by two.

Step 507: and calculating the similarity of the feature vector of each frame of the sample media content 1 and the feature vector of each frame of the sample media content 2 to obtain a similarity matrix.

In this step 507, the similarity of each frame of content in the sample media content 1 with respect to each frame of content in the sample media content 2 is obtained. Still taking the sample media content 1 as 100 frames and the sample media content 2 as 21 frames as an example, the obtained similarity matrix is a matrix of 100 × 21.

Step 509: and multiplying the key frame confidence coefficient matrix C with the similarity matrix to obtain a weighted similarity matrix.

The magnitude of each vector value in the key frame confidence matrix C represents the confidence level (value is one of 0 to 1) that each frame in the sample media content 1 and each frame in the sample media content 2 are the same as a key frame, and the similarity matrix represents the content similarity level (for example, the brightness may be used to display the similarity, the brighter the more similar the more, the darker the more dissimilar the more) between each frame in the sample media content 1 and each frame in the sample media content 2, and the similarity of two sample media contents at the positions of two matching frames which are the same as the key frame can be obtained more significantly (by weighting) by multiplying the two matrices.

Step 511: and inputting the weighted similarity matrix into a deep learning detection model so as to train the deep learning detection model.

Here, when the weighted similarity matrix is input to the deep learning detection model, the position of the similar frame between the two sample media contents is also input to the deep learning detection model, so as to train the deep learning detection model.

The output of the deep learning detection model is the similar frame position between two sample media contents. For example, frames 1 to 3 in the sample media content 1 and frames 5 to 6 in the sample media content 2 are similar frames, and their contents are similar, and are infringement segments, etc.

Step 513: and adjusting the loss function of the classification network by using the similar frame position and the key frame confidence coefficient matrix C output by the deep learning detection model.

So far, the related process of determining each key frame in the target media content in step 103 is described.

Next, in step 105, reference media content is determined.

The target media content is the media content to be determined for infringement, and the reference media content is the media content for comparison. Since a huge amount of media content is stored in the media database, it is necessary to determine the reference media content at risk of infringement from the media database. In an embodiment of the present specification, the implementation process of this step 105 includes:

step 1051: and obtaining at least two feature vectors corresponding to at least two frames of the target media content.

Step 1053: and acquiring a retrieval result of the feature vectors similar to the at least two feature vectors of the target media content from a media content database.

In this step 1053, one or more feature vectors of each media content are included in the media content database, and the first few feature vectors matching with at least two feature vectors of the target media content are used as the search result.

Step 1055: and determining reference media content similar to the target media content from a media content database based on the retrieval result of the feature vector.

For example, the first k feature vectors matching with each feature vector in the plurality of feature vectors of the target video may be respectively obtained from the video database, and then m reference videos corresponding to the first k feature vectors are determined, where m is smaller than or equal to k, and m is greater than or equal to 1, when m is equal to k, it is indicated that the k feature vectors are from k different reference videos, and when m is equal to 1, it is indicated that the k feature vectors are from the same reference video, or one feature vector that is the best matching with each feature vector in the plurality of feature vectors of the target video may be respectively obtained from the video database, and then the reference video corresponding to the best matching one feature vector is determined.

Next, in step 107, the respective key frames in the predetermined reference media content are obtained.

In this step 107, each key frame in the reference media content may be pre-labeled manually, for example, in the form of a key frame confidence matrix. Of course, each key frame in the reference media content may also be obtained by inputting the two-dimensional mosaic formed by each base frame of the reference media content into a classification network trained in advance by using the method in the above step 201 to step 207, and obtaining the information of the key frame represented by the key frame confidence matrix a2 output by the classification network. The vector value of each vector in the key frame confidence matrix a2 is either 0 or 1, where 0 indicates that the frame at the corresponding temporal location is not a key frame of the reference media content and 1 indicates that the frame at the corresponding temporal location is a key frame of the reference media content.

Next, in step 109, for each key frame of the target media content, the inter-frame similarity between the key frame and the respective key frame in the reference media content is calculated.

The processing of this step 109 may be implemented using the key frame confidence matrix a1 (information characterizing key frames in the target media content) obtained in the above-described correlation description of step 207 and the key frame confidence matrix a2 (information characterizing key frames in the reference media content) obtained in the correlation description of step 107.

For example, the following steps are carried out:

one implementation of step 109 includes: the key frame confidence matrix A1 is converted into a one-dimensional key frame confidence vector 1, represented, for example, as {1, 0, 0, 1, 0, 1, 1 … }, which comprises a total of 100 vectors. The key frame confidence matrix A2 is converted into a one-dimensional key frame confidence vector 2, represented, for example, as {1, 1, 0, 0, 0, 1, 0 … }, which comprises a total of 50 vectors. And correspondingly multiplying the elements in the two vectors one by one to obtain a matrix D of 100 x 50, and calculating the similarity between the target media content and the reference media content at two frame positions corresponding to the vector with the vector value of 1 in the matrix D. For example, the matrix D is shown in FIG. 7 (only the vector values of some of the vectors are shown in FIG. 7, it being understood thatEach vector having a vector value of 0 or 1, vector X₁₁Is 1, then the inter-frame similarity between the target media content on the 1 st frame and the reference media content on the 1 st frame is calculated, vector X₁₂Is 1, then the inter-frame similarity between the target media content on the 1 st frame and the reference media content on the 2 nd frame is calculated, vector X₁₃、X₁₄、X₁₅If the values of the values are all 0, the inter-frame similarity between the target media content on the 1 st frame and the reference media content on the 3 rd, 4 th and 5 th frames, etc. does not need to be calculated, if the vector values of the second row in the matrix D are all 0, the inter-frame similarity between the target media content on the 2 nd frame and all the frames of the reference media content does not need to be calculated, and so on.

Next, in step 111, the overall similarity between the target media content and the reference media content is determined according to the calculated inter-frame similarity.

In this step 111, the calculated inter-frame similarity is input into a pre-trained deep learning detection model, so as to obtain a similar frame position between the target media content and the reference media content output by the deep learning detection model, and the overall similarity between the target media content and the reference media content can be determined according to the similar frame position.

In an implementation example, for example, the matrix D (with vector values of 0 or 1) may be multiplied by the matrix E (representing the similarity between two frames by using bright spots, for example) representing the inter-frame similarity obtained in step 1011, and if a pattern (for example, a plurality of lines) formed by the bright spots in the weighted similarity matrix obtained after the multiplication is similar to a pattern (for example, a plurality of lines) formed by each vector with a value of 1 in the key frame confidence matrix a1 of the target media content, and the occurring positions and slopes are similar, it may be determined that the target media content is overall similar to the reference media content, and the target media content is infringing content with respect to the reference media content.

In addition, in the weighted similarity matrix obtained by multiplying the two matrices, because the overall similarity between the target media content and the reference media content is displayed in a matrix or two-dimensional graph manner, a plurality of similar segments (e.g., a plurality of connecting lines similar in position and slope) can be determined at one time, and thus a plurality of infringing segments can be determined at one time.

In the embodiment of the present specification, the method and apparatus for detecting media content similarity may be applied to a blockchain technology, for example, after the method and apparatus for detecting media content similarity are performed, if it is determined that the overall similarity between the target media content and each reference media content in the media library is not high, the target media content is considered not to belong to an infringing content, so that the target media content may be uploaded to a blockchain system, thereby implementing copyright protection in the blockchain technology.

In an embodiment of the present specification, a device for detecting media content similarity is further provided, and referring to fig. 8, the device 800 includes:

a basic frame data obtaining module 801 configured to sample target media content to obtain basic frame data;

a reference media content determination module 802 configured to determine reference media content;

a key frame determining module 803, configured to determine each key frame in the target media content according to the basic frame data; obtaining each key frame in the predetermined reference media content;

an inter-frame similarity calculation module 804 configured to calculate, for each key frame of the target media content, inter-frame similarities between the key frame and each key frame in the reference media content;

the overall similarity calculation module 805 is configured to determine the overall similarity between the target media content and the reference media content according to the calculated inter-frame similarity.

In one embodiment of the present specification apparatus, the determining reference media content 802 comprises:

In one embodiment of the apparatus of the present specification, the key frame determination module 803 is configured to perform:

In one embodiment of the apparatus of the present specification, the information of each key frame in the target media content comprises: a first key frame confidence matrix, wherein the vector value in the first key frame confidence matrix is 0 or 1, if the vector value of one vector is 0, the frame at the time sequence position corresponding to the vector is characterized as not being a key frame, and if the vector value of one vector is 1, the frame at the time sequence position corresponding to the vector is characterized as being a key frame.

In one embodiment of the apparatus of the present specification, the apparatus further comprises a classification network training module configured to perform: performing at least two rounds of training of a classification network using at least two sample media content, each round of training comprising: inputting a sample two-dimensional splicing image formed by splicing all basic frames of a sample media content into a classification network, and enabling the classification network to output a second key frame confidence coefficient matrix; the vector value in the second key frame confidence matrix is one value from 0 to 1, and the higher the value of a vector is, the higher the confidence that the frame at the time sequence position corresponding to the vector is the key frame is.

In one embodiment of the apparatus of the present specification, the classification network training module is further configured to: converting two second key frame confidence coefficient matrixes obtained aiming at the first sample media content and the second sample media content into key frame confidence coefficient vectors; matching and multiplying each key frame confidence coefficient vector obtained aiming at one second key frame confidence coefficient matrix and each key frame confidence coefficient vector obtained aiming at the other second key frame confidence coefficient matrix pairwise to obtain a third key frame confidence coefficient matrix; and adjusting a loss function of the classification network by using the similar frame position between the first sample media content and the second sample media content output by the deep learning detection model and the third key frame confidence coefficient matrix.

In an embodiment of the apparatus of the present specification, the classification network training module further performs, after performing the pairwise matching multiplication and before obtaining a third key frame confidence matrix: and setting the vector values at every set number of positions in the primary matrix as 1 to obtain the third key frame confidence coefficient matrix.

In one embodiment of the apparatus of the present specification, the apparatus further comprises a deep learning detection model training module configured to perform:

In one embodiment of the apparatus of the present specification, the overall similarity calculation module 805 is configured to perform: and inputting the calculated inter-frame similarity into a pre-trained deep learning detection model to obtain the similar frame position between the target media content and the reference media content output by the deep learning detection model, and determining the overall similarity of the target media content and the reference media content according to the similar frame position.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.

One embodiment of the present specification provides a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor implementing a method in accordance with any one of the embodiments of the specification when executing the executable code.

It is to be understood that the illustrated construction of the embodiments of the present disclosure is not to be construed as limiting the apparatus of the present disclosure in any way. In other embodiments of the description, an apparatus may include more or fewer components than illustrated, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

For the information interaction, execution process and other contents between the modules in the above-mentioned apparatus and system, because the same concept is based on the embodiment of the method in this specification, specific contents may refer to the description in the embodiment of the method in this specification, and are not described herein again.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this disclosure may be implemented in hardware, software, hardware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. The method for detecting the similarity of the media contents comprises the following steps:

sampling target media content to obtain basic frame data;

determining reference media content;

obtaining each key frame in the predetermined reference media content;

2. The method of claim 1, wherein the determining reference media content comprises:

3. The method of claim 1, the determining respective key frames in targeted media content from the base frame data, comprising:

4. The method of claim 3, the information for each key frame in the target media content comprising: a first key frame confidence matrix, wherein the vector value in the first key frame confidence matrix is 0 or 1, if the vector value of one vector is 0, the frame at the time sequence position corresponding to the vector is characterized as not being a key frame, and if the vector value of one vector is 1, the frame at the time sequence position corresponding to the vector is characterized as being a key frame.

5. The method of claim 4, wherein the training method of the classification network comprises:

6. The method of claim 5, wherein the training method of the classification network further comprises:

7. The method of claim 6, wherein after multiplying the pairwise matches and before deriving a third key frame confidence matrix, further comprising: and setting the vector values at every set number of positions in the primary matrix as 1 to obtain the third key frame confidence coefficient matrix.

8. The method of claim 6, wherein the training method of the deep learning detection model comprises:

9. The method of claim 1, wherein determining the overall similarity of the target media content and the reference media content according to the calculated inter-frame similarities comprises:

10. The media content similarity detection device comprises:

11. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-9.