CN112257595A

CN112257595A - Video matching method, device, equipment and storage medium

Info

Publication number: CN112257595A
Application number: CN202011139600.8A
Authority: CN
Inventors: 邹昱; 刘振强
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-01-22

Abstract

The embodiment of the invention discloses a video matching method, a video matching device, video matching equipment and a storage medium. Wherein, the method comprises the following steps: for each first video frame in the first video frame set, judging whether a second video frame matched with the current first video frame exists in the second video frame set, if so, marking the current first video frame as a target video frame, counting a first total number of the target video frames in the first video frame set, calculating video similarity of the first video and the second video based on a cross-over ratio algorithm and the first total number, and when the video similarity meets a first preset requirement, determining that the first video and the second video are successfully matched. The technical scheme provided by the embodiment of the invention can enhance the robustness of the video similarity algorithm for the sequence of the video frame sequence and improve the accuracy of video matching.

Description

Video matching method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a video matching method, a video matching device, video matching equipment and a storage medium.

Background

With the continuous development of multimedia information technology, video information is emerging in large quantities. Video has become an important information carrier in real life as an integrated medium for expressing information.

In many application scenarios, there is a video matching requirement, that is, it is determined whether two videos are similar. For example, short videos have been popular in recent years, attracting a large number of users to watch and stay. Generally, there are two ways for users of short video Applications (APPs) to upload videos, one is original and the other is to carry original videos of others. Video handling at least has a number of negative effects on short video APPs: (1) the number of repeated videos becomes so many that users may be distributed to the same video; (2) striking the enthusiasm of the original author; (3) carrying other APP content may involve infringement. Based on the above effects, similar identification based on short video content becomes of exceptional importance.

At present, in an existing video matching scheme, two videos to be matched are generally input into a video-based neural network, video features of a certain dimensionality corresponding to each video are output, the two video features are compared, and then a video matching result is obtained. However, the matching result of this scheme is not accurate enough and needs improvement.

Disclosure of Invention

The embodiment of the invention provides a video matching method, a video matching device, video matching equipment and a storage medium, and can optimize the existing video matching scheme.

In a first aspect, an embodiment of the present invention provides a video matching method, where the method includes:

for each first video frame in the first video frame set, judging whether a second video frame matched with the current first video frame exists in the second video frame set, and if so, recording the current first video frame as a target video frame; wherein all first video frames in the first set of video frames are derived from a first video and all second video frames in the second set of video frames are derived from a second video;

counting a first total number of target video frames in the first video frame set;

calculating video similarity of the first video and the second video based on a cross-over ratio algorithm and the first total number;

and when the video similarity meets a first preset requirement, determining that the first video and the second video are successfully matched.

In a second aspect, an embodiment of the present invention provides a video matching apparatus, including:

the target video frame judging module is used for judging whether a second video frame matched with the current first video frame exists in the second video frame set or not for each first video frame in the first video frame set, and if so, marking the current first video frame as a target video frame; wherein all first video frames in the first set of video frames are derived from a first video and all second video frames in the second set of video frames are derived from a second video;

the target video frame counting module is used for counting a first total number of target video frames in the first video frame set;

the video similarity calculation module is used for calculating the video similarity of the first video and the second video based on a cross-over ratio algorithm and the first total number;

and the video matching module is used for determining that the first video and the second video are successfully matched when the video similarity meets a first preset requirement.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the video matching method according to the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a video matching method as provided by an embodiment of the present invention.

According to the video matching scheme provided by the embodiment of the invention, for each first video frame in the first video frame set, whether a second video frame matched with the current first video frame exists in the second video frame set is judged, if yes, the current first video frame is marked as a target video frame, wherein all the first video frames in the first video frame set are from the first video, all the second video frames in the second video frame set are from the second video, the first total number of the target video frames in the first video frame set is counted, the video similarity of the first video and the second video is calculated based on a cross-over comparison algorithm and the first total number, and when the video similarity meets a first preset requirement, the first video and the second video are successfully matched. By adopting the technical scheme, the cross-over ratio algorithm applied to object accuracy detection is applied to the calculation of the video similarity, the robustness of the video similarity algorithm for the sequence of the video frame sequence can be enhanced, and when the sequence of the video frames in the video to be matched is changed, the change mode such as reverse playing or carousel can calculate the similarity more accurately, so that the accuracy of video matching is improved.

Drawings

Fig. 1 is a schematic flowchart of a video matching method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of another video matching method according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of another video matching method according to an embodiment of the present invention;

fig. 4 is a block diagram of a video matching apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.

The video matching method provided by the embodiment of the invention can be suitable for various application scenes with video matching requirements, such as video duplicate removal, video retrieval, video carrying or stealing screening and the like.

Fig. 1 is a flowchart of a video matching method according to an embodiment of the present invention, which may be executed by a video matching apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device. As shown in fig. 1, the method includes:

step 101, for each first video frame in the first video frame set, judging whether a second video frame matched with the current first video frame exists in the second video frame set, and if so, recording the current first video frame as a target video frame.

Wherein all first video frames in the first set of video frames are derived from a first video and all second video frames in the second set of video frames are derived from a second video.

For example, when video matching is required, generally, one video is used as a reference to determine whether other videos are similar to the one video, where the video used as the reference may be referred to as a reference video, and may be understood as a first video in the embodiment of the present invention, the other videos may be referred to as alternative videos, and each alternative video may be referred to as a second video in the embodiment of the present invention. When video matching is performed, alternative videos can be used as second videos one by one to be respectively matched with a first video to obtain corresponding matching results. The embodiment of the invention does not limit the source, type, format, duration, frame rate and the like of the first video and the second video.

Optionally, before performing this step, the method may further include acquiring the first video frame set and the second video frame set, and an acquisition manner is not limited. For example, the first video and the second video may be decoded separately and then extracted, and the first video frame extraction manner for the first video and the second video frame extraction manner for the second video may be the same or different.

For example, a first frequency is adopted to extract video frames in a first video, the extracted video frames are called first video frames, and the first video frames are added into a first video frame set; and extracting the video frames in the second video by adopting the second frequency, and adding the extracted video frames into the second video frame set. The first frequency and the second frequency may be the same or different. The first frequency and the second frequency may both be fixed values set in advance, such as 1 frame per second; or may be a fixed value determined by referring to other factors, such as video duration and the like; it may also be a dynamically changing value determined with reference to other factors, such as whether the video content contains characters or actions. The number of first video frames in the first set of video frames and the number of second video frames in the second set of video frames may or may not be equal.

In this step, whether a second video frame matched with the current first video frame exists in the second video frame set is sequentially judged for each first video frame in the first video frame set. For example, the first set of video frames includes 3 first video frames, respectively denoted as A, B and C, and the second set of video frames includes 4 second video frames, respectively denoted as a, b, C, and d. Firstly, aiming at A, judging whether at least one of a, b, c and d is matched with A, if so, marking A as a target video frame; then, aiming at B, judging whether at least one of a, B, c and d is matched with B, if so, marking B as a target video frame; and finally, judging whether at least one of a, b, C and d is matched with C or not aiming at C, and if so, marking C as a target video frame.

For example, when determining whether a second video frame matches the current first video frame, the specific determination method is not limited, for example, the similarity between the two video frames may be calculated in a certain manner, and if the similarity is greater than a certain threshold, the two video frames may be considered to match.

And 102, counting a first total number of target video frames in the first video frame set.

For example, the first total number may be updated while determining the target video frame by using a counter or the like, that is, the first total number may be initialized to 0, and each time a target video frame is determined, the first total number is added by 1 to obtain a final first total number; for another example, an empty target set may be initialized, and when a target video frame is determined to be obtained, the target video frame is added to the target set, and finally the size of the target set is used as the first total number.

And 103, calculating the video similarity of the first video and the second video based on the cross-over ratio algorithm and the first total number.

An Intersection Over Union (IOU) may be understood as a criterion for measuring the accuracy of detecting a corresponding object in a particular data set, and this criterion is often used to measure the correlation between real and predicted, the higher the correlation, the higher the value. The IOU algorithm is commonly used to calculate the correlation between the real and predicted images, and the calculation of the IOU is generally to calculate the area of the overlapping region in the two images divided by the total area formed by the two images overlapping together. For example, let the Area of the Overlap region (Area of overlay) be O, and the Area of the real image be S₁The area of the predicted image is S₂The total Area (Area of Union) after the overlapping is U ═ S₁+S₂O, then IOU ═ O/(S)₁+S₂-O)。

Illustratively, the first population may be referred to as the numerator in the IOU algorithm, i.e., O as described above. There may be a variety of calculation methods for the denominator in the IOU algorithm. For example, all video frames contained in the first video may be regarded as S described above₁All video frames contained in the second video are taken as the S₂If the denominator is the difference between the sum of the total number of the video frames in the first video and the total number of the video frames in the second video and the first total number; for another example, the total number of the first video frames may be regarded as S₁Taking the total number of the second video frames as the S₂The denominator is the difference between the sum of the total number of first video frames and the total number of second video frames and the first total number. Of course, other calculation manners are also possible, and the embodiment of the present invention is not particularly limited.

In the embodiment of the invention, the IOU algorithm applied to object accuracy detection is innovatively applied to the calculation of video similarity, compared with the conventional sequence matching algorithm, such as Longest Common Subsequence (LCS) and the like, the robustness of the video similarity algorithm for the sequence of the video frame sequence can be enhanced, and when the sequence of the video frames in the video to be matched is changed, the change mode such as reverse playing or carousel and the like can calculate the similarity more accurately, so that the accuracy of video matching is improved.

And step 104, when the video similarity meets a first preset requirement, determining that the first video and the second video are successfully matched.

For example, the first preset requirement may be greater than a preset similarity threshold, and when the video similarity meets the first preset requirement, it indicates that the degree of similarity between the first video and the second video is high, and it may be determined that the first video and the second video are successfully matched.

The video matching method provided in the embodiment of the invention judges whether a second video frame matched with a current first video frame exists in a second video frame set or not for each first video frame in the first video frame set, if so, the current first video frame is marked as a target video frame, wherein all first video frames in the first video frame set are from a first video, all second video frames in the second video frame set are from a second video, the first total number of the target video frames in the first video frame set is counted, the video similarity of the first video and the second video is calculated based on a cross-over comparison algorithm and the first total number, and when the video similarity meets a first preset requirement, the first video and the second video are successfully matched. By adopting the technical scheme, the IOU algorithm applied to object accuracy detection is applied to the calculation of the video similarity, the robustness of the video similarity algorithm for the sequence of the video frame sequence can be enhanced, and when the sequence of the video frames in the video to be matched is changed, the change mode such as reverse playing or carousel can calculate the similarity more accurately, so that the accuracy of video matching is improved.

In some embodiments, the calculating the video similarity of the first video and the second video based on the cross-over ratio IOU algorithm and the first total number includes: calculating a first video similarity of the first video and the second video based on an IOU algorithm and the first total number; when the video similarity meets a first preset requirement, determining that the first video and the second video are successfully matched comprises: and if the similarity of the first video is greater than or equal to a fifth preset threshold value, determining that the first video and the second video are successfully matched. The advantage of setting up like this lies in, adopts the mode that the threshold value is compared, can obtain the video matching result fast, and wherein, the fifth threshold value of predetermineeing can set up according to actual demand.

In some embodiments, said calculating the video similarity of the first video and the second video based on the IOU algorithm and the first total number comprises: calculating the sum of the number of first video frames in the first video frame set and the number of second video frames in the second video frame set to obtain a second total number; calculating a difference between the second total number and the first total number; and calculating the ratio of the first total quantity to the difference value, and determining the ratio as the video similarity of the first video and the second video. The advantage of this arrangement is that the video similarity based on the IOU algorithm can be more reasonably calculated.

For example, when determining whether there is a second video frame matching the current first video frame in the second video frame set, a variety of determination methods may be adopted, and two of them are schematically described below.

In some embodiments, the first approach may be employed. For example, the determining whether there is a second video frame in the second set of video frames that matches the current first video frame includes: for each second video frame in the second video frame set, calculating the image similarity of the current second video frame and the current first video frame; and determining the maximum image similarity as the target image similarity, and if the target image similarity meets a second preset requirement, determining that a second video frame matched with the current first video frame exists in the second video frame set. The method has the advantages that the image similarity of the current first video frame and all the second video frames in the second video frame set can be more comprehensively and accurately evaluated, the maximum image similarity is used as the target image similarity to be compared with the second preset requirement, and the judgment times can be reduced. The second preset requirement may be freely set, and for example, may be that the target image similarity is greater than or equal to a first preset threshold.

In some embodiments, the second approach may be employed. For example, the determining whether there is a second video frame in the second set of video frames that matches the current first video frame includes: determining a current second video frame in a second video frame set according to a preset sequence, calculating the image similarity between the current second video frame and a current first video frame, determining the current image similarity as a target image similarity, determining that a second video frame matched with the current first video frame exists in the second video frame set if the target image similarity meets a second preset requirement, and re-determining the current second video frame until the current second video frame is the last second video frame in the preset sequence if the target image similarity does not meet the second preset requirement. The advantage of such an arrangement is that, since the second video frames meeting the second preset requirement may appear before all the second video frames are traversed, the number of times of calculating the image similarity can be reduced to a certain extent, and the determination efficiency of the target video frame is further improved. The second preset requirement may be freely set, and for example, may be that the target image similarity is greater than or equal to a first preset threshold. The preset sequence may be, for example, the sequence number of each second video frame in the second video frame set, and the sequence number of the second video frame may be related to the relative position of the second video frame in the second video, where the earlier the playing position of the second video frame in the second video is, the smaller the corresponding sequence number is.

In some embodiments, the calculating the image similarity of the current second video frame and the current first video frame includes: extracting a first image feature vector corresponding to a current first video frame by using a preset image feature extraction model, and extracting a second image feature vector corresponding to a current second video frame by using the preset image feature extraction model; calculating the distance between the first image characteristic vector and the second image characteristic vector to obtain the first image similarity of the current second video frame and the current first video frame; correspondingly, if the similarity of the target image meets a second preset requirement, determining that a second video frame matched with the current first video frame exists in the second video frame set, including: and if the similarity of the target first image is greater than or equal to a first preset threshold value, determining that a second video frame matched with the current first video frame exists in the second video frame set. This has the advantage that the similarity between video frames can be calculated more accurately. The first image similarity calculation method may be applied to the first determination method and the second determination method. The preset image feature extraction model may be a Neural network model, and specifically may be a Convolutional Neural Network (CNN) model.

In some embodiments, the training sample set corresponding to the preset image feature extraction model is obtained by: acquiring a preset number of original images as a basic training set; processing each original image in the basic training set by adopting a preset image enhancement mode to obtain an enhanced image, wherein the preset image enhancement mode comprises at least one of cutting, mirroring, rotating, color disturbing, gray color enhancing, black edge adding, aspect ratio changing, watermark adding, expression adding, character adding, Gaussian noise adding, compressing, pixel random removing and sharpening; and summarizing the original images and the corresponding enhanced images in the basic training set to obtain a training sample set.

In an actual application scenario, a video frame is usually changed into a new image through a series of affine transformations or color transformations, that is, a homologous image of an original image. The homologous image is very similar to the original image in visual low-level semantics, but has a relatively large difference in high-level semantic space, so that the distance between the homologous image and the original image in high-level semantic space needs to be reduced, and related transformation data needs to be obtained in a targeted manner. In the prior art, data in an actual scene is generally adopted as training data, and at least two problems exist in this way: one is that data is difficult to mine, because similar images are needed as training data, and the distribution of the similar images is unknown, a large amount of data is needed for mining, and finally obtained effective data is very limited; the second is the need for manual labeling. In the embodiment of the invention, various transformations appearing in an actual application scene are simulated by adopting the data enhancement mode, and a training set is constructed. For example, N dissimilar anchor (anchor) images (i.e., original images) are selected as a basic training set, an application scene is simulated to perform enhancement transformation on each image, and one anchor image can generate a plurality of enhanced images to be added into a final training set. The method has the advantages that the training sample images can be generated quickly, the manpower for labeling is saved, the scale of the training sample data can be expanded conveniently, and the information covered by the data is increased.

Fig. 2 is a schematic flowchart of another video matching method according to an embodiment of the present invention, as shown in fig. 2, the method may include:

step 201, decoding the first video, and extracting video frames in the first video by using the first frequency to obtain a first video frame set.

Illustratively, the first video is recorded as query V_queryThe first video frame set G can be obtained by decoding and extracting video frames_query{I_0…I_NI.e. the number of frames in the first set of video frames is N.

Optionally, before this step, the method may further include: and training the preset CNN model by adopting the training sample set to obtain a preset CNN image feature extraction model (also called CNN image feature extractor). The predetermined CNN image feature extraction model may be, for example, a Resnet34 network structure, that is, the predetermined CNN model may be a Resnet34 network structure. When the CNN model is trained, an Additive Angular Margin Loss function (ArcFace Loss) may be used for training. Further, Instance Normalization (IN) may be used to replace Batch Normalization (BN) of the first two layers of the conventional CNN model, that is, the first two Normalization layers IN the CNN model are preset to be BN layers, which can make the images maintain stronger independence during training.

Optionally, before the preset CNN model is trained by using the training sample set to obtain the preset CNN image feature extraction model, the method may further include: acquiring a preset number of original images as a basic training set; processing each original image in the basic training set by adopting a preset image enhancement mode to obtain an enhanced image, wherein the preset image enhancement mode comprises cutting, mirroring, rotating, color disturbing, gray level color enhancing, black edge adding, aspect ratio changing, watermark adding, expression adding, character adding, Gaussian noise adding, compressing, pixel random removing and sharpening; and summarizing the original images and the corresponding enhanced images in the basic training set to obtain a training sample set.

Step 202, decoding the second video, and extracting the video frames in the second video by using the second frequency to obtain a second video frame set.

Illustratively, the second video is denoted as key V_queryThe first video frame set G can be obtained by decoding and extracting video frames_key{I_0…I_MI.e. the number of frames in the second set of video frames is M.

Step 203, for each first video frame in the first video frame set, determining whether a second video frame matched with the current first video frame exists in the second video frame set, and if so, recording the current first video frame as a target video frame.

Illustratively, for each second video frame in the second video frame set, calculating the image similarity of the current second video frame and the current first video frame; and determining the maximum image similarity as the target image similarity, and if the target image similarity is greater than or equal to a first preset threshold, determining that a second video frame matched with the current first video frame exists in the second video frame set. The calculating of the image similarity between the current second video frame and the current first video frame may include extracting a first image feature vector corresponding to the current first video frame by using a preset CNN image feature extraction model, and extracting a second image feature vector corresponding to the current second video frame by using the preset CNN image feature extraction model; and calculating the distance between the first image characteristic vector and the second image characteristic vector to obtain the first image similarity of the current second video frame and the current first video frame. Wherein the distance may be a cosine distance (cosine similarity).

For example, after a video frame passes through a predetermined CNN image feature extraction model, a one-dimensional vector with a length L (e.g., 128) may be obtained, and a norm normalization (e.g., L2) may be performed on the vector to obtain an image feature vector f. Query I for one frame image_queryPresetting a CNN image feature extraction model to extract a feature f_queryFor one frame image key I_keyPresetting a CNN image feature extraction model to extract a feature f_key. Will f is_queryAnd f_keyPerforming inner product to obtain cosine similarity of the two images, wherein cosine similarity is the description of the similarity relation between the two images, and if cosine similarity is greater than or equal to the similarity threshold T_cosine(the first preset threshold value), the two frames of images are considered to be successfully searched and matched, otherwise, the two frames of images are considered to be unsuccessfully searched and matched.

Illustratively, Set_qkInitialized to an empty set, traverse G_query{I₀…I_NCalculate each frame image and G_key{I_0…I_MThe similarity of each frame of image is taken as the maximum similarity S_qmaxAnd corresponding frame I_k∈G_keyIf S is_qmaxGreater than or equal to T_cosineThen I will be_kOr with I_kMatching Iq ∈ G_queryAdding as target video frame to Set_qkIn (1).

And step 204, counting the first total number of the target video frames.

Step 205, calculating the sum of the number of the first video frames in the first video frame set and the number of the second video frames in the second video frame set to obtain a second total number.

And step 206, calculating the difference value between the second total number and the first total number.

And step 207, calculating the ratio of the first total number to the difference value, and determining the ratio as the video similarity of the first video and the second video.

Exemplary, video similarity (S)_qk) Can be expressed as:

S_qk＝size(Set_qk)/(N+M-size(Set_qk))

step 208, judging whether the video similarity is greater than or equal to a fifth preset threshold, if so, executing step 209; otherwise, step 210 is performed.

Suppose a fifth preset threshold is noted as T_videoThen when S is_qkGreater than or equal to T_videoWhen it is, consider V_queryAnd V_keyAnd (5) matching is successful.

And step 209, determining that the first video and the second video are successfully matched.

Step 210, determining that the first video and the second video fail to be matched.

The video matching method provided by the embodiment of the invention can simulate an actual application scene to generate training data in a data enhancement mode, so that the marking labor can be saved, the scale of training sample data can be conveniently expanded, the information covered by the data is increased, and various change modes of a video image can be identified by a trained preset CNN image feature extraction model, thereby accurately extracting image features and improving the accuracy of similarity calculation between video frames. When the video matching degree is calculated, the IOU algorithm is adopted, the robustness of the video similarity algorithm for the sequence of the video frame sequences can be enhanced, and the accuracy of video matching is further improved. In addition, tests prove that the recall rate of the video matching scheme under the same precision is improved by 7% by adopting the IOU algorithm to replace the traditional LCS algorithm in the video recall application scene.

On the basis of the foregoing optional embodiments, in some embodiments, before the calculating the image similarity between the current second video frame and the current first video frame, the method further includes: calculating a first information entropy corresponding to a current first video frame and/or calculating a second information entropy corresponding to a current second video frame; and if the first information entropy or the second information entropy is smaller than a fourth preset threshold, increasing and adjusting the first preset threshold to obtain a new first preset threshold. This has the advantage that the accuracy of the video frame matching can be improved.

For example, the preset image feature extraction model is less robust for an image with a single content or a small amount of information. For example, a background is a solid image and the foreground has only a few text or simple shorthand lines. Most information is background, a small amount of foreground information is too simple, and effective information is difficult to extract from the model for the images, so that for two images with small information content, the model can extract a large amount of background information for matching, and the background information are easy to be matched by mistake successfully. In the embodiment of the invention, the information entropy is adopted to measure the information quantity contained in the video frame, namely the information entropy is calculated aiming at the video frame to judge whether the image information is abundant enough, if the image information is less, the matching threshold (such as a first preset threshold) is improved, and the mismatching probability is reduced. Experiments prove that in a video recall application scene, the matching precision of the model is improved by adopting an information entropy method, so that the recall rate of the model is improved by 10% under the condition of the same precision.

In some embodiments, before the calculating the image similarity of the current second video frame and the current first video frame, further includes: respectively carrying out frame detection on the current first video frame and the current second video frame; and if the frame is detected to exist, performing frame removal processing on the video frame with the frame to obtain a new corresponding video frame. The advantage of this arrangement is that the interference caused by the border is eliminated.

For example, borders with black, white or other colors added at the upper and lower ends, the left and right ends, or the periphery of the video image often appear in the video image, the borders can be regarded as invalid image information for the video content, and even if the enhanced image with the added borders is added in the training sample, the preset image feature extraction model generally automatically matches the identified borders with the features of the original image, so that the similarity is reduced. The embodiment of the invention can remove the frame of the video frame containing the frame in a certain way to avoid the interference caused by the frame. For example, taking the upper and lower borders as an example, the scanning calculation may be performed line by line from the height center line of the image respectively upwards and downwards, if the variance of the pixel values of a certain line of the scanned image is close to 0 (e.g., smaller than a preset variance threshold), it is considered that the critical points of the image border are found, the critical points of the upper border and the lower border may be obtained respectively, and the image between the two critical points is clipped, so that the effect of removing the border may be achieved. Experiments prove that in a video recall application scene, the matching precision of the model is improved by adopting a method for removing image frames, so that the recall rate of the model is improved by 3% under the condition of the same precision.

On the basis of the above optional embodiments, the specific process of determining whether there is a second video frame in the second video frame set that matches the current first video frame may be optimized. In some embodiments, if the target first image similarity is smaller than the first preset threshold and greater than or equal to a second preset threshold, calculating a second image similarity between the current second video frame and the current first video frame by using a Scale-invariant feature transform (SIFT) method; and if the second image similarity is greater than or equal to a third preset threshold, determining that a second video frame matched with the current first video frame exists in the second video frame set. The SIFT method is used as a traditional image feature extraction method, data training is not needed, but the SIFT method is weaker than an image feature extraction model in the aspects of robustness and speed, the image feature extraction model is generally sensitive to patterns (such as characters and the like) containing corners, the SIFT method and the image feature extraction model are combined, the SIFT method only needs to process a part of data, matching efficiency is guaranteed, image retrieval precision is improved, the similarity between two video frames can be comprehensively evaluated in more dimensions, and then whether the two video frames are matched or not can be judged more accurately.

In some embodiments, the calculating the second image similarity of the current second video frame and the current first video frame based on the SIFT method includes: and detecting whether the current second video frame contains a preset type pattern, and if so, calculating the second image similarity of the current second video frame and the current first video frame based on an SIFT method. The predetermined type of pattern may be, for example, a pattern with a relatively obvious corner feature, such as a text.

The advantage of this arrangement is that it can be identified in advance whether the second video frame contains the preset type pattern before the second image similarity calculation is performed, and if so, the second image similarity calculation is continued. If not, the meaning of continuously calculating the similarity of the second image is not great, and the conclusion that the current second video frame is not matched with the current first video frame can be quickly obtained, so that the video matching efficiency is improved.

In some embodiments, the detecting whether the current second video frame includes a preset type pattern includes: and detecting whether the current second video frame contains characters or not by adopting a preset edge detection algorithm. The method has the advantages that the characters are typical clusters which interfere the image feature extraction model, and the characters in the image can be rapidly identified by adopting an edge detection algorithm, so that the character detection efficiency is improved. The preset edge detection algorithm can be set according to actual requirements, for example, based on a Sobel operator or a Roberts operator, and the like.

On the basis of the optional embodiments, the specific process of determining whether the first video and the second video are successfully matched can be optimized, and on the basis of calculating the similarity of the first video and the second video based on the IOU algorithm and the first total number, the similarity of the first video and the second video can be evaluated in the space-time dimension, so that the video matching result can be determined more accurately. For example, if the first video similarity is less than the fifth preset threshold and greater than or equal to a sixth preset threshold, extracting a first video feature vector corresponding to the first video by using a preset video feature extraction model based on a space-time dimension, and extracting a second video feature vector corresponding to the second video by using the preset video feature extraction model; calculating the distance between the first video feature vector and the second video feature vector to obtain the second video similarity of the first video and the second video; and if the similarity of the second video is greater than or equal to a seventh preset threshold value, determining that the first video and the second video are successfully matched. The preset video feature extraction model based on the spatio-Temporal dimension may also be a CNN model, and specifically may be a Temporal Shift Module (TSM), for example. The distance may be, for example, a cosine distance. The video matching scheme based on the IOU algorithm for calculating the similarity cannot match the space-time information of two videos, for example, the two videos have the same content and different frame rates, and frame dislocation is easy to occur after decoding, so that difficulty is brought to video matching based on the picture similarity, and the video CNN model can effectively capture motion information in the videos and bring effective supplement to the video matching based on the picture similarity. Specifically, the TSM model can be used for modeling the video in a space-time dimension to capture motion information of the video, and then the similarity of the two videos in the space-time dimension is evaluated.

Fig. 3 is a schematic flowchart of another video matching method according to an embodiment of the present invention, where the method includes:

step 301, decoding the first video, and extracting video frames in the first video by using the first frequency to obtain a first video frame set.

Step 302, decoding the second video, and extracting the video frames in the second video by using the second frequency to obtain a second video frame set.

Step 303, for each first video frame in the first video frame set, determining whether a second video frame matched with the current first video frame exists in the second video frame set, and if so, recording the current first video frame as a target video frame.

For example, the first image similarity between the current second video frame and the current first video frame is calculated first, and the calculation manner refers to the above related contents, which is not described herein again. For example, the maximum first image similarity is determined as the target first image similarity by the first determination method. Judging whether the similarity of the target first image is greater than or equal to a first preset threshold value, if so, directly determining that a second video frame matched with the current first video frame exists in the second video frame set; otherwise, judging whether the similarity of the target first image is greater than or equal to a second preset threshold, if not, determining that a second video frame matched with the current first video frame does not exist in the second video frame set, and if so, calculating the similarity of the current second video frame and the current first video frame by adopting an SIFT method. Continuously judging whether the similarity of the second image is greater than or equal to a third preset threshold, if so, determining that a second video frame matched with the current first video frame exists in the second video frame set; otherwise, determining that no second video frame matched with the current first video frame exists in the second video frame set.

Step 304, counting the first total number of the target video frames, and calculating the sum of the number of the first video frames in the first video frame set and the number of the second video frames in the second video frame set to obtain a second total number.

And step 305, calculating a difference value between the second total number and the first total number, calculating a ratio of the first total number to the difference value, and determining the ratio as the first video similarity of the first video and the second video.

Step 306, judging whether the first video similarity is greater than or equal to a fifth preset threshold, if so, executing step 311; otherwise, step 307 is executed.

Step 307, extracting a first video feature vector corresponding to the first video and extracting a second video feature vector corresponding to the second video by using a preset TSM video feature extraction model.

And 308, calculating the distance between the first video feature vector and the second video feature vector to obtain the second video similarity of the first video and the second video.

Step 309, judging whether the second video similarity is greater than or equal to a seventh preset threshold, if so, executing step 311; otherwise, step 310 is performed.

Step 310, determining that the first video and the second video fail to be matched.

And 311, determining that the first video and the second video are successfully matched.

In the embodiment of the invention, in the process of judging video matching, the video similarity is calculated by adopting an IOU algorithm, the robustness of the system on the sequence of video frames is ensured, meanwhile, the extraction capability of the motion information is enhanced by combining the modeling of a TSM model on the motion information of the video, the space-time dimension similarity of the video is further calculated, and the problem of unsuccessful matching caused by different frame rates is solved. In the process of calculating the video similarity based on the IOU algorithm, for the judgment of the similarity of two video frames, the CNN model is adopted to extract image features and calculate the similarity, and meanwhile, the traditional SIFT method is combined, so that the defect that the CNN model lacks robustness for information such as characters and the like can be overcome, the problem of unsuccessful matching caused by character information is solved, the accuracy of video frame matching judgment is effectively improved, and the accuracy of video matching judgment is further improved.

Fig. 4 is a block diagram of a video matching apparatus according to an embodiment of the present invention, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform a video matching determination by executing a video matching method. As shown in fig. 4, the apparatus includes:

a target video frame determining module 401, configured to determine, for each first video frame in the first video frame set, whether a second video frame matching the current first video frame exists in the second video frame set, and if so, mark the current first video frame as a target video frame; wherein all first video frames in the first set of video frames are derived from a first video and all second video frames in the second set of video frames are derived from a second video;

a target video frame counting module 402, configured to count a first total number of target video frames in the first video frame set;

a video similarity calculation module 403, configured to calculate video similarities of the first video and the second video based on an intersection ratio IOU algorithm and the first total number;

a video matching module 404, configured to determine that the first video and the second video are successfully matched when the video similarity meets a first preset requirement.

The video matching device provided in the embodiment of the present invention determines, for each first video frame in the first video frame set, whether a second video frame matched with a current first video frame exists in the second video frame set, and if so, marks the current first video frame as a target video frame, where all first video frames in the first video frame set are from the first video and all second video frames in the second video frame set are from the second video, counts a first total number of the target video frames in the first video frame set, calculates video similarities of the first video and the second video based on the IOU algorithm and the first total number, and determines that the first video and the second video are successfully matched when the video similarities meet a first preset requirement. By adopting the technical scheme, the IOU algorithm applied to object accuracy detection is applied to the calculation of the video similarity, the robustness of the video similarity algorithm for the sequence of the video frame sequence can be enhanced, and when the sequence of the video frames in the video to be matched is changed, the change mode such as reverse playing or carousel can calculate the similarity more accurately, so that the accuracy of video matching is improved.

The embodiment of the invention provides computer equipment, and the video matching device provided by the embodiment of the invention can be integrated in the computer equipment. Fig. 5 is a block diagram of a computer device according to an embodiment of the present invention. The computer device 500 comprises a memory 501, a processor 502 and a computer program stored on the memory 501 and executable on the processor 502, wherein the processor 502 implements the video matching method provided by the embodiment of the invention when executing the computer program.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the video matching method provided by the embodiments of the present invention.

The video matching device, the video matching equipment and the storage medium provided in the above embodiments can execute the video matching method provided in any embodiment of the present invention, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to a video matching method provided in any embodiment of the present invention.

Note that the above is only a preferred embodiment of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in more detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the claims.

Claims

1. A method of video matching, comprising:

2. The method of claim 1, wherein the calculating the video similarity of the first video and the second video based on the cross-over ratio algorithm and the first total number comprises:

calculating the sum of the number of first video frames in the first video frame set and the number of second video frames in the second video frame set to obtain a second total number;

calculating a difference between the second total number and the first total number;

and calculating the ratio of the first total quantity to the difference value, and determining the ratio as the video similarity of the first video and the second video.

3. The method of claim 1, wherein determining whether there is a second video frame in the second set of video frames that matches the current first video frame comprises:

for each second video frame in the second video frame set, calculating the image similarity of the current second video frame and the current first video frame; determining the maximum image similarity as a target image similarity, and if the target image similarity meets a second preset requirement, determining that a second video frame matched with the current first video frame exists in the second video frame set;

alternatively, the first and second electrodes may be,

determining a current second video frame in a second video frame set according to a preset sequence, calculating the image similarity between the current second video frame and a current first video frame, determining the current image similarity as a target image similarity, determining that a second video frame matched with the current first video frame exists in the second video frame set if the target image similarity meets a second preset requirement, and re-determining the current second video frame until the current second video frame is the last second video frame in the preset sequence if the target image similarity does not meet the second preset requirement.

4. The method of claim 3, wherein calculating the image similarity between the current second video frame and the current first video frame comprises:

extracting a first image feature vector corresponding to a current first video frame by using a preset image feature extraction model, and extracting a second image feature vector corresponding to a current second video frame by using the preset image feature extraction model;

calculating the distance between the first image characteristic vector and the second image characteristic vector to obtain the first image similarity of the current second video frame and the current first video frame;

correspondingly, if the similarity of the target image meets a second preset requirement, determining that a second video frame matched with the current first video frame exists in the second video frame set, including:

and if the similarity of the target first image is greater than or equal to a first preset threshold value, determining that a second video frame matched with the current first video frame exists in the second video frame set.

5. The method according to claim 4, wherein the set of training samples corresponding to the preset image feature extraction model is obtained by:

acquiring a preset number of original images as a basic training set;

processing each original image in the basic training set by adopting a preset image enhancement mode to obtain an enhanced image, wherein the preset image enhancement mode comprises at least one of cutting, mirroring, rotating, color disturbing, gray color enhancing, black edge adding, aspect ratio changing, watermark adding, expression adding, character adding, Gaussian noise adding, compressing, pixel random removing and sharpening;

and summarizing the original images and the corresponding enhanced images in the basic training set to obtain a training sample set.

6. The method of claim 4, further comprising:

if the similarity of the target first image is smaller than the first preset threshold and is larger than or equal to a second preset threshold, calculating the similarity of a second image of the current second video frame and the current first video frame by adopting a Scale Invariant Feature Transform (SIFT) method;

and if the second image similarity is greater than or equal to a third preset threshold, determining that a second video frame matched with the current first video frame exists in the second video frame set.

7. The method of claim 6, wherein the calculating the second image similarity of the current second video frame and the current first video frame based on the SIFT method comprises:

and detecting whether the current second video frame contains a preset type pattern, and if so, calculating the second image similarity of the current second video frame and the current first video frame based on an SIFT method.

8. The method of claim 7, wherein the detecting whether the current second video frame contains a preset type pattern comprises:

and detecting whether the current second video frame contains characters or not by adopting a preset edge detection algorithm.

9. The method according to claim 4, further comprising, before said calculating the image similarity of the current second video frame and the current first video frame:

calculating a first information entropy corresponding to a current first video frame and/or calculating a second information entropy corresponding to a current second video frame;

and if the first information entropy or the second information entropy is smaller than a fourth preset threshold, increasing and adjusting the first preset threshold to obtain a new first preset threshold.

10. The method of claim 3, further comprising, before said calculating the image similarity of the current second video frame and the current first video frame:

respectively carrying out frame detection on the current first video frame and the current second video frame;

and if the frame is detected to exist, performing frame removal processing on the video frame with the frame to obtain a new corresponding video frame.

11. The method of claim 1, wherein the calculating the video similarity of the first video and the second video based on the cross-over ratio algorithm and the first total number comprises:

calculating a first video similarity of the first video and the second video based on a cross-over ratio algorithm and the first total number;

when the video similarity meets a first preset requirement, determining that the first video and the second video are successfully matched comprises:

and if the similarity of the first video is greater than or equal to a fifth preset threshold value, determining that the first video and the second video are successfully matched.

12. The method of claim 11, further comprising:

if the first video similarity is smaller than the fifth preset threshold and larger than or equal to a sixth preset threshold, extracting a first video feature vector corresponding to the first video by using a preset video feature extraction model based on space-time dimensions, and extracting a second video feature vector corresponding to the second video by using the preset video feature extraction model;

calculating the distance between the first video feature vector and the second video feature vector to obtain the second video similarity of the first video and the second video;

and if the similarity of the second video is greater than or equal to a seventh preset threshold value, determining that the first video and the second video are successfully matched.

13. A video matching apparatus, comprising:

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-12 when executing the computer program.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12.