CN117809061A

CN117809061A - Video material matching method based on AIGC

Info

Publication number: CN117809061A
Application number: CN202410001524.6A
Authority: CN
Inventors: 刘巍; 沈妍琪; 叶佳伟
Original assignee: Guangzhou Jiancan Technology Co ltd
Current assignee: Guangzhou Jiancan Technology Co ltd
Priority date: 2024-01-02
Filing date: 2024-01-02
Publication date: 2024-04-02

Abstract

The invention discloses a video material matching method based on AI GC, in particular to the technical field of video material retrieval matching, which comprises the following steps: the method comprises the steps of carrying out structural decomposition on a reference video input by a user and carrying out primary matching to obtain a primary matching video material set, establishing a video material primary screening index according to characteristic information and quality information, screening matched video materials, improving the accuracy between main class characteristics obtained after clustering analysis of the matched video materials and main class characteristics obtained after clustering analysis of the reference video input by the user, establishing semantic relation indexes according to text subtitle processing information and visual interaction information, evaluating semantic relation degrees among different main classes of the visual characteristics, carrying out secondary matching to increase the success rate of video material matching, improving the experience sense of the user, avoiding matching required video materials based on one-to-one correspondence of the visual characteristics, combining semantic information in the video to optimize a matching mechanism, and reducing the probability of false matching.

Description

Video material matching method based on AIGC

Technical Field

The invention relates to the technical field of video material retrieval matching, in particular to an AIGC-based video material matching method.

Background

Video has abundant information quantity and vivid visual effect, is widely applied to various fields, and is also a current hot spot for searching video materials which are required by the user from a large amount of video information.

The conventional video material searching and matching method is based on a text and keyword method, namely, the video information is manually processed to be annotated with the text and the keyword, visual features are extracted to match the required video material, the method ignores rich semantic information of the video, such as social relations among people and subordinate relations among people and scenes, the required video material is matched only through one-to-one correspondence of the visual features, the matching result of the video material is often unsatisfactory, and if the two videos are very similar in visual features but are not related semantically, the incorrect matching can be caused.

In order to solve the above-mentioned defect, a technical scheme is provided.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, embodiments of the present invention provide an AIGC-based video material matching method to solve the problems set forth in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the AIGC-based video material matching method comprises the following steps:

step S1, carrying out structural decomposition on a reference video input by a user, extracting visual features of the reference video, and carrying out primary matching with video materials in a database according to the extracted visual features;

step S2, a video material primary screening index is established according to the characteristic information and the quality information, the video material primary screening index is compared with a video material primary screening index reference threshold value, and video materials in a video material matching set are screened to obtain a primary screening set;

and S3, performing cluster analysis on the visual features extracted from the reference video and the video materials in the primary screening set to obtain different main class features, establishing semantic relation indexes according to text subtitle processing information and visual interaction information, evaluating semantic relation degrees among different main classes of the visual features, and performing secondary matching on the reference video and the video materials in the primary screening set according to the semantic relation indexes.

In a preferred embodiment, in step S1, the method further comprises the steps of:

step S1-1: performing shot decomposition on a reference video input by a user according to a shot detection technology;

step S1-2: extracting key frames from each decomposed lens unit by adopting a key frame extraction technology according to the lens units in the lens decomposition set;

step S1-3: extracting visual features of the reference video by adopting a feature extraction technology according to the reference video frames in the key frame set;

step S1-4: and matching the acquired visual features with visual feature materials in the database for one time by adopting a feature descriptor matching method, and extracting video materials where the successfully matched visual feature materials are located to obtain a video material matching set.

In a preferred embodiment, in step S2, the feature information of the video material includes a quantized value of the consistency of the visual feature and an excessive occupation ratio of the visual feature, the quality information includes a blur metric coefficient of the video material, the quantized value of the consistency of the visual feature, the excessive occupation ratio of the visual feature and the blur metric coefficient of the video material are respectively marked as YZ, SZ and MH, and the primary screening index of the video material is obtained by normalization processing and calculation through a formula.

In a preferred embodiment, the visual feature consistency quantization value acquisition logic is: if the visual features extracted from the video material can all contain the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 1; otherwise, if the visual features extracted from the video material cannot all include the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 0.

In a preferred embodiment, the visual characteristics are superfluousThe acquisition logic of the occupancy ratio is: after the visual features extracted from the video material are compared with the visual features extracted from the reference video, if the residual part of the visual features still exist, marking the residual part of the visual features as S _i I represents the class number of the remaining partial visual features, i= {1, 2..once, e }, e is a positive integer, and the calculation expression of the visual feature excess ratio is as follows:wherein ZL is the total number of the types of the visual features extracted from the video material.

In a preferred embodiment, the blur metric coefficient MH of the video material is calculated as:wherein (1)>For the average value of the mean square error of each frame of image, the obtained expression is:the order number representing each frame image, j= {1, 2,..once., u }, u being a positive integer, MSE _j The mean square error for each frame image.

In a preferred embodiment, the video material prescreening index is compared to a video material prescreening index reference threshold;

and if the video material primary screening index is greater than or equal to the video material primary screening index reference threshold, deleting the video material to obtain a video material primary screening set.

In a preferred embodiment, in step S3, the text subtitle processing information includes a natural language processing success rate, and the visual interactive information includes a visual feature interactive value; and normalizing the natural language processing success rate and the visual characteristic interaction value, and calculating the semantic relation index through a formula.

The invention has the technical effects and advantages that:

1. according to the method, the visual characteristics of the reference video input by a user are extracted through structural decomposition, the primary matching is carried out according to the extracted visual characteristics and the video materials in the database to obtain a primary matching video material set, meanwhile, the primary video material screening index is established according to the characteristic information and the quality information, the primary video material screening index is compared with a primary video material screening index reference threshold value, the video materials in the video material matching set are screened, the accuracy between the primary characteristics obtained after the clustering analysis of the video materials in the video material matching set and the primary characteristics obtained after the clustering analysis of the user input reference video is improved, semantic relation indexes are established according to text subtitle processing information and visual interaction information, semantic relation degrees among different visual characteristic primary categories are evaluated, secondary matching is carried out, the success rate of video material matching is increased, the experience sense of the user is improved, the required video materials are prevented from being matched based on the one-to-one correspondence of the visual characteristics, the semantic information in the video is combined, the matching mechanism is optimized, and the probability of error matching is reduced.

Drawings

For the convenience of those skilled in the art, the present invention will be further described with reference to the accompanying drawings;

fig. 1 is a schematic structural diagram of embodiment 1 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Fig. 1 shows an AIGC-based video material matching method of the present invention, which includes the steps of:

step S3, clustering analysis is carried out on visual features extracted from the reference video and the video materials in the primary screening set respectively to obtain different main class features, semantic relation indexes are established according to text subtitle processing information and visual interaction information, semantic relation degrees among different main classes of the visual features are evaluated, and secondary matching is carried out on the reference video and the video materials in the primary screening set according to the semantic relation indexes;

step S1, carrying out structural decomposition according to a reference video input by a user, and extracting visual characteristics of the reference video, wherein the method specifically comprises the following steps:

shot detection, also known as shot segmentation, refers to the accurate acquisition of shot boundaries within an entire video by automated processing by a computer. Because of the variety of imaging techniques used in the imaging process, the lens transformations also take many forms. In general, the transformation between two shots that is direct and does not employ any effect editing is called shear or abrupt; the change between the two shots, which is slowly changed from frame to frame by various editing means, is changed into gradual change. The form of gradual changes is rich compared to a single type of shear, such as overlapping, sliding, swiping, fading in, fading out, etc.

Performing shot decomposition on the reference video input by the user according to a shot detection algorithm to obtain a shot decomposition set of the reference video, such as { A } ₁ ,A ₂ ,A ₃ ...}。

It should be noted that, the commonly used lens detection algorithm mainly includes a pixel pair comparison method, a histogram method, an edge method, a motion method, a compressed domain method, a block matching method, a model method, etc., and the specific use of the lens detection algorithm may be determined according to the actual situation and will not be described herein.

since the number of shot units in the shot decomposition set is still large after shot decomposition, and a large number of redundant frames exist in each shot, in order to achieve the effect of compressing stored data and reducing the index data amount, a representative frame sequence is required to be extracted from the shot according to a certain rule to represent the content of the shot, and the frame sequence is a so-called key frame.

The embodiment adopts a key frame extraction technology of sub-shot segmentation to extract key frames from different shot units, the sub-shot segmentation is to regard a shot as a short video, and then adopts a shot detection technology to segment the shot into a plurality of sub-shot sets according to visual content, and each sub-shot represents a section of relatively stable content in the shot. Thus, one frame image is extracted from each sub-shot as a key frame, and the main content of the sub-shot can be described. When the internal change of the shot is small, the shot is a sub-shot, and only one key frame is extracted: when the internal variation of the shot is large, the shot is divided into a plurality of sub-shots, and a plurality of key frames are extracted to obtain a key frame set, such as { B } ₁ ,B ₂ ,B ₃ ...}。

the content of the reference video is based on the content of the image of the video frame, the content of the image is represented by various features of the image, and the feature extraction technology refers to a process of extracting key information or features from the video frame, wherein the features can be used for describing, characterizing and identifying the content of the video, are usually quantifiable attributes in the video, and help a computer understand and process the image or video data, and common feature extraction technologies are based on object detection and segmentation technologies, a deep learning network model and the like, and can realize feature extraction of the image of the video frame.

Extracting features of reference video frames in the key frame set according to feature extraction technology to obtain a visual feature set such as { C } ₁ ,C ₂ ,C ₃ .., wherein the visual features in the set of visual features are derived from the content of a given reference video, and are not described in detail herein.

Step S1-4: matching the acquired visual features with visual feature materials in a database for one time by adopting a feature descriptor matching method, and extracting video materials where the successfully matched visual feature materials are located;

it should be noted that, feature descriptor matching is a one-to-one feature description matching method, and feature descriptors refer to numerical values used for representing visual features or representing visual features in a vectorization manner, and generally represent local information around visual feature points, such as pixel intensities, gradient directions, color histograms, and the like around visual feature points.

The similarity measure between feature descriptors is used to compare the descriptors of two features to determine the similarity or degree of match between them. Common similarity measurement methods include:

euclidean distance: the euclidean distance between two descriptors, i.e. the spatial distance between them, is calculated. The smaller the distance, the higher the similarity.

Hamming distance: for binary descriptors, such as BRIEF or ORB, hamming distances can be used to compare the differences between them. Hamming distance represents the number of different bits between two binary strings.

Cosine similarity: the cosine similarity between the two descriptors is calculated, representing the angle between them. Cosine similarity is used in some cases to measure similarity between features.

Matching a video material matching set with higher similarity with the reference video according to the similarity measurement method, such as { D } ₁ ,D ₂ ,D ₃ ...}。

Because the similarity measurement method in the step S1-4 only matches the features one to one, but the visual features in the video matching materials obtained in the step S1-4 cannot reach high consistency with the visual features in the reference video, and cannot control the quality of the matched video materials, in the step S2, the embodiment establishes a video material primary screening index according to the feature information and the quality information, compares the video material primary screening index with a video material primary screening index reference threshold, and screens the video materials in the video material matching set to obtain a primary screening set, which comprises the following specific steps:

acquiring characteristic information and quality information of video materials;

the characteristic information of the video material comprises a visual characteristic consistency quantized value and a visual characteristic excess occupation ratio, and the quality information comprises a fuzzy measurement coefficient of the video material;

marking the visual feature consistency quantized value, the visual feature excess occupation ratio and the fuzzy metric coefficient of the video material as YZ, SZ and MH respectively;

and calculating a video material primary screening index by normalizing the visual feature consistency quantized value, the visual feature excess occupation ratio and the fuzzy metric coefficient of the video material, wherein the expression is as follows: wherein SP is the index of the primary screening of the video material, a ₁ 、a ₂ Preset scaling factors of the vision characteristic excess ratio and the blur metric coefficient of the video material are respectively a ₁ 、a ₂ Are all greater than 0.

The visual feature consistency quantized value indicates whether the visual features extracted from the video material can completely contain the visual features extracted from the reference video, and the extraction of the visual features in the video material can refer to the extraction process of the visual features in the video, which is not described herein; if the visual features extracted from the video material can all contain the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 1; otherwise, if the visual features extracted from the video material cannot all contain the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 0;

after comparing the visual features extracted from the video material with the visual features extracted from the reference video, if the remaining partial visual features still exist, marking the remaining partial visual features as S _i I represents the class number of the remaining partial visual features, i= {1, 2..once, e }, e is a positive integer, and the calculation expression of the visual feature excess ratio is as follows:wherein ZL is the total number of the types of the visual features extracted from the video material.

The higher the ratio of the excessive visual features is, the higher the interference item of the visual features in the video material is, the higher the primary screening index of the video material is, so that the probability of screening the video is higher, otherwise, the lower the interference item of the visual features in the video material is, the lower the primary screening index of the video material is, and the lower the probability of screening the video is.

The blur measurement coefficient of the video material is used for measuring the blur degree of an image in the video, the higher the blur degree is, the worse the quality of the video material is, the higher the primary screening index of the video material is, so that the probability of screening the video is higher, otherwise, the lower the blur degree is, the higher the quality of the video material is, the lower the primary screening index of the video material is, so that the probability of screening the video is lower; the acquisition logic of the blur metric coefficients of the video material is as follows: each frame of image is extracted from the video material, an image preprocessing method is adopted to preprocess an original image to obtain two types of images, which are respectively represented as an I (original image) and an I' (processed image), and a mean square error MSE is adopted to represent a fuzzy value of each frame of image, and the specific process is as follows: for each corresponding pixel position (x, y), the difference (error) between them is calculated: e (x, y) =i (x, y) -I' (x, y); square each difference value: e (E) ² (x,y)＝[I(x,y)-I'(x,y)] ² The method comprises the steps of carrying out a first treatment on the surface of the The squared error values for all pixels are summed to obtain a Mean Square Error (MSE): MSE =Wherein N represents the total number of pixels of the image, W and H represent the width and height of the image, respectively; and calculating a fuzzy metric coefficient MH of the video material according to the mean square error MSE of each frame of image, wherein the calculation formula is as follows: />Wherein (1)>For the average value of the mean square error of each frame of image, the obtained expression is: />The order number representing each frame image, j= {1, 2,..once., u }, u being a positive integer, MSE _j Mean square error for each frame of image;

comparing the video material prescreening index with a reference threshold value of the video material prescreening index, and screening the video materials in the video material matching set;

if the video material primary screening index is greater than or equal to the video material primary screening index reference threshold, the visual characteristics of the video material and the visual characteristics of the reference video are relatively different in consistency, the quality of the video material is relatively poor, and the video material is deleted to obtain a video material primary screening set F _p Wherein, p represents the number of video materials in the primary screening set, p= {1, 2..v }, v being a positive integer;

if the video material primary screening index is smaller than the video material primary screening index reference threshold, the video material and the reference video are indicated to keep high consistency in visual characteristics, the accuracy between main characteristics obtained after clustering analysis of the two types of videos is improved, the quality of the video material is high, and the video material is reserved.

In step S3, clustering analysis is carried out on the visual features extracted from the reference video and the video materials in the primary screening set respectively to obtain different main class features, semantic relation indexes are established according to text subtitle processing information and visual interaction information, semantic relation degrees among different main classes of the visual features are evaluated, and secondary matching is carried out on the reference video and the video materials in the primary screening set according to the semantic relation indexes;

it should be noted that, the method of performing cluster analysis on the visual features in the reference video and the method of performing cluster analysis on the visual features of the video material in the primary screening set both adopt a K-means cluster analysis method, so that the embodiment only introduces the cluster analysis process of the visual features in the reference video;

step S3-1: the method comprises the following steps of performing cluster analysis on visual features extracted from a reference video by adopting a K-means clustering analysis method to obtain main class features of different types:

from a set of visual features of a reference video { C ₁ ,C ₂ ,C ₃ ...C _M }(C ₁ ,C ₂ ,C ₃ ...C _M Representing different visual feature objects, respectively) selects K visual feature objects as initial cluster centers { C' ₁ ,C' ₂ ,C' ₃ ...C' _K The visual characteristics as an initial cluster center need to be able to be representative, and the present embodiment provides some references such as old people, children, young people, vehicles, schools, office buildings, and the like.

Calculating the distance between each visual feature object and the clustering center in the visual feature set of the reference video, and then distributing each visual feature object to the class of the closest clustering center according to the similarity judgment criterion of Euclidean distance;

and calculating the average value of all visual characteristic objects contained in each class as a new clustering center of the class.

Repeating the steps until the clustering center is not changed, namely the clustering objective function converges, and achieving the optimal clustering result. The clustering objective function takes the form of an error square sum criterion function: wherein C is _β Refers to visual characteristic objects in each class after cluster analysis, C _α Refers to the cluster center of each cluster, K refers to the number of clusters, N _α The visual characteristic objects contained in the alpha class are referred to, and J represents the sum of squares of errors of all objects in the data set;

ending the whole clustering distribution process to obtain K vision feature main classes;

step S3-2: establishing semantic relation indexes according to text subtitle processing information and visual interaction information, and evaluating semantic relation degrees among different visual feature main classes;

the text subtitle processing information comprises a natural language processing success rate, and the visual interaction information comprises a visual characteristic interaction value;

marking the success rate of natural language processing and the interaction value of visual features as CG and IH respectively;

the natural language processing success rate and the visual characteristic interaction value are subjected to normalization processing, and semantic relation indexes are calculated, wherein the expression is as follows: yi=ln (b ₁ *CG+b ₂ * IH), where YI is the semantic association index, b ₁ 、b ₂ The preset proportionality coefficients are respectively the success rate of natural language processing and the interaction value of visual characteristics, and b ₁ 、b ₂ Are all greater than 0.

The natural language processing technology is to extract text information from a reference video, including caption text, dialogue text and barrage text, convert the text information of the reference video into processable text data, extract characteristics of the text data, and convert the text information into characteristic vectors in a digital form. Common feature extraction methods include Word Bag models (Bag of Words) and Word embedding (Word embedding), such as Word2Vec, gloVe, etc. Whether there is a link between different visual features is assessed by building a link model.

The natural language processing success rate refers to the occupation ratio of the relation among different visual features successfully determined after a plurality of pieces of text information are processed by using a natural language processing technology, and the calculation expression is as follows: cg=z ₁ /Z ₂ Wherein Z is ₁ To successfully confirm through natural language processing technologyDetermining the number of texts with links between different visual features, Z ₂ Is the total number of text processed by natural language processing technology.

The higher the natural language processing success rate, the more compact the relation between different visual features is, and the higher the semantic relation index is, whereas the lower the natural language processing success rate is, the more subtle the relation between different visual features is, and the lower the semantic relation index is.

The visual feature interaction value refers to determining whether interaction relation exists among different visual features by using an object detection and tracking technology and a computer visual technology, and accumulating interaction times to obtain the visual feature interaction value; it should be noted that, the object detection and tracking technology and the computer vision technology are mature technologies existing in the field, and are not described herein.

The higher the visual feature interaction value, the more compact the relationship between the different visual features and thus the higher the semantic relationship index, whereas the lower the visual feature interaction value, the more subtle the relationship between the different visual features and thus the lower the semantic relationship index.

Step S3-3: performing secondary matching on the reference video and the video materials in the primary screening set by adopting a similarity matching algorithm based on the semantic relation index, and outputting a matching result for selection of a user;

it should be noted that, the selection of the similarity matching algorithm is determined by those skilled in the art according to actual use situations, and common similarity matching algorithms include cosine similarity, euclidean distance, jaccard similarity, pearson correlation coefficient, and the like.

According to the method, the visual characteristics of the reference video input by a user are extracted through structural decomposition, the primary matching is carried out according to the extracted visual characteristics and the video materials in the database to obtain a primary matching video material set, meanwhile, the primary video material screening index is established according to the characteristic information and the quality information, the primary video material screening index is compared with a primary video material screening index reference threshold value, the video materials in the video material matching set are screened, the accuracy between the primary characteristics obtained after the clustering analysis of the video materials in the video material matching set and the primary characteristics obtained after the clustering analysis of the user input reference video is improved, semantic relation indexes are established according to text subtitle processing information and visual interaction information, semantic relation degrees among different visual characteristic primary categories are evaluated, secondary matching is carried out, the success rate of video material matching is increased, the experience sense of the user is improved, the required video materials are prevented from being matched based on the one-to-one correspondence of the visual characteristics, the semantic information in the video is combined, the matching mechanism is optimized, and the probability of error matching is reduced.

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas with a large amount of data collected for software simulation to obtain the latest real situation, and preset parameters in the formulas are set by those skilled in the art according to the actual situation.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The AIGC-based video material matching method is characterized by comprising the following steps of: the method comprises the following steps:

2. The AIGC-based video material matching method of claim 1, wherein: in step S1, the method further includes the following steps:

3. The AIGC-based video material matching method of claim 1, wherein: in step S2, the feature information of the video material includes a visual feature consistency quantized value and a visual feature excess occupied ratio, the quality information includes a fuzzy metric coefficient of the video material, the visual feature consistency quantized value, the visual feature excess occupied ratio and the fuzzy metric coefficient of the video material are respectively marked as YZ, SZ and MH, and meanwhile, the video material prescreening index is obtained through normalization processing and calculation through a formula.

4. The AIGC-based video material matching method of claim 3, wherein: the acquisition logic of the visual characteristic consistency quantized value is as follows: if the visual features extracted from the video material can all contain the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 1; otherwise, if the visual features extracted from the video material cannot all include the visual features extracted from the reference video, the value of the visual feature consistency quantized value is 0.

5. The AIGC-based video material matching method of claim 4, wherein:

the acquisition logic of the vision characteristic excess occupation ratio is as follows: after the visual features extracted from the video material are compared with the visual features extracted from the reference video, if the residual part of the visual features still exist, marking the residual part of the visual features as S _i I represents the class number of the remaining partial visual features, i= {1, 2..once, e }, e is a positive integer, and the calculation expression of the visual feature excess ratio is as follows: wherein ZL is the total number of the types of the visual features extracted from the video material.

6. The AIGC-based video material matching method of claim 5, wherein: the calculation formula of the blur metric coefficient MH of the video material is as follows:wherein (1)>For the average value of the mean square error of each frame of image, the obtained expression is: />u, j represents the order number of each frame image, j= {1, 2..once., u }, u is a positive integer, MSE _j The mean square error for each frame image.

7. The AIGC-based video material matching method of claim 3, wherein: comparing the video material prescreening index with a video material prescreening index reference threshold;

8. The AIGC-based video material matching method of claim 1, wherein: in step S3, the text subtitle processing information includes a natural language processing success rate, and the visual interactive information includes a visual feature interactive value; and normalizing the natural language processing success rate and the visual characteristic interaction value, and calculating the semantic relation index through a formula.