CN114049591A

CN114049591A - Method and device for acquiring video material, storage medium and electronic equipment

Info

Publication number: CN114049591A
Application number: CN202111350557.4A
Authority: CN
Inventors: 管泽辉; 周越夫; 白刚; 邓昊元
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-15

Abstract

The disclosure relates to a method, a device, a storage medium and an electronic device for acquiring a video material, wherein a video frame picture of a target video is extracted; extracting video characteristics corresponding to the target video and picture characteristics corresponding to the video frame pictures; calculating the feature similarity of each candidate video and the target video according to the video features and the picture features; sorting the candidate videos based on the feature similarity, and determining target candidate videos which belong to the same type as the target videos from a plurality of candidate videos according to sorting results; and acquiring the target candidate video.

Description

Method and device for acquiring video material, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of material recall, and in particular, to a method, an apparatus, a storage medium, and an electronic device for acquiring a video material.

Background

In the field of advertisement putting, in order to obtain better putting effect, the material corresponding to the advertisement type can be searched for the advertisement type with good putting effect to carry out creative advertisement production, namely, similar materials belonging to the same vertical class with the seed material are recalled from a material library.

For recall of video materials, in the existing similar recall method, a cover frame or a certain video frame is generally used for retrieval, but the picture characteristics corresponding to the certain video frame are single, and the cover frame or the certain video frame cannot represent the whole video, so that the recall effect of the video materials is poor.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of acquiring video material, the method comprising:

extracting a video frame picture of a target video;

extracting video characteristics corresponding to the target video and picture characteristics corresponding to the video frame pictures;

calculating the feature similarity of each candidate video and the target video according to the video features and the picture features;

sorting the candidate videos based on the feature similarity, and determining target candidate videos which belong to the same type as the target videos from a plurality of candidate videos according to sorting results;

and acquiring the target candidate video.

In a second aspect, the present disclosure provides an apparatus for acquiring video material, the apparatus comprising:

the extraction module is used for extracting video frame pictures of the target video;

the extraction module is used for extracting video characteristics corresponding to the target video and picture characteristics corresponding to the video frame picture;

the similarity calculation module is used for calculating the feature similarity of each candidate video and the target video according to the video features and the picture features;

the determining module is used for sequencing the candidate videos based on the feature similarity and determining a target candidate video which belongs to the same type as the target video from a plurality of candidate videos according to a sequencing result;

and the material acquisition module is used for acquiring the target candidate video.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, an electronic device is provided, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.

By the technical scheme, video frame pictures of the target video are extracted; extracting video characteristics corresponding to the target video and picture characteristics corresponding to the video frame pictures; calculating the feature similarity of each candidate video and the target video according to the video features and the picture features; sorting the candidate videos based on the feature similarity, and determining target candidate videos which belong to the same type as the target videos from a plurality of candidate videos according to sorting results; the target candidate video is obtained, and the video characteristics can better represent the video content of the whole video, so that the video characteristics of the target video are introduced, the video characteristics and the picture characteristics are taken as the recall basis of the target candidate video, and the recall accuracy and the recall efficiency of the video material can be improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of obtaining video material in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of obtaining video material in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of obtaining video material in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating an apparatus for acquiring video material in accordance with one illustrative embodiment;

fig. 5 is a block diagram illustrating an apparatus for acquiring video material in accordance with an exemplary embodiment;

fig. 6 is a block diagram illustrating an apparatus for acquiring video material in accordance with an exemplary embodiment;

fig. 7 is a block diagram illustrating an apparatus for acquiring video material in accordance with an exemplary embodiment;

fig. 8 is a block diagram illustrating a structure of an electronic device according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart illustrating a method of acquiring video material, as shown in fig. 1, according to an exemplary embodiment, the method comprising the steps of:

in step S101, video frame pictures of the target video are extracted.

The target video is a video material serving as a seed material, the video frame picture may include a cover frame picture of the target video or a preset video frame picture (for example, a first frame of the target video) except the cover frame picture, and in an actual material recall scene, a user may select the cover frame picture to extract picture features according to actual business requirements or select the preset video frame picture except the cover frame picture to extract picture features.

When determining the cover frame picture of the target video, in a possible implementation manner, the picture characteristics of the video frame picture may be extracted for each video frame picture of the target video, and the video frame pictures are scored according to the picture characteristics, where the highest score is the cover frame picture, and the specific implementation details may refer to descriptions in relevant documents and are not described herein.

In step S102, video features corresponding to the target video and picture features corresponding to the video frame picture are extracted.

The video features refer to features extracted for each video segment of the target video.

In an actual application scene, for a short video, generally, a cover frame picture can express the rough content of the video, for example, a beauty class, a favorite class and other short videos, and the content of the video can be known through the cover frame picture, so that for a short video type material recall, the picture feature of the cover frame picture is extracted for recall, a better material recall effect can be realized, and for a video type material recall of an emphasis action such as dancing, horse riding and the like, the accuracy of the material recall is poor by only using the picture feature.

In the step of extracting the picture features corresponding to the video frame picture of the target video, the cover frame picture or the preset video frame except the cover frame picture in the target video can be input into a picture feature extraction model obtained by pre-training to obtain the picture features.

Here, the picture feature extraction model may include, for example, a resnet-50 model or a resnet-101 model.

For example, the cover frame picture of the target video may be input into a renet-50 model obtained by pre-training, and considering that the renet-50 model is a classification model, what is output by an output layer (generally, the last layer of the model structure) of the model is a classification result, and in the present disclosure, the picture feature of the target video is to be extracted based on the picture feature extraction model, therefore, the network output of the second last layer of the resnet-50 model structure may be used as the extracted picture feature, which is only an example, and the present disclosure does not limit this.

In the process of extracting the video characteristics corresponding to the target video, the target video can be divided into a plurality of video segments according to the preset step length; and aiming at each video clip, inputting the video clip into a video feature extraction model obtained by pre-training to obtain the video feature corresponding to each video clip.

The video feature extraction model may include, for example, a resnet-3d model or a slowfast model, and taking the resnet-3d model as an example, the resnet-3d model is also a classification model, so that in the process of extracting the video feature of the target video based on the resnet-3d model, the network output of the second-to-last layer in the model structure may be used as the video feature.

In addition, in the step of dividing the target video into a plurality of video segments according to a preset step size, in order to ensure that complete and sufficient video features can be extracted, each two adjacent video segments may have an overlap with a preset duration, for example, 4 video frames per second, 16 frames may be taken as one video segment, the preset step size is 2 seconds, that is, one video segment is divided every 2 seconds, and there is an overlap of 2 seconds between every two video segments, so that a first video segment includes the 1 st → 16 frame of the target video, a second video segment includes the 9 th → 24 frame of the target video, a third video segment includes the 17 th → 32 frame of the target video, and so on, the target video may be divided into a plurality of video segments, which is only illustrated by way of example, and the disclosure is not limited thereto.

In step S103, a feature similarity between each candidate video and the target video is calculated according to the video features and the picture features.

The feature similarity may include a picture feature similarity calculated based on picture features, a video feature similarity calculated based on video features, and a fusion feature similarity calculated based on fusion features (features obtained by fusing video features and picture features of the same video material).

In step S104, the candidate videos are ranked based on the feature similarity, and a target candidate video that is of the same type as the target video is determined from the candidate videos according to a ranking result.

Here, the target candidate video belonging to the same type as the target video refers to a video material belonging to the same vertical type as the target video.

In a possible implementation manner, the candidate videos may be all preset candidate videos in the video material library, in which case the target candidate video may be determined from the candidate videos by: respectively calculating the picture feature similarity of each preset candidate video and the target video according to the picture features, sequencing the preset candidate videos according to the sequence of the picture feature similarity from high to low, and selecting the preset candidate videos with the first preset number as a first candidate video set according to the sequencing result; respectively calculating the video feature similarity of each preset candidate video and the target video according to the video features, sequencing the preset candidate videos according to the sequence of the video feature similarity from high to low, and selecting a second preset number of preset candidate videos as a second candidate video set according to the sequencing result; determining the target candidate video from the first set of candidate videos and the second set of candidate videos.

In the present disclosure, the feature similarity of each candidate video and the target video, which includes the picture feature similarity or the video feature similarity, may be calculated through a cosine distance formula shown below, that is, formula (1) or an L2 distance formula, that is, formula (2):

similarity(s,c)＝||f_s-f_c||₂ (2)

wherein s represents the target video as the seed material, c represents any candidate video, and f is the similarity of the picture feature calculated by the formula_sRepresenting picture characteristics corresponding to the target video, f_cRepresenting the picture characteristics corresponding to the candidate video, and f, when the similarity of the video characteristics is calculated through the formula_sRepresenting the corresponding video feature of the target video, f_cAnd representing the video characteristics corresponding to the candidate videos.

The target candidate video may then be determined from the first candidate video set and the second candidate video set by: sequencing each candidate video in the first candidate video set and the second candidate video set according to a sequence of feature similarity from high to low, wherein the feature similarity comprises picture feature similarity or video feature similarity; and selecting a fourth preset number of candidate videos as the target candidate videos according to the sorting result.

For example, assuming that the first candidate video set includes six candidate videos, i.e. video a, video b, video c, video d, video e, and video f, each of the candidate videos and the corresponding similarity of the picture features can be respectively represented as (a, 0.8), (b, 0.75), (c, 0.6), (d, 0.92), (e, 0.95), (f, 0.85), the second candidate video set comprises six candidate videos including video g, video h, video i, video j, video k and video l, each of the candidate videos and the corresponding similarity of the video features can be respectively represented as (g, 0.99), (h, 0.96), (i, 0.83), (j, 0.97), (k, 0.72) and (l, 0.65), and sequencing each candidate video in the first candidate video set and the second candidate video set according to the sequence of feature similarity from high to low as follows: the video g, the video j, the video h, the video e, the video d, the video f, the video i, the video a, the video b, the video k, the video l, and the video c, and the candidate videos of the first 5 bits (i.e., the fourth preset number) are selected as the target candidate videos, that is, the video g, the video j, the video h, the video e, and the video d are the target candidate videos.

Considering that all preset candidate videos in the video material library need to be traversed in the above manner, that is, the similarity between each preset candidate video and the target video needs to be calculated, not only is the recall cost high, but also the recall efficiency is low, and in order to improve the recall efficiency of the video material, a plurality of candidate videos can also be traversed candidate videos determined after traversing a pre-established HNSW (Hierarchical navigation Small World) recall tree according to a preset recall strategy, wherein the HNSW recall tree includes a first HNSW recall tree or a second HNSW recall tree, and the first HNSW recall tree is a HNSW recall tree pre-established according to the picture feature similarity between every two candidate videos; the second HNSW recall tree is an HNSW recall tree pre-established according to the video feature similarity between every two candidate videos, so that in another possible implementation manner of this step, the target candidate video may be determined from the multiple candidate videos by: calculating the picture feature similarity of each traversed candidate video and the target video through the first HNSW recall tree according to the picture features based on a preset recall strategy, and determining a first candidate video set from the candidate videos according to the picture feature similarity; then, according to the video characteristics, the video characteristic similarity of each traversed candidate video and the target video is calculated through the second HNSW recall tree based on the preset recall strategy, and a second candidate video set is determined from the traversed candidate videos according to the video characteristic similarity; the target candidate video may then be determined from the first set of candidate videos and the second set of candidate videos.

Wherein, the first HNSW recall tree and the second HNSW recall tree both include a plurality of nodes, and the preset recall policy specifically includes:

executing a preset traversal step until the target HNSW recall tree is traversed;

the preset traversing step comprises the following steps: acquiring a preset node corresponding to the target HNSW recall tree; determining target nodes from the target HNSW recall tree according to the preset nodes, wherein the target nodes comprise the preset nodes and nodes connected with the preset nodes; respectively calculating the feature similarity of each candidate video to be determined and the target video according to the target features, wherein the candidate video to be determined is the candidate video corresponding to the target node; taking the node corresponding to the candidate video with the highest feature similarity as a new preset node; determining whether a node connected with the new preset node exists in the target HNSW recall tree;

under the condition that the node connected with the new preset node exists in the target HNSW recall tree, re-executing the preset traversal step, and under the condition that the node connected with the new preset node does not exist in the target HNSW recall tree, determining that the traversal of the target HNSW recall tree is finished; under the condition that the target HNSW recall tree is traversed, sequencing traversed candidate videos corresponding to traversed nodes according to the sequence of the feature similarity from high to low, and selecting a third preset number of traversed candidate videos as a specific candidate video set according to a sequencing result;

wherein the target feature comprises the picture feature or the video feature, and if the target feature is the picture feature, the target HNSW recall tree is the first HNSW recall tree, the feature similarity is a picture feature similarity, and the specific candidate video set is the first candidate video set; and if the target feature is the video feature, the target HNSW recall tree is the second HNSW recall tree, the feature similarity is a video feature similarity, and the specific candidate video set is the second candidate video set.

In the present embodiment, each feature similarity may be calculated by the above formula (1) or formula (2).

A specific implementation process for determining a first candidate video set from a plurality of candidate videos through a pre-established first HNSW recall tree according to the picture features is described as follows by way of example:

assuming that the preset node randomly generated by the HNSW algorithm is node 1 in the first HNSW recall tree, and the node 1 is directly connected with four nodes 2, 3, 4, and 5 in the first HNSW recall tree, at this time, the picture feature similarity between the candidate video to be determined and the target video corresponding to the nodes 1, 2, 3, 4, and 5, respectively, may be calculated according to the picture feature, then the node with the highest picture feature similarity is selected from the nodes 2, 3, 4, and 5 as a new preset node (assumed as node 3), then the first HNSW recall tree is continuously traversed, the node directly connected to the new preset node is determined from the first HNSW recall tree, assumed as nodes 6, 7, and 8, and the picture feature similarity between the candidate video corresponding to each node 6, 7, and 8 and the target video is recalculated according to the picture feature similarity at node 6, and node 8, 7. 8, selecting a node with the highest picture feature similarity as a new preset node, and re-executing the first traversal step under the condition that a node connected with the new preset node exists in the first HNSW recall tree, until the first HNSW recall tree is determined to be traversed under the condition that the node connected with the new preset node does not exist in the first HNSW recall tree; at this time, the candidate videos corresponding to the traversed nodes may be sorted according to the sequence of the similarity of the image features from high to low; selecting a first preset number of candidate videos to form the first candidate video set according to the sorting result, for example, determining that the traversed nodes include 100 nodes, namely node 1, node 2, after traversing the first HNSW recall tree, wherein determining the nodes located at the top 5 bits after sorting according to the sequence of the picture feature similarity from high to low includes: the node 12, the node 15, the node 23, the node 50, and the node 55, at this time, it may be determined that the first candidate video set includes candidate videos corresponding to the node 12, the node 15, the node 23, the node 50, and the node 55 in a plurality of candidate videos, which is only an example and is not limited in this disclosure.

In addition, the second HNSW recall tree also includes a plurality of nodes, each node corresponds to one candidate video, and a specific implementation manner of determining the second candidate video set from the candidate videos based on the second HNSW recall tree is similar to a specific implementation manner of determining the first candidate video set from the candidate videos based on the first HNSW recall tree, which is not illustrated here.

It is understood that the similarity between two videos is a number greater than 0 and less than or equal to 1, and if the similarity is greater than 1, the similarity has no practical meaning, and cannot be used to compare the degree of similarity between two videos, therefore, in order to avoid that the picture feature similarity or the video feature similarity calculated based on the above formula (1) or formula (2) is greater than 1, in one possible implementation manner of the present disclosure, a preset weight (which is greater than 0 and less than 1) may be set for each type of similarity calculated by the features (including the picture feature, the video feature, and the text feature and the audio feature mentioned later), and then, for each type of feature, multiplying the similarity calculated based on the type of feature by the corresponding preset weight to obtain the final similarity corresponding to the type of feature.

After the first candidate video set and the second candidate video set are obtained, the target candidate video to be recalled may be determined from the first candidate video set and the second candidate video set, specifically, each candidate video in the first candidate video set and the second candidate video set may be ranked in order of a feature similarity from high to low, where the feature similarity includes the picture feature similarity or the video feature similarity; and selecting a fourth preset number of candidate videos as the target candidate videos according to the sorting result.

For example, assuming that the first candidate video set includes six candidate videos, i.e. video a, video b, video c, video d, video e, and video f, each of the candidate videos and the corresponding similarity of the picture features can be respectively represented as (a, 0.8), (b, 0.75), (c, 0.6), (d, 0.92), (e, 0.95), (f, 0.85), the second candidate video set includes six candidate videos, i, j, k, and l, where each candidate video and the corresponding video feature similarity thereof can be respectively represented as (g, 0.99), (h, 0.96), (i, 0.83), (j, 0.97), (k, 0.72), (l, 0.65), the candidate videos in the first candidate video set and the second candidate video set are ranked according to the feature similarity from high to low, and the ranking comprises the following steps: the video g, the video j, the video h, the video e, the video d, the video f, the video i, the video a, the video b, the video k, the video l, and the video c, and the candidate videos of the first 5 bits (i.e., the fourth preset number) are selected as the target candidate videos, that is, the video g, the video j, the video h, the video e, and the video d are the target candidate videos.

In addition, considering that in an actual application scenario, the first candidate video set and the second candidate video set may include one or more identical candidate videos, the present disclosure may, for each identical candidate video in the first candidate video set and the second candidate video set, use a maximum similarity between a picture feature similarity corresponding to the identical candidate video and the video feature similarity as the feature similarity corresponding to the identical candidate video.

It should be noted that the first HNSW recall tree is a recall tree constructed in advance according to picture feature similarity between candidate videos, the picture feature similarity between two candidate videos corresponding to two nodes connected to each other is higher, the second HNSW recall tree is a recall tree constructed in advance according to video feature similarity between candidate videos, and the video feature similarity between two candidate videos corresponding to two nodes connected to each other is higher.

In addition, in another possible implementation manner of this step, under the condition that the video feature and the image feature have the same dimension, feature fusion may be performed on the image feature and the video feature, and then the target candidate video is determined from multiple candidate videos based on the fused feature, that is, feature fusion may be performed on the video feature and the image feature to obtain a fused feature; and then calculating the fusion feature similarity of each candidate video and the target video according to the fusion features, sequencing the candidate videos according to the fusion feature similarity, and determining the target candidate video which belongs to the same type as the target video from the candidate videos according to the sequencing result.

Wherein, the feature fusion can be carried out by the following two ways:

in a first mode, the video feature and the picture feature of the target video are directly added to obtain a fusion feature, and specifically, for each first feature element of the video feature, the first feature element and a second feature element corresponding to the first feature element in the picture feature are added to obtain the fusion feature.

For example, assuming that a feature vector corresponding to a video feature of the target video is denoted by a, a ═ is (a1, a2,....... ann., an), and a feature vector corresponding to a picture feature of the target video is denoted by B, B ═ is (B1, B2,... ann., bn), directly adding the video feature and the picture feature to obtain a feature vector corresponding to a fusion feature: a + B ═ a1+ B1, a2+ B2, a.

Considering that the characteristics corresponding to different video types may represent the video subject content differently, for example, for a short video such as a beauty class, a favorite class, etc., a cover frame picture may generally express the rough content of the video, and therefore, for a short video type, the corresponding characteristics that may represent the video subject content are picture characteristics; for the material recall of the video type of the action-emphasized action such as dancing, riding a horse, etc., the accuracy of the material recall is poor only by using the picture features, and the video features can represent the subject content of the whole video, so for the video type of the action-emphasized action, the corresponding features capable of representing the subject content of the video are video features, that is, for the recall of different types of video materials, the features of the selected representative video subject content are different, that is, the weights corresponding to each feature are different, and therefore, the present disclosure can also perform feature fusion in the following manner two:

identifying the video type of the target video through a material type identification model; respectively determining a first preset weight corresponding to the video features and a second preset weight corresponding to the picture features according to the video types; in this way, the video feature and the picture feature may be subjected to weighted summation according to the first preset weight and the second preset weight, so as to obtain the fusion feature.

The model structure of the material type identification model may be the same as the model structure of the video feature extraction model, that is, the material type identification model may include, for example, a resnet-3d model or a slowfast model, the video type may include any type of beauty, love, dance, horse riding, or the video type may include a short video or an action video.

In a possible implementation manner, the output layer of the material type identification model may output probability values corresponding to a plurality of preset material types, so that, in the process of inputting the target video into the material type identification model and identifying the video type based on the material type identification model, the preset material type corresponding to the output maximum probability value may be determined as the video type corresponding to the target video.

After the video type of the target video is determined, a first preset weight corresponding to the video feature of the target video and a second preset weight corresponding to the picture feature can be determined according to the corresponding relation between the video type and the preset weights, and then the video feature and the picture feature can be subjected to weighted summation according to the first preset weight and the second preset weight to obtain the fusion feature.

For example, one possible corresponding relationship between the video type and the preset weight is: when the video type is a short video, the corresponding first preset weight is 0.2, and the corresponding second preset weight is 0.8; when the video type is an action video, the corresponding first preset weight is 0.8, and the second preset weight is 0.2, if it is determined that the video type of the target video is the action video, it may be determined that the first preset weight corresponding to the video feature of the target video is 0.8, and the second preset weight corresponding to the picture feature of the target video is 0.2, assuming that the video feature of the target video is represented by a and the picture feature of the target video is represented by B, after performing feature fusion on the video feature and the picture feature, the obtained fusion feature is 0.8A +0.2B, which is described in the above example only for illustration, and the disclosure does not limit the present disclosure.

In addition, calculating the similarity of the fusion characteristics of each candidate video and the target video according to the fusion characteristics; when the candidate videos are ranked based on the fusion feature similarity and a target candidate video belonging to the same type as the target video is determined from the candidate videos according to the ranking result, the ranking can be achieved in the following two ways:

in the first mode, the plurality of candidate videos include all preset candidate videos in a video material library, so that the similarity of the fusion features of each preset candidate video and the target video can be respectively calculated according to the fusion features, the preset candidate videos are sorted according to the sequence from high to low of the similarity of the fusion features, and the first sixth preset number of preset candidate videos are selected as the target candidate videos according to the sorting result.

Similarly, the similarity of the fusion feature of each preset candidate video and the target video can also be calculated by the above formula (1) or formula (2).

In a second mode, when a video material is recalled based on fusion characteristics, if the target candidate video is determined according to the first mode, all preset candidate videos also need to be traversed, that is, the similarity of the fusion characteristics of each preset candidate video and the target video needs to be calculated, not only is the recall cost higher, but also the recall efficiency is lower, so that in order to improve the recall efficiency of the video material, the recall method is similar to the mode of determining a candidate video set according to picture characteristics and video characteristics respectively, the number of traversed candidate videos can be reduced based on the idea of the HNSW recall tree, and the recall efficiency of the video material is improved, so that the HNSW recall tree comprises a pre-established third HNSW recall tree which is pre-established according to the similarity of the fusion characteristics between every two candidate videos; the plurality of candidate videos include traversed candidate videos determined after the third HNSW recall tree is traversed according to the preset recall policy, and the feature similarity includes fusion feature similarity, so that the target candidate video can be determined from the plurality of candidate videos in the following manner: traversing the third HNSW recall tree by executing the preset traversing step, calculating the similarity of the fusion characteristics of the candidate video (namely the traversed candidate video) corresponding to each traversal node and the target video according to the fusion characteristics in the traversing process, sequencing the candidate videos corresponding to the traversed nodes according to the sequence of the similarity of the fusion characteristics from high to low under the condition of traversing the third HNSW recall tree, and selecting the traversed candidate videos with the fifth preset number as the target candidate videos according to the sequencing result.

Exemplarily, assuming that the predetermined node randomly generated by the HNSW algorithm is node 1 in the third HNSW recall tree, and four nodes 2, 3, 4, and 5 are directly connected to the node 1 in the third HNSW recall tree, at this time, the fused feature similarity between the candidate video corresponding to each node 1, 2, 3, 4, and 5 and the target video can be calculated according to the fused feature, then the node with the highest fused feature similarity is selected from the nodes 2, 3, 4, and 5 as a new predetermined node (assumed as node 3), then the third HNSW recall tree is continuously traversed, the node directly connected to the new predetermined node is determined from the third HNSW recall tree, assumed as nodes 6, 7, and 8, and the fused feature similarity between the candidate video corresponding to each node 6, 7, and 8 and the target video is recalculated according to the fused feature, selecting a node with the highest fusion feature similarity from the nodes 6, 7 and 8 as a new preset node according to the fusion feature similarity, and re-executing the first traversal step under the condition that the node connected with the new preset node exists in the third HNSW recall tree until the third HNSW recall tree is determined to be traversed under the condition that the node connected with the new preset node does not exist in the third HNSW recall tree; at this time, the candidate videos corresponding to the traversed nodes may be sorted according to the sequence of the similarity of the fusion features from high to low; selecting a sixth preset number of candidate videos as the target candidate video according to the sorting result, for example, after the third HNSW recall tree is traversed, determining that the traversed nodes include 100 nodes, namely node 1, node 2, · and node 100, wherein after the nodes are sorted according to the sequence of the similarity of the fusion features from high to low, determining the nodes located at the top 5 bits includes: the node 12, the node 15, the node 23, the node 50, and the node 55, at this time, it may be determined that the target candidate video includes candidate videos corresponding to the node 12, the node 15, the node 23, the node 50, and the node 55 in a plurality of candidate videos, which is only an example and is not limited in the present disclosure.

Similarly, the third HNSW recall tree is a recall tree constructed in advance according to the similarity of fusion features between candidate videos, and the similarity of fusion features between two candidate videos corresponding to two nodes connected with each other is high, so that when an actual material is recalled, the target candidate video is quickly searched out along the edges between the nodes through the HNSW recall tree constructed in advance, and all the preset candidate videos in the whole material library do not need to be traversed, so that the efficiency and the accuracy of the material recall can be obviously improved.

In step S105, the target candidate video is acquired.

In this step, the target candidate video may be recalled from the material library, so as to produce an advertisement with a better delivery effect according to the target candidate video.

By adopting the method, the video characteristics of the target video serving as the seed material are introduced, and the video characteristics and the picture characteristics are taken as the recall basis of the target candidate video, so that the recall accuracy of the video material can be improved.

Fig. 2 is a flowchart illustrating a method of acquiring video material according to the embodiment shown in fig. 1, and as shown in fig. 2, before step S103 is executed, the method further includes the following steps:

in step S106, a text feature of the target video is obtained through a text feature extraction model obtained through pre-training; and/or acquiring the audio features of the target video through an audio feature extraction model obtained through pre-training.

The text feature may include text content appearing in the target video, speech content in the target video, or a classification tag of the target video, the text feature extraction model may include a BERT model, the audio feature may include background music and speech content corresponding to the target video, and the audio feature extraction model may include a VGGish model.

In the recall scene of the actual video material, the picture feature and the video feature are used to ensure that the picture similarity of the two videos is high, but the contents of the videos may still have differences, so that in the step, the text feature and/or the audio feature of the target video can be extracted, so that the recall of the video material can be performed according to the multi-modal features of the picture feature, the video feature, the text feature, the audio feature and the like of the target video, and the recall accuracy of the video material is further improved.

In the process of extracting the text features of the target video, the text content in the target video can be encoded, specifically, the text content can be subjected to sentence splitting, then the split words and sentences are mapped to a vocabulary library to realize vocabulary encoding, then encoded data can be input into a pre-trained BERT model, and the network output of the third to last layer of the model structure is used as the extracted text features.

In the process of extracting the audio feature of the target video, the audio data in the target video may be extracted first, and then the audio data is input into a pre-trained VGGish model to extract the audio feature.

After the text feature and the audio feature are obtained, when step S103 is executed, the feature similarity between each candidate video and the target video may be calculated according to the picture feature, the video feature, and a specified feature, where the specified feature includes the text feature and/or the audio feature.

The specific implementation manner of determining the target candidate video according to the four features is similar to the implementation manner of determining the target candidate video based on the picture features and the video features, and may be respectively calculating a feature similarity based on the four features, and then determining the target candidate video after comprehensively sorting the feature similarities respectively calculated based on the four features, or in order to improve the material recall efficiency, recalling the video material based on the idea of the HNSW recall tree, or recalling the material after calculating the feature similarity based on the fusion features of the four features, and the specific implementation manner may refer to the above-mentioned related description, and is not repeated herein.

Fig. 3 is a flowchart illustrating a method of acquiring video material according to the embodiment shown in fig. 1, and before step S103 is executed, as shown in fig. 3, the method further includes the following steps:

in step S107, feature dimension reduction is performed on the picture features and the video features to obtain target picture features and target video features after dimension reduction.

In order to support video recall of ten million levels in a material library and reduce storage cost and calculation cost, the method can perform dimension reduction processing on the extracted feature vectors, wherein for the dimension reduction of picture features, the dimension reduction can be realized in a Singular Value Decomposition (SVD) mode, and for the dimension reduction of video features, a high-dimensional floating point feature vector can be mapped into a low-dimensional discrete vector in a Product Quantization (PQ) mode.

For example, assuming that the video features extracted by the video feature extraction model are 512-dimensional, the video features can be divided into one group every 4-dimensional, 128 groups of sub-vectors are provided, and each group of sub-vectors can be clustered to 128 clustering centers through K-Means, so that any visual feature can be encoded into one 128-dimensional vector.

In this way, the feature similarity between each candidate video and the target video can be calculated according to the target picture features and the target video features obtained after dimensionality reduction, so that the target candidate video can be determined more efficiently, and the calculation cost and the storage cost are reduced.

In addition, before recalling the material based on the four types of features including the picture feature, the video feature, the text feature and the audio feature, the SVD method can be adopted to perform dimension reduction processing on the text feature and the audio feature.

After the dimension reduction processing is performed, the dimensions corresponding to each type of feature vector can be made the same.

Fig. 4 is a block diagram illustrating an apparatus for acquiring video material according to an exemplary embodiment, the apparatus including, as shown in fig. 4:

an extraction module 401, configured to extract a video frame picture of a target video;

an extracting module 402, configured to extract video features corresponding to the target video and picture features corresponding to the video frame picture;

a similarity calculation module 403, configured to calculate a feature similarity between each candidate video and the target video according to the video features and the picture features;

a determining module 404, configured to rank the candidate videos based on the feature similarity, and determine, according to a ranking result, a target candidate video that is of the same type as the target video from the plurality of candidate videos;

a material obtaining module 405, configured to obtain the target candidate video.

Optionally, the plurality of candidate videos include preset candidate videos, and the similarity calculation module 403 and the determination module 404 are configured to calculate, according to the picture features, picture feature similarities between each preset candidate video and the target video, sort the preset candidate videos in an order from high to low in the picture feature similarities, and select, according to a sorting result, a first preset number of preset candidate videos as a first candidate video set; respectively calculating the video feature similarity of each preset candidate video and the target video according to the video features, sequencing the preset candidate videos according to the sequence of the video feature similarity from high to low, and selecting a second preset number of preset candidate videos as a second candidate video set according to the sequencing result; determining the target candidate video from the first set of candidate videos and the second set of candidate videos.

Optionally, the plurality of candidate videos include traversed candidate videos determined after a pre-established hierarchical navigation small-world HNSW recall tree is traversed according to a preset recall policy, where the HNSW recall tree includes a first HNSW recall tree or a second HNSW recall tree, and the similarity calculation module 403 and the determination module 404 are configured to calculate, according to the picture features, the picture feature similarity between each traversed candidate video and the target video through the first HNSW recall tree based on the preset recall policy, and determine a first candidate video set from the plurality of traversed candidate videos according to the picture feature similarity, where the first HNSW recall tree is a HNSW recall tree pre-established according to the picture feature similarity between each two candidate videos;

calculating the video feature similarity of each traversed candidate video and the target video according to the video features through a pre-established second HNSW recall tree based on the preset recall strategy, and determining a second candidate video set from the plurality of traversed candidate videos according to the video feature similarity, wherein the second HNSW recall tree is a pre-established HNSW recall tree according to the video feature similarity between every two candidate videos;

determining the target candidate video from the first set of candidate videos and the second set of candidate videos.

Optionally, the first HNSW recall tree and the second HNSW recall tree each include a plurality of nodes, and each node corresponds to the candidate video one to one;

the preset recall strategy comprises:

executing a preset traversal step until the target HNSW recall tree is traversed; the preset traversing step comprises the following steps: acquiring a preset node corresponding to the target HNSW recall tree; determining target nodes from the target HNSW recall tree according to the preset nodes, wherein the target nodes comprise the preset nodes and nodes connected with the preset nodes; respectively calculating the feature similarity of each candidate video to be determined and the target video according to the target features, wherein the candidate video to be determined is the candidate video corresponding to the target node; taking the node corresponding to the candidate video with the highest feature similarity as a new preset node; determining whether a node connected with the new preset node exists in the target HNSW recall tree; under the condition that the node connected with the new preset node exists in the target HNSW recall tree, re-executing the preset traversal step, and under the condition that the node connected with the new preset node does not exist in the target HNSW recall tree, determining that the traversal of the target HNSW recall tree is finished; under the condition that the target HNSW recall tree is traversed, sequencing traversed candidate videos corresponding to traversed nodes according to the sequence of the feature similarity from high to low, and selecting a third preset number of traversed candidate videos as a specific candidate video set according to a sequencing result;

Optionally, the determining module 404 is configured to sort each candidate video in the first candidate video set and the second candidate video set according to an order of feature similarity from high to low, where the feature similarity includes the picture feature similarity or the video feature similarity; and selecting a fourth preset number of candidate videos as the target candidate videos according to the sorting result.

Optionally, the similarity calculation module 403 is configured to perform feature fusion on the video feature and the picture feature to obtain a fusion feature; calculating the similarity of the fusion characteristics of each candidate video and the target video according to the fusion characteristics; the determining module 404 is configured to rank the candidate videos based on the fusion feature similarity, and determine a target candidate video that belongs to the same type as the target video from the plurality of candidate videos according to a ranking result.

Optionally, the similarity calculation module 403 is configured to, for each first feature element of the video feature, add the first feature element to a second feature element corresponding to the first feature element in the picture feature to obtain the fusion feature.

Optionally, fig. 5 is a block diagram of an apparatus for acquiring video material according to the embodiment shown in fig. 4, and as shown in fig. 5, the apparatus further includes:

a weight determination module 406, configured to identify a video type of the target video through a material type identification model; respectively determining a first preset weight corresponding to the video features and a second preset weight corresponding to the picture features according to the video types;

the similarity calculation module 403 is configured to perform weighted summation on the video feature and the picture feature according to the first preset weight and the second preset weight, so as to obtain the fusion feature.

Optionally, the HNSW recall tree includes a pre-established third HNSW recall tree, the plurality of candidate videos include traversed candidate videos determined after the third HNSW recall tree is traversed according to the preset recall policy, the similarity calculation module 403 is configured to traverse the third HNSW recall tree by performing the preset traversal step, and calculate a fusion feature similarity between each traversed candidate video and the target video according to the fusion feature in a traversal process, where the third HNSW recall tree is a HNSW recall tree pre-established according to a fusion feature similarity between each two candidate videos; the determining module 404 is configured to, after traversing the third HNSW recall tree, sort the traversed candidate videos corresponding to the traversed nodes according to the sequence from high to low of the fusion feature similarity, and select a fifth preset number of traversed candidate videos as the target candidate videos according to a sorting result.

Optionally, the plurality of candidate videos includes preset candidate videos, the similarity calculation module 403 is configured to calculate the fusion feature similarity between each preset candidate video and the target video according to the fusion feature, and the determination module 404 is configured to rank the preset candidate videos according to a sequence from high to low of the fusion feature similarity, and select a sixth preset number of preset candidate videos as the target candidate videos according to a ranking result.

Optionally, fig. 6 is a block diagram of an apparatus for acquiring video material according to the embodiment shown in fig. 4, and as shown in fig. 6, the apparatus further includes:

an obtaining module 407, configured to obtain a text feature of the target video through a text feature extraction model obtained through pre-training; and/or acquiring the audio features of the target video through an audio feature extraction model obtained through pre-training;

the similarity calculation module 403 is configured to calculate a feature similarity between each candidate video and the target video according to the picture feature, the video feature, and a specified feature, where the specified feature includes the text feature and/or the audio feature.

Optionally, fig. 7 is a block diagram of an apparatus for acquiring video material according to the embodiment shown in fig. 4, and as shown in fig. 7, the apparatus further includes:

the dimension reduction module 408 is configured to perform feature dimension reduction on the image features and the video features to obtain target image features and target video features after dimension reduction;

the similarity calculation module 403 is configured to calculate a feature similarity between each candidate video and the target video according to the target picture feature and the target video feature.

Optionally, the video frame picture includes a cover frame picture of the target video or a preset video frame picture other than the cover frame picture.

Optionally, the extracting module 402 is configured to divide the target video into a plurality of video segments according to a preset step size; and aiming at each video clip, inputting the video clip into a video feature extraction model obtained by pre-training to obtain the video feature corresponding to each video clip.

By adopting the device, the video characteristics of the target video serving as the seed material are introduced, and the video characteristics and the picture characteristics are taken as the recall basis of the target candidate video, so that the recall accuracy and the recall efficiency of the video material can be improved.

Referring now to fig. 8, a block diagram of an electronic device (e.g., a terminal device) 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some implementations, the clients may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting a video frame picture of a target video; extracting video characteristics corresponding to the target video and picture characteristics corresponding to the video frame pictures; calculating the feature similarity of each candidate video and the target video according to the video features and the picture features, sorting the candidate videos based on the feature similarity, and determining target candidate videos which belong to the same type as the target video from the candidate videos according to a sorting result; and acquiring the target candidate video.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, the extraction module may also be described as a "module that extracts a video frame picture".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, a method of obtaining video material, comprising:

extracting a video frame picture of a target video;

and acquiring the target candidate video.

Example 2 provides the method of example 1, the plurality of candidate videos including a preset candidate video, the calculating a feature similarity of each candidate video to the target video according to the video feature and the picture feature; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

respectively calculating the picture feature similarity of each preset candidate video and the target video according to the picture features, sequencing the preset candidate videos according to the sequence of the picture feature similarity from high to low, and selecting the preset candidate videos with the first preset number as a first candidate video set according to the sequencing result;

respectively calculating the video feature similarity of each preset candidate video and the target video according to the video features, sequencing the preset candidate videos according to the sequence of the video feature similarity from high to low, and selecting a second preset number of preset candidate videos as a second candidate video set according to the sequencing result;

Example 3 provides the method of example 1, wherein the plurality of candidate videos include traversed candidate videos determined after traversing a pre-established hierarchical navigation small-world HNSW recall tree according to a preset recall policy, the HNSW recall tree includes a first HNSW recall tree or a second HNSW recall tree, and the feature similarity between each candidate video and the target video is calculated according to the video features and the picture features; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

calculating the picture feature similarity of each traversed candidate video and the target video through the first HNSW recall tree according to the picture features based on the preset recall strategy, and determining a first candidate video set from the traversed candidate videos according to the picture feature similarity, wherein the first HNSW recall tree is an HNSW recall tree which is pre-established according to the picture feature similarity between every two candidate videos;

calculating the video feature similarity of each traversed candidate video and the target video through the second HNSW recall tree according to the video features based on the preset recall strategy, and determining a second candidate video set from the traversed candidate videos according to the video feature similarity, wherein the second HNSW recall tree is an HNSW recall tree which is pre-established according to the video feature similarity between every two candidate videos;

Example 4 provides the method of example 3, the first HNSW recall tree and the second HNSW recall tree each including a plurality of nodes, each node in one-to-one correspondence with the candidate video, in accordance with one or more embodiments of the present disclosure;

the preset recall strategy comprises:

the preset traversing step comprises the following steps:

acquiring a preset node corresponding to the target HNSW recall tree;

determining target nodes from the target HNSW recall tree according to the preset nodes, wherein the target nodes comprise the preset nodes and nodes connected with the preset nodes;

respectively calculating the feature similarity of each candidate video to be determined and the target video according to the target features, wherein the candidate video to be determined is the candidate video corresponding to the target node;

taking the node corresponding to the candidate video with the highest feature similarity as a new preset node;

determining whether a node connected with the new preset node exists in the target HNSW recall tree;

under the condition that the node connected with the new preset node exists in the target HNSW recall tree, re-executing the preset traversal step, and under the condition that the node connected with the new preset node does not exist in the target HNSW recall tree, determining that the traversal of the target HNSW recall tree is finished;

under the condition that the target HNSW recall tree is traversed, sequencing traversed candidate videos corresponding to traversed nodes according to the sequence of the feature similarity from high to low, and selecting a third preset number of traversed candidate videos as a specific candidate video set according to a sequencing result;

Example 5 provides the method of any one of examples 2-4, the determining the target candidate video from the first candidate video set and the second candidate video set including:

sorting each candidate video in the first candidate video set and the second candidate video set according to a sequence of feature similarity from high to low, wherein the feature similarity comprises the picture feature similarity or the video feature similarity;

and selecting a fourth preset number of candidate videos as the target candidate videos according to the sorting result.

Example 6 provides the method of example 4, the calculating a feature similarity of each candidate video to the target video from the video features and the picture features; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

performing feature fusion on the video features and the picture features to obtain fusion features;

calculating the feature similarity of each candidate video and the target video according to the fusion features;

and ranking the candidate videos based on the fusion feature similarity, and determining target candidate videos which belong to the same type as the target videos from a plurality of candidate videos according to a ranking result.

Example 7 provides the method of example 6, wherein the performing feature fusion on the video feature and the picture feature to obtain a fused feature comprises:

and for each first feature element of the video features, adding the first feature element to a second feature element corresponding to the first feature element in the picture features to obtain the fusion features.

Example 8 provides the method of example 6, before the feature fusing the video feature and the picture feature to obtain a fused feature, the method further comprising:

identifying the video type of the target video through a material type identification model;

respectively determining a first preset weight corresponding to the video features and a second preset weight corresponding to the picture features according to the video types;

the feature fusion of the video feature and the picture feature to obtain a fusion feature comprises:

and carrying out weighted summation on the video features and the picture features according to the first preset weight and the second preset weight to obtain the fusion features.

Example 9 provides the method of example 6, the HNSW recall tree including a pre-established third HNSW recall tree, the plurality of candidate videos including traversed candidate videos determined after traversing the third HNSW recall tree according to the preset recall policy, the fusion feature similarity of each candidate video and the target video being calculated according to the fusion features; the candidate videos are ranked based on the fusion feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

traversing the third HNSW recall tree by executing the preset traversal step, and calculating the similarity of the fusion characteristics of each traversed candidate video and the target video according to the fusion characteristics in the traversal process, wherein the third HNSW recall tree is an HNSW recall tree which is pre-established according to the similarity of the fusion characteristics between every two candidate videos;

and under the condition that the third HNSW recall tree is traversed, sequencing the traversed candidate videos corresponding to the traversed nodes according to the sequence of the fusion feature similarity from high to low, and selecting a fifth preset number of traversed candidate videos as the target candidate videos according to a sequencing result.

Example 10 provides the method of example 6, the plurality of candidate videos including preset candidate videos, the calculating a similarity of a fusion feature of each candidate video with the target video according to the fusion feature; the candidate videos are ranked based on the fusion feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

and respectively calculating the fusion feature similarity of each preset candidate video and the target video according to the fusion features, sequencing the preset candidate videos according to the sequence from high to low of the fusion feature similarity, and selecting the preset candidate videos with the first sixth preset number as the target candidate videos according to the sequencing result.

Example 11 provides the method of example 1, before the calculating a feature similarity of each candidate video to the target video from the video features and the picture features, the method further comprising:

acquiring text features of the target video through a text feature extraction model obtained through pre-training; and/or the presence of a gas in the gas,

acquiring the audio features of the target video through an audio feature extraction model obtained through pre-training;

the calculating the feature similarity of each candidate video and the target video according to the video features and the picture features comprises:

and calculating the feature similarity of each candidate video and the target video according to the picture features, the video features and specified features, wherein the specified features comprise the text features and/or the audio features.

Example 12 provides the method of example 1, before the calculating a feature similarity of each candidate video to the target video from the video features and the picture features, the method further comprising:

performing feature dimensionality reduction on the picture features and the video features to obtain target picture features and target video features subjected to dimensionality reduction;

and calculating the feature similarity of each candidate video and the target video according to the target picture features and the target video features.

Example 13 provides the method of example 1, the video frame picture comprising a cover frame picture of the target video or a preset video frame picture other than the cover frame picture, according to one or more embodiments of the present disclosure.

Example 14 provides the method of example 1, wherein extracting the video feature corresponding to the target video includes:

dividing the target video into a plurality of video segments according to a preset step length;

and aiming at each video clip, inputting the video clip into a video feature extraction model obtained by pre-training to obtain the video feature corresponding to each video clip.

Example 15 provides, in accordance with one or more embodiments of the present disclosure, an apparatus to obtain video material, the apparatus comprising:

Example 16 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-14, in accordance with one or more embodiments of the present disclosure.

Example 17 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-14.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of capturing video material, the method comprising:

extracting a video frame picture of a target video;

and acquiring the target candidate video.

2. The method according to claim 1, wherein the plurality of candidate videos include preset candidate videos, and the feature similarity of each candidate video with the target video is calculated according to the video features and the picture features; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

respectively calculating the picture feature similarity of each preset candidate video and the target video according to the picture features, sequencing the preset candidate videos according to the sequence of the picture feature similarity from high to low, and determining a first preset number of preset candidate videos as a first candidate video set according to a sequencing result;

respectively calculating the video feature similarity of each preset candidate video and the target video according to the video features, sequencing the preset candidate videos according to the sequence of the video feature similarity from high to low, and determining a second preset number of preset candidate videos as a second candidate video set according to the sequencing result;

3. The method according to claim 1, wherein the plurality of candidate videos include traversed candidate videos determined after traversing a pre-established hierarchical navigation small-world HNSW recall tree according to a preset recall strategy, the HNSW recall tree includes a first HNSW recall tree or a second HNSW recall tree, and the feature similarity between each candidate video and the target video is calculated according to the video features and the picture features; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

4. The method of claim 3, wherein the first HNSW recall tree and the second HNSW recall tree each comprise a plurality of nodes, each node corresponding to one of the candidate videos;

the preset recall strategy comprises:

the preset traversing step comprises the following steps:

acquiring a preset node corresponding to the target HNSW recall tree;

5. The method of any of claims 2-4, wherein determining the target candidate video from the first candidate video set and the second candidate video set comprises:

sequencing each candidate video in the first candidate video set and the second candidate video set according to a sequence of feature similarity from high to low, wherein the feature similarity comprises picture feature similarity or video feature similarity;

6. The method according to claim 4, wherein the feature similarity of each candidate video and the target video is calculated according to the video features and the picture features; the candidate videos are ranked based on the feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

calculating the similarity of the fusion characteristics of each candidate video and the target video according to the fusion characteristics;

7. The method of claim 6, wherein the feature fusing the video feature and the picture feature to obtain a fused feature comprises:

8. The method of claim 6, wherein before the feature fusing the video feature and the picture feature to obtain a fused feature, the method further comprises:

9. The method according to claim 6, wherein the HNSW recall tree comprises a pre-established third HNSW recall tree, the plurality of candidate videos comprise traversed candidate videos determined after traversing the third HNSW recall tree according to the preset recall policy, and the similarity of the fusion feature of each candidate video and the target video is calculated according to the fusion feature; the candidate videos are ranked based on the fusion feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

traversing the third HNSW recall tree by executing the preset traversal step, and calculating the fusion feature similarity of each traversed candidate video and the target video according to the fusion feature in the traversal process, wherein the third HNSW recall tree is an HNSW recall tree which is pre-established according to the fusion feature similarity between every two candidate videos;

10. The method according to claim 6, wherein the plurality of candidate videos include preset candidate videos, and the calculating of the similarity of the fusion feature of each candidate video and the target video is performed according to the fusion feature; the candidate videos are ranked based on the fusion feature similarity, and determining a target candidate video which belongs to the same type as the target video from the candidate videos according to a ranking result comprises the following steps:

11. The method of claim 1, wherein before the calculating the feature similarity of each candidate video to the target video according to the video features and the picture features, the method further comprises:

acquiring text features of the target video through a text feature extraction model obtained through pre-training; and/or acquiring the audio features of the target video through an audio feature extraction model obtained through pre-training;

12. The method of claim 1, wherein before the calculating the feature similarity of each candidate video to the target video according to the video features and the picture features, the method further comprises:

13. The method of claim 1, wherein the video frame picture comprises a cover frame picture of the target video or a preset video frame picture other than the cover frame picture.

14. The method according to claim 1, wherein the extracting the video feature corresponding to the target video comprises:

15. An apparatus for acquiring video material, the apparatus comprising:

16. A computer-readable medium, on which a computer program is stored, which program, when being executed by processing means, is adapted to carry out the steps of the method of any one of claims 1 to 14.

17. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 14.