CN113821676A

CN113821676A - Video retrieval method, device, equipment and storage medium

Info

Publication number: CN113821676A
Application number: CN202110853123.XA
Authority: CN
Inventors: 孔伟杰; 田上萱; 赵文哲; 蔡成飞; 刘威; 王红法
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-12-21

Abstract

The application discloses a video retrieval method, a video retrieval device, video retrieval equipment and a storage medium, and belongs to the technical field of artificial intelligence. According to the method and the device, the target similar parameters are obtained by comprehensively considering the spatiotemporal information at the video level and the spatiotemporal information at the video frame level, so that the target similar parameters can more accurately represent the similarity degree between the first video and the second video, and therefore, the accuracy of retrieval of repeated segments between the first video and the second video can be improved based on the target similar parameters.

Description

Video retrieval method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a video retrieval method, apparatus, device, and storage medium.

Background

With the rapid development of computer technology and mobile internet, the number of videos in the network shows explosive growth, wherein a large amount of repeated video contents are filled, resulting in the waste of network resources. Therefore, searching for a repeated segment between two videos has become an important research topic. In the related technology, for two given videos, one of the videos is used as a Query video (namely, Query), the other video is used as a Query object of the Query, namely, a target video, features of each video frame in the Query video and the target video are respectively extracted by adopting a convolution model, a similar parameter between each video frame in the two videos is calculated based on the features of the video frames, the similar parameter between each video frame is processed through algorithms such as dynamic programming and Hoffman voting, the similar parameter between the Query video and the target video is obtained, and a repeated segment between the target video and the Query video is retrieved based on the similar parameter.

In the above technology, the similarity between the query video and the target video cannot be well represented based on the video frame characteristics obtained by the convolution model, so that the accuracy of repeated segment retrieval between the target video and the query video is low.

Disclosure of Invention

The embodiment of the application provides a video retrieval method, a video retrieval device, video retrieval equipment and a storage medium, and the method can improve the accuracy of repeated segment retrieval between a target video and a query video. The technical scheme is as follows:

in one aspect, a video retrieval method is provided, and the method includes:

acquiring a first global space-time characteristic of a first video, first space-time characteristics of a plurality of first video frames in the first video, a second global space-time characteristic of a second video and second space-time characteristics of a plurality of second video frames in the second video;

acquiring a first similar parameter based on the first global space-time feature and the second global space-time feature, wherein the first similar parameter is used for representing the similarity degree of the video level between the first video and the second video;

acquiring a second similarity parameter based on a plurality of first spatio-temporal features and a plurality of second spatio-temporal features, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video;

and fusing the first similar parameter and the second similar parameter to obtain a target similar parameter, determining that a repeated segment exists between the first video and the second video in response to the target similar parameter being greater than or equal to a first threshold value, wherein the target similar parameter is used for representing the overall similarity degree between the first video and the second video.

In one aspect, a video retrieval apparatus is provided, the apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first global space-time characteristic of a first video, a first space-time characteristic of a plurality of first video frames in the first video, a second global space-time characteristic of a second video and a second space-time characteristic of a plurality of second video frames in the second video;

the obtaining module is configured to obtain a first similarity parameter based on the first global spatio-temporal feature and the second global spatio-temporal feature, where the first similarity parameter is used to indicate a degree of similarity of video levels between the first video and the second video;

the obtaining module is configured to obtain a second similarity parameter based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, where the second similarity parameter is used to indicate a degree of similarity at a video frame level between the first video and the second video;

and the determining module is used for fusing the first similar parameter and the second similar parameter to obtain a target similar parameter, determining that a repeated segment exists between the first video and the second video in response to the target similar parameter being greater than a first threshold value, and the target similar parameter is used for representing the overall similarity degree between the first video and the second video.

In some embodiments, for any one of the first video and the second video, the obtaining module is configured to divide each video frame of the video into a plurality of sub-video frames, and obtain spatial features of the plurality of video frames based on spatial information between the plurality of sub-video frames in each video frame; and acquiring global space-time characteristics of the video and space-time characteristics of the video frames based on the spatial characteristics of the video frames and the time sequence information among the video frames.

In some embodiments, the obtaining module is configured to obtain a similarity matrix based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, where each element in the similarity matrix is used to represent a degree of similarity between each of the first video frames and each of the second video frames; and acquiring the second similarity parameter based on the maximum value of each line in the similarity matrix and the number of the first video frames.

In some embodiments, the determining module is configured to perform weighted summation on the first similar parameter and the second similar parameter based on a first weight and a second weight to obtain the target similar parameter.

In some embodiments, the first weight and the second weight are trained based on a plurality of sample videos and corresponding sample labels;

the plurality of sample videos include a plurality of original videos and videos obtained by transforming the plurality of original videos, any original video and the videos obtained by transforming based on the original videos are in the same category, and the sample label is used for representing the category of the corresponding sample video.

In some embodiments, the obtaining module is further configured to obtain a similarity vector, where each element in the similarity vector is a maximum value of each row in the similarity matrix, and each element in the similarity vector is used to represent a similarity degree between the second video and each first video frame in the first video;

the determining module is further configured to determine the first video frame corresponding to the element, which is greater than or equal to the second threshold, in the similarity vector as a repeated video frame.

In some embodiments, the obtaining module is configured to obtain the first global spatiotemporal feature, a plurality of the first spatiotemporal features, the second global spatiotemporal feature, and a plurality of the second spatiotemporal features based on a video feature extraction model, the first video, and the second video; the video feature extraction model comprises a space sub-model and a time sequence sub-model, wherein the space sub-model and the time sequence sub-model are composed of a plurality of converter transformers.

In some embodiments, the video feature extraction model is trained based on the plurality of sample videos and corresponding sample labels;

the acquisition module is used for acquiring a target number of sample videos and corresponding sample labels from the plurality of sample videos; randomly combining the sample videos of the target number to obtain a plurality of sample video pairs; acquiring a first sample global space-time characteristic, a plurality of first sample space-time characteristics, a second sample global space-time characteristic and a plurality of second sample space-time characteristics corresponding to the plurality of sample video pairs; obtaining a plurality of first sample similar parameters based on the first sample global space-time characteristics and the second sample global space-time characteristics corresponding to the plurality of sample video pairs; obtaining a plurality of second sample similarity parameters based on the plurality of first sample spatio-temporal features and the plurality of second sample spatio-temporal features corresponding to the plurality of sample video pairs; obtaining a plurality of sample target similar parameters based on the plurality of first sample similar parameters and the plurality of second sample similar parameters; and training the video feature extraction model based on the sample target similarity parameters.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one computer program that is loaded and executed by the one or more processors to perform the operations performed by the video retrieval method.

In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to perform the operations performed by the video retrieval method.

In one aspect, a computer program product is provided that includes at least one computer program stored in a computer readable storage medium. The processor of the computer device reads the at least one computer program from the computer-readable storage medium, and the processor executes the at least one computer program to cause the computer device to implement the operations performed by the video retrieval method.

According to the technical scheme provided by the embodiment of the application, the spatio-temporal information at the video level and the spatio-temporal information at the video frame level are comprehensively considered to obtain the target similar parameter, so that the target similar parameter can more accurately represent the similarity degree between the first video and the second video, and therefore, the accuracy of retrieval of repeated segments between the first video and the second video can be improved based on the target similar parameter.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a repeated video segment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a video editing transformation provided by an embodiment of the present application;

fig. 3 is a schematic diagram of an implementation environment of a video retrieval method according to an embodiment of the present application;

fig. 4 is a flowchart of a video retrieval method provided in an embodiment of the present application;

fig. 5 is a flowchart of a video retrieval method provided in an embodiment of the present application;

fig. 6 is a flowchart of a video retrieval method provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a video feature extraction model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a positioning of a repeated segment between a first video and a second video according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a training process of a video feature extraction model according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a video transform provided by an embodiment of the present application;

fig. 11 is a flowchart of a video retrieval method provided in an embodiment of the present application;

fig. 12 is a flowchart of a video retrieval method provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a video retrieval apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the following will describe embodiments of the present application in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

In order to facilitate understanding of the technical processes of the embodiments of the present application, some terms referred to in the embodiments of the present application are explained below:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The technical scheme provided by the embodiment of the application can also be combined with a cloud technology, for example, a trained video feature extraction model is deployed on a cloud server. Cloud Technology refers to a hosting Technology for unifying resources of hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Convolutional Neural Networks (CNN), which are feed forward Neural Networks (fed forward Neural Networks) including convolution calculation and having a Deep structure, are constructed such that an artificial neuron can respond to a part of surrounding units within a coverage range, and are composed of one or more convolution layers and a top fully connected Layer (corresponding to a classical Neural network), and also include an associated weight and a Pooling Layer (discharging Layer), which is one of the algorithms represented by Deep Learning (Deep Learning).

A Linear rectification Function (called a modified Linear Unit, ReLU) is an Activation Function (Activation Function) commonly used in artificial neural networks, and generally refers to a nonlinear Function represented by a ramp Function and its variants.

An application scenario of the video retrieval method provided in the embodiment of the present application is described below.

The video retrieval method provided by the embodiment of the application can be used for retrieving the repeated segment of the first video and the second video. The repeated segments include not only the segments of the first video that are identical to the second video, but also the segments of the first video that are identical to the transformed second video, for example, as shown in fig. 1, the segments of the first video that are identical to the second video are shown in fig. 1, and it can be seen that the segments of the first video are not identical to the second video, but are identical to the transformed video that is inverted, resized, or the like, of the second video. At present, methods for transforming video mainly include photometric transformation, geometric transformation and editing transformation, as shown in fig. 2, and fig. 2 shows an example of video editing transformation. Specifically, the video retrieval method can be used in the following three scenarios:

(1) and searching interested videos.

If the user wants to search out the video with the repeated segment with a certain video, the user can upload the certain video on the terminal, the terminal sends the video to the server, and the server searches out the video with the repeated segment with the video from the database based on the video searching method provided by the application. Furthermore, for the retrieved video, the method provided by the application can also be used for positioning the repeated segments in the video, so that the efficiency of searching the video by the user is improved.

(2) The video hit is repeated.

For video applications, especially short video applications, if there are too many videos with repeated contents in the platform, not only the user experience will be affected, but also problems such as copyright dispute will be caused. Based on the method provided by the application, the video with repeated content in the platform can be reduced from the following two aspects:

on one hand, for a video uploaded by a user in a short video application, before the video is published, a server detects the video, and can determine whether a repeated segment exists between the video uploaded by the user and any video in a database based on the method provided by the application.

On the other hand, for videos in the database of the platform, the server can detect the videos in the database, in the detection process, multiple pairs of videos are obtained from the database, and for any pair of videos, if a repeated segment exists between one video and the other video based on the method provided by the application, one video is deleted from the database, so that the videos with repeated contents on the platform are reduced.

(3) And (5) advertisement putting.

The method provided by the application can be applied to the scene of advertisement putting based on the following two modes:

on one hand, for the advertisement videos in the database, the server can detect the videos in the database, in the detection process, multiple pairs of advertisement videos are obtained from the database, and for any pair of advertisement videos, if a repeated segment exists between one advertisement video and the other advertisement video based on the method provided by the application, one advertisement video is deleted from the database, so that the advertisement videos in the database are deduplicated.

On the other hand, in the advertisement putting process, the advertisement video is often recommended for the user based on the advertisement recommendation model, and the characteristics of the advertisement video obtained based on the method provided by the application are added into the advertisement recommendation model, so that the performance of the advertisement recommendation model can be improved, and the advertisement putting effect is improved.

It is understood that the video retrieval method provided by the present application is not limited to the above application scenario, and in some embodiments, the video retrieval method can also be applied to other computer vision scenarios, such as image retrieval, image matching, multi-modal media resource retrieval, and the like.

The following describes an implementation environment of the video retrieval method provided by the present application.

Fig. 3 is a schematic diagram of an implementation environment of a video retrieval method provided in an embodiment of the present application, and referring to fig. 3, the implementation environment includes: the terminal 301 and the server 302 are connected, and the terminal 301 and the server 302 are connected directly or indirectly through a wired or wireless network.

The terminal 301 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, which is not limited in this embodiment. The terminal 301 may run various applications supporting the video function, such as a short video application, a social application, and the like, and a user can upload a video through the application run by the terminal 301, and the terminal 301 can send the video to the server 302.

The server 302 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 302 may have associated therewith a database for storing a plurality of videos for video retrieval.

In some embodiments, the terminal 301 and the server 302 can be nodes in a blockchain system.

Optionally, the terminal 301 generally refers to one of a plurality of terminals, and the present embodiment is only illustrated by the terminal 301. Those skilled in the art will appreciate that the number of terminals 301 can be greater. For example, the number of the terminals 301 is several tens or several hundreds, or more, and in this case, the environment in which the video retrieval method is implemented also includes other terminals. The number and the device type of the terminals are not limited in the embodiments of the present application.

Based on the foregoing implementation environment, fig. 4 is a flowchart of a video retrieval method provided in an embodiment of the present application, and as shown in fig. 4, the method includes the following steps.

401. The server obtains a first global spatiotemporal feature of a first video, a first spatiotemporal feature of a plurality of first video frames in the first video, a second global spatiotemporal feature of a second video, and a second spatiotemporal feature of a plurality of second video frames in the second video.

In some embodiments, the server first obtains the first video and the second video, and then obtains the characteristics of the first video and the characteristics of the second video. The first video is a video to be compared with a second video, the second video is a Query (Query) video, and the purpose of comparing the first video with the second video is to determine whether a repeated segment exists between the first video and the second video.

For any one of the first video and the second video, the global spatio-temporal characteristics of the video are video-level characteristics and are used for representing the spatio-temporal information of the whole video, and the spatio-temporal characteristics of the video frames are video-frame-level characteristics and are used for representing the spatio-temporal information of the video frames. The spatio-temporal information of the video as a whole refers to semantic dependency relationship among a plurality of video frames in the video and spatial structure information of each video frame, and the spatio-temporal information of a video frame refers to semantic dependency relationship between the video frame and other video frames and spatial structure relationship among a plurality of sub-video frames of the video frame.

402. The server obtains a first similarity parameter based on the first global space-time feature and the second global space-time feature, wherein the first similarity parameter is used for representing the similarity degree of the video level between the first video and the second video.

The larger the value of the first similar parameter is, the higher the similarity of the video level between the first video and the second video is, and the smaller the value of the first similar parameter is, the lower the similarity of the video level between the first video and the second video is.

403. The server obtains a second similarity parameter based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video.

The larger the value of the first similar parameter is, the higher the similarity between the first video frame and the second video frame is, and the smaller the value of the first similar parameter is, the lower the similarity between the first video frame and the second video frame is.

404. The server fuses the first similar parameter and the second similar parameter to obtain a target similar parameter, determines that a repeated segment exists between the first video and the second video in response to the target similar parameter being larger than a first threshold value, and the target similar parameter is used for representing the overall similarity degree between the first video and the second video.

The target similarity parameter comprehensively considers the spatio-temporal information at the video level and the spatio-temporal information at the video frame level, so that the similarity degree between the first video and the second video can be more accurately represented.

The following embodiments are used to describe the video retrieval method provided by the present application based on the above three application scenarios. The embodiment of the present application takes a repeated video hit scene as an example, and the video retrieval method is described with reference to fig. 5 and 6, where fig. 5 and 6 are flowcharts of a video retrieval method provided by the embodiment of the present application, and as shown in fig. 5, the method includes the following steps.

501. The terminal responds to the video uploading operation of the user and sends a duplication checking request to the server, and the duplication checking request carries the first video.

In some embodiments, if it is required to detect whether a video uploaded by a user and a video in a database have a repeated segment, a terminal responds to the video uploaded by the user on a platform, acquires the video uploaded by the user as a first video, and sends a duplication checking request to a server, where the duplication checking request carries the first video, and the duplication checking request is used to instruct the server to duplicate the first video.

Optionally, the duplication checking request carries a video identifier, the video identifier is used to indicate any video published on the platform, and the duplication checking request is used to indicate the server to duplicate the video indicated by the video identifier.

Based on the method, the manager of the video platform can detect the video uploaded by the user or the video published on the platform, so as to determine whether repeated segments exist between the video uploaded by the user and the video in the database.

502. And the server responds to the duplication checking request of the terminal and acquires the second video from the database.

In some embodiments, the server responds to the duplication checking request, obtains a first video in the duplication checking request, and randomly obtains a second video from a plurality of videos in the database.

In some embodiments, the duplication checking request further carries video information of the first video, where the video information includes a type, a duration, and the like of the first video, and the server obtains the first video and the video information of the first video in the duplication checking request in response to the duplication checking request, and obtains the second video from the database based on the video information. Illustratively, the server acquires a video of the same type as the first video in the database as the second video based on the video information, or determines a video of which the difference value between the time length in the database and the time length of the first video is smaller than a target threshold value based on the video information, and acquires the video as the second video.

Optionally, if the duplication checking request carries a video identifier, the server acquires the video indicated by the video identifier as a first video, and acquires a second video from the database.

It should be noted that the server may obtain a plurality of second videos from the database, and retrieve the repeated segments between the first video and each second video in a serial or parallel manner.

503. The server obtains a first global spatiotemporal feature of a first video, a first spatiotemporal feature of a plurality of first video frames in the first video, a second global spatiotemporal feature of a second video, and a second spatiotemporal feature of a plurality of second video frames in the second video.

In some embodiments, the server first obtains first spatial features of the plurality of first video frames and second spatial features of the plurality of second video frames, and then obtains a first global spatiotemporal feature, a second global spatiotemporal feature, a plurality of first spatiotemporal features, and a plurality of second spatiotemporal features based on the plurality of first spatial features, the plurality of second spatial features, timing information of the plurality of first video frames, and timing information of the plurality of second video frames. The above process is explained based on steps 503A to 503D as follows:

503A, the server divides each first video frame of the first video into a plurality of sub-video frames, and obtains first spatial features of the plurality of first video frames based on spatial information between the plurality of sub-video frames in each first video frame.

In some embodiments, for any of the first video frames, the server uniformly divides the first video frame into a plurality of n × n sub-video frames, each sub-video frame comprising a partial picture of the first video frame, n being an integer greater than 0, and inputs the plurality of sub-video frames into the spatial sub-model of the video feature extraction model. As shown in fig. 7, the spatial sub-model is composed of multiple transformers, and each of the multiple transformers includes a Multi-head Self-attribute (Multi-head attention), so that inputting the multiple sub-video frames into the spatial sub-model can obtain a spatial structure relationship between each sub-video frame and other sub-videos based on the multiple Multi-head attention units, that is, obtain a first spatial feature of the first video frame, where the first spatial feature is used to represent spatial information of the first video frame.

503B, the server obtains a first global spatiotemporal feature and a first spatiotemporal feature of the plurality of first video frames based on the first spatial feature of the plurality of first video frames and the timing information between the plurality of first video frames.

In some embodiments, the server splices one target vector and a plurality of first spatial features based on a target sequence to obtain spliced features, and splicing based on the target sequence means that the target vector is used as a first feature in the spliced features, and the plurality of first spatial features are spliced with the target vector sequentially according to the sequence of the corresponding first video frame in the first video. The server inputs the splicing characteristics into a time sequence submodel of the video characteristic extraction model, processes the splicing characteristics based on a plurality of transformers in the time sequence submodel, takes the first characteristics in output data, namely the output characteristics corresponding to the target vector, as first global space-time characteristics, and takes other characteristics in the output data of the time sequence submodel, namely the output characteristics corresponding to a plurality of first space characteristics, as a plurality of first space-time characteristics. For example, if the stitching feature is a vector of S × T dimension, the first line of the stitching feature is a target vector, the second to S-th lines of the stitching feature are a plurality of first spatial features, the input of the stitching vector is a time sequence submodel, if the output data is a vector of S × J dimension, the first line of the output data is a first global space-time feature, and the second to S-th lines of the output data are a plurality of first space-time features. Wherein S, T, J is an integer greater than 0.

The target vector is a vector randomly generated by the server and used for fusing the spatio-temporal information of a plurality of video frames, the target vector has the same dimension as the plurality of spatial features, and optionally, the target vector is a [ CLS ] embedded vector.

The process of the server processing the splicing feature based on a plurality of transformers in the time sequence sub model is exemplarily described. In the chronology model, each transform includes a multi-headed self-attention unit. For a first transform, a server takes splicing features as input data, based on a multi-head attention unit in the first transform, a relation between each feature in a splicing vector is extracted, by obtaining a relation between each first spatial feature and other first spatial features, output features corresponding to each first spatial feature in output data of the first transform include spatial information and timing information corresponding to a first video frame, and by obtaining a relation between a target vector and a plurality of first spatial features, the plurality of first spatial features are fused, so that in the output data of the first transform, output features corresponding to the target vector include spatial information of the plurality of first video frames.

Furthermore, for any Transformer except the first Transformer, the server takes the output data of the last Transformer as the input data of the Transformer, and because the output features corresponding to each first spatial feature in the output data of the last Transformer include the spatial information and the timing information corresponding to the first video frame, the relationship between the features in the output data of the last Transformer is extracted based on the multi-head attention unit in the Transformer, so that the output features corresponding to the target vector in the output data of the Transformer include the spatial information and the timing information of a plurality of first video frames, and the output features corresponding to each first spatial feature include richer spatial information and timing information. And the server takes the output characteristics corresponding to the target vector in the output data of the last Transformer as first global space-time characteristics, and takes the output characteristics of the first space characteristics as a plurality of first space-time characteristics.

Through the time sequence submodel, the characteristics of other video frames are blended into the first spatial characteristics of each first video frame, so that the time sequence information of the first video frames is obtained, the characteristics of each first video frame contain richer information, and the video retrieval accuracy is improved. Meanwhile, the time sequence submodel can fuse the spatial information and the time sequence information of a plurality of first video frames through a plurality of transformers to obtain a first global space-time characteristic of the first video, so that the accuracy of video retrieval is further improved.

In some embodiments, for this step 503B, the server inputs the position information of the plurality of video frames, the spatial features of the plurality of video frames, and the target vector into the timing sub-model, to obtain the global spatio-temporal features and the spatio-temporal features of the plurality of video frames, where the position information is used to indicate the sequence of the corresponding video frames in the video. Illustratively, the server firstly splices each first spatial feature with position information corresponding to a first video frame, splices a target vector with invalid position information to obtain a plurality of first splicing features, then further splices the plurality of first splicing features according to a target sequence to obtain a second splicing feature, inputs the second splicing feature into a time sequence sub-model, takes an output feature corresponding to the first splicing feature of the target vector as a global space-time feature, and takes an output feature corresponding to the first splicing feature of the plurality of spatial features as a space-time feature of the plurality of video frames. By inputting the position information of the first video frame into the time sequence submodel, the obtained characteristics contain richer video information, and the accuracy of video retrieval can be further improved. Optionally, the position information is a frame number or a time stamp of the video frame.

503C, the server divides each second video frame of the second video into a plurality of sub-video frames, and obtains spatial features of the plurality of second video frames based on spatial information between the plurality of sub-video frames in each second video frame.

In some embodiments, the step 503C is similar to the step 503A, and is not described herein again.

503D, the server obtains a second global spatiotemporal feature and spatiotemporal features of the plurality of second video frames based on the spatial features of the plurality of second video frames and the timing information between the plurality of video frames.

In some embodiments, the step 503D is similar to the step 503B, and is not described herein again.

It should be noted that, the processing processes of the first video and the second video may be performed synchronously or according to a certain sequence, which is not limited in this embodiment of the present application.

504. The server obtains a first similarity parameter based on the first global space-time feature and the second global space-time feature, wherein the first similarity parameter is used for representing the similarity degree of the video level between the first video and the second video.

In some embodiments, the method for obtaining the first similarity parameter based on the first global spatiotemporal feature and the second global spatiotemporal feature by the server is as shown in formula (1).

Wherein S is_v2vDenotes a first similarity parameter, V_ARepresenting a first global spatio-temporal feature, V_BRepresenting a second global spatiotemporal feature. This equation (1) is also called Cosine (Cosine) similarity.

505. The server obtains a second similarity parameter based on the first space-time characteristics and the second space-time characteristics, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video.

In some embodiments, this step 505 is implemented based on the following steps 505A to 505B:

505A, the server obtains a similarity matrix based on the plurality of first spatio-temporal features and the plurality of second spatio-temporal features, wherein each element in the similarity matrix is used for representing the similarity degree between each first video frame and each second video frame.

In some embodiments, the server obtains a similarity matrix, where an element of the similarity matrix is a similarity parameter obtained by the above formula (1) based on a first spatio-temporal feature and a second spatio-temporal feature, that is, a similarity degree between a first video frame and a second video frame, and a dimension of the similarity matrix is a sum of a number of the first video frames and a dimension of the similarity matrixThe product of the number of second video frames. For example, if the first video includes N first video frames and the second video includes M second video frames, that is, the first video has N first spatiotemporal features, the second video has M second spatiotemporal features, and M and N are integers greater than 0, the server obtains the similarity matrix P of N × M dimensions based on the M first spatiotemporal features and the N second spatiotemporal features_A2BElement P in the matrix_ijIt indicates the degree of similarity between the ith first video frame and the jth second video frame.

505B, the server obtains the second similarity parameter based on the maximum value of each line in the similarity matrix and the number of the first video frames.

Wherein, the maximum value of each row refers to the element with the maximum value of each row in the similarity matrix.

In some embodiments, the server obtains the second similarity parameter based on the similarity matrix, as shown in formula (2).

Wherein S is_f2fRepresenting a second similarity parameter, N representing the number of first video frames, P_A2BRepresenting a similarity matrix, P_A2B(i,: indicates the ith row in the similarity matrix.

506. And the server fuses the first similar parameter and the second similar parameter to obtain the target similar parameter.

In some embodiments, the server performs weighted summation on the first similar parameter and the second similar parameter based on the first weight and the second weight to obtain the target similar parameter. Optionally, the target similarity parameter is denoted by S. The first weight and the second weight are weight coefficients preset in the server, and values of the first weight and the second weight can be set based on actual requirements, which is not limited in the embodiment of the present application.

In some embodiments, the first weight and the second weight are trained based on a plurality of sample videos and corresponding sample labels, and the training process is detailed in the corresponding embodiment of fig. 9. The first weight and the second weight are obtained through training, so that the accuracy of the target similarity coefficient obtained based on the first weight and the second weight is higher, and the accuracy of searching repeated segments of the first video and the second video is improved.

507. The server determines that a repeated segment exists between the first video and the second video in response to the target similarity parameter being greater than or equal to a first threshold.

The first threshold is a parameter preset in the server, and can be set based on an actual demand, which is not limited in this embodiment of the application, and optionally, the first threshold is represented by T1.

In some embodiments, the server is capable of locating a repeating segment between the first video and the second video, as shown in fig. 8, this step 507 further comprises: the server responds to the fact that the fragments which are repeated with the second video exist in the first video, and obtains a similarity vector, wherein each element in the similarity vector is the maximum value of each row in the similarity matrix, and each element in the similarity vector is used for representing the similarity degree of the second video and each first video frame in the first video; and determining the first video frame corresponding to the element, greater than or equal to the second threshold, in the similarity vector as a repeated video frame, wherein the second threshold is optionally represented by T2. And connecting the continuous repeated video frames to obtain the repeated sections of the first video and the second video, wherein the frame number or the time stamp corresponding to the repeated video frame in the first video is the position of the repeated section in the first video. Alternatively, the first threshold and the second threshold may also be referred to as a similarity threshold.

508. And the server sends a duplicate checking message to the terminal, wherein the duplicate checking message is used for indicating that the video uploaded by the user has repeated segments with the video in the database.

In some embodiments, if the server locates a repeated segment between the first video and the second video, the duplicate checking message carries the location of the repeated segment. And the terminal receives the duplicate checking message of the server and displays prompt information on an interface of the video application program, wherein the prompt information is used for prompting a user to modify the video. And if the server also sends the position of the repeated section to the terminal, the terminal displays the indication information and the position of the repeated section on an interface of the video application program.

Optionally, if the first video is a video published on the platform, the server may delete the video, or the server may send the identifier of the first video and the identifier of the second video to the terminal, the terminal displays the identifiers of the two videos on the platform management interface, and a platform manager can call the corresponding video from the database based on the identifiers, and check the two videos to determine whether to delete the first video.

The following describes the training process of the above-mentioned video feature extraction model, and as shown in fig. 9, the training process includes the following two parts:

(1) sample video preparation.

The video feature extraction model is obtained by training based on a plurality of sample videos and corresponding sample labels, wherein the plurality of sample videos comprise a plurality of original videos and videos obtained by transforming the original videos, any original video and the videos obtained by transforming based on the original videos are called as a same-family video, and the sample labels are used for representing a family group of the corresponding sample videos.

The method for transforming the original video comprises the following steps: photometric transform, collective transform, and editorial transform. As shown in fig. 10, the photometric transform includes a luminance transform, a contrast transform, a hue transform, a saturation transform, and a gamma transform, the geometric transform includes horizontal flip, rotation, clipping, resizing, and translation transforms, and the editing transform includes adding a blurred background, adding an icon (Logo), and picture-in-picture transform, etc.

(2) And (5) training a model.

The training process of the video feature extraction model is realized by multiple iterations, and the process of any iteration training comprises the following steps (a) to (g):

(a) the server obtains a target number of sample videos and corresponding sample labels from the plurality of sample videos, and randomly combines the target number of sample videos to obtain a plurality of sample video pairs.

If the two sample videos in the sample video pair are the same family videos, the sample video pair is a positive sample pair, and if the two sample videos in the sample video pair are not the same family videos, the sample video pair is a negative sample pair.

(b) The server obtains a first sample global spatiotemporal feature, a plurality of first sample spatiotemporal features, a second sample global spatiotemporal feature and a plurality of second sample spatiotemporal features corresponding to the plurality of sample video pairs based on the same method as the above step 503.

(c) The server obtains a plurality of first sample similarity parameters by a method similar to the step 504 based on the first sample global spatio-temporal feature and the second sample global spatio-temporal feature corresponding to the plurality of sample video pairs.

(d) The server obtains a plurality of second sample similarity parameters based on a plurality of first sample spatiotemporal features and a plurality of second sample spatiotemporal features corresponding to the plurality of sample video pairs by a method similar to the above step 505.

(e) And the server performs weighted summation on the plurality of first sample similar parameters and the plurality of second sample similar parameters based on the first weight and the second weight to obtain a plurality of sample target similar parameters.

(f) The server obtains a depth Metric loss based on the plurality of sample target similarity parameters, corresponding sample labels, and a Depth Metric Loss (DML) function. Optionally, the depth metric Loss Function is a triple Loss Function (triple Loss Function), a contrast Loss Function (contrast Loss Function), or another DML Function, which is not limited in this embodiment of the present application.

(g) The server trains a video feature extraction model based on the depth metric loss. Since the DML loss can represent the accuracy of the sample target similarity of the sample video pair, and the DML loss function includes a boundary (Margin) constraint, which is the minimum value of the sample target similarity of the negative sample pair, if the sample target similarity of the negative sample pair is smaller than the boundary constraint, the depth measurement loss becomes large. Therefore, the video feature extraction model is trained based on the depth measurement loss, so that the similarity between the positive sample pairs is increased, and the similarity between the negative sample pairs is decreased, the video features acquired based on the video feature extraction model can accurately represent the similarity between videos, and the aim of improving the video retrieval accuracy is fulfilled.

In some embodiments, if in step 505 the first weight and the second weight are obtained based on the sample video and the corresponding sample label, step (g) of the training process further includes: the server trains the first weight and the second weight based on the depth metric loss. The first weight and the second weight obtained based on the depth measurement loss training can more accurately represent the proportion of the first similar parameter and the second similar parameter, so that the accuracy of video retrieval can be improved.

The video retrieval method is described with reference to fig. 11 by taking an interested video retrieval scene as an example in the embodiment of the present application, and fig. 11 is a flowchart of the video retrieval method provided in the embodiment of the present application, and as shown in fig. 11, the method includes the following steps.

1101. And the terminal responds to the video uploading operation of the user and sends a retrieval request to the server, wherein the retrieval request carries the second video.

In some embodiments, if a user wishes to retrieve a video with a repeated section with a certain video, the user may upload the certain video on the terminal, in response to a video uploading operation of the user, acquires the video uploaded by the user as a second video, and sends a retrieval request to the server, where the retrieval request carries the second video, and the retrieval request is used to instruct the server to return the video with the repeated section with the second video.

Optionally, the user may select a video of interest through an interface of the video application program, for example, the terminal displays links of a plurality of videos and corresponding selection controls on the interface of the video application program, and in response to a click operation of the user on any video selection control, the terminal sends a retrieval request to the server, where the retrieval request carries an identifier of the video.

1102. The server acquires the first video from the database in response to a retrieval request of the terminal.

In some embodiments, the server retrieves the second video from the retrieval request in response to the retrieval request, and retrieves the first video from the database based on a method similar to step 1002.

Optionally, if the retrieval request carries a video identifier, the server acquires the video indicated by the video identifier as a second video, and acquires the first video from the database based on a method similar to that in step 1002.

It should be noted that the server may obtain a plurality of first videos from the database, and retrieve the repeated segments between each first video and the second video in a serial or parallel manner.

1103. The server obtains a first global spatiotemporal feature of a first video, a first spatiotemporal feature of a plurality of first video frames in the first video, a second global spatiotemporal feature of a second video, and a second spatiotemporal feature of a plurality of second video frames in the second video.

1104. The server obtains a first similarity parameter based on the first global space-time feature and the second global space-time feature, wherein the first similarity parameter is used for representing the similarity degree of the video level between the first video and the second video.

1105. The server obtains a second similarity parameter based on the first space-time characteristics and the second space-time characteristics, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video.

1106. And the server fuses the first similar parameter and the second similar parameter to obtain the target similar parameter.

1107. The server determines that a repeated segment exists between the first video and the second video in response to the target similarity parameter being greater than or equal to a first threshold.

In some embodiments, the above steps 1103 to 1107 are similar to the steps 503 to 507, and are not described herein again.

1108. And the server sends a link of a video with a repeated segment existing in the second video to the terminal, wherein the link of the video is used for playing the video based on clicking operation.

In some embodiments, if the server locates the repeated section, the server also sends the location of the repeated section to the terminal. And the terminal receives the link of the video of the server and displays the link of the video on an interface of the video application program. And if the server also sends the position of the repeated section, the terminal displays the link of the video and the position of the repeated section on an interface of a video application program.

Optionally, the server sends a plurality of links of videos having a repeated section with the second video and the positions of the corresponding repeated sections to the terminal at the same time, and the terminal displays the links of the plurality of videos and the positions of the corresponding repeated sections on the interface of the video application program.

In the embodiment of the present application, an advertisement placement scene is taken as an example, and the video retrieval method is described with reference to fig. 12, where fig. 12 is a flowchart of a video retrieval method provided in the embodiment of the present application, and as shown in fig. 12, the method includes the following steps.

1201. The terminal sends a deduplication request to the server, wherein the deduplication request is used for instructing the server to delete the advertising video with repeated content in the database.

In some embodiments, a manager of the advertisement delivery system can trigger a process of deleting the advertisement video with repeated content in the database through a management interface of the terminal, and the terminal sends a deduplication request to the server in response to a trigger operation of the manager on deduplication of the advertisement video on the management interface, wherein the deduplication request is used for instructing the server to delete the advertisement video with repeated content in the database.

1202. The server responds to the duplication elimination request of the terminal and obtains the first video and the second video from the database.

In some embodiments, the server responds to a duplicate removal request of the terminal, and randomly obtains a plurality of pairs of advertisement videos from the database, or obtains a plurality of pairs of advertisement videos with the same type based on the labels of the advertisement videos, or obtains the advertisement videos with the difference value of the lengths of the pairs of advertisement videos smaller than the target threshold value based on the lengths of the advertisement videos. For any pair of advertisement videos, one advertisement video is used as a first video, and the other advertisement video is used as a second video.

Optionally, the server can periodically trigger a process of removing duplicate of the advertisement videos in the database, that is, the server obtains multiple pairs of advertisement videos from the database at target time intervals, and for any pair of advertisement videos, one advertisement video is used as a first video, and the other advertisement video is used as a second video, so as to retrieve a repeated segment between the first video and the second video.

It should be noted that, for multiple pairs of acquired advertisement videos, the server can retrieve the repeated segments between each pair of advertisement videos in a serial or parallel manner.

1203. The server obtains a first global spatiotemporal feature of a first video, a first spatiotemporal feature of a plurality of first video frames in the first video, a second global spatiotemporal feature of a second video, and a second spatiotemporal feature of a plurality of second video frames in the second video.

1204. The server obtains a first similarity parameter based on the first global space-time feature and the second global space-time feature, wherein the first similarity parameter is used for representing the similarity degree of the video level between the first video and the second video.

1205. The server obtains a second similarity parameter based on the first space-time characteristics and the second space-time characteristics, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video.

1206. And the server fuses the first similar parameter and the second similar parameter to obtain the target similar parameter.

1207. The server determines that a repeated segment exists between the first video and the second video in response to the target similarity parameter being greater than or equal to a first threshold.

In some embodiments, steps 1203 to 1207 are similar to steps 503 to 507, and are not described herein again.

1208. The server deletes the first video or the second video from the database.

Optionally, the server sends the identifier of the first video and the identifier of the second video to the terminal, the terminal displays the identifiers of the two videos on a management interface of the advertisement delivery system, and a manager of the advertisement delivery system can call the corresponding video from the database based on the identifiers, check the two videos, and determine whether to delete one of the videos.

The video retrieval method is described in the embodiments corresponding to fig. 5 to fig. 12 based on different application scenarios, where the main difference of the method lies in the process of acquiring the first video and the second video by the server, and it should be understood that in other application scenarios, the server can also acquire the first video and the second video based on other manners, which is not limited in the embodiments of the present application.

Fig. 13 is a schematic structural diagram of a video retrieval apparatus according to an embodiment of the present application, and referring to fig. 13, the apparatus includes: an obtaining module 1301 and a determining module 1302.

An obtaining module 1301, configured to obtain a first global spatio-temporal feature of a first video, first spatio-temporal features of a plurality of first video frames in the first video, a second global spatio-temporal feature of a second video, and second spatio-temporal features of a plurality of second video frames in the second video;

the obtaining module 1301 is configured to obtain a first similar parameter based on the first global spatio-temporal feature and the second global spatio-temporal feature, where the first similar parameter is used to indicate a degree of similarity of video levels between the first video and the second video;

the obtaining module 1301 is configured to obtain a second similarity parameter based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, where the second similarity parameter is used to indicate a degree of similarity at a video frame level between the first video and the second video;

a determining module 1302, configured to fuse the first similarity parameter and the second similarity parameter to obtain a target similarity parameter, and determine that a segment overlapping with the second video exists in the first video in response to that the target similarity parameter is greater than or equal to a first threshold, where the target similarity parameter is used to indicate an overall similarity degree between the first video and the second video.

In some embodiments, for any one of the first video and the second video, the obtaining module 1301 is configured to divide each video frame of the video into a plurality of sub-video frames, and obtain spatial features of the plurality of video frames based on spatial information between the plurality of sub-video frames in each video frame; and acquiring global space-time characteristics of the video and space-time characteristics of the video frames based on the spatial characteristics of the video frames and the time sequence information among the video frames.

In some embodiments, the obtaining module 1301 is configured to obtain a similarity matrix based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, where each element in the similarity matrix is used to represent a degree of similarity between each of the first video frames and each of the second video frames; and acquiring the second similarity parameter based on the maximum value of each line in the similarity matrix and the number of the first video frames.

In some embodiments, the determining module 1302 is configured to perform a weighted summation on the first similar parameter and the second similar parameter based on a first weight and a second weight to obtain the target similar parameter.

In some embodiments, the obtaining module 1301 is further configured to obtain a similarity vector, where each element in the similarity vector is a maximum value of each row in the similarity matrix, and each element in the similarity vector is used to represent a degree of similarity between the second video and each first video frame in the first video;

the determining module 1302 is further configured to determine the first video frame corresponding to the element in the similarity vector that is greater than or equal to the second threshold as a repeated video frame.

In some embodiments, the obtaining module 1301 is configured to obtain the first global spatiotemporal feature, a plurality of the first spatiotemporal features, the second global spatiotemporal feature and a plurality of the second spatiotemporal features based on a video feature extraction model, the first video and the second video; the video feature extraction model comprises a space sub-model and a time sequence sub-model, wherein the space sub-model and the time sequence sub-model are composed of a plurality of converter transformers.

the obtaining module 1301 is configured to obtain a target number of sample videos and corresponding sample labels from the multiple sample videos; randomly combining the sample videos of the target number to obtain a plurality of sample video pairs; acquiring a first sample global space-time characteristic, a plurality of first sample space-time characteristics, a second sample global space-time characteristic and a plurality of second sample space-time characteristics corresponding to the plurality of sample video pairs; obtaining a plurality of first sample similar parameters based on the first sample global space-time characteristics and the second sample global space-time characteristics corresponding to the plurality of sample video pairs; obtaining a plurality of second sample similarity parameters based on the plurality of first sample spatio-temporal features and the plurality of second sample spatio-temporal features corresponding to the plurality of sample video pairs; obtaining a plurality of sample target similar parameters based on the plurality of first sample similar parameters and the plurality of second sample similar parameters; and training the video feature extraction model based on the sample target similarity parameters.

It should be noted that: in the video retrieval device provided in the above embodiment, only the division of the above functional modules is taken as an example for performing video retrieval, and in practical applications, the above functions may be distributed by different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the video retrieval device and the video retrieval method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The present disclosure provides a computer device for executing the video retrieval method, in some embodiments, the computer device is provided as a server, fig. 14 is a schematic structural diagram of a server provided in an embodiment of the present disclosure, and the server 1400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the one or more memories 1402 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 1401 to implement the methods provided in the various method embodiments. Certainly, the server 1400 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1400 may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer readable storage medium, such as a memory including at least one program code, the at least one program code being executable by a processor to perform the video retrieval method of the above embodiments, is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, the computer program product comprising at least one computer program, the at least one computer program being stored in a computer readable storage medium. The processor of the computer device reads the at least one computer program from the computer-readable storage medium, and executes the at least one computer program to cause the computer device to perform the operations performed by the video retrieval method.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video retrieval, the method comprising:

acquiring a first global spatiotemporal feature of a first video, a first spatiotemporal feature of a plurality of first video frames in the first video, a second global spatiotemporal feature of a second video and a second spatiotemporal feature of a plurality of second video frames in the second video;

acquiring a first similarity parameter based on the first global space-time feature and the second global space-time feature, wherein the first similarity parameter is used for representing the similarity degree of the video level between the first video and the second video;

acquiring a second similarity parameter based on the plurality of first spatio-temporal features and the plurality of second spatio-temporal features, wherein the second similarity parameter is used for representing the similarity degree of the video frame level between the first video and the second video;

and fusing the first similar parameter and the second similar parameter to obtain a target similar parameter, and determining that a repeated segment exists between the first video and the second video in response to the target similar parameter being greater than or equal to a first threshold value, wherein the target similar parameter is used for representing the overall similarity degree between the first video and the second video.

2. The method of claim 1, wherein for any of the first video and the second video, obtaining the global spatiotemporal features of the video and the spatiotemporal features of a plurality of video frames in the video comprises:

dividing each video frame of the video into a plurality of sub video frames, and acquiring spatial features of the plurality of video frames based on spatial information among the plurality of sub video frames in each video frame;

and acquiring the global space-time characteristics of the video and the space-time characteristics of the video frames based on the spatial characteristics of the video frames and the time sequence information among the video frames.

3. The method of claim 1, wherein obtaining second similarity parameters based on the plurality of first spatio-temporal features and the plurality of second spatio-temporal features comprises:

acquiring a similarity matrix based on a plurality of the first spatio-temporal features and a plurality of the second spatio-temporal features, wherein each element in the similarity matrix is used for representing the similarity degree between each first video frame and each second video frame;

and acquiring the second similarity parameter based on the maximum value of each line in the similarity matrix and the number of the first video frames.

4. The method according to claim 1, wherein the fusing the first similarity parameter with the second similarity parameter to obtain a target similarity parameter comprises:

and carrying out weighted summation on the first similar parameter and the second similar parameter based on the first weight and the second weight to obtain the target similar parameter.

5. The method of claim 4, wherein the first weight and the second weight are trained based on a plurality of sample videos and corresponding sample labels;

6. The method of any of claims 1 to 5, wherein after determining that there is a repeated video segment between the first video and the second video in response to the target similarity parameter being greater than a first threshold, the method further comprises:

obtaining a similarity vector, wherein each element in the similarity vector is the maximum value of each row in the similarity matrix, and each element in the similarity vector is used for representing the similarity between the second video and each first video frame in the first video;

and determining the first video frame corresponding to the element which is greater than or equal to the second threshold value in the similarity vector as a repeated video frame.

7. The method of any one of claims 1 to 6, wherein obtaining the first global spatio-temporal feature of the first video, the first spatio-temporal features of the plurality of first video frames in the first video, the second global spatio-temporal feature of the second video, and the second spatio-temporal features of the plurality of second video frames in the second video comprises:

obtaining the first global spatiotemporal feature, the plurality of first spatiotemporal features, the second global spatiotemporal feature and the plurality of second spatiotemporal features based on a video feature extraction model, the first video and the second video;

the video feature extraction model comprises a space sub-model and a time sequence sub-model, wherein the space sub-model and the time sequence sub-model are composed of a plurality of converter transformers.

8. The method of claim 7, wherein the video feature extraction model is trained based on the plurality of sample videos and corresponding sample labels;

the training process of the video feature extraction model comprises the following steps:

obtaining a target number of sample videos and corresponding sample labels from the plurality of sample videos;

randomly combining the sample videos of the target number to obtain a plurality of sample video pairs;

acquiring a first sample global space-time characteristic, a plurality of first sample space-time characteristics, a second sample global space-time characteristic and a plurality of second sample space-time characteristics corresponding to the plurality of sample video pairs;

obtaining a plurality of first sample similarity parameters based on the first sample global spatio-temporal features and the second sample global spatio-temporal features corresponding to the plurality of sample video pairs;

obtaining a plurality of second sample similarity parameters based on the plurality of first sample spatio-temporal features and the plurality of second sample spatio-temporal features corresponding to the plurality of sample video pairs;

obtaining a plurality of sample target similar parameters based on the plurality of first sample similar parameters and the plurality of second sample similar parameters;

training the video feature extraction model based on the plurality of sample target similarity parameters.

9. A video retrieval apparatus, the apparatus comprising:

an obtaining module, configured to obtain a first global spatio-temporal feature of a first video, a first spatio-temporal feature of a plurality of first video frames in the first video, a second global spatio-temporal feature of a second video, and a second spatio-temporal feature of a plurality of second video frames in the second video;

the obtaining module is configured to obtain a first similarity parameter based on the first global spatio-temporal feature and the second global spatio-temporal feature, where the first similarity parameter is used to indicate a degree of similarity of a video level between the first video and the second video;

the obtaining module is configured to obtain a second similarity parameter based on the plurality of first spatio-temporal features and the plurality of second spatio-temporal features, where the second similarity parameter is used to indicate a degree of similarity at a video frame level between the first video and the second video;

10. A computer device comprising one or more processors and one or more memories having stored therein at least one computer program, the at least one computer program being loaded and executed by the one or more processors to perform operations performed by the video retrieval method of any one of claims 1 to 8.

11. A computer-readable storage medium, having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to perform operations performed by the video retrieval method of any one of claims 1 to 8.