CN117591698B

CN117591698B - Training method of video retrieval model, video retrieval method, device and equipment

Info

Publication number: CN117591698B
Application number: CN202410079642.9A
Authority: CN
Inventors: 邢思远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-26
Anticipated expiration: 2044-01-19
Also published as: CN117591698A

Abstract

The application provides a training method of a video retrieval model, a video retrieval method, a device and equipment; the method comprises the following steps: acquiring video data; obtaining a plurality of types of text of the video data, wherein the text of each of the types comprises a plurality of text fragments of different sources; sorting the texts of the multiple types according to a specific sequence, and splicing the sorted texts of the multiple types into a first text combination; acquiring a screening index of the first text combination; screening each text segment in the first text combination through the screening index to obtain a second text combination serving as an input text of the video data; building a training sample through the input text, the search text and the correlation label of the search text and the video data; the video retrieval model is trained by a plurality of the training samples. By the method and the device, modeling accuracy of the video retrieval model on the video content can be improved.

Description

Training method of video retrieval model, video retrieval method, device and equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a training method for a video retrieval model, a video retrieval method, a device and equipment.

Background

The video retrieval task is to input a text segment and retrieve the video most conforming to the text description. With the rapid development of deep learning in the fields of computer vision and natural language, video retrieval can understand text and the content of video, so that matching between the video and the text is realized. In the related technology, videos and texts are encoded into feature vectors, and because the positions of the vectors with similar meanings in space are similar, a video retrieval task can be realized based on semantic matching, namely, the similarity between calculated vectors, but the videos are used as unstructured data, are rich in types, complex in information quantity and high in processing threshold, and the videos and texts belong to different types of data, so that the problem of inaccurate modeling of video contents still exists in the processing of the cross-modal task.

Disclosure of Invention

The embodiment of the application provides a training method, a video retrieval method, a device, equipment, a computer program product and a computer readable storage medium for a video retrieval model, which can improve the modeling accuracy of the video retrieval model on video content.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a training method of a video retrieval model, which comprises the following steps:

acquiring video data;

obtaining a plurality of types of text of the video data, wherein the text of each of the types comprises a plurality of text fragments of different sources;

sorting the texts of the multiple types according to a specific sequence, and splicing the sorted texts of the multiple types into a first text combination;

Acquiring a screening index of the first text combination; screening each text segment in the first text combination through the screening index to obtain a second text combination serving as an input text of the video data;

Building a training sample through the input text, the search text and the correlation label of the search text and the video data;

The video retrieval model is trained by a plurality of the training samples.

The embodiment of the application provides a video retrieval method of a video retrieval model, which comprises the following steps:

acquiring a video retrieval request, wherein the video retrieval request comprises a text to be searched;

obtaining a correlation prediction value between the text to be searched and a plurality of videos in a video library through the video retrieval model;

Sequencing a plurality of videos in a video library according to the sequence from high to low of the correlation predictive value between the text to be searched and the videos in the video library to obtain a video list;

At least one video from the video list is selected from the video list, and the video retrieval request is responded through the at least one video.

The embodiment of the application provides a training device for a video retrieval model, which comprises the following components:

the data acquisition module is used for acquiring video data;

A data processing module for obtaining a plurality of types of text of the video data, wherein the text of each type comprises a plurality of text fragments of different sources;

The data processing module is further used for sequencing the plurality of types of texts according to a specific sequence, and splicing the sequenced plurality of types of texts into a first text combination;

The data processing module is further used for acquiring screening indexes of the first text combinations;

The data processing module is further used for screening each text segment in the first text combination through the screening index to obtain a second text combination which is used as an input text of the video data;

the data processing module is further used for constructing a training sample through the input text, the search text and the correlation label of the search text and the video data;

and the model training module is used for training the video retrieval model through a plurality of training samples.

The embodiment of the application provides a video retrieval device of a video retrieval model, which comprises:

The receiving module is used for acquiring a video retrieval request, wherein the video retrieval request comprises a text to be searched;

The retrieval module is used for obtaining correlation prediction values between the text to be searched and a plurality of videos in a video library through the video retrieval model;

The retrieval module is further used for sequencing the videos in the video library according to the sequence of the correlation prediction values between the text to be searched and the videos in the video library from high to low to obtain a video list;

and the response module is used for selecting at least one video from the video list, starting from the video of the head, and responding to the video retrieval request through the at least one video.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the training method of the video retrieval model provided by the embodiment of the application or the video retrieval method of the video retrieval model provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores a computer program or computer executable instructions for realizing the training method of the video retrieval model provided by the embodiment of the application or the video retrieval method of the video retrieval model provided by the embodiment of the application when being executed by a processor.

The embodiment of the application provides a computer program product, which comprises a computer program or a computer executable instruction, wherein the computer program or the computer executable instruction realizes the training method of the video retrieval model provided by the embodiment of the application or the video retrieval method of the video retrieval model provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

The method has the beneficial effects that the coverage of the content contained in the video data is improved by introducing video texts of different types and sources and fusing multiparty video information, the texts of multiple types are ordered according to a specific sequence, and the first text combination obtained by ordering is screened out through screening indexes, so that the second text combination containing as much information as possible is constructed within a limited input length, the content information contained in the video is further refined, the modeling accuracy of a video retrieval model on the video content is improved, and the accuracy of the correlation between the video retrieval model prediction search text and the video data is improved.

Drawings

FIG. 1 is a schematic diagram of a video retrieval system architecture according to an embodiment of the present application;

Fig. 2A is a schematic diagram of a first structure of a server according to an embodiment of the present application;

fig. 2B is a schematic diagram of a second structure of a server according to an embodiment of the present application;

FIG. 3A is a first flow chart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3B is a second flow chart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3C is a third flow chart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3D is a fourth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3E is a fifth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3F is a sixth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3G is a seventh flowchart of a training method of a video search model according to an embodiment of the present application;

FIG. 3H is a schematic diagram of an eighth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3I is a ninth flowchart of a training method of a video search model according to an embodiment of the present application;

FIG. 3J is a tenth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3K is an eleventh flowchart of a training method of a video retrieval model according to an embodiment of the present application;

FIG. 3L is a twelfth flowchart of a training method of a video retrieval model according to an embodiment of the present application;

fig. 4 is a flowchart of a video retrieval method of a video retrieval model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the principles of a video retrieval model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a data structure of multiple types of text provided by an embodiment of the present application;

FIG. 7A is a schematic view of a first flow chart of a process of a video retrieval model in video retrieval according to an embodiment of the present application;

FIG. 7B is a second flow chart of the processing of the video retrieval model in video retrieval according to an embodiment of the present application;

FIG. 8A is a schematic diagram of video summary text generation provided by an embodiment of the present application;

FIG. 8B is a schematic diagram of a framework for topic entity extraction of video provided by an embodiment of the present application;

Fig. 8C is a schematic diagram of a network architecture for BERT-based semantic matching according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the embodiments of the application is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Optical character recognition (Optical Character Recognition, OCR) is a technology of converting characters of a print or handwriting into editable and searchable text, extracting text information in an image through technologies such as image processing and pattern recognition, and performing character recognition and text conversion to finally generate an editable text document.

2) Automatic speech recognition (Automatic Speech Recognition, ASR), which is used to convert human speech into text form, is a speech processing technique that converts speech signals into corresponding text by analyzing and decoding the characteristics of the speech signals.

3) The bi-directional encoder representation method (Bidirectional Encoder Representation from Transformers, BERT) from the converter is a pre-trained language model, the goal of BERT is to learn a generic language representation by performing unsupervised pre-training on a large scale text corpus, thereby providing a better feature representation for various downstream natural language processing tasks.

4) The external text is a text which can be displayed on the human-computer interaction interface and is related to the video, such as the text contained in the cover map of the video, the title word of the video and the like.

5) The content text is a text which does not appear at the level of a man-machine interaction interface, and the text needs to be extracted from the video through various technical means, such as a video abstract extracted through the video and used for representing the video content, a subject word of the video and the like.

In the prior art, video retrieval is performed based on features, including extracting feature vectors by using a pre-training network (such as VGG, res net, etc.), calculating similarity by cosine similarity or euclidean distance, and using a method for retrieving similar videos, or combining spatio-temporal information, such as extracting features by using a 3D convolutional neural network (Convolutional Neural Networks, CNN), or performing feature extraction and time-series modeling by using a time-series convolutional neural network (Temporal Convolutional Network, TCN), and performing video retrieval based on end-to-end learning, namely directly learning feature representation and similarity calculation from original video data, and performing supervised learning by using paired data. For longer videos, the feature-based method needs to divide the videos into segments for processing, but for end-to-end learning, how to process longer video sequences needs to be considered, so as to avoid the problems of excessive consumption of computing resources or gradient disappearance, etc., for video retrieval in complex scenes, the semantic information of video content is often difficult to accurately capture by algorithms in the prior art, and the quality of retrieval results is lower.

In order to solve the above problems, embodiments of the present application provide a training method for a video retrieval model, a video retrieval method, a device, equipment, a computer readable storage medium and a computer program product, which can improve the modeling accuracy of the video retrieval model on video content.

Taking the application of the embodiment of the present application to a video retrieval scenario as an example, referring to fig. 1, fig. 1 is a schematic structural diagram of a video retrieval system architecture provided in the embodiment of the present application, and in an example, fig. 1 relates to a server 100, a terminal 200 and a network 300. The terminal 200 is connected to the server 100 through a network 300, wherein the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, the video retrieval system provided by the embodiment of the application can be cooperatively implemented by a server and a terminal. For example, the terminal 200 responds to the text search operation input by the user, sends a video search request to the server 100, after the server 100 receives the video search request, trains the video search model through the training method of the video search model provided by the embodiment of the present application, acquires video data from a video library through the video search method of the video search model provided by the embodiment of the present application, organizes video information (for example, uniform resource locators (Uniform Resource Locator, URLs) or storage paths of videos, titles, descriptions of videos, etc.) of the video data into a suitable data form (for example, JSON or XML, etc.) to construct a search result, sends the constructed search result as a response of the server 100 to the terminal 200, and after the terminal 200 receives the search result, processes (for example, downloads and plays videos through video links included in the video information) according to the video information in the search result, and the specific processing manner may be determined according to the client platform and the user requirement.

Here, the server 100 may be a single server, and for this case, the training method of the video retrieval model provided by the embodiment of the present application and the video retrieval method of the video retrieval model provided by the embodiment of the present application may be implemented by the same server. The server 100 may also be a cluster of servers, and for the case that the server 100 is a server cluster, the training method of the video retrieval model and the video retrieval method of the video retrieval model provided in the embodiment of the present application may be implemented by different servers, for example, one server is used for training the video retrieval model, and another server is used for deploying the trained video retrieval model to provide the video retrieval service for the terminal, which is not limited in the embodiment of the present application.

In some embodiments, the server may implement the training method of the video retrieval model provided by the embodiment of the present application or the video retrieval method of the video retrieval model provided by the embodiment of the present application by running various computer executable instructions or computer programs. For example, the computer-executable instructions may be commands at the micro-program level, machine instructions, or software instructions. The computer program may be a native program or a software module in an operating system. In general, the computer-executable instructions may be any form of instructions and the computer program may be any form of application, module, or plug-in.

In some embodiments, the server 100 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform, where the cloud services may be interaction processing services for a terminal to call.

In some embodiments, multiple servers may be organized into a blockchain, and server 100 may be nodes on the blockchain, where there may be an information connection between each node in the blockchain, and where information may be transferred between nodes via the information connection. The training method of the video retrieval model provided by the embodiment of the application or the data related to the video retrieval method of the video retrieval model provided by the embodiment of the application can be stored on a blockchain.

The embodiments of the present application may be implemented by means of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, which is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand the intelligence of a person, sense the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Taking the processing of the video retrieval model performed by the server as an example, referring to fig. 2A, fig. 2A is a first structural schematic diagram of the server provided by the embodiment of the present application, and the server 100-1 shown in fig. 2A includes: at least one processor 110-1, a memory 130-1, and at least one network interface 120-1. The various components in server 100-1 are coupled together by bus system 140-1. It is understood that the bus system 140-1 is used to enable connected communications between these components. The bus system 140-1 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 140-1 in fig. 2A.

The Processor 110-1 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

Memory 130-1 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 130-1 optionally includes one or more storage devices physically remote from processor 110-1.

Memory 130-1 includes volatile memory or nonvolatile memory and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 130-1 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 130-1 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 131-1 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

The network communication module 132-1 is configured to reach other electronic devices via one or more (wired or wireless) network interfaces 120-1, the exemplary network interface 120-1 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (Universal Serial Bus, USB), etc.;

in some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2A shows a training apparatus 133 of a video retrieval model stored in a memory 130-1, which may be software in the form of a program, a plug-in, or the like, including the following software modules: data acquisition module 1331, data processing module 1332 and model training module 1333, which are logical, and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be described hereinafter.

Taking the processing of the video search model performed by the server as an example, referring to fig. 2B, fig. 2B is a second schematic structural diagram of the server provided by the embodiment of the present application, and the server 100-2 shown in fig. 2B includes: at least one processor 110-2, a memory 130-2, and at least one network interface 120-2. The various components in server 100-2 are coupled together by bus system 140-2. It is understood that bus system 140-2 is used to enable connected communications between these components. The bus system 140-2 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 140-2 in fig. 2B. The specific description of the processor 110-2 and the memory 130-2 is referred to above and will not be repeated here.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2B shows a video retrieval apparatus 134 of a video retrieval model stored in a memory 130-2, which may be software in the form of a program, a plug-in, or the like, including the following software modules: a receiving module 1341, a retrieving module 1342 and a responding module 1343, which are logical, and thus can be arbitrarily combined or further split depending on the implemented functionality. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a Processor in the form of a hardware decoding Processor programmed to perform the training method of the video retrieval model provided by the embodiments of the present application or the video retrieval method of the video retrieval model provided by the embodiments of the present application, for example, the Processor in the form of a hardware decoding Processor may employ one or more Application-specific integrated circuits (ASICs), digital signal processors (DIGITAL SIGNAL processors, DSPs), programmable logic devices (Programmable Logic Device, PLDs), complex Programmable logic devices (Complex Programmable Logic Device, CPLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), or other electronic components.

Referring to fig. 5, fig. 5 is a schematic diagram of the principle of the video retrieval model provided by the embodiment of the present application, first, a plurality of training samples (each training sample includes an input text for representing video content, a search text, and a correlation label between the search text and video data) are input into a text input layer of the video retrieval model, after word embedding processing and feature extraction are performed on the input text and the search text in the training samples, correlation prediction values between the input text and the search text are obtained by methods such as calculating cosine similarity, the correlation prediction values and true correlation labels are input into a first loss function and a second loss function, loss values are calculated respectively, and the two loss values are weighted and summed to obtain a combined loss value, and back-propagation is performed according to the combined loss value, so as to update parameters of the video retrieval model.

The training method of the video retrieval model provided by the embodiment of the application will be described below in connection with exemplary application and implementation of the server provided by the embodiment of the application.

Referring to fig. 3A, fig. 3A is a schematic flow chart of a first procedure of a training method of a video search model according to an embodiment of the present application, and the execution subject of fig. 3A to 3L may be the server 100-1 for model training described above, and the steps shown in fig. 3A will be specifically described below.

In step 101, video data is acquired.

In some embodiments, the acquired video data is from multiple video files or multiple video stream segments (e.g., streaming video).

In step 102, a plurality of types of text of video data are acquired, wherein each type of text includes a plurality of text segments of different origin.

In some embodiments, the video data may be a plurality of pieces of video data from different videos, referring to fig. 6, and fig. 6 is a schematic diagram of a data structure of a plurality of types of text provided by an embodiment of the present application, where each piece of video data corresponds to a plurality of types of text (e.g., an explicit text and a content text), and each type of text includes a plurality of text segments from different sources. For example, the sources of the text to be highlighted are the cover, title, and words of the title that the user has edited by himself.

The text that appears is video-related text that can appear at the level of the man-machine interface, for example, the text that appears includes the following text fragments: cover text segments, title text segments, topic word text segments, etc.; the content text includes: text recognition of video content, text of content obtained by voice recognition, abstract content of video, entity words of video subjects and the like.

The content text is text which does not appear at the level of the man-machine interaction interface, and text which needs to be extracted from the video through various technical means, for example, text fragments in the content text comprise: the best sentence hit segment in the text content of the video obtained through the text recognition and the speech recognition (refers to a segment capable of representing the main body of the video from a plurality of segments extracted by at least one of the text recognition and the speech recognition), a video abstract text segment, a video subject word splice text segment, and the like.

By the step 102, the method and the device have the advantages that by introducing video texts from different sources and fusing related information of multi-party video data, the coverage of content contained in the video data is improved, and the accuracy of modeling of video subject is improved.

In step 103, the plurality of types of text are ordered in a specific order, and the ordered plurality of types of text are spliced into a first text combination.

In some embodiments, the particular sequence includes: a type ordering means for characterizing an order of the plurality of types; the source ordering manner corresponding to each type is used to characterize the sequence of different sources corresponding to the types, and referring to fig. 3B, step 103 shown in fig. 3A may be implemented by the following steps 1031 to 1032, which are specifically described below.

In step 1031, the plurality of types of text are ranked in a type ranking manner.

In some embodiments, the plurality of types of text may include both explicit text and content text; the text of the display includes a plurality of text segments of different sources: the title and the cover map identify characters and user-edited thematic words; the content text includes a plurality of text segments of different sources: the best hit sentence fragment, the video topic summary text and the topic core word splice text.

In some embodiments, accepting the example in step 102, ordering the plurality of types of text in a type ordering manner may be represented as the front of the explicit text, the back of the content text, e.g., represented as [ explicit text, content text ].

In step 1032, the following processing is performed for each type of text: and sequencing a plurality of text fragments of different sources included in each type of text according to the source sequencing mode corresponding to the type.

In some embodiments, the examples received in step 102 sort the plurality of text segments contained within the text of the content and the explicit text, respectively, e.g., the explicit text is sorted [ explicit text (cover text segment, title text segment, topic word text segment ], herein, the sorting of the text segments of different sources is merely exemplary, and the forming of a specific source sorting is performed (i.e., sorting the plurality of text segments of different sources included in each type of text in order of high-to-low relevance to form a source sorting) the source sorting is described in steps 110 to 111 below.

In some embodiments, the sources of the explicit text may include a title of the video data, a cover map, a topic word (for example, the form may be: # plus the topic word) edited by the user, etc., and the plurality of text segments corresponding to different sources may include a title text segment, a cover text segment (obtained by optically recognizing the cover map text), a topic word text segment, etc.

In some embodiments, the source of the content text may include text recognition of video frames in the video data, voice recognition of the video, extraction of topic entities representing the video data and association degree scoring thereof, topic abstract sentence generation of the video data, and the like, and the plurality of text segments corresponding to different sources may include an optimal sentence hit segment, a video abstract text segment, and a video subject word splice text segment, and the like.

The optimal sentence hit fragment can acquire a text corresponding to the video through text recognition of a video frame of the video data and voice recognition of the video frame, and perform similarity calculation (such as cosine similarity) on a plurality of text fragments in the text and the search text, so as to select a sentence fragment with highest similarity as the optimal sentence hit fragment; referring to fig. 8A, fig. 8A is a schematic diagram of a video summary text generation principle provided by an embodiment of the present application, and after text corresponding to video data is input through a text input layer by using a summary sentence generation model (e.g. BERT), a summary text segment corresponding to the video data is generated after word embedding code (Token Embeddings), interval segment code (INTERVAL SEGMENT Embeddings), position code (Position Embeddings) and summary layer (Summarization Layers) are processed; referring to fig. 8B, fig. 8B is a schematic diagram of a framework of topic entity extraction of a video provided by an embodiment of the present application, where the topic entity extraction of the video refers to extracting entities related to topics, such as characters, places, organizations, time, dates, etc., from the video, and the topic entity extraction of the video includes four steps of multi-modal input, feature extraction, multi-modal fusion and model classification in a joint modeling manner of image modes and text modes, so as to obtain a weight of each entity (i.e., a predicted probability value of each entity, which is used to characterize a coverage degree capable of representing a video topic) in a plurality of video entities corresponding to the video, and then combine and splice the entities according to the order of the predicted probability values from large to small, thereby obtaining a text segment spliced by a video topic.

As an example of stitching, segmentation between different entities may be performed using a semicolon, and a prediction probability threshold may also be set for truncation, i.e. only entities with a prediction probability value greater than the prediction probability threshold are retained.

In some embodiments, the type sorting and the source sorting are not sequential, and the type sorting may be performed first, or the source sorting may be performed first, and the final first text combination may be expressed as: [ Exclusive text (cover text segment, title text segment, topic word text segment) ].

With continued reference to fig. 3A, in step 104, a screening indicator of the first text combination is obtained.

In some embodiments, the screening criteria include at least one of: smoothness, length, coverage.

Wherein the smoothness characterizes the fluency in the expression of the text of the first text composition, the length characterizes the number of text minimum units (e.g., words, numbers, punctuation or other symbols, which may be represented by a token) in the first text composition, and the coverage characterizes whether the text segment in the first text composition has an information gain over the preceding text segment.

With continued reference to fig. 3A, in step 105, each text segment in the first text combination is screened by a screening indicator to obtain a second text combination as an input text for the video data.

In some embodiments, referring to fig. 3C, step 105 shown in fig. 3A may be implemented by the following steps 1051 through 1053, which are described in detail below.

In step 1051, a first screening process is performed on the first text combination according to the coverage, resulting in a first screened text combination.

In some embodiments, referring to fig. 3D, step 1051 shown in fig. 3C may be implemented by traversing each text segment in the first text combination to perform steps 10511 through 10512 below, which are described in detail below.

In step 10511, a coverage of the text segment is obtained, wherein the coverage characterizes whether there is an information gain for the currently traversed text segment as compared to the text segment that has been traversed previously, and the text segment is screened out in response to the coverage being greater than a coverage threshold.

In some embodiments, if the coverage of a text segment in the first text composition is greater than a preset coverage threshold, then the text segment is considered to have no separate information gain, and therefore is removed from the first text composition to leave more input space for subsequent text segments.

In some embodiments, referring to fig. 3E, the capturing coverage of each text segment in step 10511 shown in fig. 3D may be implemented by the following steps 105111 to 105113, which are specifically described below.

In step 105111, word segmentation is performed on the text segment to obtain a segment word set of the text segment.

In some embodiments, punctuation marks such as spaces, periods, commas and the like can be used as word segmentation marks for word segmentation processing on text fragments, namely, the text fragments are segmented into the minimum units, also called tokens (token), and third-party word segmentation tools, such as Jieba, NLTK, spaCy, can be used for word segmentation processing on the text fragments, and the tools can realize accurate Chinese word segmentation and English word segmentation through algorithms and language models.

In step 105112, the word frequency of each word in the segment word set that appears in the preceding text segment is obtained.

In some embodiments, a segment word set of a text segment may be created as a dictionary or hash table, which is used to store each segment word and its corresponding occurrence number, traverse the segmented text segment, and for each segment word, determine whether the segment word has already appeared in the text segment before the current segment word position, if so, add 1 to the frequency of the segment word, if not, add the segment word to the dictionary, and set the frequency of the segment word to 1, thereby obtaining the word frequency of each word in the segment word set appearing in the previous text segment.

In step 105113, the average of the word frequency for each word is determined as the coverage of the text segment.

In some embodiments, the word frequencies of each word in the text segment are summed and averaged as a coverage of the text segment.

With continued reference to fig. 3D, in step 10512, the first text combination after the first culling process is taken as a first culled text combination.

In some embodiments, the first text combinations after the first screening processing of the sequentially traversed text segments are combined into the first screened text combinations according to the original text segment sequence.

With continued reference to FIG. 3C, in step 1052, a second screening process is performed on the first screened text combination by length to obtain a second screened text combination.

In some embodiments, referring to fig. 3F, step 1052 shown in fig. 3D may be implemented by performing the following steps 10521 through 10522 for each type of text in the first combination of screening text, as described in detail below.

In step 10521, at least one text segment in the text beginning at the tail is screened out in response to the length of the text being greater than the length threshold, such that the length of the text is less than or equal to the length threshold.

In some embodiments, when the length of the text is greater than the length threshold, at least one text segment in the text starting from the tail is screened out such that the length of the text is less than or equal to the length threshold, e.g., the length threshold for both the explicit text and the content text is 128 tokens, and if the length of the explicit text or the content text is greater than the length threshold, the text segment of the explicit text or the content text is screened out from the tail until both the explicit text and the content text meet (are less than or equal to) the length threshold of 128 tokens.

In step 10522, the first culled text after the second culling process is combined as the second culled text.

In some embodiments, the combination of the content text and the explicit text in the first culled text that have lengths less than or equal to the length threshold after the second culling process is the second culled text combination.

With continued reference to fig. 3C, in step 1053, a third screening process is performed on the second screened text combination in terms of smoothness to obtain the second text combination.

In some embodiments, referring to fig. 3G, step 1053 shown in fig. 3C may be implemented by the following steps 10531 through 10534, which are described in detail below.

In step 10531, the smoothness of each text segment in the second screening text combination is obtained.

In some embodiments, referring to fig. 3H, step 10531 shown in fig. 3G may be implemented by performing the following steps 105311-105313 on each text segment in the first text combination, described in detail below.

In step 105311, the text segment is subjected to word segmentation processing to obtain a segment word segmentation set.

In some embodiments, punctuation marks such as spaces, periods, commas and the like can be used as word segmentation marks to segment text fragments (i.e. the text fragments are segmented into the minimum units token above) to obtain a segment word segmentation set, and third-party word segmentation tools can be used to segment text fragments, such as Jieba, NLTK, spaCy, and the like, and the tools can realize accurate Chinese word segmentation and English word segmentation through algorithms and language models.

In step 105312, the output probability of each segment word in the set of segment words is obtained.

In some embodiments, the segment words (token) in each text segment are selected one by one for masking (e.g., the token is replaced with a mask tag), the masked text segment is input into a pre-trained language model (e.g., a bi-directional encoder representation (Bidirectional Encoder Representation from Transformers, BERT) based on a transformer, etc.), and the output probabilities at the token locations are obtained, i.e., the output probabilities at each token location given the context.

In step 105313, the smoothness of the text segment is obtained according to the output probability of each segment word.

In some embodiments, the output probabilities of each segment word (token) may be summed and averaged to take the average as the smoothness of the text segment.

In some examples, as an alternative implementation manner, a segment word (token) in each text segment may be selected one by one to mask, for example, the token is replaced by a [ mask ] mark, the text segment after the masking is input into a pre-trained language model (such as BERT, etc.), a predicted segment word at the segment word position is obtained, a loss between the predicted segment word and a real segment word, for example, a Negative Log Likelihood loss (NLL), the losses of all segment words are summed and averaged to obtain a segment loss value, and finally the obtained segment loss value is mapped (for example, the segment loss value is subjected to exponential transformation, so that the smoothness corresponding to a text segment with a larger segment loss value is lower) to obtain the smoothness of the text segment.

With continued reference to fig. 3G, in step 10532, the smoothness of each text segment is summed to obtain a smoothness of the second pruned text combination.

In some embodiments, the smoothness of each text segment may be added to yield the smoothness of the second culled text combination.

In other embodiments, a smoothness weight may be preset according to the position of the text segment in the second screening text combination, where the smoothness weight of the text segment that is the front of the position is larger, for example, the text segment that is the first position is set to have a weight of 0.25, the text segment that is the second position is set to have a weight of 0.15, and the text segment that is the third position is set to have a weight of 0.1, and similarly, the 3 text segments that are the content text in the second screening text combination are set according to the weights, and the smoothness of each text segment is weighted and processed to obtain the smoothness of the second screening text combination.

In step 10533, in response to the smoothness of the second culled text combination being less than the smoothness threshold, at least one text segment in the second culled text combination beginning at the tail is culled such that the smoothness of the culled second culled text combination is greater than or equal to the smoothness threshold.

Through step 10533, the text segments with lower smoothness are removed, and the beneficial effects of both smoothness and overall text input length of the screened-out second screened-out text combination are achieved.

In step 10534, the second culled text combination after the third culling process is taken as the second text combination.

In some embodiments, the second culled text that is equal to or higher than the smoothness threshold setting after the third culling process is combined as the second text.

In some embodiments, the embodiments of the present application do not limit a specific selection manner of the above-mentioned screening indexes (length, smoothness, coverage), for example, the coverage may be used only to screen the first text combination, and the obtained processed text is directly used as the input text, that is, the above-mentioned screening indexes may be combined and matched according to a specific application scenario, so that the first text combination is screened to obtain the second text combination.

As an example, the second text combination obtained after the first text combination is subjected to the screening process may simultaneously satisfy the condition in the following formula (1):

（1）

wherein, Representing the second text combinationThe number of text fragments is chosen to be the number of text fragments,Represent the firstThe degree of correlation between the individual text segments and the search text is calculated as described in steps 1101 to 1103 below,Represent the firstThe length of the individual text fragments is determined,Representing the overall input length threshold for the input text (accepting the example of step 10521 above, e.g., 128 tokens for both the explicit text and the content text, 256 tokens for the overall input length threshold),Represent the firstThe degree of smoothness of the individual text fragments,The meaning of the formula (1) is that the total relevance between the second text combination and the search text is the maximum value that can be achieved by the current combination (the text of a plurality of types is ordered according to a specific sequence through step 103 to obtain a first text combination implementation, so that the text segment position with larger relevance in each type of text is more front, after that, the first text combination can be subjected to a first screening process through step 1051 to implement the duplication removal of the first text combination), the total input length of the second text combination is smaller than the total input length threshold (implemented through the second screening process in step 1052), and the smoothness of the second text input combination is greater than or equal to the smoothness threshold (implemented through the third screening process in step 1053), and the second text combination is the input text with the highest content of video content constructed in the limited input length by meeting the above conditions.

Through steps 103 to 105, through sorting the texts of the multiple types according to a specific sequence and screening out the first text combination obtained by sorting through the screening index, the second text combination containing the maximum information is constructed within a limited input length, the content information contained in the video is further condensed, and the accuracy of the correlation degree between the video retrieval model predictive search text and the video data is improved.

With continued reference to fig. 3A, in step 106, training samples are constructed by entering text, search text, and relevance labels of the search text and video data.

In some embodiments, the relevance label of the search text and the video data represents the relevant gear of the search text and the video data, for example, the label of 5 represents that the relevance degree between the search text and the video data (the relevance of the input text corresponding to the video data can be characterized as the greatest), so that the user search requirement can be completely met; 4 th grade indicates that the correlation degree between the search text and the video data is mostly correlated; 3 rd grade indicates that the correlation degree between the search text and the video data is partial correlation; 2 nd level indicates that the degree of correlation between the search text and the video data is a partial correlation; the 1 st level representation indicates that the search text is not related to the video data.

In some embodiments, the plurality of training samples may be represented as [ training sample 1 (search text 1, input text 1, relevance label 1), training sample 2 (search text 1, input text 2, relevance label 5), training sample 3 (search text 2, input text 3, relevance label 1) ], where the same search text may correspond to a plurality of input texts, where the representation of a plurality of training samples is merely exemplary, and embodiments of the present application do not limit the number of training samples and the number of a plurality of input texts to which the search text corresponds.

In step 107, a video retrieval model is trained with a plurality of training samples.

In some embodiments, referring to fig. 3I, step 107 shown in fig. 3A may be implemented by the following steps 1071 to 1077, which are specifically described below.

In step 1071, semantic vector representations of the input text and the search text in the training sample are obtained.

In some embodiments, referring to FIG. 8C, FIG. 8C is a schematic diagram of a network architecture for BERT-based semantic matching provided by embodiments of the present application, input text (including both explicit text and content text, with the explicit text preceding and the content text following, separating individual text segments with SEP) and search text into a model, obtain hidden state representations of the model through a multi-layer word-embedding-based transform structure, and also use the output of a particular layer as a semantic vector representation of the input text and search text, and segment the input text and search text (e.g., as in FIG. 8C)The first token after word segmentation) and then inputting the model, and finally obtaining semantic vector representations of the input text and the search text through the word embedding layer and forward propagation of the model.

In step 1072, a relevance prediction between the input text and the search text is obtained by semantic vector representation.

In some embodiments, the relevance prediction value of the input text and the search text may be obtained by classifying and mapping the semantic vector representation by a multi-layer perceptron (corresponding to mlp_2 in fig. 8C), and similarity of the semantic vector representations of the input text and the search text may be calculated by cosine similarity to be used as the relevance prediction value, or a distance (such as euclidean distance, manhattan distance, etc.) between the semantic vector representation of the input text and the semantic vector representation of the input text may be calculated to obtain the relevance prediction value, and the closer the distance is, the higher the relevance prediction value is, i.e. the two are inversely related.

In step 1073, a first penalty value for the first penalty function is determined from the relevance prediction value and the relevance label.

In some embodiments, the first loss value may be obtained by:

（2）

wherein, Expressed in time stepsA relevance label of the text segment at the location,Expressed in time stepsA predicted value of the relatedness of the text segment at the position,Representation ofNatural logarithm of (i) isAnd adding the losses of all time steps to obtain a first loss value, wherein the obtained first loss value can be the first loss value output after a single training batch ends in the training iteration of the video retrieval model, and the training process of the video retrieval model comprises multiple training batch iterations.

In step 1074, two video data with different relevance labels in the plurality of video data corresponding to the search text are acquired, and relevance prediction values corresponding to the two video data are respectively acquired.

In step 106, the search texts of training sample 1 (search text 1, input text 1, relevance label 1) and training sample 2 (search text 1, input text 2, relevance label 5) are identical, and the relevance labels are different, so that the relevance predicted values between input text 1 and input text 2 corresponding to two video data and search text 1 are obtained, where the obtaining of the relevance predicted values corresponding to the two video data is merely taken as an example, there may be a plurality of search texts included in a plurality of training samples, and two video data with different relevance labels in a plurality of video data corresponding to a plurality of groups of different search texts may be obtained.

In step 1075, a second loss value of the second loss function is determined from the correlation prediction values corresponding to the two video data.

（3）

The above example is followed, wherein,Representing correlation predictors corresponding to two video dataAnd，Representing the gear difference corresponding to the correlation label between two video dataCorrelation shift subtracting of corresponding video dataA corresponding correlation shift of video data),Is a boundary value, and the meaning of the formula (3) is the difference between the correlation predictive value scores of two input texts with a gear range difference with the correlation of the same search text) Greater thanWhen the video retrieval model is used, the loss calculated by the formula (3) is 0, otherwise, the loss calculated by the formula (3) is larger than 0, so that the video retrieval model can better distinguish video data of different gears, and a more accurate correlation prediction value is output.

Here, only one set is usedAs an example, in a training iteration of the video retrieval model, each training batch includes multiple setsWill calculate each group by equation (3)Is averaged as the second loss value for the current training batch.

In step 1076, the first loss value and the second loss value are weighted and summed to obtain a combined loss value.

In some embodiments, the first loss value and the second loss value are weighted and summed by a preset weight to obtain a combined loss value.

In step 1077, gradient information is obtained by combining the loss values, and parameters of the video retrieval model are updated according to the gradient information.

In some embodiments, gradient information of the combined loss value for each parameter of the video retrieval model is obtained through a back propagation algorithm, the obtained gradient information is used for updating the parameter of the video retrieval model according to a gradient descent optimization algorithm (such as batch gradient descent, random gradient descent and the like), and the process is repeated until a certain iteration number is reached or the video retrieval model converges, so that training of the video retrieval model is completed.

In some embodiments, referring to fig. 3J, following step 102 shown in fig. 3A, following steps 108 to 109 may also be performed prior to step 103, as described in detail below.

In step 108, a pre-set priority is obtained for a plurality of types of text, wherein the plurality of types of text includes an overt text and a content text, and the overt text has a higher priority than the content text.

In step 109, the order in which the priority of the explicit text is higher than the priority of the content text is used as the genre sorting method.

In some embodiments, the type ordering may be represented as the front of the explicit text and the back of the content text.

In some embodiments, referring to fig. 3K, the following steps 110 through 111 may also be performed for a plurality of text segments of different sources each type comprising, prior to step 103 shown in fig. 3A, as described in detail below.

In step 110, a degree of correlation between each of the plurality of text segments and the search text is determined.

In some embodiments, referring to fig. 3L, step 110 shown in fig. 3K may also be implemented by the following steps 1101 to 1103, which are specifically described below.

In step 1101, word segmentation processing is performed on the search text, resulting in a set of search word segments.

In some embodiments, referring to the description in step 105111 above, the search text is subjected to word segmentation to obtain a set of search segments.

In step 1102, a word weight coverage value is obtained for each search term in the set of search terms, wherein the word weight coverage value characterizes whether the search term appears in the text segment and the word weight of the search term.

In some embodiments, the word weight coverage value for each search term may be represented by the following equation:

（4）

wherein, Representing search termsThe word weight overlay value of (c) is,Indicating whether the current search term appears in the text segment,The word weight of the search word may be represented, where the word weight of the search word may be used by using a language model (e.g., BERT, roBERTa, etc.) for a word-level labeling task to obtain information such as a word weight and a tag of each search word.

In step 1103, the word weight coverage value of each search term is added as a degree of correlation between the text segment and the search text.

With continued reference to fig. 3K, in step 111, the plurality of text segments of different sources are ranked in order of high-to-low relevance to form a source ranking.

Here, the source ranking manner is ranking of a plurality of text segments inside each type of text, for example, in the explicit text, the relevance of the cover text segment is 0.9, the relevance of the title text segment is 0.8, the relevance of the topic word text segment is 0.7, and the result after source ranking for the explicit text can be expressed as: [ Exclusive text (cover text segment, title text segment, topic word text segment) ].

Through steps 101 to 111, through introducing video texts of different types and sources, multi-party video information is fused, the coverage of content contained in the video data is improved, through sorting the texts of multiple types according to a specific sequence and screening out the first text combination obtained by sorting through screening indexes, a second text combination containing the maximum information amount is constructed within a limited input length, the beneficial effects of further condensing the content information contained in the video and improving the capturing capability of a video retrieval model on semantic information of the video content are achieved, and therefore modeling accuracy is improved.

Referring to fig. 4, fig. 4 is a flowchart of a video retrieval method of a video retrieval model according to an embodiment of the present application, where the video retrieval model is obtained by training the video retrieval model training method of any one of fig. 3A to 3K according to the embodiment of the present application, and a server for deploying the trained video retrieval model is used as an execution subject, for example, the server 100-2 described above may be used, and the steps shown in fig. 4 will be specifically described below.

In step 201, a video retrieval request is obtained, wherein the video retrieval request includes text to be searched.

In some embodiments, in response to a video retrieval operation of a user, the server obtains a video retrieval request, parses relevant data from the received video retrieval request, including text to be searched, wherein the retrieval request may further include voice data, and converts the voice data into the text to be searched through voice recognition (ASR), so as to perform video retrieval of voice input.

In step 202, a correlation prediction value between a text to be searched and a plurality of videos in a video library is obtained through a video retrieval model.

In some embodiments, first, referring to steps 102 to 103 above, an input text corresponding to each video in the video library is obtained, next, referring to steps 1071 to 1072, a semantic vector representation of the input text and the text to be searched corresponding to each video in the video library is obtained, and a correlation prediction value between the input text and the text to be searched is obtained through the semantic vector representation.

In step 203, the videos in the video library are ranked according to the ranking of the correlation prediction values between the text to be searched and the videos in the video library from high to low, so as to obtain a video list.

In step 204, at least one video from the video list is selected from the video list, and the video retrieval request is responded to by the at least one video.

In some embodiments, response data may be constructed according to a preset video number, for example, video information (such as a play link, a screenshot, or streaming media data of a video) of video data with a correlation score ranking of 5 in a video list is selected and organized into response data in a suitable data form (such as JSON or XML, etc.), the constructed response data is sent to a client, after the client receives the response data of a server, the response data is processed according to the video information in the response data, and a specific processing manner may be determined according to a client platform and a user requirement.

In the following, an exemplary application of an embodiment of the present application in a video retrieval scenario will be described.

The video search scene refers to that a user searches video content meeting specific requirements on a video library or a network through keywords or other search conditions, for example, the terminal runs an instant messaging client, the user inputs a search keyword of 'cake making' or voice input in a search box to initiate a video search request, a server searches related videos from a database through a video search model and returns a video list related to the search keyword, the list possibly comprises teaching videos, recipe videos and the like, the user can watch by clicking one of the videos and further operations such as pausing, fast forwarding, volume adjusting and the like are performed in the watching process, and other video recommendations related to the video can be displayed at the bottom of the video so that the user can continue to watch the related videos. Video retrieval techniques aim to allow users to quickly and accurately find content of interest to them.

Referring to fig. 7A, fig. 7A is a schematic diagram of a first flow of processing a video retrieval model in video retrieval according to an embodiment of the present application, and is specifically described below.

In step 301, a video retrieval request is obtained, wherein the video retrieval request includes search term text.

Here, see the explanation in step 201.

In step 302, searching is performed according to the text of the search word through a video searching model, and a searching result is obtained.

In some embodiments, referring to fig. 7B, the video retrieval model in step 302 shown in fig. 7A may enable training and prediction of the model through steps 3021 to 3024 below, which are described in detail below.

In step 3021, video content is understood.

In some embodiments, the video content understanding may be through text recognition of video frames or covers of the video data, voice recognition of video frames, extraction of subject matter of the video data, scoring of word weights thereof, generation of subject abstract sentences of the video data, and the like, as described in detail above with respect to step 1032.

In step 3022, candidate input text combinations are constructed.

Here, see the description of step 102 above.

In step 3023, an optimal information content input text composition is constructed.

Here, see the description of steps 103 to 105 above, in which the input order control corresponds to the first screening process in step 1051 in order of the specific order in step 103, the input length control corresponds to the second screening process in step 1052, the smoothness screening corresponds to the third screening process in step 1053, and the second text combination corresponds to the optimum information content input text combination.

In step 3024, a video retrieval model is trained and predictions are made by the video retrieval model.

In some embodiments, training of the video retrieval model is described above with respect to steps 106 through 107, and prediction of the video retrieval model is described above with respect to step 202.

With continued reference to fig. 7A, in step 303, a video search request is responded to by the search result.

Here, see the description of step 203 to step 204 above.

Continuing with the description below of an exemplary architecture of the video retrieval model training device 133 provided by embodiments of the present application implemented as software modules, in some embodiments, as shown in FIG. 2A, the software modules stored in the video retrieval model training device 133 of the memory 130-1 may include:

A data acquisition module 1331 for acquiring video data; a data processing module 1332 for obtaining a plurality of types of text of the video data, wherein the text of each of the types comprises a plurality of text fragments of different sources; in some embodiments, the data processing module 1332 is further configured to filter each of the text segments in the first text combination by the filtering indicator to obtain a second text combination as an input text of the video data.

In some embodiments, the data processing module 1332 is further configured to construct a training sample from the input text, the search text, and a relevance tag of the search text to the video data.

In some embodiments, the data processing module 1332 is further configured to sort the plurality of types of text according to the type sorting scheme; the following processing is performed for each of the types of text: and sequencing the text fragments of different sources included in each type of text according to the source sequencing mode corresponding to the type.

In some embodiments, the data processing module 1332 is further configured to obtain a preset priority for the plurality of types of text, wherein the plurality of types of text includes an explicit text and a content text, and the priority of the explicit text is higher than the priority of the content text; and taking the order of the priority of the explicit texts over the priority of the content texts as the type sorting mode.

In some embodiments, the data processing module 1332 is further configured to, for each of the plurality of text segments of different sources that the type includes, perform the following: determining the relevance between the text fragments and the search text respectively; and sorting the text fragments of different sources according to the sequence of the relevance from high to low to form the source sorting mode.

In some embodiments, the data processing module 1332 is further configured to perform the following processing on each of the text fragments: word segmentation processing is carried out on the search text to obtain a search word segmentation set; acquiring a word weight coverage value of each search word in the search word segmentation set, wherein the word weight coverage value represents whether the search word appears in the text segment or not and the word weight of the search word; and adding the word weight coverage value of each search word to serve as the correlation degree between the text segment and the search text.

In some embodiments, the data processing module 1332 is further configured to perform a first screening process on the first text combination according to the coverage, to obtain a first screened text combination; performing second screening treatment on the first screened text combination according to the length to obtain a second screened text combination; and performing third screening treatment on the second screened text combination according to the smoothness to obtain a second text combination.

In some embodiments, the data processing module 1332 is further configured to traverse each of the text segments in the first text combination to perform the following: acquiring the coverage of the text segment, wherein the coverage represents whether information gain exists in the current text segment compared with the previous text segment, and screening out the text segment in response to the coverage being greater than a coverage threshold; and taking the first text combination after the first screening treatment as the first screening text combination.

In some embodiments, the data processing module 1332 is further configured to perform the following processing for each type of text in the first combination of culled text: in response to the length of the text being greater than a length threshold, screening at least one of the text segments in the text starting from the tail such that the length of the text is less than or equal to the length threshold; and combining the first screening text subjected to the second screening treatment as a second screening text.

In some embodiments, the data processing module 1332 is further configured to obtain a smoothness of each of the text segments in the second pruned text combinations; adding the smoothness of each text segment to obtain the smoothness of the second screened text combination; screening at least one of the text segments of the second screened text combination starting from the tail in response to the smoothness of the second screened text combination being less than a smoothness threshold, such that the smoothness of the screened second screened text combination is greater than or equal to the smoothness threshold; and taking the second screening text combination after the third screening treatment as the second text combination.

In some embodiments, the data processing module 1332 is further configured to perform the following processing on each of the text segments in the second pruned text combinations: performing word segmentation processing on the text fragments to obtain fragment word segmentation sets; obtaining the output probability of each segment word in the segment word set; and acquiring the smoothness of the text fragments according to the output probability of the word segmentation of each fragment.

In some embodiments, the data processing module 1332 is further configured to perform word segmentation on the text segment to obtain a segment word set of the text segment; acquiring word frequency of each word in the segment word set in the previous text segment; and determining an average value of the word frequency of each word as the coverage of the text segment.

Model training module 1333 for training the video retrieval model with a plurality of the training samples.

In some embodiments, the model training module 1333 is further configured to obtain semantic vector representations of the input text and the search text in the training samples; obtaining a correlation prediction value between the input text and the search text through the semantic vector representation; determining a first loss value of a first loss function through the correlation predicted value and the correlation label; acquiring two video data with different relevance labels in a plurality of video data corresponding to the search text, and respectively acquiring relevance predicted values corresponding to the two video data; determining a second loss value of a second loss function through the correlation prediction values corresponding to the two video data; carrying out weighted summation processing on the first loss value and the second loss value to obtain a combined loss value; and acquiring gradient information through the combined loss value, and updating parameters of the video retrieval model according to the gradient information.

Continuing with the description below of an exemplary architecture of video retrieval device 134 implemented as a software module for a video retrieval model provided by an embodiment of the present application, in some embodiments, as shown in FIG. 2B, the software modules stored in video retrieval device 134 for a video retrieval model of memory 130-2 may include: a receiving module 1341, configured to obtain a video search request, where the video search request includes text to be searched; a retrieval module 1342, configured to obtain, through the video retrieval model, correlation prediction values between the text to be searched and a plurality of videos in a video library; a response module 1343, configured to select at least one video from the video list, where the at least one video starts from the video of the header, and respond to the video search request through the at least one video.

In some embodiments, the retrieving module 1342 is further configured to sort the plurality of videos in the video library according to the order of the correlation prediction values between the text to be searched and the plurality of videos in the video library from high to low, so as to obtain a video list.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the training method of the video retrieval model according to the embodiment of the present application or the video retrieval method of the video retrieval model provided by the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform a training method of a video search model provided by embodiments of the present application or a video search method of a video search model provided by embodiments of the present application, for example, a training method of a video search model as shown in fig. 3A or a video search method of a video search model as shown in fig. 4.

In some embodiments, the computer readable storage medium may be RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (Hyper Text Markup Language, HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, through the embodiment of the application, video texts of different types and sources are introduced, multiparty video information is fused, the coverage of content contained in video data is improved, and through sorting a plurality of types of texts according to a specific sequence and screening out a first text combination obtained by sorting through a screening index, a second text combination containing the maximum information amount is constructed within a limited input length, so that the content information contained in video is further condensed, the modeling accuracy of a video retrieval model on the video content is improved, and the accuracy of the correlation between the search text and the video data is improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of training a video retrieval model, the method comprising:

acquiring video data;

Traversing text segments in each type of text in the first text composition to perform the following: acquiring the coverage of the text segment, wherein the coverage represents whether information gain exists in the currently traversed text segment compared with the previously traversed text segment, and screening out the text segment to obtain a first screened text combination in response to the coverage being greater than a coverage threshold;

The following is performed for each type of text in the first combination of screened text: screening at least one text segment from the tail in the text in response to the length of the text being greater than a length threshold, so that the length of the text is less than or equal to the length threshold, and obtaining a second screened text combination;

acquiring the smoothness of text fragments in each type of text in the second screening text combination;

adding the smoothness of each text segment to obtain the smoothness of the second screened text combination;

screening at least one text segment from the tail in the second screened text combination in response to the smoothness of the second screened text combination being smaller than a smoothness threshold, and obtaining a second text combination with the smoothness higher than or equal to the smoothness threshold as an input text of the video data;

The video retrieval model is trained by a plurality of the training samples.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The specific sequence includes: a type ordering means for characterizing an order of the plurality of types; the source ordering mode corresponding to each type is used for representing the sequence of the different sources corresponding to the type;

The ordering the plurality of types of text in a particular order includes:

Sorting the texts of the multiple types according to the type sorting mode;

the following processing is performed for each of the types of text: and sequencing the text fragments of different sources included in each type of text according to the source sequencing mode corresponding to the type.

3. The method of claim 2, wherein prior to said ordering the plurality of types of text in a particular order, the method further comprises:

Acquiring preset priorities for the plurality of types of texts, wherein the plurality of types of texts comprise an explicit text and a content text, and the priority of the explicit text is higher than that of the content text;

and taking the order of the priority of the explicit texts over the priority of the content texts as the type sorting mode.

4. The method of claim 2, wherein prior to said ordering the plurality of types of text in a particular order, the method further comprises:

For the plurality of text segments of different origin each of the types includes, performing the following:

Determining the relevance between the text fragments and the search text respectively;

and sorting the text fragments of different sources according to the sequence of the relevance from high to low to form the source sorting mode.

5. The method of claim 4, wherein the determining the relevance between the plurality of text segments and the search text, respectively, comprises:

the following processing is performed for each of the text fragments:

Word segmentation processing is carried out on the search text to obtain a search word segmentation set;

Acquiring a word weight coverage value of each search word in the search word segmentation set, wherein the word weight coverage value represents whether the search word appears in the text segment or not and the word weight of the search word;

And adding the word weight coverage value of each search word to serve as the correlation degree between the text segment and the search text.

6. The method of claim 1, wherein said obtaining the smoothness of each of the text segments in the second sifted text combination comprises:

Performing the following processing on each of the text segments in the second combination of culled text:

Performing word segmentation processing on the text fragments to obtain fragment word segmentation sets;

Obtaining the output probability of each segment word in the segment word set;

And acquiring the smoothness of the text fragments according to the output probability of the word segmentation of each fragment.

7. The method of claim 1, wherein the obtaining the coverage of the text segment comprises:

Word segmentation is carried out on the text fragments to obtain fragment word sets of the text fragments;

acquiring word frequency of each word in the segment word set in the previous text segment;

And determining an average value of the word frequency of each word as the coverage of the text segment.

8. The method of any one of claims 1 to 5, wherein the same search text corresponds to a plurality of the video data, and the training the video retrieval model by a plurality of the training samples comprises:

acquiring semantic vector representations of the input text and the search text in the training sample;

Obtaining a correlation prediction value between the input text and the search text through the semantic vector representation;

determining a first loss value of a first loss function through the correlation predicted value and the correlation label;

Acquiring two video data with different relevance labels in a plurality of video data corresponding to the search text, and respectively acquiring relevance predicted values corresponding to the two video data;

Determining a second loss value of a second loss function through the correlation prediction values corresponding to the two video data;

carrying out weighted summation processing on the first loss value and the second loss value to obtain a combined loss value;

And acquiring gradient information through the combined loss value, and updating parameters of the video retrieval model according to the gradient information.

9. A video retrieval method of a video retrieval model, wherein the video retrieval model is trained by the method of any one of claims 1 to 8, the method comprising:

10. A training device for a video retrieval model, the device comprising:

the data acquisition module is used for acquiring video data;

The data processing module is further configured to traverse text segments in each type of text in the first text combination to perform the following processing: acquiring the coverage of the text segment, wherein the coverage represents whether information gain exists in the currently traversed text segment compared with the previously traversed text segment, and screening out the text segment to obtain a first screened text combination in response to the coverage being greater than a coverage threshold;

11. A video retrieval apparatus of a video retrieval model, the apparatus comprising:

the retrieval module is used for obtaining correlation prediction values between the text to be searched and a plurality of videos in a video library through the video retrieval model, wherein the video retrieval model is obtained through training by the method of any one of claims 1 to 8;

12. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions;

A processor for implementing the training method of the video retrieval model of any one of claims 1 to 8 or the video retrieval method of the video retrieval model of claim 9 when executing computer executable instructions stored in the memory.

13. A computer-readable storage medium storing computer-executable instructions or a computer program, wherein the computer-executable instructions or the computer program, when executed by a processor, implement the training method of the video retrieval model according to any one of claims 1 to 8 or the video retrieval method of the video retrieval model according to claim 9.

14. A computer program product comprising computer executable instructions or a computer program, which when executed by a processor implements the method of training a video retrieval model according to any one of claims 1 to 8 or the method of video retrieval of a video retrieval model according to claim 9.