CN115203476A

CN115203476A - Information retrieval method, model training method, device, equipment and storage medium

Info

Publication number: CN115203476A
Application number: CN202210826933.0A
Authority: CN
Inventors: 孔伟杰; 蒋杰; 蔡成飞; 赵文哲; 王红法; 涂荣成; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-18

Abstract

The application discloses an information retrieval method, a model training method, a device, equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring input information, wherein the information mode of the input information is a first mode; calling an event representation prediction model to carry out prediction processing on input information to obtain an input event representation of the input information; calculating the event similarity between the input event representation and the retrieval event representation, wherein the information modality of the retrieval information is a second modality; and under the condition that the similarity of the events exceeds a similarity threshold, determining retrieval information corresponding to the retrieval event representation as a retrieval result of the input information. According to the method, the input event representation and the retrieval event representation are obtained, and semantic information in the input information and the retrieval information is fully extracted; by comparing the difference between the input event representation and the retrieval event representation, the difference between the input information and the retrieval information caused by different information modes is avoided, and the retrieval effect among the cross-mode information is improved.

Description

Information retrieval method, model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an information retrieval method, a model training method, an apparatus, a device, and a storage medium.

Background

The text retrieval video is that the video information related to the text information is obtained through searching through the text information, and cross-modal searching between the text information and the video information is achieved.

In the related art, a video depth feature of video information is usually extracted through a video feature extractor, a text depth feature of text information is extracted through a text feature extractor, and a video text matcher is trained through the extracted video depth feature and the text depth feature, so that cross-modal retrieval between the video information and the text information is realized.

However, the video depth features and the text depth features extracted by the video feature extractor and the text feature extractor cannot sufficiently mine semantic information contained in the video information and the text information, and how to further improve the retrieval effect between cross-modal information is an urgent problem to be solved.

Disclosure of Invention

The application provides an information retrieval method, a model training method, a device, equipment and a storage medium, and the technical scheme is as follows:

according to an aspect of the present application, there is provided an information retrieval method, the method including:

acquiring input information, wherein the information modality of the input information is a first modality;

calling an event representation prediction model to perform prediction processing on the input information to obtain an input event representation of the input information, wherein the input event representation is used for indicating event information in the input information;

calculating the event similarity between the input event representation and a retrieval event representation, wherein the retrieval event representation is an event representation corresponding to retrieval information, and the information modality of the retrieval information is a second modality;

and under the condition that the event similarity exceeds a similarity threshold, determining the retrieval information corresponding to the retrieval event representation as the retrieval result of the input information.

According to an aspect of the present application, there is provided a training method for an event representation prediction model, the method is used for training the event representation prediction model in the above information retrieval method, and the method includes:

acquiring a sample information group, wherein the sample information group comprises a first information group and a second information group, the first information group comprises n pieces of first modal information, the second information group comprises n pieces of second modal information, the information modalities of the first modal information and the second modal information are different, the n pieces of first modal information correspond to the n pieces of second modal information one by one, and n is an integer greater than 1;

predicting the first information group to obtain a first event group of the first information group, and predicting the second information group to obtain a second event group of the first information group, wherein the first event group is used for indicating event information in the first information group, and the second event group is used for indicating event information in the second information group;

and training the event representation prediction model according to the prediction error between the first event group and the second event group to obtain the trained event representation prediction model.

According to another aspect of the present application, there is provided an information retrieval apparatus, the apparatus including:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring input information, and the information modality of the input information is a first modality;

the prediction module is used for calling an event representation prediction model to perform prediction processing on the input information to obtain an input event representation of the input information, and the input event representation is used for indicating event information in the input information;

the computing module is used for computing the event similarity between the input event representation and the retrieval event representation, the retrieval event representation is an event representation corresponding to retrieval information, and the information modality of the retrieval information is a second modality;

and the determining module is used for determining the retrieval information corresponding to the retrieval event representation as the retrieval result of the input information under the condition that the event similarity exceeds a similarity threshold.

In an optional design of the application, the event characterization prediction model includes a first prediction network corresponding to the first modality;

the prediction module is further to:

and under the condition that the information modality of the input information is the first modality, calling the first prediction network to perform prediction processing on the input information to obtain the input event representation of the input information.

In an alternative design of the application, the first prediction network includes a first modality encoder and a first event generator, the input information includes at least two input sub-information;

the prediction module is further to:

when the information modality of the input information is the first modality, calling the first modality encoder to encode at least two pieces of input sub information one by one to obtain an input feature sequence of the input information, wherein the input feature sequence comprises at least two input feature representations corresponding to the at least two pieces of input sub information;

and calling the first event generator to perform prediction processing on the input feature sequence to obtain the input event representation of the input information, wherein the input event representation comprises representation information of at least one first modal event of the input information.

In an alternative design of the application, the prediction module is further configured to:

calling the first event generator to perform prediction processing on the input feature sequence to obtain input weight information of the input information, wherein the input weight information is used for describing the weight of at least two pieces of input sub information in the input information in at least one first modal event;

determining the input event representation of the input information according to the input weight information and the input feature sequence.

In an optional design of the application, the event characterization prediction model further includes a second prediction network corresponding to the second modality;

the prediction module is further to:

and under the condition that the information modality of the retrieval information is the second modality, calling the second prediction network to perform prediction processing on the retrieval information to obtain the retrieval event representation of the retrieval information.

In an optional design of the application, the second prediction network includes a second modality encoder and a second event generator, and the search information includes at least two search sub-information;

the prediction module is further to:

under the condition that the information modality of the input information is the second modality, calling a second modality encoder to encode at least two pieces of retrieval sub information one by one to obtain a retrieval feature sequence of the retrieval information, wherein the retrieval feature sequence comprises at least two retrieval feature representations corresponding to the at least two pieces of retrieval sub information;

and calling the second event generator to perform prediction processing on the retrieval feature sequence to obtain the retrieval event representation of the retrieval information, wherein the retrieval event representation comprises representation information of at least one second modality event of the retrieval information.

calling the second event generator to perform prediction processing on the retrieval feature sequence to obtain retrieval weight information of the retrieval information, wherein the retrieval weight information is used for describing the weight of at least two pieces of retrieval sub information in the retrieval information in at least one second modal event;

and determining the retrieval event representation of the retrieval information according to the retrieval weight information and the retrieval feature sequence.

In an optional design of the application, the input events characterize characterizing information of at least one first modality event that includes the input information; the retrieval event characterizes characterization information of at least one second modality event comprising the retrieval information;

the calculation module is further to:

calculating a relevance score between the characterizing information of the first modality event and the characterizing information of the second modality event;

and constructing and obtaining the event similarity according to the correlation score.

In an optional design of the application, the first modality is a text modality, and the second modality is a video modality;

the acquisition module is further configured to:

acquiring text information;

the prediction module is further to:

calling the event representation prediction model to carry out prediction processing on the text information to obtain a text event representation of the text information;

the calculation module is further to:

calculating the event similarity between the text event representation and the video event representation, wherein the video event representation is an event representation corresponding to video information;

the determination module is further configured to:

and under the condition that the similarity of the event exceeds a similarity threshold, determining the video information corresponding to the video event representation as a retrieval result of the text information.

According to another aspect of the present application, there is provided a training apparatus for an event characterization prediction model, the apparatus comprising:

an obtaining module, configured to obtain a sample information group, where the sample information group includes a first information group and a second information group, the first information group includes n pieces of first modality information, the second information group includes n pieces of second modality information, information modalities of the first modality information and the second modality information are different, the n pieces of first modality information correspond to the n pieces of second modality information one to one, and n is an integer greater than 1;

the prediction module is used for performing prediction processing on the first information group to obtain a first event group of the first information group and performing prediction processing on the second information group to obtain a second event group of the first information group, wherein the first event group is used for indicating event information in the first information group, and the second event group is used for indicating event information in the second information group;

and the training module is used for training the event representation prediction model according to the prediction error between the first event group and the second event group to obtain the trained event representation prediction model.

In an alternative design of the application, the first event group includes n first event characterizations corresponding to n pieces of the first modality information, and the second event group includes n second event characterizations corresponding to n pieces of the second modality information;

the training module is further configured to:

calculating a first prediction error between the first event characteristic and the second event group; and calculating a second prediction error between the second event signature and the first event group;

and training the event representation prediction model according to the first prediction error and the second prediction error to obtain the trained event representation prediction model.

In an optional design of the application, the obtaining module is further configured to obtain a verification sample pair, where the verification sample pair includes first verification information and second verification information, and information modalities of the first verification information and the second verification information are different;

the prediction module is further configured to invoke the trained event representation prediction model to perform prediction processing on the first verification information to obtain a first verification representation, and invoke the trained event representation prediction model to perform prediction processing on the second verification information to obtain a second verification representation, where the first verification representation includes at least two pieces of representation information of the first verification event, and the second verification representation includes at least two pieces of representation information of the second verification event;

and the updating module is further used for verifying the trained event characterization prediction model according to the prediction error between the first verification characterization and the second verification characterization, and updating the predicted first number of the first verification events in the first verification characterization and the predicted second number of the second verification events in the second verification characterization.

According to another aspect of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement an information retrieval method, and/or a training method of an event characterization prediction model, as described above.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, which is loaded and executed by a processor to implement an information retrieval method, and/or a training method of an event characterization prediction model, as described above.

According to another aspect of the present application, there is provided a computer program product comprising computer instructions stored in a computer-readable storage medium, which are read and executed by a processor to implement the information retrieval method as described above, and/or the training method of the event characterization prediction model.

The beneficial effect that technical scheme that this application provided brought includes at least:

by acquiring the input event representation of the input information and the retrieval event representation of the retrieval information, semantic information contained in the input information and the retrieval information is fully extracted, and the mode of describing the semantic information is expanded; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modes is compared, so that the difference between the input information and the retrieval information caused by different information modes is avoided, and the retrieval effect among cross-mode information is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of a method of using an event characterization prediction model provided in an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a training method for an event characterization prediction model provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 5 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 6 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 8 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 9 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 10 is a flow chart of an information retrieval method provided by an exemplary embodiment of the present application;

FIG. 11 is a flow chart of a method for training an event characterization prediction model provided by an exemplary embodiment of the present application;

FIG. 12 is a flow chart of a method for training an event characterization prediction model provided by an exemplary embodiment of the present application;

FIG. 13 is a flow chart of a method for training an event characterization prediction model provided by an exemplary embodiment of the present application;

fig. 14 is a block diagram of an information retrieval apparatus according to an exemplary embodiment of the present application;

FIG. 15 is a block diagram of a training apparatus for an event characterization prediction model according to an exemplary embodiment of the present application;

fig. 16 is a block diagram of a server according to an exemplary embodiment of the present application.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used in the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the input information, search information, and the like referred to in this application are obtained with sufficient authority.

It should be understood that although the terms first, second, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

FIG. 1 is a diagram illustrating a computer system provided by one embodiment of the present application. The computer system may implement a system architecture for a training method and/or an information retrieval method for characterizing a predictive model for an event. The computer system may include: a terminal 100 and a server 200. The terminal 100 may be an electronic device such as a mobile phone, a tablet Computer, a vehicle-mounted terminal (car machine), a wearable device, a PC (Personal Computer), an unmanned terminal, and the like. The terminal 100 may have a client installed therein for running a target application, where the target application may be a training and/or information retrieval application of an event characterization prediction model, or may be another application providing a training and/or information retrieval function of an event characterization prediction model, and this application is not limited thereto. The form of the target Application is not limited in the present Application, and may include, but is not limited to, an App (Application program) installed in the terminal 100, an applet, and the like, and may be a web page form. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.

According to the training method and/or the information retrieval method for the event characterization prediction model, an execution subject of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. Taking the embodiment environment shown in fig. 1 as an example, the terminal 100 may execute the training method and/or the information retrieval method of the event characterization prediction model (for example, a client installed and running in the terminal 100 of a target application executes the training method and/or the information retrieval method of the event characterization prediction model), the server 200 may execute the training method and/or the information retrieval method of the event characterization prediction model, or the terminal 100 and the server 200 cooperate with each other to execute the training method and/or the information retrieval method of the event characterization prediction model, which is not limited in this application.

In addition, the technical scheme of the application can be combined with the block chain technology. For example, the event characterization prediction model training method and/or the information retrieval method disclosed in the present application, wherein some data (data of the first image, the first pixel block, the second pixel block, and the like) involved may be saved on the block chain. The terminal 100 and the server 200 may communicate with each other through a network, such as a wired or wireless network.

Next, an event characterization prediction model in the present application is introduced:

FIG. 2 is a diagram illustrating a method for using an event characterization prediction model according to an embodiment of the present application.

The event characterization prediction model 300 is a trained network model; the event characterization prediction model 300 includes: video prediction network 330 and text prediction network 340;

the video prediction network 330 includes: a video encoder 332, a video event generator 334.

The computer device obtains video information 310. The video encoder 332 is invoked to perform encoding processing on the video information 310, so as to obtain a video feature sequence 310a. Specifically, the video information 310 includes at least two video frames, and correspondingly, the video feature sequence 310a includes at least two video feature vectors, and each video frame corresponds to one video feature vector. Further, at least two video frames in the video information 310 are encoded one by one to obtain video feature vectors corresponding to the at least two video frames one by one, and the two video feature vectors form a video feature sequence 310a.

The computer device invokes the video event generator 334 to perform prediction processing on the video feature sequences 310a one by one, so as to obtain video event representations 310b. Specifically, the representation 310b of the video event includes representation information of at least two video events, and it is sufficient to pay attention to the fact that the video information 310 is formed by combining the video events.

Illustratively, the video event representation 310b is used to describe event information in the video information 310, and semantic information of the video information 310 is described by the event information in the video information 310; further, the video event representation 310b includes representation information of at least two video events, and the video event is associated with at least one video frame in the video information 310 for describing semantic information in the associated video frame. In one example, the video event extracts at least one of time, location, participation role, participation action, etc. information of at least one video frame related to the video event.

The text prediction network 340 includes: a text encoder 342, a text event generator 344.

The computer device obtains the text information 320. And calling a text encoder 342 to encode the text information 320 to obtain a text feature sequence 320a. Specifically, the text information 320 includes at least two word groups, and correspondingly, the text feature sequence 320a includes at least two text feature representations, and each word group corresponds to one text feature representation. Further, at least two phrases in the text information 320 are encoded one by one to obtain text feature representations corresponding to the at least two phrases one by one, and the two text feature representations form a text feature sequence 320a.

The computer device invokes the text event generator 344 to perform prediction processing on the text feature sequences 320a one by one, so as to obtain text event representations 320b. Specifically, the text event representation 320b includes representation information of at least two text events, and it is sufficient to pay attention to that the text event representation 320 is formed by combining text events.

Illustratively, the text event representation 320b is used for describing event information in the text information 320, and semantic information of the text information 320 is described by the event information in the text information 320; further, the text event representation 320b includes representation information of at least two text events, and the text event is associated with at least one word group in the text information 320 and is used for describing semantic information in the associated word group. In one example, the text event extracts at least one of time, location, participation role, participation action, and the like of at least one phrase related to the text event.

The computer device calculates a similarity score 352 describing the degree of similarity between the video information 310 and the text information 320 based on the difference between the video event representation 310b and the text event representation 320b.

Fig. 3 is a schematic diagram illustrating a training method of an event characterization prediction model according to an embodiment of the present application.

Acquiring a sample information group, wherein the sample information group comprises a video information group 410 and a text information group 420; the video information group 410 includes 128 pieces of video information, and the 128 pieces of video information correspond to 128 pieces of text information in the text information group 420, respectively.

First, a video prediction network 430 is introduced; video prediction network 430 includes a video encoder 432, a video event generator 434.

The video encoder 432 is invoked to encode the 128 pieces of video information in the video information group 410 one by one, so as to obtain a video feature group 410a. The video feature set 410a includes: the 128 video information in the video information group 410 corresponds to 128 video feature sequences.

Calling a video event generator 434 to predict 128 video feature sequences one by one to obtain a video event group 410b; video event group 410b includes 128 video event representations for the 128 video information in video information group 410.

Next, the text prediction network 440 is introduced; the text prediction network 440 includes a text encoder 442, a text event generator 444.

The text encoder 442 is invoked to encode the 128 pieces of text information in the text information group 420 one by one, resulting in a text feature group 420a. The text feature set 420a includes: 128 text feature sequences corresponding to the 128 text messages in the text message group 420.

Calling a text event generator 444 to perform prediction processing on the 128 text feature sequences one by one to obtain a text event group 420b; text event group 420b includes 128 text event representations corresponding to the 128 text messages in text message group 420.

Calculating a first prediction error between the video event characterization and the text event group 420b; calculating a second prediction error between the text event representation and the video event group 410b; and constructing a prediction error 452 according to the first prediction error and the second prediction error, and training the event representation prediction model to obtain a trained event representation prediction model.

Fig. 4 shows a flowchart of an information retrieval method provided by an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:

step 510: acquiring input information;

illustratively, the information modality of the input information is a first modality; the information modality is used for indicating the information form of the input information and/or the information source of the input information.

In one example, the information form of the input information includes, but is not limited to, at least one of: text, audio, video, animation, pictures. The information source of the input information is used to indicate a manner of acquiring the input information, taking the input information is picture information as an example, and the information source of the input information is used to indicate a manner of acquiring the picture information, for example, the information source of the input information includes but is not limited to at least one of the following: charge-Coupled devices (CCD), charge Injection Devices (CID), complementary Metal Oxide Semiconductor (CMOS), and electron computer tomography.

Step 520: calling an event representation prediction model to carry out prediction processing on input information to obtain an input event representation of the input information;

illustratively, the input event representation is used for indicating event information in the input information; the input event representation is used for representing semantic information contained in the input information, and in one example, the input event representation of the input information indicates the semantic information of the input information, and the semantic information is indicated by extracting at least one of time, place, participation role, participation action and the like in the input information. In one example, the input information is text information, the text information includes at least two phrases, and the input event representation of the input information indicates semantic information embedded in at least one of the phrases.

Step 530: calculating the event similarity between the input event representation and the retrieval event representation;

the retrieval event representation is an event representation corresponding to the retrieval information, and the information modality of the retrieval information is a second modality; similarly, the retrieval event representation is used for representing semantic information included in the retrieval information. In one example, the retrieval information is video information, the video information includes at least two video frames, and the retrieval event representation of the retrieval information indicates semantic information embedded in at least one video frame.

It is noted that the information modality of the retrieved information is different from the information modality of the input information, and in one example, the information form of the retrieved information is different from that of the input information, and/or the information source is different.

The event similarity is used for indicating the difference between the input event representation and the retrieval event representation, and the event similarity is obtained by comparing the input event representation and the retrieval event representation.

In an alternative implementation, the search information may be information in a search library, the search library comprising information of at least two second modalities. The input information is used to retrieve information in the search repository that the modality of information is the second modality.

Step 540: under the condition that the similarity of the event exceeds a similarity threshold, determining retrieval information corresponding to the retrieval event representation as a retrieval result of the input information;

for example, the event similarity threshold may be preset, or may be obtained by training according to a training process of the event characterization prediction model, and this embodiment is not limited to any specific specification.

Illustratively, in the case that the event similarity exceeds the similarity threshold, a correlation exists between the retrieval information and the input information, and the retrieval information is taken as the retrieval result of the input information. The retrieval information may be one or more pieces of information, and in the case where the retrieval information is a plurality of pieces of information, no restrictive provision is made on the order of arrangement of the plurality of pieces of information.

In summary, the method provided by this embodiment sufficiently extracts semantic information included in the input information and the retrieval information by obtaining the input event representation of the input information and the retrieval event representation of the retrieval information, and expands the way of describing the semantic information; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modalities is compared, so that the difference between the input information and the retrieval information caused by different information modalities is avoided, and the retrieval effect among cross-modality information is improved.

Next, the event characterization prediction model is further introduced:

in an alternative implementation of the present application, the event characterization prediction model includes a first prediction network.

Further optionally, the event characterization prediction model further comprises a second prediction network.

First, a first prediction network in an event characterization prediction model is introduced:

fig. 5 shows a flowchart of an information retrieval method provided by an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 4, step 520 may be implemented as step 522:

step 522: under the condition that the information modality of the input information is a first modality, calling a first prediction network to perform prediction processing on the input information to obtain an input event representation of the input information;

illustratively, the event characterization prediction model includes a first prediction network corresponding to the first modality.

Illustratively, the first prediction network is configured to perform prediction processing on the input information whose information modality is the first modality, so as to obtain an input event representation of the input information. Taking the example that the input information is text information, the first prediction network may be implemented as a text prediction network, and is used to extract semantic information included in the text information and predict a text event representation of the text information.

It should be noted that, in this embodiment, only the first prediction network is described, and the input information is text information as an example, it is understood that the input information may be implemented as information of other information modalities. In an alternative implementation, the event characterization prediction model may include only the first prediction network, and may also include other neural networks besides the first prediction network.

In summary, according to the method provided by this embodiment, the input information is subjected to prediction processing through the first prediction network to obtain the input event representation, so that semantic information contained in the input information is fully extracted, and a manner for describing the semantic information is expanded; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modes is compared, so that the difference between the input information and the retrieval information caused by different information modes is avoided, and the retrieval effect among cross-mode information is improved.

Fig. 6 shows a flowchart of an information retrieval method provided in an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 6, step 522 may be implemented as

steps

522a, 522b:

step 522a: under the condition that the information modality of the input information is a first modality, calling a first modality encoder to encode at least two pieces of input sub information one by one to obtain an input feature sequence of the input information;

the present embodiments further describe a first predictive network that includes a first modality encoder and a first event generator. The first mode encoder is used for encoding the input information with the information mode being the first mode to obtain an input feature sequence of the input information.

In one implementation, the input information includes at least two input sub-information; the first modal encoder performs encoding processing on the input sub information one by one to obtain input feature representation corresponding to the input sub information. The input feature sequence comprises at least two input feature representations corresponding to at least two input sub-information. Illustratively, the processes of encoding the input sub information by the first-modality encoder are independent from each other, and the first-modality encoder encodes the first input feature representation corresponding to the first input sub information according to the first input sub information only.

Illustratively, the input information includes at least two input sub-information; the input information is composed of at least two input sub-information. Such as: the input information is text information which is composed of at least two phrases or at least two characters; the input information is video information, and the video information is composed of at least two video frames or at least two video segments; the input information is audio information, and the audio information is composed of at least two audio frames or at least two audio segments; the input information is image information, which is composed of at least two pixel blocks or at least two image blocks.

In one example, the input information is described as text information; text information t _i From m _t The phrase composition expresses the text information as:

wherein the content of the first and second substances,

representing the m-th in the text information _t A phrase.

Further, a special classification embedder [ CLS ] is added at the beginning of the text information]Adding an end of sequence identifier EOS at the end position of the text information]. Embedding [ CLS ] for special classification in text information by first-modality coder]、m _t Personal phrase and end of sequence identifier [ EOS ]]And respectively carrying out coding processing to obtain corresponding input feature representations, wherein the input feature representations form an input feature sequence of the input information.

In one example, the input feature sequence is represented as:

wherein, the first and the second end of the pipe are connected with each other,

a sequence of input features is represented that is,

indicating a special class embedder [ CLS]Is to be used to represent the input features of,

denotes the m-th _t The input characteristics of the word group are expressed,

indicates an end of sequence identifier EOS]Is input to the device. In an alternative implementation, the input feature representation of the phrase in the text information is a 512-dimensional vector.

In one example, where the input information is textual information, the first modality encoder is at least one of a Convolutional Neural Network (CNN), a translation encoder (Transformer), and a Deep Averaging Network (DAN).

Step 522b: calling a first event generator to carry out prediction processing on the input characteristic sequence to obtain an input event representation of the input information;

illustratively, the input event representation is used for indicating event information in the input information; in one implementation, the input events characterize characterization information of at least one first modality event that includes the input information; in another implementation, the input events characterize characterizing information of at least two first modality events that include the input information; illustratively, a correlation exists between a first-modality event and at least one input sub-information of the input information, the first-modality event is used for representing semantic information implied by the at least one input sub-information, and the first-modality event extracts at least one of time, place, participation role, participation action and the like in the at least one input sub-information and is used for indicating the semantic information. In one example, the first event generator includes a fully connected layer.

In an alternative implementation, in the case that the number of first modality events is plural, the characterization information of the ith first modality event is generated based on the characterization information of the (i-1) th first modality event. It will be appreciated that in generating the characterising information for the ith first-modality event, reference to the characterising information for the (i-1) th first-modality event is used to avoid that the characterising information for the ith first-modality event is the same as the characterising information for the (i-1) th first-modality event, resulting in the same characterising information. In one implementation, the method is used for avoiding mutual influence caused by coupling between the characterization information of the ith first modality event and the characterization information of the (i-1) th first modality event.

In summary, in the method provided in this embodiment, the first prediction network includes a first modality encoder and a first event generator, and encodes the input sub information one by one, so as to fully extract depth information in each input sub information, and compares semantic information in information of different information modalities by comparing differences between the input event representations and the retrieval event representations, thereby avoiding differences between the input information and the retrieval information caused by different information modalities, and improving a retrieval effect across modalities.

Next, the first event generator is further described:

in one implementation of the present application, step 522b in the embodiment illustrated in fig. 6 may be implemented as sub-step 1 and sub-step 2:

substep 1: calling a first event generator to carry out prediction processing on the input feature sequence to obtain input weight information of the input information;

illustratively, the input weight information is used to describe the weight of at least two input sub-information in the input information in at least one first modality event.

In one example, the input information includes first sub-information and second sub-information, the input events characterizing the characterizing information of two first modality events including the input information. Correspondingly, the input weight information comprises the representation information of the first sub information in two first-mode events and the representation information of the second sub information in two first-mode events.

In one example, the input information is described as text information;

wherein, W _q 、

W _p 、W _pq And W _pt Representing a learnable parameter matrix in a first event generator; illustratively, the learnable parameter matrix is determined by a training process of the event characterization prediction model.

Where ReLU denotes the activation function, tanh denotes the hyperbolic tangent function, softmax denotes the normalization method,

characterizing information representing an n-1 th first modality event, in the case of n =1,

is the characterization information of the 0 th first modality event; in an alternative implementation form of the method according to the invention,

represented as a 512-dimensional 0 vector.

Indicating end of sequence identifier EOS in text information]For global semantic representation as text information.

And (4) representing input characteristic representation of the j-th phrase.

Representing the intermediate variables obtained during the calculation.

Wherein the content of the first and second substances,

the weight representing the textual information versus the characterization information for the nth first modality event may, in one implementation,

is one m _t A + 2-dimensional weight vector. Wherein the text information includes m _t The number of the words is set to be,

representing special class embedder [ CLS]A weight of the characterization information for the nth first modality event,

indicates an end of sequence identifier EOS]A weight of characterization information for the nth first modality event.

In an alternative implementation, the dimensions of the learnable parameter matrix are as follows; r represents a real number.

W _q ∈R ^512×1024 ；

W _p ∈R ^1×512 ；

W _pq ∈R ^512×512 ；

W _pt ∈R ^512×512 ；

Substep 2: and determining input event representation of the input information according to the input weight information and the input characteristic sequence.

In one example, the input information is text information for example;

n-th first representing text informationCharacterization information of modal events;

representing the weight of the characterization information of the jth phrase to the nth first modality event,

and (4) representing input characteristic representation of the j-th phrase.

Illustratively, the characterization information of k first modality events of the text information is calculated sequentially:

wherein the content of the first and second substances,

an input event representation of textual information representing the textual information, the input event representation including representation information for k first modality events.

Characterizing information representing a kth first modality event.

In summary, in the method provided by this embodiment, the input weight information of the input information is obtained through the first event generator, so that the semantic information included in the input information is fully extracted, and the way of describing the semantic information is expanded; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modalities is compared, so that the difference between the input information and the retrieval information caused by different information modalities is avoided, and the retrieval effect among cross-modality information is improved.

A second prediction network in the event characterization prediction model is introduced below:

fig. 7 shows a flowchart of an information retrieval method provided by an exemplary embodiment of the present application. The method may be performed by a computer device. That is, on the basis of the embodiment shown in fig. 5, the method further includes step 525:

step 525: under the condition that the information modality of the retrieval information is a second modality, calling a second prediction network to carry out prediction processing on the retrieval information to obtain a retrieval event representation of the retrieval information;

illustratively, the event characterization prediction model further includes a second prediction network corresponding to the second modality.

Illustratively, the second prediction network is configured to perform prediction processing on the retrieval information with the information modality being the second modality, so as to obtain a retrieval event representation of the retrieval information. Taking the example that the retrieval information is video information, the first prediction network can be implemented as a video prediction network, and is used for extracting semantic information contained in the video information and predicting to obtain a video event representation of the video information.

It should be noted that, in this embodiment, only the second prediction network is described, and the retrieval information is video information as an example, it is understood that the retrieval information may be implemented as information of other information modalities.

In summary, according to the method provided by this embodiment, the retrieval information is subjected to prediction processing through the second prediction network to obtain the retrieval event representation, so that semantic information included in the retrieval information is fully extracted, and a manner of describing the semantic information is expanded; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modalities is compared, so that the difference between the input information and the retrieval information caused by different information modalities is avoided, and the retrieval effect among cross-modality information is improved.

Fig. 8 shows a flowchart of an information retrieval method provided in an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 7, step 525 may be implemented as steps 525a, 525b:

step 525a: under the condition that the information modality of the retrieval information is a second modality, calling a second modality encoder to encode at least two retrieval sub-information one by one to obtain a retrieval feature sequence of the retrieval information;

the present embodiment further introduces a second prediction network comprising a second modality encoder and a second event generator. The second mode encoder is used for encoding the retrieval information with the information mode being the second mode to obtain a retrieval feature sequence of the retrieval information.

In one implementation, the search information includes at least two search sub-information; and the second modal encoder is used for encoding the retrieval sub information one by one to obtain retrieval characteristic representation corresponding to the retrieval sub information. The search feature sequence comprises at least two search feature representations corresponding to at least two search sub-information. Illustratively, the processes of encoding the retrieval sub-information by the second-modality encoder are independent from each other, and the second-modality encoder encodes the retrieval sub-information only according to the first retrieval sub-information to obtain the first retrieval feature representation corresponding to the first retrieval sub-information.

Illustratively, the retrieval information includes at least two retrieval sub-information; the retrieval information is composed of at least two retrieval sub-information.

In one example, the search information is video information as an example; video information v _i From m _v Each video frame is composed of video information represented as:

wherein the content of the first and second substances,

representing the m-th in video information _v A video frame.

Further, m in the video information is encoded by the second modality encoder _v And respectively carrying out coding processing on the video frames to obtain corresponding retrieval feature representations, wherein the retrieval feature representations form a retrieval feature sequence of retrieval information.

In one example, the search feature sequence is represented as:

a sequence of the search features is represented,

represents the m-th _v Retrieval characteristics of the individual video frames. In an alternative implementation, the retrieved feature representation of the video frame in the video information is a 512-dimensional vector.

In one example, where the retrieval information is video information, the second modality encoder is at least one of a Convolutional Neural Network (CNN), a visual translation encoder (ViT), a Long Short-Term Memory (LSTM), and a multi-layer Perceptron (MLP).

Step 525b: calling a second event generator to carry out prediction processing on the retrieval feature sequence to obtain a retrieval event representation of the retrieval information;

illustratively, the retrieval event representation is used for indicating event information in the retrieval information; in one implementation, the retrieval event characterizes information of at least one second modality event including the retrieval information; in another implementation, the retrieval event characterization includes characterization information of at least two second modality events of the retrieval information; illustratively, the second-modality event is related to at least one retrieval sub-information of the retrieval information, the second-modality event is used for representing semantic information implied by the at least one retrieval sub-information, and the second-modality event extracts at least one of time, place, participation role, participation action and the like in the at least one retrieval sub-information and is used for indicating the semantic information. In one example, the second event generator includes a fully connected layer.

In an alternative implementation, in the case that the number of second modality events is plural, the characterization information of the ith second modality event is generated based on the characterization information of the (i-1) th second modality event. It is to be understood that, in generating the characterization information of the ith second modality event, referring to the characterization information of the (i-1) th second modality event is used for avoiding that the characterization information of the ith second modality event is the same as the characterization information of the (i-1) th second modality event, resulting in the same characterization information. In one implementation, the method and the device are used for avoiding mutual influence caused by coupling between the characterization information of the ith second modality event and the characterization information of the (i-1) th second modality event.

In summary, in the method provided in this embodiment, the second prediction network includes a second modality encoder and a second event generator, and encodes the retrieval sub-information one by one, so as to fully extract depth information in each retrieval sub-information, and compare semantic information in information of different information modalities by comparing differences between the input event representation and the retrieval event representation, thereby avoiding differences between input information and retrieval information caused by different information modalities, and improving a retrieval effect between cross-modality information.

Next, the second event generator is further described:

in one implementation of the present application, step 525b in the embodiment shown in fig. 8 may be implemented as sub-step 3 and sub-step 4:

substep 3: calling a second event generator to carry out prediction processing on the retrieval feature sequence to obtain retrieval weight information of the retrieval information;

illustratively, the retrieval weight information is used to describe the weight of at least two retrieval sub-information in the retrieval information in at least one second modality event.

In one example, the retrieved information comprises first sub-information and second sub-information, and the retrieved event characterization comprises characterization information of two second modality events of the retrieved information. Correspondingly, the retrieval weight information comprises the characteristic information of the first sub information in two second modality events and the characteristic information of the second sub information in two second modality events.

In one example, the search information is video information as an example;

wherein, W _q 、

W _p 、W _pq And W _pt Representing a learnable parameter matrix in a second event generator; illustratively, the learnable parameter matrix is determined by a training process of the event characterization prediction model. It should be noted that the learnable parameter matrix in this embodiment is generally different from the learnable parameter matrix of the first event generator in the above, but the same learnable parameter matrix is not excluded. Further, in the training process of the event characterization prediction model, the first prediction network is trained through the sample information of which the mode form is the first mode, and a learnable parameter matrix of the first prediction network is obtained. In the training process of the event characterization prediction model, the second prediction network is trained through the sample information of which the mode form is the second mode, and a learnable parameter matrix of the second prediction network is obtained.

characterizing information representing an n-1 th second modality event, in the case of n =1,

is the characterization information of the 0 th second modality event; in an alternative implementation form of the method,

represented as a 512-dimensional 0 vector.

The input feature representation representing the jth phrase,

a global feature representation representing the video information obtained by the average pooling operation.

Representing the intermediate variables obtained during the calculation.

the weight representing the characterization information of the video information to the nth second modality event is, in one implementation,

is a m _v A weight vector of dimensions. Wherein the video information comprises m _v A video frame.

W _q ∈R ^512×1024 ；

W _p ∈R ^1×512 ；

W _pq ∈R ^512×512 ；

W _pt ∈R ^512×512 ；

And substep 4: determining retrieval event representation of the retrieval information according to the retrieval weight information and the retrieval characteristic sequence;

in one example, the search information is video information as an example;

wherein the content of the first and second substances,

characterizing information representing an nth second modality event of the video information;

representing a weight of the characterization information of the jth video frame to the nth second modality event,

representing an input feature representation of the jth video frame.

Illustratively, the characterization information of k second modality events of the video information is sequentially calculated:

wherein the content of the first and second substances,

a retrieval event representation of textual information representing the video information, the retrieval event representation comprising representation information of the k second modality events.

Characterizing information representing a kth second-modality event. It should be noted that the number of the second-modality events in this embodiment may be the same as or different from the number of the first-modality events described above, and this embodiment does not set any limitation thereto.

In summary, in the method provided by this embodiment, the retrieval weight information of the retrieval information is obtained through the second event generator, so that the semantic information included in the retrieval information is fully extracted, and the way of describing the semantic information is expanded; by comparing the difference between the input event representation and the retrieval event representation, semantic information in information of different information modes is compared, so that the difference between the input information and the retrieval information caused by different information modes is avoided, and the retrieval effect among cross-mode information is improved.

Fig. 9 shows a flowchart of an information retrieval method provided by an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 4, step 530 may be implemented as steps 532, 534:

step 532: calculating a relevance score between the characterizing information of the first modality event and the characterizing information of the second modality event;

illustratively, the relevance score is calculated by a similarity algorithm. Specifically, the relevance score is expressed as:

wherein the content of the first and second substances,

representing input information t _j And retrieval information v _i A relevance score between the z second modality events;

characterizing information representing an ith first modality event,

characterization information representing a z-th second modality event. | | non-woven hair ₂ Representing the L2-norm operation.

In an exemplary manner, the first and second electrodes are,

representing a correlation matrix ES ^ij The z-th row and l-th column; correlation matrix ES ^ij Representing the similarity between k first-modality events and k second-modality events:

ES ^ij ∈[-1,1] ^k×k ；

correlation matrix ES ^ij Is a matrix of dimensions k by k.

In one example, the retrieved information is video information and the input information is text information.

Step 534: according to the relevance score, constructing to obtain event similarity;

illustratively, the event similarity between the input event representation and the retrieved event representation includes at least one of a first similarity of the input information to the retrieved information and a second similarity of the retrieved information to the input information.

The description will be given by taking an example in which the event similarity includes the first similarity and the second similarity.

First, the maximum value of the correlation score between the characterising information of the i-th first modality event of the input information and the k second modality events of the retrieved information is calculated:

wherein the content of the first and second substances,

representing a correlation matrix ES ^ij Maximum value in the l-th column.

Illustratively, the first similarity is:

wherein the content of the first and second substances,

representing a first degree of similarity, k representing a number of first modality events of the input information,

representing a correlation matrix ES ^ij The maximum value in the l-th column. The correlation matrix ES ^ij The average of the maximum values of each column in (1) is taken as the first similarity.

Similarly, the second similarity is calculated by:

first, the maximum value of the correlation score between the characterization information of the z-th second modality event of the retrieval information and the k first modality events of the input information is calculated:

wherein the content of the first and second substances,

representing a correlation matrix ES ^ij Maximum value of the z-th row.

Illustratively, the second similarity is:

wherein the content of the first and second substances,

representing a second degree of similarity, k representing the number of second modality events to retrieve information,

representing a correlation matrix ES ^ij Maximum value of the z-th row. The correlation matrix ES ^ij The average of the maximum values of each row in (a) is taken as the second similarity.

When the event similarity includes the first similarity and the second similarity, the event similarity is:

wherein, s (v) _i ,t _j ) The degree of similarity of the events is represented,

a first degree of similarity is indicated, and,

the second similarity is represented, and the event similarity is an average of the first similarity and the second similarity.

In summary, in the method provided in this embodiment, the difference between the semantic information included in the different information modalities is described by calculating the correlation score between the characterizing information of the first modality event and the characterizing information of the second modality event, and the semantic information in the information of the different information modalities is compared, so that the difference between the input information and the retrieval information caused by different information modalities is avoided, and the retrieval effect between the cross-modality information is improved.

Fig. 10 is a flowchart illustrating an information retrieval method according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in FIG. 4, step 510 may be implemented as step 510a; step 520 may be implemented as step 520a; step 530 may be implemented as step 530a; step 540 may be implemented as step 540a:

step 510a: acquiring text information;

illustratively, in this embodiment, the information modality of the input information is a text modality, and the input information is text information.

Step 520a: calling an event representation prediction model to carry out prediction processing on the text information to obtain a text event representation of the text information;

illustratively, the event representation prediction model is used for performing prediction processing on text information to obtain a text event representation, and the text event representation is used for representing semantic information contained in the text information. In an optional implementation manner, the text event representation comprises representation information of at least two text mode events, and the text mode events are used for representing semantic information implied by the text information.

Step 530a: calculating the event similarity between the text event representation and the video event representation;

illustratively, the video event representation is an event representation corresponding to the video information; the video event representation is used for semantic information contained in the video information. The event similarity is used for indicating the difference between the text event representation and the video event representation, and the event similarity is obtained by comparing the text event representation and the video event representation.

Step 540a: and under the condition that the similarity of the event exceeds a similarity threshold, determining the video information corresponding to the video event representation as a retrieval result of the text information.

Illustratively, in the case that the event similarity exceeds the similarity threshold, a correlation exists between the video information and the text information, and the video information is used as the retrieval result of the text information.

Note that this embodiment shows only a case where the first modality is a text modality, and the second modality is a video modality. Those skilled in the art will appreciate that in an alternative implementation, the first modality is a video modality and the second modality is a text modality. In another alternative implementation, the first modality and the second modality are different.

In summary, according to the method provided by this embodiment, the text event representation of the text information and the video event representation of the video information are obtained, so that semantic information contained in the text information and the video information is fully extracted, and a manner for describing the semantic information is expanded; by comparing the difference between the text event representation and the video event representation, semantic information in information of different information modalities is compared, so that the difference between the text information and the video information caused by different information modalities is avoided, and the video effect among cross-modality information is improved.

FIG. 11 illustrates a flowchart of a method for training an event characterization prediction model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:

step 610: acquiring a sample information group;

illustratively, the sample information group includes a first information group and a second information group, the first information group includes n pieces of first modality information, the second information group includes n pieces of second modality information, the information modalities of the first modality information and the second modality information are different, the n pieces of first modality information correspond to the n pieces of second modality information one by one, and n is an integer greater than 1;

illustratively, the information modality is used to indicate the information form of the information, and/or the information source of the information. Illustratively, the first modality information and the corresponding second modality information are used to imply the same semantic information, being different expressions for the same information. In an alternative implementation, the first modality information is obtained through manual tagging according to the second modality information.

In a preferred implementation, n has a value of 128.

Step 620: predicting the first information group to obtain a first event group of the first information group, and predicting the second information group to obtain a second event group of the first information group;

illustratively, the first event group is used to indicate event information in the first information group. In one implementation mode, n pieces of first modality information in a first information group are subjected to prediction processing one by one to obtain a first event representation of the n pieces of first modality information corresponding to the n pieces of first modality information; the n first event representations are constructed to obtain a first information group. Similarly, the second event group is used to indicate event information in the second information group. In one implementation manner, n pieces of second modality information in the second information group are subjected to prediction processing one by one to obtain second event representations of the n pieces of second modality information corresponding to the n pieces of second modality information; and constructing n second event representations to obtain a second information group.

Step 630: training the event representation prediction model according to the prediction error between the first event group and the second event group to obtain a trained event representation prediction model;

illustratively, the prediction error is used to describe a difference between a first event group and a second event group, and the prediction error is derived by comparing the difference between the first event group and the second event group. And training the event characterization prediction model through the prediction error to obtain the event characterization prediction model in any one of the embodiments.

Illustratively, the parameters of the event characterization prediction model are updated by using a back propagation algorithm based on the prediction error, and the prediction accuracy of the event characterization prediction model is improved by using a plurality of sample information sets, comparing the prediction error for a plurality of times and updating the parameters of the event characterization prediction model.

In summary, in the method provided in this embodiment, the prediction error between the first event group and the second event group is calculated by obtaining the sample information group, and the event representation prediction model is trained; the positive sample and the negative sample are constructed through the plurality of first information and the plurality of second information in the sample information group, the first event representation and the second event representation are compared, semantic information in information of different information modes is compared, the difference between the first information and the second information caused by different information modes is avoided, and the retrieval effect among cross-mode information is improved.

FIG. 12 is a flowchart illustrating a method for training an event characterization prediction model according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 11, step 630 may be implemented as steps 632, 634:

step 632: calculating a first prediction error between the first event characterization and the second event group; and calculating a second prediction error between the second event characterization and the first event group;

illustratively, the first information group includes n pieces of first modality information, and the first information group is represented as

t _j Representing the jth first modality information in the first information group. The second information group includes n pieces of second-modality information, and the second information group is represented as

v _j Representing the jth second modality information in the second information group.

Illustratively, a first prediction error between the first event characterization and the second event group is calculated.

Illustratively, the second event group includes n second event characterizations corresponding to n second modality information; the first event group includes n first event characterizations corresponding to n first modality information.

Illustratively, the first prediction error is:

wherein L is _t2v Representing a first prediction error, s (v) _i ,t _i ) Representing an event similarity between the first event representation and the second event representation; the first modality information to which the first event representation belongs and the second modality information to which the second event representation belongs are corresponding. n represents the number of first modality information in the first information group or the number of second modality information in the second information group. s (v) _j ,t _i ) And representing the event similarity between the ith first modality information in the first information group and the jth second modality information in the second information group. exp denotes an exponential function with a natural constant e as the base.

For example, please refer to the content of the embodiment shown in fig. 9 above for the event similarity calculation method, which is not described again in this embodiment.

Illustratively, a second prediction error between the second event characterization and the first event group is calculated.

Illustratively, the first event group includes n first event characterizations corresponding to n first modality information; the second event group includes n second event characterizations corresponding to the n second modality information.

Illustratively, the second prediction error is:

wherein L is _v2t Representing the second preError measurement, s (v) _i ,t _i ) Representing an event similarity between the first event representation and the second event representation; the first modality information to which the first event representation belongs and the second modality information to which the second event representation belongs are corresponding. n represents the number of first modality information in the first information group, or the number of second modality information in the second information group. s (v) _i ,t _j ) And representing the event similarity between the jth first modality information in the first information group and the ith second modality information in the second information group. exp denotes an exponential function with a natural constant e as the base.

For example, please refer to the content of the embodiment shown in fig. 9 above for the event similarity calculation manner, which is not described again in this embodiment.

Step 634: training the event representation prediction model according to the first prediction error and the second prediction error to obtain a trained event representation prediction model;

in one implementation, the event characterization prediction model is trained with an average of the first prediction error and the second prediction error as the prediction error:

wherein L represents a prediction error, L _t2v Representing a first prediction error, L _v2t Representing a second prediction error.

In summary, in the method provided in this embodiment, the event representation prediction model is trained by calculating the first prediction error and the second prediction error, so that the multiple pieces of first information and the multiple pieces of second information in the sample information group are fully utilized, and the constructed positive sample and the negative sample avoid the difference between the first information and the second information caused by different information modalities, thereby improving the retrieval effect between the cross-modality information.

FIG. 13 illustrates a flowchart of a method for training an event characterization prediction model according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, on the basis of the embodiment shown in fig. 11, the method further includes step 642, step 644 and step 646:

step 642: obtaining a verification sample pair;

the verification sample pair comprises first verification information and second verification information, and the information modalities of the first verification information and the second verification information are different;

step 644: calling the trained event representation prediction model to perform prediction processing on the first verification information to obtain a first verification representation, and calling the trained event representation prediction model to perform prediction processing on the second verification information to obtain a second verification representation;

the first authentication token comprises token information of at least two first authentication events, and the second authentication token comprises token information of at least two second authentication events;

step 646: verifying the trained event characterization prediction model according to a prediction error between the first verification characterization and the second verification characterization, and updating a first quantity of first verification events in the first verification characterization and a second quantity of second verification events in the second verification characterization, which are obtained by prediction;

illustratively, the validation sample pair is used to update a hyper-parameter in the event characterization prediction model; specifically, the hyper-parameter includes at least one of a first number of first verification events and a second number of second verification events. It should be noted that the first number and the second number may be the same or different, and this embodiment does not set any limitation.

In summary, in the method provided in this embodiment, the trained event representation prediction model is verified by obtaining the verification sample pair, the hyper-parameters in the trained event representation prediction model are updated, the first number and the second number are updated, and the prediction effect of the trained event representation prediction model is ensured.

In one specific example, a method of training an event characterization prediction model is presented:

obtaining a set of video information

Text message group

The number of iterations t, the learning rate η, and the batch group size n.

The batch processing group size n is the same as the number of video information in the video information group, and the batch processing group size n is the same as the number of text information in the text information group. For example, the number of iterations t and the learning rate η are generally preset.

The event characterization prediction model is iteratively trained t times, illustratively, for i = 1.

The n pieces of video information and n pieces of text information in one batch group are subjected to prediction processing one by one, illustratively, for j = 1. Illustratively, N represents the amount of video information that the training event characterizes the predictive model.

And respectively coding the n pieces of video information and the n pieces of text information by using a video coder and a text coder to obtain the video characteristic sequences of the n pieces of video information and the text characteristic sequences of the n pieces of text information.

Invoking the video event generator to generate a plurality of video event representations and invoking the text event generator to generate a plurality of video event representations.

And according to the similarity relation between the video event representations and the video event representations, calculating the similarity relation between the video information and the text information, and training the event representation prediction model by minimizing a prediction error obtained based on the similarity relation to obtain the trained event representation prediction model.

Illustratively, the performance of the trained event characterization prediction model was verified by using the public data set, and the experimental results are shown in table one.

Watch 1

Wherein the public data set LSMDC includes 118081 pairs of video text; r @ K represents the proportion of the query points containing the results related to the query sample in the K search results before returning to the total query points, and the higher the value is, the better the search effect is. MdR represents the median of the ranking of the results related to the query point in the returned results, and the lower the value, the better the result; mnR represents the average value of the ranking of the results related to the query point in the returned results, and the lower the value is, the better the retrieval effect is. The first table shows the prediction results of a skilled collaboration network (Collaborative Experts), a multi-modal Video translation Retrieval network (MMT), a multi-domain multi-modal Video translation Retrieval network (mdmt), a CLIP-based End-to-End Video fragment Retrieval network (An Empirical Study of CLIP for End to End Video CLIP Retrieval, CLIP4 CLIP), and the event representation prediction model (CLIP erg) in the present application.

Those skilled in the art can understand that the above embodiments can be implemented independently, or the above embodiments can be combined freely to combine a new embodiment to implement the information retrieval method and/or the training method of the event characterization prediction model of the present application.

Fig. 14 shows a block diagram of an information retrieval apparatus provided in an exemplary embodiment of the present application. The device comprises:

an obtaining module 810, configured to obtain input information, where an information modality of the input information is a first modality;

the prediction module 820 is configured to invoke an event representation prediction model to perform prediction processing on the input information, so as to obtain an input event representation of the input information, where the input event representation is used to indicate event information in the input information;

a calculating module 830, configured to calculate an event similarity between the input event representation and a retrieval event representation, where the retrieval event representation is an event representation corresponding to retrieval information, and an information modality of the retrieval information is a second modality;

a determining module 840, configured to determine, when the event similarity exceeds a similarity threshold, search information corresponding to the search event token as a search result of the input information.

In an alternative design of the application, the event characterization prediction model includes a first prediction network corresponding to the first modality;

the prediction module 820 is further configured to:

and calling the first event generator to perform prediction processing on the input feature sequence to obtain the input event representation of the input information, wherein the input event representation comprises representation information of at least one first modality event of the input information.

In an alternative design of the application, the prediction module 820 is further configured to:

calling the first event generator to carry out prediction processing on the input feature sequence to obtain input weight information of the input information, wherein the input weight information is used for describing the weight of at least two pieces of input sub information in the input information in at least one first modal event;

In an alternative design of the application, the event characterization prediction model further includes a second prediction network corresponding to the second modality;

the prediction module 820 is further configured to:

when the information modality of the input information is the second modality, calling the second modality encoder to encode at least two pieces of retrieval sub information one by one to obtain a retrieval feature sequence of the retrieval information, wherein the retrieval feature sequence comprises at least two retrieval feature representations corresponding to the at least two pieces of retrieval sub information;

and calling the second event generator to perform prediction processing on the retrieval feature sequence to obtain the retrieval event representation of the retrieval information, wherein the retrieval event representation comprises the representation information of at least one second modal event of the retrieval information.

In an alternative design of the application, the prediction module 820 is further configured to: calling the second event generator to perform prediction processing on the retrieval feature sequence to obtain retrieval weight information of the retrieval information, wherein the retrieval weight information is used for describing the weight of at least two pieces of retrieval sub information in the retrieval information in at least one second modal event;

the calculating module 830 is further configured to:

and constructing and obtaining the event similarity according to the relevance score.

the obtaining module 810 is further configured to: acquiring text information;

the prediction module 820 is further configured to:

the calculating module 830 is further configured to: calculating the event similarity between the text event representation and the video event representation, wherein the video event representation is an event representation corresponding to video information;

the determining module 840 is further configured to:

FIG. 15 is a block diagram illustrating a training apparatus for an event characterization prediction model according to an exemplary embodiment of the present application. The device includes:

an obtaining module 850, configured to obtain a sample information group, where the sample information group includes a first information group and a second information group, the first information group includes n pieces of first modality information, the second information group includes n pieces of second modality information, information modalities of the first modality information and the second modality information are different, the n pieces of first modality information correspond to the n pieces of second modality information one to one, and n is an integer greater than 1;

a prediction module 860, configured to perform prediction processing on the first information group to obtain a first event group of the first information group, and perform prediction processing on the second information group to obtain a second event group of the first information group, where the first event group is used to indicate event information in the first information group, and the second event group is used to indicate event information in the second information group;

a training module 870, configured to train the event characterization prediction model according to a prediction error between the first event group and the second event group, to obtain a trained event characterization prediction model.

the training module 870 is further configured to:

calculating a first prediction error between the first event characterization and the second event group; and calculating a second prediction error between the second event characterization and the first event group;

In an optional design of the application, the obtaining module 850 is further configured to obtain a verification sample pair, where the verification sample pair includes first verification information and second verification information, and information modalities of the first verification information and the second verification information are different;

the prediction module 860 is further configured to invoke the trained event representation prediction model to perform prediction processing on the first verification information to obtain a first verification representation, and invoke the trained event representation prediction model to perform prediction processing on the second verification information to obtain a second verification representation, where the first verification representation includes representation information of at least two first verification events, and the second verification representation includes representation information of at least two second verification events;

an updating module 880, configured to verify the trained event representation prediction model according to the prediction error between the first verification representation and the second verification representation, and update the predicted first number of the first verification events in the first verification representation and the predicted second number of the second verification events in the second verification representation.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of each functional module is illustrated, and in practical applications, the above functions may be distributed by different functional modules according to actual needs, that is, the content structure of the device may be divided into different functional modules to implement all or part of the functions described above.

With regard to the apparatus in the above-described embodiment, the specific manner in which the respective modules perform operations has been described in detail in the embodiment related to the method; the technical effects achieved by the operations performed by the respective modules are the same as those in the embodiments related to the method, and will not be described in detail here.

An embodiment of the present application further provides a computer device, where the computer device includes: a processor and a memory, the memory having stored therein a computer program; the processor is configured to execute the computer program in the memory to implement the information retrieval method provided by the above method embodiments, and/or the training method of the event characterization prediction model.

Optionally, the computer device is a server. Illustratively, fig. 16 is a block diagram of a server according to an exemplary embodiment of the present application.

In general, the server 2300 includes: a processor 2301 and a memory 2302.

The processor 2301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 2301 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 2301 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 2301 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 2301 may also include an Artificial Intelligence (AI) processor for processing computing operations related to machine learning.

Memory 2302 may include one or more computer-readable storage media, which may be non-transitory. Memory 2302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 2302 is used to store at least one instruction for execution by the processor 2301 to implement the information retrieval methods and/or the training methods of the event characterization prediction models provided by the method embodiments herein.

In some embodiments, the server 2300 may further optionally include: an input interface 2303 and an output interface 2304. The processor 2301, the memory 2302, the input interface 2303 and the output interface 2304 may be connected by a bus or a signal line. Each peripheral device may be connected to the input interface 2303 and the output interface 2304 via a bus, a signal line, or a circuit board. The Input interface 2303 and the Output interface 2304 can be used for connecting at least one peripheral device related to Input/Output (I/O) to the processor 2301 and the memory 2302. In some embodiments, the processor 2301, memory 2302, and input interface 2303 and output interface 2304 are integrated on the same chip or circuit board; in some other embodiments, the processor 2301, the memory 2302, and any one or both of the input interface 2303 and the output interface 2304 can be implemented on separate chips or circuit boards, which are not limited in this application.

Those skilled in the art will appreciate that the above-described illustrated architecture is not meant to be limiting with respect to the server 2300 and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, there is also provided a chip comprising programmable logic circuits and/or program instructions for implementing the information retrieval method of the above aspects and/or the training method of the event characterization prediction model when the chip is run on a computer device.

In an exemplary embodiment, a computer program product is also provided that includes computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of the computer device, and the computer instructions are read from the computer-readable storage medium and executed by the processor to implement the information retrieval method and/or the training method of the event characterization prediction model provided by the above method embodiments.

In an exemplary embodiment, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program is loaded and executed by a processor to implement the information retrieval method provided by the above method embodiments, and/or the training method of the event characterization prediction model.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those skilled in the art will recognize that the functionality described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof, in one or more of the examples described above. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An information retrieval method, the method comprising:

and under the condition that the event similarity exceeds a similarity threshold, determining retrieval information corresponding to the retrieval event representation as a retrieval result of the input information.

2. The method of claim 1, wherein the event characterization prediction model comprises a first prediction network corresponding to the first modality;

the calling event representation prediction model carries out prediction processing on the input information to obtain the input event representation of the input information, and the method comprises the following steps:

3. The method of claim 2, wherein the first prediction network comprises a first modality encoder and a first event generator, and wherein the input information comprises at least two input sub-information;

the calling the first prediction network to perform prediction processing on the input information to obtain the input event representation of the input information under the condition that the information modality of the input information is the first modality includes:

4. The method of claim 3, wherein invoking the first event generator to perform predictive processing on the sequence of input features to obtain the input event representation of the input information comprises:

5. The method of claim 2, wherein the event characterization prediction model further comprises a second prediction network corresponding to the second modality;

the method further comprises the following steps:

6. The method according to claim 5, wherein the second prediction network comprises a second modality encoder and a second event generator, and the retrieval information comprises at least two retrieval sub-information;

the calling the second prediction network to perform prediction processing on the retrieval information to obtain the retrieval event representation of the retrieval information under the condition that the information modality of the retrieval information is the second modality includes:

7. The method of claim 6, wherein said invoking the second event generator to perform a prediction process on the search feature sequence to obtain the search event representation of the search information comprises:

8. The method according to any one of claims 1 to 7, wherein the input event characterization information comprises characterization information of at least one first modality event of the input information; the retrieval event characterizes characterization information of at least one second modality event comprising the retrieval information;

the calculating of the event similarity between the input event representation and the retrieval event representation comprises:

9. The method of claim 1, wherein the first modality is a text modality, and wherein the second modality is a video modality;

the acquiring of the input information includes:

acquiring text information;

determining the retrieval information corresponding to the retrieval event representation as the retrieval result of the input information under the condition that the event similarity exceeds a similarity threshold, including:

10. A method for training an event characterization prediction model, the method being used for training the event characterization prediction model in the method according to any one of claims 1 to 9, the method comprising:

11. A method according to claim 10, wherein the first event group comprises n first event representations corresponding to n of the first modality information, and the second event group comprises n second event representations corresponding to n of the second modality information;

the training the event characterization prediction model according to the prediction error between the first event group and the second event group to obtain a trained event characterization prediction model, including:

12. The method of claim 10, further comprising:

obtaining a verification sample pair, wherein the verification sample pair comprises first verification information and second verification information, and the information modalities of the first verification information and the second verification information are different;

calling the trained event representation prediction model to perform prediction processing on the first verification information to obtain a first verification representation, and calling the trained event representation prediction model to perform prediction processing on the second verification information to obtain a second verification representation, wherein the first verification representation comprises the representation information of at least two first verification events, and the second verification representation comprises the representation information of at least two second verification events;

and verifying the trained event characterization prediction model according to a prediction error between the first verification characterization and the second verification characterization, and updating the predicted first number of the first verification events in the first verification characterization and the predicted second number of the second verification events in the second verification characterization.

13. An information retrieval apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring input information, and the information modality of the input information is a first modality;

14. An apparatus for training an event characterization prediction model, the apparatus comprising:

15. A computer device, characterized in that the computer device comprises: the system comprises a processor and a memory, wherein at least one program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the information retrieval method according to any one of claims 1 to 9 and/or the training method of the event characterization prediction model according to any one of claims 10 to 12.

16. A computer-readable storage medium, wherein the computer-readable storage medium has stored therein executable instructions, which are loaded and executed by a processor, to implement the information retrieval method according to any one of claims 1 to 9 and/or the training method of the event characterization prediction model according to any one of claims 10 to 12.

17. A computer program product comprising computer instructions stored on a computer readable storage medium, which are read and executed by a processor to implement the information retrieval method of any of claims 1 to 9 above and/or the training method of the event characterization prediction model of any of claims 10 to 12 above.