CN113761933A

CN113761933A - Retrieval method, retrieval device, electronic equipment and readable storage medium

Info

Publication number: CN113761933A
Application number: CN202110542204.8A
Authority: CN
Inventors: 曾雅文; 王艺如; 廖东亮; 黎功福; 黄炜杰; 姚日恒; 徐进
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-12-07

Abstract

The application discloses a retrieval method, a retrieval device, electronic equipment and a readable storage medium, and belongs to the technical field of information. The retrieval method comprises the following steps: acquiring a retrieval object, and extracting initial semantic features of the retrieval object; generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object; generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, wherein N is an integer greater than 1; determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object; and determining the target object as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object. According to the method and the device, the matching degree is determined based on the two semantic feature sets with diversity, so that the retrieval result is determined, and the retrieval result is more diverse.

Description

Retrieval method, retrieval device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of information technology, and in particular, to a retrieval method, an apparatus, an electronic device, and a readable storage medium.

Background

The process of finding out a target object related to a retrieval object by using a certain technical means according to the retrieval object proposed by a user is generally called retrieval and also called search.

At present, the retrieval results obtained by the related art can achieve relatively high accuracy, but the variety of the retrieval results is low. For example, when the search target input by the user is a query text of "pet", the actual intention of the user may be to search for an image of pet supplies or an image of a pet market, but after the search task is performed by using the related art, usually only an image of a dog or an image of a cat is returned to the user, and the variety of the search results is low.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present application provide a retrieval method, an apparatus, an electronic device, and a readable storage medium, which aim to improve the diversity of retrieval results.

In a first aspect, an embodiment of the present application provides a retrieval method, where the retrieval method includes: obtaining a retrieval object, and extracting initial semantic features of the retrieval object; generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object; generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, wherein N is an integer greater than 1; determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object; and determining a target object serving as a retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

In a second aspect, an embodiment of the present application provides a retrieval apparatus, where the apparatus includes an initial semantic feature extraction module, a semantic feature set generation module, a matching degree determination module, and a retrieval result determination module. The initial semantic feature extraction module is used for obtaining a retrieval object and extracting initial semantic features of the retrieval object. The semantic feature set generating module is configured to generate a semantic feature set of the search object according to the initial semantic features of the search object, and is further configured to generate a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, where N is an integer greater than 1. And the matching degree determining module is used for determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object. And the retrieval result determining module is used for determining the target object serving as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, wherein the program code performs the above-mentioned method when executed by a processor.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the above-described method.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

and generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object, and generating a semantic feature set of the target object according to the initial semantic features of the target object, so that the retrieval object and the target object are not represented by the original single initial semantic features but are represented by the diversified semantic features. And then determining the matching degree of the target object and the retrieval object according to the semantic feature set of the target object and the semantic feature set of the retrieval object, and determining a retrieval result according to the matching degree of each of the plurality of target objects and the retrieval object. In the method, for the retrieval object and the target object, the matching degree is determined not based on two single initial semantic features, but based on two semantic feature sets with diversity, and then the retrieval result is determined, so that the retrieval result is more possible, and the diversity of the retrieval result is promoted.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by another embodiment of the present application;

fig. 3 is a flowchart of a retrieval method according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a plurality of first intermediate features set forth in an embodiment of the present application;

FIG. 5 is a diagram illustrating the determination of a degree of match using a model according to an embodiment of the present application;

FIG. 6 is a flow chart of training a semantic generation model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a training semantic generation model according to an embodiment of the present application;

FIG. 8 is a flow chart of determining a distance penalty loss component according to an embodiment of the present application;

FIG. 9 is a flow chart of a retrieval method according to another embodiment of the present application;

FIG. 10 is a flow chart of a retrieval method according to another embodiment of the present application;

FIG. 11 is a schematic diagram of a training semantic generation model according to another embodiment of the present application;

FIG. 12 is a diagram illustrating a retrieval task performed based on the semantic generation model of FIG. 9 according to an embodiment of the present application;

fig. 13 is a schematic diagram of a retrieval apparatus 1100 according to an embodiment of the present application;

fig. 14 is a schematic diagram of a retrieval apparatus 1200 according to another embodiment of the present application;

fig. 15 shows a block diagram of an electronic device for executing the retrieval method according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of the artificial intelligence technology, the artificial intelligence technology is researched and applied in multiple fields, and the technical scheme provided by the embodiment of the application relates to the application of machine learning in the field of data retrieval, in particular to a retrieval method.

In the related art, the retrieval result can usually achieve relatively high accuracy, but the variety of the retrieval result is low. For example, when the query text input by the user is "pet", the actual intention of the user may be to search for an image of pet supplies or an image of a pet market, but after the search task is performed by using the related art, usually only an image of a dog or an image of a cat is returned to the user, and the variety of the search results is low.

In view of the above, embodiments of the present application provide a retrieval method, a retrieval apparatus, an electronic device, and a readable storage medium. Specifically, by utilizing a pre-trained semantic generation model, the diversity semantic features of the retrieval object are automatically generated based on the initial semantic features of the retrieval object, the diversity semantic features of each target object are automatically generated based on the initial semantic features of each target object, then the matching degree of the retrieval object and each target object is determined according to the diversity semantic features of the retrieval object and the diversity semantic features of each target object, and further, the retrieval result is determined according to the matching degree. The retrieval result determined by the retrieval method provided by the application has more diversity.

The following describes an implementation environment related to a search method provided by an embodiment of the present application. Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. As shown in FIG. 1, the implementation environment may include a server 110, a target object library 120, and a terminal 130.

The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, or a dedicated or platform server providing a car networking service, a road Network cooperation, a vehicle road cooperation, intelligent transportation, automatic driving, an industrial internet service, and data communication (such as 4G, 5G).

The target object library 120 may be configured on the server 110, or may be configured on other devices, which is not limited in this application. In fig. 1, the target object library 120 is schematically configured on the server 110. The target object may be: text, image, video or news, the data modality of the target object is not limited by the application.

The terminal 130 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. Optionally, a client, such as a browser client, an instant messaging client, a content interaction client, a short video client, or a shopping client, is run on the terminal 130. The terminal 130 and the server 110 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.

In a specific implementation, the server 110 obtains the search object sent by the terminal 130, and the server 110 selects a plurality of target objects matching the search object from the target object library 120 as the search result by executing the search method provided by the present application. The retrieval object is used as a retrieval basis, and the target object is used as an object to be selected. For example, when the user wishes to retrieve an image related to a pet, the user may input the query text "pet" to the terminal 130, the terminal 130 transmits the query text "pet" to the server 110, and the server 110 selects a number of images matching the query text "pet" from the image library as the retrieval result by performing the retrieval method provided by the present application. The query text "pet" is used as a retrieval object, and the image is used as a target object.

Referring to fig. 2, fig. 2 is a schematic diagram of an implementation environment according to another embodiment of the present application. As shown in FIG. 2, the implementation environment may include a retrieval device 210 and a target object library 220.

Wherein, the retrieval device 210 may be a computer device, which refers to an electronic device with data computing, processing and storing capabilities. The computer device may be a terminal device such as a PC (personal computer), a tablet computer, a smartphone, a smart audio, a wearable device, a smart robot, a vehicle-mounted terminal, or the like. The retrieval apparatus 210 has an information input device, which may be, for example, a mouse, a keyboard, a touch screen, a touch panel, a microphone, or a camera. Taking a touch screen as an example, a user may input information to the retrieval device 210 by touching and handwriting the touch screen. Taking a microphone as an example, the user can input information to the retrieval device 210 by inputting voice to the microphone. Taking a camera as an example, the user may swipe a predetermined gesture in front of the turned-on camera to input information to the retrieval device 210.

The target object library 220 may be configured on the retrieval device 210, or may be configured on other devices, which is not limited in this application. In fig. 2, the target object library 220 is schematically configured on the retrieval device 210. The target object may be: text, image, video or news, the data modality of the target object is not limited by the application.

In a specific implementation, the user inputs information to the retrieval device 210 through the information input device, and the retrieval device 210 obtains a retrieval object from the information input by the user. The retrieval device 210 selects a plurality of target objects matching the retrieval object from the target object library 220 as the retrieval result by executing the retrieval method provided by the present application. The retrieval object is used as a retrieval basis, and the target object is used as an object to be selected. For example, when the user wishes to retrieve an image related to a pet, the user may handwriting the query text "pet" on the touch screen of the retrieval device 210. The search device 210 selects a plurality of images matching the query text "pet" from the image library as a search result by performing the search method provided by the present application, and displays the search result to the user through the touch screen. The query text "pet" is used as a retrieval object, and the image is used as a target object.

It should be noted that the two implementation environments provided above in connection with fig. 1 and 2 are two examples of various implementation environments of the present application. For simplicity of description, a device (e.g., the aforementioned server 110 or the retrieval device 210) for executing the retrieval method of the present application is collectively referred to simply as an execution subject.

In some implementations, the retrieval object and the target object may belong to the same data modality. For example, the retrieval object and the target object are both text data, or the retrieval object and the query object are both image data, or the retrieval object and the query object are both video data, or the retrieval object and the query object are both news data. For simplicity of illustration, the data modalities of the search object and the target object are not exhaustive here. Taking the retrieval object and the target object as examples, the execution main body obtains the query text (i.e. the retrieval object), and the execution main body searches the relevant target text from the plurality of target texts (i.e. the target objects) by executing the retrieval method in the embodiment of the application.

In other embodiments, the search object and the target object may belong to different data modalities. For example, the search object is character data, and the target object is image data. Or the retrieval object is character data, and the target data is video data. Or the retrieval object is character data, and the target object is news data. Or the retrieval object is image data and the target data is character data. Or the retrieval object is image data, and the target data is news data. For simplicity of illustration, the data modalities of the search object and the target object are not exhaustive here. Taking the retrieval object as character data and the target objects as image data as examples, the execution main body obtains a query text (namely, the retrieval object), and the execution main body searches a related target image from a plurality of target images (namely, the target objects) by executing the retrieval method in the embodiment of the application.

It should be noted that in some application scenarios, the execution subject may perform the retrieval task in stages. For example, in the first retrieval stage of the retrieval task, the execution subject selects 10 ten thousand target objects from 1000 ten thousand target objects as the retrieval result in the first retrieval stage. In the second retrieval stage of the retrieval task, the execution subject selects 1 ten thousand target objects from 10 ten thousand target objects as the retrieval result in the second retrieval stage. In the third retrieval phase of the retrieval task, the execution subject selects 0.1 ten thousand target objects from 1 ten thousand target objects as the final retrieval result of the retrieval task. The retrieval method in the embodiment of the application can be applied to each retrieval stage of the retrieval task.

For example, the retrieval method in the embodiment of the present application is applied to a later retrieval stage of a retrieval task, where the later retrieval stage refers to a retrieval stage, such as a last retrieval stage, located after a first retrieval stage of the retrieval task. Since the execution subject has filtered out target objects that are not relevant to the search result from a large number of target objects in the early search stage of the search task, the execution subject can select a plurality of target objects that are relevant and diverse from the target objects that are relevant to each other as the search result by operating the search method in the embodiment of the present application in the later search stage of the search task.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a flowchart of a retrieval method according to an embodiment of the present application, and as shown in fig. 3, the retrieval method includes:

s310: and obtaining a retrieval object, and extracting the initial semantic features of the retrieval object.

The retrieval object is used as a basis of the retrieval task, and the target object related to the retrieval object is found out according to the retrieval object. For example, in a search task, a user desires to search for an image related to a pet, and inputs a query text "pet" to a search box, the query text "pet" is a search object in the search task.

In some implementations, the retrieval object may be proposed by the user. In these embodiments, the execution subject obtains the search object directly input to the execution subject by the user, and specifically, reference may be made to the implementation environment shown in fig. 2. For example, when the user wants to retrieve content related to a pet, the user may input a query text "pet" to the execution main body, and the execution main body obtains the query text, which is a retrieval target.

Or the execution subject obtains the retrieval object input by the execution subject at the other end of the network through the network, which may refer to the implementation environment shown in fig. 1. For example, when a user wants to retrieve content related to a pet, the user may input a query text "pet" at a terminal thereof, the terminal transmits the query text "pet" to an executing body via a network, and the executing body obtains the query text, which is a retrieval target.

In specific implementation, when a user executes a search operation through a terminal, the terminal generates a search log in an HTML format, and a search request sent by the terminal to a server (i.e., an execution main body) carries the search log. The search log may include information such as a query timestamp, a user name, a mobile phone model, a query location, or a search mode, in addition to the query text input by the user. For this purpose, the server reads out a query text from the retrieval log in the HTML format, and takes the query text as a retrieval object.

In addition, since the read query text may further include special symbols, such as characters of "+", "-", or "\\", which are input by the user by mistake, in order to improve the purity of the query text, the server may remove the special characters included in the special character table from the query text according to a predefined special character table, and use the query text from which the special characters are removed as a retrieval object.

In some embodiments, the search object may also be generated by the execution subject. For example, when a user needs to search a certain segment of text, the execution body may extract a keyword from the segment of text, and use the extracted keyword as a search target. The method of obtaining the search target is not limited in the present application.

In the present application, the initial semantic features and the following semantic feature sets are two opposite concepts, and the feature number of the initial semantic features is lower than that of the semantic feature sets. For example, the initial semantic feature may be a feature characterized by a vector, and the semantic feature set includes a plurality of features respectively characterized by a plurality of vectors. In actual application, the execution subject generates a plurality of features according to the features of the retrieval object, and the features can be regarded as initial semantic features of the retrieval object, and the generated plurality of features can be regarded as semantic feature sets of the retrieval object.

As described above, the retrieval target may be text data, image data, video data, or the like.

In an application scene that the retrieval object is character data, the feature of the retrieval object can be extracted by adopting a Doc2Vec algorithm, and the extracted feature is used as the initial semantic feature of the retrieval object. In addition, Word2vec algorithm can be adopted to extract the characteristics of the retrieval object, and the extracted characteristics can be used as the initial semantic characteristics of the retrieval object. Taking the Doc2Vec algorithm as an example, the Doc2Vec algorithm is an algorithm for converting text (such as sentences or paragraphs) into vectors, and the vectors converted by the Doc2Vec algorithm are generally called word vectors, sentence vectors or paragraph vectors, and the vectors are used for characterizing syntactic structural features of the text. After the texts with different lengths are input into the Doc2Vec algorithm, the Doc2Vec algorithm outputs the vectors with the specified lengths. In other words, the Doc2Vec algorithm can extract fixed-length features from text. In specific implementation, a retrieval object (namely, text data) can be input into a pre-trained Doc2Vec algorithm, so that a vector output by the Doc2Vec algorithm is obtained, and the vector is used as an initial semantic feature of the retrieval object. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

In an application scenario in which the search target is image data, a ResNet (Residual Neural Network) may be used to extract features of the search target, and the extracted features may be used as initial semantic features of the search target. Furthermore, a LeNet network, a VGG network, or a densneet (dense Connected Convolutional network) may be used to extract the features of the search target, and the extracted features may be used as the initial semantic features of the search target. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

In an application scenario in which the retrieval object is video data, the C3D network or the I3D network may be used to extract features of the retrieval object, and the extracted features may be used as initial semantic features of the retrieval object. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

S320: and generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object.

As mentioned above, the semantic feature set and the initial semantic features are two opposite concepts, and the number of features of the semantic feature set is greater than that of the initial semantic features. Specifically, a plurality of features generated from the initial semantic features may be considered a semantic feature set.

In the application, the number of the features of the semantic feature set is greater than that of the initial semantic features, and the semantic feature set has higher diversity than the initial semantic features. For ease of understanding, for example, assuming that the retrieval object is the query text "rape flower in mountain valley", the initial semantic feature extracted by performing S310 is a feature, and the semantic feature represented by the feature may be "rape flower in mountain", and the semantic feature set generated by performing S320 includes a plurality of features, and the semantic features represented by the plurality of features may be "mountain valley", "rape flower" and "mountain flower", respectively.

In some embodiments, the semantic feature set of the search object may be automatically generated by a pre-trained semantic generation model. In practical application, the initial semantic features of the retrieval object can be input into a semantic generation model, and the semantic generation model can generate a semantic feature set for the retrieval object according to the initial semantic features of the retrieval object.

The semantic generation model may include a first algorithm model and a third algorithm model. The first algorithm model is used for generating a plurality of first intermediate features of the retrieval object according to the initial semantic features of the retrieval object. The first intermediate feature is a concept relative to the initial semantic feature, and when the first intermediate feature is actually applied, the execution subject inputs the initial semantic feature of the search object into the first algorithm model, and generates a plurality of features by the first algorithm model, so that the plurality of features generated by the first algorithm model can be regarded as a plurality of first intermediate features.

The plurality of first intermediate features has a higher diversity than the initial semantic features. For ease of understanding, following the above example, the initial semantic features of the search object characterize "flowers of rape in mountains", and the first plurality of intermediate features generated by the first algorithmic model may characterize "mountains", "rape" and "flowers", respectively. It can be seen that the plurality of first intermediate features has a higher diversity than the initial semantic features.

Alternatively, the first algorithm model may be a Variational Auto-Encoder (VAE), a Multi-head Attention mechanism (Multi-head Attention), or a topic model (topic model). The variational autoencoder, the multi-head attention mechanism, and the topic model all have the ability to generate multiple features from one feature. It should be noted that the first algorithm model may also be selected from other models having a lifetime multi-capability (i.e. the capability of generating a plurality of features according to one feature), such as an auto-encoder.

Taking the first algorithm model as an example, the variational self-encoder is a generative model, and the variational self-encoder comprises an encoder (encoder) and a decoder (decoder). The encoder is used for encoding the distribution condition of the hidden variable z according to input data. The decoder is operative to generate data similar to the input data based on sampling of the latent variable z. During specific implementation, the initial semantic features are input into an encoder of a variational self-encoder, the encoder encodes the initial semantic features into the distribution of hidden variables z, and then a plurality of hidden variables z are sampled from the distribution of the hidden variables z to serve as a plurality of intermediate features.

Taking the first algorithm model as an example of selecting a multi-head attention mechanism, after the initial semantic features are input into the multi-head attention mechanism, the multi-head attention mechanism executes multiple rounds of calculation based on the initial semantic features, so as to generate a plurality of intermediate features.

In the above, the first algorithm model generates a plurality of first intermediate features of the retrieval object according to the initial semantic features of the retrieval object, and although the plurality of first intermediate features have higher diversity than the initial semantic features, in order to further make the plurality of first intermediate features have stronger relevance, the plurality of first intermediate features may be optimized by using the third algorithm model, so that the plurality of first intermediate features are respectively converted into corresponding first semantic features, and the plurality of first semantic features are taken as the semantic feature set of the retrieval object. For ease of understanding, following the above example, a plurality of first semantic features in the semantic feature set may characterize "mountain," rape flower "and" mountain flower, "respectively. As can be seen, the plurality of first semantic features in the semantic feature set have stronger relevance and are closer to the actual intention than the plurality of intermediate features under the same diversity condition.

Wherein the process of optimizing the plurality of first intermediate features using the third algorithm model comprises: the third algorithm model determines a similar feature for each first intermediate feature from the plurality of first intermediate features and converts each first intermediate feature into a corresponding first semantic feature based on the similar features of each first intermediate feature and each first intermediate feature. The plurality of first intermediate features are converted into a plurality of first semantic features, and the plurality of first semantic features are semantic feature sets of the retrieval object.

In the above manner, since the first semantic features are obtained based on the first intermediate features and the similar features of the first intermediate features, the first semantic features can reflect the relevance between the first intermediate features, and the first semantic features are closer to the actual retrieval intention.

Optionally, when determining the similar feature for each first intermediate feature, the similarity between the first intermediate feature and each of the remaining first intermediate features may be calculated for the first intermediate feature, and if the similarity satisfies a preset condition, the similarity between the first intermediate feature and each of the remaining first intermediate features is determined to be similar to each other.

The similarity of the two first intermediate features may specifically be cosine similarity, and when the similarity of the two first intermediate features reaches a preset threshold value of 0.6, it is determined that the two first intermediate features are mutually similar features. Alternatively, the similarity between two first intermediate features can be measured by Euclidean Distance (Euclidean Distance), Manhattan Distance (Manhattan Distance), Chebyshev Distance (Chebyshev Distance), or the like. Taking the euclidean distance as an example, when the euclidean distance between the two first intermediate features is lower than a preset threshold, it is determined that the two first intermediate features are mutually similar features.

For ease of understanding, and by way of example, with reference to fig. 4, fig. 4 is a schematic illustration of a plurality of first intermediate features as set forth in an embodiment of the present application. Fig. 4 includes 6 first intermediate features a, b, c, d, e, f, etc. When determining the similar feature for a, firstly calculating the similarity between a and b, and determining b as the similar feature of a when the similarity meets the preset condition. As shown in fig. 4, a and b have a connecting line therebetween, and b is a similar feature of a (or a is a similar feature of b). And then calculating the similarity between the a and the c, and determining that the c is not the similar feature of the a under the condition that the similarity does not meet the preset condition. As shown in fig. 4, a and c have no connecting line therebetween, which indicates that c is not a similar feature (or a is not a similar feature of c). By analogy, all similar features are determined for a, and as shown in fig. 4, the similar features of a include b, d, and f. Similarly, similar features are determined for b, c, d, e, f, etc., respectively, in the manner described above.

Optionally, when each first intermediate feature is converted into a corresponding first semantic feature according to the similar features of each first intermediate feature and each first intermediate feature, specifically, for each first intermediate feature, first, the most similar feature is determined from the similar features of the first intermediate feature, then, the determined most similar feature is spliced with the first intermediate feature, and then, a preset activation function is used to act on the splicing result, so that the corresponding first semantic feature is converted.

In particular implementations, the aggregate characteristic can be represented by the following formula:

wherein the content of the first and second substances,

a jth similar feature representing a first intermediate feature, N representing a total number of similar features possessed by the first intermediate feature, MAX _ AGGREGATE () representing a function that determines a most similar feature from the N similar features,

representing the most similar feature of the first intermediate feature. For example, when the most similar feature is determined from the N similar features, the cosine similarity of each of the N similar features to the first intermediate feature may be first calculated, and then the similar feature whose cosine similarity is closest to 1 may be determined as the most similar feature.

In particular implementation, the first semantic feature may be represented by the following formula:

wherein the content of the first and second substances,

representing the most similar feature of the ith first intermediate feature, right side of equation

Represents the ith first intermediate feature, CONCAT () represents the feature splicing function, CONCAT () is used for splicing the feature length, W_tRepresenting a pre-trained weight vector, W_tIncluding multiple weight values, sigma representing activation function, sigma optionally being sigmoid function, left side of equation

Representing the first semantic feature into which the ith first intermediate feature is converted.

For the sake of understanding, it is preferable, by way of example,

is a 1 x 50 feature, in other words,

is a matrix with a width equal to 1 and a length equal to 50.

Is also a 1 x 50 feature, in other words,

is a matrix with a width equal to 1 and a length equal to 50. Then

Is a 1 x 100 feature, in other words,

is a matrix with a width equal to 1 and a length equal to 100. W_tIs a 100 x 50 matrix, in other words, W_tIs a matrix having a width equal to 100 and a length equal to 50. W_tAnd

after the multiplication is carried out, the obtained product is subjected to the reaction,

is a 1 x 50 feature, in other words,

is a matrix with a width equal to 1 and a length equal to 50. Specific numerical values (for example, 1 × 50, 1 × 100, 100 × 50, and the like) in the above examples are merely examples, and should not be construed as limiting the present application.

In the above, the third algorithm model optimizes the plurality of first intermediate features, so as to convert the plurality of first intermediate features into a plurality of first semantic features, where the plurality of first semantic features are semantic feature sets of the retrieval object. The third algorithm model is actually: algorithms applied during conversion of the first intermediate characteristic, e.g. the above-mentioned calculation

And

the formula (2).

S330: and generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in the N target objects, wherein N is an integer greater than 1.

Wherein the target object is the retrieved object. For example, in a search task, a user desires to query images related to pets, and when a subject executes the search task, the images related to pets are queried from multiple images, and the multiple images are multiple target objects in the search task.

In some embodiments, as mentioned above, the execution subject may execute the retrieval task in stages, and the retrieval method provided by the embodiment of the present application may be applied to the later retrieval stage of the retrieval task. In these embodiments, the execution subject searches for the target objects in the target object library in the previous search stage to obtain N target objects, where the N target objects serve as search results in the previous search stage. And the execution main body further determines a retrieval result from the N target objects by executing the retrieval method provided by the embodiment of the application in the later retrieval stage.

In some embodiments, the target object library includes N target objects, and the execution subject may also determine the search result directly from the N target objects included in the target object library by executing the search method provided in the embodiment of the present application.

In some embodiments, the initial semantic features of each target object are extracted in advance, and the execution subject can directly utilize the initial semantic features extracted from the target object in advance when generating the semantic feature set corresponding to the target object according to the initial semantic features of the target object, thereby shortening the time for retrieval.

In some embodiments, when the execution subject generates the semantic feature set of the target object according to the initial semantic features of the target object, the execution subject may also extract the initial semantic features of the target object first, and after extracting the initial semantic features of the target object, generate the semantic feature set corresponding to the target object according to the initial semantic features.

As described above, the target object may be text data, image data, video data, or the like.

In an application scene that the target object is character data, the feature of the target object can be extracted by adopting a Doc2Vec algorithm, and the extracted feature is used as the initial semantic feature of the target object. In addition, the Word2vec algorithm can also be adopted to extract the features of the target object, and the extracted features are used as the initial semantic features of the target object. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

In an application scenario in which the target object is image data, a ResNet (Residual Neural Network) may be used to extract features of the target object, and the extracted features may be used as initial semantic features of the target object. In addition, a LeNet network, a VGG network, or a densneet (dense Connected Convolutional network) may be used to extract the features of the target object, and the extracted features may be used as the initial semantic features of the target object. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

In an application scenario in which the target object is video data, the C3D network or the I3D network may be used to extract features of the target object, and the extracted features serve as initial semantic features of the target object. It should be noted that, the manner of extracting the initial semantic features is not limited in the present application.

As previously described, in some embodiments, the semantic feature set of the search object may be automatically generated by a pre-trained semantic generation model. In these embodiments, the semantic feature set of the target object may also be automatically generated by the semantic generation model. In practical application, the initial semantic features of the target object may be input into a semantic generation model, and the semantic generation model may generate a corresponding semantic feature set for the target object according to the initial semantic features of the target object.

In the present application, for each target object of the N target objects, a corresponding semantic feature set may be generated in the same manner. Therefore, for simplicity of illustration, the present application takes a target object as an example, and provides an alternative way of generating a semantic feature set for the target object.

As previously described, the semantic generation model may include a first algorithm model and a third algorithm model. The semantic generation model may also include a second algorithm model. The second algorithm model is used for generating a plurality of second intermediate features of the target object according to the initial semantic features of the target object. The second intermediate feature is a concept relative to the initial semantic feature, and in practical application, the execution subject inputs the initial semantic feature of the target object into the second algorithm model, and the plurality of features are generated by the second algorithm model, so that the plurality of features generated by the second algorithm model can be regarded as the plurality of second intermediate features. The plurality of second intermediate features has a higher diversity than the initial semantic features.

Alternatively, the second algorithm model may be a Variational Auto-Encoder (VAE), a Conditional Variational Auto-Encoder (CVAE), a Multi-head Attention mechanism (Multi-head Attention), or a topic model (topic model). The variational autoencoder, the conditional variational autoencoder, the multi-head attention mechanism, and the topic model all have the capability of generating a plurality of features from one feature. It should be noted that other models with a lifetime multi-capability (i.e. the capability of generating multiple features according to one feature) may also be used as the second algorithm model, such as an autoencoder.

In a specific implementation, in a case where the search object and the target object belong to different data modalities, the first algorithm model preferably uses a multi-head attention mechanism, and the second algorithm model preferably uses a variational auto-encoder or a conditional variational auto-encoder. For example, when the search object is character data and the target object is image data, the first algorithm model uses a multi-head attention mechanism and the second algorithm model uses a conditional variation self-encoder.

The conditional variational self-encoder is a generative model and comprises an encoder (encoder) and a decoder (decoder). The encoder is used for encoding the distribution of the hidden variables z meeting the input conditions according to the input data and the input conditions (the input conditions can also be understood as labels). The decoder is used for generating data meeting the input condition according to the sampling of the hidden variable z and the input condition. For example, when the present application is specifically implemented, the initial semantic features of the target object may be used as input data, and the initial semantic features of the retrieval object may be used as input conditions. In the application, an input condition is input into a full connected layer (FC) to obtain a transformed feature output by the full connected layer, and then the transformed feature is spliced with input data (i.e. an initial semantic feature of a target object) to obtain a splicing feature. And inputting the splicing characteristics into an encoder of the conditional variational self-encoder, encoding the splicing characteristics into the distribution of hidden variables z meeting the input conditions by the encoder, and sampling a plurality of hidden variables z from the distribution of the hidden variables z to be used as a plurality of second intermediate characteristics of the target object.

In the above, the second algorithm model generates a plurality of second intermediate features of the target object according to the initial semantic features of the target object, and although the plurality of second intermediate features have higher diversity than the initial semantic features, in order to further make the plurality of second intermediate features have stronger relevance, the plurality of second intermediate features may be optimized by using the third algorithm model, so that the plurality of second intermediate features are respectively converted into corresponding second semantic features, and the plurality of second semantic features are taken as a semantic feature set corresponding to the target object.

Wherein the process of optimizing the plurality of second intermediate features using the third algorithm model comprises: the third algorithm model determines similar features for each second intermediate feature from the plurality of second intermediate features, and converts each second intermediate feature into corresponding second semantic features according to the similar features of each second intermediate feature and each second intermediate feature; and the plurality of second semantic features converted from the plurality of second intermediate features are used as a semantic feature set of the target object.

In the above manner, since the second semantic features are obtained based on the second intermediate features and the similar features of the second intermediate features, the second semantic features can reflect the relevance between the second intermediate features, and the second semantic features are closer to the actual retrieval intention.

Alternatively, the process of determining similar features for each second intermediate feature may refer to the process of determining similar features for each first intermediate feature described above.

Optionally, the process of converting each second intermediate feature into a corresponding second semantic feature according to the similar features of each second intermediate feature and each second intermediate feature may refer to the process of converting each first intermediate feature into a corresponding first semantic feature described above.

In the above, the third algorithm model optimizes the plurality of second intermediate features, so as to convert the plurality of second intermediate features into a plurality of second semantic features, where the plurality of second semantic features are semantic feature sets of the target object. The third algorithm model is actually: an algorithm applied in the process of translating the first intermediate feature and the second intermediate feature.

S340: and determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object.

In the application, the semantic feature set has higher diversity than the initial semantic features, and for the retrieval object and the target object, the matching degree is determined not based on two single initial semantic features but based on two groups of diversity semantic features (namely two semantic feature sets).

For simplicity of explanation, a feature included in the semantic feature set of the retrieval object is referred to as a first semantic feature, and a feature included in the semantic feature set of the target object is referred to as a second semantic feature. The semantic feature set of the retrieval object comprises p first semantic features, and p is an integer greater than 1. The semantic feature set of each target object comprises q second semantic features, q being an integer greater than 1. It should be noted that, for each target object in the N target objects, the matching degree with the search object may be determined in the same manner. For simplicity of explanation, the jth target object is taken as an example, where J is a positive integer equal to or less than N.

In some embodiments, to determine the matching degree of the jth target object and the retrieval object, the difference degree of the jth target object and the ith first semantic feature may be determined according to q second semantic features of the jth target object and the ith first semantic feature of the retrieval object, where i is a positive integer less than or equal to p; and determining the minimum difference degree from the difference degrees of the J-th target object and each first semantic feature, and determining the matching degree of the J-th target object and the retrieval object according to the minimum difference degree.

For understanding, exemplarily, assuming that the semantic feature set of the jth target object includes 4 second semantic features, the semantic feature set of the retrieval object includes 6 first semantic features, when determining the matching degree of the target object and the retrieval object: according to the 4 second semantic features and the 1 st first semantic feature, determining that the difference degree between the target object and the 1 st first semantic feature is 0.3; according to the 4 second semantic features and the 2 nd first semantic feature, determining that the difference degree between the target object and the 2 nd first semantic feature is 0.1; by analogy, the difference degrees of the target object and the 6 first semantic features are finally determined to be 0.3, 0.1, 0.6, 0.4, 0.8 and 0.9 respectively, wherein the minimum difference degree is equal to 0.1, and therefore the matching degree of the target object and the retrieval object is determined based on the minimum difference degree of 0.1.

In the above manner, as long as the target object specifically matches one aspect of the search object, the degree of difference between the target object and the search object is small regardless of whether the target object specifically matches the other aspect of the search object, and the target object is highly likely to be used as the search result. In this way, the plurality of search results finally specified may be matched with different aspects of the search target, and the plurality of search results have a certain difference, so that the plurality of search results have a relatively high diversity.

Optionally, when determining the degree of difference between the jth target object and the ith first semantic feature according to the q second semantic features of the jth target object and the ith first semantic feature of the retrieval object, specifically, the degree of difference may be determined by:

firstly, determining a weighted sum of q second semantic features of the J-th target object according to each second semantic feature of the J-th target object and the attention weight of each second semantic feature, wherein the attention weight of each second semantic feature is the attention weight of the second semantic feature corresponding to the ith first semantic feature; then, determining a vector difference between the ith first semantic feature and the weighted sum, wherein the vector difference comprises a plurality of vector element values; finally, the smallest vector element value in the plurality of vector element values is determined as the difference degree of the J-th target object and the i-th first semantic feature.

In a specific implementation, the difference between the target object and the ith first semantic feature may be represented by the following formula:

wherein the content of the first and second substances,

j-th second semantic feature, alpha, representing the target object_ijAn attention weight, α, indicating that the jth second semantic feature corresponds to the ith first semantic feature_ijIt may be a result of pre-training,

a weighted sum of q second semantic features representing the target object,

an ith first semantic feature representing the retrieved object,

representing the vector difference, min () representing a function of the smallest vector element value out of the vector's multiple vector element values, d_iAnd representing the difference degree of the target object and the ith first semantic feature.

For ease of understanding, it is assumed, by way of example

Then

Where 0.1 is the minimum vector element value, so 0.1 is determined to be the target object andthe degree of difference of the ith first semantic feature.

In the above manner, as long as one second semantic feature of the target object is close to the ith first semantic feature after being multiplied by the corresponding attention weight, the degree of difference between the target object and the ith first semantic feature is small no matter whether other second semantic features of the target object are close to the ith first semantic feature after being multiplied by the corresponding attention weight. In other words, as long as one aspect of the target object is particularly matched with one aspect of the retrieval object, the degree of difference between the target object and the retrieval object is relatively small whether other aspects of the target object are particularly matched with other aspects of the retrieval object, and the target object is likely to be used as the retrieval result. In this way, the plurality of search results finally specified may be matched with different aspects of the search target, and the plurality of search results have a certain difference, so that the plurality of search results have a relatively high diversity.

Alternatively, when determining the matching degree between the target object and the retrieval object according to the minimum difference degree, the minimum difference degree may be directly determined as the matching degree, so that the smaller the value of the matching degree, the more the target object and the retrieval object are matched. Or the inverse of the minimum degree of difference may be determined as the degree of matching, so that the larger the value of the degree of matching, the more matched the target object and the retrieval object are. The process of determining the matching degree according to the minimum difference degree is not limited in the present application.

Alternatively, the process of determining the matching degree between the target object and the search object may be performed by a pre-trained fourth algorithm model, and the attention weight α referred to in the above formula_ijAs a parameter of the fourth algorithm model, the attention weight α is given during training of the fourth algorithm model_ijAnd gradually converging with the training of the fourth algorithm model. In a specific implementation, the fourth algorithm model may be connected to an output of the third algorithm model. During model application, the third algorithm model outputs a semantic feature set of the retrieval object and a semantic feature set of the target object, which are input to a fourth algorithm model, which is based on the fourth algorithm modelIn the attention weight alpha_ijAnd calculating the matching degree between the two semantic feature sets, namely the matching degree between the target object and the retrieval object.

During model training, the third algorithm model outputs a first prediction semantic feature set of a sample retrieval object and a second prediction semantic feature set of a sample target object, the two prediction semantic feature sets are input into a fourth algorithm model, and the fourth algorithm model is based on an initial attention weight alpha_ijAnd calculating the prediction matching degree between the two prediction semantic feature sets, namely the prediction matching degree between the sample target object and the sample retrieval object. Thereafter, a loss value is determined according to the prediction matching degree, and the fourth algorithm model, the third algorithm model, the second algorithm model, and the first algorithm model are updated based on the loss value.

For ease of understanding, referring to fig. 5, fig. 5 is a schematic diagram of determining a matching degree by using a model according to an embodiment of the present application. As shown in fig. 5, in actual application, an initial semantic feature 501 of a target object may be input into the second algorithm model 520 of the semantic generation model 500. Meanwhile, an initial semantic feature 503 of the retrieval object 502 is extracted, the extracted initial semantic feature 503 is input into a first algorithm model 510 of the semantic generation model 500, a plurality of second intermediate features 504 are output by a second algorithm model 520, and a plurality of first intermediate features 505 are output by the first algorithm model 510. The third algorithm model 530 performs optimization processing on the plurality of second intermediate features 504 and the plurality of first intermediate features 505, thereby outputting a semantic feature set 506 of the target object and a semantic feature set 507 of the search object 502. The fourth algorithm model 540 determines and outputs the matching degree of the target object and the retrieval object 502 according to the two semantic feature sets.

In some embodiments, in order to determine the matching degree between the jth target object and the retrieval object, the difference degree between the jth second semantic feature of the jth target object and the retrieval object may also be determined according to the jth second semantic feature of the jth target object and the p first semantic features of the retrieval object, where J is a positive integer less than or equal to q; and determining the minimum difference degree from the difference degrees of each second semantic feature of the J-th target object and the retrieval object, and determining the matching degree of the J-th target object and the retrieval object according to the minimum difference degree.

S350: and determining the target object as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

According to the method and the device, the retrieval result is determined from the N target objects according to the matching degree of each target object and the retrieval object, so that the retrieval result has high diversity and high accuracy.

In some embodiments, M target objects corresponding to the top M matching degrees may be used as the search result according to the matching degree of each target object with the search object. Wherein M is a positive integer less than or equal to N. For example, in the case where the larger the degree of matching between the target object and the search object, the more the target object and the search object match, the top 1000 target objects with the highest degree of matching may be determined as the search result. Alternatively, for example, in the case where the smaller the degree of matching between the target object and the search object, the more the target object and the search object match, the top 1000 target objects with the smallest degree of matching may be determined as the search result.

In some embodiments, the target object may be determined as the search result according to the matching degree of the target object and the search object, where the matching degree value satisfies a preset condition. For example, in the case where the greater the degree value of matching between the target object and the search object, the more the target object matches with the search object, if the degree value of matching between the target object and the search result exceeds a preset threshold (for example, 0.6), the target object is determined as the search result. Or, for example, in the case where the smaller the degree value of matching between the target object and the retrieval object, the more the target object matches with the retrieval object, if the degree value of matching between the target object and the retrieval result is lower than a preset threshold (for example, 0.3), the target object is determined as the retrieval result.

It should be noted that the present application is not limited to how to determine the process of the search result according to the matching degree.

According to the method and the device, the semantic feature set of the retrieval object is generated according to the initial semantic features of the retrieval object, and the semantic feature set of the target object is generated according to the initial semantic features of the target object, so that the retrieval object and the target object are not represented by the original single initial semantic features any more, but are represented by the various semantic features. And then determining the matching degree of the target object and the retrieval object according to the semantic feature set of the target object and the semantic feature set of the retrieval object, and determining a retrieval result according to the matching degree of each of the plurality of target objects and the retrieval object. Therefore, the matching degree is determined not based on two single initial semantic features but based on two semantic feature sets with diversity aiming at the retrieval object and the target object, and then the retrieval result is determined, so that the retrieval result is more possible, and the diversity of the retrieval result is promoted.

As previously described, in some embodiments, S320 through S340 may be performed by a semantic generation model. Optionally, the training process of the semantic generation model may refer to fig. 6 and fig. 7, fig. 6 is a training flowchart of the semantic generation model provided in an embodiment of the present application, and fig. 7 is a schematic diagram of the training semantic generation model provided in an embodiment of the present application. As shown in fig. 6, the training process of the semantic generation model includes:

s610: the method comprises the steps of obtaining a sample retrieval object and initial semantic features of the sample retrieval object, and obtaining a plurality of sample target objects and the initial semantic features of each sample target object, wherein each sample target object carries a label, and the label carried by each sample target object is used for representing whether the sample target object is retrieved by the sample retrieval object or not.

The initial semantic features of the sample retrieval object and the sample target object can be generated in advance, or the initial semantic features of the sample retrieval object and the sample target object can be extracted during training. For a specific way of extracting the initial semantic features, reference may be made to the foregoing.

Wherein, the label carried by each sample target object is a positive label or a negative label. If the sample target object carries a positive tag, it indicates that the sample target object was retrieved by the sample retrieval object. If the sample target object carries a negative tag, it indicates that the sample target object has not been retrieved by the sample retrieval object.

Alternatively, the sample retrieval object and a part of the sample target objects may be from historical search data accumulated by search products, and the historical search data may include a plurality of sets of search data, each set of search data including one retrieval object and a plurality of retrieval results, the plurality of retrieval results being determined by the search products according to the retrieval object. If the sample target object and the sample retrieval object are from the same set of search data, the sample target object carries a positive tag indicating that the sample target object was retrieved by the sample retrieval object. If the sample target object and the sample retrieval object are not from the same set of search data, the sample target object carries a negative tag indicating that the sample target object has not been retrieved by the sample retrieval object.

S620: and inputting the initial semantic features of the sample retrieval object into a semantic generation model to generate a first prediction semantic feature set of the sample retrieval object.

S630: and respectively inputting the initial semantic features of each sample target object into a semantic generation model, and generating a second prediction semantic feature set corresponding to each sample target object.

As shown in FIG. 7, for each sample target object, the initial semantic features 702 of the sample target object and the initial semantic features 701 of the sample retrieval object may be simultaneously input into the semantic generation model 700. Wherein the initial semantic features 702 of the sample target object are input into the second algorithmic model 720 and the second algorithmic model 720 outputs a plurality of second intermediate features 704 of the sample target object. The initial semantic features 701 of the sample search object are input to a first algorithmic model 710, and the first algorithmic model 710 outputs a plurality of first intermediate features 703 of the sample search object. The third algorithm model 730 optimizes the plurality of second intermediate features 704 to generate a second set of predicted semantic features 706, and the third algorithm model 730 optimizes the plurality of first intermediate features 703 to generate a first set of predicted semantic features 705.

Among them, the first algorithm model 710 and the second algorithm modelFor specific types of the two-algorithm model 720, reference is made to the above. The optimization process of the third algorithm model 730 on the plurality of first intermediate features 703 and the plurality of second intermediate features 704 can refer to the above, that is, the process of the third algorithm model optimizing the plurality of first intermediate features. It should be appreciated that, during training, the model parameters (e.g., the aforementioned weight vector W) within the third algorithm model 730_t) Are not optimal parameters, which need to be trained and updated during model training, and thus the optimization effect of the third algorithm model 730 on the plurality of first intermediate features 703 and the plurality of second intermediate features 704 during training may not be ideal.

S640: and determining the prediction matching degree of each sample target object and the sample retrieval object according to the second prediction semantic feature set and the first prediction semantic feature set of each sample target object.

As shown in fig. 7, a fourth algorithmic model 740 may be used to determine the predicted match of each sample target object to the sample retrieval object. In particular, the third algorithm model 730 outputs a first set 705 of predicted semantic features of the sample search object and a second set 706 of predicted semantic features of the sample target object, which are input to a fourth algorithm model 740, the fourth algorithm model 740 being based on an initial attention weight α_ijAnd calculating the prediction matching degree between the two prediction semantic feature sets, namely the prediction matching degree between the sample target object and the sample retrieval object.

S650: and determining a loss value according to the label and the prediction matching degree of each sample target object, and training a semantic generation model based on the loss value.

Fig. 7 only schematically shows the processing procedure for one sample target object, so that the prediction matching degree of the sample retrieval object and the sample target object is obtained. In the same manner, the present application also performs processing on other sample target objects, respectively, to obtain the prediction matching degrees of the sample search object and each sample target object, such as the prediction matching degree b to the prediction matching degree x in fig. 7. After obtaining the predicted matching degrees corresponding to the plurality of sample target objects, as shown in fig. 7, a loss value may be determined according to the label and the predicted matching degree of each sample target object, and the loss value may be propagated in the reverse direction, so as to update the fourth algorithm model, the third algorithm model, the second algorithm model, and the first algorithm model.

In some embodiments, a triple loss component L may be included in the loss value_milTriple loss component L_milCan be represented by the following formula:

wherein, S (t, v)^-) Representing the prediction matching degree of the sample retrieval object and the sample target object carrying the negative label; s (t, v)⁺) Representing the prediction matching degree of the sample retrieval object and the sample target object carrying the positive label; delta is a preset distance compensation coefficient; if the numerical value in the | is less than or equal to 0, the | is 0, and if the numerical value in the | is greater than 0, the | is a numerical value per se; c is a plurality of S (t, v) in one round of training^-) With a plurality of S (t, v)⁺) The total logarithm can be formed between every two pairs; l is_milThe triplet loss component is represented.

During training, if S (t, v)^-) Large and S (t, v)⁺) And if the target object is smaller, the semantic generation model cannot accurately distinguish the positive and negative sample target objects. In this case, L calculated by the above formula_milIs larger, and therefore utilizes this L_milWhen the semantic generation model is updated, the semantic generation model can be updated to a greater extent, which is beneficial to accelerating the convergence of the semantic generation model.

If S (t, v)^-) Small and S (t, v)⁺) And if the object size is larger, the semantic generation model can accurately distinguish the positive and negative sample target objects. In this case, L calculated by the above formula_milIs closer to 0, and therefore utilizes this L_milWhen the semantic generation model is updated, the semantic generation model can be updated to a smaller degree, and the smooth convergence of the semantic generation model is facilitated。

In some embodiments, the loss value may include a tag classification loss component L_labelThe process for determining the loss component of the label classification comprises the following steps: predicting the type of each predicted semantic feature set (including a first predicted semantic feature set and a second predicted semantic feature set) output by the semantic generation model through a classifier to obtain a type prediction result; determining a label classification loss component L according to the type prediction result and the type label of each prediction semantic feature set_label. And the type label of the predicted semantic feature set is used for representing the data modality type of the predicted semantic feature set.

For convenience of understanding, in the present round of training, for example, the total number of sample target objects is 100, and the number of sample retrieval objects is 1, then after the initial semantic features of the sample retrieval objects and the initial semantic features of one sample target object are input into the semantic generation model each time, a first predicted semantic feature set and a second predicted semantic feature set output by the semantic generation model are obtained. Therefore, 100 first prediction semantic feature sets and 100 second prediction semantic feature sets are finally obtained through the training in the current round, the 100 first prediction semantic feature sets correspond to the sample retrieval objects, and the 100 second prediction semantic feature sets correspond to the 100 sample target objects respectively.

Then, for each of 100 first prediction semantic feature sets, splicing a plurality of features in the first prediction semantic feature set, thereby obtaining a first spliced feature corresponding to the first prediction semantic feature set. In this manner, 100 first splice features are obtained. Next, a first tag is configured for each first splice characteristic, wherein the first tag is used for characterizing the data modality type of the first splice characteristic. For example, if the data modality type of the sample retrieval object is text data, the first tag may be shaped as "0".

Similarly, for each of the 100 second prediction semantic feature sets, the features in the second prediction semantic feature set are spliced, so as to obtain a second spliced feature corresponding to the second prediction semantic feature set. In this manner, 100 second stitching features are obtained. And then configuring a second label for each second splicing characteristic, wherein the second label is used for representing the data modality type of the second splicing characteristic. For example, if the data modality type of each target retrieval object is image data, the second tag may be shaped as "1".

Then, the 100 first splicing features and the 100 second splicing features are used as sample data and input into a preset classifier, and 200 classification results output by the classifier are obtained. Then, according to the 200 classification results and the label (i.e. the aforementioned first label or second label) corresponding to each classification result, a label classification loss component L is determined_label。

Above, by way of example, the loss component L is classified for a given label_labelThe specific process of (a) is illustrated. It should be further noted that the specific values (e.g., 100, 200, etc.) referred to in the above examples are only examples and should not be construed as limiting the present application.

Furthermore, the labels carried by some sample target objects may deviate from the actual situation. For example, some sample target objects that carry a negative label, although not retrieved by the sample retrieval object, are actually more relevant to the sample retrieval object. However, during model training, these sample target objects still participate in the training with identities carrying negative labels.

If the above situation is ignored, the semantic generation model with corresponding capability can be trained. However, in order to further enable the semantic generation model to have stronger fault tolerance performance, in some embodiments, a distance penalty loss component may be included in the loss value, and the distance penalty loss component is determined according to the prior matching degree and the predicted matching degree of the sample target object carrying the negative label. Wherein, the prior matching degree of the sample target object is as follows: and determining the matching degree of the sample target object and the sample retrieval object according to the initial semantic features of the sample target object and the initial semantic features of the sample retrieval object.

Alternatively, referring to fig. 8, fig. 8 is a flowchart for determining a distance penalty loss component according to an embodiment of the present application. As shown in fig. 8, the process of determining the distance penalty loss component may include:

s810: and determining the prior matching degree of each sample target object carrying the negative label and the sample retrieval object according to the initial semantic features of each sample target object carrying the negative label and the initial semantic features of the sample retrieval object.

When the prior matching degree of the sample target object and the sample retrieval object is determined, the cosine similarity between the initial semantic features of the sample target object and the initial semantic features of the sample retrieval object can be calculated according to the initial semantic features of the sample target object and the initial semantic features of the sample retrieval object, and then the calculated cosine similarity is used as the prior matching degree of the sample target object and the sample retrieval object.

Or, when determining the prior matching degree between the sample target object and the sample retrieval object, the euclidean distance, the manhattan distance, the chebyshev distance, or the like between the initial semantic features of the sample target object and the initial semantic features of the sample retrieval object may be calculated according to the initial semantic features of the sample target object and the initial semantic features of the sample retrieval object, and then the calculation result is used as the prior matching degree between the sample target object and the sample retrieval object. It should be noted that the present application does not limit the specific manner of determining the prior matching degree.

S820: and determining the prior matching degree sequence of a plurality of prior matching degrees according to the prior matching degree of each sample target object carrying the negative label.

When determining the prior matching degree sequence of the multiple prior matching degrees, the sequence can be performed according to a preset sequence rule. For example, the preset ordering rule may be ordered from large to small, or may be ordered from small to large, or may be ordered in other manners. The preset ordering rule is not limited in the present application.

For example, taking the sorting from large to small as an example, it is assumed that the sample target objects carrying negative labels in the training round are the sample target object 1 and the sample target object 2.. the sample target object 10, and the prior matching degrees of the sample target object 1 to the sample target object 10 are 0.15, 0.26, 0.71, 0.25, 0.16, 0.22, 0.07, 0.55, 0.42, and 0.12, respectively. Then the prior matching degrees of the 10 sample target objects are ranked as: sample target object 3, sample target object 8, sample target object 9, sample target object 2, sample target object 4, sample target object 6, sample target object 5, sample target object 1, sample target object 10, sample target object 7.

S830: and determining a predicted matching degree sequence of the plurality of predicted matching degrees according to the predicted matching degree of each sample target object carrying the negative label.

When the predicted matching degree ranking of the plurality of predicted matching degrees is determined, ranking can be performed according to the preset ranking rule.

For example, taking the sorting from large to small as an example, it is assumed that the sample target objects carrying negative labels in the training round are the sample target object 1 and the sample target object 2.. the sample target object 10, and the prediction matching degrees of the sample target object 1 to the sample target object 10 are 0.12, 0.28, 0.63, 0.31, 0.10, 0.25, 0.02, 0.51, 0.39 and 0.07 respectively. Then the predicted match degrees for the 10 sample target objects are ranked as: sample target object 3, sample target object 8, sample target object 9, sample target object 4, sample target object 2, sample target object 6, sample target object 1, sample target object 5, sample target object 10, sample target object 7.

S840: and determining a distance penalty loss component according to the similarity of the prior matching degree sequence and the prediction matching degree sequence.

Wherein, the similarity of the prior matching degree ordering and the predicted matching degree ordering can be characterized by two ordered KL (Kullback-Leibler) divergences, and the two ordered KL divergences can be determined as a distance penalty loss component. In particular implementations, the distance penalty loss component can be represented by the following equation:

L_dis＝KL(dist(t，v^-)||S(t，v^-))

wherein, dist (t, v)^-) Representing a priori degree of match ordering of a plurality of sample target objects carrying negative labels, S (t, v)^-) Representing a predicted match ranking, KL (dist (t, v) of a plurality of sample target objects carrying negative labels^-)||S(t，v^-) KL divergence, L, representing a priori and predicted matchmaking rankings_disA distance penalty loss component is represented.

In the above manner, the distance penalty loss component is determined according to the similarity of the priori matching degree ranking and the predicted matching degree ranking, and the distance penalty loss component is used for training the semantic generation model, so that the semantic generation model can learn the difference between negative examples (namely, target sample objects carrying negative labels), and the predicted matching degrees of multiple negative examples can reflect the distance difference of multiple negative examples. In this way, under the condition that some labels may deviate from the actual situation, the trained semantic generation model is further enabled to have stronger fault tolerance performance.

Referring to fig. 9, fig. 9 is a flowchart of a retrieval method according to another embodiment of the present application. As shown in fig. 9, the search method includes:

s910: and obtaining a retrieval object, and extracting the initial semantic features of the retrieval object.

S920: and generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object.

S930: and generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in the N target objects and the initial semantic features of the retrieval objects. Wherein N is an integer greater than 1.

In the application, the initial semantic features of the retrieval object can reflect the retrieval intention of the user, so the semantic feature set generated according to the initial semantic features of the retrieval object and the initial semantic features of the target object is closer to the retrieval intention of the user.

It should be noted that, for each target object of the N target objects, the corresponding semantic feature set may be determined in the same manner. For simplicity of description, the kth target object is taken as an example, and K is a positive integer less than or equal to N.

Alternatively, a pre-trained conditional variational auto-encoder may be utilized to generate the semantic feature set of the target object. For example, the initial semantic features of the kth target object and the initial semantic features of the retrieval object may be input into a pre-trained conditional variation self-encoder to obtain hidden variables generated by the conditional variation self-encoder. And then, collecting a plurality of hidden variables from the hidden variables generated by the conditional variational self-encoder, and generating a semantic feature set of a Kth target object according to the plurality of hidden variables.

The conditional variation self-encoder takes the initial semantic features of the retrieval objects as encoding guide conditions. In other words, the initial semantic features of the search target are used as input conditions (the input conditions may also be understood as labels) of the conditional variant auto-encoder, so that the hidden variables generated by the conditional variant auto-encoder are closer to the initial semantic features of the search target.

In a specific implementation, as described above, the semantic feature set of the retrieval object and the semantic feature set of the target object may be generated by using a semantic generation model, where the semantic generation model includes a first algorithm model, a second algorithm model, and a third algorithm model. Specifically, a semantic feature set of the target object is generated by a second algorithm model and a third algorithm model, and the second algorithm model is a conditional variational self-encoder.

For example, when generating the semantic feature set of the target object by using the conditional variational auto-encoder and the third algorithm model, the initial semantic feature of the target object may be used as input data, and the initial semantic feature of the search object may be used as an input condition. Inputting the input condition into the full connection layer to obtain the transformed feature output by the full connection layer, and then splicing the transformed feature with the input data (namely the initial semantic feature of the target object) to obtain the splicing feature. And inputting the splicing characteristics into an encoder of the conditional variational self-encoder, encoding the splicing characteristics into the distribution of hidden variables z meeting the input conditions by the encoder, and sampling a plurality of hidden variables z from the distribution of the hidden variables z to be used as a plurality of second intermediate characteristics of the target object. The third algorithm model optimizes the second plurality of intermediate features to generate a semantic feature set of the target object.

S940: and determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object.

S950: and determining the target object as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

For the specific embodiments of S910, S920, S940 and S950, reference may be made to the specific embodiments of S310, S320, S340 and S350, respectively, and no further description is given here to avoid repetition.

In the present application, the conditional variant autoencoder (i.e., the second algorithm model) in the semantic generation model may be trained in the manner shown in fig. 6. After the loss values are obtained in the manner shown in fig. 6, each algorithm model in the semantic generation model, including the conditional variational self-encoder, is trained using the loss values. The loss value may include an encoding loss component, and the encoding loss component may be represented by the following formula:

wherein p (z) represents the distribution of the hidden variable z. q (z | f)_v，f_t) Representing the distribution of hidden variables z under a first condition, retrieving an initial semantic feature f of the object_tAnd initial semantic features f of the target object_vAs the first condition. p (f)_v|f_tZ) represents the initial semantic feature f of the target object under the second condition_vDistribution of (2), retrieving the initial semantic features f of the object_tAnd a hidden variable z as the second condition. KL (q (z | f)_v，f_t) | p (z)) means p (z) and q (z | f)_v，f_t) KL divergence in between.

Represents logp (f)_v|f_tAnd z) is desired. L is_qvaeRepresenting the coding loss component.

In some embodiments, the loss values obtained in the manner shown in FIG. 6 include both: triad loss component L_milLabel, and method for producing the sameClassification of the loss component L_labelDistance penalty loss component L_disAnd a coding loss component L_qvaeThe loss value can be expressed by the following formula:

L＝L_label+λ₁L_qvae+λ₂L_mil+λ₃L_dis

wherein λ is₁、λ₂And λ₃All are preset weight coefficients, and L represents a loss value.

Referring to fig. 10, fig. 10 is a flowchart of a retrieval method according to another embodiment of the present application. As shown in fig. 8, the search method includes:

s1010: and obtaining a retrieval object, and extracting the initial semantic features of the retrieval object.

S1020: based on a pre-trained projection strategy, projecting the initial semantic features of the retrieval object to a target vector space corresponding to the projection strategy to obtain first projection features of the retrieval object; and generating a semantic feature set of the retrieval object according to the first projection feature.

S1030: based on a projection strategy, projecting the initial semantic features of each target object in the N target objects to a target vector space to obtain second projection features of each target object; and generating a semantic feature set corresponding to each target object according to the second projection feature of each target object.

As previously described, in some embodiments, the retrieval object and the target object may each belong to a different data modality. In these embodiments, in order to make the retrieval object and the target object of different data modalities more comparable, the initial semantic features of the retrieval object and the initial semantic features of the target object may be projected into the same vector space, i.e. the target vector space. Thus, the first projection feature and the second projection feature obtained by projection are features located in the same vector space. Furthermore, the semantic feature set generated according to the first projection feature is more comparable to the semantic feature set generated according to the second projection feature.

The projection strategy comprises a plurality of projection parameters, and the parameters comprise space parameters for describing a target vector space. The projection strategy can be obtained by pre-training, and each projection parameter in the projection strategy is converged into a proper numerical value by training the projection strategy, so that a proper target vector space is also determined.

The above may be referred to as a specific manner of generating the semantic feature set according to the first projection feature. For example, the first projection feature may be input into a semantic generation model, so as to generate a semantic feature set of the retrieval object through a first algorithm model and a third algorithm model of the semantic generation model.

Reference may also be made to the above for the specific way of generating the semantic feature set from the second projection features. For example, the second projection feature may be input into a semantic generation model, so as to generate a semantic feature set of the target object through a second algorithm model and a third algorithm model of the semantic generation model. Or for example, the second projection feature and the first projection feature may be spliced and then input into the semantic generation model, so that the semantic feature set of the target object is generated through a second algorithm model and a third algorithm model of the semantic generation model, where the second algorithm model is a conditional variational self-encoder.

S1040: and determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object.

S1050: and determining the target object as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

For the specific embodiments of S1010, S1040 and S1050, reference may be made to the specific embodiments of S310, S340 and S350, respectively, and no further description is given here to avoid repetition.

In some embodiments, S1020-S1030 may be performed by a pre-trained semantic generation model. During training of the semantic generation model, the projection strategy may be trained simultaneously. Specifically, the sample retrieval object and the sample target object belong to different data modalities, respectively. In order to generate a first prediction semantic feature set of the sample retrieval object, the initial semantic features of the sample retrieval object may be projected to a preset vector space corresponding to a preset projection strategy based on a preset projection strategy, so as to obtain first sample projection features of the sample retrieval object; and then inputting the first sample projection characteristic into a semantic generation model to generate a first prediction semantic characteristic set of the sample retrieval object.

In order to generate a second prediction semantic feature set corresponding to each sample target object, the initial semantic features of each sample target object may be projected to a preset vector space based on a preset projection strategy, so as to obtain second sample projection features of each sample target object; and then, inputting the second sample projection characteristic of each sample target object into a semantic generation model respectively, and generating a second prediction semantic characteristic set corresponding to each sample target object.

By the method, the initial semantic features of the sample retrieval object and the initial semantic features of the sample target object are projected to the same vector space, namely the preset vector space, so that the first sample projection features and the second sample projection features which are located in the same vector space are obtained. Furthermore, the first prediction semantic feature set generated according to the first sample projection features is more comparable to the second prediction semantic feature set generated according to the second sample projection features.

In addition, after the loss value is determined, the preset projection strategy can be trained based on the loss value, so that the projection parameters in the preset projection strategy are updated, and the spatial parameters of the preset vector space are also updated.

Referring to fig. 11, fig. 11 is a schematic diagram of model training proposed in another embodiment of the present application, in which the model training process simultaneously trains a projection strategy, a semantic feature extractor, and a fourth algorithm model in an end-to-end learning manner. As shown in FIG. 11, semantic generation model 1100 includes a conditional variational auto-encoder 1120, a multi-headed attention mechanism 1110, and a multi-instance learning model 1130. To simplify the drawing, the projection strategy is not shown in fig. 11.

As shown in fig. 11, the sample retrieval object is the query text "golden knight", and the first sample target object is a knight image (abbreviated as fig. a). During training, the Doc2Vec algorithm can be used to extract an initial semantic feature 1101 of a sample retrieval object, and then based on a projection strategy, the initial semantic feature 1101 is projected to a preset vector space to obtain a first projection feature 1102. Initial semantic features 1103 of the graph a can be extracted by using ResNet, and then the initial semantic features 1103 are projected to a preset vector space based on a projection strategy, so as to obtain second projection features 1104.

As shown in fig. 11, a first projected feature 1102 is input to a multi-head attention mechanism 1110 of a semantic generation model 1100, resulting in a plurality of first intermediate features 1105 output by the multi-head attention mechanism 1110. The second projection features 1104 and the first projection features 1102 are input into a conditional variational self-encoder 1120, and a plurality of second intermediate features 1106 (i.e. a plurality of sampled implicit variables z) output by the conditional variational self-encoder 1120 are obtained.

As shown in FIG. 11, the multi-example learning model 1130 generates a first example graph 1107 from a plurality of first intermediate features 1105. In the first exemplary diagram 1107, there are lines between some first intermediate features 1105, and two first intermediate features 1105 with lines are similar to each other. The multi-instance learning model 1130 transforms each first intermediate feature 1105 into a corresponding first semantic feature according to each first intermediate feature 1105 and its similar features, thereby obtaining a first set of predicted semantic features of the sample search object.

Likewise, the multi-example learning model 1130 generates the second example graph 1108 from the plurality of second intermediate features 1106. In the second exemplary diagram 1108, a portion of the second intermediate features 1106 has a connection therebetween, and the two second intermediate features 1106 having the connection are similar to each other. The multi-instance learning model 1130 transforms each second intermediate feature 1106 into a corresponding second semantic feature according to each second intermediate feature 1106 and its similar features, thereby obtaining a second set of predicted semantic features for graph a.

The fourth algorithmic model 1140 is based on an initial attention weight α_ijAnd calculating the prediction matching degree a between the two prediction semantic feature sets, namely the prediction matching degree a of the sample retrieval object and the graph a.

Fig. 11 only schematically shows the processing procedure of the graph a, so that the prediction matching degree a of the sample retrieval object and the graph a is obtained. In the present application, in the same manner, the target objects such as the images b and c … and x are processed separately to obtain the predicted matching degree b between the target object and the image b, and the predicted matching degree c … between the target object and the image x. In the sample target objects such as the graphs a to x, a part of the sample target objects carry positive labels, which indicate that the sample target objects are retrieved by the sample retrieval object. Some of the sample target objects carry negative labels indicating that the sample target objects have not been retrieved by the sample retrieval object.

After the prediction matching degrees corresponding to the plurality of sample target objects are obtained, a loss value may be determined according to the label and the prediction matching degree of each sample target object, and the fourth algorithm model 1140, the semantic generation model, and the projection strategy may be trained based on the loss value. In other words, the fourth algorithm model 1140, the multi-instance learning model 1130, the conditional variational auto-encoder 1120, the multi-head attention mechanism 1110, and the projection strategy are trained based on the loss values. Wherein the parameters (e.g., attention weight α) of the fourth algorithm model 1140 are generated by training the fourth algorithm model 1140_ij) Is updated. By training the projection strategy, the parameters of the projection strategy (e.g. the preset vector space) are updated.

Wherein the loss value may comprise at least one of the following components: triad loss component L_milLabel classification loss component L_labelDistance penalty loss component L_disOr coding the loss component L_qvae. For a specific determination of these loss components, reference is made to the foregoing.

Referring to fig. 12, fig. 12 is a schematic diagram of performing a search task based on the models in fig. 11 according to an embodiment of the present application.

As shown in fig. 12, the search target is the query text "rape flowers in mountain ditches". Further, the first target object is a rape flower image (simply referred to as fig. a, the target object not shown in fig. 12). During the execution of a retrieval task, the Doc2Vec algorithm can be used to extract an initial semantic feature 1201 of a retrieval object, and then the initial semantic feature 1201 is projected to a target vector space based on a trained projection strategy to obtain a first projection feature 1202. An initial semantic feature 1203 of the graph a may be obtained, and then based on the trained projection strategy, the initial semantic feature 1203 is projected to a target vector space, resulting in a second projected feature 1204. The target vector space is a parameter in the trained projection strategy, that is, a parameter into which the preset vector space gradually converges during the training, and the initial semantic feature 1203 of the graph a may be extracted before the search task is performed.

As shown in fig. 12, a first projection feature 1202 and a second projection feature 1204 are input into a pre-trained semantic generation model 1210, so as to obtain two semantic feature sets output by the semantic generation model 1210, where the two semantic feature sets are a semantic feature set 1205 of a search object and a semantic feature set 1206 of a graph a, respectively. The fourth algorithmic model 1220 is based on a pre-trained attention weight α_ijAnd calculating the matching degree a between the two semantic feature sets, wherein the matching degree a is the matching degree of the retrieval object and the graph A. Fig. 12 only schematically shows the processing procedure for the graph a, so that the matching degree a of the retrieval object with the graph a is obtained. In the present application, in the same manner, the target objects such as the graph B and the graph C …, the graph N, and the like are processed to obtain the matching degree B between the search object and the graph B, and the matching degree B … between the search object and the graph C, and the matching degree N between the search object and the graph N. As shown in fig. 12, a target object as a search result 1207 is finally determined from the N target objects according to the matching degree of each target object with the search object.

Referring to fig. 13, fig. 13 is a schematic diagram of a retrieving apparatus 1300 according to an embodiment of the present application. As shown in fig. 13, the search apparatus 1300 includes:

an initial semantic feature extraction module 1310, configured to obtain a retrieval object and extract an initial semantic feature of the retrieval object.

The semantic feature set generating module 1320 is configured to generate a semantic feature set of the search object according to the initial semantic features of the search object, and further generate a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, where N is an integer greater than 1.

A matching degree determining module 1330, configured to determine a matching degree between each target object and the search object according to the semantic feature set of each target object and the semantic feature set of the search object.

The retrieval result determining module 1340 is configured to determine a target object serving as a retrieval result from the N target objects according to the matching degree between each target object and the retrieval object.

Optionally, the semantic feature set of the retrieval object includes p first semantic features, where p is an integer greater than 1; the semantic feature set of each target object comprises q second semantic features, q being an integer greater than 1.

The matching degree determining module 1330 is specifically configured to: determining the difference degree of the J-th target object and the ith first semantic feature according to q second semantic features of the J-th target object and the ith first semantic feature of the retrieval object; and determining the minimum difference degree from the difference degrees of the J-th target object and each first semantic feature, and determining the matching degree of the J-th target object and the retrieval object according to the minimum difference degree. Wherein J is a positive integer less than or equal to N, and i is a positive integer less than or equal to p.

Optionally, the matching degree determining module 1330 is specifically configured to, when determining the difference degree between the jth target object and the ith first semantic feature: determining a weighted sum of q second semantic features of the J-th target object according to each second semantic feature of the J-th target object and the attention weight of each second semantic feature; determining a vector difference between the ith first semantic feature and the weighted sum, wherein the vector difference comprises a plurality of vector element values; and determining the smallest vector element value in the plurality of vector element values as the difference degree of the J-th target object and the i-th first semantic feature. Wherein the attention weight of each second semantic feature is the attention weight of the second semantic feature corresponding to the ith first semantic feature.

Optionally, the semantic feature set generating module 1320, when generating the semantic feature set corresponding to each target object, is specifically configured to: and generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object and the initial semantic features of the retrieval objects.

Optionally, the semantic feature set generating module 1320, when generating the semantic feature set corresponding to each target object, is specifically configured to: inputting the initial semantic features of the Kth target object and the initial semantic features of the retrieval object into a pre-trained conditional variation self-encoder to obtain hidden variables generated by the conditional variation self-encoder; and acquiring a plurality of hidden variables from the hidden variables generated by the conditional variational self-encoder, and generating a semantic feature set of a Kth target object according to the plurality of hidden variables. The conditional variation self-encoder takes the initial semantic features of the retrieval objects as encoding guide conditions, and K is a positive integer less than or equal to N.

Optionally, the retrieval object and the target object belong to different data modalities, respectively. The semantic feature set generating module 1320, when generating the semantic feature set of the retrieval object, is specifically configured to: based on a pre-trained projection strategy, projecting the initial semantic features of the retrieval object to a target vector space corresponding to the projection strategy to obtain first projection features of the retrieval object; and generating a semantic feature set of the retrieval object according to the first projection feature.

The semantic feature set generating module 1320, when generating the semantic feature set corresponding to each target object, is specifically configured to: based on a projection strategy, projecting the initial semantic features of each target object to a target vector space to obtain second projection features of each target object; and generating a semantic feature set corresponding to each target object according to the second projection feature of each target object.

Optionally, the semantic feature set of the retrieval object and the semantic feature set of each target object are generated by a semantic generation model.

Optionally, the semantic generation model includes a first algorithm model, a second algorithm model, and a third algorithm model. The first algorithm model is used for generating a plurality of first intermediate features of the retrieval object according to the initial semantic features of the retrieval object; the third algorithm model is used for determining similar features for each first intermediate feature from the plurality of first intermediate features, and converting each first intermediate feature into corresponding first semantic features according to the similar features of each first intermediate feature and each first intermediate feature; the plurality of first semantic features converted from the plurality of first intermediate features serve as a semantic feature set of the retrieval object.

The second algorithm model is used for generating a plurality of second intermediate features of the target object according to the initial semantic features of the target object; the third algorithm model is further used for determining a similar feature for each second intermediate feature from the plurality of second intermediate features, and converting each second intermediate feature into a corresponding second semantic feature according to the similar feature of each second intermediate feature and each second intermediate feature; and the plurality of second semantic features converted from the plurality of second intermediate features are used as a semantic feature set of the target object.

Optionally, the retrieval object and the target object belong to different data modalities, respectively. The first algorithm model is a multi-head attention mechanism, and the second algorithm model is a variational self-encoder or a conditional variational self-encoder.

Alternatively, referring to fig. 14, fig. 14 is a schematic diagram of a retrieval apparatus 1400 according to another embodiment of the present application. As shown in fig. 14, the search device 1400 includes not only: the initial semantic feature extraction module 1410, the semantic feature set generation module 1420, the matching degree determination module 1430, and the search result determination module 1440 further include:

a sample obtaining module 1401, configured to obtain a sample retrieval object and an initial semantic feature of the sample retrieval object, and obtain a plurality of sample target objects and an initial semantic feature of each sample target object, where each sample target object carries a tag, and the tag carried by each sample target object is used to represent whether the sample target object is retrieved by the sample retrieval object.

A predicted semantic feature set generating module 1402, configured to input the initial semantic features of the sample retrieval object into a semantic generation model, and generate a first predicted semantic feature set of the sample retrieval object; and the initial semantic features of each sample target object are respectively input into the semantic generation model, and a second prediction semantic feature set corresponding to each sample target object is generated.

A prediction matching degree determining module 1403, configured to determine the prediction matching degree between each sample target object and the sample retrieval object according to the second prediction semantic feature set and the first prediction semantic feature set of each sample target object.

And the model training module 1404 is configured to determine a loss value according to the label and the prediction matching degree of each sample target object, and train a semantic generation model based on the loss value.

Optionally, the label carried by each sample target object is a positive label or a negative label. If the sample target object carries a positive tag, it indicates that the sample target object was retrieved by the sample retrieval object. If the sample target object carries a negative tag, it indicates that the sample target object has not been retrieved by the sample retrieval object.

The loss value includes a distance penalty loss component, and the model training module 1404, when determining the distance penalty loss component, is specifically configured to: determining the prior matching degree of each sample target object carrying the negative label and the sample retrieval object according to the initial semantic features of each sample target object carrying the negative label and the initial semantic features of the sample retrieval object; determining a priori matching degree sequence of a plurality of priori matching degrees according to the priori matching degree of each sample target object carrying the negative label; determining a prediction matching degree sequence of a plurality of prediction matching degrees according to the prediction matching degree of each sample target object carrying the negative label; and determining a distance penalty loss component according to the similarity of the prior matching degree sequence and the prediction matching degree sequence.

Optionally, the sample retrieval object and the sample target object belong to different data modalities, respectively. When the predicted semantic feature set generating module 1402 generates the first predicted semantic feature set, it is specifically configured to: based on a preset projection strategy, projecting the initial semantic features of the sample retrieval object to a preset vector space corresponding to the preset projection strategy to obtain first sample projection features of the sample retrieval object; and inputting the first sample projection characteristic into a semantic generation model to generate a first prediction semantic characteristic set of the sample retrieval object.

When the predicted semantic feature set generating module 1402 generates the second predicted semantic feature set, the predicted semantic feature set generating module is specifically configured to: based on a preset projection strategy, projecting the initial semantic features of each sample target object to a preset vector space to obtain second sample projection features of each sample target object; and respectively inputting the second sample projection feature of each sample target object into a semantic generation model, and generating a second prediction semantic feature set corresponding to each sample target object.

The model training module 1404 is further configured to: and training a preset projection strategy based on the loss value.

According to the retrieval device, the semantic feature set of the retrieval object is generated according to the initial semantic features of the retrieval object, and the semantic feature set of the target object is generated according to the initial semantic features of the target object, so that the retrieval object and the target object are not represented by the original single initial semantic features any more, but are represented by the various semantic features. And then determining the matching degree of the target object and the retrieval object according to the semantic feature set of the target object and the semantic feature set of the retrieval object, and determining a retrieval result according to the matching degree of each of the plurality of target objects and the retrieval object. Therefore, the matching degree is determined not based on two single initial semantic features but based on two semantic feature sets with diversity aiming at the retrieval object and the target object, and then the retrieval result is determined, so that the retrieval result is more possible, and the diversity of the retrieval result is promoted.

It should be noted that the device embodiment and the method embodiment in the present application correspond to each other, and specific principles in the device embodiment may refer to the contents in the method embodiment, which is not described herein again.

An electronic device provided by the present application will be described below with reference to fig. 15.

FIG. 15 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

It should be noted that the computer system of the electronic device shown in fig. 15 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 15, the computer system includes a Central Processing Unit (CPU)1501 which can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1502 or a program loaded from a storage portion 1508 into a Random Access Memory (RAM) 1503. In the RAM1503, various programs and data necessary for system operation are also stored. The CPU 1501, the ROM 1502, and the RAM1503 are connected to each other by a bus 1504. An Input/Output (I/O) interface 1505 is also connected to bus 1504.

The following components are connected to the I/O interface 1505: an input portion 1506 including a keyboard, a mouse, and the like; an output section 1507 including a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1508 including a hard disk and the like; and a communication section 1509 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. A removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1510 as necessary, so that a computer program read out therefrom is installed into the storage section 1508 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1509, and/or installed from the removable medium 1511. When the computer program is executed by a Central Processing Unit (CPU)1501, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of searching, the method comprising:

obtaining a retrieval object, and extracting initial semantic features of the retrieval object;

generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object;

generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, wherein N is an integer greater than 1;

determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object;

and determining a target object serving as a retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

2. The method according to claim 1, wherein the semantic feature set of the search object comprises p first semantic features, and p is an integer greater than 1; the semantic feature set of each target object comprises q second semantic features, wherein q is an integer greater than 1; determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object, including:

determining the difference degree of a J-th target object and an ith first semantic feature of the retrieval object according to q second semantic features of the J-th target object and the ith first semantic feature, wherein J is a positive integer less than or equal to N, and i is a positive integer less than or equal to p;

and determining the minimum difference degree from the difference degrees of the J-th target object and each first semantic feature, and determining the matching degree of the J-th target object and the retrieval object according to the minimum difference degree.

3. The method of claim 2, wherein determining the degree of difference between the jth target object and the ith first semantic feature according to the q second semantic features of the jth target object and the ith first semantic feature of the retrieved object comprises:

determining a weighted sum of q second semantic features of the J-th target object according to each second semantic feature of the J-th target object and the attention weight of each second semantic feature, wherein the attention weight of each second semantic feature is the attention weight of the second semantic feature corresponding to the ith first semantic feature;

determining a vector difference between the ith first semantic feature and the weighted sum, wherein the vector difference comprises a plurality of vector element values;

determining a smallest vector element value of the plurality of vector element values as a degree of dissimilarity of the jth target object with the ith first semantic feature.

4. The method according to claim 1, wherein the generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in the N target objects comprises:

and generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object and the initial semantic features of the retrieval objects.

5. The method according to claim 4, wherein the generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object and the initial semantic features of the retrieval object comprises:

inputting the initial semantic features of a Kth target object and the initial semantic features of the retrieval object into a pre-trained conditional variational self-encoder to obtain hidden variables generated by the conditional variational self-encoder, wherein the conditional variational self-encoder takes the initial semantic features of the retrieval object as encoding guide conditions, and K is a positive integer less than or equal to N;

and acquiring a plurality of hidden variables from the hidden variables generated by the conditional variational self-encoder, and generating a semantic feature set of the Kth target object according to the plurality of hidden variables.

6. The method of claim 1, wherein the search object and the target object each belong to a different data modality;

generating a semantic feature set of the retrieval object according to the initial semantic features of the retrieval object, wherein the semantic feature set comprises:

based on a pre-trained projection strategy, projecting the initial semantic features of the retrieval object to a target vector space corresponding to the projection strategy to obtain first projection features of the retrieval object; generating a semantic feature set of the retrieval object according to the first projection feature;

generating a semantic feature set corresponding to each target object according to the initial semantic features of each target object in the N target objects, wherein the semantic feature set comprises:

based on the projection strategy, projecting the initial semantic features of each target object to the target vector space to obtain second projection features of each target object; and generating a semantic feature set corresponding to each target object according to the second projection feature of each target object.

7. The method of claim 1, wherein the semantic feature set of the search object and the semantic feature set of each target object are generated by a semantic generation model.

8. The method of claim 7, wherein the semantic generation model comprises a first algorithm model, a second algorithm model, and a third algorithm model;

the first algorithm model is used for generating a plurality of first intermediate features of the retrieval object according to the initial semantic features of the retrieval object; the third algorithm model is used for determining similar features for each first intermediate feature from the plurality of first intermediate features and converting each first intermediate feature into corresponding first semantic features according to the similar features of each first intermediate feature and each first intermediate feature; a plurality of first semantic features converted from the plurality of first intermediate features are used as a semantic feature set of the retrieval object;

the second algorithm model is used for generating a plurality of second intermediate features of the target object according to the initial semantic features of the target object; the third algorithm model is further configured to determine a similar feature for each second intermediate feature from the plurality of second intermediate features, and convert each second intermediate feature into a corresponding second semantic feature according to the similar feature of each second intermediate feature and each second intermediate feature; and the plurality of second semantic features converted from the plurality of second intermediate features are used as the semantic feature set of the target object.

9. The method of claim 8, wherein the search object and the target object each belong to a different data modality; the first algorithm model is a multi-head attention mechanism, and the second algorithm model is a variational self-encoder or a conditional variational self-encoder.

10. The method of claim 7, wherein the training process of the semantic generation model comprises:

obtaining a sample retrieval object and initial semantic features of the sample retrieval object, and obtaining a plurality of sample target objects and initial semantic features of each sample target object, wherein each sample target object carries a label, and the label carried by each sample target object is used for representing whether the sample target object is retrieved by the sample retrieval object;

inputting the initial semantic features of the sample retrieval object into the semantic generation model to generate a first prediction semantic feature set of the sample retrieval object;

respectively inputting the initial semantic features of each sample target object into the semantic generation model, and generating a second prediction semantic feature set corresponding to each sample target object;

determining the prediction matching degree of each sample target object and the sample retrieval object according to the second prediction semantic feature set and the first prediction semantic feature set of each sample target object;

and determining a loss value according to the label and the prediction matching degree of each sample target object, and training the semantic generation model based on the loss value.

11. The method of claim 10, wherein the label carried by each sample target object is a positive label or a negative label; if the sample target object carries the positive label, the sample target object is retrieved by the sample retrieval object; if the sample target object carries a negative label, the sample target object is not retrieved by the sample retrieval object;

the loss value comprises a distance penalty loss component, and the determination process of the distance penalty loss component comprises the following steps:

determining the prior matching degree of each sample target object carrying the negative label and the sample retrieval object according to the initial semantic features of each sample target object carrying the negative label and the initial semantic features of the sample retrieval object;

determining a priori matching degree sequence of a plurality of priori matching degrees according to the priori matching degree of each sample target object carrying the negative label;

determining a prediction matching degree sequence of a plurality of prediction matching degrees according to the prediction matching degree of each sample target object carrying the negative label;

and determining the distance penalty loss component according to the similarity of the prior matching degree sequence and the prediction matching degree sequence.

12. The method of claim 10, wherein the sample retrieval object and the sample target object each belong to different data modalities;

inputting the initial semantic features of the sample retrieval object into the semantic generation model, and generating a first prediction semantic feature set of the sample retrieval object, including:

based on a preset projection strategy, projecting the initial semantic features of the sample retrieval object to a preset vector space corresponding to the preset projection strategy to obtain first sample projection features of the sample retrieval object; inputting the first sample projection feature into the semantic generation model to generate a first prediction semantic feature set of the sample retrieval object;

the step of inputting the initial semantic features of each sample target object into the semantic generation model to generate a second prediction semantic feature set corresponding to each sample target object includes:

based on the preset projection strategy, projecting the initial semantic features of each sample target object to the preset vector space to obtain second sample projection features of each sample target object; inputting the second sample projection feature of each sample target object into the semantic generation model respectively, and generating a second prediction semantic feature set corresponding to each sample target object;

the method further comprises the following steps:

and training the preset projection strategy based on the loss value.

13. A retrieval apparatus, characterized in that the apparatus comprises:

the initial semantic feature extraction module is used for obtaining a retrieval object and extracting initial semantic features of the retrieval object;

a semantic feature set generating module, configured to generate a semantic feature set of the search object according to the initial semantic features of the search object, and further configured to generate a semantic feature set corresponding to each target object according to the initial semantic features of each target object in N target objects, where N is an integer greater than 1;

the matching degree determining module is used for determining the matching degree of each target object and the retrieval object according to the semantic feature set of each target object and the semantic feature set of the retrieval object;

and the retrieval result determining module is used for determining the target object serving as the retrieval result from the N target objects according to the matching degree of each target object and the retrieval object.

14. An electronic device comprising a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-12.

15. A computer-readable storage medium, having program code stored therein, wherein the program code when executed by a processor performs the method of any of claims 1-12.